xemu/hw
Alexey G 7fb394ad8a xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings
Under certain circumstances normal xen-mapcache functioning may be broken
by guest's actions. This may lead to either QEMU performing exit() due to
a caught bad pointer (and with QEMU process gone the guest domain simply
appears hung afterwards) or actual use of the incorrect pointer inside
QEMU address space -- a write to unmapped memory is possible. The bug is
hard to reproduce on a i440 machine as multiple DMA sources are required
(though it's possible in theory, using multiple emulated devices), but can
be reproduced somewhat easily on a Q35 machine using an emulated AHCI
controller -- each NCQ queue command slot may be used as an independent
DMA source ex. using READ FPDMA QUEUED command, so a single storage
device on the AHCI controller port will be enough to produce multiple DMAs
(up to 32). The detailed description of the issue follows.

Xen-mapcache provides an ability to map parts of a guest memory into
QEMU's own address space to work with.

There are two types of cache lookups:
 - translating a guest physical address into a pointer in QEMU's address
   space, mapping a part of guest domain memory if necessary (while trying
   to reduce a number of such (re)mappings to a minimum)
 - translating a QEMU's pointer back to its physical address in guest RAM

These lookups are managed via two linked-lists of structures.
MapCacheEntry is used for forward cache lookups, while MapCacheRev -- for
reverse lookups.

Every guest physical address is broken down into 2 parts:
    address_index  = phys_addr >> MCACHE_BUCKET_SHIFT;
    address_offset = phys_addr & (MCACHE_BUCKET_SIZE - 1);

MCACHE_BUCKET_SHIFT depends on a system (32/64) and is equal to 20 for
a 64-bit system (which assumed for the further description). Basically,
this means that we deal with 1 MB chunks and offsets within those 1 MB
chunks. All mappings are created with 1MB-granularity, i.e. 1MB/2MB/3MB
etc. Most DMA transfers typically are less than 1MB, however, if the
transfer crosses any 1MB border(s) - than a nearest larger mapping size
will be used, so ex. a 512-byte DMA transfer with the start address
700FFF80h will actually require a 2MB range.

Current implementation assumes that MapCacheEntries are unique for a given
address_index and size pair and that a single MapCacheEntry may be reused
by multiple requests -- in this case the 'lock' field will be larger than
1. On other hand, each requested guest physical address (with 'lock' flag)
is described by each own MapCacheRev. So there may be multiple MapCacheRev
entries corresponding to a single MapCacheEntry. The xen-mapcache code
uses MapCacheRev entries to retrieve the address_index & size pair which
in turn used to find a related MapCacheEntry. The 'lock' field within
a MapCacheEntry structure is actually a reference counter which shows
a number of corresponding MapCacheRev entries.

The bug lies in ability for the guest to indirectly manipulate with the
xen-mapcache MapCacheEntries list via a special sequence of DMA
operations, typically for storage devices. In order to trigger the bug,
guest needs to issue DMA operations in specific order and timing.
Although xen-mapcache is protected by the mutex lock -- this doesn't help
in this case, as the bug is not due to a race condition.

Suppose we have 3 DMA transfers, namely A, B and C, where
- transfer A crosses 1MB border and thus uses a 2MB mapping
- transfers B and C are normal transfers within 1MB range
- and all 3 transfers belong to the same address_index

In this case, if all these transfers are to be executed one-by-one
(without overlaps), no special treatment necessary -- each transfer's
mapping lock will be set and then cleared on unmap before starting
the next transfer.
The situation changes when DMA transfers overlap in time, ex. like this:

  |===== transfer A (2MB) =====|

              |===== transfer B (1MB) =====|

                          |===== transfer C (1MB) =====|
 time --->

In this situation the following sequence of actions happens:

1. transfer A creates a mapping to 2MB area (lock=1)
2. transfer B (1MB) tries to find available mapping but cannot find one
   because transfer A is still in progress, and it has 2MB size + non-zero
   lock. So transfer B creates another mapping -- same address_index,
   but 1MB size.
3. transfer A completes, making 1st mapping entry available by setting its
   lock to 0
4. transfer C starts and tries to find available mapping entry and sees
   that 1st entry has lock=0, so it uses this entry but remaps the mapping
   to a 1MB size
5. transfer B completes and by this time
  - there are two locked entries in the MapCacheEntry list with the SAME
    values for both address_index and size
  - the entry for transfer B actually resides farther in list while
    transfer C's entry is first
6. xen_ram_addr_from_mapcache() for transfer B gets correct address_index
   and size pair from corresponding MapCacheRev entry, but then it starts
   looking for MapCacheEntry with these values and finds the first entry
   -- which belongs to transfer C.

At this point there may be following possible (bad) consequences:

1. xen_ram_addr_from_mapcache() will use a wrong entry->vaddr_base value
   in this statement:

   raddr = (reventry->paddr_index << MCACHE_BUCKET_SHIFT) +
       ((unsigned long) ptr - (unsigned long) entry->vaddr_base);

resulting in an incorrent raddr value returned from the function. The
(ptr - entry->vaddr_base) expression may produce both positive and negative
numbers and its actual value may differ greatly as there are many
map/unmap operations take place. If the value will be beyond guest RAM
limits then a "Bad RAM offset" error will be triggered and logged,
followed by exit() in QEMU.

2. If raddr value won't exceed guest RAM boundaries, the same sequence
of actions will be performed for xen_invalidate_map_cache_entry() on DMA
unmap, resulting in a wrong MapCacheEntry being unmapped while DMA
operation which uses it is still active. The above example must
be extended by one more DMA transfer in order to allow unmapping as the
first mapping in the list is sort of resident.

The patch modifies the behavior in which MapCacheEntry's are added to the
list, avoiding duplicates.

Signed-off-by: Alexey Gerasimenko <x1917x@gmail.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
2017-07-21 17:37:06 -07:00
..
9pfs Convert error_report() to warn_report() 2017-07-13 13:49:58 +02:00
acpi include/exec/poison: Mark CONFIG_KVM as poisoned, too 2017-07-04 14:30:03 +02:00
adc STM32F2xx: Add the ADC device 2016-10-04 13:28:07 +01:00
alpha target/alpha: Merge several flag bytes into ENV->FLAGS 2017-07-18 18:41:52 -10:00
arm hw/arm/mps2: Add ethernet 2017-07-17 13:36:09 +01:00
audio audio/adlib: remove limitation of one adlib card 2017-07-17 11:09:02 +02:00
block hw/block/pflash_cfi01, pflash_cfi02: Use memory_region_init_rom_device() 2017-07-14 17:59:42 +01:00
bt be-hci: use backend functions 2017-06-02 11:33:53 +04:00
char hw/arm/mps2: Add UARTs 2017-07-17 13:36:08 +01:00
core migration/next for 20170718 2017-07-19 12:30:41 +01:00
cpu hw/cpu: core.c can be compiled as common object 2017-06-09 12:02:55 +10:00
cris hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
display virtio-gpu migration fix for 2.10 2017-07-17 17:12:41 +01:00
dma * gdbstub fixes (Alex) 2017-07-14 12:16:09 +01:00
gpio qdev: Replace cannot_instantiate_with_device_add_yet with !user_creatable 2017-05-17 10:37:00 -03:00
i2c migration/next for 20170601 2017-06-02 14:07:53 +01:00
i386 xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings 2017-07-21 17:37:06 -07:00
ide -----BEGIN PGP SIGNATURE----- 2017-07-19 13:43:58 +01:00
input memory: Rename memory_region_init_ram() to memory_region_init_ram_nomigrate() 2017-07-14 17:59:42 +01:00
intc s390x/flic: migrate ais states 2017-07-14 12:29:49 +02:00
ipack ipack: Update e-mail address 2016-05-18 15:04:27 +03:00
ipmi qom: enforce readonly nature of link's check callback 2017-07-14 12:04:42 +02:00
isa chardev: move headers to include/chardev 2017-06-02 11:33:52 +04:00
lm32 char: rename CharDriverState Chardev 2017-01-27 18:07:59 +01:00
m68k hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
mem dimm: Convert to DEFINE_PROP_LINK 2017-07-14 12:04:43 +02:00
microblaze hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
mips ahci: add ahci_get_num_ports 2017-07-18 11:47:56 -04:00
misc configure: Rename CONFIG_IVSHMEM to CONFIG_IVSHMEM_DEVICE 2017-07-20 14:56:20 +01:00
moxie hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
net virtio-net: fix offload ctrl endian 2017-07-17 20:13:56 +08:00
nios2 hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
nvram fw_cfg: move QOM type defines and fw_cfg types into fw_cfg.h 2017-07-17 15:41:30 -03:00
openrisc hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
pci hw/pci/pci.c: Use memory_region_init_rom() 2017-07-14 17:59:42 +01:00
pci-bridge pci: Convert shpc_init() to Error 2017-07-03 22:29:49 +03:00
pci-host memory: Rename memory_region_init_ram() to memory_region_init_ram_nomigrate() 2017-07-14 17:59:42 +01:00
pcmcia hw: Clean up includes 2016-01-29 15:07:25 +00:00
ppc ppc patch queue 2017-07-17 2017-07-17 12:52:59 +01:00
s390x Use qemu_tolower() and qemu_toupper(), not tolower() and toupper() 2017-07-21 10:32:41 +01:00
scsi Use qemu_tolower() and qemu_toupper(), not tolower() and toupper() 2017-07-21 10:32:41 +01:00
sd generic-sdhci: Remove user_creatable flag 2017-05-17 10:37:01 -03:00
sh4 hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
smbios stubs: move smbios stubs to hw/smbios 2017-01-16 17:52:35 +01:00
sparc hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
sparc64 memory: Rename memory_region_init_ram() to memory_region_init_ram_nomigrate() 2017-07-14 17:59:42 +01:00
ssi xilinx_spips: allow mmio execution 2017-06-27 15:09:15 +02:00
timer hw/char/cmsdk-apb-timer: Implement CMSDK APB timer device 2017-07-17 13:36:08 +01:00
tpm clean-up: removed duplicate #includes 2016-10-28 18:17:24 +03:00
tricore hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
unicore32 hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
usb usb: Fix build with newer gcc 2017-07-20 10:02:11 +02:00
vfio vfio-pci, ppc64/spapr: Reorder group-to-container attaching 2017-07-17 12:39:09 -06:00
virtio virtio-crypto: Convert to DEFINE_PROP_LINK 2017-07-14 12:04:43 +02:00
watchdog shutdown: Add source information to SHUTDOWN and RESET 2017-05-23 13:28:17 +02:00
xen xen_pt_msi.c: Check for xen_host_pci_get_* failures in xen_pt_msix_init() 2017-07-18 13:27:04 -07:00
xenpv xenfb: remove xen_init_display "temporary" hack 2017-07-07 11:10:03 -07:00
xtensa hw: Use new memory_region_init_{ram, rom, rom_device}() functions 2017-07-14 17:59:42 +01:00
Makefile.objs acpi: filter based on CONFIG_ACPI_X86 rather than TARGET 2017-01-16 17:52:35 +01:00