xemu/hw
Michael Roth 9d7bd0826f virtio-pci: disable vring processing when bus-mastering is disabled
Currently the SLOF firmware for pseries guests will disable/re-enable
a PCI device multiple times via IO/MEM/MASTER bits of PCI_COMMAND
register after the initial probe/feature negotiation, as it tends to
work with a single device at a time at various stages like probing
and running block/network bootloaders without doing a full reset
in-between.

In QEMU, when PCI_COMMAND_MASTER is disabled we disable the
corresponding IOMMU memory region, so DMA accesses (including to vring
fields like idx/flags) will no longer undergo the necessary
translation. Normally we wouldn't expect this to happen since it would
be misbehavior on the driver side to continue driving DMA requests.

However, in the case of pseries, with iommu_platform=on, we trigger the
following sequence when tearing down the virtio-blk dataplane ioeventfd
in response to the guest unsetting PCI_COMMAND_MASTER:

  #2  0x0000555555922651 in virtqueue_map_desc (vdev=vdev@entry=0x555556dbcfb0, p_num_sg=p_num_sg@entry=0x7fffe657e1a8, addr=addr@entry=0x7fffe657e240, iov=iov@entry=0x7fffe6580240, max_num_sg=max_num_sg@entry=1024, is_write=is_write@entry=false, pa=0, sz=0)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:757
  #3  0x0000555555922a89 in virtqueue_pop (vq=vq@entry=0x555556dc8660, sz=sz@entry=184)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:950
  #4  0x00005555558d3eca in virtio_blk_get_request (vq=0x555556dc8660, s=0x555556dbcfb0)
      at /home/mdroth/w/qemu.git/hw/block/virtio-blk.c:255
  #5  0x00005555558d3eca in virtio_blk_handle_vq (s=0x555556dbcfb0, vq=0x555556dc8660)
      at /home/mdroth/w/qemu.git/hw/block/virtio-blk.c:776
  #6  0x000055555591dd66 in virtio_queue_notify_aio_vq (vq=vq@entry=0x555556dc8660)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:1550
  #7  0x000055555591ecef in virtio_queue_notify_aio_vq (vq=0x555556dc8660)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:1546
  #8  0x000055555591ecef in virtio_queue_host_notifier_aio_poll (opaque=0x555556dc86c8)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio.c:2527
  #9  0x0000555555d02164 in run_poll_handlers_once (ctx=ctx@entry=0x55555688bfc0, timeout=timeout@entry=0x7fffe65844a8)
      at /home/mdroth/w/qemu.git/util/aio-posix.c:520
  #10 0x0000555555d02d1b in try_poll_mode (timeout=0x7fffe65844a8, ctx=0x55555688bfc0)
      at /home/mdroth/w/qemu.git/util/aio-posix.c:607
  #11 0x0000555555d02d1b in aio_poll (ctx=ctx@entry=0x55555688bfc0, blocking=blocking@entry=true)
      at /home/mdroth/w/qemu.git/util/aio-posix.c:639
  #12 0x0000555555d0004d in aio_wait_bh_oneshot (ctx=0x55555688bfc0, cb=cb@entry=0x5555558d5130 <virtio_blk_data_plane_stop_bh>, opaque=opaque@entry=0x555556de86f0)
      at /home/mdroth/w/qemu.git/util/aio-wait.c:71
  #13 0x00005555558d59bf in virtio_blk_data_plane_stop (vdev=<optimized out>)
      at /home/mdroth/w/qemu.git/hw/block/dataplane/virtio-blk.c:288
  #14 0x0000555555b906a1 in virtio_bus_stop_ioeventfd (bus=bus@entry=0x555556dbcf38)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio-bus.c:245
  #15 0x0000555555b90dbb in virtio_bus_stop_ioeventfd (bus=bus@entry=0x555556dbcf38)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio-bus.c:237
  #16 0x0000555555b92a8e in virtio_pci_stop_ioeventfd (proxy=0x555556db4e40)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio-pci.c:292
  #17 0x0000555555b92a8e in virtio_write_config (pci_dev=0x555556db4e40, address=<optimized out>, val=1048832, len=<optimized out>)
      at /home/mdroth/w/qemu.git/hw/virtio/virtio-pci.c:613

I.e. the calling code is only scheduling a one-shot BH for
virtio_blk_data_plane_stop_bh, but somehow we end up trying to process
an additional virtqueue entry before we get there. This is likely due
to the following check in virtio_queue_host_notifier_aio_poll:

  static bool virtio_queue_host_notifier_aio_poll(void *opaque)
  {
      EventNotifier *n = opaque;
      VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
      bool progress;

      if (!vq->vring.desc || virtio_queue_empty(vq)) {
          return false;
      }

      progress = virtio_queue_notify_aio_vq(vq);

namely the call to virtio_queue_empty(). In this case, since no new
requests have actually been issued, shadow_avail_idx == last_avail_idx,
so we actually try to access the vring via vring_avail_idx() to get
the latest non-shadowed idx:

  int virtio_queue_empty(VirtQueue *vq)
  {
      bool empty;
      ...

      if (vq->shadow_avail_idx != vq->last_avail_idx) {
          return 0;
      }

      rcu_read_lock();
      empty = vring_avail_idx(vq) == vq->last_avail_idx;
      rcu_read_unlock();
      return empty;

but since the IOMMU region has been disabled we get a bogus value (0
usually), which causes virtio_queue_empty() to falsely report that
there are entries to be processed, which causes errors such as:

  "virtio: zero sized buffers are not allowed"

or

  "virtio-blk missing headers"

and puts the device in an error state.

This patch works around the issue by introducing virtio_set_disabled(),
which sets a 'disabled' flag to bypass checks like virtio_queue_empty()
when bus-mastering is disabled. Since we'd check this flag at all the
same sites as vdev->broken, we replace those checks with an inline
function which checks for either vdev->broken or vdev->disabled.

The 'disabled' flag is only migrated when set, which should be fairly
rare, but to maintain migration compatibility we disable it's use for
older machine types. Users requiring the use of the flag in conjunction
with older machine types can set it explicitly as a virtio-device
option.

NOTES:

 - This leaves some other oddities in play, like the fact that
   DRIVER_OK also gets unset in response to bus-mastering being
   disabled, but not restored (however the device seems to continue
   working)
 - Similarly, we disable the host notifier via
   virtio_bus_stop_ioeventfd(), which seems to move the handling out
   of virtio-blk dataplane and back into the main IO thread, and it
   ends up staying there till a reset (but otherwise continues working
   normally)

Cc: David Gibson <david@gibson.dropbear.id.au>,
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Message-Id: <20191120005003.27035-1-mdroth@linux.vnet.ibm.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-01-05 07:03:03 -05:00
..
9pfs 9pfs: make Error **errp const where it is appropriate 2019-12-18 08:43:19 +01:00
acpi * More uses of RCU_READ_LOCK_GUARD (Dave, myself) 2019-12-20 11:20:25 +00:00
adc Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
alpha hw: replace hw/i386/pc.h with a header just for the i8259 2019-12-17 19:33:49 +01:00
arm hw/arm/nseries: Replace the bluetooth chardev with a "null" chardev 2019-12-16 17:24:07 +01:00
audio hw/audio: Remove the "use_broken_id" hack from the AC97 device 2019-12-18 02:34:12 +01:00
block virtio-blk: fix out-of-bounds access to bitmap in notify_guest_bh 2019-12-19 16:20:25 +00:00
bt Remove the core bluetooth code 2019-12-17 09:01:14 +01:00
char virtio-serial-bus: fix memory leak while attach virtio-serial-bus 2020-01-05 07:03:03 -05:00
core virtio-pci: disable vring processing when bus-mastering is disabled 2020-01-05 07:03:03 -05:00
cpu hw/core: Move cpu.c, cpu.h from qom/ to hw/core/ 2019-08-21 13:24:01 +02:00
cris Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
display vga: two little bugfixes. 2020-01-03 14:29:42 +00:00
dma mips: jazz: Renovate coding style 2019-12-16 13:04:46 +01:00
gpio gpio: fix memory leak in aspeed_gpio_init() 2019-12-16 10:46:34 +00:00
hppa hw: replace hw/i386/pc.h with a header just for the i8259 2019-12-17 19:33:49 +01:00
hyperv hyperv: Use auto rcu_read macros 2019-12-17 19:33:52 +01:00
i2c aspeed/i2c: Add trace events 2019-12-16 10:46:34 +00:00
i386 intel_iommu: fix bug to read DMAR_RTADDR_REG 2020-01-05 07:03:03 -05:00
ide bootdevice: Gather LCHS from all relevant devices 2019-10-31 11:47:29 -04:00
input virtio-input: convert to new virtio_delete_queue 2020-01-05 07:03:03 -05:00
intc * More uses of RCU_READ_LOCK_GUARD (Dave, myself) 2019-12-20 11:20:25 +00:00
ipack Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
ipmi hw/ipmi: Fix realize() error API violations 2019-12-18 08:36:15 +01:00
isa hw/isa/isa-bus: cleanup irq functions 2019-12-17 19:33:51 +01:00
lm32 Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
m68k q800: fix I/O memory map 2019-11-05 18:52:29 +01:00
mem memory-device: Fix memory pre-plug error API violations 2019-12-18 08:36:15 +01:00
microblaze microblaze: fix leak of fdevice tree blob 2019-10-04 18:49:16 +02:00
mips hw: replace hw/i386/pc.h with a header just for the i8259 2019-12-17 19:33:49 +01:00
misc hw/misc/ivshmem: Bury dead legacy INTx code 2019-12-17 09:05:23 +01:00
moxie Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
net qemu_log_lock/unlock now preserves the qemu_logfile handle. 2019-12-18 20:18:02 +00:00
nios2 Clean up inclusion of sysemu/sysemu.h 2019-08-16 13:31:53 +02:00
nubus hw/m68k: add Nubus support 2019-10-28 19:06:47 +01:00
nvram Fix the fw_cfg reboot-timeout=-1 special value, add a test for it. 2019-11-05 20:17:11 +00:00
openrisc Include hw/qdev-properties.h less 2019-08-16 13:31:53 +02:00
pci hw/pci: Remove the "command_serr_enable" property 2019-12-18 02:34:12 +01:00
pci-bridge numa: move numa global variable nb_numa_nodes into MachineState 2019-09-03 11:26:55 -03:00
pci-host hw/pci-host: Add Kconfig entry to select the IGD Passthrough Host Bridge 2019-12-18 02:34:12 +01:00
pcmcia Include hw/hw.h exactly where needed 2019-08-16 13:31:52 +02:00
ppc * More uses of RCU_READ_LOCK_GUARD (Dave, myself) 2019-12-20 11:20:25 +00:00
rdma hw/rdma: Utilize ibv_reg_mr_iova for memory registration 2019-11-06 12:49:04 +02:00
riscv hw/riscv: Add optional symbol callback ptr to riscv_load_kernel() 2019-11-25 12:34:52 -08:00
rtc * microvm docs and fixes (Sergio, Liam) 2019-11-19 16:31:27 +00:00
s390x hw/s390x: rename Error ** parameter to more common errp 2019-12-18 08:43:19 +01:00
scsi scsi: deprecate scsi-disk 2019-11-19 10:01:34 +01:00
sd hw/sd: drop extra whitespace in sdhci_sysbus_realize() header 2019-12-18 08:43:19 +01:00
semihosting Clean up inclusion of sysemu/sysemu.h 2019-08-16 13:31:53 +02:00
sh4 sysemu: Split sysemu/runstate.h off sysemu/sysemu.h 2019-08-16 13:37:36 +02:00
smbios smbios:ipmi: Ignore IPMI devices with no fwinfo function 2019-09-20 14:08:10 -05:00
sparc hw: Move M48T59 device from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:20:45 +02:00
sparc64 hw: Move sun4v hypervisor RTC from hw/timer/ to hw/rtc/ subdirectory 2019-10-24 20:23:15 +02:00
ssi aspeed/smc: Add AST2600 timings registers 2019-12-16 10:46:34 +00:00
timer aspeed: Change the "scu" property definition 2019-12-16 10:46:34 +00:00
tpm hw/tpm: rename Error ** parameter to more common errp 2019-12-18 08:43:19 +01:00
tricore Clean up inclusion of sysemu/sysemu.h 2019-08-16 13:31:53 +02:00
unicore32 Include hw/irq.h a lot less 2019-08-16 13:31:52 +02:00
usb hw/usb: rename Error ** parameter to more common errp 2019-12-18 08:43:19 +01:00
vfio hw/vfio/ap: drop local_err from vfio_ap_realize 2019-12-18 08:43:19 +01:00
virtio virtio-pci: disable vring processing when bus-mastering is disabled 2020-01-05 07:03:03 -05:00
watchdog aspeed: Change the "scu" property definition 2019-12-16 10:46:34 +00:00
xen xen: convert "-machine igd-passthru" to an accelerator property 2019-12-17 19:32:27 +01:00
xenpv Include sysemu/sysemu.h a lot less 2019-08-16 13:31:53 +02:00
xtensa hw/xtensa: add virt machine 2019-10-18 20:38:10 -07:00
Kconfig Remove the core bluetooth code 2019-12-17 09:01:14 +01:00
Makefile.objs Remove the core bluetooth code 2019-12-17 09:01:14 +01:00