xemu/hw
Laszlo Ersek d7dc0682f5 vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously
(1) The virtio-1.2 specification
<http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html> writes:

> 3     General Initialization And Device Operation
> 3.1   Device Initialization
> 3.1.1 Driver Requirements: Device Initialization
>
> [...]
>
> 7. Perform device-specific setup, including discovery of virtqueues for
>    the device, optional per-bus setup, reading and possibly writing the
>    device’s virtio configuration space, and population of virtqueues.
>
> 8. Set the DRIVER_OK status bit. At this point the device is “live”.

and

> 4         Virtio Transport Options
> 4.1       Virtio Over PCI Bus
> 4.1.4     Virtio Structure PCI Capabilities
> 4.1.4.3   Common configuration structure layout
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>
> [...]
>
> The driver MUST configure the other virtqueue fields before enabling the
> virtqueue with queue_enable.
>
> [...]

(The same statements are present in virtio-1.0 identically, at
<http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>.)

These together mean that the following sub-sequence of steps is valid for
a virtio-1.0 guest driver:

(1.1) set "queue_enable" for the needed queues as the final part of device
initialization step (7),

(1.2) set DRIVER_OK in step (8),

(1.3) immediately start sending virtio requests to the device.

(2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
special virtio feature is negotiated, then virtio rings start in disabled
state, according to
<https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
enabling vrings.

Therefore setting "queue_enable" from the guest (1.1) -- which is
technically "buffered" on the QEMU side until the guest sets DRIVER_OK
(1.2) -- is a *control plane* operation, which -- after (1.2) -- travels
from the guest through QEMU to the vhost-user backend, using a unix domain
socket.

Whereas sending a virtio request (1.3) is a *data plane* operation, which
evades QEMU -- it travels from guest to the vhost-user backend via
eventfd.

This means that operations ((1.1) + (1.2)) and (1.3) travel through
different channels, and their relative order can be reversed, as perceived
by the vhost-user backend.

That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
against the Rust-language virtiofsd version 1.7.2. (Which uses version
0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
crate.)

Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
device initialization steps (i.e., control plane operations), and
immediately sends a FUSE_INIT request too (i.e., performs a data plane
operation). In the Rust-language virtiofsd, this creates a race between
two components that run *concurrently*, i.e., in different threads or
processes:

- Control plane, handling vhost-user protocol messages:

  The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
  [crates/vhost-user-backend/src/handler.rs] handles
  VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
  flag according to the message processed.

- Data plane, handling virtio / FUSE requests:

  The "VringEpollHandler::handle_event" method
  [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
  virtio / FUSE request, consuming the virtio kick at the same time. If
  the vring's "enabled" flag is set, the virtio / FUSE request is
  processed genuinely. If the vring's "enabled" flag is clear, then the
  virtio / FUSE request is discarded.

Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
However, if the data plane processor in virtiofsd wins the race, then it
sees the FUSE_INIT *before* the control plane processor took notice of
VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
processor. Therefore the latter drops FUSE_INIT on the floor, and goes
back to waiting for further virtio / FUSE requests with epoll_wait.
Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.

The deadlock is not deterministic. OVMF hangs infrequently during first
boot. However, OVMF hangs almost certainly during reboots from the UEFI
shell.

The race can be "reliably masked" by inserting a very small delay -- a
single debug message -- at the top of "VringEpollHandler::handle_event",
i.e., just before the data plane processor checks the "enabled" field of
the vring. That delay suffices for the control plane processor to act upon
VHOST_USER_SET_VRING_ENABLE.

We can deterministically prevent the race in QEMU, by blocking OVMF inside
step (1.2) -- i.e., in the write to the device status register that
"unleashes" queue enablement -- until VHOST_USER_SET_VRING_ENABLE actually
*completes*. That way OVMF's VCPU cannot advance to the FUSE_INIT
submission before virtiofsd's control plane processor takes notice of the
queue being enabled.

Wait for VHOST_USER_SET_VRING_ENABLE completion by:

- setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
  for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
  has been negotiated, or

- performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
  a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.

Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost)
Cc: Eugenio Perez Martin <eperezma@redhat.com>
Cc: German Maglione <gmaglione@redhat.com>
Cc: Liu Jiang <gerry@linux.alibaba.com>
Cc: Sergio Lopez Pascual <slp@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Albert Esteve <aesteve@redhat.com>
[lersek@redhat.com: work Eugenio's explanation into the commit message,
 about QEMU containing step (1.1) until step (1.2)]
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Message-Id: <20231002203221.17241-8-lersek@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22 05:18:16 -04:00
..
9pfs fsdev: Use ThrottleDirection instread of bool is_write 2023-08-29 10:49:24 +02:00
acpi virtio,pci: features, cleanups 2023-10-05 09:01:01 -04:00
adc meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
alpha hw/alpha: Use MachineClass->default_nic in the alpha machine 2023-05-26 09:10:49 +02:00
arm * fix from optionrom build 2023-10-03 07:43:44 -04:00
audio hw/audio/es1370: trace lost interrupts 2023-10-10 12:31:05 +00:00
avr Remove qemu-common.h include from most units 2022-04-06 14:31:55 +02:00
block swim: update IWM/ISM register block decoding 2023-10-06 10:33:43 +02:00
char hw/char: riscv_htif: replace exit calls with proper shutdown 2023-10-12 12:35:36 +10:00
core cpu: Correct invalid mentions of 'softmmu' by 'system-mode' 2023-10-07 19:02:33 +02:00
cpu hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
cris Do not include exec/address-spaces.h if it's not really necessary 2021-05-02 17:24:51 +02:00
cxl hw/cxl: Fix local variable shadowing of cap_hdrs 2023-10-06 10:56:54 +02:00
display gfxstream + rutabaga: enable rutabaga 2023-10-16 11:29:56 +04:00
dma hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
gpio hw/gpio/nrf51: implement DETECT signal 2023-08-22 17:30:59 +01:00
hppa target/hppa: Report and clear BTLBs via fw_cfg at startup 2023-09-15 17:34:38 +02:00
hyperv win32: replace closesocket() with close() wrapper 2023-03-13 15:39:31 +04:00
i2c aspeed/i2c: Clean up local variable shadowing 2023-09-29 10:07:19 +02:00
i386 hw/i386: changes towards enabling -Wshadow=local for x86 machines 2023-10-06 10:56:54 +02:00
ide hw/ide/ahci: Clean up local variable shadowing 2023-10-06 13:16:57 +02:00
input audio: propagate Error * out of audio_init 2023-10-03 10:29:40 +02:00
intc target/riscv: move KVM only files to kvm subdir 2023-10-12 12:20:24 +10:00
ipack meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
ipmi hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
isa hw/acpi/acpi_dev_interface: Remove now unused madt_cpu virtual method 2023-10-04 18:15:05 -04:00
loongarch hw/loongarch/virt: Remove unused 'loongarch_virt_pm' region 2023-10-13 10:04:52 +08:00
m68k q800: add alias for MacOS toolbox ROM at 0x40000000 2023-10-06 10:33:43 +02:00
mem memory-device,vhost: Support automatic decision on the number of memslots 2023-10-12 14:15:22 +02:00
microblaze hw/microblaze: Clean up local variable shadowing 2023-09-29 10:07:16 +02:00
mips vt82c686 machines: Support machine-default audiodev with fallback 2023-10-03 10:29:40 +02:00
misc * Fix CVE-2023-1544 2023-10-16 12:36:55 -04:00
net hw/net/vhost_net: Silence compiler warning when compiling with -Wshadow 2023-10-06 10:56:54 +02:00
nios2 hw/nios2: Clean up local variable shadowing 2023-09-29 10:07:16 +02:00
nubus trace-events: Fix the name of the tracing.rst file 2023-09-08 13:08:51 +03:00
nvme hw/nvme: Clean up local variable shadowing in nvme_ns_init() 2023-09-29 10:07:20 +02:00
nvram crypto: only include tls-cipher-suites in emulators 2023-10-03 10:29:39 +02:00
openrisc *: Add missing includes of qemu/error-report.h 2023-03-22 15:06:57 +00:00
pci pcie_sriov: unregister_vfs(): fix error path 2023-10-04 18:15:06 -04:00
pci-bridge hw/pci-bridge/cxl-upstream: Add serial number extended capability support 2023-10-04 18:15:06 -04:00
pci-host pc: remove short_root_bus property 2023-09-29 09:33:10 +02:00
pcmcia meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
ppc accel/tcg: Replace CPUState.env_ptr with cpu_env() 2023-10-04 11:03:54 -07:00
rdma hw/rdma: Deprecate the pvrdma device and the rdma subsystem 2023-10-12 14:11:44 +02:00
remote exec/memory: Add symbol for memory listener priority for device backend 2023-06-28 14:27:59 +02:00
riscv target/riscv: move KVM only files to kvm subdir 2023-10-12 12:20:24 +10:00
rtc hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
rx hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
s390x s390x: do a subsystem reset before the unprotect on reboot 2023-09-12 11:13:33 +02:00
scsi virtio,pci: features, cleanups 2023-10-05 09:01:01 -04:00
sd aspeed queue: 2023-09-06 11:14:55 -04:00
sensor hw/i2c: spelling fixes 2023-08-31 19:47:43 +02:00
sh4 hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
smbios hw/acpi: changes towards enabling -Wshadow=local 2023-09-29 10:07:18 +02:00
sparc other architectures: spelling fixes 2023-07-25 17:14:07 +03:00
sparc64 hw/pci/pci: Remove multifunction parameter from pci_new_multifunction() 2023-07-10 18:59:32 -04:00
ssi hw/other: spelling fixes 2023-09-21 11:31:16 +03:00
timer aspeed/timer: Clean up local variable shadowing 2023-09-29 10:07:19 +02:00
tpm hw/tpm: spelling fixes 2023-09-20 07:54:34 +03:00
tricore hw/tricore: Log failing test in testdevice 2023-09-29 08:28:02 +02:00
ufs hw/ufs: Fix code coverity issues 2023-10-13 13:56:28 +09:00
usb hw/usb: Silence compiler warnings in USB code when compiling with -Wshadow 2023-10-06 13:27:48 +02:00
vfio vfio/pci: enable MSI-X in interrupt restoring on dynamic allocation 2023-10-05 22:04:51 +02:00
virtio vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously 2023-10-22 05:18:16 -04:00
watchdog meson: Replace softmmu_ss -> system_ss 2023-06-20 10:01:30 +02:00
xen xen: spelling fix 2023-09-08 13:08:52 +03:00
xenpv hw/xenpv: Initialize Xen backend operations 2023-03-24 14:52:14 +00:00
xtensa trivial: Simplify the spots that use TARGET_BIG_ENDIAN as a numeric value 2023-09-08 13:08:52 +03:00
Kconfig hw/ufs: Initial commit for emulated Universal-Flash-Storage 2023-09-07 14:01:29 -04:00
meson.build hw/ufs: Initial commit for emulated Universal-Flash-Storage 2023-09-07 14:01:29 -04:00