Existing code misses a decrement of cpuid_i when skip leaf 0x1F.
There's a blank CPUID entry(with leaf, subleaf as 0, and all fields
stuffed 0s) left in the CPUID array.
It conflicts with correct CPUID leaf 0.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by:Yang Weijiang <weijiang.yang@intel.com>
Message-ID: <20240125024016.2521244-2-xiaoyao.li@intel.com>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The name "iothread" is overloaded. Use the term Big QEMU Lock (BQL)
instead, it is already widely used and unambiguous.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-id: 20240102153529.486531-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The Big QEMU Lock (BQL) has many names and they are confusing. The
actual QemuMutex variable is called qemu_global_mutex but it's commonly
referred to as the BQL in discussions and some code comments. The
locking APIs, however, are called qemu_mutex_lock_iothread() and
qemu_mutex_unlock_iothread().
The "iothread" name is historic and comes from when the main thread was
split into into KVM vcpu threads and the "iothread" (now called the main
loop thread). I have contributed to the confusion myself by introducing
a separate --object iothread, a separate concept unrelated to the BQL.
The "iothread" name is no longer appropriate for the BQL. Rename the
locking APIs to:
- void bql_lock(void)
- void bql_unlock(void)
- bool bql_locked(void)
There are more APIs with "iothread" in their names. Subsequent patches
will rename them. There are also comments and documentation that will be
updated in later patches.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Fabiano Rosas <farosas@suse.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com>
Acked-by: Hyman Huang <yong.huang@smartx.com>
Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-id: 20240102153529.486531-2-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Commit 7191f24c7f ("accel/kvm/kvm-all: Handle register access errors")
added error checking for KVM_SET_SREGS/KVM_SET_SREGS2. In doing so, it
exposed a long-running bug in current KVM support for SEV-ES where the
kernel assumes that MSR_EFER_LMA will be set explicitly by the guest
kernel, in which case EFER write traps would result in KVM eventually
seeing MSR_EFER_LMA get set and recording it in such a way that it would
be subsequently visible when accessing it via KVM_GET_SREGS/etc.
However, guest kernels currently rely on MSR_EFER_LMA getting set
automatically when MSR_EFER_LME is set and paging is enabled via
CR0_PG_MASK. As a result, the EFER write traps don't actually expose the
MSR_EFER_LMA bit, even though it is set internally, and when QEMU
subsequently tries to pass this EFER value back to KVM via
KVM_SET_SREGS* it will fail various sanity checks and return -EINVAL,
which is now considered fatal due to the aforementioned QEMU commit.
This can be addressed by inferring the MSR_EFER_LMA bit being set when
paging is enabled and MSR_EFER_LME is set, and synthesizing it to ensure
the expected bits are all present in subsequent handling on the host
side.
Ultimately, this handling will be implemented in the host kernel, but to
avoid breaking QEMU's SEV-ES support when using older host kernels, the
same handling can be done in QEMU just after fetching the register
values via KVM_GET_SREGS*. Implement that here.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Akihiko Odaki <akihiko.odaki@daynix.com>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Lara Lazier <laramglazier@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: <kvm@vger.kernel.org>
Fixes: 7191f24c7f ("accel/kvm/kvm-all: Handle register access errors")
Signed-off-by: Michael Roth <michael.roth@amd.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20231206155821.1194551-1-michael.roth@amd.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230918160257.30127-4-philmd@linaro.org>
The primary console is special because the toolstack maps a page into
the guest for its ring, and also allocates the guest-side event channel.
The guest's grant table is even primed to export that page using a known
grant ref#. Add support for all that in emulated mode, so that we can
have a primary console.
For reasons unclear, the backends running under real Xen don't just use
a mapping of the well-known GNTTAB_RESERVED_CONSOLE grant ref (which
would also be in the ring-ref node in XenStore). Instead, the toolstack
sets the ring-ref node of the primary console to the GFN of the guest
page. The backend is expected to handle that special case and map it
with foreignmem operations instead.
We don't have an implementation of foreignmem ops for emulated Xen mode,
so just make it map GNTTAB_RESERVED_CONSOLE instead. This would probably
work for real Xen too, but we can't work out how to make real Xen create
a primary console of type "ioemu" to make QEMU drive it, so we can't
test that; might as well leave it as it is for now under Xen.
Now at last we can boot the Xen PV shim and run PV kernels in QEMU.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This will allow Linux guests (since v6.0) to use the per-vCPU upcall
vector delivered as MSI through the local APIC.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
A guest which has configured the per-vCPU upcall vector may set the
HVM_PARAM_CALLBACK_IRQ param to fairly much anything other than zero.
For example, Linux v6.0+ after commit b1c3497e604 ("x86/xen: Add support
for HVMOP_set_evtchn_upcall_vector") will just do this after setting the
vector:
/* Trick toolstack to think we are enlightened. */
if (!cpu)
rc = xen_set_callback_via(1);
That's explicitly setting the delivery to GSI#1, but it's supposed to be
overridden by the per-vCPU vector setting. This mostly works in Qemu
*except* for the logic to enable the in-kernel handling of event channels,
which falsely determines that the kernel cannot accelerate GSI delivery
in this case.
Add a kvm_xen_has_vcpu_callback_vector() to report whether vCPU#0 has
the vector set, and use that in xen_evtchn_set_callback_param() to
enable the kernel acceleration features even when the param *appears*
to be set to target a GSI.
Preserve the Xen behaviour that when HVM_PARAM_CALLBACK_IRQ is set to
*zero* the event channel delivery is disabled completely. (Which is
what that bizarre guest behaviour is working round in the first place.)
Cc: qemu-stable@nongnu.org
Fixes: 91cce75617 ("hw/xen: Add xen_evtchn device for event channel emulation")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The per-vCPU upcall vector support had three problems. Firstly it was
using the wrong hypercall argument and would always return -EFAULT when
the guest tried to set it up. Secondly it was using the wrong ioctl() to
pass the vector to the kernel and thus the *kernel* would always return
-EINVAL. Finally, even when delivering the event directly from userspace
with an MSI, it put the destination CPU ID into the wrong bits of the
MSI address.
Linux doesn't (yet) use this mode so it went without decent testing
for a while.
Cc: qemu-stable@nongnu.org
Fixes: 105b47fdf2 ("i386/xen: implement HVMOP_set_evtchn_upcall_vector")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This confuses lscpu into thinking it's running in PVH mode.
Cc: qemu-stable@nongnu.org
Fixes: bedcc13924 ("i386/xen: implement HYPERVISOR_xen_version")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This was introduced in KVM in Linux 2.6.33, we can require it
unconditionally. KVM_CLOCK_TSC_STABLE was only added in Linux 4.9,
for now do not require it (though it would allow the removal of some
pretty yucky code).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This was introduced in KVM in Linux 2.6.36, and could already be used at
the time to save/restore FPU data even on older processor. We can require
it unconditionally and stop using KVM_GET/SET_FPU.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This was introduced in KVM in Linux 3.5, we can require it unconditionally
in kvm_irqchip_send_msi(). However, not all architectures have to implement
it so check it only in x86, the only architecture that ever had MSI injection
but not KVM_CAP_SIGNAL_MSI.
ARM uses it to detect the presence of the ITS emulation in the kernel,
introduced in Linux 4.8. Assume that it's there and possibly fail when
realizing the arm-its-kvm device.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modify migrate_add_blocker and migrate_del_blocker to take an Error **
reason. This allows migration to own the Error object, so that if
an error occurs in migrate_add_blocker, migration code can free the Error
and clear the client handle, simplifying client code. It also simplifies
the migrate_del_blocker call site.
In addition, this is a pre-requisite for a proposed future patch that would
add a mode argument to migration requests to support live update, and
maintain a list of blockers for each mode. A blocker may apply to a single
mode or to multiple modes, and passing Error** will allow one Error object
to be registered for multiple modes.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Tested-by: Michael Galaxy <mgalaxy@akamai.com>
Reviewed-by: Michael Galaxy <mgalaxy@akamai.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <1697634216-84215-1-git-send-email-steven.sistare@oracle.com>
When explicitly booting a multiple vcpus vm with "-cpu +ht", it gets
warning of
warning: host doesn't support requested feature: CPUID.01H:EDX.ht [bit 28]
Make CPUID_HT as supported unconditionally can resolve the warning.
However it introduces another issue that it also expose CPUID_HT to
guest when "-cpu host/max" with only 1 vcpu. To fix this, need mark
CPUID_HT as the no_autoenable_flags.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20231010060539.210258-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Software MMU is TCG specific. Here 'softmmu' is misused
for system emulation. Anyhow, since KVM is system emulation
specific, just rename as 'i386_kvm_ss'.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20231004090629.37473-10-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
accel_cpu_realize() is a generic function working with CPUs
from any target. Rename it using '_common_' to emphasis it is
not target specific.
Suggested-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20231003123026.99229-4-philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
The AccelCPUClass::cpu_realizefn handler is meant for target
specific code, rename it using '_target_' to emphasis it.
Suggested-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20231003123026.99229-3-philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
We use the '*fn' suffix for handlers, this is a public method.
Drop the suffix.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Claudio Fontana <cfontana@suse.de>
Message-Id: <20231003123026.99229-2-philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
kvm_pc_setup_irq_routing() is only defined for x86 targets (in
hw/i386/kvm/apic.c). Its declaration is pointless on all
other targets.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-14-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_has_pit_state2() is only defined for x86 targets (in
target/i386/kvm/kvm.c). Its declaration is pointless on
all other targets. Have it return a boolean.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-13-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_get_apic_state() is only defined for x86 targets (in
hw/i386/kvm/apic.c). Its declaration is pointless on all
other targets.
Since we include "linux-headers/asm-x86/kvm.h", no need
to forward-declare 'struct kvm_lapic_state'.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-12-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_arch_get_supported_cpuid() / kvm_arch_get_supported_msr_feature()
are only defined for x86 targets (in target/i386/kvm/kvm.c). Their
declarations are pointless on other targets.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-11-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Keep the function accessed by target/i386/ and hw/i386/
exposed, restrict the ones accessed by target/i386/kvm/.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-10-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call kvm_enabled() before kvm_hv_vpindex_settable()
to let the compiler elide its call.
kvm-stub.c is now empty, remove it.
Suggested-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-9-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call kvm_enabled() before kvm_enable_x2apic() to let the compiler elide
its call. Cleanup the code by simplifying "!xen_enabled() &&
kvm_enabled()" to just "kvm_enabled()".
Suggested-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-8-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All these functions:
- kvm_arch_get_supported_cpuid()
- kvm_has_smm(()
- kvm_hyperv_expand_features()
- kvm_set_max_apic_id()
are called after checking for kvm_enabled(), which is
false when KVM is not built. Since the compiler elides
these functions, their stubs are not used and can be
removed.
Inspired-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-7-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to have cpu-sysemu.c become accelerator-agnostic,
inline kvm_apic_in_kernel() -- which is a simple wrapper
to kvm_irqchip_in_kernel() -- and use the generic "sysemu/kvm.h"
header.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-6-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Changes the signature of the target-defined functions for
inserting/removing kvm hw breakpoints. The address and length arguments
are now of vaddr type, which both matches the type used internally in
accel/kvm/kvm-all.c and makes the api target-agnostic.
Signed-off-by: Anton Johansson <anjo@rev.ng>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230807155706.9580-4-anjo@rev.ng>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
kvm_arch_get_default_type() returns the default KVM type. This hook is
particularly useful to derive a KVM type that is valid for "none"
machine model, which is used by libvirt to probe the availability of
KVM.
For MIPS, the existing mips_kvm_type() is reused. This function ensures
the availability of VZ which is mandatory to use KVM on the current
QEMU.
Cc: qemu-stable@nongnu.org
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-id: 20230727073134.134102-2-akihiko.odaki@daynix.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
[PMM: added doc comment for new function]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Coverity points out (CID 1507534, 1507968) that we sometimes access
env->xen_singleshot_timer_ns under the protection of
env->xen_timers_lock and sometimes not.
This isn't always an issue. There are two modes for the timers; if the
kernel supports the EVTCHN_SEND capability then it handles all the timer
hypercalls and delivery internally, and all we use the field for is to
get/set the timer as part of the vCPU state via an ioctl(). If the
kernel doesn't have that support, then we do all the emulation within
qemu, and *those* are the code paths where we actually care about the
locking.
But it doesn't hurt to be a little bit more consistent and avoid having
to explain *why* it's OK.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20230801175747.145906-3-dwmw2@infradead.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
MCDT_NO bit indicates HW contains the security fix and doesn't need to
be mitigated to avoid data-dependent behaviour for certain instructions.
It needs no hypervisor support. Treat it as supported regardless of what
KVM reports.
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20230706054949.66556-4-tao1.su@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We use the user_ss[] array to hold the user emulation sources,
and the softmmu_ss[] array to hold the system emulation ones.
Hold the latter in the 'system_ss[]' array for parity with user
emulation.
Mechanical change doing:
$ sed -i -e s/softmmu_ss/system_ss/g $(git grep -l softmmu_ss)
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230613133347.82210-10-philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This had been pulled in via qemu/plugin.h from hw/core/cpu.h,
but that will be removed.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20230310195252.210956-5-richard.henderson@linaro.org>
[AJB: add various additional cases shown by CI]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20230315174331.2959-15-alex.bennee@linaro.org>
Reviewed-by: Emilio Cota <cota@braap.org>
Upstream commit ddf0fd9ae1 "hw/xen: Support HVM_PARAM_CALLBACK_TYPE_GSI callback"
added kvm_xen_maybe_deassert_callback usage to target/i386/kvm/kvm.c file without
conditional preprocessing check. This breaks any build not using CONFIG_XEN_EMU.
Protect call by conditional preprocessing to allow build without CONFIG_XEN_EMU.
Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20230308130557.2420-1-mrezanin@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is only part of it; we will also need to get the PV back end drivers
to tear down their own mappings (or do it for them, but they kind of need
to stop using the pointers too).
Some more work on the actual PV back ends and xen-bus code is going to be
needed to really make soft reset and migration fully functional, and this
part is the basis for that.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The default number of PIRQs is set to 256 to avoid issues with 32-bit MSI
devices. Allow it to be increased if the user desires.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The way that Xen handles MSI PIRQs is kind of awful.
There is a special MSI message which targets a PIRQ. The vector in the
low bits of data must be zero. The low 8 bits of the PIRQ# are in the
destination ID field, the extended destination ID field is unused, and
instead the high bits of the PIRQ# are in the high 32 bits of the address.
Using the high bits of the address means that we can't intercept and
translate these messages in kvm_send_msi(), because they won't be caught
by the APIC — addresses like 0x1000fee46000 aren't in the APIC's range.
So we catch them in pci_msi_trigger() instead, and deliver the event
channel directly.
That isn't even the worst part. The worst part is that Xen snoops on
writes to devices' MSI vectors while they are *masked*. When a MSI
message is written which looks like it targets a PIRQ, it remembers
the device and vector for later.
When the guest makes a hypercall to bind that PIRQ# (snooped from a
marked MSI vector) to an event channel port, Xen *unmasks* that MSI
vector on the device. Xen guests using PIRQ delivery of MSI don't
ever actually unmask the MSI for themselves.
Now that this is working we can finally enable XENFEAT_hvm_pirqs and
let the guest use it all.
Tested with passthrough igb and emulated e1000e + AHCI.
CPU0 CPU1
0: 65 0 IO-APIC 2-edge timer
1: 0 14 xen-pirq 1-ioapic-edge i8042
4: 0 846 xen-pirq 4-ioapic-edge ttyS0
8: 1 0 xen-pirq 8-ioapic-edge rtc0
9: 0 0 xen-pirq 9-ioapic-level acpi
12: 257 0 xen-pirq 12-ioapic-edge i8042
24: 9600 0 xen-percpu -virq timer0
25: 2758 0 xen-percpu -ipi resched0
26: 0 0 xen-percpu -ipi callfunc0
27: 0 0 xen-percpu -virq debug0
28: 1526 0 xen-percpu -ipi callfuncsingle0
29: 0 0 xen-percpu -ipi spinlock0
30: 0 8608 xen-percpu -virq timer1
31: 0 874 xen-percpu -ipi resched1
32: 0 0 xen-percpu -ipi callfunc1
33: 0 0 xen-percpu -virq debug1
34: 0 1617 xen-percpu -ipi callfuncsingle1
35: 0 0 xen-percpu -ipi spinlock1
36: 8 0 xen-dyn -event xenbus
37: 0 6046 xen-pirq -msi ahci[0000:00:03.0]
38: 1 0 xen-pirq -msi-x ens4
39: 0 73 xen-pirq -msi-x ens4-rx-0
40: 14 0 xen-pirq -msi-x ens4-rx-1
41: 0 32 xen-pirq -msi-x ens4-tx-0
42: 47 0 xen-pirq -msi-x ens4-tx-1
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This wires up the basic infrastructure but the actual interrupts aren't
there yet, so don't advertise it to the guest.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Just hook up the basic hypercalls to stubs in xen_evtchn.c for now.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Just the basic shell, with the event channel hookup. It only dumps the
buffer for now; a real ring implmentation will come in a subsequent patch.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Which is used to fetch xenstore PFN and port to be used
by the guest. This is preallocated by the toolstack when
guest will just read those and use it straight away.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Xen has eight frames at 0xfeff8000 for this; we only really need two for
now and KVM puts the identity map at 0xfeffc000, so limit ourselves to
four.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Introduce support for one shot and periodic mode of Xen PV timers,
whereby timer interrupts come through a special virq event channel
with deadlines being set through:
1) set_timer_op hypercall (only oneshot)
2) vcpu_op hypercall for {set,stop}_{singleshot,periodic}_timer
hypercalls
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The guest is permitted to specify an arbitrary domain/bus/device/function
and INTX pin from which the callback IRQ shall appear to have come.
In QEMU we can only easily do this for devices that actually exist, and
even that requires us "knowing" that it's a PCMachine in order to find
the PCI root bus — although that's OK really because it's always true.
We also don't get to get notified of INTX routing changes, because we
can't do that as a passive observer; if we try to register a notifier
it will overwrite any existing notifier callback on the device.
But in practice, guests using PCI_INTX will only ever use pin A on the
Xen platform device, and won't swizzle the INTX routing after they set
it up. So this is just fine.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The GSI callback (and later PCI_INTX) is a level triggered interrupt. It
is asserted when an event channel is delivered to vCPU0, and is supposed
to be cleared when the vcpu_info->evtchn_upcall_pending field for vCPU0
is cleared again.
Thankfully, Xen does *not* assert the GSI if the guest sets its own
evtchn_upcall_pending field; we only need to assert the GSI when we
have delivered an event for ourselves. So that's the easy part, kind of.
There's a slight complexity in that we need to hold the BQL before we
can call qemu_set_irq(), and we definitely can't do that while holding
our own port_lock (because we'll need to take that from the qemu-side
functions that the PV backend drivers will call). So if we end up
wanting to set the IRQ in a context where we *don't* already hold the
BQL, defer to a BH.
However, we *do* need to poll for the evtchn_upcall_pending flag being
cleared. In an ideal world we would poll that when the EOI happens on
the PIC/IOAPIC. That's how it works in the kernel with the VFIO eventfd
pairs — one is used to trigger the interrupt, and the other works in the
other direction to 'resample' on EOI, and trigger the first eventfd
again if the line is still active.
However, QEMU doesn't seem to do that. Even VFIO level interrupts seem
to be supported by temporarily unmapping the device's BARs from the
guest when an interrupt happens, then trapping *all* MMIO to the device
and sending the 'resample' event on *every* MMIO access until the IRQ
is cleared! Maybe in future we'll plumb the 'resample' concept through
QEMU's irq framework but for now we'll do what Xen itself does: just
check the flag on every vmexit if the upcall GSI is known to be
asserted.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Add the array of virq ports to each vCPU so that we can deliver timers,
debug ports, etc. Global virqs are allocated against vCPU 0 initially,
but can be migrated to other vCPUs (when we implement that).
The kernel needs to know about VIRQ_TIMER in order to accelerate timers,
so tell it via KVM_XEN_VCPU_ATTR_TYPE_TIMER. Also save/restore the value
of the singleshot timer across migration, as the kernel will handle the
hypercalls automatically now.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This finally comes with a mechanism for actually injecting events into
the guest vCPU, with all the atomic-test-and-set that's involved in
setting the bit in the shinfo, then the index in the vcpu_info, and
injecting either the lapic vector as MSI, or letting KVM inject the
bare vector.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
It calls an internal close_port() helper which will also be used from
EVTCHNOP_reset and will actually do the work to disconnect/unbind a port
once any of that is actually implemented in the first place.
That in turn calls a free_port() internal function which will be in
error paths after allocation.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This adds the basic structure for maintaining the port table and reporting
the status of ports therein.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The kvm_xen_inject_vcpu_callback_vector() function will either deliver
the per-vCPU local APIC vector (as an MSI), or just kick the vCPU out
of the kernel to trigger KVM's automatic delivery of the global vector.
Support for asserting the GSI/PCI_INTX callbacks will come later.
Also add kvm_xen_get_vcpu_info_hva() which returns the vcpu_info of
a given vCPU.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Include basic support for setting HVM_PARAM_CALLBACK_IRQ to the global
vector method HVM_PARAM_CALLBACK_TYPE_VECTOR, which is handled in-kernel
by raising the vector whenever the vCPU's vcpu_info->evtchn_upcall_pending
flag is set.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This is the hook for adding the HVM_PARAM_CALLBACK_IRQ parameter in a
subsequent commit.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Split out from another commit]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
The HVMOP_set_evtchn_upcall_vector hypercall sets the per-vCPU upcall
vector, to be delivered to the local APIC just like an MSI (with an EOI).
This takes precedence over the system-wide delivery method set by the
HVMOP_set_param hypercall with HVM_PARAM_CALLBACK_IRQ. It's used by
Windows and Xen (PV shim) guests but normally not by Linux.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Rework for upstream kernel changes and split from HVMOP_set_param]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Ditch event_channel_op_compat which was never available to HVM guests]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Allow guest to setup the vcpu runstates which is used as
steal clock.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
In order to support Linux vdso in Xen.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Handle the hypercall to set a per vcpu info, and also wire up the default
vcpu_info in the shared_info page for the first 32 vCPUs.
To avoid deadlock within KVM a vCPU thread must set its *own* vcpu_info
rather than it being set from the context in which the hypercall is
invoked.
Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for migration,
and restore it in kvm_arch_put_registers() appropriately.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This is simply when guest tries to register a vcpu_info
and since vcpu_info placement is optional in the minimum ABI
therefore we can just fail with -ENOSYS
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This is when guest queries for support for HVMOP_pagetable_dying.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Specifically XENMEM_add_to_physmap with space XENMAPSPACE_shared_info to
allow the guest to set its shared_info page.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Use the xen_overlay device, add compat support]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Xen will "latch" the guest's 32-bit or 64-bit ("long mode") setting when
the guest writes the MSR to fill in the hypercall page, or when the guest
sets the event channel callback in HVM_PARAM_CALLBACK_IRQ.
KVM handles the former and sets the kernel's long_mode flag accordingly.
The latter will be handled in userspace. Keep them in sync by noticing
when a hypercall is made in a mode that doesn't match qemu's idea of
the guest mode, and resyncing from the kernel. Do that same sync right
before serialization too, in case the guest has set the hypercall page
but hasn't yet made a system call.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
They both do the same thing and just call sched_yield. This is enough to
stop the Linux guest panicking when running on a host kernel which doesn't
intercept SCHEDOP_poll and lets it reach userspace.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
It allows to shutdown itself via hypercall with any of the 3 reasons:
1) self-reboot
2) shutdown
3) crash
Implementing SCHEDOP_shutdown sub op let us handle crashes gracefully rather
than leading to triple faults if it remains unimplemented.
In addition, the SHUTDOWN_soft_reset reason is used for kexec, to reset
Xen shared pages and other enlightenments and leave a clean slate for the
new kernel without the hypervisor helpfully writing information at
unexpected addresses.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Ditch sched_op_compat which was never available for HVM guests,
Add SCHEDOP_soft_reset]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This is just meant to serve as an example on how we can implement
hypercalls. xen_version specifically since Qemu does all kind of
feature controllability. So handling that here seems appropriate.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Implement kvm_gva_rw() safely]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This means handling the new exit reason for Xen but still
crashing on purpose. As we implement each of the hypercalls
we will then return the right return code.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Add CPL to hypercall tracing, disallow hypercalls from CPL > 0]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
There are (at least) three different vCPU ID number spaces. One is the
internal KVM vCPU index, based purely on which vCPU was chronologically
created in the kernel first. If userspace threads are all spawned and
create their KVM vCPUs in essentially random order, then the KVM indices
are basically random too.
The second number space is the APIC ID space, which is consistent and
useful for referencing vCPUs. MSIs will specify the target vCPU using
the APIC ID, for example, and the KVM Xen APIs also take an APIC ID
from userspace whenever a vCPU needs to be specified (as opposed to
just using the appropriate vCPU fd).
The third number space is not normally relevant to the kernel, and is
the ACPI/MADT/Xen CPU number which corresponds to cs->cpu_index. But
Xen timer hypercalls use it, and Xen timer hypercalls *really* want
to be accelerated in the kernel rather than handled in userspace, so
the kernel needs to be told.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Introduce support for emulating CPUID for Xen HVM guests. It doesn't make
sense to advertise the KVM leaves to a Xen guest, so do Xen unconditionally
when the xen-version machine property is set.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Obtain xen_version from KVM property, make it automatic]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
This just initializes the basic Xen support in KVM for now. Only permitted
on TYPE_PC_MACHINE because that's where the sysbus devices for Xen heap
overlay, event channel, grant tables and other stuff will exist. There's
no point having the basic hypercall support if nothing else works.
Provide sysemu/kvm_xen.h and a kvm_xen_get_caps() which will be used
later by support devices.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
These are just a flag that documents the performance characteristic of
an instruction; it needs no hypervisor support. So include them even
if KVM does not show them. In particular, FZRM/FSRS/FSRC have only
been added very recently, but they are available on Sapphire Rapids
processors.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We have two inclusion loops:
block/block.h
-> block/block-global-state.h
-> block/block-common.h
-> block/blockjob.h
-> block/block.h
block/block.h
-> block/block-io.h
-> block/block-common.h
-> block/blockjob.h
-> block/block.h
I believe these go back to Emanuele's reorganization of the block API,
merged a few months ago in commit d7e2fe4aac.
Fortunately, breaking them is merely a matter of deleting unnecessary
includes from headers, and adding them back in places where they are
now missing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221221133551.3967339-2-armbru@redhat.com>
Commit 012d4c96e2 changed the visitor functions taking Error ** to
return bool instead of void, and the commits following it used the new
return value to simplify error checking. Since then a few more uses
in need of the same treatment crept in. Do that. All pretty
mechanical except for
* balloon_stats_get_all()
This is basically the same transformation commit 012d4c96e2 applied
to the virtual walk example in include/qapi/visitor.h.
* set_max_queue_size()
Additionally replace "goto end of function" by return.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221121085054.683122-10-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Specify maximum possible APIC ID assigned for current VM session to KVM
prior to the creation of vCPUs. By this setting, KVM can set up VM-scoped
data structure indexed by the APIC ID, e.g. Posted-Interrupt Descriptor
pointer table to support Intel IPI virtualization, with the most optimal
memory footprint.
It can be achieved by calling KVM_ENABLE_CAP for KVM_CAP_MAX_VCPU_ID
capability once KVM has enabled it. Ignoring the return error if KVM
doesn't support this capability yet.
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20220825025246.26618-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These memory allocation functions return void *, and casting to
another pointer type is useless clutter. Drop these casts.
If you really want another pointer type, consider g_new().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Message-Id: <20220923120025.448759-3-armbru@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Resetting a guest that has Hyper-V VMBus support enabled triggers a QEMU
assertion failure:
hw/hyperv/hyperv.c:131: synic_reset: Assertion `QLIST_EMPTY(&synic->sint_routes)' failed.
This happens both on normal guest reboot or when using "system_reset" HMP
command.
The failing assertion was introduced by commit 64ddecc88b ("hyperv: SControl is optional to enable SynIc")
to catch dangling SINT routes on SynIC reset.
The root cause of this problem is that the SynIC itself is reset before
devices using SINT routes have chance to clean up these routes.
Since there seems to be no existing mechanism to force reset callbacks (or
methods) to be executed in specific order let's use a similar method that
is already used to reset another interrupt controller (APIC) after devices
have been reset - by invoking the SynIC reset from the machine reset
handler via a new x86_cpu_after_reset() function co-located with
the existing x86_cpu_reset() in target/i386/cpu.c.
Opportunistically move the APIC reset handler there, too.
Fixes: 64ddecc88b ("hyperv: SControl is optional to enable SynIc") # exposed the bug
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <cb57cee2e29b20d06f81dce054cbcea8b5d497e8.1664552976.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The MSR_CORE_THREAD_COUNT MSR describes CPU package topology, such as number
of threads and cores for a given package. This is information that QEMU has
readily available and can provide through the new user space MSR deflection
interface.
This patch propagates the existing hvf logic from patch 027ac0cb51
("target/i386/hvf: add rdmsr 35H MSR_CORE_THREAD_COUNT") to KVM.
Signed-off-by: Alexander Graf <agraf@csgraf.de>
Message-Id: <20221004225643.65036-4-agraf@csgraf.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM has grown support to deflect arbitrary MSRs to user space since
Linux 5.10. For now we don't expect to make a lot of use of this
feature, so let's expose it the easiest way possible: With up to 16
individually maskable MSRs.
This patch adds a kvm_filter_msr() function that other code can call
to install a hook on KVM MSR reads or writes.
Signed-off-by: Alexander Graf <agraf@csgraf.de>
Message-Id: <20221004225643.65036-3-agraf@csgraf.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are cases that malicious virtual machine can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs. Notify VM exit is introduced to mitigate such kind of
attacks, which will generate a VM exit if no event window occurs in VM
non-root mode for a specified amount of time (notify window).
A new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT is exposed to user space
so that the user can query the capability and set the expected notify
window when creating VMs. The format of the argument when enabling this
capability is as follows:
Bit 63:32 - notify window specified in qemu command
Bit 31:0 - some flags (e.g. KVM_X86_NOTIFY_VMEXIT_ENABLED is set to
enable the feature.)
Users can configure the feature by a new (x86 only) accel property:
qemu -accel kvm,notify-vmexit=run|internal-error|disable,notify-window=n
The default option of notify-vmexit is run, which will enable the
capability and do nothing if the exit happens. The internal-error option
raises a KVM internal error if it happens. The disable option does not
enable the capability. The default value of notify-window is 0. It is valid
only when notify-vmexit is not disabled. The valid range of notify-window
is non-negative. It is even safe to set it to zero since there's an
internal hardware threshold to be added to ensure no false positive.
Because a notify VM exit may happen with VM_CONTEXT_INVALID set in exit
qualification (no cases are anticipated that would set this bit), which
means VM context is corrupted. It would be reflected in the flags of
KVM_EXIT_NOTIFY exit. If KVM_NOTIFY_CONTEXT_INVALID bit is set, raise a KVM
internal error unconditionally.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220929072014.20705-5-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several hypervisor capabilities in KVM are target-specific. When exposed
to QEMU users as accelerator properties (i.e. -accel kvm,prop=value), they
should not be available for all targets.
Add a hook for targets to add their own properties to -accel kvm, for
now no such property is defined.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220929072014.20705-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>