xemu/include/sysemu/kvm_int.h

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

188 lines
5.4 KiB
C
Raw Normal View History

/*
* Internal definitions for a target's KVM support
*
* This work is licensed under the terms of the GNU GPL, version 2 or later.
* See the COPYING file in the top-level directory.
*
*/
#ifndef QEMU_KVM_INT_H
#define QEMU_KVM_INT_H
#include "exec/memory.h"
#include "qapi/qapi-types-common.h"
#include "qemu/accel.h"
kvm: Atomic memslot updates If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20221111154758.1372674-4-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 15:47:58 +00:00
#include "qemu/queue.h"
#include "sysemu/kvm.h"
Add support for RAPL MSRs in KVM/Qemu Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL interface (Running Average Power Limit) for advertising the accumulated energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). The consumption is reported via MSRs (model specific registers) like MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits registers that represent the accumulated energy consumption in micro Joules. They are updated by microcode every ~1ms. For now, KVM always returns 0 when the guest requests the value of these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle these MSRs dynamically in userspace. To limit the amount of system calls for every MSR call, create a new thread in QEMU that updates the "virtual" MSR values asynchronously. Each vCPU has its own vMSR to reflect the independence of vCPUs. The thread updates the vMSR values with the ratio of energy consumed of the whole physical CPU package the vCPU thread runs on and the thread's utime and stime values. All other non-vCPU threads are also taken into account. Their energy consumption is evenly distributed among all vCPUs threads running on the same physical CPU package. To overcome the problem that reading the RAPL MSR requires priviliged access, a socket communication between QEMU and the qemu-vmsr-helper is mandatory. You can specified the socket path in the parameter. This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock Actual limitation: - Works only on Intel host CPU because AMD CPUs are using different MSR adresses. - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the moment. Signed-off-by: Anthony Harivel <aharivel@redhat.com> Link: https://lore.kernel.org/r/20240522153453.1230389-4-aharivel@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-22 15:34:52 +00:00
#include "hw/boards.h"
#include "hw/i386/topology.h"
#include "io/channel-socket.h"
typedef struct KVMSlot
{
hwaddr start_addr;
ram_addr_t memory_size;
void *ram;
int slot;
int flags;
int old_flags;
/* Dirty bitmap cache for the slot */
unsigned long *dirty_bmap;
unsigned long dirty_bmap_size;
/* Cache of the address space ID */
int as_id;
/* Cache of the offset in ram address space */
ram_addr_t ram_start_offset;
int guest_memfd;
hwaddr guest_memfd_offset;
} KVMSlot;
kvm: Atomic memslot updates If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20221111154758.1372674-4-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 15:47:58 +00:00
typedef struct KVMMemoryUpdate {
QSIMPLEQ_ENTRY(KVMMemoryUpdate) next;
MemoryRegionSection section;
} KVMMemoryUpdate;
typedef struct KVMMemoryListener {
MemoryListener listener;
KVMSlot *slots;
unsigned int nr_slots_used;
KVM: Dynamic sized kvm memslots array Zhiyi reported an infinite loop issue in VFIO use case. The cause of that was a separate discussion, however during that I found a regression of dirty sync slowness when profiling. Each KVMMemoryListerner maintains an array of kvm memslots. Currently it's statically allocated to be the max supported by the kernel. However after Linux commit 4fc096a99e ("KVM: Raise the maximum number of user memslots"), the max supported memslots reported now grows to some number large enough so that it may not be wise to always statically allocate with the max reported. What's worse, QEMU kvm code still walks all the allocated memslots entries to do any form of lookups. It can drastically slow down all memslot operations because each of such loop can run over 32K times on the new kernels. Fix this issue by making the memslots to be allocated dynamically. Here the initial size was set to 16 because it should cover the basic VM usages, so that the hope is the majority VM use case may not even need to grow at all (e.g. if one starts a VM with ./qemu-system-x86_64 by default it'll consume 9 memslots), however not too large to waste memory. There can also be even better way to address this, but so far this is the simplest and should be already better even than before we grow the max supported memslots. For example, in the case of above issue when VFIO was attached on a 32GB system, there are only ~10 memslots used. So it could be good enough as of now. In the above VFIO context, measurement shows that the precopy dirty sync shrinked from ~86ms to ~3ms after this patch applied. It should also apply to any KVM enabled VM even without VFIO. NOTE: we don't have a FIXES tag for this patch because there's no real commit that regressed this in QEMU. Such behavior existed for a long time, but only start to be a problem when the kernel reports very large nr_slots_max value. However that's pretty common now (the kernel change was merged in 2021) so we attached cc:stable because we'll want this change to be backported to stable branches. Cc: qemu-stable <qemu-stable@nongnu.org> Reported-by: Zhiyi Guo <zhguo@redhat.com> Tested-by: Zhiyi Guo <zhguo@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20240917163835.194664-2-peterx@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17 16:38:32 +00:00
unsigned int nr_slots_allocated;
int as_id;
kvm: Atomic memslot updates If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20221111154758.1372674-4-eesposit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-11 15:47:58 +00:00
QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_add;
QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_del;
} KVMMemoryListener;
#define KVM_MSI_HASHTAB_SIZE 256
Add support for RAPL MSRs in KVM/Qemu Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL interface (Running Average Power Limit) for advertising the accumulated energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). The consumption is reported via MSRs (model specific registers) like MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits registers that represent the accumulated energy consumption in micro Joules. They are updated by microcode every ~1ms. For now, KVM always returns 0 when the guest requests the value of these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle these MSRs dynamically in userspace. To limit the amount of system calls for every MSR call, create a new thread in QEMU that updates the "virtual" MSR values asynchronously. Each vCPU has its own vMSR to reflect the independence of vCPUs. The thread updates the vMSR values with the ratio of energy consumed of the whole physical CPU package the vCPU thread runs on and the thread's utime and stime values. All other non-vCPU threads are also taken into account. Their energy consumption is evenly distributed among all vCPUs threads running on the same physical CPU package. To overcome the problem that reading the RAPL MSR requires priviliged access, a socket communication between QEMU and the qemu-vmsr-helper is mandatory. You can specified the socket path in the parameter. This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock Actual limitation: - Works only on Intel host CPU because AMD CPUs are using different MSR adresses. - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the moment. Signed-off-by: Anthony Harivel <aharivel@redhat.com> Link: https://lore.kernel.org/r/20240522153453.1230389-4-aharivel@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-22 15:34:52 +00:00
typedef struct KVMHostTopoInfo {
/* Number of package on the Host */
unsigned int maxpkgs;
/* Number of cpus on the Host */
unsigned int maxcpus;
/* Number of cpus on each different package */
unsigned int *pkg_cpu_count;
/* Each package can have different maxticks */
unsigned int *maxticks;
} KVMHostTopoInfo;
struct KVMMsrEnergy {
pid_t pid;
bool enable;
char *socket_path;
QIOChannelSocket *sioc;
QemuThread msr_thr;
unsigned int guest_vcpus;
unsigned int guest_vsockets;
X86CPUTopoInfo guest_topo_info;
KVMHostTopoInfo host_topo;
const CPUArchIdList *guest_cpu_list;
uint64_t *msr_value;
uint64_t msr_unit;
uint64_t msr_limit;
uint64_t msr_info;
};
enum KVMDirtyRingReaperState {
KVM_DIRTY_RING_REAPER_NONE = 0,
/* The reaper is sleeping */
KVM_DIRTY_RING_REAPER_WAIT,
/* The reaper is reaping for dirty pages */
KVM_DIRTY_RING_REAPER_REAPING,
};
/*
* KVM reaper instance, responsible for collecting the KVM dirty bits
* via the dirty ring.
*/
struct KVMDirtyRingReaper {
/* The reaper thread */
QemuThread reaper_thr;
volatile uint64_t reaper_iteration; /* iteration number of reaper thr */
volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */
};
struct KVMState
{
AccelState parent_obj;
/* Max number of KVM slots supported */
int nr_slots_max;
int fd;
int vmfd;
int coalesced_mmio;
int coalesced_pio;
struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
bool coalesced_flush_in_progress;
int vcpu_events;
#ifdef TARGET_KVM_HAVE_GUEST_DEBUG
QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints;
#endif
int max_nested_state_len;
int kvm_shadow_mem;
bool kernel_irqchip_allowed;
bool kernel_irqchip_required;
OnOffAuto kernel_irqchip_split;
bool sync_mmu;
bool guest_state_protected;
uint64_t manual_dirty_log_protect;
kvm: Use 'unsigned long' for request argument in functions wrapping ioctl() Change the data type of the ioctl _request_ argument from 'int' to 'unsigned long' for the various accel/kvm functions which are essentially wrappers around the ioctl() syscall. The correct type for ioctl()'s 'request' argument is confused: * POSIX defines the request argument as 'int' * glibc uses 'unsigned long' in the prototype in sys/ioctl.h * the glibc info documentation uses 'int' * the Linux manpage uses 'unsigned long' * the Linux implementation of the syscall uses 'unsigned int' If we wrap ioctl() with another function which uses 'int' as the type for the request argument, then requests with the 0x8000_0000 bit set will be sign-extended when the 'int' is cast to 'unsigned long' for the call to ioctl(). On x86_64 one such example is the KVM_IRQ_LINE_STATUS request. Bit requests with the _IOC_READ direction bit set, will have the high bit set. Fortunately the Linux Kernel truncates the upper 32bit of the request on 64bit machines (because it uses 'unsigned int', and see also Linus Torvalds' comments in https://sourceware.org/bugzilla/show_bug.cgi?id=14362 ) so this doesn't cause active problems for us. However it is more consistent to follow the glibc ioctl() prototype when we define functions that are essentially wrappers around ioctl(). This resolves a Coverity issue where it points out that in kvm_get_xsave() we assign a value (KVM_GET_XSAVE or KVM_GET_XSAVE2) to an 'int' variable which can't hold it without overflow. Resolves: Coverity CID 1547759 Signed-off-by: Johannes Stoelp <johannes.stoelp@gmail.com> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20240815122747.3053871-1-peter.maydell@linaro.org [PMM: Rebased patch, adjusted commit message, included note about Coverity fix, updated the type of the local var in kvm_get_xsave, updated the comment in the KVMState struct definition] Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2024-09-13 14:31:46 +00:00
/*
* Older POSIX says that ioctl numbers are signed int, but in
* practice they are not. (Newer POSIX doesn't specify ioctl
* at all.) Linux, glibc and *BSD all treat ioctl numbers as
* unsigned, and real-world ioctl values like KVM_GET_XSAVE have
* bit 31 set, which means that passing them via an 'int' will
* result in sign-extension when they get converted back to the
* 'unsigned long' which the ioctl() prototype uses. Luckily Linux
* always treats the argument as an unsigned 32-bit int, so any
* possible sign-extension is deliberately ignored, but for
* consistency we keep to the same type that glibc is using.
*/
unsigned long irq_set_ioctl;
unsigned int sigmask_len;
GHashTable *gsimap;
#ifdef KVM_CAP_IRQ_ROUTING
struct kvm_irq_routing *irq_routes;
int nr_allocated_irq_routes;
unsigned long *used_gsi_bitmap;
unsigned int gsi_count;
#endif
KVMMemoryListener memory_listener;
QLIST_HEAD(, KVMParkedVcpu) kvm_parked_vcpus;
/* For "info mtree -f" to tell if an MR is registered in KVM */
int nr_as;
struct KVMAs {
KVMMemoryListener *ml;
AddressSpace *as;
} *as;
uint64_t kvm_dirty_ring_bytes; /* Size of the per-vcpu dirty ring */
uint32_t kvm_dirty_ring_size; /* Number of dirty GFNs per ring */
bool kvm_dirty_ring_with_bitmap;
uint64_t kvm_eager_split_size; /* Eager Page Splitting chunk size */
struct KVMDirtyRingReaper reaper;
Add support for RAPL MSRs in KVM/Qemu Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL interface (Running Average Power Limit) for advertising the accumulated energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). The consumption is reported via MSRs (model specific registers) like MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits registers that represent the accumulated energy consumption in micro Joules. They are updated by microcode every ~1ms. For now, KVM always returns 0 when the guest requests the value of these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle these MSRs dynamically in userspace. To limit the amount of system calls for every MSR call, create a new thread in QEMU that updates the "virtual" MSR values asynchronously. Each vCPU has its own vMSR to reflect the independence of vCPUs. The thread updates the vMSR values with the ratio of energy consumed of the whole physical CPU package the vCPU thread runs on and the thread's utime and stime values. All other non-vCPU threads are also taken into account. Their energy consumption is evenly distributed among all vCPUs threads running on the same physical CPU package. To overcome the problem that reading the RAPL MSR requires priviliged access, a socket communication between QEMU and the qemu-vmsr-helper is mandatory. You can specified the socket path in the parameter. This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock Actual limitation: - Works only on Intel host CPU because AMD CPUs are using different MSR adresses. - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the moment. Signed-off-by: Anthony Harivel <aharivel@redhat.com> Link: https://lore.kernel.org/r/20240522153453.1230389-4-aharivel@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-22 15:34:52 +00:00
struct KVMMsrEnergy msr_energy;
NotifyVmexitOption notify_vmexit;
uint32_t notify_window;
uint32_t xen_version;
uint32_t xen_caps;
uint16_t xen_gnttab_max_frames;
uint16_t xen_evtchn_max_pirq;
char *device;
};
void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
AddressSpace *as, int as_id, const char *name);
void kvm_set_max_memslot_size(hwaddr max_slot_size);
/**
* kvm_hwpoison_page_add:
*
* Parameters:
* @ram_addr: the address in the RAM for the poisoned page
*
* Add a poisoned page to the list
*
* Return: None.
*/
void kvm_hwpoison_page_add(ram_addr_t ram_addr);
#endif