Both the master key and key slot passphrases are run through the PBKDF2
algorithm. The iterations count is expected to be generally very large
(many 10's or 100's of 1000s). It is hard to define a low level cutoff,
but we can certainly say that iterations count should be non-zero. A
zero count likely indicates an initialization mistake so reject it.
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The LUKS header data on disk is a fixed size, however, there's expected
to be a gap between the end of the header and the first key slot to get
alignment with the 2nd sector on 4k drives. This wasn't originally part
of the LUKS spec, but was always part of the reference implementation,
so it is worth validating this.
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
We already validate that LUKS keyslots don't overlap with the
header, or with each other. This closes the remaining hole in
validation of LUKS file regions.
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
We already check that key material doesn't overlap between key slots,
and that it doesn't overlap with the payload. We didn't check for
overlap with the LUKS header.
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Although the LUKS stripes are encoded in the keyslot header and so
potentially configurable, in pratice the cryptsetup impl mandates
this has the fixed value 4000. To avoid incompatibility apply the
same enforcement in QEMU too. This also caps the memory usage for
key material when QEMU tries to open a LUKS volume.
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The LUKS spec requires that header strings are NUL-terminated, and our
code relies on that. Protect against maliciously crafted headers by
adding validation.
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Using FILE * APIs for writing the PSK file results in translation from
UNIX to DOS line endings on Windows. When the crypto PSK code later
loads the credentials the stray \r will result in failure to load the
PSK credentials into GNUTLS.
Rather than switching the FILE* APIs to open in binary format, just
switch to the more concise g_file_set_contents API.
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Tested-by: Bin Meng <bmeng.cn@gmail.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
If setting credentials fails, the handshake will later fail to complete
with an obscure error message which is hard to diagnose.
Reviewed-by: Bin Meng <bmeng.cn@gmail.com>
Tested-by: Bin Meng <bmeng.cn@gmail.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Currently we check status of each submodule, before actually checking
if we're in a git repo. These status commands will all fail, but we
are hiding their output so we don't see it currently.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
As of the kernel commit linked below, Linux ingests an RNG seed
passed as part of the environment block by the bootloader or firmware.
This mechanism works across all different environment block types,
generically, which pass some block via the second firmware argument. On
malta, this has been tested to work when passed as an argument from
U-Boot's linux_env_set.
As is the case on most other architectures (such as boston), when
booting with `-kernel`, QEMU, acting as the bootloader, should pass the
RNG seed, so that the machine has good entropy for Linux to consume. So
this commit implements that quite simply by using the guest random API,
which is what is used on nearly all other archs too. It also
reinitializes the seed on reboot, so that it is always fresh.
Link: https://git.kernel.org/torvalds/c/056a68cea01
Cc: Aleksandar Rikalo <aleksandar.rikalo@syrmia.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When the system reboots, the rng-seed that the FDT has should be
re-randomized, so that the new boot gets a new seed. Since the FDT is in
the ROM region at this point, we add a hook right after the ROM has been
added, so that we have a pointer to that copy of the FDT.
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-12-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When the system reboots, the rng-seed that the FDT has should be
re-randomized, so that the new boot gets a new seed. Since the FDT is in
the ROM region at this point, we add a hook right after the ROM has been
added, so that we have a pointer to that copy of the FDT.
Cc: Stafford Horne <shorne@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-11-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When the system reboots, the rng-seed that the FDT has should be
re-randomized, so that the new boot gets a new seed. Since the FDT is in
the ROM region at this point, we add a hook right after the ROM has been
added, so that we have a pointer to that copy of the FDT.
Cc: Aleksandar Rikalo <aleksandar.rikalo@syrmia.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-9-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Snapshot loading is supposed to be deterministic, so we shouldn't
re-randomize the various seeds used.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-8-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Snapshot loading is supposed to be deterministic, so we shouldn't
re-randomize the various seeds used.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-7-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When the system reboots, the rng-seed that the FDT has should be
re-randomized, so that the new boot gets a new seed. Since the FDT is in
the ROM region at this point, we add a hook right after the ROM has been
added, so that we have a pointer to that copy of the FDT.
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Bin Meng <bin.meng@windriver.com>
Cc: qemu-riscv@nongnu.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-id: 20221025004327.568476-6-Jason@zx2c4.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When the system reboots, the rng-seed that the FDT has should be
re-randomized, so that the new boot gets a new seed. Since the FDT is in
the ROM region at this point, we add a hook right after the ROM has been
added, so that we have a pointer to that copy of the FDT.
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-arm@nongnu.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-5-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Snapshot loading is supposed to be deterministic, so we shouldn't
re-randomize the various seeds used.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-4-Jason@zx2c4.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
When the system reboots, the rng-seed that the FDT has should be
re-randomized, so that the new boot gets a new seed. Several
architectures require this functionality, so export a function for
injecting a new seed into the given FDT.
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-id: 20221025004327.568476-3-Jason@zx2c4.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Snapshot loading only expects to call deterministic handlers, not
non-deterministic ones. So introduce a way of registering handlers that
won't be called when reseting for snapshots.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Message-id: 20221025004327.568476-2-Jason@zx2c4.com
[PMM: updated json doc comment with Markus' text; fixed
checkpatch style nit]
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
We had only been reporting the stage2 page size. This causes
problems if stage1 is using a larger page size (16k, 2M, etc),
but stage2 is using a smaller page size, because cputlb does
not set large_page_{addr,mask} properly.
Fix by using the max of the two page sizes.
Reported-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-15-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Perform the atomic update for hardware management of the dirty bit.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-14-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Perform the atomic update for hardware management of the access flag.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-13-richard.henderson@linaro.org
[PMM: Fix accidental PROT_WRITE to PAGE_WRITE; add missing
main-loop.h include]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Replace some gotos with some nested if statements.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20221024051851.3074715-12-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Both GP and DBM are in the upper attribute block.
Extend the computation of attrs to include them,
then simplify the setting of guarded.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20221024051851.3074715-11-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Leave the upper and lower attributes in the place they originate
from in the descriptor. Shifting them around is confusing, since
one cannot read the bit numbers out of the manual. Also, new
attributes have been added which would alter the shifts.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20221024051851.3074715-10-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Always overriding fi->type was incorrect, as we would not properly
propagate the fault type from S1_ptw_translate, or arm_ldq_ptw.
Simplify things by providing a new label for a translation fault.
For other faults, store into fi directly.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20221024051851.3074715-9-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The unconditional loop was used both to iterate over levels
and to control parsing of attributes. Use an explicit goto
in both cases.
While this appears less clean for iterating over levels, we
will need to jump back into the middle of this loop for
atomic updates, which is even uglier.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-8-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
This fault type is to be used with FEAT_HAFDBS when
the guest enables hw updates, but places the tables
in memory where atomic updates are unsupported.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20221024051851.3074715-7-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Separate S1 translation from the actual lookup.
Will enable lpae hardware updates.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-6-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20221024051851.3074715-5-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The MMFR1 field may indicate support for hardware update of
access flag alone, or access flag and dirty bit.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-4-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Hoist the computation of the mmu_idx for the ptw up to
get_phys_addr_with_struct and get_phys_addr_twostage.
This removes the duplicate check for stage2 disabled
from the middle of the walk, performing it only once.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20221024051851.3074715-3-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reduce the amount of typing required for this check.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20221024051851.3074715-2-richard.henderson@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The semantic difference between the deprecated device_legacy_reset()
function and the newer device_cold_reset() function is that the new
function resets both the device itself and any qbuses it owns,
whereas the legacy function resets just the device itself and nothing
else. In hyperv_synic_reset() we reset a SynICState, which has no
qbuses, so for this purpose the two functions behave identically and
we can stop using the deprecated one.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-id: 20221013171817.1447562-1-peter.maydell@linaro.org
The code for handling the reset level count in the Resettable code
has two issues:
The reset count is only decremented for the 1->0 case. This means
that if there's ever a nested reset that takes the count to 2 then it
will never again be decremented. Eventually the count will exceed
the '50' limit in resettable_phase_enter() and QEMU will trip over
the assertion failure. The repro case in issue 1266 is an example of
this that happens now the SCSI subsystem uses three-phase reset.
Secondly, the count is decremented only after the exit phase handler
is called. Moving the reset count decrement from "just after" to
"just before" calling the exit phase handler allows
resettable_is_in_reset() to return false during the handler
execution.
This simplifies reset handling in resettable devices. Typically, a
function that updates the device state will just need to read the
current reset state and not anymore treat the "in a reset-exit
transition" as a special case.
Note that the semantics change to the *_is_in_reset() functions
will have no effect on the current codebase, because only two
devices (hw/char/cadence_uart.c and hw/misc/zynq_sclr.c) currently
call those functions, and in neither case do they do it from the
device's exit phase methed.
Fixes: 4a5fc890 ("scsi: Use device_cold_reset() and bus_cold_reset()")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1266
Signed-off-by: Damien Hedde <damien.hedde@greensocs.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reported-by: Michael Peter <michael.peter@hensoldt-cyber.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20221020142749.3357951-1-peter.maydell@linaro.org
Buglink: https://bugs.launchpad.net/qemu/+bug/1905297
Reported-by: Michael Peter <michael.peter@hensoldt-cyber.com>
[PMM: adjust the docs paragraph changed to get the name of the
'enter' phase right and to clarify exactly when the count is
adjusted; rewrite the commit message]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
An exception targeting EL2 from lower EL is actually maskable when
HCR_E2H and HCR_TGE are both set. This applies to both secure and
non-secure Security state.
We can remove the conditions that try to suppress masking of
interrupts when we are Secure and the exception targets EL2 and
Secure EL2 is disabled. This is OK because in that situation
arm_phys_excp_target_el() will never return 2 as the target EL. The
'not if secure' check in this function was originally written before
arm_hcr_el2_eff(), and back then the target EL returned by
arm_phys_excp_target_el() could be 2 even if we were in Secure
EL0/EL1; but it is no longer needed.
Signed-off-by: Ake Koomsin <ake@igel.co.jp>
Message-id: 20221017092432.546881-1-ake@igel.co.jp
[PMM: Add commit message paragraph explaining why it's OK to
remove the checks on secure and SCR_EEL2]
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The "PCI Bus Binding to: IEEE Std 1275-1994" defines the compatible
string for a PCIe bus or endpoint as "pci<vendorid>,<deviceid>" or
similar. Since the initial binding for PCI virtio-iommu didn't follow
this rule, it was modified to accept both strings and ensure backward
compatibility. Also, the unit-name for the node should be
"device,function".
Fix corresponding dt-validate and dtc warnings:
pcie@10000000: virtio_iommu@16:compatible: ['virtio,pci-iommu'] does not contain items matching the given schema
pcie@10000000: Unevaluated properties are not allowed (... 'virtio_iommu@16' were unexpected)
From schema: linux/Documentation/devicetree/bindings/pci/host-generic-pci.yaml
virtio_iommu@16: compatible: 'oneOf' conditional failed, one must be fixed:
['virtio,pci-iommu'] is too short
'pci1af4,1057' was expected
From schema: dtschema/schemas/pci/pci-bus.yaml
Warning (pci_device_reg): /pcie@10000000/virtio_iommu@16: PCI unit address format error, expected "2,0"
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
FEAT_E0PD adds new bits E0PD0 and E0PD1 to TCR_EL1, which allow the
OS to forbid EL0 access to half of the address space. Since this is
an EL0-specific variation on the existing TCR_ELx.{EPD0,EPD1}, we can
implement it entirely in aa64_va_parameters().
This requires moving the existing regime_is_user() to internals.h
so that the code in helper.c can get at it.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 20221021160131.3531787-1-peter.maydell@linaro.org
Currently, there is no way to configure a CPU affinity inside QEMU when
the sandbox option disables it for QEMU as a whole, for example, via:
-sandbox enable=on,resourcecontrol=deny
While ThreadContext objects can be created on the QEMU commandline and
the CPU affinity can be configured externally via the thread-id, this is
insufficient if a ThreadContext with a certain CPU affinity is already
required during QEMU startup, before we can intercept QEMU and
configure the CPU affinity.
Blocking sched_setaffinity() was introduced in 24f8cdc572 ("seccomp:
add resourcecontrol argument to command line"), "to avoid any bigger of the
process". However, we only care about once QEMU is running, not when
the instance starting QEMU explicitly requests a certain CPU affinity
on the QEMU comandline.
Right now, for NUMA-aware preallocation of memory backends used for initial
machine RAM, one has to:
1) Start QEMU with the memory-backend with "prealloc=off"
2) Pause QEMU before it starts the guest (-S)
3) Create ThreadContext, configure the CPU affinity using the thread-id
4) Configure the ThreadContext as "prealloc-context" of the memory
backend
5) Trigger preallocation by setting "prealloc=on"
To simplify this handling especially for initial machine RAM,
allow creation of ThreadContext objects before parsing sandbox options,
such that the CPU affinity requested on the QEMU commandline alongside the
sandbox option can be set. As ThreadContext objects essentially only create
a persistent context thread and set the CPU affinity, this is easily
possible.
With this change, we can create a ThreadContext with a CPU affinity on
the QEMU commandline and use it for preallocation of memory backends
glued to the machine (simplified example):
To make "-name debug-threads=on" keep working as expected for the
context threads, perform earlier parsing of "-name".
qemu-system-x86_64 -m 1G \
-object thread-context,id=tc1,cpu-affinity=3-4 \
-object memory-backend-ram,id=pc.ram,size=1G,prealloc=on,prealloc-threads=2,prealloc-context=tc1 \
-machine memory-backend=pc.ram \
-S -monitor stdio -sandbox enable=on,resourcecontrol=deny
And while we can query the current CPU affinity:
(qemu) qom-get tc1 cpu-affinity
[
3,
4
]
We can no longer change it from QEMU directly:
(qemu) qom-set tc1 cpu-affinity 1-2
Error: Setting CPU affinity failed: Operation not permitted
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Message-Id: <20221014134720.168738-8-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's allow for specifying a thread context via the "prealloc-context"
property. When set, preallcoation threads will be crated via the
thread context -- inheriting the same CPU affinity as the thread
context.
Pinning preallcoation threads to CPUs can heavily increase performance
in NUMA setups, because, preallocation from a CPU close to the target
NUMA node(s) is faster then preallocation from a CPU further remote,
simply because of memory bandwidth for initializing memory with zeroes.
This is especially relevant for very large VMs backed by huge/gigantic
pages, whereby preallocation is mandatory.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Message-Id: <20221014134720.168738-7-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
... and implement it under POSIX. When a ThreadContext is provided,
create new threads via the context such that these new threads obtain a
properly configured CPU affinity.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Message-Id: <20221014134720.168738-6-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's make it easier to pin threads created via a ThreadContext to
all host CPUs currently belonging to a given set of host NUMA nodes --
which is the common case.
"node-affinity" is simply a shortcut for setting "cpu-affinity" manually
to the list of host CPUs belonging to the set of host nodes. This property
can only be written.
A simple QEMU example to set the CPU affinity to host node 1 on a system
with two nodes, 24 CPUs each, whereby odd-numbered host CPUs belong to
host node 1:
qemu-system-x86_64 -S \
-object thread-context,id=tc1,node-affinity=1
And we can query the cpu-affinity via HMP/QMP:
(qemu) qom-get tc1 cpu-affinity
[
1,
3,
5,
7,
9,
11,
13,
15,
17,
19,
21,
23,
25,
27,
29,
31,
33,
35,
37,
39,
41,
43,
45,
47
]
We cannot query the node-affinity:
(qemu) qom-get tc1 node-affinity
Error: Insufficient permission to perform this operation
But note that due to dynamic library loading this example will not work
before we actually make use of thread_context_create_thread() in QEMU
code, because the type will otherwise not get registered. We'll wire
this up next to make it work.
Note that if the host CPUs for a host node change due do CPU hot(un)plug
CPU onlining/offlining (i.e., lscpu output changes) after the ThreadContext
was started, the CPU affinity will not get updated.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221014134720.168738-5-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Setting the CPU affinity of QEMU threads is a bit problematic, because
QEMU doesn't always have permissions to set the CPU affinity itself,
for example, with seccomp after initialized by QEMU:
-sandbox enable=on,resourcecontrol=deny
General information about CPU affinities can be found in the man page of
taskset:
CPU affinity is a scheduler property that "bonds" a process to a given
set of CPUs on the system. The Linux scheduler will honor the given CPU
affinity and the process will not run on any other CPUs.
While upper layers are already aware of how to handle CPU affinities for
long-lived threads like iothreads or vcpu threads, especially short-lived
threads, as used for memory-backend preallocation, are more involved to
handle. These threads are created on demand and upper layers are not even
able to identify and configure them.
Introduce the concept of a ThreadContext, that is essentially a thread
used for creating new threads. All threads created via that context
thread inherit the configured CPU affinity. Consequently, it's
sufficient to create a ThreadContext and configure it once, and have all
threads created via that ThreadContext inherit the same CPU affinity.
The CPU affinity of a ThreadContext can be configured two ways:
(1) Obtaining the thread id via the "thread-id" property and setting the
CPU affinity manually (e.g., via taskset).
(2) Setting the "cpu-affinity" property and letting QEMU try set the
CPU affinity itself. This will fail if QEMU doesn't have permissions
to do so anymore after seccomp was initialized.
A simple QEMU example to set the CPU affinity to host CPU 0,1,6,7 would be:
qemu-system-x86_64 -S \
-object thread-context,id=tc1,cpu-affinity=0-1,cpu-affinity=6-7
And we can query it via HMP/QMP:
(qemu) qom-get tc1 cpu-affinity
[
0,
1,
6,
7
]
But note that due to dynamic library loading this example will not work
before we actually make use of thread_context_create_thread() in QEMU
code, because the type will otherwise not get registered. We'll wire
this up next to make it work.
In general, the interface behaves like pthread_setaffinity_np(): host
CPU numbers that are currently not available are ignored; only host CPU
numbers that are impossible with the current kernel will fail. If the
list of host CPU numbers does not include a single CPU that is
available, setting the CPU affinity will fail.
A ThreadContext can be reused, simply by reconfiguring the CPU affinity.
Note that the CPU affinity of previously created threads will not get
adjusted.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20221014134720.168738-4-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Usually, we let upper layers handle CPU pinning, because
pthread_setaffinity_np() (-> sched_setaffinity()) is blocked via
seccomp when starting QEMU with
-sandbox enable=on,resourcecontrol=deny
However, we want to configure and observe the CPU affinity of threads
from QEMU directly in some cases when the sandbox option is either not
enabled or not active yet.
So let's add a way to configure CPU pinning via
qemu_thread_set_affinity() and obtain CPU affinity via
qemu_thread_get_affinity() and implement them under POSIX using
pthread_setaffinity_np() + pthread_getaffinity_np().
Implementation under Windows is possible using SetProcessAffinityMask()
+ GetProcessAffinityMask(), however, that is left as future work.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Message-Id: <20221014134720.168738-3-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Let's
* give the function a "qemu_*" style name
* make sure the parameters in the implementation match the prototype
* rename smp_cpus to max_threads, which makes the semantics of that
parameter clearer
... and add a function documentation.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Message-Id: <20221014134720.168738-2-david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
The element size is encoded in the M3 field, not in the M4
field.
Fixes: be6324c6b7 ("s390x/tcg: Implement VECTOR ISOLATE STRING")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1248
Message-Id: <20221012182755.1014853-3-thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
This is common practice, see the Makefile.target in the aarch64
folder for example.
Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20221012182755.1014853-2-thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Under PV, the guest's TOD clock is under control of the ultravisor and the
hypervisor cannot change it.
With upcoming kernel changes[1], the Linux kernel will reject QEMU's
request to adjust the guest's clock in this case, so don't attempt to set
the clock.
This avoids the following warning message on save/restore of a PV guest:
warning: Unable to set KVM guest TOD clock: Operation not supported
[1] https://lore.kernel.org/all/20221011160712.928239-2-nrb@linux.ibm.com/
Fixes: c3347ed0d2 ("s390x: protvirt: Support unpack facility")
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Message-Id: <20221012123229.1196007-1-nrb@linux.ibm.com>
[thuth: Add curly braces]
Signed-off-by: Thomas Huth <thuth@redhat.com>