mirror of https://github.com/xemu-project/xemu.git
atomics: convert to reStructuredText
No attempts to fix or update the text; these are left for the next patch in the series. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit is contained in:
parent
278fb16273
commit
15e8699f00
|
@ -0,0 +1,446 @@
|
||||||
|
=========================
|
||||||
|
Atomic operations in QEMU
|
||||||
|
=========================
|
||||||
|
|
||||||
|
CPUs perform independent memory operations effectively in random order.
|
||||||
|
but this can be a problem for CPU-CPU interaction (including interactions
|
||||||
|
between QEMU and the guest). Multi-threaded programs use various tools
|
||||||
|
to instruct the compiler and the CPU to restrict the order to something
|
||||||
|
that is consistent with the expectations of the programmer.
|
||||||
|
|
||||||
|
The most basic tool is locking. Mutexes, condition variables and
|
||||||
|
semaphores are used in QEMU, and should be the default approach to
|
||||||
|
synchronization. Anything else is considerably harder, but it's
|
||||||
|
also justified more often than one would like. The two tools that
|
||||||
|
are provided by ``qemu/atomic.h`` are memory barriers and atomic operations.
|
||||||
|
|
||||||
|
Macros defined by ``qemu/atomic.h`` fall in three camps:
|
||||||
|
|
||||||
|
- compiler barriers: ``barrier()``;
|
||||||
|
|
||||||
|
- weak atomic access and manual memory barriers: ``atomic_read()``,
|
||||||
|
``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``,
|
||||||
|
``smp_mb_release()``, ``smp_read_barrier_depends()``;
|
||||||
|
|
||||||
|
- sequentially consistent atomic access: everything else.
|
||||||
|
|
||||||
|
|
||||||
|
Compiler memory barrier
|
||||||
|
=======================
|
||||||
|
|
||||||
|
``barrier()`` prevents the compiler from moving the memory accesses either
|
||||||
|
side of it to the other side. The compiler barrier has no direct effect
|
||||||
|
on the CPU, which may then reorder things however it wishes.
|
||||||
|
|
||||||
|
``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some
|
||||||
|
architectures, CPU guarantees are strong enough that blocking compiler
|
||||||
|
optimizations already ensures the correct order of execution. In this
|
||||||
|
case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
|
||||||
|
compiler barriers.
|
||||||
|
|
||||||
|
Still, ``barrier()`` can be useful when writing code that can be interrupted
|
||||||
|
by signal handlers.
|
||||||
|
|
||||||
|
|
||||||
|
Sequentially consistent atomic access
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
|
||||||
|
consistency*, where "the result of any execution is the same as if the
|
||||||
|
operations of all the processors were executed in some sequential order,
|
||||||
|
and the operations of each individual processor appear in this sequence
|
||||||
|
in the order specified by its program".
|
||||||
|
|
||||||
|
``qemu/atomic.h`` provides the following set of atomic read-modify-write
|
||||||
|
operations::
|
||||||
|
|
||||||
|
void atomic_inc(ptr)
|
||||||
|
void atomic_dec(ptr)
|
||||||
|
void atomic_add(ptr, val)
|
||||||
|
void atomic_sub(ptr, val)
|
||||||
|
void atomic_and(ptr, val)
|
||||||
|
void atomic_or(ptr, val)
|
||||||
|
|
||||||
|
typeof(*ptr) atomic_fetch_inc(ptr)
|
||||||
|
typeof(*ptr) atomic_fetch_dec(ptr)
|
||||||
|
typeof(*ptr) atomic_fetch_add(ptr, val)
|
||||||
|
typeof(*ptr) atomic_fetch_sub(ptr, val)
|
||||||
|
typeof(*ptr) atomic_fetch_and(ptr, val)
|
||||||
|
typeof(*ptr) atomic_fetch_or(ptr, val)
|
||||||
|
typeof(*ptr) atomic_fetch_xor(ptr, val)
|
||||||
|
typeof(*ptr) atomic_fetch_inc_nonzero(ptr)
|
||||||
|
typeof(*ptr) atomic_xchg(ptr, val)
|
||||||
|
typeof(*ptr) atomic_cmpxchg(ptr, old, new)
|
||||||
|
|
||||||
|
all of which return the old value of ``*ptr``. These operations are
|
||||||
|
polymorphic; they operate on any type that is as wide as a pointer.
|
||||||
|
|
||||||
|
Similar operations return the new value of ``*ptr``::
|
||||||
|
|
||||||
|
typeof(*ptr) atomic_inc_fetch(ptr)
|
||||||
|
typeof(*ptr) atomic_dec_fetch(ptr)
|
||||||
|
typeof(*ptr) atomic_add_fetch(ptr, val)
|
||||||
|
typeof(*ptr) atomic_sub_fetch(ptr, val)
|
||||||
|
typeof(*ptr) atomic_and_fetch(ptr, val)
|
||||||
|
typeof(*ptr) atomic_or_fetch(ptr, val)
|
||||||
|
typeof(*ptr) atomic_xor_fetch(ptr, val)
|
||||||
|
|
||||||
|
Sequentially consistent loads and stores can be done using::
|
||||||
|
|
||||||
|
atomic_fetch_add(ptr, 0) for loads
|
||||||
|
atomic_xchg(ptr, val) for stores
|
||||||
|
|
||||||
|
However, they are quite expensive on some platforms, notably POWER and
|
||||||
|
Arm. Therefore, qemu/atomic.h provides two primitives with slightly
|
||||||
|
weaker constraints::
|
||||||
|
|
||||||
|
typeof(*ptr) atomic_mb_read(ptr)
|
||||||
|
void atomic_mb_set(ptr, val)
|
||||||
|
|
||||||
|
The semantics of these primitives map to Java volatile variables,
|
||||||
|
and are strongly related to memory barriers as used in the Linux
|
||||||
|
kernel (see below).
|
||||||
|
|
||||||
|
As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
|
||||||
|
be reordered with each other, and it is also not possible to reorder
|
||||||
|
"normal" accesses around them.
|
||||||
|
|
||||||
|
However, and this is the important difference between
|
||||||
|
atomic_mb_read/atomic_mb_set and sequential consistency, it is important
|
||||||
|
for both threads to access the same volatile variable. It is not the
|
||||||
|
case that everything visible to thread A when it writes volatile field f
|
||||||
|
becomes visible to thread B after it reads volatile field g. The store
|
||||||
|
and load have to "match" (i.e., be performed on the same volatile
|
||||||
|
field) to achieve the right semantics.
|
||||||
|
|
||||||
|
|
||||||
|
These operations operate on any type that is as wide as an int or smaller.
|
||||||
|
|
||||||
|
|
||||||
|
Weak atomic access and manual memory barriers
|
||||||
|
=============================================
|
||||||
|
|
||||||
|
Compared to sequentially consistent atomic access, programming with
|
||||||
|
weaker consistency models can be considerably more complicated.
|
||||||
|
In general, if the algorithm you are writing includes both writes
|
||||||
|
and reads on the same side, it is generally simpler to use sequentially
|
||||||
|
consistent primitives.
|
||||||
|
|
||||||
|
When using this model, variables are accessed with:
|
||||||
|
|
||||||
|
- ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from
|
||||||
|
optimizing accesses out of existence and creating unsolicited
|
||||||
|
accesses, but do not otherwise impose any ordering on loads and
|
||||||
|
stores: both the compiler and the processor are free to reorder
|
||||||
|
them.
|
||||||
|
|
||||||
|
- ``atomic_load_acquire()``, which guarantees the LOAD to appear to
|
||||||
|
happen, with respect to the other components of the system,
|
||||||
|
before all the LOAD or STORE operations specified afterwards.
|
||||||
|
Operations coming before ``atomic_load_acquire()`` can still be
|
||||||
|
reordered after it.
|
||||||
|
|
||||||
|
- ``atomic_store_release()``, which guarantees the STORE to appear to
|
||||||
|
happen, with respect to the other components of the system,
|
||||||
|
after all the LOAD or STORE operations specified afterwards.
|
||||||
|
Operations coming after ``atomic_store_release()`` can still be
|
||||||
|
reordered after it.
|
||||||
|
|
||||||
|
Restrictions to the ordering of accesses can also be specified
|
||||||
|
using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
|
||||||
|
``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
|
||||||
|
|
||||||
|
Memory barriers control the order of references to shared memory.
|
||||||
|
They come in six kinds:
|
||||||
|
|
||||||
|
- ``smp_rmb()`` guarantees that all the LOAD operations specified before
|
||||||
|
the barrier will appear to happen before all the LOAD operations
|
||||||
|
specified after the barrier with respect to the other components of
|
||||||
|
the system.
|
||||||
|
|
||||||
|
In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
|
||||||
|
required to have any effect on stores.
|
||||||
|
|
||||||
|
- ``smp_wmb()`` guarantees that all the STORE operations specified before
|
||||||
|
the barrier will appear to happen before all the STORE operations
|
||||||
|
specified after the barrier with respect to the other components of
|
||||||
|
the system.
|
||||||
|
|
||||||
|
In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
|
||||||
|
required to have any effect on loads.
|
||||||
|
|
||||||
|
- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
|
||||||
|
the barrier will appear to happen before all the LOAD or STORE operations
|
||||||
|
specified after the barrier with respect to the other components of
|
||||||
|
the system.
|
||||||
|
|
||||||
|
- ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
|
||||||
|
the barrier will appear to happen after all the LOAD or STORE operations
|
||||||
|
specified *before* the barrier with respect to the other components of
|
||||||
|
the system.
|
||||||
|
|
||||||
|
- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
|
||||||
|
before the barrier will appear to happen before all the LOAD and
|
||||||
|
STORE operations specified after the barrier with respect to the other
|
||||||
|
components of the system.
|
||||||
|
|
||||||
|
``smp_mb()`` puts a partial ordering on both loads and stores. It is
|
||||||
|
stronger than both a read and a write memory barrier; it implies both
|
||||||
|
``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
|
||||||
|
coming before the barrier from overtaking LOADs coming after the
|
||||||
|
barrier and vice versa.
|
||||||
|
|
||||||
|
- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On
|
||||||
|
most processors, whenever two loads are performed such that the
|
||||||
|
second depends on the result of the first (e.g., the first load
|
||||||
|
retrieves the address to which the second load will be directed),
|
||||||
|
the processor will guarantee that the first LOAD will appear to happen
|
||||||
|
before the second with respect to the other components of the system.
|
||||||
|
However, this is not always true---for example, it was not true on
|
||||||
|
Alpha processors. Whenever this kind of access happens to shared
|
||||||
|
memory (that is not protected by a lock), a read barrier is needed,
|
||||||
|
and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``.
|
||||||
|
|
||||||
|
Note that the first load really has to have a _data_ dependency and not
|
||||||
|
a control dependency. If the address for the second load is dependent
|
||||||
|
on the first load, but the dependency is through a conditional rather
|
||||||
|
than actually loading the address itself, then it's a _control_
|
||||||
|
dependency and a full read barrier or better is required.
|
||||||
|
|
||||||
|
|
||||||
|
This is the set of barriers that is required *between* two ``atomic_read()``
|
||||||
|
and ``atomic_set()`` operations to achieve sequential consistency:
|
||||||
|
|
||||||
|
+----------------+-------------------------------------------------------+
|
||||||
|
| | 2nd operation |
|
||||||
|
| +------------------+-----------------+------------------+
|
||||||
|
| 1st operation | (after last) | atomic_read | atomic_set |
|
||||||
|
+----------------+------------------+-----------------+------------------+
|
||||||
|
| (before first) | .. | none | smp_mb_release() |
|
||||||
|
+----------------+------------------+-----------------+------------------+
|
||||||
|
| atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ |
|
||||||
|
+----------------+------------------+-----------------+------------------+
|
||||||
|
| atomic_set | none | smp_mb() [3]_ | smp_wmb() |
|
||||||
|
+----------------+------------------+-----------------+------------------+
|
||||||
|
|
||||||
|
.. [1] Or smp_read_barrier_depends().
|
||||||
|
|
||||||
|
.. [2] This requires a load-store barrier. This is achieved by
|
||||||
|
either smp_mb_acquire() or smp_mb_release().
|
||||||
|
|
||||||
|
.. [3] This requires a store-load barrier. On most machines, the only
|
||||||
|
way to achieve this is a full barrier.
|
||||||
|
|
||||||
|
|
||||||
|
You can see that the two possible definitions of ``atomic_mb_read()``
|
||||||
|
and ``atomic_mb_set()`` are the following:
|
||||||
|
|
||||||
|
1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire()
|
||||||
|
| atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
|
||||||
|
|
||||||
|
2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire()
|
||||||
|
| atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
|
||||||
|
|
||||||
|
Usually the former is used, because ``smp_mb()`` is expensive and a program
|
||||||
|
normally has more reads than writes. Therefore it makes more sense to
|
||||||
|
make ``atomic_mb_set()`` the more expensive operation.
|
||||||
|
|
||||||
|
There are two common cases in which atomic_mb_read and atomic_mb_set
|
||||||
|
generate too many memory barriers, and thus it can be useful to manually
|
||||||
|
place barriers, or use atomic_load_acquire/atomic_store_release instead:
|
||||||
|
|
||||||
|
- when a data structure has one thread that is always a writer
|
||||||
|
and one thread that is always a reader, manual placement of
|
||||||
|
memory barriers makes the write side faster. Furthermore,
|
||||||
|
correctness is easy to check for in this case using the "pairing"
|
||||||
|
trick that is explained below:
|
||||||
|
|
||||||
|
+----------------------------------------------------------------------+
|
||||||
|
| thread 1 |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
| before | after |
|
||||||
|
+===================================+==================================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| (other writes) | |
|
||||||
|
| atomic_mb_set(&a, x) | atomic_store_release(&a, x) |
|
||||||
|
| atomic_mb_set(&b, y) | atomic_store_release(&b, y) |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
|
||||||
|
+----------------------------------------------------------------------+
|
||||||
|
| thread 2 |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
| before | after |
|
||||||
|
+===================================+==================================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) |
|
||||||
|
| x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) |
|
||||||
|
| (other reads) | |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
|
||||||
|
Note that the barrier between the stores in thread 1, and between
|
||||||
|
the loads in thread 2, has been optimized here to a write or a
|
||||||
|
read memory barrier respectively. On some architectures, notably
|
||||||
|
ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
|
||||||
|
smp_mb, but smp_rmb and/or smp_wmb are more efficient.
|
||||||
|
|
||||||
|
- sometimes, a thread is accessing many variables that are otherwise
|
||||||
|
unrelated to each other (for example because, apart from the current
|
||||||
|
thread, exactly one other thread will read or write each of these
|
||||||
|
variables). In this case, it is possible to "hoist" the implicit
|
||||||
|
barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside
|
||||||
|
a loop. For example, the above definition ``atomic_mb_read()`` gives
|
||||||
|
the following transformation:
|
||||||
|
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
| before | after |
|
||||||
|
+===================================+==================================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| n = 0; | n = 0; |
|
||||||
|
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
|
||||||
|
| n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); |
|
||||||
|
| | smp_mb_acquire(); |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
|
||||||
|
Similarly, atomic_mb_set() can be transformed as follows:
|
||||||
|
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
| before | after |
|
||||||
|
+===================================+==================================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| | smp_mb_release(); |
|
||||||
|
| for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
|
||||||
|
| atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); |
|
||||||
|
| | smp_mb(); |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
|
||||||
|
|
||||||
|
The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``.
|
||||||
|
|
||||||
|
The two tricks can be combined. In this case, splitting a loop in
|
||||||
|
two lets you hoist the barriers out of the loops _and_ eliminate the
|
||||||
|
expensive ``smp_mb()``:
|
||||||
|
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
| before | after |
|
||||||
|
+===================================+==================================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| | smp_mb_release(); |
|
||||||
|
| for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
|
||||||
|
| atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); |
|
||||||
|
| atomic_mb_set(&b[i], false); | smb_wmb(); |
|
||||||
|
| } | for (i = 0; i < 10; i++) |
|
||||||
|
| | atomic_set(&a[i], false); |
|
||||||
|
| | smp_mb(); |
|
||||||
|
+-----------------------------------+----------------------------------+
|
||||||
|
|
||||||
|
|
||||||
|
Memory barrier pairing
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
A useful rule of thumb is that memory barriers should always, or almost
|
||||||
|
always, be paired with another barrier. In the case of QEMU, however,
|
||||||
|
note that the other barrier may actually be in a driver that runs in
|
||||||
|
the guest!
|
||||||
|
|
||||||
|
For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()``
|
||||||
|
both count as read barriers. A read barrier shall pair with a write
|
||||||
|
barrier or a full barrier; a write barrier shall pair with a read
|
||||||
|
barrier or a full barrier. A full barrier can pair with anything.
|
||||||
|
For example:
|
||||||
|
|
||||||
|
+--------------------+------------------------------+
|
||||||
|
| thread 1 | thread 2 |
|
||||||
|
+====================+==============================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| a = 1; | |
|
||||||
|
| smp_wmb(); | |
|
||||||
|
| b = 2; | x = b; |
|
||||||
|
| | smp_rmb(); |
|
||||||
|
| | y = a; |
|
||||||
|
+--------------------+------------------------------+
|
||||||
|
|
||||||
|
Note that the "writing" thread is accessing the variables in the
|
||||||
|
opposite order as the "reading" thread. This is expected: stores
|
||||||
|
before the write barrier will normally match the loads after the
|
||||||
|
read barrier, and vice versa. The same is true for more than 2
|
||||||
|
access and for data dependency barriers:
|
||||||
|
|
||||||
|
+----------------------+------------------------------+
|
||||||
|
| thread 1 | thread 2 |
|
||||||
|
+======================+==============================+
|
||||||
|
| :: | :: |
|
||||||
|
| | |
|
||||||
|
| b[2] = 1; | |
|
||||||
|
| smp_wmb(); | |
|
||||||
|
| x->i = 2; | |
|
||||||
|
| smp_wmb(); | |
|
||||||
|
| a = x; | x = a; |
|
||||||
|
| | smp_read_barrier_depends(); |
|
||||||
|
| | y = x->i; |
|
||||||
|
| | smp_read_barrier_depends(); |
|
||||||
|
| | z = b[y]; |
|
||||||
|
+----------------------+------------------------------+
|
||||||
|
|
||||||
|
``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``.
|
||||||
|
and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``.
|
||||||
|
|
||||||
|
|
||||||
|
Comparison with Linux kernel memory barriers
|
||||||
|
============================================
|
||||||
|
|
||||||
|
Here is a list of differences between Linux kernel atomic operations
|
||||||
|
and memory barriers, and the equivalents in QEMU:
|
||||||
|
|
||||||
|
- atomic operations in Linux are always on a 32-bit int type and
|
||||||
|
use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
|
||||||
|
and use normal C types.
|
||||||
|
|
||||||
|
- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
|
||||||
|
at all. Linux 4.1 updated them to implement volatile
|
||||||
|
semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
|
||||||
|
|
||||||
|
QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed
|
||||||
|
semantics if the compiler supports it, and volatile semantics otherwise.
|
||||||
|
Both semantics prevent the compiler from doing certain transformations;
|
||||||
|
the difference is that atomic accesses are guaranteed to be atomic,
|
||||||
|
while volatile accesses aren't. Thus, in the volatile case we just cross
|
||||||
|
our fingers hoping that the compiler will generate atomic accesses,
|
||||||
|
since we assume the variables passed are machine-word sized and
|
||||||
|
properly aligned.
|
||||||
|
|
||||||
|
No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux
|
||||||
|
or QEMU.
|
||||||
|
|
||||||
|
- atomic read-modify-write operations in Linux are of three kinds:
|
||||||
|
|
||||||
|
===================== =========================================
|
||||||
|
``atomic_OP`` returns void
|
||||||
|
``atomic_OP_return`` returns new value of the variable
|
||||||
|
``atomic_fetch_OP`` returns the old value of the variable
|
||||||
|
``atomic_cmpxchg`` returns the old value of the variable
|
||||||
|
===================== =========================================
|
||||||
|
|
||||||
|
In QEMU, the second kind does not exist. Currently Linux has
|
||||||
|
atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub.
|
||||||
|
|
||||||
|
- different atomic read-modify-write operations in Linux imply
|
||||||
|
a different set of memory barriers; in QEMU, all of them enforce
|
||||||
|
sequential consistency, which means they imply full memory barriers
|
||||||
|
before and after the operation.
|
||||||
|
|
||||||
|
- Linux does not have an equivalent of ``atomic_mb_set()``. In particular,
|
||||||
|
note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``.
|
||||||
|
``atomic_mb_read()`` compiles to the same instructions as Linux's
|
||||||
|
``smp_load_acquire()``, but this should be treated as an implementation
|
||||||
|
detail.
|
||||||
|
|
||||||
|
Sources
|
||||||
|
=======
|
||||||
|
|
||||||
|
- ``Documentation/memory-barriers.txt`` from the Linux kernel
|
|
@ -1,403 +0,0 @@
|
||||||
CPUs perform independent memory operations effectively in random order.
|
|
||||||
but this can be a problem for CPU-CPU interaction (including interactions
|
|
||||||
between QEMU and the guest). Multi-threaded programs use various tools
|
|
||||||
to instruct the compiler and the CPU to restrict the order to something
|
|
||||||
that is consistent with the expectations of the programmer.
|
|
||||||
|
|
||||||
The most basic tool is locking. Mutexes, condition variables and
|
|
||||||
semaphores are used in QEMU, and should be the default approach to
|
|
||||||
synchronization. Anything else is considerably harder, but it's
|
|
||||||
also justified more often than one would like. The two tools that
|
|
||||||
are provided by qemu/atomic.h are memory barriers and atomic operations.
|
|
||||||
|
|
||||||
Macros defined by qemu/atomic.h fall in three camps:
|
|
||||||
|
|
||||||
- compiler barriers: barrier();
|
|
||||||
|
|
||||||
- weak atomic access and manual memory barriers: atomic_read(),
|
|
||||||
atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_mb_acquire(),
|
|
||||||
smp_mb_release(), smp_read_barrier_depends();
|
|
||||||
|
|
||||||
- sequentially consistent atomic access: everything else.
|
|
||||||
|
|
||||||
|
|
||||||
COMPILER MEMORY BARRIER
|
|
||||||
=======================
|
|
||||||
|
|
||||||
barrier() prevents the compiler from moving the memory accesses either
|
|
||||||
side of it to the other side. The compiler barrier has no direct effect
|
|
||||||
on the CPU, which may then reorder things however it wishes.
|
|
||||||
|
|
||||||
barrier() is mostly used within qemu/atomic.h itself. On some
|
|
||||||
architectures, CPU guarantees are strong enough that blocking compiler
|
|
||||||
optimizations already ensures the correct order of execution. In this
|
|
||||||
case, qemu/atomic.h will reduce stronger memory barriers to simple
|
|
||||||
compiler barriers.
|
|
||||||
|
|
||||||
Still, barrier() can be useful when writing code that can be interrupted
|
|
||||||
by signal handlers.
|
|
||||||
|
|
||||||
|
|
||||||
SEQUENTIALLY CONSISTENT ATOMIC ACCESS
|
|
||||||
=====================================
|
|
||||||
|
|
||||||
Most of the operations in the qemu/atomic.h header ensure *sequential
|
|
||||||
consistency*, where "the result of any execution is the same as if the
|
|
||||||
operations of all the processors were executed in some sequential order,
|
|
||||||
and the operations of each individual processor appear in this sequence
|
|
||||||
in the order specified by its program".
|
|
||||||
|
|
||||||
qemu/atomic.h provides the following set of atomic read-modify-write
|
|
||||||
operations:
|
|
||||||
|
|
||||||
void atomic_inc(ptr)
|
|
||||||
void atomic_dec(ptr)
|
|
||||||
void atomic_add(ptr, val)
|
|
||||||
void atomic_sub(ptr, val)
|
|
||||||
void atomic_and(ptr, val)
|
|
||||||
void atomic_or(ptr, val)
|
|
||||||
|
|
||||||
typeof(*ptr) atomic_fetch_inc(ptr)
|
|
||||||
typeof(*ptr) atomic_fetch_dec(ptr)
|
|
||||||
typeof(*ptr) atomic_fetch_add(ptr, val)
|
|
||||||
typeof(*ptr) atomic_fetch_sub(ptr, val)
|
|
||||||
typeof(*ptr) atomic_fetch_and(ptr, val)
|
|
||||||
typeof(*ptr) atomic_fetch_or(ptr, val)
|
|
||||||
typeof(*ptr) atomic_fetch_xor(ptr, val)
|
|
||||||
typeof(*ptr) atomic_fetch_inc_nonzero(ptr)
|
|
||||||
typeof(*ptr) atomic_xchg(ptr, val)
|
|
||||||
typeof(*ptr) atomic_cmpxchg(ptr, old, new)
|
|
||||||
|
|
||||||
all of which return the old value of *ptr. These operations are
|
|
||||||
polymorphic; they operate on any type that is as wide as a pointer.
|
|
||||||
|
|
||||||
Similar operations return the new value of *ptr:
|
|
||||||
|
|
||||||
typeof(*ptr) atomic_inc_fetch(ptr)
|
|
||||||
typeof(*ptr) atomic_dec_fetch(ptr)
|
|
||||||
typeof(*ptr) atomic_add_fetch(ptr, val)
|
|
||||||
typeof(*ptr) atomic_sub_fetch(ptr, val)
|
|
||||||
typeof(*ptr) atomic_and_fetch(ptr, val)
|
|
||||||
typeof(*ptr) atomic_or_fetch(ptr, val)
|
|
||||||
typeof(*ptr) atomic_xor_fetch(ptr, val)
|
|
||||||
|
|
||||||
Sequentially consistent loads and stores can be done using:
|
|
||||||
|
|
||||||
atomic_fetch_add(ptr, 0) for loads
|
|
||||||
atomic_xchg(ptr, val) for stores
|
|
||||||
|
|
||||||
However, they are quite expensive on some platforms, notably POWER and
|
|
||||||
Arm. Therefore, qemu/atomic.h provides two primitives with slightly
|
|
||||||
weaker constraints:
|
|
||||||
|
|
||||||
typeof(*ptr) atomic_mb_read(ptr)
|
|
||||||
void atomic_mb_set(ptr, val)
|
|
||||||
|
|
||||||
The semantics of these primitives map to Java volatile variables,
|
|
||||||
and are strongly related to memory barriers as used in the Linux
|
|
||||||
kernel (see below).
|
|
||||||
|
|
||||||
As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
|
|
||||||
be reordered with each other, and it is also not possible to reorder
|
|
||||||
"normal" accesses around them.
|
|
||||||
|
|
||||||
However, and this is the important difference between
|
|
||||||
atomic_mb_read/atomic_mb_set and sequential consistency, it is important
|
|
||||||
for both threads to access the same volatile variable. It is not the
|
|
||||||
case that everything visible to thread A when it writes volatile field f
|
|
||||||
becomes visible to thread B after it reads volatile field g. The store
|
|
||||||
and load have to "match" (i.e., be performed on the same volatile
|
|
||||||
field) to achieve the right semantics.
|
|
||||||
|
|
||||||
|
|
||||||
These operations operate on any type that is as wide as an int or smaller.
|
|
||||||
|
|
||||||
|
|
||||||
WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
|
|
||||||
=============================================
|
|
||||||
|
|
||||||
Compared to sequentially consistent atomic access, programming with
|
|
||||||
weaker consistency models can be considerably more complicated.
|
|
||||||
In general, if the algorithm you are writing includes both writes
|
|
||||||
and reads on the same side, it is generally simpler to use sequentially
|
|
||||||
consistent primitives.
|
|
||||||
|
|
||||||
When using this model, variables are accessed with:
|
|
||||||
|
|
||||||
- atomic_read() and atomic_set(); these prevent the compiler from
|
|
||||||
optimizing accesses out of existence and creating unsolicited
|
|
||||||
accesses, but do not otherwise impose any ordering on loads and
|
|
||||||
stores: both the compiler and the processor are free to reorder
|
|
||||||
them.
|
|
||||||
|
|
||||||
- atomic_load_acquire(), which guarantees the LOAD to appear to
|
|
||||||
happen, with respect to the other components of the system,
|
|
||||||
before all the LOAD or STORE operations specified afterwards.
|
|
||||||
Operations coming before atomic_load_acquire() can still be
|
|
||||||
reordered after it.
|
|
||||||
|
|
||||||
- atomic_store_release(), which guarantees the STORE to appear to
|
|
||||||
happen, with respect to the other components of the system,
|
|
||||||
after all the LOAD or STORE operations specified afterwards.
|
|
||||||
Operations coming after atomic_store_release() can still be
|
|
||||||
reordered after it.
|
|
||||||
|
|
||||||
Restrictions to the ordering of accesses can also be specified
|
|
||||||
using the memory barrier macros: smp_rmb(), smp_wmb(), smp_mb(),
|
|
||||||
smp_mb_acquire(), smp_mb_release(), smp_read_barrier_depends().
|
|
||||||
|
|
||||||
Memory barriers control the order of references to shared memory.
|
|
||||||
They come in six kinds:
|
|
||||||
|
|
||||||
- smp_rmb() guarantees that all the LOAD operations specified before
|
|
||||||
the barrier will appear to happen before all the LOAD operations
|
|
||||||
specified after the barrier with respect to the other components of
|
|
||||||
the system.
|
|
||||||
|
|
||||||
In other words, smp_rmb() puts a partial ordering on loads, but is not
|
|
||||||
required to have any effect on stores.
|
|
||||||
|
|
||||||
- smp_wmb() guarantees that all the STORE operations specified before
|
|
||||||
the barrier will appear to happen before all the STORE operations
|
|
||||||
specified after the barrier with respect to the other components of
|
|
||||||
the system.
|
|
||||||
|
|
||||||
In other words, smp_wmb() puts a partial ordering on stores, but is not
|
|
||||||
required to have any effect on loads.
|
|
||||||
|
|
||||||
- smp_mb_acquire() guarantees that all the LOAD operations specified before
|
|
||||||
the barrier will appear to happen before all the LOAD or STORE operations
|
|
||||||
specified after the barrier with respect to the other components of
|
|
||||||
the system.
|
|
||||||
|
|
||||||
- smp_mb_release() guarantees that all the STORE operations specified *after*
|
|
||||||
the barrier will appear to happen after all the LOAD or STORE operations
|
|
||||||
specified *before* the barrier with respect to the other components of
|
|
||||||
the system.
|
|
||||||
|
|
||||||
- smp_mb() guarantees that all the LOAD and STORE operations specified
|
|
||||||
before the barrier will appear to happen before all the LOAD and
|
|
||||||
STORE operations specified after the barrier with respect to the other
|
|
||||||
components of the system.
|
|
||||||
|
|
||||||
smp_mb() puts a partial ordering on both loads and stores. It is
|
|
||||||
stronger than both a read and a write memory barrier; it implies both
|
|
||||||
smp_mb_acquire() and smp_mb_release(), but it also prevents STOREs
|
|
||||||
coming before the barrier from overtaking LOADs coming after the
|
|
||||||
barrier and vice versa.
|
|
||||||
|
|
||||||
- smp_read_barrier_depends() is a weaker kind of read barrier. On
|
|
||||||
most processors, whenever two loads are performed such that the
|
|
||||||
second depends on the result of the first (e.g., the first load
|
|
||||||
retrieves the address to which the second load will be directed),
|
|
||||||
the processor will guarantee that the first LOAD will appear to happen
|
|
||||||
before the second with respect to the other components of the system.
|
|
||||||
However, this is not always true---for example, it was not true on
|
|
||||||
Alpha processors. Whenever this kind of access happens to shared
|
|
||||||
memory (that is not protected by a lock), a read barrier is needed,
|
|
||||||
and smp_read_barrier_depends() can be used instead of smp_rmb().
|
|
||||||
|
|
||||||
Note that the first load really has to have a _data_ dependency and not
|
|
||||||
a control dependency. If the address for the second load is dependent
|
|
||||||
on the first load, but the dependency is through a conditional rather
|
|
||||||
than actually loading the address itself, then it's a _control_
|
|
||||||
dependency and a full read barrier or better is required.
|
|
||||||
|
|
||||||
|
|
||||||
This is the set of barriers that is required *between* two atomic_read()
|
|
||||||
and atomic_set() operations to achieve sequential consistency:
|
|
||||||
|
|
||||||
| 2nd operation |
|
|
||||||
|-----------------------------------------------|
|
|
||||||
1st operation | (after last) | atomic_read | atomic_set |
|
|
||||||
---------------+----------------+-------------+----------------|
|
|
||||||
(before first) | | none | smp_mb_release |
|
|
||||||
---------------+----------------+-------------+----------------|
|
|
||||||
atomic_read | smp_mb_acquire | smp_rmb | ** |
|
|
||||||
---------------+----------------+-------------+----------------|
|
|
||||||
atomic_set | none | smp_mb()*** | smp_wmb() |
|
|
||||||
---------------+----------------+-------------+----------------|
|
|
||||||
|
|
||||||
* Or smp_read_barrier_depends().
|
|
||||||
|
|
||||||
** This requires a load-store barrier. This is achieved by
|
|
||||||
either smp_mb_acquire() or smp_mb_release().
|
|
||||||
|
|
||||||
*** This requires a store-load barrier. On most machines, the only
|
|
||||||
way to achieve this is a full barrier.
|
|
||||||
|
|
||||||
|
|
||||||
You can see that the two possible definitions of atomic_mb_read()
|
|
||||||
and atomic_mb_set() are the following:
|
|
||||||
|
|
||||||
1) atomic_mb_read(p) = atomic_read(p); smp_mb_acquire()
|
|
||||||
atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
|
|
||||||
|
|
||||||
2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire()
|
|
||||||
atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
|
|
||||||
|
|
||||||
Usually the former is used, because smp_mb() is expensive and a program
|
|
||||||
normally has more reads than writes. Therefore it makes more sense to
|
|
||||||
make atomic_mb_set() the more expensive operation.
|
|
||||||
|
|
||||||
There are two common cases in which atomic_mb_read and atomic_mb_set
|
|
||||||
generate too many memory barriers, and thus it can be useful to manually
|
|
||||||
place barriers, or use atomic_load_acquire/atomic_store_release instead:
|
|
||||||
|
|
||||||
- when a data structure has one thread that is always a writer
|
|
||||||
and one thread that is always a reader, manual placement of
|
|
||||||
memory barriers makes the write side faster. Furthermore,
|
|
||||||
correctness is easy to check for in this case using the "pairing"
|
|
||||||
trick that is explained below:
|
|
||||||
|
|
||||||
thread 1 thread 1
|
|
||||||
------------------------- ------------------------
|
|
||||||
(other writes)
|
|
||||||
atomic_mb_set(&a, x) atomic_store_release(&a, x)
|
|
||||||
atomic_mb_set(&b, y) atomic_store_release(&b, y)
|
|
||||||
|
|
||||||
=>
|
|
||||||
thread 2 thread 2
|
|
||||||
------------------------- ------------------------
|
|
||||||
y = atomic_mb_read(&b) y = atomic_load_acquire(&b)
|
|
||||||
x = atomic_mb_read(&a) x = atomic_load_acquire(&a)
|
|
||||||
(other reads)
|
|
||||||
|
|
||||||
Note that the barrier between the stores in thread 1, and between
|
|
||||||
the loads in thread 2, has been optimized here to a write or a
|
|
||||||
read memory barrier respectively. On some architectures, notably
|
|
||||||
ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
|
|
||||||
smp_mb, but smp_rmb and/or smp_wmb are more efficient.
|
|
||||||
|
|
||||||
- sometimes, a thread is accessing many variables that are otherwise
|
|
||||||
unrelated to each other (for example because, apart from the current
|
|
||||||
thread, exactly one other thread will read or write each of these
|
|
||||||
variables). In this case, it is possible to "hoist" the implicit
|
|
||||||
barriers provided by atomic_mb_read() and atomic_mb_set() outside
|
|
||||||
a loop. For example, the above definition atomic_mb_read() gives
|
|
||||||
the following transformation:
|
|
||||||
|
|
||||||
n = 0; n = 0;
|
|
||||||
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
|
|
||||||
n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]);
|
|
||||||
smp_mb_acquire();
|
|
||||||
|
|
||||||
Similarly, atomic_mb_set() can be transformed as follows:
|
|
||||||
|
|
||||||
smp_mb_release();
|
|
||||||
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
|
|
||||||
atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
|
|
||||||
smp_mb();
|
|
||||||
|
|
||||||
|
|
||||||
The other thread can still use atomic_mb_read()/atomic_mb_set().
|
|
||||||
|
|
||||||
The two tricks can be combined. In this case, splitting a loop in
|
|
||||||
two lets you hoist the barriers out of the loops _and_ eliminate the
|
|
||||||
expensive smp_mb():
|
|
||||||
|
|
||||||
smp_mb_release();
|
|
||||||
for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++)
|
|
||||||
atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
|
|
||||||
atomic_mb_set(&b[i], false); smb_wmb();
|
|
||||||
} for (i = 0; i < 10; i++)
|
|
||||||
atomic_set(&a[i], false);
|
|
||||||
smp_mb();
|
|
||||||
|
|
||||||
|
|
||||||
Memory barrier pairing
|
|
||||||
----------------------
|
|
||||||
|
|
||||||
A useful rule of thumb is that memory barriers should always, or almost
|
|
||||||
always, be paired with another barrier. In the case of QEMU, however,
|
|
||||||
note that the other barrier may actually be in a driver that runs in
|
|
||||||
the guest!
|
|
||||||
|
|
||||||
For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
|
|
||||||
both count as read barriers. A read barrier shall pair with a write
|
|
||||||
barrier or a full barrier; a write barrier shall pair with a read
|
|
||||||
barrier or a full barrier. A full barrier can pair with anything.
|
|
||||||
For example:
|
|
||||||
|
|
||||||
thread 1 thread 2
|
|
||||||
=============== ===============
|
|
||||||
a = 1;
|
|
||||||
smp_wmb();
|
|
||||||
b = 2; x = b;
|
|
||||||
smp_rmb();
|
|
||||||
y = a;
|
|
||||||
|
|
||||||
Note that the "writing" thread is accessing the variables in the
|
|
||||||
opposite order as the "reading" thread. This is expected: stores
|
|
||||||
before the write barrier will normally match the loads after the
|
|
||||||
read barrier, and vice versa. The same is true for more than 2
|
|
||||||
access and for data dependency barriers:
|
|
||||||
|
|
||||||
thread 1 thread 2
|
|
||||||
=============== ===============
|
|
||||||
b[2] = 1;
|
|
||||||
smp_wmb();
|
|
||||||
x->i = 2;
|
|
||||||
smp_wmb();
|
|
||||||
a = x; x = a;
|
|
||||||
smp_read_barrier_depends();
|
|
||||||
y = x->i;
|
|
||||||
smp_read_barrier_depends();
|
|
||||||
z = b[y];
|
|
||||||
|
|
||||||
smp_wmb() also pairs with atomic_mb_read() and smp_mb_acquire().
|
|
||||||
and smp_rmb() also pairs with atomic_mb_set() and smp_mb_release().
|
|
||||||
|
|
||||||
|
|
||||||
COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
|
|
||||||
============================================
|
|
||||||
|
|
||||||
Here is a list of differences between Linux kernel atomic operations
|
|
||||||
and memory barriers, and the equivalents in QEMU:
|
|
||||||
|
|
||||||
- atomic operations in Linux are always on a 32-bit int type and
|
|
||||||
use a boxed atomic_t type; atomic operations in QEMU are polymorphic
|
|
||||||
and use normal C types.
|
|
||||||
|
|
||||||
- Originally, atomic_read and atomic_set in Linux gave no guarantee
|
|
||||||
at all. Linux 4.1 updated them to implement volatile
|
|
||||||
semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE).
|
|
||||||
|
|
||||||
QEMU's atomic_read/set implement, if the compiler supports it, C11
|
|
||||||
atomic relaxed semantics, and volatile semantics otherwise.
|
|
||||||
Both semantics prevent the compiler from doing certain transformations;
|
|
||||||
the difference is that atomic accesses are guaranteed to be atomic,
|
|
||||||
while volatile accesses aren't. Thus, in the volatile case we just cross
|
|
||||||
our fingers hoping that the compiler will generate atomic accesses,
|
|
||||||
since we assume the variables passed are machine-word sized and
|
|
||||||
properly aligned.
|
|
||||||
No barriers are implied by atomic_read/set in either Linux or QEMU.
|
|
||||||
|
|
||||||
- atomic read-modify-write operations in Linux are of three kinds:
|
|
||||||
|
|
||||||
atomic_OP returns void
|
|
||||||
atomic_OP_return returns new value of the variable
|
|
||||||
atomic_fetch_OP returns the old value of the variable
|
|
||||||
atomic_cmpxchg returns the old value of the variable
|
|
||||||
|
|
||||||
In QEMU, the second kind does not exist. Currently Linux has
|
|
||||||
atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub.
|
|
||||||
|
|
||||||
- different atomic read-modify-write operations in Linux imply
|
|
||||||
a different set of memory barriers; in QEMU, all of them enforce
|
|
||||||
sequential consistency, which means they imply full memory barriers
|
|
||||||
before and after the operation.
|
|
||||||
|
|
||||||
- Linux does not have an equivalent of atomic_mb_set(). In particular,
|
|
||||||
note that smp_store_mb() is a little weaker than atomic_mb_set().
|
|
||||||
atomic_mb_read() compiles to the same instructions as Linux's
|
|
||||||
smp_load_acquire(), but this should be treated as an implementation
|
|
||||||
detail.
|
|
||||||
|
|
||||||
SOURCES
|
|
||||||
=======
|
|
||||||
|
|
||||||
* Documentation/memory-barriers.txt from the Linux kernel
|
|
||||||
|
|
||||||
* "The JSR-133 Cookbook for Compiler Writers", available at
|
|
||||||
http://g.oswego.edu/dl/jmm/cookbook.html
|
|
|
@ -17,6 +17,7 @@ Contents:
|
||||||
loads-stores
|
loads-stores
|
||||||
memory
|
memory
|
||||||
migration
|
migration
|
||||||
|
atomics
|
||||||
stable-process
|
stable-process
|
||||||
testing
|
testing
|
||||||
decodetree
|
decodetree
|
||||||
|
|
Loading…
Reference in New Issue