This fixes a problem I was having where using frame advance with the
debugger open would frequently cause panic alerts about invalid addresses
due to the CPU thread changing MSR.DR while the host thread was trying
to access memory.
To aid in tracking down all the places where we weren't properly locking
the CPU, I've created a new type (in Core.h) that you have to pass as a
reference or pointer to functions that require running as the CPU thread.
While the NV extension is totally fine, the KHR extension should be able to support more hardware.
For NVIDIA, the hardware either supports both or neither, it just needs a driver from the last two years.
For AMD, the drivers from late 2022-12 seems to bring support for the KHR extension.
For Intel, the KHR is also supported for some years.
- Cancel doesn't shut down anymore.
Allowing it to be used multiple times thoughout the life of
the WorkQueue
- Remove Clear, so we only have Cancel semantics
- Add IsCancelling so work items can abort early if cancelling
- Replace m_cancelled and m_thread.joinable() guars with m_shutdown.
- Rename Flush to WaitForCompletion (As it's ambiguous if a function
called flush should be blocking or not)
- Add documentation
A lot of the remaining complexity in Renderer is the massive Swap function
which tries to handle a bunch of FrameBegin/FrameEnd events.
Rather than create a new place for it. This event system will try
to distribute it all over the place
Macros that expand to include the standard define macro are undefined.
This is pretty trivial to fix. We can just do the test and then define
the name itself if it's true, rather than making the set of definition
checks the macro itself.
Now that we've flipped the C++20 switch, let's start making use of
the nice new <bit> header.
I'm planning on handling this move away from BitUtils.h incrementally
in a series of PRs. There may be a few functions remaining in
BitUtils.h by the end that C++20 doesn't have any equivalents for.
The "vector shift by immediate" category encodes the shift amount for
right shifts as `size - amount`, whereas left shifts use `amount`.
We're not actually using SHRN/SHRN2 anywhere, which is why this has gone
undetected.
For quite some time now, we've had a setting on x86-64 that makes Dolphin
handle NaNs in a more accurate but slower way. There's only one game that
cares about this, Dragon Ball: Revenge of King Piccolo, and what that game
cares about more specifically is that the default NaN (or "generated NaN"
as I believe it's called in PowerPC documentation) is the same as on
PowerPC. On ARM, the default NaN is the same as on PowerPC, so for the
longest time we didn't need to do anything special to get Dragon Ball:
Revenge of King Piccolo working. However, in 93e636a I changed how we
handle FMA instructions in a way that resulted in the sign of NaNs
becoming inverted for nmadd/nmsub instructions, breaking the game.
To fix this, let's implement the AccurateNaNs setting, like on x86-64.
1. In some cases, ps_merge01 can be implemented using one instruction.
2. When we need two instructions for ps_merge01, it's best to start with
a MOV to avoid false dependencies on the destination register.
3. ps_merge10 can be implemented using a single EXT instruction.
This new function is like MOVP2R, except it masks out the lower 12 bits,
returning them instead of writing them to the register. These lower
12 bits can then be used as an offset for LDR/STR. This lets us turn
ADRP+ADD+LDR sequences with a zero offset into ADRP+LDR sequences with
a non-zero offset, saving one instruction.
ARM64 can do perform various types of sign and zero extension on a
register value before using it. The Arm64Emitter already had support for
this, but it was kinda hidden away.
This commit exposes the functionality by making the ExtendSpecifier enum
available everywhere and adding a new ArithOption constructor.
The previous implementation of Force25BitPrecision was essentially a
translation of the x86-64 implementation. It worked, but we can make a
more efficient implementation by using an AArch64 instruction I don't
believe x86-64 has an equivalent of: URSHR. The latency is the same as
before, but the instruction count and register count are both reduced.
This was added in 385d8e2b15, but became somewhat redundant with Do in 4c7bbd96e4, and completely redundant now that std::is_trivially_copyable_v is well-supported.