This worked correctly on the JIT vertex loaders, and for the equivalent case with texture coordinates. I'm not aware of any games this affects (but libogc does mention a semi-related scenario at 6bc0317c7d/gc/ogc/gx.h (L1855-L1857).)
Jimmie Johnson's Anything with an Engine is known to use texture coordinate 7 (and only texture coordinate 7) in some cases. There are a lot of possible edge-cases, so this test brute-forces all combinations with coordinates 0, 1, and 2.
This test fails with the non-JIT vertex loader due to an issue fixed in a later commit in this PR. (Note that the non-JIT vertex loader is only used on machines where no JIT is available or if COMPARE_VERTEXLOADERS is enabled in VertexLoaderBase.cpp.)
Dolphin would crash when running with a fastmem arena but without
fastmem. This regression was caused by merging 9192128c50 without
adapting it after the merge of 0606433404.
This must have broken in a rebase of one of my recently merged PRs.
Dolphin still worked correctly with this bug, for two reasons:
1. Most AArch64 users are not on Windows, and therefore normally do have
the entry points map.
2. When the bug was triggered, Dolphin would fall back to the slower
path rather than crashing.
Instead of shifting left by 1, we can first shift right by 2 and then
left by 3. This is both faster and smaller, because we get the right
shift for free with the masking and the left shift for free with the
address calculation. It also happens to match the pseudocode more
closely, which is always nice for readability.
This slightly improves instruction-level parallelism in Jit64's slow
dispatcher by shifting the PC left instead of the MSR.
In the past, this also enabled an optimization in JitArm64's fast path
where we could use LDP to load normalEntry and msrBits in one
instruction, but this was superseded by fd9c970.
Just to make the code easier to understand at a glance. I especially
found it a bit annoying to reason about whether callee-saved registers
like W28 were being used because we needed a callee-saved register or
just for no reason in particular.
X8 and up is what compilers normally use when they're not register
starved.
AArch64's handling of NaNs in arithmetic instructions matches PowerPC's
as long as no more than one of the operands is NaN. If we know that all
inputs except the last input are non-NaN, we can therefore skip checking
the last input. This is an optimization that in principle only works for
non-SIMD operations, but ps_sumX effectively is non-SIMD as far as the
arithmetic part of it is concerned, so we can use it there too.