This test fails with the non-JIT vertex loader due to an issue fixed in a later commit in this PR. (Note that the non-JIT vertex loader is only used on machines where no JIT is available or if COMPARE_VERTEXLOADERS is enabled in VertexLoaderBase.cpp.)
Dolphin would crash when running with a fastmem arena but without
fastmem. This regression was caused by merging 9192128c50 without
adapting it after the merge of 0606433404.
This must have broken in a rebase of one of my recently merged PRs.
Dolphin still worked correctly with this bug, for two reasons:
1. Most AArch64 users are not on Windows, and therefore normally do have
the entry points map.
2. When the bug was triggered, Dolphin would fall back to the slower
path rather than crashing.
Instead of shifting left by 1, we can first shift right by 2 and then
left by 3. This is both faster and smaller, because we get the right
shift for free with the masking and the left shift for free with the
address calculation. It also happens to match the pseudocode more
closely, which is always nice for readability.
This slightly improves instruction-level parallelism in Jit64's slow
dispatcher by shifting the PC left instead of the MSR.
In the past, this also enabled an optimization in JitArm64's fast path
where we could use LDP to load normalEntry and msrBits in one
instruction, but this was superseded by fd9c970.
Just to make the code easier to understand at a glance. I especially
found it a bit annoying to reason about whether callee-saved registers
like W28 were being used because we needed a callee-saved register or
just for no reason in particular.
X8 and up is what compilers normally use when they're not register
starved.
AArch64's handling of NaNs in arithmetic instructions matches PowerPC's
as long as no more than one of the operands is NaN. If we know that all
inputs except the last input are non-NaN, we can therefore skip checking
the last input. This is an optimization that in principle only works for
non-SIMD operations, but ps_sumX effectively is non-SIMD as far as the
arithmetic part of it is concerned, so we can use it there too.
FL_EVIL is only used for blocking instructions from being reordered.
There are three types of instructions which have FL_EVIL set:
1. CR operations. The previous commits improved our CR analysis
and removed FL_EVIL from these instructions.
2. Load/store operations. These are always blocked from reordering
due to always having canCauseException set.
3. isync. I don't know if we actually need to prevent reordering
around this one, since as far as I know we only do reorderings
that are guaranteed to not change the behavior of the program.
But just in case, I've renamed FL_EVIL to FL_NO_REORDER instead of
removing it entirely, so that it can be used for this instruction.
Other than the CR instructions, which we now analyze properly,
all the covered instructions are not integer operations and also
have either FL_ENDBLOCK or FL_EVIL set, so there are two other
checks in CanSwapAdjacentOps that will reject them.