This is the bare minimum required to run a few games on AArch64.
Was able to run starfield and Animal Crossing to the Nintendo logo.
QEmu emulation is literally the slowest thing in the world, it maxes out at around 12mhz on my Core i7-4930MX.
I've tested a few instruction encodings and am expecting most to work as long as one stays away from VFP/SIMD.
This implements mostly instructions to bring up an initial JIT with integer support.
This can be improved to allow ease of use functions in the future, dealing with the raw imms/immr encodings is probably the worst thing ever.
Uses are split into three categories:
- Arbitrary (except for size savings) - constants like RSCRATCH are
used.
- ABI (i.e. RAX as return value) - ABI_RETURN is used.
- Fixed by architecture (RCX shifts, RDX/RAX for some instructions) -
explicit register is kept.
In theory this allows the assignments to be modified easily. I verified
that I was able to run Melee with all the registers changed, although
there may be issues if RSCRATCH[2] and ABI_PARAM{1,2} conflict.
And switch to a register order that consistently prefers callee-save to
caller-save. phire suggested putting rdi/rsi first, even though they're
caller-save, to save code space; this is more conservative and I can do
that later.
Rather than using a variety of registers including RSI, ABI_PARAM1
(either RCX or RDI), RCX, and RDX, the rule is:
- RDI and RSI are never used. This allows them to be allocated on Unix,
bringing parity with Windows.
- RDX is a permanent temporary register along with RAX (and is thus not
FlushLocked). It's used frequently enough that allocating it would
probably be a bad idea, as it would constantly get flushed.
- RCX is allocatable, but is flushed in two situations:
- Non-immediate shifts (rlwnm), because x86 requires RCX to be used.
- Paired single loads and stores, because they require three
temporary registers: the helper functions take two integer
arguments, and another register is used as an index to get the
function address.
These should be relatively rare.
While we're at it, in stores, use the registers directly where possible
rather than always using temporaries (by making SafeWriteRegToReg
clobber less). The address doesn't need to be clobbered in the usual
case, and on CPUs with MOVBE, neither does the value.
Oh, and get rid of a useless MEMCHECK.
This commit does not actually add new registers to the allocation order;
it is intended to test for any performance or correctness issues
separately.
The special case is where the registers are actually to be swapped (i.e.
func(ABI_PARAM2, ABI_PARAM1); this was previously impossible but would
be ugly not to handle anyway.
In two cases, my old code was using a temporary register but not saving
it properly; it basically worked by accident (an otherwise useless
FlushLock was causing CallerSavedRegistersInUse to think it was in use
by the GPR cache, even though it was actually a temporary).
I'm going to modify this in the next commit to use RDX, but I didn't
want to leave a broken revision in the middle.
The register is RBP, previously in the GPR allocation order. The next
commit will investigate whether there are too few GPRs (now or before),
but for now there is no replacement.
Previously, it was accessed RIP relatively; using RBP, anything in the
first 0x100 bytes of ppcState (including all the GPRs) can be accessed
with three fewer bytes. Code to access ppcState is generated constantly
(mostly by register save/load), so in principle, this should improve
instruction cache footprint significantly. It seems that this makes a
significant performance difference in practice.
The vast majority of this commit is mechanically replacing
M(&PowerPC::ppcState.x) with a new macro PPCSTATE(x).
Version 2: gets most of the cases which were using the register access
macros.
GetOpInfo was returning null pointers for invalid ops in subtables
instead of asserting an error. This was causing segfaults when the
jit tried to jit invalid code.