Allows the analyzer to exist independently of the DSP structure. This
allows for unit-tests to be created in a nicer manner.
SDSP is only necessary during the analysis phase, so we only need to
keep a reference around to it then as opposed to the entire lifecycle of
the analyzer.
This also allows the copy/move assignment operators to be defaulted, as
a reference member variable prevents that.
Now that we have the convenience functions around the flag
bit manipulations, there's no external usages of the flags, so we can
make these private to the analyzer implementation.
Now the Analyzer namespace is largely unnecessary and can be merged with
the DSP namespace in the next commit.
Localizes code that modifies m_dsp into the struct itself. This reduces
the overal coupling between DSPCore and SDSP by reducing access to its
member variables.
This commit is only code movement and has no functional changes.
An unfortunately large single commit that deglobalizes the DSP code.
(which I'm very sorry about).
This would have otherwise been extremely difficult to separate due to
extensive use of the globals in very coupling ways that would result in
more scaffolding to work around than is worth it.
Aside from the video code, I believe only the DSP code is the hairiest
to deal with in terms of globals, so I guess it's best to get this dealt
with right off the bat.
A summary of what this commit does:
- Turns the DSPInterpreter into its own class
This is the most involved portion of this change.
The bulk of the changes are turning non-member functions into member
functions that would be situated into the Interpreter class.
- Eliminates all usages to globals within DSPCore.
This generally involves turning a lot of non-member functions into
member functions that are either situated within SDSP or DSPCore.
- Discards DSPDebugInterface (it wasn't hooked up to anything,
and for the sake of eliminating global state, I'd rather get rid of
it than think up ways for this class to be integrated with
everything else.
- Readjusts the DSP JIT to handle calling out to member functions.
In most cases, this just means wrapping respective member function
calles into thunk functions.
Surprisingly, this doesn't even make use of the introduced System class.
It was possible all along to do this without it. We can house everything
within the DSPLLE class, which is quite nice =)
Shifting zero by any amount always gives zero.
Before:
41 B9 00 00 00 00 mov r9d,0
41 8B CF mov ecx,r15d
49 C1 E1 20 shl r9,20h
49 D3 F9 sar r9,cl
49 C1 E9 20 shr r9,20h
After:
Nothing, register is set to constant zero.
Before:
41 B8 00 00 00 00 mov r8d,0
41 8B CF mov ecx,r15d
49 C1 E0 20 shl r8,20h
49 D3 F8 sar r8,cl
41 8B C0 mov eax,r8d
49 C1 E8 20 shr r8,20h
44 85 C0 test eax,r8d
0F 95 45 58 setne byte ptr [rbp+58h]
After:
C6 45 58 00 mov byte ptr [rbp+58h],0
Occurs a bunch of times in Super Mario Sunshine. Since this is an
arithmetic shift a similar optimization can be done for constant -1
(0xFFFFFFFF), but I couldn't find any game where this happens.
Shifting zero by any amount always gives zero.
Before:
41 BF 00 00 00 00 mov r15d,0
8B CF mov ecx,edi
49 D3 E7 shl r15,cl
45 8B FF mov r15d,r15d
After:
Nothing, register is set to constant zero.
All games I've tried hit this optimization on launch. In Soul Calibur II
it occurs very frequently during gameplay.
Much like we did for srawx. This was already implemented on JitArm64.
Before:
B8 00 00 00 00 mov eax,0
8B F0 mov esi,eax
C1 E8 1F shr eax,1Fh
23 C6 and eax,esi
D1 FE sar esi,1
88 45 58 mov byte ptr [rbp+58h],al
After:
C6 45 58 00 mov byte ptr [rbp+58h],0
If both input registers hold known values at compile time, we can just
calculate the result on the spot.
Code has mostly been copied from JitArm64 where it had already been implemented.
Before:
BF FF FF FF FF mov edi,0FFFFFFFFh
8B C7 mov eax,edi
C1 FF 10 sar edi,10h
C1 E0 10 shl eax,10h
85 F8 test eax,edi
0F 95 45 58 setne byte ptr [rbp+58h]
After:
C6 45 58 01 mov byte ptr [rbp+58h],1
More efficient code can be generated if the shift amount is known at
compile time. We can once again take advantage of shifts with the shift
amount in an 8-bit immediate to eliminate ECX as a scratch register,
reducing register pressure and removing the occasional spill. We can
also do 32-bit shifts instead of 64-bit operations.
We recognize four distinct cases:
- The special case where we're dealing with the PowerPC's quirky shift
amount masking. If the shift amount is a number from 32 to 63, all
bits are shifted out and the result it either all zeroes or all ones.
Before:
B9 F0 FF FF FF mov ecx,0FFFFFFF0h
8B F7 mov esi,edi
48 C1 E6 20 shl rsi,20h
48 D3 FE sar rsi,cl
8B C6 mov eax,esi
48 C1 EE 20 shr rsi,20h
85 F0 test eax,esi
0F 95 45 58 setne byte ptr [rbp+58h]
After:
8B F7 mov esi,edi
C1 FE 1F sar esi,1Fh
0F 95 45 58 setne byte ptr [rbp+58h]
- The shift amount is zero. Not calculation needs to be done, just clear
the carry flag.
Before:
B9 00 00 00 00 mov ecx,0
49 C1 E5 20 shl r13,20h
49 D3 FD sar r13,cl
41 8B C5 mov eax,r13d
49 C1 ED 20 shr r13,20h
44 85 E8 test eax,r13d
0F 95 45 58 setne byte ptr [rbp+58h]
After:
C6 45 58 00 mov byte ptr [rbp+58h],0
- The carry flag doesn't need to be computed. Just do the arithmetic
shift.
Before:
B9 02 00 00 00 mov ecx,2
48 C1 E7 20 shl rdi,20h
48 D3 FF sar rdi,cl
48 C1 EF 20 shr rdi,20h
After:
C1 FF 02 sar edi,2
- The carry flag must be computed. In addition to the arithmetic shift,
we do a shift to the left and and them together to know if any ones
were shifted out. It's still better than before, because we can do
32-bit shifts.
Before:
B9 02 00 00 00 mov ecx,2
49 C1 E5 20 shl r13,20h
49 D3 FD sar r13,cl
41 8B C5 mov eax,r13d
49 C1 ED 20 shr r13,20h
44 85 E8 test eax,r13d
0F 95 45 58 setne byte ptr [rbp+58h]
After:
41 8B C5 mov eax,r13d
41 C1 FD 02 sar r13d,2
C1 E0 1E shl eax,1Eh
44 85 E8 test eax,r13d
0F 95 45 58 setne byte ptr [rbp+58h]
More efficient code can be generated if the shift amount is known at
compile time. Similar optimizations were present in JitArm64 already,
but were missing in Jit64.
- By using an 8-bit immediate we can eliminate the need for ECX as a
scratch register, thereby reducing register pressure and occasionally
eliminating a spill.
Before:
B9 18 00 00 00 mov ecx,18h
41 8B F7 mov esi,r15d
48 D3 E6 shl rsi,cl
8B F6 mov esi,esi
After:
41 8B CF mov ecx,r15d
C1 E1 18 shl ecx,18h
- PowerPC has strange shift amount masking behavior which is emulated
using 64-bit shifts, even though we only care about a 32-bit result.
If the shift amount is known, we can handle this special case
separately, and use 32-bit shift instructions otherwise. We also no
longer need to clear the upper 32 bits of the register.
Before:
BE F8 FF FF FF mov esi,0FFFFFFF8h
8B CE mov ecx,esi
41 8B F4 mov esi,r12d
48 D3 E6 shl rsi,cl
8B F6 mov esi,esi
After:
Nothing, register is set to constant zero.
- A shift by zero becomes a simple MOV.
Before:
BE 00 00 00 00 mov esi,0
8B CE mov ecx,esi
41 8B F3 mov esi,r11d
48 D3 E6 shl rsi,cl
8B F6 mov esi,esi
After:
41 8B FB mov edi,r11d
More efficient code can be generated if the shift amount is known at
compile time. Similar optimizations were present in JitArm64 already,
but were missing in Jit64.
- By using an 8-bit immediate we can eliminate the need for ECX as a
scratch register, thereby reducing register pressure and occasionally
eliminating a spill.
Before:
B9 18 00 00 00 mov ecx,18h
45 8B C1 mov r8d,r9d
49 D3 E8 shr r8,cl
After:
45 8B C1 mov r8d,r9d
41 C1 E8 18 shr r8d,18h
- PowerPC has strange shift amount masking behavior which is emulated
using 64-bit shifts, even though we only care about a 32-bit result.
If the shift amount is known, we can handle this special case
separately, and use 32-bit shift instructions otherwise.
Before:
B9 F8 FF FF FF mov ecx,0FFFFFFF8h
45 8B C1 mov r8d,r9d
49 D3 E8 shr r8,cl
After:
Nothing, register is set to constant zero.
- A shift by zero becomes a simple MOV.
Before:
B9 00 00 00 00 mov ecx,0
45 8B C1 mov r8d,r9d
49 D3 E8 shr r8,cl
After:
45 8B C1 mov r8d,r9d