Uses are split into three categories:
- Arbitrary (except for size savings) - constants like RSCRATCH are
used.
- ABI (i.e. RAX as return value) - ABI_RETURN is used.
- Fixed by architecture (RCX shifts, RDX/RAX for some instructions) -
explicit register is kept.
In theory this allows the assignments to be modified easily. I verified
that I was able to run Melee with all the registers changed, although
there may be issues if RSCRATCH[2] and ABI_PARAM{1,2} conflict.
And switch to a register order that consistently prefers callee-save to
caller-save. phire suggested putting rdi/rsi first, even though they're
caller-save, to save code space; this is more conservative and I can do
that later.
Rather than using a variety of registers including RSI, ABI_PARAM1
(either RCX or RDI), RCX, and RDX, the rule is:
- RDI and RSI are never used. This allows them to be allocated on Unix,
bringing parity with Windows.
- RDX is a permanent temporary register along with RAX (and is thus not
FlushLocked). It's used frequently enough that allocating it would
probably be a bad idea, as it would constantly get flushed.
- RCX is allocatable, but is flushed in two situations:
- Non-immediate shifts (rlwnm), because x86 requires RCX to be used.
- Paired single loads and stores, because they require three
temporary registers: the helper functions take two integer
arguments, and another register is used as an index to get the
function address.
These should be relatively rare.
While we're at it, in stores, use the registers directly where possible
rather than always using temporaries (by making SafeWriteRegToReg
clobber less). The address doesn't need to be clobbered in the usual
case, and on CPUs with MOVBE, neither does the value.
Oh, and get rid of a useless MEMCHECK.
This commit does not actually add new registers to the allocation order;
it is intended to test for any performance or correctness issues
separately.
The special case is where the registers are actually to be swapped (i.e.
func(ABI_PARAM2, ABI_PARAM1); this was previously impossible but would
be ugly not to handle anyway.
In two cases, my old code was using a temporary register but not saving
it properly; it basically worked by accident (an otherwise useless
FlushLock was causing CallerSavedRegistersInUse to think it was in use
by the GPR cache, even though it was actually a temporary).
I'm going to modify this in the next commit to use RDX, but I didn't
want to leave a broken revision in the middle.
The register is RBP, previously in the GPR allocation order. The next
commit will investigate whether there are too few GPRs (now or before),
but for now there is no replacement.
Previously, it was accessed RIP relatively; using RBP, anything in the
first 0x100 bytes of ppcState (including all the GPRs) can be accessed
with three fewer bytes. Code to access ppcState is generated constantly
(mostly by register save/load), so in principle, this should improve
instruction cache footprint significantly. It seems that this makes a
significant performance difference in practice.
The vast majority of this commit is mechanically replacing
M(&PowerPC::ppcState.x) with a new macro PPCSTATE(x).
Version 2: gets most of the cases which were using the register access
macros.
GetOpInfo was returning null pointers for invalid ops in subtables
instead of asserting an error. This was causing segfaults when the
jit tried to jit invalid code.
Prior to this change, it was possible to cause an infinite loop by making the string to be replaced and the replacing string the same thing.
e.g.
std::string some_str = "test";
ReplaceAll(some_str, "test", "test");
This also changes the replacing in a way that doesn't require starting from the beginning of the string on each replacement iteration.
For a long time, we've had ugly and inconsistent function names here as
helpers, names like "decodebytesRGB5A3rgba" which are absolutely
incomprehensible to understand. Fix this by introducing a new consistent
naming scheme, where the above function now becomes "DecodeBytes_RGB5A3".
Instead of having three separate functions and checking the tlutfmt in a
variety of places, just do it once in a helper method. This is already
for the slow path either in our Generic decoder or in our Software
renderer, so it doesn't matter that this is slower.
x64 will continue using the separate functions for speed.
The D3D / OGL backends only ever used RGBA textures, and the Software
backend uses its own custom code for sampling. The ARGB path seems to
just be dead code.
Since ARGB and RGBA formats are similar, I don't think this will make
the code more difficult to read or unable to be used as
reference. Somebody who wants to use this code to output ARGB can simply
modify the MakeRGBA function to put the shift at the other end.
This pulls all the duplicate code from TextureDecoder_Generic /
TextureDecoder_x64 out and puts it in a common file. Out custom font
used for debugging the texture cache is also pulled out and put in a
common "sfont.inc" file. At some point we should also combine this font
with the other six binary fonts we ship.
GetPC_TexFormat was never used. It was added in commit d02426a, with the
only user being commented out code. The commented out code was later
removed in 9893122, but the implementation stayed.
We were decoding to BGRA32 textures in our RGBA32 texture decoder. Since
this is the same for the BGRA32 decoder implementation, this is most
likely a copy/paste typo, rather than the texture actually being
bit-swapped. Fix this.
I'm not sure of any games that use the C14X2 texture format, so I'm not
sure this fixes any games, but it does make the code cleaner for when we
clean it up in the future, and merge some of these similar loops.
The person who wrote this seemed to misunderstand how XPending and
XNextEvent actually work. XNextEvent will wait in poll if there's
no event yet, meaning that we don't need to sleep after we process
all the events; the kernel will sleep for us.
This changes indentation, so view with -w or a similar feature to
understand what's actually changed here.
Added the option to handle whether the user wants to iterate through the
assignment of button mappings or assign them one at a time.
fixed formatting issues and code style.
I excluded this option from the config file. This stopped the check box value and the boolean from becoming offset. Since the option should always start as false.
This still causes an issue with the Wiimote input, since the class variable that keeps the state will be wiped, but the check box value will stay the same after closing/reopening without closing the entire Wiimote configuration. I am looking for a way to resolve this.
I also reduced wait time to 2.5 seconds vs. the 5 seconds previously. Seemed to be a little long.
These changes apparently did not go through.
This should fix the Wiimote issue.
Allows user to map all inputs seamlessly without having to
click on each button.
Also increased button timeout to 5 seconds from 1.5 due to pita.
Motion controls are not included since they will be special cases.
As far as I can tell, this has literally been here since the start of the git
history; maybe it was stubbed out because the author wasn't sure it was right?
It matches the PPC/Broadway manuals perfectly, though.
This value was "helpful" for debugging when the stack got corrupted.
Helpful that if gpr[1](Which is the stack pointer with PPC ABI) is zero then the interpreter would spam huge amounts of annoy text saying that we
managed to get in to a "corrupted" state.
This is incremented every instruction on the interpreter, or every block run on the JIT64....Only if debugging is enabled(JIT64 it is a const
variable)
The message is only outputted when interpreter is used and debugging is enabled.
The old method would always evict the first suitable register, i.e. the
same register every time once the cache got full. The cache doesn't get
terribly often, but the result is pathological...
Faster, of course, since we avoid the interpreter, but also means we can
get more a more accurate timer in long blocks by adding the offset from the
start of the block to the retrieved timer. I don't know if this will actually
fix any issues, but it's more correct and a nearly-free improvement.
This causes glDrawArrays to fail in core profile, and thus on OS X, see:
http://renderingpipeline.com/2012/03/attribute-less-rendering/
There must be something bound, even though it is not used.
Fixes#7599. I'm not sure this is actually the best way to fix it,
since AFAICT it makes a nonobvious assumption that *something* will be
bound before the first attributeless rendering in
TextureConverter::DecodeToTexture, but it's what degasus suggested and
seems to work.
Separated out from my gpu-determinism branch by request. It's not a big
commit; I just like to write long commit messages.
The main reason to kill it is hopefully a slight performance improvement
from avoiding the double switch (especially in single core mode);
however, this also improves cycle calculation, as described below.
- FifoCommandRunnable is removed; in its stead, Decode returns the
number of cycles (which only matters for "sync" GPU mode), or 0 if there
was not enough data, and is also responsible for unknown opcode alerts.
Decode and DecodeSemiNop are almost identical, so the latter is replaced
with a skipped_frame parameter to Decode. Doesn't mean we can't improve
skipped_frame mode to do less work; if, at such a point, branching on it
has too much overhead (it certainly won't now), it can always be changed
to a template parameter.
- FifoCommandRunnable used a fixed, large cycle count for display lists,
regardless of the contents. Presumably the actual hardware's processing
time is mostly the processing time of whatever commands are in the list,
and with this change InterpretDisplayList can just return the list's
cycle count to be added to the total. (Since the calculation for this
is part of Decode, it didn't seem easy to split this change up.)
To facilitate this, Decode also gains an explicit 'end' parameter in
lieu of FifoCommandRunnable's call to GetVideoBufferEndPtr, which can
point to there or to the end of a display list (or elsewhere in
gpu-determinism, but that's another story). Also, as a small
optimization, InterpretDisplayList now calls OpcodeDecoder_Run rather
than having its own Decode loop, to allow Decode to be inlined (haven't
checked whether this actually happens though).
skipped_frame mode still does not traverse display lists and uses the
old fake value of 45 cycles. degasus has suggested that this hack is
not essential for performance and can be removed, but I want to separate
any potential performance impact of that from this commit.
This is required to make packing consistent between compilers: with u32, MSVC
would not allocate a bitfield that spans two u32s (it would leave a "hole").