Decreases total Wii state save time (not counting compression) from
~570ms to ~18ms.
The compiler can't remove this check because of potential aliasing; this
might be fixable (e.g. by making mode const), but there is no reason to
have the code work in such a braindead way in the first place.
- DoVoid now uses memcpy.
- DoArray now uses DoVoid on the whole rather than Doing each element
(would fail for an array of STL structures, but we don't have any of
those).
- Do also now uses DoVoid. (In the previous version, it replicated
DoVoid's code in order to ensure each type gets its own implementation,
which for small types then becomes a simple load/store in any modern
compiler. Now DoVoid is __forceinline, which addresses that issue and
shouldn't make a big difference otherwise - perhaps a few extra copies
of the code inlined into DoArray or whatever.)
We were generating a texture without ever setting the data to a known value.
This happened on the old code as well, just that PP shaders are receiving some love and people are using it and noticing some of its issues.
Doesn't support all the FPSCR flags, just the FPRF ones.
Add PPCAnalyzer support to remove unnecessary FPRF calculations.
POV-ray benchmark with enableFPRF forced on for an extreme comparison:
Before: 1500s
After, fmul/fmadd only: 728s
After, all float: 753s
In real games that use FPRF, like F-Zero GX, FPRF previously cost a few percent
of total runtime.
Since FPRF is so much faster now, if enableFPRF is set, just do it for every
float instruction, not just fmul/fmadd like before. I don't know if this will
fix any games, but there's little good reason not to.
The only possible functionality change is that s_efbAccessRequested and
s_swapRequested are no longer reset at init and shutdown of the OGL
backend (only; this is the only interaction any files other than
MainBase.cpp have with them). I am fairly certain this was entirely
vestigial.
Possible performance implications: efbAccessReady now uses an Event
rather than spinning, which might be slightly slower, but considering
the slow loop the flags are being checked in from the GPU thread, I
doubt it's noticeable.
Also, this uses sequentially consistent rather than release/acquire
memory order, which might be slightly slower, especially on ARM...
something to improve in Event/Flag, really.
This shouldn't affect functionality. I'm not sure if the breakpoint
distinction is actually necessary (my commit messages from the old
dc-netplay last year claim that breakpoints are broken anyway, but I
don't remember why), but I don't actually need to change this part of
the code (yet), so I'll stick with the trimmings change for now.
While we're at it, support a bunch of float load/store variants that weren't
implemented in the JIT. Might not have a big speed impact on typical games but
they're used at least a bit in povray and luabench.
694 -> 644 seconds on povray.
In particular, even in code that only runs on x86-64, you can't use
PRIx64 for size_t because, on OS X, one is unsigned long and the other
is unsigned long long and clang whines about the difference. I guess
you could make a size_t specifier macro, but those are horribly ugly, so
I just used casting.
Anyone want to make a nice (and slow) template-based printf?
Now without bare 'unsigned'.