Separated out from my gpu-determinism branch by request. It's not a big
commit; I just like to write long commit messages.
The main reason to kill it is hopefully a slight performance improvement
from avoiding the double switch (especially in single core mode);
however, this also improves cycle calculation, as described below.
- FifoCommandRunnable is removed; in its stead, Decode returns the
number of cycles (which only matters for "sync" GPU mode), or 0 if there
was not enough data, and is also responsible for unknown opcode alerts.
Decode and DecodeSemiNop are almost identical, so the latter is replaced
with a skipped_frame parameter to Decode. Doesn't mean we can't improve
skipped_frame mode to do less work; if, at such a point, branching on it
has too much overhead (it certainly won't now), it can always be changed
to a template parameter.
- FifoCommandRunnable used a fixed, large cycle count for display lists,
regardless of the contents. Presumably the actual hardware's processing
time is mostly the processing time of whatever commands are in the list,
and with this change InterpretDisplayList can just return the list's
cycle count to be added to the total. (Since the calculation for this
is part of Decode, it didn't seem easy to split this change up.)
To facilitate this, Decode also gains an explicit 'end' parameter in
lieu of FifoCommandRunnable's call to GetVideoBufferEndPtr, which can
point to there or to the end of a display list (or elsewhere in
gpu-determinism, but that's another story). Also, as a small
optimization, InterpretDisplayList now calls OpcodeDecoder_Run rather
than having its own Decode loop, to allow Decode to be inlined (haven't
checked whether this actually happens though).
skipped_frame mode still does not traverse display lists and uses the
old fake value of 45 cycles. degasus has suggested that this hack is
not essential for performance and can be removed, but I want to separate
any potential performance impact of that from this commit.
This is required to make packing consistent between compilers: with u32, MSVC
would not allocate a bitfield that spans two u32s (it would leave a "hole").
OSD messages can be disabled, while still leaving them in the status bar. This is incredibly useful for certain users, who may wish to see the messages, but do not wish to have them cover up half of the screen. In particular TASers will generally have OSD messages on the screen 100% of the time, and they cover up useful information, making it critical to turn them off. However the messages are still very useful to them, so it's important to have them somewhere.
This reverts 4a16211bae.
This time, check the address carefully beforehand, since apparently some games
do horrible things like running it on non-RAM addresses, or at the very least
virtual addresses.
This is no longer required since we don't support x86_32 anymore.
x86_64 implies SSE2 support.
Also this check was a bit messed up and was hitting on Generic builds.
A bug that seems to have been uncovered by allowing immediate-address loads.
Super Monkey Ball 2 crashes without this change -- it's possible, however, that
the game actually requires the MMU hack, since it crashed due to accessing an
address in the 0x20000000-0x3fffffff range.
Decreases total Wii state save time (not counting compression) from
~570ms to ~18ms.
The compiler can't remove this check because of potential aliasing; this
might be fixable (e.g. by making mode const), but there is no reason to
have the code work in such a braindead way in the first place.
- DoVoid now uses memcpy.
- DoArray now uses DoVoid on the whole rather than Doing each element
(would fail for an array of STL structures, but we don't have any of
those).
- Do also now uses DoVoid. (In the previous version, it replicated
DoVoid's code in order to ensure each type gets its own implementation,
which for small types then becomes a simple load/store in any modern
compiler. Now DoVoid is __forceinline, which addresses that issue and
shouldn't make a big difference otherwise - perhaps a few extra copies
of the code inlined into DoArray or whatever.)
We were generating a texture without ever setting the data to a known value.
This happened on the old code as well, just that PP shaders are receiving some love and people are using it and noticing some of its issues.
Doesn't support all the FPSCR flags, just the FPRF ones.
Add PPCAnalyzer support to remove unnecessary FPRF calculations.
POV-ray benchmark with enableFPRF forced on for an extreme comparison:
Before: 1500s
After, fmul/fmadd only: 728s
After, all float: 753s
In real games that use FPRF, like F-Zero GX, FPRF previously cost a few percent
of total runtime.
Since FPRF is so much faster now, if enableFPRF is set, just do it for every
float instruction, not just fmul/fmadd like before. I don't know if this will
fix any games, but there's little good reason not to.
The only possible functionality change is that s_efbAccessRequested and
s_swapRequested are no longer reset at init and shutdown of the OGL
backend (only; this is the only interaction any files other than
MainBase.cpp have with them). I am fairly certain this was entirely
vestigial.
Possible performance implications: efbAccessReady now uses an Event
rather than spinning, which might be slightly slower, but considering
the slow loop the flags are being checked in from the GPU thread, I
doubt it's noticeable.
Also, this uses sequentially consistent rather than release/acquire
memory order, which might be slightly slower, especially on ARM...
something to improve in Event/Flag, really.
This shouldn't affect functionality. I'm not sure if the breakpoint
distinction is actually necessary (my commit messages from the old
dc-netplay last year claim that breakpoints are broken anyway, but I
don't remember why), but I don't actually need to change this part of
the code (yet), so I'll stick with the trimmings change for now.
While we're at it, support a bunch of float load/store variants that weren't
implemented in the JIT. Might not have a big speed impact on typical games but
they're used at least a bit in povray and luabench.
694 -> 644 seconds on povray.
In particular, even in code that only runs on x86-64, you can't use
PRIx64 for size_t because, on OS X, one is unsigned long and the other
is unsigned long long and clang whines about the difference. I guess
you could make a size_t specifier macro, but those are horribly ugly, so
I just used casting.
Anyone want to make a nice (and slow) template-based printf?
Now without bare 'unsigned'.
Thanks to magumagu's softfp experiments, we know a lot more about the Wii's
strange floating point unit than we used to. In particular, when doing a
single-precision floating point multiply (fmulsx), it rounds the right hand
side's mantissa so as to lose the low 28 bits (of the 53-bit mantissa).
Emulating this behavior in Dolphin fixes a bunch of issues with games that
require extremely precise emulation of floating point hardware, especially
game replays. Fortunately, we can do this with rather little CPU cost; just ~5
extra instructions per multiply, instead of the vast load of a pure-software
float implementation.
This doesn't make floating-point behavior at all perfect. I still suspect
fmadd rounding might not be quite right, since the Wii uses fused instructions
and Dolphin doesn't, and NaN/infinity/exception handling is probably off in
various ways... but it's definitely way better than before.
This appears to fix replays in Mario Kart Wii, Mario Kart Double Dash, and
Super Smash Brothers Brawl. I wouldn't be surprised if it fixes a bunch of
other stuff too.
The changes to instructions other than fmulsx may not be strictly necessary,
but I included them for completeness, since it feels wrong to fix some
instructions but not others, since some games we didn't test might rely on
them.
(1) Rename ABI_ALL_CALLEE_SAVED to ABI_ALL_CALLER_SAVED, because that's
what it was actually defined as (and used as). Derp.
(2) RegistersInUse is always used for the purpose of saving registers
before calling a C++ function in the middle of a JIT block (without
flushing). There is no need to save callee-saved registers in this
case. Change the name to CallerSavedRegistersInUse and mask with
ABI_ALL_CALLER_SAVED.
Nothing obvious broke when starting up a Melee game. (I added a test
for anything actually being masked out; it happens, but in this
particular case seemed to occur at most a few dozen times per second, so
the actual performance benefit is probably negligible.)
The PPC_FP conversion code can be made a lot simpler with the observation
that the only values that need to be sent through the slow x87 path are
denormals.
A whole bunch faster: 708->678 seconds on POV-RAY.
strictStrings is not supported by debug libraries, and indeed breaks the build.
Drop wbemidl.h (incompatible with strictStrings) dependency by using SDL-style search for XInput GUIDs.
The concept of a "title bar" / "status bar" shouldn't be a core concept,
so remove the Host_UpdateStatusBar function, and move the code handles
whether to update the status bar or titlebar into DolphinWX.
This is effectively unused, as the window handles that we pass to the
GLInterface are window handles for the frame which isn't ever a real
toplevel window. Host_UpdateTitle is what actually sets the proper title
on the render window.
Now that MainNoGUI is properly architected and GLX doesn't need to
sometimes craft its own windows sometimes which we have to thread back
into MainNoGUI, we don't need to thread the window handle that GLX
creates at all.
This removes the reference to pass back here, and the g_pWindowHandle
always be the same as the window returned by Host_GetRenderHandle().
A future cleanup could remove g_pWindowHandle entirely.
We now have two cases: the GLX window is parented into a frame, or it's
parented into the MainNoGUI host. In both cases, the GLX window should
be locked to the size of the parent, so just sync it up based on that.
Our existing code was relying on the GLX backend to create the GLX
window properly, and for the rest of the code to patch that up, sort
of. If we rely on Host_GetRenderHandle() returning a valid window, we
can do a lot better about this.
Create a simple window inside MainNoGUI to make this happen.
Move to one display. There's no reason to have two displays here -- the
comment stated that one should touch GLX and one should touch window
events, and that they should be touched from different threads, but the
current code wasn't this careful.
Just use one Display connection.
Now, the only supported EGL platform is Android. We might eventually add
back support for EGL/X11 or EGL/Wayland, but it will have to be
architected differently.
Yes, this is a fancy new feature, but our Wayland support was
particularly bitrotten, and ideally this would be handled by a platform
layer like SDL. If not, we can always add this back in when GLInterface
has caught up. We might be able to even support wxWidgets and GL
together with subsurfaces!
The framebuffer is no longer rotated the wrong way around in Qualcomm's latest development drivers.
They did something right, only took them over a year.
This catches most instances of configuration failures that can happen in a post processing shader.
Gives a user a helpful error message that lets them know what they have failed to set up correctly
EXI memcard code now doesn't know specifics of how data is flushed to whatever backing storage is used.
GC raw memcard now flushes every 15 seconds if dirty, and on memcard destruction.
GCI folder now flushes only on memcard destruction.
Seems mesa has a quirk where
define THING(x) (#x)
is the same as
define THING(x) (##x)
Didn't realize I messed it up since it just worked since I only tested on Mesa.
The JIT block compare code didn't set the same options for the PPCAnalyzer
as the actual JIT did, which made the PPC side of the JIT block viewer stop
at the first branch instead of the end of the block.
I have no idea what the compiler does with these, and this code probably
isn't triggered in most games, but it's probably better not to taunt the
undefined behavior demon.