It only ever did anything on 32-bit OS X.
Anyway, it wasn't even on the right functions, and these days
ABI_PushRegistersAndAdjustStack should handle maintaining the ABI
correctly.
wxGetActiveWindow is implemented as "return NULL" on OS X, while
wxWindow::FindFocus works. On Windows, the difference is in the use of
GetActiveWindow() vs. GetForegroundWindow(). A MSDN comment says:
> A system has only one active window, which GetForegroundWindow()
> returns. GetActiveWindow() seems to return the same window as
> GetForegroundWindow() if the foreground window belongs to the current
> thread. Otherwise, it always returns null, rather than the topmost
> window of the calling thread.
Since we are on the GUI thread, it shouldn't make any difference.
This noticeably includes GL_ARB_get_program_binary, which was previously
thought unsupported on OS X. Well, actually, the OS X implementation is
trivial and reports 0 binary formats (as of 10.10; this is hardcoded in
GLEngine, by the way), but at least it'll work if it's fixed someday.
It now affects the GPU determinism mode as well as some miscellaneous
things that were calling IsNetPlayRunning. Probably incomplete.
Notably, this can change while paused, if the user starts recording a
movie. The movie code appears to have been missing locking between
setting g_playMode and doing other things, which probably had a small
chance of causing crashes or even desynced movies; fix that with
PauseAndLock.
The next commit will add a hidden config variable to override GPU
determinism mode.
It's a relatively big commit (less big with -w), but it's hard to test
any of this separately...
The basic problem is that in netplay or movies, the state of the CPU
must be deterministic, including when the game receives notification
that the GPU has processed FIFO data. Dual core mode notifies the game
whenever the GPU thread actually gets around to doing the work, so it
isn't deterministic. Single core mode is because it notifies the game
'instantly' (after processing the data synchronously), but it's too slow
for many systems and games.
My old dc-netplay branch worked as follows: everything worked as normal
except the state of the CP registers was a lie, and the CPU thread only
delivered results when idle detection triggered (waiting for the GPU if
they weren't ready at that point). Usually, a game is idle iff all the
work for the frame has been done, except for a small amount of work
depending on the GPU result, so neither the CPU or the GPU waiting on
the other affected performance much. However, it's possible that the
game could be waiting for some earlier interrupt, and any of several
games which, for whatever reason, never went into a detectable idle
(even when I tried to improve the detection) would never receive results
at all. (The current method should have better compatibility, but it
also has slightly higher overhead and breaks some other things, so I
want to reimplement this, hopefully with less impact on the code, in the
future.)
With this commit, the basic idea is that the CPU thread acts as if the
work has been done instantly, like single core mode, but actually hands
it off asynchronously to the GPU thread (after backing up some data that
the game might change in memory before it's actually done). Since the
work isn't done, any feedback from the GPU to the CPU, such as real
XFB/EFB copies (virtual are OK), EFB pokes, performance queries, etc. is
broken; but most games work with these options disabled, and there is no
need to try to detect what the CPU thread is doing.
Technically: when the flag g_use_deterministic_gpu_thread (currently
stuck on) is on, the CPU thread calls RunGpu like in single core mode.
This function synchronously copies the data from the FIFO to the
internal video buffer and updates the CP registers, interrupts, etc.
However, instead of the regular ReadDataFromFifo followed by running the
opcode decoder, it runs ReadDataFromFifoOnCPU ->
OpcodeDecoder_Preprocess, which relatively quickly scans through the
FIFO data, detects SetFinish calls etc., which are immediately fired,
and saves certain associated data from memory (e.g. display lists) in
AuxBuffers (a parallel stream to the main FIFO, which is a bit slow at
the moment), before handing the data off to the GPU thread to actually
render. That makes up the bulk of this commit.
In various circumstances, including the aforementioned EFB pokes and
performance queries as well as swap requests (i.e. the end of a frame -
we don't want the CPU potentially pumping out frames too quickly and the
GPU falling behind*), SyncGPU is called to wait for actual completion.
The overhead mainly comes from OpcodeDecoder_Preprocess (which is,
again, synchronous), as well as the actual copying.
Currently, display lists and such are escrowed from main memory even
though they usually won't change over the course of a frame, and
textures are not even though they might, resulting in a small chance of
graphical glitches. When the texture locking (i.e. fault on write) code
lands, I can make this all correct and maybe a little faster.
* This suggests an alternate determinism method of just delaying results
until a short time before the end of each frame. For all I know this
might mostly work - I haven't tried it - but if any significant work
hinges on the competion of render to texture etc., the frame will be
missed.
videoBuffer -> s_video_buffer
size -> s_video_buffer_write_ptr
g_pVideoData -> g_video_buffer_read_ptr (impl moved to Fifo.cpp)
This eradicates the wonderful use of 'size' as a global name, and makes
it clear that s_video_buffer_write_ptr and g_video_buffer_read_ptr are
the two ends of the FIFO buffer s_video_buffer.
Oh, and remove a useless namespace {}.
This state will be used to calculate sizes for skipping over commands on
a separate thread. An alternative to having these state variables would
be to have the preprocessor stash "state as we go" somewhere, but I
think that would be much uglier.
GetVertexSize now takes an extra argument to determine which state to
use, as does FifoCommandRunnable, which calls it. While I'm modifying
FifoCommandRunnable, I also change it to take a buffer and size as
parameters rather than using g_pVideoData, which will also be necessary
later. I also get rid of an unused overload.
VertexLoader::VertexLoader was setting loop_counter, a *static*
variable, to 0. This was nonsensical, but harmless until I started to
run it on a separate thread, where it had a chance of interfering with a
running vertex translator.
Switch to just using a register for the loop counter.
- Lazily create the native vertex format (which involves GL calls) from
RunVertices rather than RefreshLoader itself, freeing the latter to be
run from the CPU thread (hopefully).
- In order to avoid useless allocations while doing so, store the native
format inside the VertexLoader rather than using a cache entry.
- Wrap the s_vertex_loader_map in a lock, for similar reasons.
Renamed various commands to refer to ISO instead of GCM for consistency,
as the commands are used for both Wii and GameCube files.
CompressGCM --> CompressISO
DeleteGCM --> DeleteISO
MultiCompressGCM --> MultiCompressISO
MultiDecompressGCM --> MultiDecompressISO
SetDefaultGCM --> SetDefaultISO
Fixed COMPRESSISO
Fixed missing "COMPRESSISO"
Fixed more COMPRESSISO
Final fix for COMPRESSISO
Detects a situation where the game is writing to the dcache at the address being DMA'd. As we do not have dcache emulation, invalid data is being DMA'd causing audio glitches. The following code detects this and enables the DMA to complete instantly before the invalid data is written.
Added accurate ARAM DMA transfer timing.
Removed the addition of DSP exception checking.
Updated ARAM DMA and FIFO write exception checking to uses these types.
Conflicts:
Source/Core/Core/PowerPC/Interpreter/Interpreter_Tables.cpp
Source/Core/Core/PowerPC/PPCTables.h
This helps us avoid accidentally clobbering flags between two instructions
when the flags are expected to be maintained. Dolphin will of course crash
immediately, but at least it will crash loudly and alert us of the mistake,
instead of forcing hours of bisecting to find the subtle way in which the JIT
has managed to sneak a flag-modifying instruction where there shouldn't be one.
That commit reorganized fastmem a bit; I wrote it before the patch to
support fastmem in JitIL landed, and forgot to edit it to account for
the fact. Since JitILBase now derives from Jitx86Base, the HandleFault
override can just be removed.
Also correct behavior with regards to which bits in XER are treated as zero
based on a hwtest (probably doesn't affect any real games, but might as well
be correct).
This should dramatically reduce code size in the case of blocks with
lots of branches, and certainly doesn't hurt elsewhere either.
This can probably be improved a good bit through smarter tracking of register
usage, e.g. discarding registers that are going to be overwritten, but this
is a good start and should help reduce code size and register pressure.
Unlike that sort of change, this is a "safe" patch; it only flushes registers,
which can't affect correctness, unlike actually discarding data.
As part of this, refactor PPCAnalyst to support distinguishing between
float and integer registers (to properly handle instructions that access
both, like floating-point loads and stores).
Also update every instruction in the interpreter flags table I could find
that didn't have all the correct flags.
When a game was running and someone opened the video dialog, it would crash. This is because the preprocessor macro should have been __APPLE__ not _APPLE_
Fixes issue 7644.
These ID values would clash with the window parent IDs of all the actual debugger panes (they are in the 350 range as well).
For example, attempting to show and then close the memory window would cause an assertion, because it would attempt to destroy the text control for searching through memory, rather than destroying the actual parent window it's attached to.
These IDs are only used locally, so their value doesn't matter.
This is inconsistent with how other containers are used (i.e. with Do()), but making std::array be used with Do() seems rather confusing when there's also a DoArray available.
I didn't realize the I and W fields were in a different place for these
variants.
This should fix Paper Mario and probably lots of other things I accidentally
broke.
fselx was the main problem, but ps_sel was wrong too (even if there were no
known reported bugs with it).
This fixes Beyond Good and Evil (at the least).
4 more instructions down.
Store ones should be pretty well-tested; load ones seem to almost never be
used. I found them in Turok Evolution, so I was able to check code generation,
but the relevant code didn't seem to be called.
My code didn't maintain correct semantics with floating-point NaNs (a < b is
not the same as "not a >= b" in float), which seems to have broken FIFA 12.
Rather than *MemTools.cpp checking whether the address is in the
emulated range itself (which, as of the next commit, doesn't cover every
kind of access the JIT might want to intercept) and doing PC
replacement, they just pass the access address and context to
jit->HandleFault, which does the rest itself.
Because SContext is now in JitInterface, I wanted JitBackpatch.h (which
defines it) to be lightweight, so I moved TrampolineCache and associated
x64{Analyzer,Emitter} dependencies into its own file. I hate adding new
files in three places, two of which are MSVC...
While I'm at it, edit a misleading comment.
When executing a BL-type instruction, push the new LR onto the stack,
then CALL the dispatcher or linked block rather than JMPing to it. When
executing BLR, compare [rsp+8] to LR, and RET if it's right, which it
usually will be unless the thread was switched out. If it's not right,
reset RSP to avoid overflow.
This both saves a trip through the dispatcher and improves branch
prediction.
There is a small possibility of stack overflow anyway, which should
be handled... *yawn*
These calls are made outside of JIT blocks, and thus previously did not
read any protection - register use is taken into account and the outer
dispatcher stack frame is sufficient. However, if data is to be stored
on the stack, these calls must reserve stack shadow space on Windows to
avoid clobbering it.
Use some SSE4 instructions in on CPUs that support them.
Use float instructions instead of int where appropriate (it's a cycle faster
on CPUs with arithmetic unit forwarding penalties).
JitIL's fastmem was stubbed out when Sonicadvance1 merged JitARMIL
into the tree. Since JitARMIL has been deleted, I simply re-arrange
the inheritance to base JitIL on Jitx86Base, so it can inherit the
backpatch function.
Povray Benchmark: 1985 seconds to 1316 seconds.
Tries as hard as possible to push carry-using operations (like addc and adde)
next to each other. Refactor the instruction reordering to be more flexible
and allow multiple passes.
353 -> 192 x86 instructions on a carry-heavy code block in Pokemon Puzzle.
12% faster overall in Pokemon Puzzle; probably less in typical games (Virtual
Console games seem to be carry-heavy for some reason; maybe a different
compiler?)
Has been broken since the flags-opt merge. The idle skipping code in
JitIL was very brittle and depended on the IL of it's inputs not
changing in any way.
flags-opt changed the IR generated by the cmp instruction, which is part
of the idle loop, causing JitIL to break in really weird ways, which
were almost impossible to track down.
This fixes various wii games crashing/not booting and the Regspill
error on (all?) gamecube mmu games.
Fixes all the current issues I've been experiencing.
Scaled back the register cache idea for now so I can actually work on some real instructions.
Tested this work with unit tests so I know it works.
Unit tests are pretty great things.
In situations where conditional continue isn't supported + if a JIT doesn't implement a instruction that has the FL_ENDBLOCK flag. This would cause an
infinite loop.
In reality all the JITs should implement every FL_ENDBLOCK instruction regardless, but JITIL doesn't implement tw/twi which are FL_ENDBLOCK
instructions.
Missed a define in x64MemTools for when the thought process was Android == ARM
Also changes the variable we use for choosing which folders to copy to and from our jni file.
This has changed since the x86_64 build target uses the library folder x86-64, which is stupid and annoying.
It didn't behave correctly with an input of zero, resulting in some games
breaking (at the least, Fight Night 2). This should be fixed now.
Also clean it up, add a few comments, and fix some variants of the instruction
that are so rare that they probably never got tested.
Google has gotten their act together and fixes a few of the signal handling headers.
Change over to a header that works on both r10 32bit and r10 64bit.
32bit has the old "broken" headers as in some didn't even exist.
64bit has the "fixed" headers that one would expect on any regular unix system.
To avoid FPRs being pushed unnecessarily, I checked the uses: DSPEmitter
doesn't use FPRs, and VertexLoader doesn't use anything but RAX, so I
specified the register list accordingly. The regular JIT, however, does
use FPRs, and as far as I can tell, it was incorrect not to save them in
the outer routine. Since the dispatcher loop is only exited when
pausing or stopping, this should have no noticeable performance impact.
- Factor common work into a helper function.
- Replace confusingly named "noProlog" with "rsp_alignment". Now that
x86 is not supported, we can just specify it explicitly as 8 for
clarity.
- Add the option to include more frame size, which I'll need later.
- Revert a change by magumagu in March which replaced MOVAPD with MOVUPD
on account of 32-bit Windows, since it's no longer supported. True,
apparently recent processors don't execute the former any faster if the
pointer is, in fact, aligned, but there's no point using MOVUPD for
something that's guaranteed to be aligned...
(I discovered that GenFrsqrte and GenFres were incorrectly passing false
to noProlog - they were, in fact, functions without prologs, the
original meaning of the parameter - which caused the previous change to
break. This is now fixed.)
Each emulated Wiimote can have its speaker routed from left to right via the "Speaker Pan" setting in the emulated wiimote settings dialog. Use any value from -127 for leftmost to 127 for rightmost with 0 being the centre.
Added code in the InputConfig to use a spin control for non-boolean values.
Defaulted the setting of "Enable Speaker Data" to disabled.
The Wiimotes are positioned as follows:
Wiimote 0 = Center
Wiimote 1 = Left
Wiimote 2 = Right
Wiimote 3 = Center
The Wiimote speaker output can be disabled via the "Enable Speaker Data" checkbox in the Wiimote settings.
This is the bare minimum required to run a few games on AArch64.
Was able to run starfield and Animal Crossing to the Nintendo logo.
QEmu emulation is literally the slowest thing in the world, it maxes out at around 12mhz on my Core i7-4930MX.
I've tested a few instruction encodings and am expecting most to work as long as one stays away from VFP/SIMD.
This implements mostly instructions to bring up an initial JIT with integer support.
This can be improved to allow ease of use functions in the future, dealing with the raw imms/immr encodings is probably the worst thing ever.
Uses are split into three categories:
- Arbitrary (except for size savings) - constants like RSCRATCH are
used.
- ABI (i.e. RAX as return value) - ABI_RETURN is used.
- Fixed by architecture (RCX shifts, RDX/RAX for some instructions) -
explicit register is kept.
In theory this allows the assignments to be modified easily. I verified
that I was able to run Melee with all the registers changed, although
there may be issues if RSCRATCH[2] and ABI_PARAM{1,2} conflict.
And switch to a register order that consistently prefers callee-save to
caller-save. phire suggested putting rdi/rsi first, even though they're
caller-save, to save code space; this is more conservative and I can do
that later.
Rather than using a variety of registers including RSI, ABI_PARAM1
(either RCX or RDI), RCX, and RDX, the rule is:
- RDI and RSI are never used. This allows them to be allocated on Unix,
bringing parity with Windows.
- RDX is a permanent temporary register along with RAX (and is thus not
FlushLocked). It's used frequently enough that allocating it would
probably be a bad idea, as it would constantly get flushed.
- RCX is allocatable, but is flushed in two situations:
- Non-immediate shifts (rlwnm), because x86 requires RCX to be used.
- Paired single loads and stores, because they require three
temporary registers: the helper functions take two integer
arguments, and another register is used as an index to get the
function address.
These should be relatively rare.
While we're at it, in stores, use the registers directly where possible
rather than always using temporaries (by making SafeWriteRegToReg
clobber less). The address doesn't need to be clobbered in the usual
case, and on CPUs with MOVBE, neither does the value.
Oh, and get rid of a useless MEMCHECK.
This commit does not actually add new registers to the allocation order;
it is intended to test for any performance or correctness issues
separately.
The special case is where the registers are actually to be swapped (i.e.
func(ABI_PARAM2, ABI_PARAM1); this was previously impossible but would
be ugly not to handle anyway.
In two cases, my old code was using a temporary register but not saving
it properly; it basically worked by accident (an otherwise useless
FlushLock was causing CallerSavedRegistersInUse to think it was in use
by the GPR cache, even though it was actually a temporary).
I'm going to modify this in the next commit to use RDX, but I didn't
want to leave a broken revision in the middle.
The register is RBP, previously in the GPR allocation order. The next
commit will investigate whether there are too few GPRs (now or before),
but for now there is no replacement.
Previously, it was accessed RIP relatively; using RBP, anything in the
first 0x100 bytes of ppcState (including all the GPRs) can be accessed
with three fewer bytes. Code to access ppcState is generated constantly
(mostly by register save/load), so in principle, this should improve
instruction cache footprint significantly. It seems that this makes a
significant performance difference in practice.
The vast majority of this commit is mechanically replacing
M(&PowerPC::ppcState.x) with a new macro PPCSTATE(x).
Version 2: gets most of the cases which were using the register access
macros.
GetOpInfo was returning null pointers for invalid ops in subtables
instead of asserting an error. This was causing segfaults when the
jit tried to jit invalid code.
Previously it would fall through to the mmu code path, and raise a dsi
exception, which it would never check for, so it would continue
executing code silently.
Prior to this change, it was possible to cause an infinite loop by making the string to be replaced and the replacing string the same thing.
e.g.
std::string some_str = "test";
ReplaceAll(some_str, "test", "test");
This also changes the replacing in a way that doesn't require starting from the beginning of the string on each replacement iteration.
For a long time, we've had ugly and inconsistent function names here as
helpers, names like "decodebytesRGB5A3rgba" which are absolutely
incomprehensible to understand. Fix this by introducing a new consistent
naming scheme, where the above function now becomes "DecodeBytes_RGB5A3".
Instead of having three separate functions and checking the tlutfmt in a
variety of places, just do it once in a helper method. This is already
for the slow path either in our Generic decoder or in our Software
renderer, so it doesn't matter that this is slower.
x64 will continue using the separate functions for speed.
The D3D / OGL backends only ever used RGBA textures, and the Software
backend uses its own custom code for sampling. The ARGB path seems to
just be dead code.
Since ARGB and RGBA formats are similar, I don't think this will make
the code more difficult to read or unable to be used as
reference. Somebody who wants to use this code to output ARGB can simply
modify the MakeRGBA function to put the shift at the other end.
This pulls all the duplicate code from TextureDecoder_Generic /
TextureDecoder_x64 out and puts it in a common file. Out custom font
used for debugging the texture cache is also pulled out and put in a
common "sfont.inc" file. At some point we should also combine this font
with the other six binary fonts we ship.
GetPC_TexFormat was never used. It was added in commit d02426a, with the
only user being commented out code. The commented out code was later
removed in 9893122, but the implementation stayed.
We were decoding to BGRA32 textures in our RGBA32 texture decoder. Since
this is the same for the BGRA32 decoder implementation, this is most
likely a copy/paste typo, rather than the texture actually being
bit-swapped. Fix this.
I'm not sure of any games that use the C14X2 texture format, so I'm not
sure this fixes any games, but it does make the code cleaner for when we
clean it up in the future, and merge some of these similar loops.
The person who wrote this seemed to misunderstand how XPending and
XNextEvent actually work. XNextEvent will wait in poll if there's
no event yet, meaning that we don't need to sleep after we process
all the events; the kernel will sleep for us.
This changes indentation, so view with -w or a similar feature to
understand what's actually changed here.
Added the option to handle whether the user wants to iterate through the
assignment of button mappings or assign them one at a time.
fixed formatting issues and code style.
I excluded this option from the config file. This stopped the check box value and the boolean from becoming offset. Since the option should always start as false.
This still causes an issue with the Wiimote input, since the class variable that keeps the state will be wiped, but the check box value will stay the same after closing/reopening without closing the entire Wiimote configuration. I am looking for a way to resolve this.
I also reduced wait time to 2.5 seconds vs. the 5 seconds previously. Seemed to be a little long.
These changes apparently did not go through.
This should fix the Wiimote issue.
Allows user to map all inputs seamlessly without having to
click on each button.
Also increased button timeout to 5 seconds from 1.5 due to pita.
Motion controls are not included since they will be special cases.
As far as I can tell, this has literally been here since the start of the git
history; maybe it was stubbed out because the author wasn't sure it was right?
It matches the PPC/Broadway manuals perfectly, though.
This value was "helpful" for debugging when the stack got corrupted.
Helpful that if gpr[1](Which is the stack pointer with PPC ABI) is zero then the interpreter would spam huge amounts of annoy text saying that we
managed to get in to a "corrupted" state.
This is incremented every instruction on the interpreter, or every block run on the JIT64....Only if debugging is enabled(JIT64 it is a const
variable)
The message is only outputted when interpreter is used and debugging is enabled.
The old method would always evict the first suitable register, i.e. the
same register every time once the cache got full. The cache doesn't get
terribly often, but the result is pathological...
Faster, of course, since we avoid the interpreter, but also means we can
get more a more accurate timer in long blocks by adding the offset from the
start of the block to the retrieved timer. I don't know if this will actually
fix any issues, but it's more correct and a nearly-free improvement.
This causes glDrawArrays to fail in core profile, and thus on OS X, see:
http://renderingpipeline.com/2012/03/attribute-less-rendering/
There must be something bound, even though it is not used.
Fixes#7599. I'm not sure this is actually the best way to fix it,
since AFAICT it makes a nonobvious assumption that *something* will be
bound before the first attributeless rendering in
TextureConverter::DecodeToTexture, but it's what degasus suggested and
seems to work.
Separated out from my gpu-determinism branch by request. It's not a big
commit; I just like to write long commit messages.
The main reason to kill it is hopefully a slight performance improvement
from avoiding the double switch (especially in single core mode);
however, this also improves cycle calculation, as described below.
- FifoCommandRunnable is removed; in its stead, Decode returns the
number of cycles (which only matters for "sync" GPU mode), or 0 if there
was not enough data, and is also responsible for unknown opcode alerts.
Decode and DecodeSemiNop are almost identical, so the latter is replaced
with a skipped_frame parameter to Decode. Doesn't mean we can't improve
skipped_frame mode to do less work; if, at such a point, branching on it
has too much overhead (it certainly won't now), it can always be changed
to a template parameter.
- FifoCommandRunnable used a fixed, large cycle count for display lists,
regardless of the contents. Presumably the actual hardware's processing
time is mostly the processing time of whatever commands are in the list,
and with this change InterpretDisplayList can just return the list's
cycle count to be added to the total. (Since the calculation for this
is part of Decode, it didn't seem easy to split this change up.)
To facilitate this, Decode also gains an explicit 'end' parameter in
lieu of FifoCommandRunnable's call to GetVideoBufferEndPtr, which can
point to there or to the end of a display list (or elsewhere in
gpu-determinism, but that's another story). Also, as a small
optimization, InterpretDisplayList now calls OpcodeDecoder_Run rather
than having its own Decode loop, to allow Decode to be inlined (haven't
checked whether this actually happens though).
skipped_frame mode still does not traverse display lists and uses the
old fake value of 45 cycles. degasus has suggested that this hack is
not essential for performance and can be removed, but I want to separate
any potential performance impact of that from this commit.
This is required to make packing consistent between compilers: with u32, MSVC
would not allocate a bitfield that spans two u32s (it would leave a "hole").
OSD messages can be disabled, while still leaving them in the status bar. This is incredibly useful for certain users, who may wish to see the messages, but do not wish to have them cover up half of the screen. In particular TASers will generally have OSD messages on the screen 100% of the time, and they cover up useful information, making it critical to turn them off. However the messages are still very useful to them, so it's important to have them somewhere.
This reverts 4a16211bae.
This time, check the address carefully beforehand, since apparently some games
do horrible things like running it on non-RAM addresses, or at the very least
virtual addresses.
This is no longer required since we don't support x86_32 anymore.
x86_64 implies SSE2 support.
Also this check was a bit messed up and was hitting on Generic builds.
A bug that seems to have been uncovered by allowing immediate-address loads.
Super Monkey Ball 2 crashes without this change -- it's possible, however, that
the game actually requires the MMU hack, since it crashed due to accessing an
address in the 0x20000000-0x3fffffff range.
Decreases total Wii state save time (not counting compression) from
~570ms to ~18ms.
The compiler can't remove this check because of potential aliasing; this
might be fixable (e.g. by making mode const), but there is no reason to
have the code work in such a braindead way in the first place.
- DoVoid now uses memcpy.
- DoArray now uses DoVoid on the whole rather than Doing each element
(would fail for an array of STL structures, but we don't have any of
those).
- Do also now uses DoVoid. (In the previous version, it replicated
DoVoid's code in order to ensure each type gets its own implementation,
which for small types then becomes a simple load/store in any modern
compiler. Now DoVoid is __forceinline, which addresses that issue and
shouldn't make a big difference otherwise - perhaps a few extra copies
of the code inlined into DoArray or whatever.)
We were generating a texture without ever setting the data to a known value.
This happened on the old code as well, just that PP shaders are receiving some love and people are using it and noticing some of its issues.
Doesn't support all the FPSCR flags, just the FPRF ones.
Add PPCAnalyzer support to remove unnecessary FPRF calculations.
POV-ray benchmark with enableFPRF forced on for an extreme comparison:
Before: 1500s
After, fmul/fmadd only: 728s
After, all float: 753s
In real games that use FPRF, like F-Zero GX, FPRF previously cost a few percent
of total runtime.
Since FPRF is so much faster now, if enableFPRF is set, just do it for every
float instruction, not just fmul/fmadd like before. I don't know if this will
fix any games, but there's little good reason not to.
The only possible functionality change is that s_efbAccessRequested and
s_swapRequested are no longer reset at init and shutdown of the OGL
backend (only; this is the only interaction any files other than
MainBase.cpp have with them). I am fairly certain this was entirely
vestigial.
Possible performance implications: efbAccessReady now uses an Event
rather than spinning, which might be slightly slower, but considering
the slow loop the flags are being checked in from the GPU thread, I
doubt it's noticeable.
Also, this uses sequentially consistent rather than release/acquire
memory order, which might be slightly slower, especially on ARM...
something to improve in Event/Flag, really.
This shouldn't affect functionality. I'm not sure if the breakpoint
distinction is actually necessary (my commit messages from the old
dc-netplay last year claim that breakpoints are broken anyway, but I
don't remember why), but I don't actually need to change this part of
the code (yet), so I'll stick with the trimmings change for now.
While we're at it, support a bunch of float load/store variants that weren't
implemented in the JIT. Might not have a big speed impact on typical games but
they're used at least a bit in povray and luabench.
694 -> 644 seconds on povray.
In particular, even in code that only runs on x86-64, you can't use
PRIx64 for size_t because, on OS X, one is unsigned long and the other
is unsigned long long and clang whines about the difference. I guess
you could make a size_t specifier macro, but those are horribly ugly, so
I just used casting.
Anyone want to make a nice (and slow) template-based printf?
Now without bare 'unsigned'.
Thanks to magumagu's softfp experiments, we know a lot more about the Wii's
strange floating point unit than we used to. In particular, when doing a
single-precision floating point multiply (fmulsx), it rounds the right hand
side's mantissa so as to lose the low 28 bits (of the 53-bit mantissa).
Emulating this behavior in Dolphin fixes a bunch of issues with games that
require extremely precise emulation of floating point hardware, especially
game replays. Fortunately, we can do this with rather little CPU cost; just ~5
extra instructions per multiply, instead of the vast load of a pure-software
float implementation.
This doesn't make floating-point behavior at all perfect. I still suspect
fmadd rounding might not be quite right, since the Wii uses fused instructions
and Dolphin doesn't, and NaN/infinity/exception handling is probably off in
various ways... but it's definitely way better than before.
This appears to fix replays in Mario Kart Wii, Mario Kart Double Dash, and
Super Smash Brothers Brawl. I wouldn't be surprised if it fixes a bunch of
other stuff too.
The changes to instructions other than fmulsx may not be strictly necessary,
but I included them for completeness, since it feels wrong to fix some
instructions but not others, since some games we didn't test might rely on
them.
(1) Rename ABI_ALL_CALLEE_SAVED to ABI_ALL_CALLER_SAVED, because that's
what it was actually defined as (and used as). Derp.
(2) RegistersInUse is always used for the purpose of saving registers
before calling a C++ function in the middle of a JIT block (without
flushing). There is no need to save callee-saved registers in this
case. Change the name to CallerSavedRegistersInUse and mask with
ABI_ALL_CALLER_SAVED.
Nothing obvious broke when starting up a Melee game. (I added a test
for anything actually being masked out; it happens, but in this
particular case seemed to occur at most a few dozen times per second, so
the actual performance benefit is probably negligible.)
The PPC_FP conversion code can be made a lot simpler with the observation
that the only values that need to be sent through the slow x87 path are
denormals.
A whole bunch faster: 708->678 seconds on POV-RAY.
strictStrings is not supported by debug libraries, and indeed breaks the build.
Drop wbemidl.h (incompatible with strictStrings) dependency by using SDL-style search for XInput GUIDs.
The concept of a "title bar" / "status bar" shouldn't be a core concept,
so remove the Host_UpdateStatusBar function, and move the code handles
whether to update the status bar or titlebar into DolphinWX.
This is effectively unused, as the window handles that we pass to the
GLInterface are window handles for the frame which isn't ever a real
toplevel window. Host_UpdateTitle is what actually sets the proper title
on the render window.
Now that MainNoGUI is properly architected and GLX doesn't need to
sometimes craft its own windows sometimes which we have to thread back
into MainNoGUI, we don't need to thread the window handle that GLX
creates at all.
This removes the reference to pass back here, and the g_pWindowHandle
always be the same as the window returned by Host_GetRenderHandle().
A future cleanup could remove g_pWindowHandle entirely.
We now have two cases: the GLX window is parented into a frame, or it's
parented into the MainNoGUI host. In both cases, the GLX window should
be locked to the size of the parent, so just sync it up based on that.
Our existing code was relying on the GLX backend to create the GLX
window properly, and for the rest of the code to patch that up, sort
of. If we rely on Host_GetRenderHandle() returning a valid window, we
can do a lot better about this.
Create a simple window inside MainNoGUI to make this happen.