Previously we had decided to busy loop on systems due to Windows' scheduler being terrible and moving us around CPU cores when we yielded.
Along with context switching being a hot spot.
We had decided to busy loop in these situations instead, which allows us greater CPU performance on the video thread.
This can be attributed to multiple things, CPU not downclocking while busy looping, context switches happening less often, yielding taking more time
than a busy loop, etc.
One thing we had considered when moving over to a busy loop is the issues that dual core systems would now face due to Dolphin eating all of their CPU
resources. Effectively we are starving a dual core system of any time to do anything else due to the CPU thread always being pinned at 100% and then
the GPU thread also always at 100% just spinning around. We noted the potential for a performance regression, but dismissed it as most computers are
now becoming quad core or higher.
This change in particular has performance advantages on the dual core Nvidia Denver due to its architecture being nonstandard. If both CPU cores are
maxed out, the CPU can't effectively take any idle time to recompile host code blocks to its native VLIW architecture.
It can still do so, but it does less frequently which results in performance issues in Dolphin due to most code just running through the in-order
instruction decoder instead of the native VLIW architecture.
In one particular example, yielding moves the performance from 35-40FPS to 50-55FPS. So it is far more noticeable on Denver than any other system.
Of course once a triple or quad core Denver system comes out this will no longer be an issue on this architecture since it'll have a free core to do
all of this work.
Instead of abusing whatever VAO is previously bound, which might have
enabled arrays.
Only used in one instance currently, which fixes a crash with older
NVIDIA drivers.
The reason this didn't break is that bitwise instructions like VPAND,
VANDPS, and VANDPD do the exact same thing. The only difference is the
data type they are intended for.
Updated PTE.R bit on Write and Instruction fetch.
Added code to read the PTE from MEM2 if the PTE is stored there.
Refactored the two hash functions to reduce code duplication.
Updated save state version.
Seems to be pretty high in the profile in some geometry-heavy games like The
Last Story, and the compiler-generated assembly is terrifyingly bad, so
SSE-ize it.
Just use regular boolean negation in our pixel shader's depth test everywhere except on Qualcomm.
This works around a bug in the Intel Windows driver where comparing a boolean value against true or false fails but boolean negation works fine.
Quite silly.
Should fix issues #7830 and #7899.
We try to keep as many registers as possible in callee saved registers, so if we have guest registers in the correct registers and the interpreter
call we are falling back to doesn't need the registers then we can dump just those ones. Which means we don't have to dump 100% of our register state
when falling to the interpreter.
ComputeRC was a bit unclear by using 64bit registers for setting the immediate and then calling SXTW on a 6b4it register which is just a bit obscure.
When the source register is an immediate in cntlzwx, just use the built in GCC function instead of our own implementing for counting leading zeros.
Before block linking was enabled but it wasn't ever implemented.
Implements link blocks and destroy block functions and moves the downcount check in the WriteExit function so it doesn't get overwritten when linking.
Changes the dispatcher to make sure to we are saving the LR(X30) to the stack. Also makes sure to keep the stack aligned.
AArch64's AAPCS64 mandates the stack to be quad-word aligned.
Fixes the dispatcher from infinite looping due to a downcount check jumping to the dispatcher. This was because checking exceptions and the state
pointer wouldn't reset the global conditional flags. So it would leave the timing/exception, jump to the start of the dispatcher and then jump back
again due to the conditional branch.
Removes the REG_AWAY nonsense I was doing. I've got to get the JIT more up to speed before thinking of insane register cache things.
Also fixes a bug in immediate setting where if the register being set to an immediate already had a host register tied to it then it wouldn't free the
register it had. Resulting in register exhaustion.
lmw/stmw weren't properly setting input and output registers since they use multiple registers.
dcbz was just missing a flag in the instruction tables.
My recent update to that check broke compilation on 10.9:
https://code.google.com/p/dolphin-emu/issues/detail?id=7900
However, on further review, the check isn't actually necessary. If the
OS X version is older than LSMinimumSystemVersion in Info.plist, the
system will generally refuse to run the binary in the first place. You
can try to launch it via a terminal, but at that point it's the user's
problem if it crashes.
On AArch64 asimd is the new name for NEON.
This fixes a message on application start in Android about the device not supporting NEON.
If it's AArch64 then it supports NEON!
The builtin byteswap routines cause critical failure on AArch64 when built with the Android toolchain.
I didn't experience this issue when building for Linux using a local qemu chroot.
Seems to be only an issue with the Android toolchain when building AArch64.
Use our generic version instead.
This was '7' on all ARMv7 devices but was 'AArch64' on the Nexus 9.
Trying to cast to integer was causing a crash. We don't even use this so may as well as wipe it.
Also adds Nvidia to the CPU implementers list.
This code was obviously wrong, we were sign extending 8 bit unsigned values and loading from the wrong offset as well.
This fixes a bug in Muramasa where some colours were going insane.
If the inputs are both float singles, and the top half is known to be identical
to the bottom half, we can use packed arithmetic instead of scalar to skip
the movddup.
This is slower on a few rather old CPUs, plus the Atom+Silvermont, so detect
Atom and disable it in that case.
Also avoid PPC_FP on stores if we know that the output came from a float op.
This shader constant was previously used for depth remapping in D3D and for pixel center correction. Now it only serves one purpose and the new name makes it clear.