Optimistically assume used GQRs are 0 in blocks that only use one GQR, and
bail at the start of the block and recompile if that assumption fails.
Many games use almost entirely unquantized stores (e.g. Rebel Strike, Sonic
Colors), so this will likely be a big performance improvement across the board
for games with heavy use of paired singles.
Won't work with all games, but provides a nice way to spend extra CPU to make
a variable framerate game faster (e.g. Spyro or The Last Story), or to make
a game use less CPU at the cost of a lower framerate (e.g. Rogue Leader).
If we have a shift amount that is the full length of the source register then we have an invalid instruction.
This can happen when dealing with a couple of PowerPC instructions.
This same adjustment is already in the ARMv7 emitter.
Fixes issues with negative offsets in loadstore instructions.
Adds ADRP/ADR instructions.
Optimizes MOVI2R function to take advantage of ADRP on pointers, can change a 3 instruction operation down to one.
Adds GPR push/pop operations for ABI related things.
The reason this didn't break is that bitwise instructions like VPAND,
VANDPS, and VANDPD do the exact same thing. The only difference is the
data type they are intended for.
The builtin byteswap routines cause critical failure on AArch64 when built with the Android toolchain.
I didn't experience this issue when building for Linux using a local qemu chroot.
Seems to be only an issue with the Android toolchain when building AArch64.
Use our generic version instead.
If the inputs are both float singles, and the top half is known to be identical
to the bottom half, we can use packed arithmetic instead of scalar to skip
the movddup.
This is slower on a few rather old CPUs, plus the Atom+Silvermont, so detect
Atom and disable it in that case.
Also avoid PPC_FP on stores if we know that the output came from a float op.
Move the JITed function/basic-block registration logic out of the CPU
subsystem in order to add JIT registration to JITed DSP and
Video/VertexLoader code.
This necessary in order to add /tmp/perf-$pid.map support to other
JITed code as they need to write to the same file.
When we cleaned up the code to calculate the shm_position and total_mem
in one step, we sometimes skipped over certain views because they were
Wii-only. When looking at the total memory, we'd look at the last field,
whether or not it was skipped. Since Wii-only fields are the last view,
this meant that the shm_position was 0, since it was skipped, causing us
to map a 0-sized field. Fix this by explicitly returning the total size
from MemoryMap_InitializeViews.
Additionally, the shm_position was being calculated incorrectly because
it was adding up the shm_position *before* the mirror, rather than after
it. Fix this by adopting a scheme similar to what we had before.
The code to calculate the offsets into the SHM file wasn't properly
respecting the skip flags, causing it to calculate offsets beyond
the end of the SHM file.
This code was ported from out_ptr, which was a double-pointer, and
wanted to double-check that the proper arena was actually allocated.
When I ported it to store the pointer directly in the view regardless
of whether out_ptr was non-NULL, I got confused here and instead
caused the code to only free the arena if the first byte was non-zero.
This code originally tried to map the "low space" for the Gamecube's
memory layout, but since has expanded to mapping all of the easily
mappable memory on the system. Change the name to "GrabSHMSegment" to
indicate that we're looking for a shared memory segment we can map into
our process space.
These are effectively unused, since the memmap already maps them in one
place. For 32-bit, they might have some slight advantage, but we already
special-case the regular "high-mem" pointer for 32-bit, so just use the
one we already have...
This is a higher level, more concise wrapper for bitsets which supports
efficiently counting and iterating over set bits. It's similar to
std::bitset, but the latter does not support efficient iteration (and at
least in libc++, the count algorithm is subpar, not that it really
matters). The converted uses include both bitsets and, notably,
considerably less efficient regular arrays (for in/out registers in
PPCAnalyst).
Unfortunately, this may slightly pessimize unoptimized builds.
We weren't dropping a newline character from the string, we were cutting off the last character of the hardware name.
This fixes my TK1 being called 'lagun' when it's name is 'laguna'
GCC has optimized this using the exact same code since 4.7 or 4.8.
Android building falls back to the __linux__ route.
No need to keep these around anymore since we aren't building on an old GCC version.
Before this commit, the two were reversed ("cpu_string" had the brand, e.g. "AuthenticAMD"; and "brand_string" had the CPU type, e.g. "AMD Phenom II X4 925").
This is good hygiene, and also happens to be required to build Dolphin
using Clang modules.
(Under this setup, each header file becomes a module, and each #include
is automatically translated to a module import. Recursive includes
still leak through (by default), but modules are compiled independently,
and can't depend on defines or types having previously been set up. The
main reason to retrofit it onto Dolphin is compilation performance - no
more textual includes whatsoever, rather than putting a few blessed
common headers into a PCH. Unfortunately, I found multiple Clang bugs
while trying to build Dolphin this way, so it's not ready yet, but I can
start with this prerequisite.)
I found it via clang complaining about a useless null check on an array,
but I decided to get rid of the array in favor of dynamic allocation, as
there was no reason to assume a maximum length of 0x32 bytes. Plus, add
a CFString type check just in case, and switch to UTF-8 in the
off-chance it matters.
The result has not actually been tested, as I have no CD drive.
It only ever did anything on 32-bit OS X.
Anyway, it wasn't even on the right functions, and these days
ABI_PushRegistersAndAdjustStack should handle maintaining the ABI
correctly.
This helps us avoid accidentally clobbering flags between two instructions
when the flags are expected to be maintained. Dolphin will of course crash
immediately, but at least it will crash loudly and alert us of the mistake,
instead of forcing hours of bisecting to find the subtle way in which the JIT
has managed to sneak a flag-modifying instruction where there shouldn't be one.
This is inconsistent with how other containers are used (i.e. with Do()), but making std::array be used with Do() seems rather confusing when there's also a DoArray available.
To avoid FPRs being pushed unnecessarily, I checked the uses: DSPEmitter
doesn't use FPRs, and VertexLoader doesn't use anything but RAX, so I
specified the register list accordingly. The regular JIT, however, does
use FPRs, and as far as I can tell, it was incorrect not to save them in
the outer routine. Since the dispatcher loop is only exited when
pausing or stopping, this should have no noticeable performance impact.
- Factor common work into a helper function.
- Replace confusingly named "noProlog" with "rsp_alignment". Now that
x86 is not supported, we can just specify it explicitly as 8 for
clarity.
- Add the option to include more frame size, which I'll need later.
- Revert a change by magumagu in March which replaced MOVAPD with MOVUPD
on account of 32-bit Windows, since it's no longer supported. True,
apparently recent processors don't execute the former any faster if the
pointer is, in fact, aligned, but there's no point using MOVUPD for
something that's guaranteed to be aligned...
(I discovered that GenFrsqrte and GenFres were incorrectly passing false
to noProlog - they were, in fact, functions without prologs, the
original meaning of the parameter - which caused the previous change to
break. This is now fixed.)
This is the bare minimum required to run a few games on AArch64.
Was able to run starfield and Animal Crossing to the Nintendo logo.
QEmu emulation is literally the slowest thing in the world, it maxes out at around 12mhz on my Core i7-4930MX.
I've tested a few instruction encodings and am expecting most to work as long as one stays away from VFP/SIMD.
This implements mostly instructions to bring up an initial JIT with integer support.
This can be improved to allow ease of use functions in the future, dealing with the raw imms/immr encodings is probably the worst thing ever.
Uses are split into three categories:
- Arbitrary (except for size savings) - constants like RSCRATCH are
used.
- ABI (i.e. RAX as return value) - ABI_RETURN is used.
- Fixed by architecture (RCX shifts, RDX/RAX for some instructions) -
explicit register is kept.
In theory this allows the assignments to be modified easily. I verified
that I was able to run Melee with all the registers changed, although
there may be issues if RSCRATCH[2] and ABI_PARAM{1,2} conflict.
The special case is where the registers are actually to be swapped (i.e.
func(ABI_PARAM2, ABI_PARAM1); this was previously impossible but would
be ugly not to handle anyway.
Prior to this change, it was possible to cause an infinite loop by making the string to be replaced and the replacing string the same thing.
e.g.
std::string some_str = "test";
ReplaceAll(some_str, "test", "test");
This also changes the replacing in a way that doesn't require starting from the beginning of the string on each replacement iteration.
Decreases total Wii state save time (not counting compression) from
~570ms to ~18ms.
The compiler can't remove this check because of potential aliasing; this
might be fixable (e.g. by making mode const), but there is no reason to
have the code work in such a braindead way in the first place.
- DoVoid now uses memcpy.
- DoArray now uses DoVoid on the whole rather than Doing each element
(would fail for an array of STL structures, but we don't have any of
those).
- Do also now uses DoVoid. (In the previous version, it replicated
DoVoid's code in order to ensure each type gets its own implementation,
which for small types then becomes a simple load/store in any modern
compiler. Now DoVoid is __forceinline, which addresses that issue and
shouldn't make a big difference otherwise - perhaps a few extra copies
of the code inlined into DoArray or whatever.)
(1) Rename ABI_ALL_CALLEE_SAVED to ABI_ALL_CALLER_SAVED, because that's
what it was actually defined as (and used as). Derp.
(2) RegistersInUse is always used for the purpose of saving registers
before calling a C++ function in the middle of a JIT block (without
flushing). There is no need to save callee-saved registers in this
case. Change the name to CallerSavedRegistersInUse and mask with
ABI_ALL_CALLER_SAVED.
Nothing obvious broke when starting up a Melee game. (I added a test
for anything actually being masked out; it happens, but in this
particular case seemed to occur at most a few dozen times per second, so
the actual performance benefit is probably negligible.)
This class loads all the common PP shader configuration options and passes those options through to a inherited class that OpenGL or D3D will have.
Makes it so all the common code for PP shaders is in VideoCommon instead of duplicating the code across each backend.
This was actually never used as far as I can tell. There was no wx event handling done whatsoever for the global ID, So this is basically a dead function.
This moves the Gekko disassembler to Common where it should be. Having it in the Bochs disassembly Externals is incorrect.
Unlike the PowerPC disassembler prior however, this one is updated to have an API that is more fitting for C++. e.g. Not needing to specify a string buffer and size. It does all of this under the hood.
This modifies all the DebuggingInterfaces as necessary to handle this.
On error mmap returns MAP_FAILED(-1) not null.
FreeBSD was checking the return correctly, Linux was not.
This was noticed by triad attempting to run Dolphin under valgrind and not getting a memory space under the 2GB limit(Because -1 wraps around on
unsigned obviously)
Previously using the new "lower 8 bits" registers (SIL, SPL, ...) caused SETcc
to write to other registers (for example, SETcc SIL would generate SETcc DH).
It was only used for Windows XP and lower.
This also bumps the _WIN32_WINNT define in the stdafx precompiled headers to set the minimum version as Windows Vista.
A previous PR changed a whole lot of min/maxes to std::min/std::max
but made a mistake here and used a templated min which cast it's
arguments to unsigned instead of casting return value.
This resulted in glitchy artifacts in bright areas (See issue 7439)
I rewrote the code to use a proper clamping function so it's cleaner
to read.
This shouldn't really be exposed as a public function and should only be called through other Do class functions that take a container type as a parameter.
Sometimes (in particular when using non-typesafe functions) it can be convenient to have a getter method rather than performing a potentially lengthy explicit cast.
This allows for removal of the strcpy calls, also it's technically way more safe, though I doubt we'll ever have a log name larger than 128 characters or a short description larger than 32 characters.
Also moved these assignments into the constructor's initializer list.
If a CPU string was incapable of being found we would return a null pointer, which would crash with strncpy.
Also if we couldn't get a CPU implementer we would call free() to a null pointer.
In addition, detect 64bit ARM running.
MemoryUtil.cpp was incorrectly using the old __x86_64__ define when it should be using _M_X86_64.
It was also using _ARCH_64 when it shouldn't have which was causing an errant PanicAlert to come up in my development.
Some compilers we care about (mostly g++) do not support std::make_unique yet,
but we still want to use it in our codebase to make unique_ptr code more
readable. This commit introduces an implementation derivated from the libc++
code in the Dolphin codebase so we can use it right now everywhere.
Adapted from delroth's pull request.
Set the x87 precision, even on x64. Since we are using x87 instructions
in the JIT now, we can't guarantee that x87 precision will never
influence Dolphin on x64.
The new NOP emitter breaks when called with a negative count. As it
turns out, it did happen when deoptimizing 8 bit MOVs because they are
only 4 bytes long and need no BSWAP.
Fixes issue 6990.
This uses a bit of templating to remove the duplicate code that is the CodeBlocks in each emitter headers.
No actual functionality change in this.
When creating a Fixupbranch we were swapping the BL and B targets.
I think this was found by PPSSPP a while ago, but they never send PRs to merge their changes upstream.
Between C++11 and C++14, volatile types stopped being trivially
copyable. The serializer has no reason to care about this distinction,
so tack on remove_volatile.
The underlying storage type of a bitfield can be any intrinsic integer type,
but also any enumeration.
Custom storage types are supported if the following things are defined on the storage type:
- casting 0 to the storage type
- bit shift operators (in both directions)
- bitwise & operator
- bitwise ~ operator
- std::make_unsigned specialization
Previously he function was misbehaving because of a missing check for
whether an 8-bit operand was a register operand; it would therefore
emit unnecessary REX prefixes, incorrectly assert on 32-bit targets, and
could potentially emit wrong code in rare cases (like a memory to register
operation involving AH.)
Also, some cleanup while I was in the area to make the function easier to
read.
The workaround of using fixed underlying types produces lots of warnings
in GCC because now the bit-fields are too small for the value range used
for conversion semantics.
Breakpoints have one, but memchecks don't, despite being cleared directly in the breakpoint window.
Now DolphinWX should call the interface functions and not the direct functions of the breakpoints or memchecks for clearing.
Our defines were never clear between what meant 64bit or x86_64
This makes a clear cut between bitness and architecture.
This commit also has the side effect of bringing up aarch64 compiling support.
- remove unused variables
- reduce the scope where it makes sense
- correct limits (did you know that strcat()'s last parameter does not
include the \0 that is always added?)
- set some free()'d pointers to NULL
This breaks Linux stdout logging.
This reverts commit 7ac5b1f2f8, reversing
changes made to 9bc14012fc.
Revert "Merge pull request #77 from lioncash/remove-console"
This reverts commit 9bc14012fc, reversing
changes made to b18a33377d.
Conflicts:
Source/Core/Common/LogManager.cpp
Source/Core/DolphinWX/Frame.cpp
Source/Core/DolphinWX/FrameAui.cpp
Source/Core/DolphinWX/LogConfigWindow.cpp
Source/Core/DolphinWX/LogWindow.cpp
Note I do not mean the Logging window, but the console window.
It's literally rarely, if at all used, and offers less advantages over the built-in logging window (ie. it breaks on different locales: http://i.imgur.com/Cs92tQE.png)
This commit should remove all of the console logging.
Floating-point is complicated...
Some background: Denormals are floats that are too close to zero to be
stored in a normalized way (their exponent would need more bits). Since
they are stored unnormalized, they are hard to work with, even in
hardware. That's why both PowerPC and SSE can be configured to operate
in faster but non-standard-conpliant modes in which these numbers are
simply rounded ('flushed') to zero.
Internally, we do the same as the PowerPC CPU and store all floats in
double format. This means that for loading and storing singles we need a
conversion. The PowerPC CPU does this in hardware. We previously did
this using CVTSS2SD/CVTSD2SS. Unfortunately, these instructions are
considered arithmetic and therefore flush denormals to zero if non-IEEE
mode is active. This normally wouldn't be a problem since the next
arithmetic floating-point instruction would do the same anyway but as it
turns out some games actually use floating-point instructions for
copying arbitrary data.
My idea for fixing this problem was to use x87 instructions since the
x87 FPU never supported flush-to-zero and thus doesn't mangle denormals.
However, there is one more problem to deal with: SNaNs are automatically
converted to QNaNs (by setting the most-significant bit of the
fraction). I opted to fix this by manually resetting the QNaN bit of all
values with all-1s exponent.