This is required to make sure two code spaces are relatively close to one another.
In this case I need the AArch64 JIT codespace and its farcode space to be within 128MB of one another for branches.
As it is currently written, CallLambdaTrampoline does not return a
value. However, some of the functions that are being wrapped may
return a value that the JIT is expected to understand. A compiler
*cough cough clang* may opt to alter %rax after the wrapped lambda
returns, e.g. popping a previous value, which can clobber the
return value. If we actually have a return value, then the compiler
must not clobber it.
In a particular hashing heavy scene in Crazy Taxi the Murmur3 hash used 3.11% CPU time.
The new CRC32 hash in the same scene used 1.86%
This was tested on a Nvidia SHIELD Android TV with Cortex-A57s.
This will be a bit slower on the Nexus 9, the Denver CPU core is a bit slower with CRC32 texture hashing than Murmur3 texture hashing.
The new implementation has 3 options:
SyncGpuMaxDistance
SyncGpuMinDistance
SyncGpuOverclock
The MaxDistance controlls how many CPU cycles the CPU is allowed to be in front
of the GPU. Too low values will slow down extremly, too high values are as
unsynchronized and half of the games will crash.
The -MinDistance (negative) set how many cycles the GPU is allowed to be in
front of the CPU. As we are used to emulate an infinitiv fast GPU, this may be
set to any high (negative) number.
The last parameter is to hack a faster (>1.0) or slower(<1.0) GPU. As we don't
emulate GPU timing very well (eg skip the timings of the pixel stage completely),
an overclock factor of ~0.5 is often much more accurate than 1.0
Using SDL_INIT_JOYSTICK implies SDL_INIT_EVENTS which installs a signal
handler for SIGINT and SIGTERM. There will be a way to prevent this in
2.0.4 but for now we'll need to handle SDL_QUIT.
Actually caused by IniFiles::GetLines leaving the output vector in its
old state if the section wasn't found, and Gecko::LoadCodes not checking
the return value. Fix by moving lines->clear() up.
- Change the Wiimote emulation SYSCONF R/W to use the temporary NAND if in use.
- Fix up SysConf API so this actually works.
Kind of a hack. Like I said, this can be cleaned up when configuration
is synced...
Eventually, netplay will be able to use the host's NAND, but this could
still be useful in some cases; for TAS it definitely makes sense to have
a way to avoid using any preexisting NAND.
In terms of implementation: remove D_WIIUSER_IDX, which was just WIIROOT
+ "/", as well as some other indices which are pointless to have as
separate variables rather than just using the actual path (fixed, since
they're actual Wii NAND paths) at the call site. Then split off
D_SESSION_WIIROOT_IDX, which can point to the dummy NAND directory, from
D_WIIROOT_IDX, which always points to the "real" one the user
configured.
- FileSearch is now just one function, and it converts the original glob
into a regex on all platforms rather than relying on native Windows
pattern matching on there and a complete hack elsewhere. It now
supports recursion out of the box rather than manually expanding
into a full list of directories in multiple call sites.
- This adds a GCC >= 4.9 dependency due to older versions having
outright broken <regex>. MSVC is fine with it.
- ScanDirectoryTree returns the parent entry rather than filling parts
of it in via reference. The count is now stored in the entry like it
was for subdirectories.
- .glsl file search is now done with DoFileSearch.
- IOCTLV_READ_DIR now uses ScanDirectoryTree directly and sorts the
results after replacements for better determinism.
There is no nice way to correctly "detect" the "used" memory, so we just say
we're fine to use 50% of the physical memory for custom textures.
This will fix out-of-memory crashes, but we still might run into swapping issues.
Replaces them with forward declarations of used types, or removes them entirely if they aren't used at all. This also replaces certain Common headers with less inclusive ones (in terms of definitions they pull in).
Change TMemCheck::Action to return whether to break rather than calling
PPCDebugInterface::BreakNow, as this simplified the implementation; then
remove said method, as that was its only caller. One "interface" method
down, many to go...
- Move JitState::memcheck to JitOptions because it's an option.
- Add JitOptions::fastmem; switch JIT code to checking that rather than
bFastmem directly.
- Add JitBase::UpdateMemoryOptions(), which sets both two JIT options
(replacing the duplicate lines in Jit64 and JitIL that set memcheck
from bMMU).
- (!) The ARM JITs both had some lines that checked js.memcheck
despite it being uninitialized in their cases. I've added
UpdateMemoryOptions to both. There is a chance this could make
something slower compared to the old behavior if the uninitialized
value happened to be nonzero... hdkr should check this.
- UpdateMemoryOptions forces jo.fastmem and jo.memcheck off and on,
respectively, if there are any watchpoints set.
- Also call that function from ClearCache.
- Have MemChecks call ClearCache when the {first,last} watchpoint is
{added,removed}.
Enabling jo.memcheck (bah, confusing names) is currently pointless
because hitting a watchpoint does not interrupt the basic block. That
will change in the next commit.
Otherwise, it would work but any async sending would be delayed by 4ms or
wait until the next packet was received.
Also increase the client timeout to 250ms, since enet_host_service is now
really interrupted.
With my previous changes Dolphin would fail to create the user directory if it didn't exist, and would dump all the configuration options in to the cwdir.
This was a bit more complicated to fix in a clean fashion, so I took to moving around code concerning user directories.
Instead of having GetUserPath serve a dual purpose of both getting and setting our user directories, break out to a new SetUserPath function.
GetUserPath will know only get the configured user path.
SetUserPath will set our user paths and setup the internal user path state.
This ending up being a lot cleaner overall, which is nice. Also less mind bending when attempting to read the code.
So now we won't dump all of our configuration in to the cwdir if ~/.dolphin-emu isn't found.
Fixes issue 8371.
Clamping a rectangle correctly requires fully clamping all four
coordinates in the general case.
This should fix issue 6923, sort of; at least, it fixes the part where a
rectangle ends up with a nonsensical height after being clamped.
A bit more efficient if we are only pushing two VFP registers.
We can probably be a bit more efficient in the future by mixing paired loadstores in to the other paths as well.
Previously on FPR pushing and popping we would do a single STR/LDR per quad FPR we wanted to push/pop.
In most of our cases when we are pushing and popping VFP registers they will be consecutive registers that will save more efficiently using the NEON
loadstores that can do up to four quad registers.
So this can potentially cutting instructions down to ~1/4th the amount of instructions if the registers are all consecutive.
On the Cortex-A57 this is basically just an icache improvement, but on the Nvidia Denver this may be optimized to be more efficient. Either way it's a
win.
The Load directory wasn't being properly reassigned when the user path changed, which causes a bunch of issues with things loading from the wrong
place when using the -U option in Dolphin.
The UI should decide on where it wants the user directory, not our core system.
This is in anticipation of some upcoming work on Android which will need proper user directory setting.
Since libcommon.a is also the last library to be linked, this has the
totally hacky but useful side-effect that it doesn't require people to
modify CMake files for temporarily adding VTune code to other Dolphin
libraries.
The PowerPC CPU has bits in MSR (DR and IR) which control whether
addresses are translated. We should respect these instead of mixing
physical addresses and translated addresses into the same address space.
This is mostly mass-renaming calls to memory accesses APIs from places
which expect address translation to use a different version from those
which do not expect address translation.
This does very little on its own, but it's the first step to a correct BAT
implementation.
The Windows implementations of CharArrayFromFormatV() and
StringFromFormat() use the "C"/".1252" locale instead of the user
locale (using _vsnprintf_l). On non-Windows, the user locale was used.
This leads to bugs on non-Windows: the Overclock parameter was
serialised with the user locale ("0,279322" in some locale) and was
interpreted back as "0" (because the C locale is used for parsing the
string).
Make non-Windows CharArrayFromFormatV() and StringFromFormat()
consistent with their Windows counterpart.
The locale code is not enables for Android:: uselocale is only
available since API 21 and API 21 only supports C and C.UTF-8.
If we are compiling in the CRC32 hash, clang has an issue with casting a s32 to a u64.
Change our lens argument to a unsigned integer to fix the issue.
Intellisense doesn't like defines in PCH files, and it doesn't like the deleted
constructor for BitField. (I think it's being overly strict about the
"must have no non-default constructors" rule for classes in unions.)
Someone thought it would be a good idea to have the location as the first argument on the instruction.
Changed it to how it is supposed to be disassembled.
Optimistically assume used GQRs are 0 in blocks that only use one GQR, and
bail at the start of the block and recompile if that assumption fails.
Many games use almost entirely unquantized stores (e.g. Rebel Strike, Sonic
Colors), so this will likely be a big performance improvement across the board
for games with heavy use of paired singles.
Won't work with all games, but provides a nice way to spend extra CPU to make
a variable framerate game faster (e.g. Spyro or The Last Story), or to make
a game use less CPU at the cost of a lower framerate (e.g. Rogue Leader).
If we have a shift amount that is the full length of the source register then we have an invalid instruction.
This can happen when dealing with a couple of PowerPC instructions.
This same adjustment is already in the ARMv7 emitter.
Fixes issues with negative offsets in loadstore instructions.
Adds ADRP/ADR instructions.
Optimizes MOVI2R function to take advantage of ADRP on pointers, can change a 3 instruction operation down to one.
Adds GPR push/pop operations for ABI related things.
The reason this didn't break is that bitwise instructions like VPAND,
VANDPS, and VANDPD do the exact same thing. The only difference is the
data type they are intended for.
The builtin byteswap routines cause critical failure on AArch64 when built with the Android toolchain.
I didn't experience this issue when building for Linux using a local qemu chroot.
Seems to be only an issue with the Android toolchain when building AArch64.
Use our generic version instead.
If the inputs are both float singles, and the top half is known to be identical
to the bottom half, we can use packed arithmetic instead of scalar to skip
the movddup.
This is slower on a few rather old CPUs, plus the Atom+Silvermont, so detect
Atom and disable it in that case.
Also avoid PPC_FP on stores if we know that the output came from a float op.
Move the JITed function/basic-block registration logic out of the CPU
subsystem in order to add JIT registration to JITed DSP and
Video/VertexLoader code.
This necessary in order to add /tmp/perf-$pid.map support to other
JITed code as they need to write to the same file.
When we cleaned up the code to calculate the shm_position and total_mem
in one step, we sometimes skipped over certain views because they were
Wii-only. When looking at the total memory, we'd look at the last field,
whether or not it was skipped. Since Wii-only fields are the last view,
this meant that the shm_position was 0, since it was skipped, causing us
to map a 0-sized field. Fix this by explicitly returning the total size
from MemoryMap_InitializeViews.
Additionally, the shm_position was being calculated incorrectly because
it was adding up the shm_position *before* the mirror, rather than after
it. Fix this by adopting a scheme similar to what we had before.
The code to calculate the offsets into the SHM file wasn't properly
respecting the skip flags, causing it to calculate offsets beyond
the end of the SHM file.
This code was ported from out_ptr, which was a double-pointer, and
wanted to double-check that the proper arena was actually allocated.
When I ported it to store the pointer directly in the view regardless
of whether out_ptr was non-NULL, I got confused here and instead
caused the code to only free the arena if the first byte was non-zero.
This code originally tried to map the "low space" for the Gamecube's
memory layout, but since has expanded to mapping all of the easily
mappable memory on the system. Change the name to "GrabSHMSegment" to
indicate that we're looking for a shared memory segment we can map into
our process space.
These are effectively unused, since the memmap already maps them in one
place. For 32-bit, they might have some slight advantage, but we already
special-case the regular "high-mem" pointer for 32-bit, so just use the
one we already have...