This instruction is fairly heavily used by Ikaruga to load a bunch of registers from the stack.
In particular at the start of the second stage is a block that takes up ~20% CPU time that includes a usage of lmw to load half of the guest
registers.
Basic thing optimized here is changing from a single 32bit LDR to potentially a single 128bit LDR.
a single 32bit LDR is fairly slow, so we can optimize a few ways.
If we have four or more registers to load, do a 64bit LDP in to two host registers, byteswap, and then move the high 32bits of the host registers in
to the correct mapped guest register locations.
If we have two registers to load then do a 32bit LDP which will load two guest registers in a single instruction.
and then if we have only one register left to load, load it as before.
This saves quite a bit of cycles since the Cortex-A57 and A72's LDR instruction takes a few cycles.
Each 32bit LDR takes 4 cycles latency, plus 1 cycle for post-index(which typically happens in parallel.
Both the 32bit and 64bit LDP take the same amount of latency.
So we are improving latencies and reducing code bloat here.
This is a bug that crops if BindToRegister() is called multiple times in a row without a R() function call between them.
How to reproduce the bug:
1) Have a completely filled cache with no host register remaining
2) Call BindToRegister() with different guest registers
3) Don't call R() between the BindToRegister() calls.
This issue typically wouldn't be seen for a couple of reasons. Typically we have /plenty/ of registers in the cache, and in most cases we only call
BindToRegister() once per instruction. In the off chance that it is called multiple times, it wouldn't update the last used counts and would flush the
same register as the previous call to it.
Samsung updated the video drivers on the SGS6 which introduced a bug when disabling vsync.
Both the driver versions are r5p0, but the md5sums of the blob differ.
To work around the issue, make sure to never disable vsync by calling eglSwapInterval.
We can't actually determine the driver version on Android yet.
So until the driver version lands that displays the driver version string in the GL_VERSION string
we will need to keep this workaround enabled at all times, which is a bit annoying.
Current mali drivers return the video driver version in one of the EGL strings you can query.
The issue with that is that Android eats all of those strings, so we can't query it.
OpenGL ES 3.2 adds a few things we care about supporting in core. In particular:
- GL_{ARB,EXT,OES}_draw_elements_base_vertex
- KHR_Debug
- Sample Shading
- GL_{ARB,EXT,OES,NV}_copy_image
- Geometry shaders
- Geometry shader instancing (If they support GL_{EXT,OES}_geometry_point_size)
Nvidia was the first to release an OpenGL ES 3.2 driver which I uesd to test this on.
This also enables GS Instancing on GLES 3.1 hardware if it supports all of the required extensions.
In particular this optimizes the case where a 32bit float is loaded via lfs, and then used in double operations.
This happens very often in Gekko based code because the best way to load a 32bit value as a double is lfs since it automatically turns in to a double value.
There are a few other implications of this in practice as well. Like if both of the paired registers are loaded via psq_l and then used in double
operations it would be improved.
Also if we implement a double register we've got to be careful to make sure we understand if it is in "lower" register or the full 128bit register.
- dynamically allocate third scratch register instead of forcing ECX
- use LEA as 3 operand add if possible
- use BT,JC instead of SHR,TEST,JNZ
- merge MOV,TEST
- use appropriate ABI function (no asm change)
The class itself already acts as a namespace trailer, so '_interpreter'
isn't necessary. This also gets rid of a duplicate typedef in the
Interpreter_Tables.
Their new driver that supports GLES3.1 + AEP has issues with it.
At the very least they don't implement all of the geometry shader features fully which causes shader linker issues when we attempt to use them.
I don't have a device so I can't fully test, so until I do I'm going to blanket disable the whole thing.
The Star Wars games really push the hardware to its limits, which can cause the shaders that are produced to be 18kb or more.
Double our maximum shader size to compensate.
Fixes issue #8860
We compile the blocks as they are executed, so it's common
to link them continuously. We end with calling JMP after every
block, but often just with a distance of 0.
So just emitting NOPs instead also "calls" the next block, but
easier for the CPU.
If we are going to be using lfd, then chances are it is going to be used in double heavy areas of code.
If we only need to load the lower register, then we should also not worry about having to insert in to the low 64bits of the guest register.
So add a new flag to the backpatching to handle lfd to directly to the destination register.
This gives ~3% performance improvement to Povray.
The assert(0) that was introduced in PR #2811 is not user friendly
since it has no explanation at all about what happened. Regular users
probably won't think of looking at the log to get more information.
It kind of sucks that we don't emulate some behaviors properly, but there is
very little ROI for some of these features and I'm not going to spend time
implementing them any time soon. Making the PanicAlerts optional allows for
more testing of the core features on more games while "just" breaking less
important features like reverb.
On older versions there is no such concept of a "back buffer" since positional
audio is not supported. These buffers are just used for temp mixing, and while
it would be really nice if we could support more of that inter-buffer mixing
(TODO :) ) it does not warrant a PanicAlert at this time.
Fails ingame because it mixes to some buffers that are considered the back
buffers for Dolby games. That check might need to be less restrictive for games
that don't use Dolby mixing.
Add a flag for UCodes that only have four non-Dolby mixing destinations
(instead of the standard six destinations).
NTSC IPL is still hopelessly broken.
Used by a few titles (Luigi's Mansion, Animal Crossing) as well as the GameCube
IPL/BIOS.
Note that the IPL does not work yet because it mixes to unknown buffers.
Very close to the TWW version of the UCode, haven't determined any differences
yet (but I'm sure that will come soon). Works well enough to reach ingame
without any errors other than a few volume issues.
Zelda Twilight Princess (GC) seems to push null words into the command buffer.
The command handler in the UCode ignores initial command mails that do not have
the MSB set.
I was under the wrong impression that std::array's default constructor
performed value initialization. Turns out it does not, so an array of POD will
not be initialized.
While these are not really unrecoverable errors, while Zelda HLE is in a
testing / development phase it is useful to notice these "unexpected" cases (or
expected without known ways to reproduce) by making them as hard as possible to
ignore.
Now renders some of the cutscene audio in Zelda TWW. Most of the work around
sample sources is in a good enough state, even though it is still missing
features like Dolby mixing, IIR, etc.
Now accurate control flow for DAC. Voice rendering is not implemented yet, but
framing seems correct. Not expected to work for anything than DAC UCode games.
Push the mail at UCode boot time, not after the first update. This avoids a lot
of crazy busy-looping on the CPU side while waiting for the DSP to be
initialized.
This change is meant to solve the following problem: how to translate the
following snippet to DSPHLE:
SendInterruptAndWaitRead(MAIL_A);
SendAndWaitRead(MAIL_B);
SendInterruptAndWaitRead(MAIL_C);
This should cause the following actions on the CPU side:
---> Woken up by interrupt
Reads MAIL_A
Reads MAIL_B
<--- Exits interrupt handler
---> Woken up by interrupt
Reads MAIL_C
<---
But with the current DSPHLE mail support, the following would happen because
the "AndWaitRead" part is not supported:
---> Woken up by interrupt
Reads MAIL_A
Reads MAIL_B
<--- Exits interrupt handler
[Never gets the second interrupt since it was triggered at the same time as
the first one! Misses MAIL_C.]
This changes fixes the issue by storing two values in the mail queue on the DSP
side: the value of the mail itself, and whether a read of that mail should
trigger a DSP interrupt. If nothing is in the queue yet and an interrupt is
requested, just trigger the interrupt. In the present example, the queue will
look like this:
Mail value Interrupt requested
MAIL_A No <-- Interrupt was triggered when
pushing the mail to the queue.
MAIL_B Yes
MAIL_C No
When the CPU will read MAIL_B, this will cause MailHandler to trigger the
interrupt, which will be handled by the CPU when coming back from the exception
handler. MAIL_C is then successfully read.
With auto-updating lists, searching for the previous value isn't
necessary. Also, this breaks out specific functionality into their own
functions, which helps separate UI code from the data processing code.
modified: Source/Core/DolphinWX/Cheats/CheatSearchTab.h
This improves performance pretty much across the board for games.
The increase in performance is mainly from removing some code from the main JIT blocks of code (pushing and popping millions of registers) and
throwing them in farcode where it doesn't pollute the icache.
This is required to make sure two code spaces are relatively close to one another.
In this case I need the AArch64 JIT codespace and its farcode space to be within 128MB of one another for branches.
When the emulation is paused and the ALSA backend is used, make the audio
thread wait on a condition variable instead of busy-waiting. This commit
fixes bug #7729
Since the ALSA API is not thread-safe, calls to snd_pcm_drop() and snd_pcm_prepare()
in AlsaSound::Clear() are protected by the same mutex as the condition variable in AlsaSound::SoundLoop()
to make sure that we do not call these functions while a call to
snd_pcm_writei() is ongoing.
This fixes a race condition:
Before this commit, there was a race condition when starting a game:
Core::EmuThread(), after having started (but not necessarily completed)
the initialization of the audio thread, calls Core::SetState() which calls
CCPU::EnableStepping(), which in turns calls AudioCommon::ClearAudioBuffer().
This means that SoundStream::Clear() can be called before
AlsaSound::AlsaInit() has completed.
04fcb72 fixed an issue with reading the Wii FST size, but I found a second
issue when working on PR #2820 - the size must be shifted left by 2.
DiscScrubber and Boot already do this correctly using separate code.
If the selected audio backend fails to Start() (which could happen for
example if there is no audio device), we currently still use the backend
anyway. This can lead to crashes on some platforms (such as Windows) and
is outright wrong anyway.
This commit fallbacks to the Null audio backend if the selected backend
couldn't be started.
This fixes bug #6001