The calculation of each address in lmw/stmw currently has a dependency
on the calculation of the previous address. By removing this dependency,
the host CPU should be able to pipeline the loads/stores better. The cost
we pay for this is up to one extra register and one extra MOV instruction
per guest instruction, but often nothing.
Making EmitBackpatchRoutine support using any register as the address
register would let us get rid of the MOV, but I consider that to be too
big of a task to do in one go at the same time as this.
Now that we've flipped the C++20 switch, let's start making use of
the nice new <bit> header.
I'm planning on handling this move away from BitUtils.h incrementally
in a series of PRs. There may be a few functions remaining in
BitUtils.h by the end that C++20 doesn't have any equivalents for.
This reverts commit 351d095fff.
In hindsight, my attempted optimization messes with the return
predictor, unlike real tail calls. So I think it does more bad than
good.
Use: callstack(0x80000000).
!callstack(value) works as a 'does not contain'.
Add strings to expr.h conditionals.
Use quotations: callstack("anim") to check symbols/name.
For quite some time now, we've had a setting on x86-64 that makes Dolphin
handle NaNs in a more accurate but slower way. There's only one game that
cares about this, Dragon Ball: Revenge of King Piccolo, and what that game
cares about more specifically is that the default NaN (or "generated NaN"
as I believe it's called in PowerPC documentation) is the same as on
PowerPC. On ARM, the default NaN is the same as on PowerPC, so for the
longest time we didn't need to do anything special to get Dragon Ball:
Revenge of King Piccolo working. However, in 93e636a I changed how we
handle FMA instructions in a way that resulted in the sign of NaNs
becoming inverted for nmadd/nmsub instructions, breaking the game.
To fix this, let's implement the AccurateNaNs setting, like on x86-64.
Operations that have two operands and can't generate a default NaN,
i.e. addition and subtraction, already have the desired NaN handling
on x86. We just need to make sure to not reverse the operands.
This fixes ps_sum0/ps_sum1 outputting NaNs in cases where they shouldn't.
(HandleNaNs assumes that a NaN in a ps0 input always results in a NaN in
the ps0 output, and correspondingly for ps1.)
1. In some cases, ps_merge01 can be implemented using one instruction.
2. When we need two instructions for ps_merge01, it's best to start with
a MOV to avoid false dependencies on the destination register.
3. ps_merge10 can be implemented using a single EXT instruction.
This new function is like MOVP2R, except it masks out the lower 12 bits,
returning them instead of writing them to the register. These lower
12 bits can then be used as an offset for LDR/STR. This lets us turn
ADRP+ADD+LDR sequences with a zero offset into ADRP+LDR sequences with
a non-zero offset, saving one instruction.
When emulated GBAs were added to Dolphin, it was possible to control them
using the GC TAS input window. (Z was mapped to Select.) Unaware of this,
I broke the functionality in b296248.
To make it possible to control emulated GBAs using TAS input again,
I'm adding a proper TAS input window for GBAs, with a real Select button
and no analog controls.
I recently talked to a homebrew developer who was trying to add exception
handlers at link time but found out that Dolphin was overwriting their
exception handlers. I figure that's not the usual way to do exception
handlers, but... making us load the executable after setting up memory
rather than before is easy, and matches what we do when booting discs,
so I suppose there's no reason not to do it. It also matches the intent
of why Dolphin is writing default exception handlers – we're writing
them because some homebrew relies on exception handlers being left
around from whatever program was running before it (see 3dd777be70).
Let's take advantage of ARM64's input register shifting one last time,
shall we?
Before:
0x1280005b mov w27, #-0x3
0x1b1b7f18 mul w24, w24, w27
After:
0x4b180b18 sub w24, w24, w24, lsl #2
ARM64's flexible shifting of input registers also allows us to calculate
a negative power of two in one instruction; shift the input of a NEG
instruction.
Before:
0x128001f7 mov w23, #-0x10
0x1b1a7efa mul w26, w23, w26
0x93407f58 sxtw x24, w26
After:
0x4b1a13fa neg w26, w26, lsl #4
0x93407f58 sxtw x24, w26
If the destination register doesn't equal the input register, using it
to temporarily hold the immediate value is fair game as it'll be
overwritten with the result of the multiplication anyway. This can
slightly reduce register pressure.
Before:
0x52800659 mov w25, #0x32
0x1b197f5b mul w27, w26, w25
After:
0x5280065b mov w27, #0x32
0x1b1b7f5b mul w27, w26, w27
By taking advantage of ARM64's ability to shift an input register by any
amount, we can calculate multiplication by a number that is one more
than a power of two with a single instruction.
Before:
0x52800838 mov w24, #0x41
0x1b187f7b mul w27, w27, w24
After:
0x0b1b1b7b add w27, w27, w27, lsl #6
Turn multiplications by a power of two into bitshifts.
Before:
0x52800817 mov w23, #0x40
0x1b167ef6 mul w22, w23, w22
After:
0x531a66d6 lsl w22, w22, #6
Multiplication by one is also trivial. Depending on the registers
involved, either a single MOV or no instructions will be generated.
Before:
0x52800038 mov w24, #0x1
0x1b1a7f1b mul w27, w24, w26
After:
0x2a1a03fb mov w27, w26
Before:
0x52800039 mov w25, #0x1
0x1b1a7f3a mul w26, w25, w26
After:
Nothing!
Add a new function that will handle all the special cases regarding
multiplication. It does nothing for now, but will be expanded in
follow-up commits.
We can merge an SXTW with the SUB, eliminating one instruction. In
addition, it is no longer necessary to allocate a temporary register,
reducing register pressure.
Before:
0x93407f59 sxtw x25, w26
0x93407ebb sxtw x27, w21
0xcb1b033b sub x27, x25, x27
After:
0x93407f5b sxtw x27, w26
0xcb35c37b sub x27, x27, w21, sxtw
Because of the previous commit, `regs_in_use` must not include `dest_reg`
when calling MMIOLoadToReg. There are also some other registers we can
skip including in regs_in_use just for efficiency's sake.
The `addr_reg_set = false` statements that I've added in this commit are
technically redundant – if `mmio_address` is non-zero then `addr_reg_set`
is already false – but it's just a coincidence that that's the case.
The old calculation was stride * (max_index + 1), which fails if stride is less than the size of a component (for instance, if float XYZ positions are used, and the stride was set to 4 (i.e. sizeof(float)) instead of 12 (i.e. 3 * sizeof(float)), it would be missing the last 8 bytes of the final element in the array. Or, if stride was set to 0, then no bytes would be recorded at all (though that's not a useful configuration so it's unlikely to actually exist).
I'm not aware of any games affected by this issue.
This should fix recording the wall in the staircase leading to the basement in Luigi's Mansion (though I haven't tested it, as I don't own a copy of Luigi's Mansion). This uses NormalIndex3, and the index for the normal vector (generally 0x02XX or 0x01XX) there is always lower than the tangent or binormal (generally 0x07XX). Other games seem to usually have a similar range of indices for the normal, tangent, and binormal, so this issue wouldn't affect them.
In most cases, games will use the same type for all vertex components (either Index8 or Index16 or Direct). However, RS2's deflection towers use Index16 for the texture coordinate and Index8 for everything else, meaning the texture coordinates were recorded incorrectly (the first byte was used, so only indices 0 and 1 were recorded instead of 0 through 0x0192). Worse still, some background elements in RS2 use direct positions but indexed normals or texture coordinates, and those would not be recorded at all.
This is a regression from b5fd35f951.
The previous implementation of Force25BitPrecision was essentially a
translation of the x86-64 implementation. It worked, but we can make a
more efficient implementation by using an AArch64 instruction I don't
believe x86-64 has an equivalent of: URSHR. The latency is the same as
before, but the instruction count and register count are both reduced.
The new `dispatcher_no_timing_check` is the same as `dispatcher_no_check`
except it includes the "stepping check" in debug mode. This lets us avoid
the `m_enable_debugging ? dispatcher : dispatcher_no_check` dance.
Maybe "tail call" isn't quite the right term for what this code
is doing, since it's jumping to the dispatcher rather than
returning, but it's the same optimization as for a tail call.
fregsIn will include FD for double-precision instructions, since for
dependency tracking purposes the instruction does read the upper
half of FD. This is not what we want in HandleNaNs.
The consequence of this bug is that if an instruction was supposed to
output a NaN and FD happens to contain a NaN and FD happens to be the
same register as an unused register in the instruction encoding, the
NaN in FD could get used as the output instead of the correct NaN.
This isn't known to affect any games, which isn't especially surprising
considering that there's only one game that needs AccurateNaNs anyway.
Jumping to `dispatcher` requires first subtracting the downcount,
otherwise `dispatcher` may unpredictably jump to CoreTiming::Advance,
which could break determinism compatibility with JitArm64. We should
jump to `dispatcher_no_check` instead.
The breakpoint check in Jit.cpp makes it redundant.
Normally this redundant check doesn't cause any issues, but if you
create a breakpoint and enable logging without breaking, you get two
log messages if the breakpoint is at the beginning of a block. See
https://bugs.dolphin-emu.org/issues/13044.
This is also a tiny performance improvement for when debugging is
active, since we no longer check for breakpoints for blocks that never
had any breakpoints to begin with.
base is an unsigned variable, so we can make things little more
consistent by making the loop index unsigned so we aren't doing bit
arithmetic with signed types.
MemoryInterface already does this, so we can leave it alone.
No behavioral changes, just a consistency thing.
Micro-optimization. Some CPUs can fuse CMP+B, TST+B, arith+CBZ, etc.
I also moved things around for CMP+CSET and TST+CSET - which I'm not sure
if any CPUs support - but it doesn't hurt anything, so I might as well.
Improves accuracy but isn't known to affect any games.
This turned out to be fairly convenient to implement; ORing with the
PPC default NaN will quieten SNaNs and do nothing to QNaNs.
This existed in the initial megacommit (though I don't know why) as IO_SIZE. It was used in Memmap's Init() to compute totalMemSize, but I don't know if it actually did anything then. That use was removed in 2d0f714546, but the constant persisted until cc858c63b8, when it became a static variable.
This was added in 385d8e2b15, but became somewhat redundant with Do in 4c7bbd96e4, and completely redundant now that std::is_trivially_copyable_v is well-supported.
This lets the TAS input code use a higher-level interface for
overriding inputs instead of having to fiddle with raw bits.
WiiTASInputWindow in particular was messy with how much
controller code it had to re-implement.
Fixes a Rogue Squadron II regression from 9d73583.
This set_dirty stuff is pretty tricky to reason about. I thought I
was clever when coming up with set_dirty, but maybe I was too clever
for my own good...
In case the register we're binding is the same as the immediate register,
we should fetch the immediate before calling BindToRegister. The way
the register cache currently works, calling GetImm after BindToRegister
actually does work, but it's better to not rely on it.
Tested on an official DOL-014 (251 blocks) memory card by executing the
0xf4 command on a card with content along its entire length and then
dumping the whole card: it reads as 0xff all the way through.
Therefor, the current implementation is already consistent with hardware.
Texture dumping can already be done using VideoCommon's system (and in fact the same setting already enabled *both* of these). Dumping objects/TEV stages/texture fetches doesn't currently have an equivalent, but could be added to the FIFO player instead.
A (partial) port of #9481 to ARM64. This commit adds special cases for
immediate values equal to 0 or 0xFFFFFFFF, allowing for more efficient
or no code to be generated.
When a guest register is an immediate, it may be necessary to move this
value into a register. This is handled by gpr.R(), which lacks context
on how the register will be used. This leads to cases where the
immediate is written to a register, only for it to be overwritten. Take
for example this code generated by srwx:
0x5280031b mov w27, #0x18
0x53187edb lsr w27, w22, #24
gpr.BindToRegister() does have this context through the do_load
parameter, but didn't handle immediates. By adding this logic, we can
intelligently skip the write when do_load is false.
Fixes https://bugs.dolphin-emu.org/issues/13017. With uCode switching, the existing instance of AXUCode is re-activated when GBAUCode is done, but if the state remains as WaitingForNextTask, it won't be able to do anything. Instead, it needs to be in WaitingForCmdListSize.
(When the AX uCode is resumed, startpc is set to 0x0030, at least for 0x07f88145; this is the same location as MAIL_RESUME jumps to, so DSP_RESUME should be sent when the resuming happens; that's already handled by AXUCode::Update.)
dir_path is used by PanicAlertFormatT, which prior to PR 10209 used a
lambda. Before c++20, referring to structured bindings in lambda captures
was forbidden. The problem is now doubly fixed, so put the structured
binding back in.
Fixes the Dolphin bug mentioned in
https://github.com/dolphin-emu/hwtests/issues/45.
Because this doesn't fix any observed behavior in games (no, 1080°
Avalanche isn't affected), I haven't implemented this in the JITs,
so as to not cause unnecessary performance degradations.
This command does not upload the MAIN buffers to CPU memory. This was
functionally fixed in f11a40f858 without
updating the comments and variable names.
Previously, we had WBFS and CISO which both returned an upper bound
of the size, and other formats which returned an accurate size. But
now we also have NFS, which returns a lower bound of the size. To
allow VolumeVerifier to make better informed decisions for NFS, let's
use an enum instead of a bool for the type of data size a blob has.
For a few years now, I've been thinking it would be nice to make Dolphin
support reading Wii games in the format they come in when you download
them from the Wii U eShop. The Wii U eShop has some good deals on Wii
games (Metroid Prime Trilogy especially is rather expensive if you try
to buy it physically!), and it's the only place right now where you can
buy Wii games digitally.
Of course, Nintendo being Nintendo, next year they're going to shut down
this only place where you can buy Wii games digitally. I kind of wish I
had implemented this feature earlier so that people would've had ample
time to buy the games they want, but... better late than never, right?
I used MIT-licensed code from the NOD library as a reference when
implementing this. None of the code has been directly copied, but
you may notice that the names of the struct members are very similar.
c1635245b8/lib/DiscIONFS.cpp
Needed for the next commit. NFS disc images are hashed but not encrypted.
While we're at it, also get rid of SupportsIntegrityCheck.
It does the same thing as old IsEncryptedAndHashed and new HasWiiHashes.
CARDUCode, GBAUCode, and INITUCode previously didn't have an implementation of it. In practice it's unlikely that this caused an issue, since these uCodes are only active for a few frames at most, but now that GBAUCode doesn't have global state, we can implement it there. I also implemented it for CARDUCode, although our CARDUCode implementation does not have all states handled yet - this is simply future-proofing so that when the card uCode is properly implemented, the save state version does not need to be bumped. INITUCode does not have any state to save, though.
The accuracy improvements are:
* The request mail must be 0xabba0000 exactly; both the low and high parts are checked
* The address is masked with 0x0fffffff
* Before, the global state meant that after the GBA uCode had been used once, it would accept 0xcdd1 commands immediately. Now, it only accepts them after execution has finished.
These lookup tables total 4 megabytes, and contain data that's entirely redundant to the actual cache state (as part of an optimization, though I'm not sure whether the optimization actually is useful). This change instead recomputes these lookup tables when loading the state (which involves filling the lookup table with a marker (0xff), and then setting the 128 * 8 valid entries (1 kilobyte)).
Before, we used a replace hook and didn't write anything there. Now, we write a BLR instruction to immediately return, and then use a start hook. This makes the behavior a bit clearer (though it shoudln't matter in practice).
All of our BBA options are technically built in, so it made the BBA
Built In option kind of confusing as to what it did. So rename it to
BBA HLE to make it more clear what it is doing and why it doesn't need a
TAP.
I'm not sure what the XMM0 check was supposed to be, but the 0xCC008000 one is for the fifo and is handled elsewhere now (look for `optimizeGatherPipe`).
We currently have two different code paths for initializing controllers:
Either the frontend (DolphinQt) can do it, or if the frontend doesn't do
it, the core will do it automatically when booting. Having these two
paths has caused problems in the past due to only one frontend being
tested (see de7ef47548). I would like to get rid of the latter path to
avoid further problems like this.
The movie config layer is not active for recording, only playback. Thus, recording ends up stuck with default SYSCONF settings.
The fix is simply to add in the movie config layer when recording. The way it's done is a bit hacky, but seems to work.
This also changes the behavior for the invalid gamma value, which was confirmed to behave the same as 2.2.
Note that currently, the gamma value is only used for XFB copies, even though hardware testing indicates it also works for EFB copies. This will be changed in a later commit.
Before, Free Look would accept background input by default, which means it was easy to accidentally move the camera while typing in another window. (This is because HotkeyScheduler::Run sets the input gate to `true` after it's copied the hotkey state, supposedly for other threads (though `SetInputGate` uses a `thread_local` variable so I'm not 100% sure that's correct) and for the GBA windows (which always accept unfocused input, presumably because they won't be focused normally).
If a 64-bit register is passed to WriteConditionalExceptionExit,
the LDR instruction in it will read too much data. This seems
to be harmless right now, but causes problem in one of my PRs.
This should reduce (but not completely eliminate) gradual audio desyncs in dumps. This also allows for accurate sample rates for the GameCube.
Completely eliminating gradual audio desyncs will require resampling to an integer sample rate, as nothing seems to support a non-integer sample rate.
These values were obtained by setting a breakpoint at a game's entry point, and then observing the register values with Dolphin's register widget.
There are other registers that aren't handled by this PR, including CR, XER, SRR0, SRR1, and "Int Mask" (as well as most of the GPRs). They could be added in a later PR if it turns out that their values matter, but probably most of them don't.
This fixes Datel titles booting with the IPL skipped (see https://bugs.dolphin-emu.org/issues/8223), though when booted this way they are currently missing textures. Due to somewhat janky code, Datel overwrites the syscall interrupt handler and then immediately triggers it (with the `sc` instruction) before they restore the correct one. This works on real hardware due to icache, and also works in Dolphin when the IPL runs due to icache, but prior to this change `HID0.ICE` defaulted to 0 so icache was not enabled when the IPL was skipped.
DSPHLE::Initialize sets the halt and init bits to true (i.e. m_dsp_control.Hex starts as 0x804), which is reasonable behavior (this is the state the DSP will be in when starting a game from the IPL, as after `__OSStopAudioSystem` the control register is 0x804).
However, CMailHandler::m_halted defaults to false, and we only call CMailHandler::SetHalted in DSPHLE::DSP_WriteControlRegister when m_dsp_control.DSPHalt changes, so since DSPHalt defaults to true, if the first thing that happens is writing true to DSPHalt, we won't properly halt the mail handler.
Now, we call CMailHandler::SetHalted on startup. This fixes Datel titles when the IPL is skipped with DSP HLE (though this configuration only works once https://bugs.dolphin-emu.org/issues/8223 is fixed).
This fixes booting Datel titles with DSPHLE (see https://bugs.dolphin-emu.org/issues/12943). Datel messed up their DSP initialization code, so it only works by receiving a mail later on, but if halting isn't implemented then it receives the mail too early and hangs.
It's cleared whenever the uCode changes, so there's no reason to clear it in a destructor or during initialization.
I've also renamed it to ClearPending.