The SaveToSYSCONF call in BootManager.cpp was unintentionally
overriding the temporary NAND set by the preceding
InitializeWiiRoot call. Fixes
https://bugs.dolphin-emu.org/issues/12500.
Verifying a Wii game creates an instance of IOS, and Dolphin
can't handle more than one instance of IOS at the same time.
Properly supporting it is probably more effort than it's worth.
Fixes https://bugs.dolphin-emu.org/issues/12494.
Avoids the need to copy the *.mo files manually *and* more importantly
this ensures that the mo files are always recreated if the build
output directory is cleared.
Update references was failing to update the references, causing input to stay nullptr and crashing.
I fixed the case that triggered that, though also added checks against nullptrs for safety.
(cherry picked from commit 4bdcf707555a5568eddff957fa3604975ffb6ed7)
I think the AArch64 JIT has come far enough that it doesn't have to
be called experimental anymore.
I'm also labeling the x86-64 JIT as x86-64 for consistence with the
AArch64 JIT. This will especially be helpful if we start supporting
AArch64 on macOS, as AArch64 macOS can run both the x86-64 JIT and
the AArch64 JIT depending on whether you enable Rosetta 2.
I haven't observed this breaking any game, but it didn't match
the behavior of the interpreter as far as I could tell from
reading the code, in that denormals weren't being flushed.
If we can prove that FCVT will provide a correct conversion,
we can use FCVT. This makes the common case a bit faster
and the less likely cases (unfortunately including zero,
which FCVT actually can convert correctly) a bit slower.
Preparation for following commits.
This commit intentionally doesn't touch paired stores,
since paired stores are supposed to flush to zero.
(Consistent with Jit64.)
This simplifies some of the following commits. It does require
an extra register, but hey, we have 32 of them.
Something I think would be nice to add to the register cache
in the future is the ability to keep both the single and double
version of a guest register in two different host registers
when that is useful. That way, the extra register we write to
here can be read by a later instruction, saving us from
having to perform the same conversion again.
Fixes https://bugs.dolphin-emu.org/issues/12388. Might also fix
other games that have problems with float/paired instructions
in JitArm64, but I haven't tested any.
-They might have never drawn if DrawMessages wasn't called before they actually expired
-Their fade was wrong if the duration of the message was less than the fade time
This makes them much more useful for debugging, I know there might be other means
of debugging like logs and imgui, but this was the simplest so that's what I used.
If you want to print the same message every frame, but with a slightly different value
to see the changes, it now work.
To compensate for the fact that they are now always rendered once,
so on start up a lot of old messages (printed while the emulation was off) could show up,
I've added a "drop" time, which means if a msg isn't rendered for the first
time within that time, it will be dropped and never rendered.
When the interpreter writes to a discarded register, its type
must be changed so that it is no longer considered discarded.
Fixes a 62ce1c7 regression.
We normally check for division by zero to know if we should set the
destination register to zero with a XOR. However, when the divisor and
destination registers are the same the explicit zeroing can be omitted.
In addition, some of the surrounding branching can be simplified as
well.
Before:
45 85 FF test r15d,r15d
75 05 jne normal_path
45 33 FF xor r15d,r15d
EB 0C jmp done
normal_path:
B8 5A 00 00 00 mov eax,5Ah
99 cdq
41 F7 FF idiv eax,r15d
44 8B F8 mov r15d,eax
done:
After:
45 85 FF test r15d,r15d
74 0C je done
B8 5A 00 00 00 mov eax,5Ah
99 cdq
41 F7 FF idiv eax,r15d
44 8B F8 mov r15d,eax
done:
Division by a power of two can be slightly improved when the
destination and dividend registers are the same.
Before:
8B C6 mov eax,esi
85 C0 test eax,eax
8D 70 03 lea esi,[rax+3]
0F 49 F0 cmovns esi,eax
C1 FE 02 sar esi,2
After:
85 F6 test esi,esi
8D 46 03 lea eax,[rsi+3]
0F 48 F0 cmovs esi,eax
C1 FE 02 sar esi,2
Repeated erase() + iteration on a std::multimap is extremely slow.
Slow enough that it causes a 7 second long stutter during some
transitions in F-Zero X (a N64 VC game that triggers many, many icache
invalidations).
And slow enough that JitBaseBlockCache::DestroyBlock shows up on a
flame graph as taking >50% of total CPU time on the CPU-GPU thread:
https://i.imgur.com/vvqiFL6.png
This commit optimises those block link queries by replacing the
std::multimap (which is typically implemented with red-black trees)
with hash tables.
Master: https://i.imgur.com/vvqiFL6.png / 7s stutters
(starting from 5.0-2021 and with branch following disabled)
This commit: https://i.imgur.com/hAO74fy.png / ~0.7s stutters, which
is pretty close to 5.0 stable. (5.0-2021 introduced the performance
regression and it is especially noticeable when branch following
is disabled, which is the case for all N64 VC games since 5.0-8377.)
VideoCommon: Change the type of BPMemory.scissorOffset to 10bit signed: S32X10Y10
VideoBackends: Fix Software Clipper.PerspectiveDivide function, use BPMemory.scissorOffset instead of hard code 342
Oversight from #9545, which moved the "new game has been loaded" logic
to a separate OnNewTitleLoad function that has to be called explicitly
*after* a title has loaded.
Coupled with the commit that makes Dolphin not clobber 0x1800-0x3000
when using MIOS, this fixes Wind Waker and other MIOS-patched games
when they are launched from the System Menu.
MIOS puts patch data in low MEM1 (0x1800-0x3000) for its own use.
Overwriting data in this range can cause the IPL to crash when
launching games that get patched by MIOS.
See https://bugs.dolphin-emu.org/issues/11952 for more info.
Not applying the Gecko HLE patches means that Gecko codes will not work
under MIOS, but this is better than the alternative of having specific
games crash.
This particular range is kind of bizarre, and would only interpret
interleave mode 2 as a valid mode, while rejecting interleave mode 1 and
the extension byte mode.
As far as I know, based off the information on Wiibrew, we should be
considering all three values within this range as valid.
texture serialization and deserialization used to involve many memory
allocations and deallocations, along with many copies to and from
those allocations. avoid those by reserving a memory region inside the
output and writing there directly, skipping the allocation and copy to
an intermediate buffer entirely.
This adds a CMake option (DOLPHIN_DEFAULT_UPDATE_TRACK) to allow
configuring SCM_UPDATE_TRACK_STR. This is needed to enable auto-updates
in Windows CMake builds by default.
This adds a function to get the emulated or real Bluetooth device for
an active emulation instance. This lets us deduplicate all the
`ios->GetDeviceByName("/dev/usb/oh1/57e/305")` calls that are currently
scattered in the codebase and ensures Bluetooth passthrough is being
handled correctly.
This also fixes the broken check in WiimoteCommon::UpdateSource.
There was a confusion between "emulated Bluetooth" (as opposed to
"real Bluetooth" aka Bluetooth passthrough) and "emulated Wiimote".
Specifically, 'Scooby-Doo! Mystery Mayhem', 'Scooby-Doo! Unmasked', 'Ed, Edd n Eddy: The Mis-Edventures', and the Wii version of 'Happy Feet'.
The JIT cache causes problems with emulated icache invalidation in these games, resulting in areas failing to load.
This avoids some warnings, which were originally fixed by ignoring loads with a value of zero (see 636bedb207 / #3242).
Note that FifoCI will report some changes, but only on the first frame; these seem to be timing related as they don't happen if a different write is used to replace skipped ones.
They appear to relate to perf queries, and combining them with truely unknown commands would probably hide useful information. Furthermore, 0x20 is issued by every title, so without this every title would be recorded as using an unknown command, which is very unhelpful.
The swaps are confusing and don't accomplish much.
It was originally written like this:
u32 pte = bswap(*(u32*)&base_mem[pteg_addr]);
then bswap was changed to Common::swap32, and then the array access
was replaced with Memory::Read_U32, leading to the useless swaps.
While 6xx_pem.pdf §7.6.1.1 mentions that the number of trailing
zeros in HTABORG must be equal to the number of trailing ones
in the mask (i.e. HTABORG must be properly aligned), this is actually
not a hard requirement. Real hardware will just OR the base address
anyway. Ignoring SDR changes would lead to incorrect emulation.
Logging a warning instead of dropping the SDR update silently is a
saner behaviour.
debaf63fe8 moved the "Sonic epsilon hack"
to vertex shaders. However, it was only done for targets with depth
clamping. If this is not available, for example the target is OpenGL ES,
the Sonic problem appears (https://bugs.dolphin-emu.org/issues/11897).
A version of the "Sonic epsilon hack" is added for targets without
depth clamping.
This changes FileSystemProxy::Open to return a file descriptor wrapper
that will ensure the FD is closed when it goes out of scope.
By using such a wrapper we make it more difficult to forget to close
file descriptors.
This fixes a leak in ReadBootContent. I should have added such a class
from the beginning... In practice, I don't think this would have caused
any obvious issue because ReadBootContent is only called after an IOS
relaunch -- which clears all FDs -- and most titles do not get close
to the FD limit.
JitArm64::DoJit contains a check where it prints a warning and tries
to pause emulation if instructed to compile code at address 0. I'm
assuming this was done in order to provide a nicer error behavior
in cases where PC was accidentally set to null. Unfortunately, it
has started causing us problems recently, as 688bd61 writes and runs
some code at address 0 to simulate the PPC being held in reset.
What makes this worse is that calling Core::SetState from the CPU
thread is actually not allowed and will cause a deadlock instead of
the intended behavior. I don't believe there is anything on a real
console that would stop you from executing code at address 0 (as
long as the MMU has been set up to allow it), and Jit64::DoJit
doesn't contain any check like this, so let's remove the check.
This commit adds a new "discarded" state for registers.
Discarding a register is like flushing it, but without
actually writing its value back to memory. We can discard
a register only when it is guaranteed that no instruction
will read from the register before it is next written to.
Discarding reduces the register pressure a little, and can
also let us skip a few flushes on interpreter fallbacks.
The output of instructions like fabsx and ps_sel is store-safe
if and only if the relevant inputs are. The old code was always
marking the output as store-safe if the output was a single,
and never otherwise.
Also, the old code was treating the output of psq_l/psq_lu as
store-safe, which seems incorrect (if dequantization is disabled).
This improves the speed of verifying Wii WIA/RVZ files.
For me, the verification speed for LZMA2-compressed files
has gone from 11-12 MiB/s to 13-14 MiB/s.
One thing VolumeVerifier does to achieve parallelism is to
compute hashes for one chunk of data while reading the next
chunk of data. In master, when reading data from a Wii
partition, each such chunk is 32 KiB. This is normally fine,
but with WIA and RVZ it leads to rather lopsided read times
(without the compute times being lopsided): The first 32 KiB
of each 2 MiB takes a long time to read, and the remaining
part of the 2 MiB can be read nearly instantly. (The WIA/RVZ
code has to read the entire 2 MiB in order to compute hashes
which appear at the beginning of the 2 MiB, and then caches
the result afterwards.) This leads to us at times not doing
much reading and at other times not doing much computation.
To improve this, this change makes us use 2 MiB chunks
instead of 32 KiB chunks when reading from Wii partitions.
(block = 32 KiB, group = 2 MiB)
This can't actually happen in practice due to how WAD files work,
but it's very easy to add support for thanks to the last commit,
so we might as well add support for it.
The performance gains of doing this aren't too important since you
normally wouldn't run into any disc image that has overlapping blocks
(which by extension means overlapping partitions), but this change also
lets us get rid of things like VolumeVerifier's mutex that used to
exist just for the sake of handling overlapping blocks.
Panic alerts in DiscIO can potentially be very annoying since
large amounts of them can pop up when loading the game list
if you have some particularly weird files in your game list.
This was a much bigger problem back in 5.0 with its
"Tried to decrypt data from a non-Wii volume" panic alert, but
I figured I would take it all the way and remove the remaining
panic alerts that can show up when loading the game list.
I have exempted uses of ASSERT/ASSERT_MSG since they indicate
a bug in Dolphin rather than a malformed file.
If we know at compile time that the PPC carry flag definitely
has a certain value, we can bake that value into the emitted code
and skip having to read from PPCState.
When a save state is loaded, the IOS device serving bluetooth
is cast as BluetoothEmuDevice. If, however, a real Wiimote
with BT passthrough is used, this caused the game to crash.
Now the proper device class is used.
At a first glance it may look like a part of the code I added to
srawx in efeda3b has a bug when a == s. The code actually happens
to work correctly, but in the interest of making the code easier
to reason about, I'd like to change the way it's implemented. This
change should improve the pipelining a little in the a == s case too.
Fix Gamelist context menu item 'Open Containing Folder' opening wrong
target on Windows when game parent folder is [foobar] and grandparent
folder contains file [foobar].bat or [foobar].exe
Add trailing directory separator to parent folder path to force Windows
to interpret path as directory.
Fixes https://bugs.dolphin-emu.org/issues/12411
21c152f added a small hack to DVDInterface to keep WBFS and CISO
files working with Nintendo's "Error #001" anti-piracy check.
Unfortunately I don't think it's possible to support WBFS and
CISO without any kind of hack or heuristic, but what we can do
is replace the 21c152f hack (which applies regardless of file
format) with a hack that only is active when using WBFS or CISO.
This change is similar to 2a5a399, but the disc size is
calculated in a different way.
Add ! before unused variables to 'use' them.
Ubuntu-x64 emits warnings for unused variables because gcc decides
it should ignore the void cast around them. See thread for discussion:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66425
Loop index int i was being compared against GetControllerCount() which
returned a size_t. This was the only place GetControllerCount() was
called from so the change of return type doesn't disturb anything else.
Changing the loop index to size_t wouldn't work as well since it's
passed into GetController(), which takes an int and is called from many
places, so it would need a cast anyway on an already busy line.
...and let's optimize a divisor of 2 ever so slightly for good measure.
I wouldn't have bothered, but most GameCube games seem to hit this on
launch.
- Division by 2
Before:
41 BE 02 00 00 00 mov r14d,2
41 8B C2 mov eax,r10d
45 85 F6 test r14d,r14d
74 0D je overflow
3D 00 00 00 80 cmp eax,80000000h
75 0E jne normal_path
41 83 FE FF cmp r14d,0FFFFFFFFh
75 08 jne normal_path
overflow:
C1 F8 1F sar eax,1Fh
44 8B F0 mov r14d,eax
EB 07 jmp done
normal_path:
99 cdq
41 F7 FE idiv eax,r14d
44 8B F0 mov r14d,eax
done:
After:
45 8B F2 mov r14d,r10d
41 C1 EE 1F shr r14d,1Fh
45 03 F2 add r14d,r10d
41 D1 FE sar r14d,1
Add a function to calculate the magic constants required to optimize
signed 32-bit division.
Since this optimization is not exclusive to any particular architecture,
JitCommon seemed like a good place to put this.
Zero divided by any number is still zero. For whatever reason, this case
shows up frequently too.
Before:
B8 00 00 00 00 mov eax,0
85 F6 test esi,esi
74 0C je overflow
3D 00 00 00 80 cmp eax,80000000h
75 0C jne normal_path
83 FE FF cmp esi,0FFFFFFFFh
75 07 jne normal_path
overflow:
C1 F8 1F sar eax,1Fh
8B F8 mov edi,eax
EB 05 jmp done
normal_path:
99 cdq
F7 FE idiv eax,esi
8B F8 mov edi,eax
done:
After:
Nothing!
When the dividend is known at compile time, we can eliminate some of the
branching and precompute the result for the overflow case.
Before:
B8 54 D3 E6 02 mov eax,2E6D354h
85 FF test edi,edi
74 0C je overflow
3D 00 00 00 80 cmp eax,80000000h
75 0C jne normal_path
83 FF FF cmp edi,0FFFFFFFFh
75 07 jne normal_path
overflow:
C1 F8 1F sar eax,1Fh
8B F8 mov edi,eax
EB 05 jmp done
normal_path:
99 cdq
F7 FF idiv eax,edi
8B F8 mov edi,eax
done:
After:
85 FF test edi,edi
75 04 jne normal_path
33 FF xor edi,edi
EB 0A jmp done
normal_path:
B8 54 D3 E6 02 mov eax,2E6D354h
99 cdq
F7 FF idiv eax,edi
8B F8 mov edi,eax
done:
Fairly common with constant dividend of zero. Non-zero values occur
frequently in Ocarina of Time Master Quest.
Whether the custom RTC setting is enabled shouldn't in itself
affect determinism (as long as the actual RTC value is properly
synced). Alters the logic added in 4b2906c.
I'm not entirely certain that this is correct, but the current
code doesn't really make sense to me... If we need to force the
RTC bias to 0 when custom RTC is enabled, why don't we need to
do it when custom RTC is disabled? The code for getting the
host system's current time doesn't contain any special handling
for the guest's RTC bias as far as I can tell.
The loop in WIARVZFileReader::Chunk::Read could terminate
prematurely if the size argument was smaller than the size
of an exception list which had only been partially loaded.
Fixes issue 11393.
The problem is that left and top make no sense for a width by height array; they only make sense in a larger array where from which a smaller part is extracted. Thus, the overall size of the array is provided to CopyRegion in addition to the sub-region. EncodeXFB already handles the extraction, so CopyRegion's only use there is to resize the image (and thus no sub-region is provided).
BPMEM_TEV_COLOR_ENV + 6 (0xC6) was missing due to a typo. BPMEM_BP_MASK (0xFE) does not lend itself well to documentation with the current FIFO analyzer implementation (since it requires remembering the values in BP memory) but still shouldn't be treated as unknown. BPMEM_TX_SETMODE0_4 and BPMEM_TX_SETMODE1_4 (0xA4-0xAB) were missing entirely.
Additional changes:
- For TevStageCombiner's ColorCombiner and AlphaCombiner, op/comparison and scale/compare_mode have been split as there are different meanings and enums if bias is set to compare. (Shift has also been renamed to scale)
- In TexMode0, min_filter has been split into min_mip and min_filter.
- In TexImage1, image_type is now cache_manually_managed.
- The unused bit in GenMode is now exposed.
- LPSize's lineaspect is now named adjust_for_aspect_ratio.
Additionally, VCacheEnhance has been added to UVAT_group1. According to YAGCD, this field is always 1.
TVtxDesc also now has separate low and high fields whose hex values correspond with the proper registers, instead of having one 33-bit value. This change was made in a way that should be backwards-compatible.
The PPC is supposed to be held in reset when another version of IOS is
in the process of being launched for a PPC title launch.
Probably doesn't matter in practice, though the inaccuracy was
definitely observable from the PPC.
We should only try to load a symbol map for the new title *after* it
has been loaded into memory, not before. Likewise for applying HLE
patches and loading new custom textures.
In practice, loading/repatching too early was only a problem for
titles that are launched via ES_Launch. This commit fixes that.
The extra IPC ack is triggered by a syscall that is invoked in ES's
main function; the syscall literally just sets Y2, IX1 and IX2 in
HW_IPC_ARMCTRL -- there is no complicated ack queue or anything.
Low MEM1 is cleared by IOS before all the other constants are written.
This will overwrite the Gecko code handler but it should be fine
because HLE::Reload (which will set up the code handler hook again)
will be called after a title change is detected.
The Host constructor sets a callback on a lambda that in turn calls
Host_UpdateDisasmDialog. Since that function is not a member function
capturing this is unnecessary.
Fixes -Wunused-lambda-capture warning on freebsd-x64.
When reading a reply from a message sent to the data socket there is
the possibility that the other side gets sent multiple messages
before replying to any of them, which can lead to multiple replies
sent in a row. Though this only happens when things time out, it's
quite possible for these timeouts to happen or build up over time,
especially when initiating the connection.
This change makes sure to flush any pending bytes that have not been
read yet out of the socket after a successful POLL reply is received,
since that is the most common time when backups occur, and as well as
using the exact number of bytes in an expected reply, to ensure
the received data and the message it's replying to do not get out of
sync.
The result of calls to PPCSTATE_OFF_PS0/1 were being cast to u32 and
passed to functions expecting s32 parameters. This changes the casts
to s32 instead.
One location was missing a cast and generated a warning with VS which
is now fixed.
Added `ToggleBreakPoint` to both interface BreakPoints/MemChecks. this would allow us to toggle the state of the breakpoint.
Also the TMemCheck::is_ranged is not longer serialized to string, since can be deduce by comparing the TMemCheck::start_address and TMemCheck::end_address
DualShock UDP Client is the only place in the code that assumed OnConfigChanged()
is called at least once on startup or it won't load up the setting, so I took care of that
This was caused, because we were saving the `break_on_hit` flag with the letter `p`. Then while loading the breakpoints, we read the flag with the letter `b`, resulting in the `break_on_hit` flag being always false
Filesystem accesses aren't magically faster when they are done by ES,
so this commit changes our content wrapper IPC commands to take FS
access times and read operations into account.
This should make content read timings a lot more accurate and closer
to console. Note that the accuracy of the timings are limited to the
accuracy of the emulated FS timings, and currently performance
differences between IOS9-IOS28 and newer IOS versions are not emulated.
Part 1 of fixing https://bugs.dolphin-emu.org/issues/11346
(part 2 will involve emulating those differences)
This makes it more convenient to emulate timings for IPC commands that
perform internal IOS <-> IOS IPC, for example ES relying on FS
for filesystem access.
According to hwtests, older versions of IOS are slower at performing
various filesystem operations:
https://docs.google.com/spreadsheets/d/1OKo9IUuKCrniz4m0kYIaMP_qFtOCmAzHZ_zAmobvBcc/edit
(courtesy of JMC)
A quick glance at IOS9 reveals that older versions of IOS have a
simplistic implementation of memcpy that does not optimize large copies
by copying 16 bytes or 32 bytes per chunk, which makes cached reads
and writes noticeably slower -- the difference was significant enough
that the OoT speedrunning community noticed that IOS9 (the IOS that
is used for the OoT VC title) was slower.
More or less a complete rewrite of the function which aims
to be equally good or better for each given input, without
relying on special cases like the old implementation did.
In particular, we now have more extensive support for
MOVN, as mentioned in a TODO comment.
Instead of constructing IPCCommandResult with static member functions
in the Device class, we can just add the relevant constructors to the
reply struct itself. Makes more sense than putting it in Device
when the struct is used in the kernel code and doesn't use any Device
specific members...
This commit also changes the IPC command handlers to return an optional
IPCCommandResult rather than an IPCCommandResult. This removes the need
for a separate boolean that indicates whether the "result" is actually
a reply, and also avoids the need to set dummy result values and ticks.
It also makes it really obvious which commands can result in no reply
being generated.
Finally, this commit renames IPCCommandResult to IPCReply since the
struct is now only used for actual replies. This new name is less
verbose in my opinion.
The diff is quite large since this touches every command handler, but
the only functional change is that I fixed EnqueueIPCReply to
take a s64 for cycles_in_future to match IPCReply.
PrepareForState is now unnecessary with the new implementation of
HostFileSystem::DoState, which does what the old implementation
(CWII_IPC_HLE_Device_FileIO::PrepareForState) used to do.
I don't really see the use of this. (Maybe in the past it
was used for when we need a constant number of instructions
for backpatching? But we don't use MOVI2R for that now.)
Now that the ES class (now called ESDevice) and the ES namespace do
not conflict anymore, "IOS::" can be dropped in a lot of cases.
This also removes "IOS::HLE::" for code that is already in that
namespace. Some of those names used to be explicitly qualified
only for historical reasons.
There are no functional changes.
Some of the device names can be ambiguous and require fully or partly
qualifying the name (e.g. IOS::HLE::FS::) in a somewhat verbose way.
Additionally, insufficiently qualified names are prone to breaking.
Consider the example of IOS::HLE::FS:: (namespace) and
IOS::HLE::Device::FS (class). If we use FS::Foo in a file that doesn't
know about the class, everything will work fine. However, as soon as
Device::FS is declared via a header include or even just forward
declared, that code will cease to compile because FS:: now resolves
to Device::FS if FS::Foo was used in the Device namespace.
It also leads to having to write IOS::ES:: to access ES types and
utilities even for code that is already under the IOS namespace.
The fix for this is simple: rename the device classes and give them
a "device" suffix in their names if the existing ones may be ambiguous.
This makes it clear whether we're referring to the device class or to
something else.
This is not any longer to type, considering it lets us get rid of the
Device namespace, which is now wholly unnecessary.
There are no functional changes in this commit.
A future commit will fix unnecessarily qualified names.
According to the C standard, an offsetof expression must evaluate to an
address constant, otherwise it's undefined behavior.
Fixes https://bugs.dolphin-emu.org/issues/12409
See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95942
There are still improper uses of offsetof (mostly in JitArm64) but
fixing that will take more effort since there's a PPCSTATE_OFF wrapper
macro that is sometimes used with non-array members and sometimes used
with arrays and variable indices... Let's keep that for another PR.
Fixes the expression window being spammed with the first entry in the
Operators or Functions select menus when scrolling the mouse wheel while
hovering over them.
Fixes https://bugs.dolphin-emu.org/issues/12405
The dolphin-redirect.php script seems to have been present since 2012
at least, but we accidentally stopped using it when the "open wiki"
feature was reimplemented in DolphinQt2 in 2016.
<@delroth> dolphin-redirect.php is slightly smarter and tries to find gameid aliases for e.g. same region
<@delroth> uh, I mean different region
PR 9262 added a bunch of Jit64 optimizations, some of
which were already in JitArm64 and some which weren't.
This change ports the latter ones to JitArm64.
Let's reset m_last_used for each register that will be used
in an instruction before we start allocating any of them,
so that one of the earlier allocations doesn't spill a
register that we want in a later allocation. (We must still
also increment/reset m_last_used in R and RW, otherwise we
end up in trouble when emulating lmw/stmw since those access
more guest registers than there are available host registers.)
This should ensure that the asserts added earlier in this
pull request are never triggered.
If the register pressure is high when allocating registers,
Arm64FPRCache may spill a guest register which we are going to
allocate later during the current instruction, which has the
side effect of turning it into double precision. This will have
bad consequences if we are assuming that it is single precision,
so let's add some asserts to detect if that ever happens.
This doesn't really add any new optimizations, but fixes an issue that
prevented the optimizations introduced in #8551 and #8755 from being
applied in specific cases. A similar issue was solved for subfx as part
of #9425.
Consider the case where the destination register is also an input
register and happens to hold an immediate value. This results in a set
of constraints that forces the RegCache to allocate a register and move
the immediate value into it for us. By the time we check for immediate
values in the JIT, we're too late.
We solve this by refactoring the code in such a way that we can check
for immediates before involving the RegCache.
- Example 1
Before:
41 BF 00 68 00 CC mov r15d,0CC006800h
44 03 FF add r15d,edi
After:
44 8D BF 00 68 00 CC lea r15d,[rdi-33FF9800h]
- Example 2
Before:
41 BE 00 00 00 00 mov r14d,0
44 03 F7 add r14d,edi
After:
44 8B F7 mov r14d,edi
- Example 3
Before:
41 BD 03 00 00 00 mov r13d,3
44 03 6D 8C add r13d,dword ptr [rbp-74h]
After:
44 8B 6D 8C mov r13d,dword ptr [rbp-74h]
41 83 C5 03 add r13d,3
PR #2663 added a Jit64 implementation of dcbX and a fast path to skip JIT cache invalidation. Unfortunately, there is a mismatch between address spaces in this optimization. It tests the effective address (with the top 3 bits cleared) against the valid block bitset which is intended to be indexed by physical address. While this works in the common case, it fails (for example) when the effective address is in the 7E... region (a.k.a. "fake VMEM"). This may also fall apart under more complex memory mapping scenarios requiring full MMU emulation.
The good news is that even without this fast path, the underlying call to JitInterface::InvalidateICache() still does very little work in the common case. It correctly translates the effective address to a physical address which it tests against the valid block bitset, skipping invalidation if it is not necessary. As such, the cost of removing the fast path should not be too high.
The Jit64 implementation is retained, though all it does now is emit a call. This is marginally more efficient than simple interpreter fallback, which involves an extra call. The JitArm64 implementation has also been fixed.
The game Happy Feet is fixed by this change, as it loads code in the 7E... address region and depends upon JIT cache invalidation in reponse to dcbf.
https://bugs.dolphin-emu.org/issues/12133
At least on some CPUs (I found out about this from the
Arm Cortex-A76 Software Optimization Guide), using X30
with BLR is one cycle slower than using another register.
Previously, eaddr would only be partially initialized in the ipv6 case.
Even if there's no support for it, we may as well ensure that the
variable always has deterministic initialization.
While we're at it, we can make the parameter a const reference, given no
members are modified.
EmulationActivity has an instance of Settings. If you go to
SettingsActivity from EmulationActivity and change some settings,
the changes get saved to disk, but EmulationActivity's Settings
instance still contains the old settings in its map of all
settings (assuming the EmulationActivity was not killed by the
system to save memory). Then, once you're done playing your
game and exit EmulationActivity, EmulationActivity calls
Settings.saveSettings. This call to saveSettings first overwrites
the entire INI file with its map of all settings (which is
outdated) in order to save any legacy settings that have changed
(which they haven't, since the GUI doesn't let you change legacy
settings while a game is running). Then, it asks the new config
system to write the most up-to-date values available for non-legacy
settings, which should make all the settings be up-to-date again.
The problem here is that the new config system would skip writing
to disk if no settings changes had been made since the last time
we asked it to write to disk (i.e. since SettingsActivity exited).
NB: Calling Settings.loadSettings in EmulationActivity.onResume
is not a working solution. I assume this is because
SettingsActivity saves its settings in onStop and not onPause.
For certain occurrences of nandx/norx, we declare a ReadWrite constraint
on the destination register, even though the value of the destination
register is irrelevant. This false dependency would force the RegCache
to generate a redundant MOV when the destination register wasn't already
assigned to a host register.
Example 1:
BF 00 00 00 00 mov edi,0
8B FE mov edi,esi
F7 D7 not edi
Example 2:
8B 7D 80 mov edi,dword ptr [rbp-80h]
8B FE mov edi,esi
F7 D7 not edi
Also avoid files without a name before the extension (name: ".ini")
from being added to the list because then they wouldn't be saveable
and it would appear with an empty name anyway.
This function has been marked as obsolete. In Qt 6.0 it's removed
entirely, so we must use getContentsMargin() explicitly instead
(margin() would do this for us).
Ditto for setMargin(), in which case we use setContentsMargin instead.
setMargin() would just pass its argument to all four parameters of
setContentsMargin(), so we can do the same.
This literal was deprecated in 5.14.0. Not to mention it wasn't
documented as part of the API either: see the 5.14.0 changelog here:
https://code.qt.io/cgit/qt/qtbase.git/tree/dist/changes-5.14.0?h=v5.14.0
On Qt 6.0 this define is removed entirely. To stay forward compatible,
we can make use of QStringLiteral instead.
This skips a potentially costly loop if volume is 100% or 0%,
as for former there is no need for volume adjustment,
while latter can be solved by specifying a AUDCLNT_BUFFERFLAGS_SILENT flag
This fixes numerous resource leaks, as not every return path cleaned every created resource
Now they are all managed automatically and "commited" to WASAPIStream class fields only
after it's certain they initialized properly
When a game is selected, the option to add a shortcut of the game to the desktop is given. Uses native Windows API since Qt lacks support for adding shortcuts.
FinalizeCarryOverflow didn't maintain XER[OV/SO] properly due to an
oversight. Here's the code it would generate:
0: 9c pushf
1: 80 65 3b fe and BYTE PTR [rbp+0x3b],0xfe
5: 71 04 jno b <jno>
7: c6 45 3b 03 mov BYTE PTR [rbp+0x3b],0x3
000000000000000b <jno>:
b: 9d popf
At first glance it seems reasonable. The host flags are carefully
preserved with PUSHF. The AND instruction clears XER[OV]. Next, an
conditional branch checks the host's overflow flag and, if needed, skips
over a MOV that sets XER[OV/SO]. Finally, host flags are restored with
POPF.
However, the AND instruction also clears the host's overflow flag. As a
result, the branch that follows it is always taken and the MOV is always
skipped. The end result is that XER[OV] is always cleared while XER[SO]
is left unchanged.
Putting POPF immediately after the AND would fix this, but we already
have GenerateOverflow doing it correctly (and without the PUSHF/POPF
shenanigans too). So let's just use that instead.
Happens in Super Mario Sunshine. You could probably do something similar
for b == -1 (like we do for subfic), but I couldn't find any titles that
do this.
- Case 1: d == a
Before:
41 8B C7 mov eax,r15d
41 BF 00 00 00 00 mov r15d,0
44 2B F8 sub r15d,eax
After:
41 F7 DF neg r15d
- Case 2: d != a
Before:
BF 00 00 00 00 mov edi,0
41 2B FD sub edi,r13d
After:
41 8B FD mov edi,r13d
F7 DF neg edi
Consider the case where d and a refer to the same PowerPC register,
which is known to hold an immediate value by the RegCache. We place a
ReadWrite constraint on this register and bind it to an x86 register.
The RegCache then allocates a new register, initializes it with the
immediate, and returns a RCX64Reg for both d and a.
At this point information about the immediate value becomes unreachable.
In the case of subfx, this generates suboptimal code:
Before 1:
BF 1E 00 00 00 mov edi,1Eh <- done by RegCache
8B C7 mov eax,edi
8B FE mov edi,esi
2B F8 sub edi,eax
Before 2:
BE 00 AC 3F 80 mov esi,803FAC00h <- done by RegCache
8B C6 mov eax,esi
8B 75 EC mov esi,dword ptr [rbp-14h]
2B F0 sub esi,eax
The solution is to explicitly handle the constant a case before having
the RegCache allocate registers for us.
After 1:
8D 7E E2 lea edi,[rsi-1Eh]
After 2:
8B 75 EC mov esi,dword ptr [rbp-14h]
81 EE 00 AC 3F 80 sub esi,803FAC00h
The special case doesn't appear to make a significant difference in any games, and the current implementation has a (minor, fixable) issue that breaks Super Mario Sunshine (both with a failed assertion (https://bugs.dolphin-emu.org/issues/11742) and a rendering issue (https://bugs.dolphin-emu.org/issues/7476)). Hardware testing wasn't able to reproduce the special case, either, so it may just not exist.
PR #9315 contains a fixed implementation of the special case on all video backends, and can serve as a basis for it being reintroduced if it is found to exist under more specific circumstances. For now, I don't see a reason to keep it present.
-If adding 2 devices with the same name, they their unique id wouldn't be increased, causing a conflict.
-Removing a device wouldn't actually remove it from the internal devices list because the list of devices had already been updated when going through it.
-It was possible to remove devices belonging to other sources by adding a device with the same name and then removing it.
The name was confusing as changing it at runtime would not change the window to fullscreen, as it effectively only affects the start of the emulation.
Also blocked the ability to change it when the emulation is running, to be more inline with other similar settings, like "Render to main Window".
Also provides operator!= for logical symmetry.
We can also take the arguments by value, as the arguments are trivially
copyable enum values which fit nicely into registers already.
https://bugs.dolphin-emu.org/issues/6749
This change fixes the scratchy audio in Teenage Mutant Ninja Turtles (SX7E52/SX7P52). The game starts an audio interface DMA with an unaligned address, and because Dolphin was not masking off the low 5 bits of AUDIO_DMA_START_LO, all future AI DMAs were misaligned. To understand why, it is instructive to refer to AUDIO_InitDMA() in libogc, which behaves the same as the official SDK:
_dspReg[25] = (_dspReg[25]&~0xffe0)|(startaddr&0xffff);
The implementation does not mask off the low bits of the passed in value before it ORs them with low bits of the current register value. Therefore, if they are not masked off by the hardware itself, they become permanently stuck once set.
Adding a write mask for AUDIO_DMA_START_LO is enough to fix the bug in TMNT, but I decided to run some tests on GC and Wii to find the correct write masks for the surrounding registers, as only a couple were already being masked. Dolphin has gotten away with not masking the rest because many are already A) masked on read (or never read) by the SDK and/or B) masked on use (or never used) in Dolphin.
This leaves just three registers where the difference may be observable: AR_DMA_CNT_H and AUDIO_DMA_START_HI/LO.
operator[] performs a default construction if an object at the given key
doesn't exist before overwriting it with the one we provide in operator=
insert_or_assign performs optimal insertion by avoiding the default
construction if an entry doesn't exist.
Not a game changer, but it is essentially a "free" change.
Allows lookups to be done with std::string_view or any other string
type. This allows for non-allocating strings to be used with the name
lookup without needing to construct a std::string.
Cleans up some locks that explicitly specify the recursive mutex type in
it. Meant to be included with the previous commit that cleaned out
regular mutexes, but I forgot.
This code was storing references to patch entries which could move around in memory if a patch was erased from the middle of a vector or if the vector itself was reallocated. Instead, NewPatchDialog maintains a separate copy of the patch entries which are committed back to the patch if the user accepts the changes.
These games are erroneously zeroing buffers before they can be fully copied to ARAM by DMA. The responsible memset() calls are followed by a call to DVDRead() which issues dcbi instructions that effectively cancel the memset() on real hardware. Because Dolphin lacks dcache emulation, the effects of the memset() calls are observed, which causes missing audio.
In a comment on the original bug, phire noted that the issue can be corrected by simply nop'ing out the offending memset() calls. Because the games dynamically load different .rel executables based on the character and/or language, the addresses of these calls can vary.
To deal generally with the problem of code being dynamically loaded to fixed, known addresses, the patch engine is extended to support conditional patches which require a match against a known value. This sort of thing is already achievable with Action Replay/Gecko codes, but their use depends on enabling cheats globally in Dolphin, which is not a prerequisite shared by patches.
Patches are included for every region, character, and language combination. They are enabled by default.
The end result is an approximation of the games' behavior on real hardware without the associated complexity of proper dcache emulation.
https://bugs.dolphin-emu.org/issues/9840
The config version should always be incremented whenever config is
changed, regardless of callbacks being suppressed or not.
Otherwise, getters can return stale data until another config change
(with callbacks enabled) happens.
C++17 allows omitting the mutex type, which makes for both less reading
and more flexibility (e.g. The mutex type can change and all occurrences
don't need to be updated).
Allows the analyzer to exist independently of the DSP structure. This
allows for unit-tests to be created in a nicer manner.
SDSP is only necessary during the analysis phase, so we only need to
keep a reference around to it then as opposed to the entire lifecycle of
the analyzer.
This also allows the copy/move assignment operators to be defaulted, as
a reference member variable prevents that.
Now that we have the convenience functions around the flag
bit manipulations, there's no external usages of the flags, so we can
make these private to the analyzer implementation.
Now the Analyzer namespace is largely unnecessary and can be merged with
the DSP namespace in the next commit.
This commit implements the following commands:
* open
* close
* GetMode
* SetLinkState (used to actually trigger scanning)
* GetLinkState (used to check if the driver is in the expected state)
* GetInfo
* RecvFrame and RecvNotification (stubbed)
* Disassociate (stubbed)
GetInfo was already implemented, but the structure wasn't initialized
correctly so the info was being rejected by official titles.
That has also been fixed in this commit.
Some of the checks may seem unimportant but official titles actually
require WD to return error codes... Failing to do so can cause hangs
and softlocks when DS communications are shut down.
This minimal implementation is enough to satisfy the Mii channel
and all other DS games, except Tales of Graces (https://dolp.in/i11977)
which still softlocks because it probably requires us to actually
feed it frame data.
NCD returns an error if it receives a request to lock the driver
when it is already locked.
Emulating this may seem pointless, but it turns out PPC-side code
expects NCD to return an error and will immediately fail and stop
initialising wireless stuff if NCD succeeds.
Localizes code that modifies m_dsp into the struct itself. This reduces
the overal coupling between DSPCore and SDSP by reducing access to its
member variables.
This commit is only code movement and has no functional changes.
An unfortunately large single commit that deglobalizes the DSP code.
(which I'm very sorry about).
This would have otherwise been extremely difficult to separate due to
extensive use of the globals in very coupling ways that would result in
more scaffolding to work around than is worth it.
Aside from the video code, I believe only the DSP code is the hairiest
to deal with in terms of globals, so I guess it's best to get this dealt
with right off the bat.
A summary of what this commit does:
- Turns the DSPInterpreter into its own class
This is the most involved portion of this change.
The bulk of the changes are turning non-member functions into member
functions that would be situated into the Interpreter class.
- Eliminates all usages to globals within DSPCore.
This generally involves turning a lot of non-member functions into
member functions that are either situated within SDSP or DSPCore.
- Discards DSPDebugInterface (it wasn't hooked up to anything,
and for the sake of eliminating global state, I'd rather get rid of
it than think up ways for this class to be integrated with
everything else.
- Readjusts the DSP JIT to handle calling out to member functions.
In most cases, this just means wrapping respective member function
calles into thunk functions.
Surprisingly, this doesn't even make use of the introduced System class.
It was possible all along to do this without it. We can house everything
within the DSPLLE class, which is quite nice =)
Shifting zero by any amount always gives zero.
Before:
41 B9 00 00 00 00 mov r9d,0
41 8B CF mov ecx,r15d
49 C1 E1 20 shl r9,20h
49 D3 F9 sar r9,cl
49 C1 E9 20 shr r9,20h
After:
Nothing, register is set to constant zero.
Before:
41 B8 00 00 00 00 mov r8d,0
41 8B CF mov ecx,r15d
49 C1 E0 20 shl r8,20h
49 D3 F8 sar r8,cl
41 8B C0 mov eax,r8d
49 C1 E8 20 shr r8,20h
44 85 C0 test eax,r8d
0F 95 45 58 setne byte ptr [rbp+58h]
After:
C6 45 58 00 mov byte ptr [rbp+58h],0
Occurs a bunch of times in Super Mario Sunshine. Since this is an
arithmetic shift a similar optimization can be done for constant -1
(0xFFFFFFFF), but I couldn't find any game where this happens.
Shifting zero by any amount always gives zero.
Before:
41 BF 00 00 00 00 mov r15d,0
8B CF mov ecx,edi
49 D3 E7 shl r15,cl
45 8B FF mov r15d,r15d
After:
Nothing, register is set to constant zero.
All games I've tried hit this optimization on launch. In Soul Calibur II
it occurs very frequently during gameplay.
Much like we did for srawx. This was already implemented on JitArm64.
Before:
B8 00 00 00 00 mov eax,0
8B F0 mov esi,eax
C1 E8 1F shr eax,1Fh
23 C6 and eax,esi
D1 FE sar esi,1
88 45 58 mov byte ptr [rbp+58h],al
After:
C6 45 58 00 mov byte ptr [rbp+58h],0
If both input registers hold known values at compile time, we can just
calculate the result on the spot.
Code has mostly been copied from JitArm64 where it had already been implemented.
Before:
BF FF FF FF FF mov edi,0FFFFFFFFh
8B C7 mov eax,edi
C1 FF 10 sar edi,10h
C1 E0 10 shl eax,10h
85 F8 test eax,edi
0F 95 45 58 setne byte ptr [rbp+58h]
After:
C6 45 58 01 mov byte ptr [rbp+58h],1
More efficient code can be generated if the shift amount is known at
compile time. We can once again take advantage of shifts with the shift
amount in an 8-bit immediate to eliminate ECX as a scratch register,
reducing register pressure and removing the occasional spill. We can
also do 32-bit shifts instead of 64-bit operations.
We recognize four distinct cases:
- The special case where we're dealing with the PowerPC's quirky shift
amount masking. If the shift amount is a number from 32 to 63, all
bits are shifted out and the result it either all zeroes or all ones.
Before:
B9 F0 FF FF FF mov ecx,0FFFFFFF0h
8B F7 mov esi,edi
48 C1 E6 20 shl rsi,20h
48 D3 FE sar rsi,cl
8B C6 mov eax,esi
48 C1 EE 20 shr rsi,20h
85 F0 test eax,esi
0F 95 45 58 setne byte ptr [rbp+58h]
After:
8B F7 mov esi,edi
C1 FE 1F sar esi,1Fh
0F 95 45 58 setne byte ptr [rbp+58h]
- The shift amount is zero. Not calculation needs to be done, just clear
the carry flag.
Before:
B9 00 00 00 00 mov ecx,0
49 C1 E5 20 shl r13,20h
49 D3 FD sar r13,cl
41 8B C5 mov eax,r13d
49 C1 ED 20 shr r13,20h
44 85 E8 test eax,r13d
0F 95 45 58 setne byte ptr [rbp+58h]
After:
C6 45 58 00 mov byte ptr [rbp+58h],0
- The carry flag doesn't need to be computed. Just do the arithmetic
shift.
Before:
B9 02 00 00 00 mov ecx,2
48 C1 E7 20 shl rdi,20h
48 D3 FF sar rdi,cl
48 C1 EF 20 shr rdi,20h
After:
C1 FF 02 sar edi,2
- The carry flag must be computed. In addition to the arithmetic shift,
we do a shift to the left and and them together to know if any ones
were shifted out. It's still better than before, because we can do
32-bit shifts.
Before:
B9 02 00 00 00 mov ecx,2
49 C1 E5 20 shl r13,20h
49 D3 FD sar r13,cl
41 8B C5 mov eax,r13d
49 C1 ED 20 shr r13,20h
44 85 E8 test eax,r13d
0F 95 45 58 setne byte ptr [rbp+58h]
After:
41 8B C5 mov eax,r13d
41 C1 FD 02 sar r13d,2
C1 E0 1E shl eax,1Eh
44 85 E8 test eax,r13d
0F 95 45 58 setne byte ptr [rbp+58h]
More efficient code can be generated if the shift amount is known at
compile time. Similar optimizations were present in JitArm64 already,
but were missing in Jit64.
- By using an 8-bit immediate we can eliminate the need for ECX as a
scratch register, thereby reducing register pressure and occasionally
eliminating a spill.
Before:
B9 18 00 00 00 mov ecx,18h
41 8B F7 mov esi,r15d
48 D3 E6 shl rsi,cl
8B F6 mov esi,esi
After:
41 8B CF mov ecx,r15d
C1 E1 18 shl ecx,18h
- PowerPC has strange shift amount masking behavior which is emulated
using 64-bit shifts, even though we only care about a 32-bit result.
If the shift amount is known, we can handle this special case
separately, and use 32-bit shift instructions otherwise. We also no
longer need to clear the upper 32 bits of the register.
Before:
BE F8 FF FF FF mov esi,0FFFFFFF8h
8B CE mov ecx,esi
41 8B F4 mov esi,r12d
48 D3 E6 shl rsi,cl
8B F6 mov esi,esi
After:
Nothing, register is set to constant zero.
- A shift by zero becomes a simple MOV.
Before:
BE 00 00 00 00 mov esi,0
8B CE mov ecx,esi
41 8B F3 mov esi,r11d
48 D3 E6 shl rsi,cl
8B F6 mov esi,esi
After:
41 8B FB mov edi,r11d
More efficient code can be generated if the shift amount is known at
compile time. Similar optimizations were present in JitArm64 already,
but were missing in Jit64.
- By using an 8-bit immediate we can eliminate the need for ECX as a
scratch register, thereby reducing register pressure and occasionally
eliminating a spill.
Before:
B9 18 00 00 00 mov ecx,18h
45 8B C1 mov r8d,r9d
49 D3 E8 shr r8,cl
After:
45 8B C1 mov r8d,r9d
41 C1 E8 18 shr r8d,18h
- PowerPC has strange shift amount masking behavior which is emulated
using 64-bit shifts, even though we only care about a 32-bit result.
If the shift amount is known, we can handle this special case
separately, and use 32-bit shift instructions otherwise.
Before:
B9 F8 FF FF FF mov ecx,0FFFFFFF8h
45 8B C1 mov r8d,r9d
49 D3 E8 shr r8,cl
After:
Nothing, register is set to constant zero.
- A shift by zero becomes a simple MOV.
Before:
B9 00 00 00 00 mov ecx,0
45 8B C1 mov r8d,r9d
49 D3 E8 shr r8,cl
After:
45 8B C1 mov r8d,r9d
The enumerated LOG_TYPE "OSREPORT" is currently used in both EXI_DeviceIPL.cpp and HLE_OS.cpp. In many games, the multitude of game functions detected by HLE_OS.cpp for OSREPORT logging results in poor log readability. This Pull Request remedies that by adding a new enumerated LOG_TYPE "OSREPORT_HLE" for log usage in HLE_OS.cpp.
In the future, further changing how logging in HLE_OS.cpp works may be desirable. As it is, game functions are detected that send a single character to the log. This is a major source of poor readability.
Introduces the system class that will eventually contain all relevant
system state, as opposed to everything being distributed all over the
place as global variables.
Throughout the codebase we have code that from its interface-view, does
not actually require its dependencies to be described in the interface,
and we routinely run into issues with initialization where we sometimes
make use of a facility before it's been initialized, which leads to
annoying to debug cases, because the reader needs to run through the
codebase and see what order things get initialized in, and how they're
being used. This is particularly a frequent issue in the video code.
Further, we also have a lot of code that makes use of file-scope
variables (many of which are non-trivial), which must all be default
initialized before the application can actually enter main(). While this
may not be a huge issue in itself, some of these are allocating, which
means that the application may need to use memory that it otherwise
wouldn't need to (e.g. when a game isn't running, this excess memory is
being used).
Being able to wrap all these subsystems into objects would be nicer,
since they can be constructed when they're actually needed. Them being
objects also means we can better express dependencies on subsystems as
types directly in the interface, making them explicit to the reader
instead of a change randomly blowing up, said reader inspecting it, and finding
out that something needed to be initialized beforehand. With the global
turned into a function parameter, the dependency is explicit and they
know just by reading it, that the given subsystem needs to be in a valid
state before calling the function.
For a prior example of an emulator that has moved to this model, see
yuzu, which has been migrated off of global variables all over the place
and replaced with a system instance (which has now reached the stage,
where the singleton can be removed).
We want to clear/memset the padding bytes, not just each member,
so using assignment or {} initialization is not an option.
To silence the warnings, cast the object pointer to u8* (which is not
undefined behavior) to make it explicit to the compiler that we want
to fill the object representation.
TunTap has recently become unmaintained, and it seems Apple wants developers to move away from kexts in general. TunTap currently takes some finagling to work on Catalina, and it may not work at all on Big Sur, necessitating a non-kext-based solution. Fortunately, fake Ethernet devices were introduced in Sierra and can be used similarly to tap adapters. This commit adds a new type of BBA interface implementation which uses fake Ethernet devices via tapserver (https://github.com/fuzziqersoftware/tapserver) to communicate with the host. This implementation was tested with PSO Episodes I & II, which can successfully connect to a private server running locally.
This implementation is only available on macOS, since that's the only place it's needed - Windows/Linux/Unix are unaffected by TunTap being deprecated.
Make sure m_is_populating_devices is true when a WM_INPUT_DEVICE_CHANGE
event is received directly on the ciface thread, so that callbacks do
not occur while removing devices. This breaks a hold-and-wait deadlock
between the ciface thread and the CPU thread when using emulated
Wiimotes.
Co-authored-by: brainleq <brainleq@users.noreply.github.com>
Co-authored-by: oldmud0 <oldmud0@users.noreply.github.com>
If we want to enable codes in the default game INIs,
we should have some way for users to disable them.
This commit accomplishes that by adding a *_Disabled
section corresponding to each *_Enabled section.
In order to reach the middle guard (at m_stack_base + GUARD_OFFSET)
before the bottom guard (at m_stack_base), the stack pointer
must start at an address which is higher than the middle guard.
It also didn't make sense that we were allocating memory
and then not using the top part of it.
Nintendo's official title installation code and ES both only look at
content IDs but we should probably check for content hashes in addition
to checking for IDs for at least two reasons:
1. Some of the installed contents could be corrupted -- this cannot be
easily detected without checking hashes.
2. Some mod distributors do not bother to update content IDs, which
means that installing updates from the UI would not actually
update the installed game. This is confusing for users.
To keep the existing semantic (for IOS especially), the new content
hash checks are opt-in for callers of GetStoredContentsFromTMD.
This commit changes WiiUtils's WAD installation logic to enable
the content hash checks.
Adds a flag to File::Delete and File::DeleteDir functions to control
whether a console warning is emitted when the file or directory doesn't
exist. The flag is optional and true by default to match current behavior.
Makes File::DeleteDir return true when attempting to delete a
nonexistent path.
The purpose of DeleteDir is to ensure the path doesn't exist after the
call, which is better reflected by the new return value. Additionally,
none of the current callers actually check the return value so this
won't break any existing code.
Fixes https://bugs.dolphin-emu.org/issues/12327.
When we started using fmt in CheckExternalExceptions, JitArm64
mysteriously stopped working even though the code path where
fmt was used never was reached. This is because the compiler
added a function prologue and epilogue to set up the stack,
since the code path that used fmt required the use of the stack.
However, the breakage didn't actually have anything to do
with the usage of the stack in itself, but rather with the
compiler's insertion of a stack canary. In the function
epilogue, a cmp instruction was inserted to check that the
stack canary had not been overwritten during the execution
of the function. This cmp instruction overwriting the status
flags ended up having a disastrous side effect once execution
returned to code emitted by JitArm64::WriteExceptionExit.
JitArm64's dispatcher contains a branch to the "do_timing"
code which is intended to be taken if the PPC downcount is
negative. However, the dispatcher doesn't update the status
flags on its own before this conditional branch, but rather
expects the calling code to have set them as a side effect
of DoDownCount. The root cause of our bug was that
JitArm64::WriteExceptionExit was calling DoDownCount before
Check(External)Exceptions instead of after.
1. Comparing string_views does not behave the same as strncmp
in the case where SConfig's game ID is longer than 6 chars.
2. DTMHeader::GetGameID wasn't excluding null bytes for game IDs
shorter than 6 chars.
3. == was accidentally used instead of !=.
This issue is both severe and surprisingly difficult to find
the root cause of, so I think it would make sense to add a simple
hotfix for now. https://bugs.dolphin-emu.org/issues/12327
Fallback Region
A user-selected fallback to use instead of the default PAL
This is used for unknown region or region free titles to give them
the ability to force region to use. This replaces the current fallback region
of PAL. This can be useful if a user is trying to play a region free
tilte that is originally NTSC and expects to be run at NTSC speeds. This
may be done when a user attempts to dump a WAD of their own without
understanding the settings they have chosen, or could be an intentional
decision by a developer of a ROM hack that can be injected into a
Virtual Console WAD.
Remove using System Menu region being checked in GetFallbackRegion
Use DiscIO::Region instead of std::String for fallback
Add explanation text for Fallback Region
Unfortunately, adding a DEBUG_ASSERT_MSG_FMT isn't actually possible
right now because of compiler bugs:
https://github.com/dolphin-emu/dolphin/pull/9284
We could require a newer version of GCC (10) but that would require
updating GCC on the build machines.
For what it's worth, older versions of GCC (8, 9) are broken in
many ways: adding constexpr to some Matrix functions causes GCC 8
to generate bugged code that causes the Wii IR pointer to disappear,
which means that the generated builds are already unusable
(see https://dolp.in/i12324).
Additionally, we've already had to add workarounds for those versions
in the format macros to fix compilation bugs. This time, it looks like
workarounds won't cut it; even applying the workaround
described in https://github.com/fmtlib/fmt/pull/1580 does not help.
On Windows, when the Rename function fails to replace an existing file
it will now retry the operation multiple times with increasingly long
delays between attempts. This fixes transient rename failures.
I've been getting sporadic yet annoyingly frequent errors saying:
'IOS_FS: Failed to rename temporary FST file'
These typically appear on startup but I've also gotten them randomly.
Investigation shows this happens when the Windows ReplaceFile function
returns the error ERROR_UNABLE_TO_REMOVE_REPLACED. That happens in the
context of using ReplaceFile to perform an atomic file overwrite, which
is required when saving updates to a file to avoid corruption. The
error mainly happens with the /Wii/fst.bin file but I've seen it
happen with multiple other files as well.
I haven't been able to definitively pin down why the error occurs,
though online discussions suggest antivirus scanning may be a major
culprit. That said, I've excluded the Dolphin folder from Windows
Defender scans to no avail and don't have any other antivirus running,
so this is likely to be a problem others are experiencing as well.
The number and duration of retry delays is arbitrary but I feel like a
combined second or so in the worst case is an acceptable tradeoff for
the reduction (actually elimination in my experience) of those errors.
This is even more true when you consider the time it takes to read and
dismiss the error dialogs.
Converts the remaining PowerPC code over to fmt-capable logging.
Now, all that's left to convert over are the lingering remnants within
the frontend code.
The way Config::Get works in master, it first calls
Config::GetActiveLayerForConfig which searches for the
setting in all layers, and then calls Config::Layer::Get
which searches for the same setting again within the given
layer. We can remove this second search by combining the
logic of Config::GetActiveLayerForConfig and
Config::Layer::Get into one function.
gameID isn't null terminated since it is just an std::array<char, 6>
and .data() returns a char* so {fmt} would go way beyond the bounds of
the array when it attempts to determine the length of the string.
The fix is to pass a std::string_view to {fmt}. This commit adds
a GetGameID() function that can also be used to simplify
string comparisons.
Cel-damage depends on lighting being calculated for the first channel
even though there is no color in the vertex format (defaults to the
material color). If lighting for the channel is not enabled, the vertex
will use the default color as before.
The default value of the color is determined by the number of elements in
the vertex format. This fixes the grey cubes in Super Mario Sunshine.
If the color channel count is zero, we set the color to black before the
end of the vertex shader. It's possible that this would be undefined
behavior on hardware if a vertex color index that was greater than the
channel count was used within TEV.
PR #9260 made the MsgAlert macros use lambdas so that local
constexpr variables can be added while keeping the ability to
return a boolean from the macros.
Unfortunately, C++17 forbids referring to structured bindings in lambda
captures. This is fixed in P1091R3 but we cannot rely on C++20 yet...
In CreateDescriptorSetLayouts(), one less dynamic binding is created when
bSupportsGeometryShaders=false. Reduce the dynamicOffsetCount argument by
one in that case. Avoids this validation error:
Attempting to bind 3 descriptorSets with 2 dynamic descriptors, but
dynamicOffsetCount is 3. It should exactly match the number of dynamic
descriptors. The Vulkan spec states: dynamicOffsetCount must be equal to
the total number of dynamic descriptors in pDescriptorSets
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
[Applied clang-format]
Signed-off-by: Léo Lam <leo@leolam.fr>
Unfortunately, {fmt} allows passing too many arguments to a format call
without raising any runtime or compile-time error [1].
As this is a common source of bugs since we started migrating to {fmt},
this commit adds some custom logic to validate the number of
replacement fields in format strings in addition to {fmt}'s own checks.
[1] https://github.com/fmtlib/fmt/issues/492
We want to use positional arguments in translatable strings
that have more than one argument so that translators can change
the order of them, but the question is: Should we also use
positional arguments in translatable strings with only one
argument? I think it makes most sense that way, partially
so that translators don't even have to be aware of the
non-positional syntax and partially because "translatable
strings use positional arguments" is an easier rule for us
to remember than "transitional strings which have more than
one argument use positional arguments". But let me know if
you have a different opinion.
tr calls with more than one argument would have a 0x04 byte
in the returned string when no translation was found
(which always is the case when using Dolphin in English).
They are unused, since there is no C++ code that touches
these settings. See the discussion in PR 9152, which is
a PR that adds a lot more Android-specific settings.
I moved it from the main settings screen to the in-game menu
in PR 8439 so that it could be changed while a game is running,
but now that the main settings can be accessed while a game is
running, there's no reason to not put it in the main settings.
https://bugs.dolphin-emu.org/issues/12067
Adds an interface that uses fmt under the hood, which is much more
flexible than printf, particularly for localization purposes, given fmt
supports positional formatters in a cross-platform manner out of the box
with no configuration necessary.
Time for yet another new iteration of working around the
"surface destruction during boot" problem...
This time, the strategy is to use a mutex in MainAndroid.cpp.
Now that we've converted all of the shader generators over to using fmt,
we can drop the old Write() member function and perform a rename
operation on the WriteFmt() to turn it into the new Write() function.
All changes within this are the removal of a <cstdarg> header, since the
previous printf-based Write() required it, and renaming. No functional
changes are made at all.
Noticed missing include as a build failure on gcc-11:
```
[ 15%] Building CXX object Source/Core/Common/CMakeFiles/common.dir/Config/Config.cpp.o
Source/Core/Common/Config/Config.cpp:23:24:
error: 'unique_lock' in namespace 'std' does not name a template type
23 | using WriteLock = std::unique_lock<std::shared_mutex>;
| ^~~~~~~~~~~
Source/Core/Common/Config/Config.cpp:11:1:
note: 'std::unique_lock' is defined in header '<mutex>';
did you forget to '#include <mutex>'?
```
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Completes the migration over to using the fmt-formatting WriteFmt
function. The next PR will rename all usages of WriteFmt, while
simultaneously getting rid of the old printf code.
Unfortunately, compilers will issue warnings when using offsetof with
non-standard layout types even when offsetof actually works fine here;
just having a virtual function is enough to trigger the warning...
Let's just stop the scan threads explicitly in destructors instead of
relying on member destruction order.
Much of these classes are operating on integral types and are pretty
standard behavior as far as vectors go. Some member functions can be
made constexpr to make them more flexible and allow them to be used in
constexpr contexts.
Replace it with a function-local static that is initialized on first
use. This gets rid of a global variable and removes the need for
manual initialization in UICommon.
This commit also replaces the weird find_if that looks for a non-null
unique_ptr with a simple "is vector empty" check considering that
none of the pointers can be null by construction.
As in the comment, the panning was denaturalizing the volume (when the panning was at 0).
Unless users actually want to use panning, their wii mote volume should not be reduced, it should come out at the same volume the samples were.
If users want to reduce the volume of the wii mote speakers, they can do so from the wii home menu.
I opted for this instead of adding another setting "Enable Panning" as it would have been confusing, and the changes in the panning formula are unlikely to have any negative effect, as it still works.
Provides a basic extension to the interface to begin migration off of
the printf-based logging system.
Everything will go through macros with the same style naming as the old
logging system, except the macros will have the _FMT suffix, while the
migration is in process.
This allows for peacemeal migration over time instead of pulling
everything out and replacing it all in a single pull request, which
makes for much easier reviewing.
Common shouldn't be depending on APIs in Core (in this, case depending
on the PowerPC namespace). Because of the poor separation here, this
moves OSThread functionality into core, so that it resolves the implicit
dependency on core.
Continues migration of the shader generators over to fmt.
With this, all that's left to move over are the pixel shaders (regular
and ubershader variants)
Fixes the following warning:
../../../../../../Core\DiscIO/DirectoryBlob.h:156:3: warning: explicitly defaulted move constructor is implicitly deleted [-Wdefaulted-function-deleted]
DirectoryBlobReader(DirectoryBlobReader&&) = default;
^
../../../../../../Core\DiscIO/DirectoryBlob.h:205:22: note: move constructor of 'DirectoryBlobReader' is implicitly deleted because field 'm_encryption_cache' has a deleted move constructor
WiiEncryptionCache m_encryption_cache;
^
In case someone wants to be very careful with how much bandwidth
they use or with what data GameTDB.com collects on you.
This is already an option in DolphinQt (though in DolphinQt it
will switch entirely from using covers to banners when turned off).
Once nice benefit of fmt is that we can use positional arguments
in localizable strings. This a feature which has been
requested for the Korean translation of strings like
"Errors were found in %zu blocks in the %s partition."
and which will no doubt be useful for other languages too.
Noticed missing include as a build failure on gcc-11:
```
[ 26%] Building CXX object Source/Core/DiscIO/CMakeFiles/discio.dir/WIACompression.cpp.o
../../../../Source/Core/DiscIO/WIACompression.cpp: In lambda function:
../../../../Source/Core/DiscIO/WIACompression.cpp:170:31: error: 'numeric_limits' is not a member of 'std'
170 | std::min<size_t>(std::numeric_limits<unsigned int>().max(), x));
| ^~~~~~~~~~~~~~
../../../../Source/Core/DiscIO/WIACompression.cpp:170:46: error: expected primary-expression before 'unsigned'
170 | std::min<size_t>(std::numeric_limits<unsigned int>().max(), x));
| ^~~~~~~~
```
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
By calling ZSTD_CCtx_setPledgedSrcSize, we can let zstd know
how large a chunk is going to be before which start compressing
it, which lets zstd avoid allocating more memory than needed
for various internal buffers. This greatly reduces the RAM usage
when using a high compression level with a small chunk size,
and doesn't have much of an effect in other circumstances.
A side effect of calling ZSTD_CCtx_setPledgedSrcSize is that
zstd by default will write the uncompressed size into the
compressed data stream as metadata. In order to save space,
and since the decompressed size can be figured out through
the structure of the RVZ format anyway, we disable writing
the uncompressed size by setting ZSTD_c_contentSizeFlag to 0.
Unlike Super Paper Mario, this game doesn't crash as soon as you
try to start it, but rather if you try to skip a certain cutscene.
Thanks to JMC for letting me know about this.
For the non-packed variant of this instruction, a MOVSD instruction was
generated to copy only the lower 64 bits of XMM1 to the destination
register. This was done in order to keep the destination register's
upper half intact.
However, when register c and the destination register are the same,
there is no need for this copy. Because the registers match and due to
the way the mask is generated, BLENDVPD will end up taking the upper
half from the destination register, as intended.
Additionally, the MOVAPS to copy Rc into XMM1 can also be skipped.
Before:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
41 0F 28 CE movaps xmm1,xmm14
66 41 0F 38 15 CA blendvpd xmm1,xmm10,xmm0
F2 44 0F 10 F1 movsd xmm14,xmm1
After:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
66 45 0F 38 15 F2 blendvpd xmm14,xmm10,xmm0
For the non-packed variant of this instruction, a MOVSD instruction was
generated to copy only the lower 64 bits of XMM1 to the destination
register. This was done in order to keep the destination register's
upper half intact.
However, when register c and the destination register are the same,
there is no need for this copy. Because the registers match and due to
the way the mask is generated, VBLENDVPD will end up taking the upper
half from the destination register, as intended.
Before:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
C4 C3 09 4B CA 00 vblendvpd xmm1,xmm14,xmm10,xmm0
F2 44 0F 10 F1 movsd xmm14,xmm1
After:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
C4 43 09 4B F2 00 vblendvpd xmm14,xmm14,xmm10,xmm0
CommandLineParse expects UTF-8 strings. (QApplication, on the
other hand, seems to be designed so that you can pass in the
char** argv untouched on Windows and get proper Unicode handling.)
Any call to Config::SetCurrent will cause the relevant setting
to show up as overridden in the Android GUI, which can be confusing,
so let's not do it when the new value is the same as the original.
SYSCONF very much is saveable. Whether it's in IsSettingSaveable
or not hasn't mattered until now since the SYSCONF settings use
separate config loader code that doesn't check IsSettingSaveable,
but the next commit will require SYSCONF to be marked as saveable.
Adding AmdPowerXpressRequestHighPerformance
This will allow AMD drivers to detect the request to use the dGPU instead of the iGPU on compatible hybrid graphics systems.
Reference: https://community.amd.com/thread/169965
Fixes https://bugs.dolphin-emu.org/issues/12245.
I considered making a change to DolphinQt instead of
the core, but then additional effort would've been
required to add the same fix to the Android GUI once
we start using the new config system there.
It's possible (but rare) for a WIA or RVZ file to support
this for some partitions but not all, and for the game and
the blob code to disagree on how large a partition is.
In particular, I wanted to do change this in
AudioInterface::Init so that dumped GC audio doesn't need
to have a file split (changing from 32000 Hz to 32029 Hz)
when the emulated software initializes the AI registers.
I've also made the same change to DI's DTK code.
The heuristic was not allocating enough space for Metroid: Other M,
at least when using the default settings. (This didn't break the
file, it just caused some headers to be placed at the end of the
file instead of at the start and wasted a few hundred kilobytes.)
The new hash check catches essentially all desync problems
that VolumeVerifier can catch, so from the user's perspective,
such problems will result in Dolphin refusing to start the
game on netplay rather than actually getting a desync.
This moves the scan thread logic and variables into a separate
ScanThread class. By turning ScanThread instances into members of the
most derived class, this ensures that the scan thread is always
properly stopped when the most derived class is destructed and fixes
a race condition that could cause the scan thread to call virtual
member functions from a derived class whose members have already
been destructed.
A drawback of this approach is that the scan thread must be the last
member variable, so this commit also adds static assertions to ensure
that the assumption stays valid.
The "FindLibLZMA.cmake" module in CMake versions prior to 3.14 do not
set an alias like how Externals/liblzma/CMakeLists.txt does, so builds
performed using one of those older CMake versions will fail if the
system LZMA library is detected. To fix this, we need to link against
"lzma" instead of "LibLZMA::LibLZMA".
Fixes: b59ef81a7e ("WIA: Implement bzip2, LZMA, and LZMA2 decompression")
When I first made VolumeVerifier, I figured that the distinction
between an unsigned ticket and an unsigned TMD was a technical
detail that users would have no reason to care about. However,
while this might be true for discs, it isn't equally true for
WADs, due to the widespread practice of fakesigning tickets to
set the console ID to 0. This practice does not require
fakesigning the TMD (though apparently people do it anyway,
at least sometimes...), and the presence of a correctly signed
TMD is a useful indicator that the contents have not been
tampered with, even if the ticket isn't correctly signed.
FCVT doesn't necessarily round to zero, so the result
might be inaccurate if we use it. To ensure correct
rounding, we use FCVTS from double FPR to 32-bit GPR.
Unfortunately, FCVTS can't do double FPR to single FPR.
This is now unused. Seems like it was an improper fix
(there would be a race if saving the screenshot took longer
than 2 seconds) back when it was used too.
The intent here is to generate a more compact instruction if a 32-bit
immediate can be zero-extended to the desired 64-bit immediate.
Nowadays the emitter is smart enough to do this for us, so this logic is
redundant.
There's no need to load the 64-bit immediate into a temporary register.
x64 will sign-extend 32-bit immediates to 64 bits, giving us the exact
value we need in this case.
48 C7 C0 00 00 FF FF mov rax,0FFFFFFFFFFFF0000h
48 21 C2 and rdx,rax
48 81 E2 00 00 FF FF and rdx,0FFFFFFFFFFFF0000h
- LEA is a bit silly when the source and the destination are the same. A
simple ADD or SHL will do in those cases.
66 8D 04 45 00 00 00 00 lea ax,[rax*2]
66 03 C0 add ax,ax
48 8D 04 00 lea rax,[rax+rax]
48 03 C0 add rax,rax
66 8D 14 D5 00 00 00 00 lea dx,[rdx*8]
66 C1 E2 03 shl dx,3
- When scaling by 2, consider summing the register with itself instead.
The former always needs a 32-bit displacement, so the sum is more
compact.
66 8D 14 45 00 00 00 00 lea dx,[rax*2]
66 8D 14 00 lea dx,[rax+rax]
Other than the controller settings and JIT debug settings,
these are the only settings which were defined in Java code
but not defined in the new config system in C++. (There are
still a lot of settings that are defined in the new config
system but not yet saveable in the new config system, though.)
Instead of comparing the game ID, revision, disc number and name,
we can compare a hash of important parts of the disc including
all the aforementioned data but also additional data such as the
FST. The primary reason why I'm making this change is to let us
catch more desyncs before they happen, but this should also fix
https://bugs.dolphin-emu.org/issues/12115. As a bonus, the UI can
now distinguish the case where a client doesn't have the game at
all from the case where a client has the wrong version of the game.
Turns out, Gamecube games actually do check DILENGTH, and if DILENGTH is at 0, they'll think the transfer completed successfully even if DEINT is used, since after all, surely that means everything was sent. That caused all sorts of issues, from audio looping when a disc is removed since it's re-using the same buffer to just flat-out crashing instead of showing the disc removed screen.
In particular:
- Trying to play audio in a non-ready state returns the state-specific error, not an audio buf error
- Audio status cannot be requested in non-ready states
- The audio buffer cannot be configured in states other than ReadyNoReadsMade
- Using the stop motor command while the motor is already stopped doesn't change states
Additionally, the internal state IDs are used (which distinguish ReadyNoReadsMade and Ready), instead of the state IDs exposed in request error. This makes some of the weird behavior a bit more obvious.
State and error behavior of the seek command was not implemented in this commit.
It is my opinion that nobody should use NKit disc images without
being aware of the drawbacks of them. Since it seems like almost
nobody who is using NKit disc images knows what NKit is (hmm, now
how could that have happened...?), I am adding a warning to Dolphin
so that you can't run NKit disc images without finding out about the
drawbacks. In case someone really does want to use NKit disc images,
the warning has a "Don't show this again" option. Unfortunately, I
can't retroactively add the warning where it's most needed:
in Dolphin 5.0, which does not support Wii NKit disc images.
Pretty much the same optimization we did for AVX, although slightly more
constrained because we're stuck with the two-operand instruction where
destination and source have to match.
We could also specialize the case where registers b, c, and d are all
distinct, but I decided against it since I couldn't find any game that
does this.
Before:
66 0F 57 C0 xorpd xmm0,xmm0
66 41 0F C2 C1 06 cmpnlepd xmm0,xmm9
41 0F 28 CE movaps xmm1,xmm14
66 41 0F 38 15 CC blendvpd xmm1,xmm12,xmm0
44 0F 28 F1 movaps xmm14,xmm1
After:
66 0F 57 C0 xorpd xmm0,xmm0
66 41 0F C2 C1 06 cmpnlepd xmm0,xmm9
66 45 0F 38 15 F4 blendvpd xmm14,xmm12,xmm0
AVX has a four-operand VBLENDVPD instruction, which allows for the first
input and the destination to be different. By taking advantage of this,
we no longer need to copy one of the inputs around and we can just
reference it directly, provided it's already in a register (I have yet
to see this not be the case).
Before:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
41 0F 28 CE movaps xmm1,xmm14
66 41 0F 38 15 CA blendvpd xmm1,xmm10,xmm0
F2 44 0F 10 F1 movsd xmm14,xmm1
After:
66 0F 57 C0 xorpd xmm0,xmm0
F2 41 0F C2 C6 06 cmpnlesd xmm0,xmm14
C4 C3 09 4B CA 00 vblendvpd xmm1,xmm14,xmm10,xmm0
F2 44 0F 10 F1 movsd xmm14,xmm1