Make it have no effect on the texture resource as a resource may be used with samplers with different overrides. Also make sure magnification vs. minification is not undefined with it on Direct3D 12.
According to the integral promotion rules https://eel.is/c++draft/conv.prom#5.sentence-1 bit fields can be promoted to `int` if it's wide enough to store their value, and then otherwise, to `unsigned int`. Hopefully fixes Clang building (the `width_div_8` case).
Add special case to TYPE_INT64's EmitAnd for UINT_MAX mask. Do mov32 to 32 if detected to take advantage of implicit zero xt/reg renaming
Add helper function for skipping assignment defs in instr.
Add helper function for checking if an opcode is binary value type
Add several new optimizations to simplificationpass, plus weak NZM calculation code (better full evaluation of Z/NZ will be done later) .
List of optimizations:
If a value is anded with a bitmask that it was already masked against, reuse the old value (this cuts out most FPSCR update garbage, although it does cause a local variable to be allocated for the masked FPSCR and it still repeatedly stores the masked value to the context)
If masking a value that was or'ed against another check whether our mask only considers bits from one value or another. if so, change the operand to the OR input that actually matters
If the only usage of a rotate left's output is an AND against a mask that discards the bits that were rotated in change the opcode to SHIFT_LEFT
If masking against all ones, become an assign.
If XOR or OR against 0, become an assign (additional FPSCR codegen cleanup)
If XOR against all ones, become a NOT
Adding a direct CPUID check to x64_emitter for lzcnt, the version of xbyak we are using is skipping checking for lzcnt on all non-intel cpus, meaning we are generating the much slower bitscan path for AMD cpus.
Float24-as-float32 depth bias is now in the increments of 8, because conversion of the depth to float24 directly in the pixel shaders may destroy the bias qualitatively otherwise if it's too small.
the emitted asm as rel32 calls. Disabled by default, enabled via
resolve_rel32_guest_calls
detect whether cpu has fast jrcxz, fast loop/loope/loopne
much more thorough LoadConstantXMM
New cvar elide_e0_check that allows the backend to assume accesses via
the SP or TLS register will not cross into 0xe0 range
Add x64 codegen for Vector shift uint8
If has fast jrcxz use for some traptrue/breaktrue instructions
Use phat nops
Add cvar use_fast_dot_product, which uses a four instruction sequence
for both dot product instructions which ought to be equivalent. disabled
by default.
On Vulkan, when snorm16 in unsupported, these formats may be emulated as float16, which natively can represent a wide range of numbers including -32 to 32 with blending. However, R16G16_SNORM and R16G16B16A16_SNORM are two separate formats, which may have different support on the device.
Ordering the descriptor sets by the change frequency on Vulkan, in increasing order (the opposite of D3D12 root signatures). The EDRAM binding never changes there (always one storage buffer), while the destination buffer binding may become changeable in the future (to split dispatches if exceeding `maxStorageBufferRange`, for example).
Replace a movzx after setae in both ComputeMemoryAddressOffset and ComputeMemoryAddress with a xor_ of eax prior to the cmp. This reduces the length in bytes of both sequences by 1, and should be a moderate ICache usage reduction thanks to the frequency of these sequences.
While the alpha of the texture data is not used at all (replaced with blue using the view swizzle), still make the shader code state the intention more explicitly if the format is decompressed for use as signed. Unsigned 1.0 is 0xFF, while signed 1.0 is 0x7F.
The original multiplication was likely added early during the development of generic resolution scaling. Before generic resolution scaling, invocations were done for unscaled guest blocks, now they're done for scaled blocks, so with 3x1 scaling, an invocation for 8 blocks writes 8 host blocks, not 24.
Certain games, such as Forza Motorsport 3, submit XMA data with the
stereo flag set with a null second channel. This falls back to mono
conversion when the second channel is null, preventing a crash.
- Changed name of config option to apply_title_update to better reflect what that option does
- Mount TU package to UPDATE: partition
- Simplified UserModule::title_id()
- Splitted loading module into two parts to allow applying TUs and custom patches
6fcf9d21fe made per-vertex diameter vs. constant radius consistent, and with that commit the shader works with direct pixel to NDC conversion, however, the NDC conversion factor was outdated in that commit (still included the 0.5 factor for diameter to radius conversion, resulting in all points being 50% narrower along each axis than needed). Now, the diameter to radius conversion factor is used there properly, and also the multiplication of the per-vertex diameter by 0.5 has been removed from the shader since the constant already includes it now (the constant diameter is passed via the system constants instead of the radius also).
- WinSystemClock is a FILETIME clock without scaling, can convert to
system_time
- XSystemClock is a FILTETIME clock with scaling applied, can only
convert to WinSystemClock
`pxor` is a zero-uop register-rename and `gf2p8affineqb dest, zero, int8`
is a very quick single-instruction way to use affine galois
transformations to fill a register with an immediate byte without
touching memory.
Transfers are functional on a D3D12-like level, but need additional work so fallbacks are used when multisampled integer sampled images are not supported, and to eliminate transfers between render targets within Vulkan format compatibility classes by using different views directly.
Edit one of the lanes in this unit-test to be larger than the width of
the element-size to ensure that this case is handled correctly.
It should only mask the lower `log2(32)=5` bits of the input, causing
`33`(`100001`) to be `1`(`000001`).
`vprolvd` is an almost 1:1 analog with this opcode and can be
conditionally emitted when the host supports AVX512{F,VL}.
Altivec docs say that `vrl{bhw}` masks the lower log2(n) bits of the
element-size.
[vprold](https://www.felixcloutier.com/x86/vprold:vprolvd:vprolq:vprolvq)
modulos the shift-value by the element size in bits, which is the same
as masking the lower log2(n) bits. So `vrlw` maps exactly to `vprold`.
Since timer_delete does not clean up already queued signals, signal info
data needs to be retained after timer deletion and object destruction in
order to circumvent use-after-free bugs.
Schedule callbacks whith the only guarantee that they will not be run for
the minimum duration specified. Useful for garbage collecting POSIX
timer_create() signal info data.
- Mainly use `assert`s, since failure is very rare
- Forward failure of `CreateSemaphore` to guests because it is more easy
to trigger with invalid initial parameters.
Timing dependencies in this tests were causing spurious test failures:
- Create and Run Thread
- Test Thread QueueUserCallback
They have been largely replaced by spin waits.
- Never use `cond_.notify_one()` because it may wake a thread that is
unrelated to the signalled wait handle, resulting in a lost wake and
possible deadlock. Wait conditions are to be checked by the threads
themselves.
- Refactor and simplify `WaitMultiple`
[GPU] Add FXAA post-processing
[UI] Add FidelityFX FSR and CAS post-processing
[UI] Add blue noise dithering from 10bpc to 8bpc
[GPU] Apply the DC PWL gamma ramp closer to the spec, supporting fully white color
[UI] Allow the GPU CP thread to present on the host directly, bypassing the UI thread OS paint event
[UI] Allow variable refresh rate (or tearing)
[UI] Present the newest frame (restart) on DXGI
[UI] Replace GraphicsContext with a far more advanced Presenter with more coherent surface connection and UI overlay state management
[UI] Connect presentation to windows via the Surface class, not native window handles
[Vulkan] Switch to simpler Vulkan setup with no instance/device separation due to interdependencies and to pass fewer objects around
[Vulkan] Lower the minimum required Vulkan version to 1.0
[UI/GPU] Various cleanup, mainly ComPtr usage
[UI] Support per-monitor DPI awareness v2 on Windows
[UI] DPI-scale Dear ImGui
[UI] Replace the remaining non-detachable window delegates with unified window event and input listeners
[UI] Allow listeners to safely destroy or close the window, and to register/unregister listeners without use-after-free and the ABA problem
[UI] Explicit Z ordering of input listeners and UI overlays, top-down for input, bottom-up for drawing
[UI] Add explicit window lifecycle phases
[UI] Replace Window virtual functions with explicit desired state, its application, actual state, its feedback
[UI] GTK: Apply the initial size to the drawing area
[UI] Limit internal UI frame rate to that of the monitor
[UI] Hide the cursor using a timer instead of polling due to no repeated UI thread paints with GPU CP thread presentation, and only within the window
[AltiVec](https://www.nxp.com/docs/en/reference-manual/ALTIVECPEM.pdf)
doc says that it just uses the lower `log2(n)` bits of the shift-amount
rather than the whole element-sized value. So there is no need to handle
an overflow. Also adjusts 64-bit literals to utilize the explicit
`UINT64_C` type.
In the `Int8` case of `VECTOR_SH{R,L}_V128`, when all the values are the
same, then a single-instruction `gf2p8affineqb` can be emitted that does
an int8-based arithmetic-shift, utilizing GF(8) arithmetic.
More info here:
https://wunkolo.github.io/post/2020/11/gf2p8affineqb-int8-shifting/
Also fixes the iteration-type for when detecting if all of the simd
lanes are the same value(was iterating `u16` and not `u8`)
[XAM] Improvements to XamUserReadProfileSettingsEx/
XamUserWriteProfileSettings.
- Unify X_USER_READ_PROFILE_SETTING and X_USER_WRITE_PROFILE_SETTING
into X_USER_PROFILE_SETTING.
- Clean up Setting serialization to use X_USER_PROFILE_SETTING_DATA
instead of manual buffer copying.
- Fix XamUserReadProfileSettingsEx case where user_index is non-zero
and xuids are being used.
- Skip unset settings in XamUserWriteProfileSettings_entry.
Rather than having a huge list of if-statements that all do the same
thing, this preprocessor allows a more concise pattern to detecting if
the emit-flag is enabled as well as the correlated Xbyak flag that it
needs to check for to before allowing the feature-flag to be emitted.
Also moved the AVX-check to the beginning to early-out rather than do a
bunch of wasted work only to find out last that the host doesn't even
support AVX.
In the `Int8` case of `VECTOR_SHA_V128`, when all the values are the same, then a single-instruction `gf2p8affineqb` can be emitted that does an int8-based arithmetic-shift, utilizing GF(8) arithmetic.
More info here:
https://wunkolo.github.io/post/2020/11/gf2p8affineqb-int8-shifting/
As of now(Dec 2021): Tremont(Lakefield), Jasper Lake, Ice lake, Tigerlake, and Rocket Lake support GNFI.
Rather than having a single bool to conditionally detect haswell-level
instruction features. The granularity is increased with a new
`x64_extension_mask` where individual features within the x64 backend
can be turned on or off in a bit-mask manner. Since we have an ARM
backend on the horizon, I've added this to the new `x64`
configuration-group rather than `CPU`. This new pattern will hopefully
allow for testing to be more targetted to certain processor features and
allows the user to determine if they want certain features to be enabled
or disabled(such as avoiding BMI2 on certain AMD processors due to
pdep/pext being incredibly slow). The default configuration is to detect
and utilize all available features.
This addresses a JIT-issue in the case that the `src1` and `dest`
register are both the same. This issue only happens in the "generic"
x86 path but not in the BMI1-accelerated path.
Thanks Rick for the extensive debugging help.
When `src1` and `dest` were the same, then the `addc` instruction at
`82099A08` in title `584108FF` might emit the following assembly:
```
.text:82099A08 andc r11, r10, r11
|
| Jitted
|
V
00000000A0011B15 mov rbx,r10
00000000A0011B18 not rbx
00000000A0011B1B and rbx,rbx
```
This was due to the src1 operand and the destination register being the
same, which used to call the "else" case in the x64 emitter when it
needs to be handled explicitly due to register aliasing/allocation.
Addresses issue #1945
The wheel shader in 4D530910 does vfetch_full to r0 with the index from r0.x, and then vfetch_mini.
Thanks @Gliniak for the finding :3
Also small formatting cleanup in commented-out code.
Fix the the case where src1 is constant and src2 is non-constant causing
an assert due to trying to call `.constant()` on the src2 operand.
Interfaces with an issue Gliniak was encountering where title `4D53082D`
encounters an assert. Also includes a BMI1-acceleration in the 64-bit
case where a temporary register is needed(the `and` x86 instruction only
supports immediate constants up to 32-bits).
In the case of having two register operands for `AndNot`, the `andn` instruction can be used when the host supports `BMI1`. `andn` only supports 32-bit and 64-bit operands, so some register up-casting is needed.
The `BMI1 feature` fits into the current pattern of `use_haswell_instructions` as BMI1 was only introduced in haswell.
Also moved the aliases to the end of the enum rather than interleave it with the bit definitions.
Implements the detection of some baseline `AVX512` subsets and some common aliases into `X64EmitterFeatureFlags`.
So far, `AVX512{F,VL,BW,DQ}` are the only subsets of `AVX512` that are detected with this PR since I anticipate these are the ones that will actually be used a lot in the x64 backend. Some aliases are also implemented such as `kX64EmitAVX512Ortho` which is `AVX512F` and `AVX512VL` combined which are the two subsets of AVX512 required to allow for `AVX512` operations upon `ymm` and `xmm` registers.
These aliases can possibly be collapsed since we could just always require `AVX512VL` to be supported to allow for _any_ kind of `AVX512` to be used since we will practically always want to use `AVX512` on `xmm` registers at the very least as there is no use-case where we want to use the 512-bit `zmm` registers exclusively.
Rather than using `n * 1024 * 1024`, this adds a convenient `_MiB`/`_KiB` user-literal to the new `literals.h` header to concisely describe units of memory in a much more readable way. Any other useful literals can be added to this header. These literals exist in the `xe::literals` namespace so they are opt-in, similar to `std::chrono` literals, and require a `using namespace xe::literals` statement to utilize it within the current scope.
I've done a pass through the codebase to replace trivial instances of `1024 * 1024 * ...` expressions being used but avoided anything that added additional casting complexity from `size_t` to `uint32_t` and such to keep this commit concise.
Just checking if the resulting mask is non-zero means we cannot allow this function to check for multiple features in parallel. A hypothetical computer that supports FMA but not AVX2 will return `true` if you try to call `IsFeatureEnabled(kX64EmitFMA | kX64EmitAVX2)`. We should make sure all the masked flags return `true` rather than check for non-zero.
This is ramping up to allow for particular subsets of AVX512 to be checked for in parallel with a single function call.
All of the (non-third party) cpp impl files use the .cc extension, this one doesn't. I was digging through the code and found this one so thought I might as well rename it whilst I'm here!
XMountUtilityDrive code tries reading/writing from \Device\Harddisk0\Cache0 / Cache1 / Partition0, NullDevice handling \Device\Harddisk0 will make that code think that the reads/writes were successful, so the utility-drive mount can proceed without failing.
XMountUtilityDrive-related code checks some values returned from NtDeviceIoControlFile, stub just returns values that it seems to accept
IoCreateDevice is also used by utility-drive code, writing some values into a pointer returned by it, so stub allocs space so it can write to the pointer without errors.
The fpfs function is using strtof to convert a string to floating point
value, but the type may be a double. Using strtof in that case won't
provide enough precision, so switch to using strtod. When the type is a
float, the double will be down-converted to the correct value.