Add special case to TYPE_INT64's EmitAnd for UINT_MAX mask. Do mov32 to 32 if detected to take advantage of implicit zero xt/reg renaming
Add helper function for skipping assignment defs in instr.
Add helper function for checking if an opcode is binary value type
Add several new optimizations to simplificationpass, plus weak NZM calculation code (better full evaluation of Z/NZ will be done later) .
List of optimizations:
If a value is anded with a bitmask that it was already masked against, reuse the old value (this cuts out most FPSCR update garbage, although it does cause a local variable to be allocated for the masked FPSCR and it still repeatedly stores the masked value to the context)
If masking a value that was or'ed against another check whether our mask only considers bits from one value or another. if so, change the operand to the OR input that actually matters
If the only usage of a rotate left's output is an AND against a mask that discards the bits that were rotated in change the opcode to SHIFT_LEFT
If masking against all ones, become an assign.
If XOR or OR against 0, become an assign (additional FPSCR codegen cleanup)
If XOR against all ones, become a NOT
Adding a direct CPUID check to x64_emitter for lzcnt, the version of xbyak we are using is skipping checking for lzcnt on all non-intel cpus, meaning we are generating the much slower bitscan path for AMD cpus.
Float24-as-float32 depth bias is now in the increments of 8, because conversion of the depth to float24 directly in the pixel shaders may destroy the bias qualitatively otherwise if it's too small.
the emitted asm as rel32 calls. Disabled by default, enabled via
resolve_rel32_guest_calls
detect whether cpu has fast jrcxz, fast loop/loope/loopne
much more thorough LoadConstantXMM
New cvar elide_e0_check that allows the backend to assume accesses via
the SP or TLS register will not cross into 0xe0 range
Add x64 codegen for Vector shift uint8
If has fast jrcxz use for some traptrue/breaktrue instructions
Use phat nops
Add cvar use_fast_dot_product, which uses a four instruction sequence
for both dot product instructions which ought to be equivalent. disabled
by default.
On Vulkan, when snorm16 in unsupported, these formats may be emulated as float16, which natively can represent a wide range of numbers including -32 to 32 with blending. However, R16G16_SNORM and R16G16B16A16_SNORM are two separate formats, which may have different support on the device.
Ordering the descriptor sets by the change frequency on Vulkan, in increasing order (the opposite of D3D12 root signatures). The EDRAM binding never changes there (always one storage buffer), while the destination buffer binding may become changeable in the future (to split dispatches if exceeding `maxStorageBufferRange`, for example).
Replace a movzx after setae in both ComputeMemoryAddressOffset and ComputeMemoryAddress with a xor_ of eax prior to the cmp. This reduces the length in bytes of both sequences by 1, and should be a moderate ICache usage reduction thanks to the frequency of these sequences.