In the `Int8` case of `VECTOR_SHA_V128`, when all the values are the same, then a single-instruction `gf2p8affineqb` can be emitted that does an int8-based arithmetic-shift, utilizing GF(8) arithmetic.
More info here:
https://wunkolo.github.io/post/2020/11/gf2p8affineqb-int8-shifting/
As of now(Dec 2021): Tremont(Lakefield), Jasper Lake, Ice lake, Tigerlake, and Rocket Lake support GNFI.
Rather than having a single bool to conditionally detect haswell-level
instruction features. The granularity is increased with a new
`x64_extension_mask` where individual features within the x64 backend
can be turned on or off in a bit-mask manner. Since we have an ARM
backend on the horizon, I've added this to the new `x64`
configuration-group rather than `CPU`. This new pattern will hopefully
allow for testing to be more targetted to certain processor features and
allows the user to determine if they want certain features to be enabled
or disabled(such as avoiding BMI2 on certain AMD processors due to
pdep/pext being incredibly slow). The default configuration is to detect
and utilize all available features.
This addresses a JIT-issue in the case that the `src1` and `dest`
register are both the same. This issue only happens in the "generic"
x86 path but not in the BMI1-accelerated path.
Thanks Rick for the extensive debugging help.
When `src1` and `dest` were the same, then the `addc` instruction at
`82099A08` in title `584108FF` might emit the following assembly:
```
.text:82099A08 andc r11, r10, r11
|
| Jitted
|
V
00000000A0011B15 mov rbx,r10
00000000A0011B18 not rbx
00000000A0011B1B and rbx,rbx
```
This was due to the src1 operand and the destination register being the
same, which used to call the "else" case in the x64 emitter when it
needs to be handled explicitly due to register aliasing/allocation.
Addresses issue #1945
The wheel shader in 4D530910 does vfetch_full to r0 with the index from r0.x, and then vfetch_mini.
Thanks @Gliniak for the finding :3
Also small formatting cleanup in commented-out code.
Fix the the case where src1 is constant and src2 is non-constant causing
an assert due to trying to call `.constant()` on the src2 operand.
Interfaces with an issue Gliniak was encountering where title `4D53082D`
encounters an assert. Also includes a BMI1-acceleration in the 64-bit
case where a temporary register is needed(the `and` x86 instruction only
supports immediate constants up to 32-bits).
In the case of having two register operands for `AndNot`, the `andn` instruction can be used when the host supports `BMI1`. `andn` only supports 32-bit and 64-bit operands, so some register up-casting is needed.
The `BMI1 feature` fits into the current pattern of `use_haswell_instructions` as BMI1 was only introduced in haswell.
Also moved the aliases to the end of the enum rather than interleave it with the bit definitions.
Implements the detection of some baseline `AVX512` subsets and some common aliases into `X64EmitterFeatureFlags`.
So far, `AVX512{F,VL,BW,DQ}` are the only subsets of `AVX512` that are detected with this PR since I anticipate these are the ones that will actually be used a lot in the x64 backend. Some aliases are also implemented such as `kX64EmitAVX512Ortho` which is `AVX512F` and `AVX512VL` combined which are the two subsets of AVX512 required to allow for `AVX512` operations upon `ymm` and `xmm` registers.
These aliases can possibly be collapsed since we could just always require `AVX512VL` to be supported to allow for _any_ kind of `AVX512` to be used since we will practically always want to use `AVX512` on `xmm` registers at the very least as there is no use-case where we want to use the 512-bit `zmm` registers exclusively.
Rather than using `n * 1024 * 1024`, this adds a convenient `_MiB`/`_KiB` user-literal to the new `literals.h` header to concisely describe units of memory in a much more readable way. Any other useful literals can be added to this header. These literals exist in the `xe::literals` namespace so they are opt-in, similar to `std::chrono` literals, and require a `using namespace xe::literals` statement to utilize it within the current scope.
I've done a pass through the codebase to replace trivial instances of `1024 * 1024 * ...` expressions being used but avoided anything that added additional casting complexity from `size_t` to `uint32_t` and such to keep this commit concise.
Just checking if the resulting mask is non-zero means we cannot allow this function to check for multiple features in parallel. A hypothetical computer that supports FMA but not AVX2 will return `true` if you try to call `IsFeatureEnabled(kX64EmitFMA | kX64EmitAVX2)`. We should make sure all the masked flags return `true` rather than check for non-zero.
This is ramping up to allow for particular subsets of AVX512 to be checked for in parallel with a single function call.
All of the (non-third party) cpp impl files use the .cc extension, this one doesn't. I was digging through the code and found this one so thought I might as well rename it whilst I'm here!
XMountUtilityDrive code tries reading/writing from \Device\Harddisk0\Cache0 / Cache1 / Partition0, NullDevice handling \Device\Harddisk0 will make that code think that the reads/writes were successful, so the utility-drive mount can proceed without failing.
XMountUtilityDrive-related code checks some values returned from NtDeviceIoControlFile, stub just returns values that it seems to accept
IoCreateDevice is also used by utility-drive code, writing some values into a pointer returned by it, so stub allocs space so it can write to the pointer without errors.
The fpfs function is using strtof to convert a string to floating point
value, but the type may be a double. Using strtof in that case won't
provide enough precision, so switch to using strtod. When the type is a
float, the double will be down-converted to the correct value.
- [Kernel] XEnumerator::WriteItems no longer cares about provided
buffer size, since we know the size when the XEnumerator was created.
- [Kernel] Added XStaticEnumerator template. Previous
XStaticEnumerator renamed to XStaticUntypedEnumerator.
- [XAM] Deferred xeXamEnumerate.
- Ensure that logging waits until everything is written before shutting
down.
- Fix a bug where a new log line would not be written until the next
log line had been appended.
XamAppEnumerateContentAggregate seems to store title_id at 0x140, so I've moved title_id to there, now AGGREGATE_DATA seems to match the size of the struct used by XamContentCreateInternal.
Also removed unneeded/useless checks inside XamAppEnumerateContentAggregate.
Adds a new X_KTHREAD::apc_disable_count field at 0xB0 into X_KTHREAD based on where 360 kernel seems to store it, and made CriticalRegion funcs act on that, instead of locking things between all threads, and changes DeliverAPCs to check that field before running the APCs.
XThread::LockApc/UnlockApc were also updated too as those previously called into EnterCrit/LeaveCrit to work, but AFAIK the code that uses LockApc/UnlockApc does have an actual need for locking between threads, so changed them to work on XThread::global_critical_region_ directly instead.
Prevent Vulkan Swap before display context is assigned.
Prevent resize while fullscreen (like in windows impl).
Use superclass Resize implementation to reduce code duplication.
Remove recursive call to GTKWindow::Close().
Destroy xcb window after superclass Close().
Set hwnd to null after closing on windows implementation.