AVX512 has native unsigned integer comparisons instructions, removing
the need to XOR the most-significant-bit with a constant in memory to
use the signed comparison instructions. These instructions only write to
a k-mask register though and need an additional call to `vpmovm2*` to
turn the mask-register into a vector-mask register.
As of Icelake:
`vpcmpu*` is all L3/T1
`vpmovm2d` is L1/T0.33
`vpmovm2{b,w}` is L3/T0.33
As of Zen4:
`vpcmpu*` is all L3/T0.50
`vpmovm2*` is all L1/T0.25
Other half of #2125. I don't know of any title that utilizes this instruction, but I went ahead and implemented it for completeness.
Verified the implementation with `instr__gen_vsubcuw` from #1348. Can be grabbed with:
```
git checkout origin/gen_tests -- src\xenia\cpu\ppc\testing\*vsubcuw.s
```
I don't know of any title that utilizes this instruction, but I went
ahead and implemented it for completeness.
Verified the implementation with `instr__gen_vaddcuw` from #1348. Can be
grabbed with:
```
git checkout origin/gen_tests -- src\xenia\cpu\ppc\testing\*vaddcuw.s
```
There's no limit on the number of memory exports in a shader on the real
Xenos, and exports can be done anywhere, including in loops. Now, instead
of deferring the exports to the end of the shader, and assuming that export
allocs are executed only once, Xenia flushes exports when it reaches an
alloc (allocs terminate memory exports on Xenos, as well as individual ALU
instructions with `serialize`, but not handling this case for simplicity,
it's only truly mandatory to flush memory exports before starting a new
one), the end of the shader, or a pixel with outstanding exports is killed.
To know which eM# registers need to be flushed to the memory, traversing
the successors of each exec potentially writing any eM#, and specifying
that certain eM# registers might have potentially been written before each
reached control flow instruction, until a flush point or the end of the
shader is reached.
Also, some games export to sub-32bpp formats. These are now supported via
atomic AND clearing the bits of the dword to replace followed by an atomic
OR inserting the new byte/short.
Uses `vpternlogd` to collapse the bitwise select operation into one
instruction. Though it needs a `vmovdqa` instruction since `vpternlogd`
reads and writes to the first argument.
- References to vector data become UB after vector size changes.
- Add one extra level of indirection to pin the wide string memory
location regardless of vector memory