Uses `vpternlogd` to collapse the bitwise select operation into one
instruction. Though it needs a `vmovdqa` instruction since `vpternlogd`
reads and writes to the first argument.
- References to vector data become UB after vector size changes.
- Add one extra level of indirection to pin the wide string memory
location regardless of vector memory
It is debatable whether this is correct in the general case.
There's nothing really wrong with 0xFFFF logically.
Burnout Paradise however bases its in-game language on this and does not recognise 0xFFFF.
The game uses Japanese in the default case.
I've avoided the "rest of Asia" code since Burnout Paradise seems to use a different value (0x01F8) for that than what I expected (0x01FC).
Burnout Paradise statically expects certain thread handle values based on how many objects it knows it is allocating ahead of time.
From this, it calculates an ID by subtracting the thread handle from a base handle of what it expects the first such thread to be assigned.
The value is statically declared in the executable and is not determined automatically.
The host objects in the handle range made these thread handles higher than what the game expects.
Removing these, and allowing 0xF8000000 to be assigned, allows the thread handles to fit perfectly in the range the game expects.
It is not clear what handle range the host objects should be taking. For now though, they're 0-based rather than 0xF8000000-based.
While a title is running any attempt to open another title will result in a crash this commit prevents these crashes via File->Open and File->Open Recent->Title
Opening a recent title with an invalid path caused a crash.
Reorganized SystemPageFlags for sharedmemory, each field now goes into its own array, the three arrays are page aligned in a single virtual allocation
Refactored sharedmemory a bit, use tzcnt if available when finding ranges (faster on pre-zen4 amd cpus)
Made commandprocessor GetcurrentRingReadcount inline, it was made noinline to match PGO decisions but i think PGO can make extra reg allocation decisions that make this inlining choice yield gains, whereas if we do it manually we lose a tiny bit of performance
Working on a more compact vectorized version of GetScissor to save icache on cmd processor thread
Add WriteRegisterForceinline, will probably end up amending to remove it
add ERMS path for vastcpy
Adding WriteRegistersFromMemCommonSense (name will be changed later), name is because i realized my approach with optimizing writeregisters has been backwards, instead of handling all checks more quickly within the loop that writes the registers, we need a different loop for each range with unique handling. we also manage to hoist a lot of the logic out of the loops
Use 100ns delay for MaybeYield, noticed we often return almost immediately from the syscall so we end up wasting some cpu, instead we give up the cpu for the min waitable time (on my system, this is 0.5 ms)
Added a note about affinity mask/dynamic process affinity updates to threading_win
Add TextureFetchConstantsWritten