stub XeKeysGetConsoleType
Removed the breakpoints in HandleCppException and RtlRaiseException until we have a real implementation of them. Some apps can continue fine afterwards.
Stub version of HalGetCurrentAVPack
Implement MmIsAddressValid
Implement RtlGetStackLimits
The "close window" keyboard hotkey (Guide-B) now toggles between loglevel -1 and the loglevel set in your config.
Added LoggerBatch class, which accumulates strings into the threads scratch buffer. This is only intended to be used for very high frequency debug logging. if it exhausts the thread buffer, it just silently stops.
Cleaned nearly 8 years of dust off of the pm4 packet disassembler code, now supports all packets that the command processor supports.
Added extremely verbose logging for gpu register writes. This is not compiled in outside of debug builds, requires LogLevel::Debug and log_guest_driven_gpu_register_written_values = true.
Added full logging of all PM4 packets in the cp. This is not compiled in outside of debug builds, requires LogLevel::Debug and disassemble_pm4.
Piggybacked an implementation of guest callstack backtraces using the stackpoints from enable_host_guest_stack_synchronization. If enable_host_guest_stack_synchronization = false, no backtraces can be obtained.
Added log_ringbuffer_kickoff_initiator_bts. when a thread updates the cp's read pointer, it dumps the backtrace of that thread
Changed the names of the gpu registers CALLBACK_ADDRESS and CALLBACK_CONTEXT to the correct names.
Added a note about CP_PROG_COUNTER
Added CP_RB_WPTR to the gpu register table
Added notes about CP_RB_CNTL and CP_RB_RPTR_ADDR. Both aren't necessary for HLE
Changed name of UNKNOWN_0E00 gpu register to TC_CNTL_STATUS. Games only seem to write 1 to it (L2 invalidate)
Fully defined the structure.
Single copy of it + single timer across all modules, managing it is now the responsibility of KernelState.
add global_critical_region::PrepareToAcquire, which uses Prefetchw on the global crit. We now know we can use Prefetchw on all cpus that have AVX.
add KeQueryInterruptTime, which is used by some dashboards.
add threading::NanoSleep
uses a bitmap that splits up the memory space into 65k blocks per bit. Currently is using the guest virtual address but should be using physical addresses instead.
Currently if a guest does a reserve on a location and then a reserved store to a totally different location we trigger a breakpoint. This should never happen
Also removed the NEGATED_MUL_blah operations. They weren't necessary, nothing special is needed for the negated result variants.
Added a log message for when watched physical memory has a race, it just would be nice to know when it happens and in what games.
Fixed PrefetchW feature check
Added prefetchw check to startup AVX check, there should be no CPUs that support AVX but not PrefetchW.
Init VRSAVE to all ones.
Removed unused disable_global_lock flag.
* Added controller hotkeys setting
Added option to disable controller hotkeys
Minor Changes
* Fixed locked input system
The input system lock should be released even if a controller is not connected.
Implemented Controller Hotkeys
Added controller hotkeys
Added guide button support for XInput and winkey
The hotkey configurations can be found in HID -> Display controller hotkeys
If the Xbox Gamebar overlay is enabled then use the Back button instead of the Guide button.
- Fixed hotkey thread destruction
- Fixed XINPUT_STATE by padding 4 bytes
- Added hotkey vibration for user feedback
- Replaced MessageBoxA with ImGuiDialog::ShowMessageBox
Co-authored-by: Margen67 <Margen67@users.noreply.github.com>
Uses `vpternlogd` to collapse the bitwise select operation into one
instruction. Though it needs a `vmovdqa` instruction since `vpternlogd`
reads and writes to the first argument.
- References to vector data become UB after vector size changes.
- Add one extra level of indirection to pin the wide string memory
location regardless of vector memory
It is debatable whether this is correct in the general case.
There's nothing really wrong with 0xFFFF logically.
Burnout Paradise however bases its in-game language on this and does not recognise 0xFFFF.
The game uses Japanese in the default case.
I've avoided the "rest of Asia" code since Burnout Paradise seems to use a different value (0x01F8) for that than what I expected (0x01FC).
Burnout Paradise statically expects certain thread handle values based on how many objects it knows it is allocating ahead of time.
From this, it calculates an ID by subtracting the thread handle from a base handle of what it expects the first such thread to be assigned.
The value is statically declared in the executable and is not determined automatically.
The host objects in the handle range made these thread handles higher than what the game expects.
Removing these, and allowing 0xF8000000 to be assigned, allows the thread handles to fit perfectly in the range the game expects.
It is not clear what handle range the host objects should be taking. For now though, they're 0-based rather than 0xF8000000-based.
While a title is running any attempt to open another title will result in a crash this commit prevents these crashes via File->Open and File->Open Recent->Title
Opening a recent title with an invalid path caused a crash.
Reorganized SystemPageFlags for sharedmemory, each field now goes into its own array, the three arrays are page aligned in a single virtual allocation
Refactored sharedmemory a bit, use tzcnt if available when finding ranges (faster on pre-zen4 amd cpus)
Made commandprocessor GetcurrentRingReadcount inline, it was made noinline to match PGO decisions but i think PGO can make extra reg allocation decisions that make this inlining choice yield gains, whereas if we do it manually we lose a tiny bit of performance
Working on a more compact vectorized version of GetScissor to save icache on cmd processor thread
Add WriteRegisterForceinline, will probably end up amending to remove it
add ERMS path for vastcpy
Adding WriteRegistersFromMemCommonSense (name will be changed later), name is because i realized my approach with optimizing writeregisters has been backwards, instead of handling all checks more quickly within the loop that writes the registers, we need a different loop for each range with unique handling. we also manage to hoist a lot of the logic out of the loops
Use 100ns delay for MaybeYield, noticed we often return almost immediately from the syscall so we end up wasting some cpu, instead we give up the cpu for the min waitable time (on my system, this is 0.5 ms)
Added a note about affinity mask/dynamic process affinity updates to threading_win
Add TextureFetchConstantsWritten
- Unified APU error messages
- Removed magic number from SetOffset call
- Commented out that annoying assertion from XmaContext::GetNextFrame
- Removed checks for current_input_packet_count and replaced with bool check
Not sure how to call it correctly. I know that calls with packet count == 1 is specific one
and probably handled differently. Is it streaming or how should it be called?
print thread name in host exception reports
trying to force win32 error descriptions to english
Return if output buffer block count is 0 in XmaContext, this is an attempt to fix a divide by zero crash many users have reported
Add constexpr getters to magicdiv class so it can be used from jitted x64/dxbc
Track the guest return address as well for guest/host sync, if multiple entries have the same guest stack find the first one with a matching guest retaddr. this fixes epic mickey 2 (which the previous guest-stack change had allowed to go ingame for a bit) and potentially also a crash in fable3.
Break if under debugger when stackpoints are overflowed
Add much more useful output for host exceptions, print out xenia_canary.exe relative offsets if exception is in module, formatmessage for ntstatus/win32err, strerror
Minor d3d12 microoptimization, instead of doing SetEventOnCompletion + WaitForSingleObject do SetEventOnCompletion w/ nullptr so that the wait happens in kernel mode, avoiding two extra context switches
add unimplemented kernel functions:
ExAllocatePoolWithTag
ObReferenceObject
ObDereferenceObject has no return value.
Log a message when ObDereferenceObject/Reference receive unregistered guest kernel objects
gave ObLookupThreadByThreadId its correct error status
hoist object_types initialization out of ObReferenceObjectByHandle
Fix out parameter values on error for a few kernel funcs
add note about msr to KeSetCurrentStackPointers
add X_STATUS_OBJECT_TYPE_MISMATCH check for xeNtSetEvent
add msr_mask field to X_KPCR
Block members can be decorated with Invariant only since SPIR-V 1.5 Revision 2. In earlier versions, Invariant can be used only for variables. Mesa warns about this.
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
add msr field on context
write to msr for mtmsr/mfmsr, do not have correct default value for msr yet, nor has mtmsrd been reimplemented
do not evaluate assert expressions in release at all, while still avoiding unused variable warnings
Also cleanup the code involved in dialog registration, and update the explanation of why dialog removal is delayed until the end of drawing (the original was written back when window listener and UI drawer callback registration during the execution of the callbacks was deferred, but that was wrong as that might result in execution of callbacks belonging to now-deleted objects).
properly byteswap r13 for spinlock
Add PPCOpcodeBits
stub out broken fpscr updating in ppc_hir_builder. it's just code that repeatedly does nothing right now.
add note about 0 opcode bytes being executed to ppc_frontend
Add assert to check that function end is greater than function start, can happen with malformed functions
Disable prefetch and cachecontrol by default, automatic hardware prefetchers already do the job for the most part
minor cleanup in simplification_pass, dont loop optimizations, let the pass manager do it for us
Add experimental "delay_via_maybeyield" cvar, which uses MaybeYield to "emulate" the db16cyc instruction
Add much faster/simpler way of directly calling guest functions, no longer have to do a byte by byte search through the generated code
Generate label string ids on the fly
Fix unused function warnings for prefetch on clang, fix many other clang warnings
Eliminated majority of CallNativeSafes by replacing them with naive generic code paths.
^ Vector rotate left, vector shift left, vector shift right, vector shift arithmetic right, and vector average are included
These naive paths are implemented small loops that stash the two inputs to the stack and load them in gprs from there, they are not particularly fast but should be an order of magnitude faster than callnativesafe
to a host function, which would involve a call, stashing all volatile registers, an indirect call, potentially setting up a stack frame for the arrays that the inputs get stashed to, the actual operations, a return, loading all volatile registers, a return, etc
Added the fast SHR_V128 path back in
Implement signed vector average byte, signed vector average word. previously we were emitting no code for them. signed vector average byte appears in many games
Fix bug with signed vector average 32, we were doing unsigned shift, turning negative values into big positive ones potentially
Previously, for mips, the dimensions of the texture weren't rounded to powers of two before calculating the mip tail extent, resulting in the mip tail for a 260 blocks tall texture, that contains mips ending at Y of up to 36, having the Y extent calculated as 32. With rounding to powers of two, it would have been 64.
However, with the GetTiledAddressUpperBound functions, none of this is necessary at all (and neither is rounding the extents in TextureGuestLayout::Level to 32x32x4 blocks) - using the same code for calculating the XYZ extents of tiled textures as for linear textures now, which, for the mip tail, calculates the actual maximum coordinates of the mips stored in it - and rounding to tiles is done internally by GetTiledAddressUpperBound.
Fixes the PIX validation warning about missing resource states on every guest draw. Also potentially prevents drivers from making assumptions about the shared memory buffer based on the bindings, though no such cases are currently known.
Uses the single-instruction AVX512 `vperm*` instructions to accelerate
the `INT8_TYPE` and `INT16_TYPE` permutation opcodes.
The `INT8_TYPE` is accelerated using `AVX512VBMI` subset of AVX512.
Available since Icelake(Intel) and Zen4(AMD).
Allows access to byte-element 2-register permutations(32-byte look up
tables) and for 64-bit multi-shifts.
Particularly adding this to accelerate the assembly of our `PERMUTE`
opcode.
Fix guest code profiler, it previously only worked with function precomp + all code you were about to execute already discovered
Allow AndNot if type is V128
Also fixes addressing of MSAA samples 2 and 3 for 64bpp color render targets in the ROV RB implementation on Direct3D 12.
Additionally, with FSI/ROV, alpha test and alpha to coverage are done only if the render target 0 was dynamically written to (according to the Direct3D 9 rules for writing to color render targets, though not sure if they actually apply to the alpha tests on Direct3D 9, but for safety).
There is also some code cleanup for things spotted during the development of the feature.