Commit Graph

2309 Commits

Author SHA1 Message Date
chss95cs@gmail.com 2fa2f1a78c Add more wrapper functions to ppc_context_t in kernel, want to switch…
… over to referencing state through ppc_context as much as possible, it'll make implementing things like kernel processes much easier in the future

Move forward definitions of kernel types into kernel_fwd.h

Stub implementation of  XamLoaderGetMediaInfoEx
Partially implement XamSetDashContext
implement XamGetDashContext
Stub implementation of XamUserIsUnsafeProgrammingAllowed
Stub implementation of XamUserGetSubscriptionType
Expanded the supported object types for ObReferenceObjectByHandle and wrapped the logic for encoding the type in a constexpr function
ObReferenceObjectByName was taking lpstring_t for first param, but the function actually takes X_ANSI_STRING ptr

NtReleaseMutant actually does not return anything from KeReleaseMutant, it just checks the handle and returns whether it was invalid.

Changed the raise/lower irql functions to just set the irql on the pcr instead of setting it on field of Processor, processor is a shared object and irql is per-thread

Semi-stub implementation of KeGetImagePageTableEntry. I locked at it in the HV and got it so the values are in the same range the HV returns + actually reflect the page & memory range, but i doubt its equal to the values the hv returns on real hw. Used by modern dashboards, don't know for what.
Log error message for ObDereferenceObject w/ null ptr.

Allocate a special fixed page that dashboards reference, currently don't know what the data on that page is supposed to be.
Add current_irql field to X_KPCR
Added Device object member of XObject::Type enum
Added some notes about other gpu registers, found a table of register names and indices in xam
2023-04-23 10:39:52 -04:00
Triang3l 8aaa6f1f7d [SPIR-V] Wrap 4-operand ops and 1-3-operand GLSL std calls 2023-04-19 21:44:24 +03:00
Triang3l 19d56001d2 [SPIR-V] Wrap NoContraction operations 2023-04-19 11:53:45 +03:00
Triang3l 78f1d55a36 [SPIR-V] Use Builder createSelectionMerge directly 2023-04-19 11:11:28 +03:00
Triang3l 64d2a80f79 [SPIR-V] Cleanup ALU emulation conditionals 2023-04-19 10:35:09 +03:00
Triang3l eede38ff63 [SPIR-V] Remove more vec2-4 reserve calls 2023-04-18 22:05:02 +03:00
chss95cs@gmail.com 27c4cef1b5 Added logger flags, for selectively disabling categories of logging (cpu, apu, kernel). Need to make more log messages make use of these flags.
The "close window" keyboard hotkey (Guide-B) now toggles between loglevel -1 and the loglevel set in your config.
Added LoggerBatch class, which accumulates strings into the threads scratch buffer. This is only intended to be used for very high frequency debug logging. if it exhausts the thread buffer, it just silently stops.
Cleaned nearly 8 years of dust off of the pm4 packet disassembler code, now supports all packets that the command processor supports.
Added extremely verbose logging for gpu register writes. This is not compiled in outside of debug builds, requires LogLevel::Debug and log_guest_driven_gpu_register_written_values = true.
Added full logging of all PM4 packets in the cp. This is not compiled in outside of debug builds, requires LogLevel::Debug and disassemble_pm4.
Piggybacked an implementation of guest callstack backtraces using the stackpoints from enable_host_guest_stack_synchronization. If enable_host_guest_stack_synchronization = false, no backtraces can be obtained.
Added log_ringbuffer_kickoff_initiator_bts. when a thread updates the cp's read pointer, it dumps the backtrace of that thread
Changed the names of the gpu registers CALLBACK_ADDRESS and CALLBACK_CONTEXT to the correct names.
Added a note about CP_PROG_COUNTER
Added CP_RB_WPTR to the gpu register table
Added notes about CP_RB_CNTL and CP_RB_RPTR_ADDR. Both aren't necessary for HLE
Changed name of UNKNOWN_0E00 gpu register to TC_CNTL_STATUS. Games only seem to write 1 to it (L2 invalidate)
2023-04-16 12:42:42 -04:00
chrisps 12c9135843
Merge branch 'xenia-project:master' into canary_experimental 2023-04-16 09:11:39 -04:00
Triang3l 887fda55c2 [SPIR-V] Remove temp reserve for 4 or less elements 2023-04-13 22:43:44 +03:00
Triang3l 75d805245d [DXBC] `discard` pixels from `kill` with ROV instead of returning
Keep the current lane active as it may be needed for derivatives.
2023-04-09 20:13:22 +03:00
Gliniak 5e0c67438c Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2023-04-09 17:28:04 +02:00
Gliniak 7fce8fce9d Revert "[D3D12] Added workaround for broken Geometry Shader in AMD 7900 Series GPU"
This reverts commit 2afd2cc4d6.
2023-04-09 17:27:55 +02:00
Triang3l 88c645d818 [D3D12] Don't use emit_then_cut due to RDNA 3 crash 2023-04-09 18:07:44 +03:00
Triang3l baa2ff78d8 [Vulkan] Add missing stencil reference unpack in RT transfer + formatting fix 2023-03-30 22:40:40 +03:00
Triang3l c238d8af55 [Vulkan] Fix FragStencilRef store type 2023-03-30 22:28:56 +03:00
Gliniak 23bd18cfca [GPU] Check if memory page is available while copying data 2023-03-11 16:20:01 +01:00
Gloria d62fe21d47
RADV bug fix (#139)
* Use correct typing for stencil, dispatch launch on UI thread
* Clean up some LaunchPath code
2023-03-06 08:31:05 +01:00
Adrian 459497f0b6
Implemented Controller Hotkeys (#111)
Implemented Controller Hotkeys

Added controller hotkeys
Added guide button support for XInput and winkey

The hotkey configurations can be found in HID -> Display controller hotkeys
If the Xbox Gamebar overlay is enabled then use the Back button instead of the Guide button.

- Fixed hotkey thread destruction
- Fixed XINPUT_STATE by padding 4 bytes
- Added hotkey vibration for user feedback
- Replaced MessageBoxA with ImGuiDialog::ShowMessageBox

Co-authored-by: Margen67 <Margen67@users.noreply.github.com>
2023-01-13 09:17:43 +01:00
Gliniak 2afd2cc4d6 [D3D12] Added workaround for broken Geometry Shader in AMD 7900 Series GPU 2022-12-31 11:27:01 +01:00
Gliniak 26415cb8b1 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-12-31 11:19:01 +01:00
Joel Linn b641e39c4d [ImGui] Use new ImageButton signature 2022-12-28 14:16:32 -06:00
Joel Linn d6b5cbd634 [ImGui] Fix deprecated SetCursorPos use 2022-12-28 14:16:32 -06:00
Joel Linn 8f88bb237e [ImGui] Fix empty IDs 2022-12-28 14:16:32 -06:00
chss95cs@gmail.com 3952997116 Fix issue introduced yesterday where the final fetch constant would never be marked as written
Reorganized SystemPageFlags for sharedmemory, each field now goes into its own array, the three arrays are page aligned in a single virtual allocation
Refactored sharedmemory a bit, use tzcnt if available when finding ranges (faster on pre-zen4 amd cpus)
2022-12-15 08:35:36 -08:00
chss95cs@gmail.com f931c34ecb Cleaned up for commit, moved WriteRegistersFromMemCommonSense code into WriteRegistersFromMem
optimized copy_and_swap_32_unaligned further
2022-12-14 11:34:33 -08:00
chss95cs@gmail.com 754293ffc3 Fix mistake with fetch constant dirty mask 2022-12-14 10:13:13 -08:00
chss95cs@gmail.com 7a0fd0f32a Remove MaybeYields when vsync is off 2022-12-14 09:56:33 -08:00
chss95cs@gmail.com 080b6f4cbd Partially vectorized GetScissor (loading and unpacking the bitfields from the registers is still scalar) 2022-12-14 09:33:14 -08:00
chss95cs@gmail.com ab6d9dade0 add avx2 codepath for copy_and_swap_32_unaligned
use the new writerange approach in WriteRegisterRangeFromMem_WithKnownBound
2022-12-14 07:53:21 -08:00
chss95cs@gmail.com 82dcf3f951 faster/more compact MatchValueAndRef
Made commandprocessor GetcurrentRingReadcount inline, it was made noinline to match PGO decisions but i think PGO can make extra reg allocation decisions that make this inlining choice yield gains, whereas if we do it manually we lose a tiny bit of performance

Working on a more compact vectorized version of GetScissor to save icache on cmd processor thread
Add WriteRegisterForceinline, will probably end up amending to remove it

add ERMS path for vastcpy

Adding WriteRegistersFromMemCommonSense (name will be changed later), name is because i realized my approach with optimizing writeregisters has been backwards,  instead of handling all checks more quickly within the loop that writes the registers, we need a different loop for each range with unique handling. we also manage to hoist a lot of the logic out of the loops

Use 100ns delay for MaybeYield, noticed we often return almost immediately from the syscall so we end up wasting some cpu, instead we give up the cpu for the min waitable time (on my system, this is 0.5 ms)

Added a note about affinity mask/dynamic process affinity updates to threading_win

Add TextureFetchConstantsWritten
2022-12-13 11:25:33 -08:00
chss95cs@gmail.com a63f424c0a Directly check PEB for IsDebuggerAttached
Add constexpr getters to magicdiv class so it can be used from jitted x64/dxbc
Track the guest return address as well for guest/host sync, if multiple entries have the same guest stack find the first one with a matching guest retaddr. this fixes epic mickey 2 (which the previous guest-stack change had allowed to go ingame for a bit) and potentially also a crash in fable3.
Break if under debugger when stackpoints are overflowed

Add much more useful output for host exceptions, print out xenia_canary.exe relative offsets if exception is in module, formatmessage for ntstatus/win32err, strerror

Minor d3d12 microoptimization, instead of doing SetEventOnCompletion + WaitForSingleObject do SetEventOnCompletion w/ nullptr so that the wait happens in kernel mode, avoiding two extra context switches

add unimplemented kernel functions:
ExAllocatePoolWithTag
ObReferenceObject

ObDereferenceObject has no return value.
Log a message when ObDereferenceObject/Reference receive unregistered guest kernel objects
gave ObLookupThreadByThreadId its correct error status
hoist object_types initialization out of ObReferenceObjectByHandle
Fix out parameter values on error for a few kernel funcs
add note about msr to KeSetCurrentStackPointers
add X_STATUS_OBJECT_TYPE_MISMATCH check for xeNtSetEvent
add msr_mask field to X_KPCR
2022-12-04 12:38:19 -08:00
Triang3l e97eb75b94 [Vulkan] Update variableMultisampleRate comments (actually supported) [ci skip] 2022-12-04 14:55:56 +03:00
Triang3l 0b4f5ef286 [SPIR-V] Decorate whole gl_PerVertex with Invariant
Block members can be decorated with Invariant only since SPIR-V 1.5 Revision 2. In earlier versions, Invariant can be used only for variables. Mesa warns about this.
2022-12-03 14:27:43 +03:00
chss95cs@gmail.com 90c771526d "Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h

On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction

Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.

In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.

Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.

This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet

So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.

Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game

MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function

add Processor::LookupModule

Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)

add notes about flags that trap in XamInputGetCapabilities

0 == 3 in XamInputGetCapabilities

Name arg 2 of XamInputSetState

PrefetchW in critical section kernel funcs if available & doing cmpxchg

Add terminated field to X_KTHREAD, set it on termination

Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.

Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)

Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen

Use page_size_shift in more places

Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false

Remove dead #if 0'd code in math.h

On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction

Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.

In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.

Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.

This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet

So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.

Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game

MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function

add Processor::LookupModule

Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)

add notes about flags that trap in XamInputGetCapabilities

0 == 3 in XamInputGetCapabilities

Name arg 2 of XamInputSetState

PrefetchW in critical section kernel funcs if available & doing cmpxchg

Add terminated field to X_KTHREAD, set it on termination

Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.

Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)

Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen

Use page_size_shift in more places

Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
2022-11-27 09:39:33 -08:00
chss95cs@gmail.com c1d922eebf Minor decoder optimizations, kernel fixes, cpu backend fixes 2022-11-05 10:50:33 -07:00
chrisps 8186792113
Revert "Minor decoder optimizations, kernel fixes, cpu backend fixes" 2022-11-01 14:45:36 -07:00
chss95cs@gmail.com bff264b5fd Fixed RtlCompareString and RtlCompareStringN, they were very wrong, for CompareString the params are struct ptrs not char ptrs
Fixed a ton of clang-cl compiler warnings about unused variables, still many left. Fixed a lot of inconsistent override ones too
2022-10-30 10:47:09 -07:00
Triang3l a37b57ca8d [GPU] Fix tiled mip tail extent calculation
Previously, for mips, the dimensions of the texture weren't rounded to powers of two before calculating the mip tail extent, resulting in the mip tail for a 260 blocks tall texture, that contains mips ending at Y of up to 36, having the Y extent calculated as 32. With rounding to powers of two, it would have been 64.

However, with the GetTiledAddressUpperBound functions, none of this is necessary at all (and neither is rounding the extents in TextureGuestLayout::Level to 32x32x4 blocks) - using the same code for calculating the XYZ extents of tiled textures as for linear textures now, which, for the mip tail, calculates the actual maximum coordinates of the mips stored in it - and rounding to tiles is done internally by GetTiledAddressUpperBound.
2022-10-23 21:26:47 +03:00
Triang3l 74f1f6bb6d [Vulkan] Check depthClamp feature 2022-10-23 19:01:17 +03:00
Triang3l 4add1f67b1 [D3D12] Replace unused shared memory view with a null view
Fixes the PIX validation warning about missing resource states on every guest draw. Also potentially prevents drivers from making assumptions about the shared memory buffer based on the bindings, though no such cases are currently known.
2022-10-23 18:09:47 +03:00
Radosław Gliński 7c375879bc
Merge pull request #85 from chrisps/canary_experimental
Kernel improvements, "fix" crash on sandy bridge/ivy bridge
2022-10-21 14:18:03 +02:00
Gliniak 48fea6d9aa Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-10-18 12:19:52 +02:00
Triang3l cdb40ddb28 [DXBC] Fix interpolator copying from v# to r# in PS
The bit count was of `(1<<i)-1` itself (thus couldn't handle interpolators with a smaller index skipped), not of `bits&((1<<i)-1)`.
2022-10-18 13:12:37 +03:00
chss95cs@gmail.com d8b7b3ecec Fix bindless path in d3d12 that i broke in earlier commit (did not affect any users, thats a debug thing)
Fix guest code profiler, it previously only worked with function precomp + all code you were about to execute already discovered
Allow AndNot if type is V128
2022-10-16 07:48:43 -07:00
chrisps a495709344
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-10-15 03:07:35 -07:00
chss95cs@gmail.com efbeae660c Drastically reduce cpu time wasted by XMADecoderThread spinning, went from 13% of all cpu time to about 0.6% in my tests
Commented out lock in WatchMemoryRange, lock is always held by caller
properly set the value/check the irql for spinlocks in xboxkrnl_threading
2022-10-15 03:07:07 -07:00
Gliniak d262214c1b Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-10-14 20:13:03 +02:00
Triang3l 45050b2380 [GPU] Vulkan fragment shader interlock RB and related fixes/cleanup
Also fixes addressing of MSAA samples 2 and 3 for 64bpp color render targets in the ROV RB implementation on Direct3D 12.
Additionally, with FSI/ROV, alpha test and alpha to coverage are done only if the render target 0 was dynamically written to (according to the Direct3D 9 rules for writing to color render targets, though not sure if they actually apply to the alpha tests on Direct3D 9, but for safety).
There is also some code cleanup for things spotted during the development of the feature.
2022-10-09 22:06:41 +03:00
chss95cs@gmail.com 8f7f7dc6ad fixed wine crash from use of NtSetEventPriorityBoost
add xe::clear_lowest_bit, use it in place of shift-andnot in some bit iteration code
make is_allocated_ and is_enabled_ volatile in xma_context
preallocate avpacket buffer in XMAContext::Setup, the reallocations of the buffer in ffmpeg were showing up on profiles
check is_enabled and is_allocated BEFORE locking an xmacontext. XMA worker was spending most of its time locking and unlocking contexts
Removed XeDMAC, dma:: namespace. It was a bad idea and I couldn't make it work in the end. Kept vastcpy and moved it to the memory namespace instead
Made the rest of global_critical_region's members static. They never needed an instance.
Removed ifdef'ed out code from ring_buffer.h
Added EventInfo struct to threading, added Event::Query to aid with implementing NtQueryEvent.
Removed vector from WaitMultiple, instead use a fixed array of 64 handles that we populate. WaitForMultipleObjects cannot handle more than 64 objects.
Remove XE_MSVC_OPTIMIZE_SMALL() use in x64_sequences, x64 backend is now always size optimized because of premake
Make global_critical_region_ static constexpr in shared_memory.h to get rid of wasteage of 8 bytes (empty class=1byte, +alignment for next member=8)
Move trace-related data to the tail of SharedMemory to keep more important data together
In IssueDraw build an array of fetch constant addresses/sizes, then pre-lock the global lock before doing requestrange for each instead of individually locking within requestrange for each of them
Consistent access specifier protected for pm4_command_processor_declare
Devirtualize WriteOneRegisterFromRing.
Move ExecutePacket and ExecutePrimaryBuffer to pm4_command_buffer_x
Remove many redundant header inclusions access xenia-gpu
Minor microoptimization of ExecutePacketType0

Add TextureCache::RequestTextures for batch invocation of LoadTexturesData

Add TextureCache::LoadTexturesData for reducing the number of times we release and reacquire the global lock.
Ideally you should hold the global lock for as little time as possible, but if you are constantly acquiring and releasing it you are actually more likely to have contention
Add already_locked param to ObjectTable::LookupObject to help with reducing lock acquire/release pairs
Add missing checks to XAudioRegisterRenderDriverClient_entry. this is unlikely to fix anything, it was just an easy thing to do
Add NtQueryEvent system call implementation. I don't actually know of any games that need it.
Instead of using std::vector + push_back in KeWaitForMultipleObjects and xeNtWaitForMultipleObjectsEx use a fixed size array of 64 and track the count. More than 64 objects is not permitted by the kernel. The repeated reallocations from push_back were appearing unusually high on the profiler, but were masked until now by waitformultipleobjects natural overhead
Pre-lock the global lock before looking up each handle for xeNtWaitForMultipleObjectsEx and KeWaitForMultipleObjects.
Pre-lock before looking up the signal and waiter in NtSignalAndWaitForSingleObjectEx
add missing checks to NtWaitForMultipleObjectsEx
Support pre-locking in XObject::GetNativeObject
2022-10-08 09:55:17 -07:00
chss95cs@gmail.com 7e58a3b320 Fix compiler errors i introduced under clang-cl
remove xe_kernel_export_shim_fn field of Export function_data, trampoline is now the only way exports get invoked
Remove kernelstate argument from string functions in order to conform to the trampoline signature (the argument was unused anyway)
Constant-evaluated initialization of ppc_opcode_disasm_table, removal of unused std::vector fields
Constant-evaluated initialization of export tables
name field on export is just a const char* now, only immutable static strings are ever passed to it
Remove unused callcount field of export.
PM4 compare op function extracted
Globally apply /Oy, /GS-, /Gw on msvc windows
Remove imgui testwindow code call, it took up like 300 kb
2022-09-29 07:04:17 -07:00
chss95cs@gmail.com eb8154908c atomic cas use prefetchw if available
remove useless memorybarrier
remove double membarrier in wait pm4 cmd
add int64 cvar
use int64 cvar for x64 feature mask
Rework some functions that were frontend bound according to vtune placing some of their code in different noinline functions, profiling after indicating l1 cache misses decreased and perf of func increased
remove long vpinsrd dep chain code for conversion.h, instead do normal load+bswap or movbe if avail
Much faster entry table via split_map, code size could be improved though
GetResolveInfo was very large and had impact on icache, mark callees as noinline + msvc pragma optimize small
use log2 shifts instead of integer divides in memory
minor optimizations in PhysicalHeap::EnableAccessCallbacks, the majority of time in the function is spent looping, NOT calling Protect! Someone should optimize this function and rework the algo completely
remove wonky scheduling log message, it was spammy and unhelpful
lock count was unnecessary for criticalsection mutex, criticalsection is already a recursive mutex
brief notes i gotta run
2022-09-17 04:04:53 -07:00
chss95cs@gmail.com 20638c2e61 use Sleep(0) instead of SwitchToThread, should waste less power and help the os with scheduling.
PM4 buffer handling made a virtual member of commandprocessor, place the implementation/declaration into reusable macro files. this is probably the biggest boost here.
Optimized SET_CONSTANT/ LOAD_CONSTANT pm4 ops based on the register range they start writing at, this was also a nice boost

Expose X64 extension flags to code outside of x64 backend, so we can detect and use things like avx512, xop, avx2, etc in normal code
Add freelists for HIR structures to try to reduce the number of last level cache misses during optimization (currently disabled... fixme later)

Analyzed PGO feedback and reordered branches, uninlined functions, moved code out into different functions based on info from it in the PM4 functions, this gave like a 2% boost at best.

Added support for the db16cyc opcode, which is used often in xb360 spinlocks. before it was just being translated to nop, now on x64 we translate it to _mm_pause but may change that in the future to reduce cpu time wasted

texture util - all our divisors were powers of 2, instead we look up a shift. this made texture scaling slightly faster, more so on intel processors which seem to be worse at int divs. GetGuestTextureLayout is now a little faster, although it is still one of the heaviest functions in the emulator when scaling is on.

xe_unlikely_mutex was not a good choice for the guest clock lock, (running theory) on intel processors another thread may take a significant time to update the clock? maybe because of the uint64 division? really not sure, but switched it to xe_mutex. This fixed audio stutter that i had introduced to 1 or 2 games, fixed performance on that n64 rare game with the monkeys.
Took another crack at DMA implementation, another failure.
Instead of passing as a parameter, keep the ringbuffer reader as the first member of commandprocessor so it can be accessed through this
Added macro for noalias
Applied noalias to Memory::LookupHeap. This reduced the size of the executable by 7 kb.
Reworked kernel shim template, this shaved like 100kb off the exe and eliminated the indirect calls from the shim to the actual implementation. We still unconditionally generate string representations of kernel calls though :(, unless it is kHighFrequency

Add nvapi extensions support, currently unused. Will use CPUVISIBLE memory at some point
Inserted prefetches in a few places based on feedback from vtune.
Add native implementation of SHA int8 if all elements are the same

Vectorized comparisons for SetViewport, SetScissorRect
Vectorized ranged comparisons for WriteRegister
Add XE_MSVC_ASSUME
Move FormatInfo::name out of the structure, instead look up the name in a different table. Debug related data and critical runtime data are best kept apart
Templated UpdateSystemConstantValues based on ROV/RTV and primitive_polygonal
Add ArchFloatMask functions, these are for storing the results of floating point comparisons without doing costly float->int pipeline transfers (vucomiss/setb)
Use floatmasks in UpdateSystemConstantValues for checking if dirty, only transfer to int at end of function.
Instead of dirty |= (x == y) in UpdateSystemConstantValues, now we do dirty_u32 |= (x^y). if any of them are not equal, dirty_u32 will be nz, else if theyre all equal it will be zero. This is more friendly to register renaming and the lack of dependencies on EFLAGS lets the compiler reorder better
Add PrefetchSamplerParameters to D3D12TextureCache
use PrefetchSamplerParameters in UpdateBindings to eliminate cache misses that vtune detected

Add PrefetchTextureBinding to D3D12TextureCache
Prefetch texture bindings to get rid of more misses vtune detected (more accesses out of order with random strides)
Rewrote DMAC, still terrible though and have disabled it for now.
Replace tiny memcmp of 6 U64 in render_target_cache with inline loop, msvc fails to make it a loop and instead does a thunk to their memcmp function, which is optimized for larger sizes

PrefetchTextureBinding in AreActiveTextureSRVKeysUpToDate
Replace memcmp calls for pipelinedescription with handwritten cmp
Directly write some registers that dont have special handling in PM4 functions
Changed EstimateMaxY to try to eliminate mispredictions that vtune was reporting, msvc ended up turning the changed code into a series of blends

in ExecutePacketType3_EVENT_WRITE_EXT, instead of writing extents to an array on the stack and then doing xe_copy_and_swap_16 of the data to its dest, pre-swap each constant and then store those. msvc manages to unroll that into wider stores
stop logging XE_SWAP every time we receive XE_SWAP, stop logging the start and end of each viz query

Prefetch watch nodes in FireWatches based on feedback from vtune
Removed dead code from texture_info.cc
NOINLINE on GpuSwap, PGO builds did it so we should too.
2022-09-11 14:14:48 -07:00
chss95cs@gmail.com c6010bd4b1 nasty commit with a bunch of test code left in, will clean up and pr
Remove the logger_ != nullptr check from shouldlog, it will nearly always be true except on initialization and gets checked later anyway, this shrinks the size of the generated code for some
Select specialized vastcpy for current cpu, for now only have paths for MOVDIR64B and generic avx1
Add XE_UNLIKELY/LIKELY if, they map better to the c++ unlikely/likely attributes which we will need to use soon
Finished reimplementing STVL/STVR/LVL/LVR as their own opcodes. we now generate far less code for these instructions. this also means optimization passes can be written to simplify/remove/replace these instructions in some cases. Found that a good deal of the X86 we were emitting for these instructions was dead code or redundant.
the reduction in generated HIR/x86 should help a lot with compilation times and make function precompilation more feasible as a default

Don't static assert in default prefetch impl, in c++20 the assertion will be triggered even without an instantiation
Reorder some if/else to prod msvc into ordering the branches optimally. it somewhat worked...
Added some notes about which opcodes should be removed/refactored
Dispatch in WriteRegister via vector compares for the bounds. still not very optimal, we ought to be checking whether any register in a range may be special
A lot of work on trying to optimize writeregister, moved wraparound path into a noinline function based on profiling info
Hoist the IsUcodeAnalyzed check out of AnalyzeShader, instead check it before each call. Profiler recorded many hits in the stack frame setup of the function, but none in the actual body of it, so the check is often true but the stack frame setup is run unconditionally
Pre-check whether we're about to write a single register from a ring
Replace more jump tables from draw_util/texture_info with popcnt based sparse indexing/bit tables/shuffle lookups
Place the GPU register file on its own VAD/virtual allocation, it is no longer a member of graphics system
2022-09-04 11:04:41 -07:00
chss95cs@gmail.com 78c9a48bc2 also use vastcpy for shared memory page stuff 2022-08-28 14:52:12 -07:00
chss95cs@gmail.com f31869092c Fixed a bug with readback_resolve and readback_memexport that was responsible for a large portion of their overhead. readback_memexport and resolve are now usable for games, depending on your hardware. in my case games that were slideshows now run at like 20-30 fps, and my hardware isnt the best for xenia.
add split_map class for mapping keys to values in a way that optimizes for frequent searches and infrequent insertions/removals
remove jump table implementation of GetColorRenderTargetFormatComponentCount, it was appearing relatively high in profiles. instead pack the component counts into a single 32 bit word, which is indexed by shifting
Add cvar to align all basic blocks to a boundary
Add mmio aware load paths
liberally apply XE_RESTRICT in ringbuffer related code
Removed the IS_TRUE and IS_FALSE opcodes, they were pointless duplicates of COMPARE_EQ/COMPARE_NE and i want to simplify our set of opcodes for future backends
More work on LVSR/LVSL/STVR/STVL opcodes
Optimized X64 translated code emission, now only compute instrkey once
Add code for pre-computing integer division magic numbers
Optimized GetHostViewportInfo a little
Move args for GetHostViewportInfo into a class, cache the result and compare for future queries. moved GetHostViewportInfo far lower on the profile
Add (currently not functional, and very racy) asynchronous memcpy code. will improve it and actually use it in future commits.
Add non-temporal memcpy function for huge page-aligned allocations. Used for copying to shared memory/readback
hoist are_accumulated_render_targets_valid_ check out of loop in render_target_cache already bound check.
Add stosb/movsb code for small constant memcpys/memsets that arent worth the overhead of memcpy/memset
2022-08-28 14:24:25 -07:00
Radosław Gliński 0b013fdc6b
Merge pull request #61 from chrisps/canary_experimental
performance improvements, kernel fixes, cpu accuracy improvements
2022-08-21 09:31:09 +02:00
chss95cs@gmail.com 457296850e Add OPCODE_NEGATED_MUL_ADD/OPCODE_NEGATED_MUL_SUB
Proper handling of nans for VMX max/min on x64 (minps/maxps has special behavior depending on the operand order that vmx does not have for vminfp/vmaxfp)
Add extremely unintrusive guest code profiler utilizing KUSER_SHARED systemtime. This profiler is disabled on platforms other than windows, and on windows is disabled by default by a cvar
Repurpose GUEST_SCRATCH64 stack offset to instead be for storing guest function profile times, define GUEST_SCRATCH as 0 instead, since thats already meant to be a scratch area
Fix xenia silently closing on config errors/other fatal errors by setting has_console_attached_'s default to false
Add alternative code path for guest clock that uses kusershared systemtime instead of QueryPerformanceCounter. This is way faster and I have tested it and found it to be working, but i have disabled it because i do not know how well it works on wine or on processors other than mine
Significantly reduce log spam by setting XELOGAPU and XELOGGPU to be LogLevel::Debug
Changed some LOGI to LOGD in places to reduce log spam
Mark VdSwap as kHighFrequency, it was spamming up logs
Make logging calls less intrusive for the caller by forcing the test of log level inline and moving the format/AppendLogLine stuff to an outlined cold function
Add swcache namespace for software cache operations like prefetches, streaming stores and streaming loads.
Add XE_MSVC_REORDER_BARRIER for preventing msvc from propagating a value too close to its store or from its load
Add xe_unlikely_mutex for locks we know have very little contention
add XE_HOST_CACHE_LINE_SIZE and XE_RESTRICT to platform.h
Microoptimization: Changed most uses of size_t to ring_size_t in RingBuffer, this reduces the size of the inlined ringbuffer operations slightly by eliminating rex prefixes, depending on register allocation
Add BeginPrefetchedRead to ringbuffer, which prefetches the second range if there is one according to the provided PrefetchTag
added inline_loadclock cvar, which will directly use the value of the guest clock from clock.cc in jitted guest code. off by default
change uses of GUEST_SCRATCH64 to GUEST_SCRATCH
Add fast vectorized xenos_half_to_float/xenos_float_to_half (currently resides in x64_seq_vector, move to gpu code maybe at some point)
Add fast x64 codegen for PackFloat16_4/UnpackFloat16_4. Same code can be used for Float16_2 in future commit. This should speed up some games that use these functions heavily
Remove cvar for toggling old float16 behavior
Add VRSAVE register, support mfspr/mtspr vrsave
Add cvar for toggling off codegen for trap instructions and set it to true by default.
Add specialized methods to CommandProcessor: WriteRegistersFromMem, WriteRegisterRangeFromRing, and WriteOneRegisterFromRing. These reduce the overall cost of WriteRegister
Use a fixed size vmem vector for upload ranges, realloc/memsetting on resize  in the inner loop of requestranges was showing up on the profiler (the search in requestranges itself needs work)
Rename fixed_vmem_vector to better fit xenia's naming convention
Only log unknown register writes in WriteRegister if DEBUG :/. We're stuck on MSVC with c++17 so we have no way of influencing the branch ordering for that function without profile guided optimization
Remove binding stride assert in shader_translator.cc, triangle told me its leftover ogl stuff
Mark xe::FatalError as noreturn
If a controller is not connected, delay by 1.1 seconds before checking if it has been reconnected. Asking Xinput about a controller slot that is unused is extremely slow, and XinputGetState/SetState were taking up
an enormous amount of time in profiles. this may have caused a bit of input lag
Protect accesses to input_system with a lock
Add proper handling for user_index>= 4 in XamInputGetState/SetState, properly return zeroed state in GetState
Add missing argument to NtQueryVirtualMemory_entry
Fixed RtlCompareMemoryUlong_entry, it actually does not care if the source is misaligned, and for length it aligns down
Fixed RtlUpperChar and RtlLowerChar, added a table that has their correct return values precomputed
2022-08-20 11:40:19 -07:00
Gliniak e06978e5be [Premake] Cleanup & Fixed references in cpu-tests 2022-08-17 09:43:55 +02:00
chss95cs@gmail.com 7cc364dcb8 squash reallocs in command buffers by using large prealloced buffer, directly use virtual memory with it so os allocs on demand
mark raw clock functions as noinline, the way msvc was inlining them and ordering the branches meant that rdtsc would often be speculatively executed
add alternative clock impl for win, instead of using queryperformancecounter we grab systemtime from kusershared. it does not have the same precision as queryperformancecounter, we only have 100 nanosecond precision, but we round to milliseconds so it never made sense to use the performance counter in the first place
stubbed out the "guest clock mutex"... (the entirety of clock.cc needs a rewrite)
added some helpers for minf/maxf without the nan handling behavior
2022-08-14 13:42:08 -07:00
chss95cs@gmail.com c9b2d10e17 alternative mutex impl on windows works but i really can't tell if it helps much. use larger size in deferred_command_list to cut down on resizes in big scenes on m:dur 2022-08-14 10:26:50 -07:00
chss95cs@gmail.com 08f7a28920 Alternative mutex 2022-08-14 08:59:11 -07:00
chss95cs@gmail.com cb85fe401c Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 12:59:00 -07:00
Gliniak 0e3403d6da Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-30 12:42:51 +02:00
Triang3l 7595cdb52b [Vulkan] Non-GS point sprites + minor SPIR-V fixes 2022-07-27 17:14:28 +03:00
Triang3l ff7ef05063 [SPIR-V] Clamp cube face using NClamp, not NMax/FMin 2022-07-26 17:08:12 +03:00
Triang3l 66c995f3aa [SPIR-V] Saturate point sprite coordinates 2022-07-26 17:04:22 +03:00
Triang3l 8fb5da18ea [Vulkan] Add forgotten fullDrawIndexUint32 check 2022-07-26 16:24:14 +03:00
Triang3l 9fa41c27bc [Vulkan] Point sprite geometry shader 2022-07-26 16:01:20 +03:00
Gliniak 76806e08c5 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-26 10:22:38 +02:00
Triang3l f248e23079 [DXBC] Skip backface check in point PsParamGen 2022-07-25 21:48:25 +03:00
Triang3l 77e85ecaa4 [Vulkan] 32-bit index fetch without fullDrawIndexUint32 2022-07-25 16:53:12 +03:00
Gliniak 98c2cb636f Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-24 17:38:08 +02:00
Triang3l 37579d3bf0 [GPU] Treat non-adaptive-tessellated patches as 1-control-point 2022-07-24 17:38:26 +03:00
Gliniak 1fcac00924 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-23 13:26:31 +02:00
Triang3l 3c12814276 [GPU] EDRAM looped addressing (resolves #2031) 2022-07-22 23:51:50 +03:00
Gliniak 0c782ade8e Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-21 18:52:33 +02:00
Triang3l 6ff312afb1 [DXBC] Update PsParamGen comment [ci skip] 2022-07-21 12:42:06 +03:00
Triang3l 1a95bef8b3 [GPU] Eliminate unused shader I/O, UCP culling, centroid on Vulkan
For more optimal usage of exports and the parameter cache on the host regardless of how effective the optimizations in the host GPU driver are. Also reserve space for Vulkan/Metal/D3D11-specific HostVertexShaderTypes to use one more bit for the host vertex shader type in the shader modification bits, so that won't have to be done in the future as that would require invalidating shader storages (which are invalidated by this commit) again.
2022-07-21 12:32:28 +03:00
Gliniak bc315d21e0 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-19 10:45:14 +02:00
Triang3l 0a94b86cb8 [GPU] Remove orphaned GetPresentArea declaration [ci skip] 2022-07-18 21:02:34 +03:00
Gliniak 6e1e62378f Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-17 21:27:52 +02:00
Triang3l 14fdf4b270 [GPU] Up to 7x7 resolution scaling 2022-07-17 20:41:50 +03:00
Triang3l e8652e544a [GPU] Translucent trace viewer controls 2022-07-17 17:29:41 +03:00
Triang3l 25663827ba [GPU] Trace viewer Android content URI loading 2022-07-17 16:37:49 +03:00
Triang3l 2a69d1db4d [Vulkan] Fix a typo in a comment about BC textures [ci skip] 2022-07-14 21:16:23 +03:00
Triang3l 037310f8dc [Android] Unified xenia-app with windowed apps and build prerequisites 2022-07-11 21:45:57 +03:00
Gliniak 1d00372e6b Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-10 10:50:39 +02:00
Triang3l b41bb35a20 [SPIR-V] Make interpolators an array to fix Adreno linkage 2022-07-09 17:52:26 +03:00
Triang3l b3edc56576 [Vulkan] Merge texture and sampler descriptors into a single descriptor set
Put all descriptors used by translated shaders in up to 4 descriptor sets, which is the minimum required, and the most common on Android, `maxBoundDescriptorSets` device limit value
2022-07-09 17:10:28 +03:00
Triang3l e4de8663c4 [Vulkan] All guest draw uniform buffer bindings in a single descriptor set
Reduce the number of bound descriptor sets from 10 to 6, which is still above the minimum limit of 4, but closer
2022-07-07 21:05:56 +03:00
Triang3l 88c055eb30 [CPU] Null backend enough for GPU trace viewing 2022-07-06 23:28:06 +03:00
Triang3l 3ee68d79ea Revert "[GPU] Make Processor optional for GraphicsSystem setup"
The Processor is still required in many places, including the GPU command processor worker thread

This reverts commit fd03d886e9.
2022-07-06 22:43:40 +03:00
Triang3l fd03d886e9 [GPU] Make Processor optional for GraphicsSystem setup 2022-07-05 21:21:22 +03:00
Triang3l bdfd410b13 [CPU] Cleanup x64 backend usage conditionals 2022-07-05 21:07:10 +03:00
Triang3l d263d508cd [GPU] Make operator< const 2022-07-05 20:47:53 +03:00
Triang3l 536f14d94c [GPU] Fix a typo in a Neon intrinsic name 2022-07-05 20:47:34 +03:00
Triang3l feaad639fb [Vulkan] Destroy all RTs before VulkanRenderTargetCache is destroyed 2022-07-04 11:27:51 +03:00
Gliniak 6e753c6399 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-07-04 08:11:04 +02:00
Triang3l 2621dabf0f [Vulkan] Native 24-bit unorm depth where available 2022-07-03 21:21:17 +03:00
Triang3l ee84f4e267 [Vulkan] Update title bar warning 2022-07-03 19:45:48 +03:00