Commit Graph

7431 Commits

Author SHA1 Message Date
Triang3l 4add1f67b1 [D3D12] Replace unused shared memory view with a null view
Fixes the PIX validation warning about missing resource states on every guest draw. Also potentially prevents drivers from making assumptions about the shared memory buffer based on the bindings, though no such cases are currently known.
2022-10-23 18:09:47 +03:00
Wunkolo 5fde7c6aa5 [x64] Add AVX512 optimizations for `PERMUTE_V128`
Uses the single-instruction AVX512 `vperm*` instructions to accelerate
the `INT8_TYPE` and `INT16_TYPE` permutation opcodes.

The `INT8_TYPE` is accelerated using `AVX512VBMI` subset of AVX512.
Available since Icelake(Intel) and Zen4(AMD).
2022-10-21 08:47:31 -05:00
Wunkolo f207239349 [x64] Add `kX64EmitAVX512VBMI` feature-flag and detection
Allows access to byte-element 2-register permutations(32-byte look up
tables) and for 64-bit multi-shifts.
Particularly adding this to accelerate the assembly of our `PERMUTE`
opcode.
2022-10-21 08:47:31 -05:00
Wunkolo d73088e5ca [x64] Add AVX512 optimization for `OPCODE_VECTOR_SUB`(saturated)
Passes the `vsubuws` and `vsubsws` unit-tests from https://github.com/xenia-project/xenia/pull/1348
2022-10-21 08:45:43 -05:00
Radosław Gliński 7c375879bc
Merge pull request #85 from chrisps/canary_experimental
Kernel improvements, "fix" crash on sandy bridge/ivy bridge
2022-10-21 14:18:03 +02:00
chrisps 4493d17acc
Update hir_builder.cc 2022-10-20 14:58:27 -07:00
chrisps adc3405537
change else{if} to else if in AndNot 2022-10-20 14:56:55 -07:00
Gliniak 48fea6d9aa Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-10-18 12:19:52 +02:00
Triang3l cdb40ddb28 [DXBC] Fix interpolator copying from v# to r# in PS
The bit count was of `(1<<i)-1` itself (thus couldn't handle interpolators with a smaller index skipped), not of `bits&((1<<i)-1)`.
2022-10-18 13:12:37 +03:00
chss95cs@gmail.com d8b7b3ecec Fix bindless path in d3d12 that i broke in earlier commit (did not affect any users, thats a debug thing)
Fix guest code profiler, it previously only worked with function precomp + all code you were about to execute already discovered
Allow AndNot if type is V128
2022-10-16 07:48:43 -07:00
chss95cs@gmail.com b41e5060da Fix bindless path in d3d12 that i broke in earlier commit (did not affect any users, thats a debug thing)
Fix guest code profiler, it previously only worked with function precomp + all code you were about to execute already discovered
Allow AndNot if type is V128
2022-10-16 07:47:27 -07:00
chss95cs@gmail.com 22e52cbecd Canary can now run on sandy bridge/e and ivy bridge/e
Stubbed out OPCODE_AND_NOT, its fallback implementation if bmi1 was not supported was broken. it's difficult to tell where the actual issue is there.
2022-10-15 05:14:53 -07:00
chss95cs@gmail.com 7204532b1c Implement RtlUpcaseUnicodeChar 2022-10-15 04:29:13 -07:00
chss95cs@gmail.com d7fa8481af Switch cxxopts over to version with selectany while i wait for the selectany change to be merged there 2022-10-15 03:49:12 -07:00
chrisps a495709344
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-10-15 03:07:35 -07:00
chss95cs@gmail.com efbeae660c Drastically reduce cpu time wasted by XMADecoderThread spinning, went from 13% of all cpu time to about 0.6% in my tests
Commented out lock in WatchMemoryRange, lock is always held by caller
properly set the value/check the irql for spinlocks in xboxkrnl_threading
2022-10-15 03:07:07 -07:00
Gliniak d262214c1b Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-10-14 20:13:03 +02:00
chrisps 28b565c0c7
Merge pull request #82 from chrisps/canary_experimental
Fix missing semicolon
2022-10-09 14:07:43 -07:00
chss95cs@gmail.com ecf6bfbbdf Stub event query for linux, fix missing semicolon in linux SetEventBoostPriority 2022-10-09 12:30:18 -07:00
Triang3l 45050b2380 [GPU] Vulkan fragment shader interlock RB and related fixes/cleanup
Also fixes addressing of MSAA samples 2 and 3 for 64bpp color render targets in the ROV RB implementation on Direct3D 12.
Additionally, with FSI/ROV, alpha test and alpha to coverage are done only if the render target 0 was dynamically written to (according to the Direct3D 9 rules for writing to color render targets, though not sure if they actually apply to the alpha tests on Direct3D 9, but for safety).
There is also some code cleanup for things spotted during the development of the feature.
2022-10-09 22:06:41 +03:00
Gliniak c923ab78a9 [Config] Added note about internal_display_resolution 2022-10-09 12:33:04 +02:00
Gliniak 7975ea78d4 [Base] BitStream: Prevent readout beyond buffer 2022-10-09 12:24:46 +02:00
Gliniak 17b3939bbf Revert "[Base] Changed size of bitstream accessed data (Risky)"
This reverts commit 061000af01.
2022-10-09 12:18:43 +02:00
chrisps 08d38bdff6
Merge pull request #81 from chrisps/canary_experimental
global lock changes, minor kernel changes, premake fix
2022-10-08 12:04:43 -07:00
chss95cs@gmail.com 2dd6f33f4b Fix debug/ui premake too 2022-10-08 10:34:50 -07:00
chrisps bcd57f8663
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-10-08 10:11:30 -07:00
chss95cs@gmail.com d8c94b1aee Fix premake filter mistake that broke debug builds (and likely any build other than release) 2022-10-08 10:10:36 -07:00
chss95cs@gmail.com 8f7f7dc6ad fixed wine crash from use of NtSetEventPriorityBoost
add xe::clear_lowest_bit, use it in place of shift-andnot in some bit iteration code
make is_allocated_ and is_enabled_ volatile in xma_context
preallocate avpacket buffer in XMAContext::Setup, the reallocations of the buffer in ffmpeg were showing up on profiles
check is_enabled and is_allocated BEFORE locking an xmacontext. XMA worker was spending most of its time locking and unlocking contexts
Removed XeDMAC, dma:: namespace. It was a bad idea and I couldn't make it work in the end. Kept vastcpy and moved it to the memory namespace instead
Made the rest of global_critical_region's members static. They never needed an instance.
Removed ifdef'ed out code from ring_buffer.h
Added EventInfo struct to threading, added Event::Query to aid with implementing NtQueryEvent.
Removed vector from WaitMultiple, instead use a fixed array of 64 handles that we populate. WaitForMultipleObjects cannot handle more than 64 objects.
Remove XE_MSVC_OPTIMIZE_SMALL() use in x64_sequences, x64 backend is now always size optimized because of premake
Make global_critical_region_ static constexpr in shared_memory.h to get rid of wasteage of 8 bytes (empty class=1byte, +alignment for next member=8)
Move trace-related data to the tail of SharedMemory to keep more important data together
In IssueDraw build an array of fetch constant addresses/sizes, then pre-lock the global lock before doing requestrange for each instead of individually locking within requestrange for each of them
Consistent access specifier protected for pm4_command_processor_declare
Devirtualize WriteOneRegisterFromRing.
Move ExecutePacket and ExecutePrimaryBuffer to pm4_command_buffer_x
Remove many redundant header inclusions access xenia-gpu
Minor microoptimization of ExecutePacketType0

Add TextureCache::RequestTextures for batch invocation of LoadTexturesData

Add TextureCache::LoadTexturesData for reducing the number of times we release and reacquire the global lock.
Ideally you should hold the global lock for as little time as possible, but if you are constantly acquiring and releasing it you are actually more likely to have contention
Add already_locked param to ObjectTable::LookupObject to help with reducing lock acquire/release pairs
Add missing checks to XAudioRegisterRenderDriverClient_entry. this is unlikely to fix anything, it was just an easy thing to do
Add NtQueryEvent system call implementation. I don't actually know of any games that need it.
Instead of using std::vector + push_back in KeWaitForMultipleObjects and xeNtWaitForMultipleObjectsEx use a fixed size array of 64 and track the count. More than 64 objects is not permitted by the kernel. The repeated reallocations from push_back were appearing unusually high on the profiler, but were masked until now by waitformultipleobjects natural overhead
Pre-lock the global lock before looking up each handle for xeNtWaitForMultipleObjectsEx and KeWaitForMultipleObjects.
Pre-lock before looking up the signal and waiter in NtSignalAndWaitForSingleObjectEx
add missing checks to NtWaitForMultipleObjectsEx
Support pre-locking in XObject::GetNativeObject
2022-10-08 09:55:17 -07:00
chrisps 50fce8bdb3
Merge pull request #80 from chrisps/canary_experimental
Reduce size of final exe by about 1.6mb
2022-10-05 04:15:53 -07:00
chss95cs@gmail.com bae63b95c5 Update to latest version of cxxopts 2022-09-30 06:51:25 -07:00
chss95cs@gmail.com b4c175d8a3 Enable SDL_LEAN_AND_MEAN, SDL_RENDER_DISABLED, saves about 500kb in final exe
Build several projects that arent performance critical with /Os and /O1 under msvc windows
2022-09-29 07:26:38 -07:00
chss95cs@gmail.com 7e58a3b320 Fix compiler errors i introduced under clang-cl
remove xe_kernel_export_shim_fn field of Export function_data, trampoline is now the only way exports get invoked
Remove kernelstate argument from string functions in order to conform to the trampoline signature (the argument was unused anyway)
Constant-evaluated initialization of ppc_opcode_disasm_table, removal of unused std::vector fields
Constant-evaluated initialization of export tables
name field on export is just a const char* now, only immutable static strings are ever passed to it
Remove unused callcount field of export.
PM4 compare op function extracted
Globally apply /Oy, /GS-, /Gw on msvc windows
Remove imgui testwindow code call, it took up like 300 kb
2022-09-29 07:04:17 -07:00
Gliniak 203267b106 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-09-23 12:23:53 +02:00
Joel Linn 9ab4db285c [Premake] Update premake-cmake
- Handle compiler flags per-file. Removes ffmpeg warnings
- Switch to JoelLinn fork since original author stopped maintaining
  and other forks don't seem to care about PRs
2022-09-22 06:36:43 -05:00
Rick Gibbed 3bfa3b05e1
Lint fix. 2022-09-22 06:34:21 -05:00
Gliniak 7d970967c4 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-09-20 21:15:12 +02:00
chrisps def00e6ddb
Merge pull request #76 from beeanyew/input-system-mutex-revert
[Input System] xe_mutex revert
2022-09-18 07:46:02 -07:00
beeanyew cd17f1846f [Input System] xe_mutex revert
On request from chrisps, revert xe_mutex back to xe_unlikely_mutex to avoid mutex deadlocks while initializing hid-winkey.
2022-09-18 15:18:29 +02:00
chrisps a29a7436e0
Merge pull request #75 from chrisps/canary_experimental
misc stuff again
2022-09-17 06:43:50 -07:00
chrisps d0acd68369
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-09-17 07:05:24 -04:00
chss95cs@gmail.com eb8154908c atomic cas use prefetchw if available
remove useless memorybarrier
remove double membarrier in wait pm4 cmd
add int64 cvar
use int64 cvar for x64 feature mask
Rework some functions that were frontend bound according to vtune placing some of their code in different noinline functions, profiling after indicating l1 cache misses decreased and perf of func increased
remove long vpinsrd dep chain code for conversion.h, instead do normal load+bswap or movbe if avail
Much faster entry table via split_map, code size could be improved though
GetResolveInfo was very large and had impact on icache, mark callees as noinline + msvc pragma optimize small
use log2 shifts instead of integer divides in memory
minor optimizations in PhysicalHeap::EnableAccessCallbacks, the majority of time in the function is spent looping, NOT calling Protect! Someone should optimize this function and rework the algo completely
remove wonky scheduling log message, it was spammy and unhelpful
lock count was unnecessary for criticalsection mutex, criticalsection is already a recursive mutex
brief notes i gotta run
2022-09-17 04:04:53 -07:00
Wunkolo addd8c94e5 [x64] Add AVX512 optimization for `OPCODE_VECTOR_ADD`(saturated)
Uses a single `vpternlogd` to test for signed/unsigned
overflow/underflow. Then utilizes AVX512 mask operations to create
either `0x7FFFFFFF` or `0x80000000` arithmetically.
2022-09-14 11:39:03 -05:00
Wunkolo 9fd684594b [x64] Add AVX512 optimization for `OPCODE_VECTOR_CONVERT_F2I`(unsigned)
`vcvttps2udq` already saturates overflowing and unordered values to `0xFFFFFFFF`. Using mask registers, zeroes are written to negative values within the same instruction.
2022-09-12 13:52:57 -05:00
chrisps b4224ff3dc
Merge pull request #74 from chrisps/canary_experimental
Misc optimizations
2022-09-11 18:02:00 -04:00
chss95cs@gmail.com 0fd4a2533b Prevent clang-format from moving d3d12_nvapi above the require d3d12 headers 2022-09-11 14:35:33 -07:00
chss95cs@gmail.com 20638c2e61 use Sleep(0) instead of SwitchToThread, should waste less power and help the os with scheduling.
PM4 buffer handling made a virtual member of commandprocessor, place the implementation/declaration into reusable macro files. this is probably the biggest boost here.
Optimized SET_CONSTANT/ LOAD_CONSTANT pm4 ops based on the register range they start writing at, this was also a nice boost

Expose X64 extension flags to code outside of x64 backend, so we can detect and use things like avx512, xop, avx2, etc in normal code
Add freelists for HIR structures to try to reduce the number of last level cache misses during optimization (currently disabled... fixme later)

Analyzed PGO feedback and reordered branches, uninlined functions, moved code out into different functions based on info from it in the PM4 functions, this gave like a 2% boost at best.

Added support for the db16cyc opcode, which is used often in xb360 spinlocks. before it was just being translated to nop, now on x64 we translate it to _mm_pause but may change that in the future to reduce cpu time wasted

texture util - all our divisors were powers of 2, instead we look up a shift. this made texture scaling slightly faster, more so on intel processors which seem to be worse at int divs. GetGuestTextureLayout is now a little faster, although it is still one of the heaviest functions in the emulator when scaling is on.

xe_unlikely_mutex was not a good choice for the guest clock lock, (running theory) on intel processors another thread may take a significant time to update the clock? maybe because of the uint64 division? really not sure, but switched it to xe_mutex. This fixed audio stutter that i had introduced to 1 or 2 games, fixed performance on that n64 rare game with the monkeys.
Took another crack at DMA implementation, another failure.
Instead of passing as a parameter, keep the ringbuffer reader as the first member of commandprocessor so it can be accessed through this
Added macro for noalias
Applied noalias to Memory::LookupHeap. This reduced the size of the executable by 7 kb.
Reworked kernel shim template, this shaved like 100kb off the exe and eliminated the indirect calls from the shim to the actual implementation. We still unconditionally generate string representations of kernel calls though :(, unless it is kHighFrequency

Add nvapi extensions support, currently unused. Will use CPUVISIBLE memory at some point
Inserted prefetches in a few places based on feedback from vtune.
Add native implementation of SHA int8 if all elements are the same

Vectorized comparisons for SetViewport, SetScissorRect
Vectorized ranged comparisons for WriteRegister
Add XE_MSVC_ASSUME
Move FormatInfo::name out of the structure, instead look up the name in a different table. Debug related data and critical runtime data are best kept apart
Templated UpdateSystemConstantValues based on ROV/RTV and primitive_polygonal
Add ArchFloatMask functions, these are for storing the results of floating point comparisons without doing costly float->int pipeline transfers (vucomiss/setb)
Use floatmasks in UpdateSystemConstantValues for checking if dirty, only transfer to int at end of function.
Instead of dirty |= (x == y) in UpdateSystemConstantValues, now we do dirty_u32 |= (x^y). if any of them are not equal, dirty_u32 will be nz, else if theyre all equal it will be zero. This is more friendly to register renaming and the lack of dependencies on EFLAGS lets the compiler reorder better
Add PrefetchSamplerParameters to D3D12TextureCache
use PrefetchSamplerParameters in UpdateBindings to eliminate cache misses that vtune detected

Add PrefetchTextureBinding to D3D12TextureCache
Prefetch texture bindings to get rid of more misses vtune detected (more accesses out of order with random strides)
Rewrote DMAC, still terrible though and have disabled it for now.
Replace tiny memcmp of 6 U64 in render_target_cache with inline loop, msvc fails to make it a loop and instead does a thunk to their memcmp function, which is optimized for larger sizes

PrefetchTextureBinding in AreActiveTextureSRVKeysUpToDate
Replace memcmp calls for pipelinedescription with handwritten cmp
Directly write some registers that dont have special handling in PM4 functions
Changed EstimateMaxY to try to eliminate mispredictions that vtune was reporting, msvc ended up turning the changed code into a series of blends

in ExecutePacketType3_EVENT_WRITE_EXT, instead of writing extents to an array on the stack and then doing xe_copy_and_swap_16 of the data to its dest, pre-swap each constant and then store those. msvc manages to unroll that into wider stores
stop logging XE_SWAP every time we receive XE_SWAP, stop logging the start and end of each viz query

Prefetch watch nodes in FireWatches based on feedback from vtune
Removed dead code from texture_info.cc
NOINLINE on GpuSwap, PGO builds did it so we should too.
2022-09-11 14:14:48 -07:00
Wunkolo 90fffe1de7 [PPC] Fix memory assert formatting
This was still using printf-style format specifiers. Causing memory
asserts to show up like this while testing.

```
!> 0000438C Memory 10001040 assert failed:
!> 0000438C   Expected: %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X
!> 0000438C     Actual: %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X
!> 0000438C     TEST FAILED
```

Updated them so they format correctly:

```
!> 00002CCC Memory 10001040 assert failed:
!> 00002CCC   Expected: FC FD FE FF 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
!> 00002CCC     Actual: FC FD FE FF 00 00 00 00 00 00 00 00 00 00 00 00
!> 00002CCC     TEST FAILED
```
2022-09-05 13:47:48 -05:00
Wunkolo b0cc3db4d8 [x64] Add AVX512 optimization for `NOT_V128` 2022-09-05 13:47:30 -05:00
chrisps 9a6dd4cd6f
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-09-05 09:08:46 -04:00
chss95cs@gmail.com 0c576877c8 Add constant folding for LVR when 16 aligned, clean up prior commit by removing dead test code for LVR/LVL/STVL/STVR opcodes and legacy hir sequence
Delay using mm_pause in KeAcquireSpinLockAtRaisedIrql_entry, a huge amount of time is spent spinning in halo3
2022-09-04 22:42:51 -05:00