Commit Graph

7153 Commits

Author SHA1 Message Date
chrisps 50fce8bdb3
Merge pull request #80 from chrisps/canary_experimental
Reduce size of final exe by about 1.6mb
2022-10-05 04:15:53 -07:00
chss95cs@gmail.com bae63b95c5 Update to latest version of cxxopts 2022-09-30 06:51:25 -07:00
chss95cs@gmail.com b4c175d8a3 Enable SDL_LEAN_AND_MEAN, SDL_RENDER_DISABLED, saves about 500kb in final exe
Build several projects that arent performance critical with /Os and /O1 under msvc windows
2022-09-29 07:26:38 -07:00
chss95cs@gmail.com 7e58a3b320 Fix compiler errors i introduced under clang-cl
remove xe_kernel_export_shim_fn field of Export function_data, trampoline is now the only way exports get invoked
Remove kernelstate argument from string functions in order to conform to the trampoline signature (the argument was unused anyway)
Constant-evaluated initialization of ppc_opcode_disasm_table, removal of unused std::vector fields
Constant-evaluated initialization of export tables
name field on export is just a const char* now, only immutable static strings are ever passed to it
Remove unused callcount field of export.
PM4 compare op function extracted
Globally apply /Oy, /GS-, /Gw on msvc windows
Remove imgui testwindow code call, it took up like 300 kb
2022-09-29 07:04:17 -07:00
Gliniak 203267b106 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-09-23 12:23:53 +02:00
Joel Linn 9ab4db285c [Premake] Update premake-cmake
- Handle compiler flags per-file. Removes ffmpeg warnings
- Switch to JoelLinn fork since original author stopped maintaining
  and other forks don't seem to care about PRs
2022-09-22 06:36:43 -05:00
Rick Gibbed 3bfa3b05e1
Lint fix. 2022-09-22 06:34:21 -05:00
Gliniak 7d970967c4 Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental 2022-09-20 21:15:12 +02:00
chrisps def00e6ddb
Merge pull request #76 from beeanyew/input-system-mutex-revert
[Input System] xe_mutex revert
2022-09-18 07:46:02 -07:00
beeanyew cd17f1846f [Input System] xe_mutex revert
On request from chrisps, revert xe_mutex back to xe_unlikely_mutex to avoid mutex deadlocks while initializing hid-winkey.
2022-09-18 15:18:29 +02:00
chrisps a29a7436e0
Merge pull request #75 from chrisps/canary_experimental
misc stuff again
2022-09-17 06:43:50 -07:00
chrisps d0acd68369
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-09-17 07:05:24 -04:00
chss95cs@gmail.com eb8154908c atomic cas use prefetchw if available
remove useless memorybarrier
remove double membarrier in wait pm4 cmd
add int64 cvar
use int64 cvar for x64 feature mask
Rework some functions that were frontend bound according to vtune placing some of their code in different noinline functions, profiling after indicating l1 cache misses decreased and perf of func increased
remove long vpinsrd dep chain code for conversion.h, instead do normal load+bswap or movbe if avail
Much faster entry table via split_map, code size could be improved though
GetResolveInfo was very large and had impact on icache, mark callees as noinline + msvc pragma optimize small
use log2 shifts instead of integer divides in memory
minor optimizations in PhysicalHeap::EnableAccessCallbacks, the majority of time in the function is spent looping, NOT calling Protect! Someone should optimize this function and rework the algo completely
remove wonky scheduling log message, it was spammy and unhelpful
lock count was unnecessary for criticalsection mutex, criticalsection is already a recursive mutex
brief notes i gotta run
2022-09-17 04:04:53 -07:00
Wunkolo addd8c94e5 [x64] Add AVX512 optimization for `OPCODE_VECTOR_ADD`(saturated)
Uses a single `vpternlogd` to test for signed/unsigned
overflow/underflow. Then utilizes AVX512 mask operations to create
either `0x7FFFFFFF` or `0x80000000` arithmetically.
2022-09-14 11:39:03 -05:00
Wunkolo 9fd684594b [x64] Add AVX512 optimization for `OPCODE_VECTOR_CONVERT_F2I`(unsigned)
`vcvttps2udq` already saturates overflowing and unordered values to `0xFFFFFFFF`. Using mask registers, zeroes are written to negative values within the same instruction.
2022-09-12 13:52:57 -05:00
chrisps b4224ff3dc
Merge pull request #74 from chrisps/canary_experimental
Misc optimizations
2022-09-11 18:02:00 -04:00
chss95cs@gmail.com 0fd4a2533b Prevent clang-format from moving d3d12_nvapi above the require d3d12 headers 2022-09-11 14:35:33 -07:00
chss95cs@gmail.com 20638c2e61 use Sleep(0) instead of SwitchToThread, should waste less power and help the os with scheduling.
PM4 buffer handling made a virtual member of commandprocessor, place the implementation/declaration into reusable macro files. this is probably the biggest boost here.
Optimized SET_CONSTANT/ LOAD_CONSTANT pm4 ops based on the register range they start writing at, this was also a nice boost

Expose X64 extension flags to code outside of x64 backend, so we can detect and use things like avx512, xop, avx2, etc in normal code
Add freelists for HIR structures to try to reduce the number of last level cache misses during optimization (currently disabled... fixme later)

Analyzed PGO feedback and reordered branches, uninlined functions, moved code out into different functions based on info from it in the PM4 functions, this gave like a 2% boost at best.

Added support for the db16cyc opcode, which is used often in xb360 spinlocks. before it was just being translated to nop, now on x64 we translate it to _mm_pause but may change that in the future to reduce cpu time wasted

texture util - all our divisors were powers of 2, instead we look up a shift. this made texture scaling slightly faster, more so on intel processors which seem to be worse at int divs. GetGuestTextureLayout is now a little faster, although it is still one of the heaviest functions in the emulator when scaling is on.

xe_unlikely_mutex was not a good choice for the guest clock lock, (running theory) on intel processors another thread may take a significant time to update the clock? maybe because of the uint64 division? really not sure, but switched it to xe_mutex. This fixed audio stutter that i had introduced to 1 or 2 games, fixed performance on that n64 rare game with the monkeys.
Took another crack at DMA implementation, another failure.
Instead of passing as a parameter, keep the ringbuffer reader as the first member of commandprocessor so it can be accessed through this
Added macro for noalias
Applied noalias to Memory::LookupHeap. This reduced the size of the executable by 7 kb.
Reworked kernel shim template, this shaved like 100kb off the exe and eliminated the indirect calls from the shim to the actual implementation. We still unconditionally generate string representations of kernel calls though :(, unless it is kHighFrequency

Add nvapi extensions support, currently unused. Will use CPUVISIBLE memory at some point
Inserted prefetches in a few places based on feedback from vtune.
Add native implementation of SHA int8 if all elements are the same

Vectorized comparisons for SetViewport, SetScissorRect
Vectorized ranged comparisons for WriteRegister
Add XE_MSVC_ASSUME
Move FormatInfo::name out of the structure, instead look up the name in a different table. Debug related data and critical runtime data are best kept apart
Templated UpdateSystemConstantValues based on ROV/RTV and primitive_polygonal
Add ArchFloatMask functions, these are for storing the results of floating point comparisons without doing costly float->int pipeline transfers (vucomiss/setb)
Use floatmasks in UpdateSystemConstantValues for checking if dirty, only transfer to int at end of function.
Instead of dirty |= (x == y) in UpdateSystemConstantValues, now we do dirty_u32 |= (x^y). if any of them are not equal, dirty_u32 will be nz, else if theyre all equal it will be zero. This is more friendly to register renaming and the lack of dependencies on EFLAGS lets the compiler reorder better
Add PrefetchSamplerParameters to D3D12TextureCache
use PrefetchSamplerParameters in UpdateBindings to eliminate cache misses that vtune detected

Add PrefetchTextureBinding to D3D12TextureCache
Prefetch texture bindings to get rid of more misses vtune detected (more accesses out of order with random strides)
Rewrote DMAC, still terrible though and have disabled it for now.
Replace tiny memcmp of 6 U64 in render_target_cache with inline loop, msvc fails to make it a loop and instead does a thunk to their memcmp function, which is optimized for larger sizes

PrefetchTextureBinding in AreActiveTextureSRVKeysUpToDate
Replace memcmp calls for pipelinedescription with handwritten cmp
Directly write some registers that dont have special handling in PM4 functions
Changed EstimateMaxY to try to eliminate mispredictions that vtune was reporting, msvc ended up turning the changed code into a series of blends

in ExecutePacketType3_EVENT_WRITE_EXT, instead of writing extents to an array on the stack and then doing xe_copy_and_swap_16 of the data to its dest, pre-swap each constant and then store those. msvc manages to unroll that into wider stores
stop logging XE_SWAP every time we receive XE_SWAP, stop logging the start and end of each viz query

Prefetch watch nodes in FireWatches based on feedback from vtune
Removed dead code from texture_info.cc
NOINLINE on GpuSwap, PGO builds did it so we should too.
2022-09-11 14:14:48 -07:00
Wunkolo 90fffe1de7 [PPC] Fix memory assert formatting
This was still using printf-style format specifiers. Causing memory
asserts to show up like this while testing.

```
!> 0000438C Memory 10001040 assert failed:
!> 0000438C   Expected: %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X
!> 0000438C     Actual: %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X
!> 0000438C     TEST FAILED
```

Updated them so they format correctly:

```
!> 00002CCC Memory 10001040 assert failed:
!> 00002CCC   Expected: FC FD FE FF 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
!> 00002CCC     Actual: FC FD FE FF 00 00 00 00 00 00 00 00 00 00 00 00
!> 00002CCC     TEST FAILED
```
2022-09-05 13:47:48 -05:00
Wunkolo b0cc3db4d8 [x64] Add AVX512 optimization for `NOT_V128` 2022-09-05 13:47:30 -05:00
chrisps 9a6dd4cd6f
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-09-05 09:08:46 -04:00
chss95cs@gmail.com 0c576877c8 Add constant folding for LVR when 16 aligned, clean up prior commit by removing dead test code for LVR/LVL/STVL/STVR opcodes and legacy hir sequence
Delay using mm_pause in KeAcquireSpinLockAtRaisedIrql_entry, a huge amount of time is spent spinning in halo3
2022-09-04 22:42:51 -05:00
chss95cs@gmail.com d372d8d5e3 nasty commit with a bunch of test code left in, will clean up and pr
Remove the logger_ != nullptr check from shouldlog, it will nearly always be true except on initialization and gets checked later anyway, this shrinks the size of the generated code for some
Select specialized vastcpy for current cpu, for now only have paths for MOVDIR64B and generic avx1
Add XE_UNLIKELY/LIKELY if, they map better to the c++ unlikely/likely attributes which we will need to use soon
Finished reimplementing STVL/STVR/LVL/LVR as their own opcodes. we now generate far less code for these instructions. this also means optimization passes can be written to simplify/remove/replace these instructions in some cases. Found that a good deal of the X86 we were emitting for these instructions was dead code or redundant.
the reduction in generated HIR/x86 should help a lot with compilation times and make function precompilation more feasible as a default

Don't static assert in default prefetch impl, in c++20 the assertion will be triggered even without an instantiation
Reorder some if/else to prod msvc into ordering the branches optimally. it somewhat worked...
Added some notes about which opcodes should be removed/refactored
Dispatch in WriteRegister via vector compares for the bounds. still not very optimal, we ought to be checking whether any register in a range may be special
A lot of work on trying to optimize writeregister, moved wraparound path into a noinline function based on profiling info
Hoist the IsUcodeAnalyzed check out of AnalyzeShader, instead check it before each call. Profiler recorded many hits in the stack frame setup of the function, but none in the actual body of it, so the check is often true but the stack frame setup is run unconditionally
Pre-check whether we're about to write a single register from a ring
Replace more jump tables from draw_util/texture_info with popcnt based sparse indexing/bit tables/shuffle lookups
Place the GPU register file on its own VAD/virtual allocation, it is no longer a member of graphics system
2022-09-04 22:42:51 -05:00
illusion0001 f62ac9868a Make portable default for new install 2022-09-04 22:42:40 -05:00
chrisps 5476d5e422
Merge branch 'xenia-canary:canary_experimental' into canary_experimental 2022-09-04 14:45:03 -04:00
chss95cs@gmail.com 2e5c4937fd Add constant folding for LVR when 16 aligned, clean up prior commit by removing dead test code for LVR/LVL/STVL/STVR opcodes and legacy hir sequence
Delay using mm_pause in KeAcquireSpinLockAtRaisedIrql_entry, a huge amount of time is spent spinning in halo3
2022-09-04 11:44:29 -07:00
chss95cs@gmail.com c6010bd4b1 nasty commit with a bunch of test code left in, will clean up and pr
Remove the logger_ != nullptr check from shouldlog, it will nearly always be true except on initialization and gets checked later anyway, this shrinks the size of the generated code for some
Select specialized vastcpy for current cpu, for now only have paths for MOVDIR64B and generic avx1
Add XE_UNLIKELY/LIKELY if, they map better to the c++ unlikely/likely attributes which we will need to use soon
Finished reimplementing STVL/STVR/LVL/LVR as their own opcodes. we now generate far less code for these instructions. this also means optimization passes can be written to simplify/remove/replace these instructions in some cases. Found that a good deal of the X86 we were emitting for these instructions was dead code or redundant.
the reduction in generated HIR/x86 should help a lot with compilation times and make function precompilation more feasible as a default

Don't static assert in default prefetch impl, in c++20 the assertion will be triggered even without an instantiation
Reorder some if/else to prod msvc into ordering the branches optimally. it somewhat worked...
Added some notes about which opcodes should be removed/refactored
Dispatch in WriteRegister via vector compares for the bounds. still not very optimal, we ought to be checking whether any register in a range may be special
A lot of work on trying to optimize writeregister, moved wraparound path into a noinline function based on profiling info
Hoist the IsUcodeAnalyzed check out of AnalyzeShader, instead check it before each call. Profiler recorded many hits in the stack frame setup of the function, but none in the actual body of it, so the check is often true but the stack frame setup is run unconditionally
Pre-check whether we're about to write a single register from a ring
Replace more jump tables from draw_util/texture_info with popcnt based sparse indexing/bit tables/shuffle lookups
Place the GPU register file on its own VAD/virtual allocation, it is no longer a member of graphics system
2022-09-04 11:04:41 -07:00
Radosław Gliński c1d3e35eb9
Merge pull request #66 from chrisps/canary_experimental
Huge boost to readback_memexport/resolve performance by fixing old bug; miscellaneous optimizations
2022-08-29 00:55:24 +02:00
chss95cs@gmail.com 78c9a48bc2 also use vastcpy for shared memory page stuff 2022-08-28 14:52:12 -07:00
chss95cs@gmail.com f31869092c Fixed a bug with readback_resolve and readback_memexport that was responsible for a large portion of their overhead. readback_memexport and resolve are now usable for games, depending on your hardware. in my case games that were slideshows now run at like 20-30 fps, and my hardware isnt the best for xenia.
add split_map class for mapping keys to values in a way that optimizes for frequent searches and infrequent insertions/removals
remove jump table implementation of GetColorRenderTargetFormatComponentCount, it was appearing relatively high in profiles. instead pack the component counts into a single 32 bit word, which is indexed by shifting
Add cvar to align all basic blocks to a boundary
Add mmio aware load paths
liberally apply XE_RESTRICT in ringbuffer related code
Removed the IS_TRUE and IS_FALSE opcodes, they were pointless duplicates of COMPARE_EQ/COMPARE_NE and i want to simplify our set of opcodes for future backends
More work on LVSR/LVSL/STVR/STVL opcodes
Optimized X64 translated code emission, now only compute instrkey once
Add code for pre-computing integer division magic numbers
Optimized GetHostViewportInfo a little
Move args for GetHostViewportInfo into a class, cache the result and compare for future queries. moved GetHostViewportInfo far lower on the profile
Add (currently not functional, and very racy) asynchronous memcpy code. will improve it and actually use it in future commits.
Add non-temporal memcpy function for huge page-aligned allocations. Used for copying to shared memory/readback
hoist are_accumulated_render_targets_valid_ check out of loop in render_target_cache already bound check.
Add stosb/movsb code for small constant memcpys/memsets that arent worth the overhead of memcpy/memset
2022-08-28 14:24:25 -07:00
Radosław Gliński 335a390d43
Merge pull request #64 from beeanyew/cpu-updates-raiden-fighters
Some minor CPU updates
2022-08-28 20:52:42 +02:00
beeanyew 3569e97e0e [CPU] Add rldicx implementation
NOTE: May or may not be correct, but works for 535507D4.
2022-08-28 20:02:39 +02:00
beeanyew 75ed343e72 [CPU] Add stub OE handling implementation for addex and negx 2022-08-28 20:01:26 +02:00
illusion0001 04c9c02270 Guest crash message more useful 2022-08-24 09:42:56 -05:00
Radosław Gliński 9006b309af
Merge pull request #62 from chrisps/canary_experimental
Minor correctness/constant folding fixes, guest code optimizations for pre-ryzen amd processors
2022-08-23 00:01:24 +02:00
chss95cs@gmail.com 1ffd7ecae8 Remove vpcmov print 2022-08-21 12:40:56 -07:00
chss95cs@gmail.com b5ef3453c7 Disable most XOP code by default, the manual must be wrong for the shifts or we must be assembling them incorrectly, will return to it later and fix
comparisons and select done by xop are fine though
2022-08-21 12:32:33 -07:00
chss95cs@gmail.com b26c6ee1b8 Fix some more constant folding
fabsx does NOT set fpscr
turns out that our vector unsigned compare instructions are a bit wierd?
2022-08-21 10:27:54 -07:00
chss95cs@gmail.com 0ebc109d4d add initial xop codepaths, still need to finish the rest of the compares, and then do shifts, rotates, and PERMUTE
Add vector simplification pass, so far it only recognizes whether VECTOR_DENORMFLUSH is useless and optimizes them away
Tag restgplr/savegplr/restvmx/savevmx/restfpr/savefpr with useful information, i intend to inline them (they tend to be the most heavily called guest functions)
2022-08-21 08:55:42 -07:00
Gliniak da00ede181 [XAM/Settings] Check if provided size doesn't exceed maximal setting size 2022-08-21 17:46:00 +02:00
Radosław Gliński 0b013fdc6b
Merge pull request #61 from chrisps/canary_experimental
performance improvements, kernel fixes, cpu accuracy improvements
2022-08-21 09:31:09 +02:00
chss95cs@gmail.com d85bfc1894 Dont constant evaluate MAX with V128!
Fix signed zeroes behavior for vmaxfp emulation, was causing a block in sonic to move perpetually, very slowly
2022-08-20 14:22:05 -07:00
Gliniak 010b59e81c [Emulator] Install Content: Create header for installed packages
This fixes support for certain DLCs
2022-08-20 20:44:30 +02:00
Gliniak 469d062a50 [Emulator] Updated "Install Content" function to match PR status 2022-08-20 20:44:30 +02:00
Gliniak f19cb704aa [Emulator] Added error checking while creating directories 2022-08-20 20:44:30 +02:00
chss95cs@gmail.com 457296850e Add OPCODE_NEGATED_MUL_ADD/OPCODE_NEGATED_MUL_SUB
Proper handling of nans for VMX max/min on x64 (minps/maxps has special behavior depending on the operand order that vmx does not have for vminfp/vmaxfp)
Add extremely unintrusive guest code profiler utilizing KUSER_SHARED systemtime. This profiler is disabled on platforms other than windows, and on windows is disabled by default by a cvar
Repurpose GUEST_SCRATCH64 stack offset to instead be for storing guest function profile times, define GUEST_SCRATCH as 0 instead, since thats already meant to be a scratch area
Fix xenia silently closing on config errors/other fatal errors by setting has_console_attached_'s default to false
Add alternative code path for guest clock that uses kusershared systemtime instead of QueryPerformanceCounter. This is way faster and I have tested it and found it to be working, but i have disabled it because i do not know how well it works on wine or on processors other than mine
Significantly reduce log spam by setting XELOGAPU and XELOGGPU to be LogLevel::Debug
Changed some LOGI to LOGD in places to reduce log spam
Mark VdSwap as kHighFrequency, it was spamming up logs
Make logging calls less intrusive for the caller by forcing the test of log level inline and moving the format/AppendLogLine stuff to an outlined cold function
Add swcache namespace for software cache operations like prefetches, streaming stores and streaming loads.
Add XE_MSVC_REORDER_BARRIER for preventing msvc from propagating a value too close to its store or from its load
Add xe_unlikely_mutex for locks we know have very little contention
add XE_HOST_CACHE_LINE_SIZE and XE_RESTRICT to platform.h
Microoptimization: Changed most uses of size_t to ring_size_t in RingBuffer, this reduces the size of the inlined ringbuffer operations slightly by eliminating rex prefixes, depending on register allocation
Add BeginPrefetchedRead to ringbuffer, which prefetches the second range if there is one according to the provided PrefetchTag
added inline_loadclock cvar, which will directly use the value of the guest clock from clock.cc in jitted guest code. off by default
change uses of GUEST_SCRATCH64 to GUEST_SCRATCH
Add fast vectorized xenos_half_to_float/xenos_float_to_half (currently resides in x64_seq_vector, move to gpu code maybe at some point)
Add fast x64 codegen for PackFloat16_4/UnpackFloat16_4. Same code can be used for Float16_2 in future commit. This should speed up some games that use these functions heavily
Remove cvar for toggling old float16 behavior
Add VRSAVE register, support mfspr/mtspr vrsave
Add cvar for toggling off codegen for trap instructions and set it to true by default.
Add specialized methods to CommandProcessor: WriteRegistersFromMem, WriteRegisterRangeFromRing, and WriteOneRegisterFromRing. These reduce the overall cost of WriteRegister
Use a fixed size vmem vector for upload ranges, realloc/memsetting on resize  in the inner loop of requestranges was showing up on the profiler (the search in requestranges itself needs work)
Rename fixed_vmem_vector to better fit xenia's naming convention
Only log unknown register writes in WriteRegister if DEBUG :/. We're stuck on MSVC with c++17 so we have no way of influencing the branch ordering for that function without profile guided optimization
Remove binding stride assert in shader_translator.cc, triangle told me its leftover ogl stuff
Mark xe::FatalError as noreturn
If a controller is not connected, delay by 1.1 seconds before checking if it has been reconnected. Asking Xinput about a controller slot that is unused is extremely slow, and XinputGetState/SetState were taking up
an enormous amount of time in profiles. this may have caused a bit of input lag
Protect accesses to input_system with a lock
Add proper handling for user_index>= 4 in XamInputGetState/SetState, properly return zeroed state in GetState
Add missing argument to NtQueryVirtualMemory_entry
Fixed RtlCompareMemoryUlong_entry, it actually does not care if the source is misaligned, and for length it aligns down
Fixed RtlUpperChar and RtlLowerChar, added a table that has their correct return values precomputed
2022-08-20 11:40:19 -07:00
Margen67 f551e59015
CI: Remove game patches
People are too stupid to understand how portable mode works, and these can be outdated anyway.
2022-08-19 03:39:29 -07:00
Gliniak e06978e5be [Premake] Cleanup & Fixed references in cpu-tests 2022-08-17 09:43:55 +02:00
Gliniak 0df92130e6 [Memory] Changed amount of kernel reserved pages.
This fixes flickering in games with resoultion scaling enabled
2022-08-15 17:51:29 +02:00
chss95cs@gmail.com 7cc364dcb8 squash reallocs in command buffers by using large prealloced buffer, directly use virtual memory with it so os allocs on demand
mark raw clock functions as noinline, the way msvc was inlining them and ordering the branches meant that rdtsc would often be speculatively executed
add alternative clock impl for win, instead of using queryperformancecounter we grab systemtime from kusershared. it does not have the same precision as queryperformancecounter, we only have 100 nanosecond precision, but we round to milliseconds so it never made sense to use the performance counter in the first place
stubbed out the "guest clock mutex"... (the entirety of clock.cc needs a rewrite)
added some helpers for minf/maxf without the nan handling behavior
2022-08-14 13:42:08 -07:00