xenia-canary

Commit Graph

Author	SHA1	Message	Date
Gliniak	7975ea78d4	[Base] BitStream: Prevent readout beyond buffer	2022-10-09 12:24:46 +02:00
Gliniak	17b3939bbf	Revert "[Base] Changed size of bitstream accessed data (Risky)" This reverts commit `061000af01`.	2022-10-09 12:18:43 +02:00
chrisps	08d38bdff6	Merge pull request #81 from chrisps/canary_experimental global lock changes, minor kernel changes, premake fix	2022-10-08 12:04:43 -07:00
chss95cs@gmail.com	2dd6f33f4b	Fix debug/ui premake too	2022-10-08 10:34:50 -07:00
chrisps	bcd57f8663	Merge branch 'xenia-canary:canary_experimental' into canary_experimental	2022-10-08 10:11:30 -07:00
chss95cs@gmail.com	d8c94b1aee	Fix premake filter mistake that broke debug builds (and likely any build other than release)	2022-10-08 10:10:36 -07:00
chss95cs@gmail.com	8f7f7dc6ad	fixed wine crash from use of NtSetEventPriorityBoost add xe::clear_lowest_bit, use it in place of shift-andnot in some bit iteration code make is_allocated_ and is_enabled_ volatile in xma_context preallocate avpacket buffer in XMAContext::Setup, the reallocations of the buffer in ffmpeg were showing up on profiles check is_enabled and is_allocated BEFORE locking an xmacontext. XMA worker was spending most of its time locking and unlocking contexts Removed XeDMAC, dma:: namespace. It was a bad idea and I couldn't make it work in the end. Kept vastcpy and moved it to the memory namespace instead Made the rest of global_critical_region's members static. They never needed an instance. Removed ifdef'ed out code from ring_buffer.h Added EventInfo struct to threading, added Event::Query to aid with implementing NtQueryEvent. Removed vector from WaitMultiple, instead use a fixed array of 64 handles that we populate. WaitForMultipleObjects cannot handle more than 64 objects. Remove XE_MSVC_OPTIMIZE_SMALL() use in x64_sequences, x64 backend is now always size optimized because of premake Make global_critical_region_ static constexpr in shared_memory.h to get rid of wasteage of 8 bytes (empty class=1byte, +alignment for next member=8) Move trace-related data to the tail of SharedMemory to keep more important data together In IssueDraw build an array of fetch constant addresses/sizes, then pre-lock the global lock before doing requestrange for each instead of individually locking within requestrange for each of them Consistent access specifier protected for pm4_command_processor_declare Devirtualize WriteOneRegisterFromRing. Move ExecutePacket and ExecutePrimaryBuffer to pm4_command_buffer_x Remove many redundant header inclusions access xenia-gpu Minor microoptimization of ExecutePacketType0 Add TextureCache::RequestTextures for batch invocation of LoadTexturesData Add TextureCache::LoadTexturesData for reducing the number of times we release and reacquire the global lock. Ideally you should hold the global lock for as little time as possible, but if you are constantly acquiring and releasing it you are actually more likely to have contention Add already_locked param to ObjectTable::LookupObject to help with reducing lock acquire/release pairs Add missing checks to XAudioRegisterRenderDriverClient_entry. this is unlikely to fix anything, it was just an easy thing to do Add NtQueryEvent system call implementation. I don't actually know of any games that need it. Instead of using std::vector + push_back in KeWaitForMultipleObjects and xeNtWaitForMultipleObjectsEx use a fixed size array of 64 and track the count. More than 64 objects is not permitted by the kernel. The repeated reallocations from push_back were appearing unusually high on the profiler, but were masked until now by waitformultipleobjects natural overhead Pre-lock the global lock before looking up each handle for xeNtWaitForMultipleObjectsEx and KeWaitForMultipleObjects. Pre-lock before looking up the signal and waiter in NtSignalAndWaitForSingleObjectEx add missing checks to NtWaitForMultipleObjectsEx Support pre-locking in XObject::GetNativeObject	2022-10-08 09:55:17 -07:00
chrisps	50fce8bdb3	Merge pull request #80 from chrisps/canary_experimental Reduce size of final exe by about 1.6mb	2022-10-05 04:15:53 -07:00
chss95cs@gmail.com	bae63b95c5	Update to latest version of cxxopts	2022-09-30 06:51:25 -07:00
chss95cs@gmail.com	b4c175d8a3	Enable SDL_LEAN_AND_MEAN, SDL_RENDER_DISABLED, saves about 500kb in final exe Build several projects that arent performance critical with /Os and /O1 under msvc windows	2022-09-29 07:26:38 -07:00
chss95cs@gmail.com	7e58a3b320	Fix compiler errors i introduced under clang-cl remove xe_kernel_export_shim_fn field of Export function_data, trampoline is now the only way exports get invoked Remove kernelstate argument from string functions in order to conform to the trampoline signature (the argument was unused anyway) Constant-evaluated initialization of ppc_opcode_disasm_table, removal of unused std::vector fields Constant-evaluated initialization of export tables name field on export is just a const char* now, only immutable static strings are ever passed to it Remove unused callcount field of export. PM4 compare op function extracted Globally apply /Oy, /GS-, /Gw on msvc windows Remove imgui testwindow code call, it took up like 300 kb	2022-09-29 07:04:17 -07:00
Gliniak	203267b106	Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental	2022-09-23 12:23:53 +02:00
Joel Linn	9ab4db285c	[Premake] Update premake-cmake - Handle compiler flags per-file. Removes ffmpeg warnings - Switch to JoelLinn fork since original author stopped maintaining and other forks don't seem to care about PRs	2022-09-22 06:36:43 -05:00
Rick Gibbed	3bfa3b05e1	Lint fix.	2022-09-22 06:34:21 -05:00
Gliniak	7d970967c4	Merge branch 'master' of https://github.com/xenia-project/xenia into canary_experimental	2022-09-20 21:15:12 +02:00
chrisps	def00e6ddb	Merge pull request #76 from beeanyew/input-system-mutex-revert [Input System] xe_mutex revert	2022-09-18 07:46:02 -07:00
beeanyew	cd17f1846f	[Input System] xe_mutex revert On request from chrisps, revert xe_mutex back to xe_unlikely_mutex to avoid mutex deadlocks while initializing hid-winkey.	2022-09-18 15:18:29 +02:00
chrisps	a29a7436e0	Merge pull request #75 from chrisps/canary_experimental misc stuff again	2022-09-17 06:43:50 -07:00
chrisps	d0acd68369	Merge branch 'xenia-canary:canary_experimental' into canary_experimental	2022-09-17 07:05:24 -04:00
chss95cs@gmail.com	eb8154908c	atomic cas use prefetchw if available remove useless memorybarrier remove double membarrier in wait pm4 cmd add int64 cvar use int64 cvar for x64 feature mask Rework some functions that were frontend bound according to vtune placing some of their code in different noinline functions, profiling after indicating l1 cache misses decreased and perf of func increased remove long vpinsrd dep chain code for conversion.h, instead do normal load+bswap or movbe if avail Much faster entry table via split_map, code size could be improved though GetResolveInfo was very large and had impact on icache, mark callees as noinline + msvc pragma optimize small use log2 shifts instead of integer divides in memory minor optimizations in PhysicalHeap::EnableAccessCallbacks, the majority of time in the function is spent looping, NOT calling Protect! Someone should optimize this function and rework the algo completely remove wonky scheduling log message, it was spammy and unhelpful lock count was unnecessary for criticalsection mutex, criticalsection is already a recursive mutex brief notes i gotta run	2022-09-17 04:04:53 -07:00
Wunkolo	addd8c94e5	[x64] Add AVX512 optimization for `OPCODE_VECTOR_ADD`(saturated) Uses a single `vpternlogd` to test for signed/unsigned overflow/underflow. Then utilizes AVX512 mask operations to create either `0x7FFFFFFF` or `0x80000000` arithmetically.	2022-09-14 11:39:03 -05:00
Wunkolo	9fd684594b	[x64] Add AVX512 optimization for `OPCODE_VECTOR_CONVERT_F2I`(unsigned) `vcvttps2udq` already saturates overflowing and unordered values to `0xFFFFFFFF`. Using mask registers, zeroes are written to negative values within the same instruction.	2022-09-12 13:52:57 -05:00
chrisps	b4224ff3dc	Merge pull request #74 from chrisps/canary_experimental Misc optimizations	2022-09-11 18:02:00 -04:00
chss95cs@gmail.com	0fd4a2533b	Prevent clang-format from moving d3d12_nvapi above the require d3d12 headers	2022-09-11 14:35:33 -07:00
chss95cs@gmail.com	20638c2e61	use Sleep(0) instead of SwitchToThread, should waste less power and help the os with scheduling. PM4 buffer handling made a virtual member of commandprocessor, place the implementation/declaration into reusable macro files. this is probably the biggest boost here. Optimized SET_CONSTANT/ LOAD_CONSTANT pm4 ops based on the register range they start writing at, this was also a nice boost Expose X64 extension flags to code outside of x64 backend, so we can detect and use things like avx512, xop, avx2, etc in normal code Add freelists for HIR structures to try to reduce the number of last level cache misses during optimization (currently disabled... fixme later) Analyzed PGO feedback and reordered branches, uninlined functions, moved code out into different functions based on info from it in the PM4 functions, this gave like a 2% boost at best. Added support for the db16cyc opcode, which is used often in xb360 spinlocks. before it was just being translated to nop, now on x64 we translate it to _mm_pause but may change that in the future to reduce cpu time wasted texture util - all our divisors were powers of 2, instead we look up a shift. this made texture scaling slightly faster, more so on intel processors which seem to be worse at int divs. GetGuestTextureLayout is now a little faster, although it is still one of the heaviest functions in the emulator when scaling is on. xe_unlikely_mutex was not a good choice for the guest clock lock, (running theory) on intel processors another thread may take a significant time to update the clock? maybe because of the uint64 division? really not sure, but switched it to xe_mutex. This fixed audio stutter that i had introduced to 1 or 2 games, fixed performance on that n64 rare game with the monkeys. Took another crack at DMA implementation, another failure. Instead of passing as a parameter, keep the ringbuffer reader as the first member of commandprocessor so it can be accessed through this Added macro for noalias Applied noalias to Memory::LookupHeap. This reduced the size of the executable by 7 kb. Reworked kernel shim template, this shaved like 100kb off the exe and eliminated the indirect calls from the shim to the actual implementation. We still unconditionally generate string representations of kernel calls though :(, unless it is kHighFrequency Add nvapi extensions support, currently unused. Will use CPUVISIBLE memory at some point Inserted prefetches in a few places based on feedback from vtune. Add native implementation of SHA int8 if all elements are the same Vectorized comparisons for SetViewport, SetScissorRect Vectorized ranged comparisons for WriteRegister Add XE_MSVC_ASSUME Move FormatInfo::name out of the structure, instead look up the name in a different table. Debug related data and critical runtime data are best kept apart Templated UpdateSystemConstantValues based on ROV/RTV and primitive_polygonal Add ArchFloatMask functions, these are for storing the results of floating point comparisons without doing costly float->int pipeline transfers (vucomiss/setb) Use floatmasks in UpdateSystemConstantValues for checking if dirty, only transfer to int at end of function. Instead of dirty \|= (x == y) in UpdateSystemConstantValues, now we do dirty_u32 \|= (x^y). if any of them are not equal, dirty_u32 will be nz, else if theyre all equal it will be zero. This is more friendly to register renaming and the lack of dependencies on EFLAGS lets the compiler reorder better Add PrefetchSamplerParameters to D3D12TextureCache use PrefetchSamplerParameters in UpdateBindings to eliminate cache misses that vtune detected Add PrefetchTextureBinding to D3D12TextureCache Prefetch texture bindings to get rid of more misses vtune detected (more accesses out of order with random strides) Rewrote DMAC, still terrible though and have disabled it for now. Replace tiny memcmp of 6 U64 in render_target_cache with inline loop, msvc fails to make it a loop and instead does a thunk to their memcmp function, which is optimized for larger sizes PrefetchTextureBinding in AreActiveTextureSRVKeysUpToDate Replace memcmp calls for pipelinedescription with handwritten cmp Directly write some registers that dont have special handling in PM4 functions Changed EstimateMaxY to try to eliminate mispredictions that vtune was reporting, msvc ended up turning the changed code into a series of blends in ExecutePacketType3_EVENT_WRITE_EXT, instead of writing extents to an array on the stack and then doing xe_copy_and_swap_16 of the data to its dest, pre-swap each constant and then store those. msvc manages to unroll that into wider stores stop logging XE_SWAP every time we receive XE_SWAP, stop logging the start and end of each viz query Prefetch watch nodes in FireWatches based on feedback from vtune Removed dead code from texture_info.cc NOINLINE on GpuSwap, PGO builds did it so we should too.	2022-09-11 14:14:48 -07:00
Wunkolo	90fffe1de7	[PPC] Fix memory assert formatting This was still using printf-style format specifiers. Causing memory asserts to show up like this while testing. ``` !> 0000438C Memory 10001040 assert failed: !> 0000438C Expected: %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X !> 0000438C Actual: %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X %02X !> 0000438C TEST FAILED ``` Updated them so they format correctly: ``` !> 00002CCC Memory 10001040 assert failed: !> 00002CCC Expected: FC FD FE FF 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F !> 00002CCC Actual: FC FD FE FF 00 00 00 00 00 00 00 00 00 00 00 00 !> 00002CCC TEST FAILED ```	2022-09-05 13:47:48 -05:00
Wunkolo	b0cc3db4d8	[x64] Add AVX512 optimization for `NOT_V128`	2022-09-05 13:47:30 -05:00
chrisps	9a6dd4cd6f	Merge branch 'xenia-canary:canary_experimental' into canary_experimental	2022-09-05 09:08:46 -04:00
chss95cs@gmail.com	0c576877c8	Add constant folding for LVR when 16 aligned, clean up prior commit by removing dead test code for LVR/LVL/STVL/STVR opcodes and legacy hir sequence Delay using mm_pause in KeAcquireSpinLockAtRaisedIrql_entry, a huge amount of time is spent spinning in halo3	2022-09-04 22:42:51 -05:00
chss95cs@gmail.com	d372d8d5e3	nasty commit with a bunch of test code left in, will clean up and pr Remove the logger_ != nullptr check from shouldlog, it will nearly always be true except on initialization and gets checked later anyway, this shrinks the size of the generated code for some Select specialized vastcpy for current cpu, for now only have paths for MOVDIR64B and generic avx1 Add XE_UNLIKELY/LIKELY if, they map better to the c++ unlikely/likely attributes which we will need to use soon Finished reimplementing STVL/STVR/LVL/LVR as their own opcodes. we now generate far less code for these instructions. this also means optimization passes can be written to simplify/remove/replace these instructions in some cases. Found that a good deal of the X86 we were emitting for these instructions was dead code or redundant. the reduction in generated HIR/x86 should help a lot with compilation times and make function precompilation more feasible as a default Don't static assert in default prefetch impl, in c++20 the assertion will be triggered even without an instantiation Reorder some if/else to prod msvc into ordering the branches optimally. it somewhat worked... Added some notes about which opcodes should be removed/refactored Dispatch in WriteRegister via vector compares for the bounds. still not very optimal, we ought to be checking whether any register in a range may be special A lot of work on trying to optimize writeregister, moved wraparound path into a noinline function based on profiling info Hoist the IsUcodeAnalyzed check out of AnalyzeShader, instead check it before each call. Profiler recorded many hits in the stack frame setup of the function, but none in the actual body of it, so the check is often true but the stack frame setup is run unconditionally Pre-check whether we're about to write a single register from a ring Replace more jump tables from draw_util/texture_info with popcnt based sparse indexing/bit tables/shuffle lookups Place the GPU register file on its own VAD/virtual allocation, it is no longer a member of graphics system	2022-09-04 22:42:51 -05:00
illusion0001	f62ac9868a	Make portable default for new install	2022-09-04 22:42:40 -05:00
chrisps	5476d5e422	Merge branch 'xenia-canary:canary_experimental' into canary_experimental	2022-09-04 14:45:03 -04:00
chss95cs@gmail.com	2e5c4937fd	Add constant folding for LVR when 16 aligned, clean up prior commit by removing dead test code for LVR/LVL/STVL/STVR opcodes and legacy hir sequence Delay using mm_pause in KeAcquireSpinLockAtRaisedIrql_entry, a huge amount of time is spent spinning in halo3	2022-09-04 11:44:29 -07:00
chss95cs@gmail.com	c6010bd4b1	nasty commit with a bunch of test code left in, will clean up and pr Remove the logger_ != nullptr check from shouldlog, it will nearly always be true except on initialization and gets checked later anyway, this shrinks the size of the generated code for some Select specialized vastcpy for current cpu, for now only have paths for MOVDIR64B and generic avx1 Add XE_UNLIKELY/LIKELY if, they map better to the c++ unlikely/likely attributes which we will need to use soon Finished reimplementing STVL/STVR/LVL/LVR as their own opcodes. we now generate far less code for these instructions. this also means optimization passes can be written to simplify/remove/replace these instructions in some cases. Found that a good deal of the X86 we were emitting for these instructions was dead code or redundant. the reduction in generated HIR/x86 should help a lot with compilation times and make function precompilation more feasible as a default Don't static assert in default prefetch impl, in c++20 the assertion will be triggered even without an instantiation Reorder some if/else to prod msvc into ordering the branches optimally. it somewhat worked... Added some notes about which opcodes should be removed/refactored Dispatch in WriteRegister via vector compares for the bounds. still not very optimal, we ought to be checking whether any register in a range may be special A lot of work on trying to optimize writeregister, moved wraparound path into a noinline function based on profiling info Hoist the IsUcodeAnalyzed check out of AnalyzeShader, instead check it before each call. Profiler recorded many hits in the stack frame setup of the function, but none in the actual body of it, so the check is often true but the stack frame setup is run unconditionally Pre-check whether we're about to write a single register from a ring Replace more jump tables from draw_util/texture_info with popcnt based sparse indexing/bit tables/shuffle lookups Place the GPU register file on its own VAD/virtual allocation, it is no longer a member of graphics system	2022-09-04 11:04:41 -07:00
Radosław Gliński	c1d3e35eb9	Merge pull request #66 from chrisps/canary_experimental Huge boost to readback_memexport/resolve performance by fixing old bug; miscellaneous optimizations	2022-08-29 00:55:24 +02:00
chss95cs@gmail.com	78c9a48bc2	also use vastcpy for shared memory page stuff	2022-08-28 14:52:12 -07:00
chss95cs@gmail.com	f31869092c	Fixed a bug with readback_resolve and readback_memexport that was responsible for a large portion of their overhead. readback_memexport and resolve are now usable for games, depending on your hardware. in my case games that were slideshows now run at like 20-30 fps, and my hardware isnt the best for xenia. add split_map class for mapping keys to values in a way that optimizes for frequent searches and infrequent insertions/removals remove jump table implementation of GetColorRenderTargetFormatComponentCount, it was appearing relatively high in profiles. instead pack the component counts into a single 32 bit word, which is indexed by shifting Add cvar to align all basic blocks to a boundary Add mmio aware load paths liberally apply XE_RESTRICT in ringbuffer related code Removed the IS_TRUE and IS_FALSE opcodes, they were pointless duplicates of COMPARE_EQ/COMPARE_NE and i want to simplify our set of opcodes for future backends More work on LVSR/LVSL/STVR/STVL opcodes Optimized X64 translated code emission, now only compute instrkey once Add code for pre-computing integer division magic numbers Optimized GetHostViewportInfo a little Move args for GetHostViewportInfo into a class, cache the result and compare for future queries. moved GetHostViewportInfo far lower on the profile Add (currently not functional, and very racy) asynchronous memcpy code. will improve it and actually use it in future commits. Add non-temporal memcpy function for huge page-aligned allocations. Used for copying to shared memory/readback hoist are_accumulated_render_targets_valid_ check out of loop in render_target_cache already bound check. Add stosb/movsb code for small constant memcpys/memsets that arent worth the overhead of memcpy/memset	2022-08-28 14:24:25 -07:00
Radosław Gliński	335a390d43	Merge pull request #64 from beeanyew/cpu-updates-raiden-fighters Some minor CPU updates	2022-08-28 20:52:42 +02:00
beeanyew	3569e97e0e	[CPU] Add rldicx implementation NOTE: May or may not be correct, but works for 535507D4.	2022-08-28 20:02:39 +02:00
beeanyew	75ed343e72	[CPU] Add stub OE handling implementation for addex and negx	2022-08-28 20:01:26 +02:00
illusion0001	04c9c02270	Guest crash message more useful	2022-08-24 09:42:56 -05:00
Radosław Gliński	9006b309af	Merge pull request #62 from chrisps/canary_experimental Minor correctness/constant folding fixes, guest code optimizations for pre-ryzen amd processors	2022-08-23 00:01:24 +02:00
chss95cs@gmail.com	1ffd7ecae8	Remove vpcmov print	2022-08-21 12:40:56 -07:00
chss95cs@gmail.com	b5ef3453c7	Disable most XOP code by default, the manual must be wrong for the shifts or we must be assembling them incorrectly, will return to it later and fix comparisons and select done by xop are fine though	2022-08-21 12:32:33 -07:00
chss95cs@gmail.com	b26c6ee1b8	Fix some more constant folding fabsx does NOT set fpscr turns out that our vector unsigned compare instructions are a bit wierd?	2022-08-21 10:27:54 -07:00
chss95cs@gmail.com	0ebc109d4d	add initial xop codepaths, still need to finish the rest of the compares, and then do shifts, rotates, and PERMUTE Add vector simplification pass, so far it only recognizes whether VECTOR_DENORMFLUSH is useless and optimizes them away Tag restgplr/savegplr/restvmx/savevmx/restfpr/savefpr with useful information, i intend to inline them (they tend to be the most heavily called guest functions)	2022-08-21 08:55:42 -07:00
Gliniak	da00ede181	[XAM/Settings] Check if provided size doesn't exceed maximal setting size	2022-08-21 17:46:00 +02:00
Radosław Gliński	0b013fdc6b	Merge pull request #61 from chrisps/canary_experimental performance improvements, kernel fixes, cpu accuracy improvements	2022-08-21 09:31:09 +02:00
chss95cs@gmail.com	d85bfc1894	Dont constant evaluate MAX with V128! Fix signed zeroes behavior for vmaxfp emulation, was causing a block in sonic to move perpetually, very slowly	2022-08-20 14:22:05 -07:00
Gliniak	010b59e81c	[Emulator] Install Content: Create header for installed packages This fixes support for certain DLCs	2022-08-20 20:44:30 +02:00

... 3 4 5 6 7 ...

7310 Commits All Branches Search

7310 Commits

All Branches