Commit Graph

78 Commits

Author SHA1 Message Date
disjtqz 275454089e [Kernel] Implement ObCreateObject 2023-10-21 17:33:07 +02:00
disjtqz 6a08208dc8 Proper misalignment for AllocatePool, add guest object table 2023-10-14 19:29:25 +02:00
disjtqz b5ddd30572 moved xsemaphore to xthread.d
add typed guest pointer template
add X_KSPINLOCK, rework spinlock functions.
rework irql related code, use irql on pcr instead of on XThread
add guest linked list helper functions
renamed ProcessInfoBlock to X_KPROCESS
assigned names to many kernel structure fields
2023-10-11 17:43:59 +02:00
chss95cs@gmail.com c1d922eebf Minor decoder optimizations, kernel fixes, cpu backend fixes 2022-11-05 10:50:33 -07:00
chss95cs@gmail.com eb8154908c atomic cas use prefetchw if available
remove useless memorybarrier
remove double membarrier in wait pm4 cmd
add int64 cvar
use int64 cvar for x64 feature mask
Rework some functions that were frontend bound according to vtune placing some of their code in different noinline functions, profiling after indicating l1 cache misses decreased and perf of func increased
remove long vpinsrd dep chain code for conversion.h, instead do normal load+bswap or movbe if avail
Much faster entry table via split_map, code size could be improved though
GetResolveInfo was very large and had impact on icache, mark callees as noinline + msvc pragma optimize small
use log2 shifts instead of integer divides in memory
minor optimizations in PhysicalHeap::EnableAccessCallbacks, the majority of time in the function is spent looping, NOT calling Protect! Someone should optimize this function and rework the algo completely
remove wonky scheduling log message, it was spammy and unhelpful
lock count was unnecessary for criticalsection mutex, criticalsection is already a recursive mutex
brief notes i gotta run
2022-09-17 04:04:53 -07:00
chss95cs@gmail.com 20638c2e61 use Sleep(0) instead of SwitchToThread, should waste less power and help the os with scheduling.
PM4 buffer handling made a virtual member of commandprocessor, place the implementation/declaration into reusable macro files. this is probably the biggest boost here.
Optimized SET_CONSTANT/ LOAD_CONSTANT pm4 ops based on the register range they start writing at, this was also a nice boost

Expose X64 extension flags to code outside of x64 backend, so we can detect and use things like avx512, xop, avx2, etc in normal code
Add freelists for HIR structures to try to reduce the number of last level cache misses during optimization (currently disabled... fixme later)

Analyzed PGO feedback and reordered branches, uninlined functions, moved code out into different functions based on info from it in the PM4 functions, this gave like a 2% boost at best.

Added support for the db16cyc opcode, which is used often in xb360 spinlocks. before it was just being translated to nop, now on x64 we translate it to _mm_pause but may change that in the future to reduce cpu time wasted

texture util - all our divisors were powers of 2, instead we look up a shift. this made texture scaling slightly faster, more so on intel processors which seem to be worse at int divs. GetGuestTextureLayout is now a little faster, although it is still one of the heaviest functions in the emulator when scaling is on.

xe_unlikely_mutex was not a good choice for the guest clock lock, (running theory) on intel processors another thread may take a significant time to update the clock? maybe because of the uint64 division? really not sure, but switched it to xe_mutex. This fixed audio stutter that i had introduced to 1 or 2 games, fixed performance on that n64 rare game with the monkeys.
Took another crack at DMA implementation, another failure.
Instead of passing as a parameter, keep the ringbuffer reader as the first member of commandprocessor so it can be accessed through this
Added macro for noalias
Applied noalias to Memory::LookupHeap. This reduced the size of the executable by 7 kb.
Reworked kernel shim template, this shaved like 100kb off the exe and eliminated the indirect calls from the shim to the actual implementation. We still unconditionally generate string representations of kernel calls though :(, unless it is kHighFrequency

Add nvapi extensions support, currently unused. Will use CPUVISIBLE memory at some point
Inserted prefetches in a few places based on feedback from vtune.
Add native implementation of SHA int8 if all elements are the same

Vectorized comparisons for SetViewport, SetScissorRect
Vectorized ranged comparisons for WriteRegister
Add XE_MSVC_ASSUME
Move FormatInfo::name out of the structure, instead look up the name in a different table. Debug related data and critical runtime data are best kept apart
Templated UpdateSystemConstantValues based on ROV/RTV and primitive_polygonal
Add ArchFloatMask functions, these are for storing the results of floating point comparisons without doing costly float->int pipeline transfers (vucomiss/setb)
Use floatmasks in UpdateSystemConstantValues for checking if dirty, only transfer to int at end of function.
Instead of dirty |= (x == y) in UpdateSystemConstantValues, now we do dirty_u32 |= (x^y). if any of them are not equal, dirty_u32 will be nz, else if theyre all equal it will be zero. This is more friendly to register renaming and the lack of dependencies on EFLAGS lets the compiler reorder better
Add PrefetchSamplerParameters to D3D12TextureCache
use PrefetchSamplerParameters in UpdateBindings to eliminate cache misses that vtune detected

Add PrefetchTextureBinding to D3D12TextureCache
Prefetch texture bindings to get rid of more misses vtune detected (more accesses out of order with random strides)
Rewrote DMAC, still terrible though and have disabled it for now.
Replace tiny memcmp of 6 U64 in render_target_cache with inline loop, msvc fails to make it a loop and instead does a thunk to their memcmp function, which is optimized for larger sizes

PrefetchTextureBinding in AreActiveTextureSRVKeysUpToDate
Replace memcmp calls for pipelinedescription with handwritten cmp
Directly write some registers that dont have special handling in PM4 functions
Changed EstimateMaxY to try to eliminate mispredictions that vtune was reporting, msvc ended up turning the changed code into a series of blends

in ExecutePacketType3_EVENT_WRITE_EXT, instead of writing extents to an array on the stack and then doing xe_copy_and_swap_16 of the data to its dest, pre-swap each constant and then store those. msvc manages to unroll that into wider stores
stop logging XE_SWAP every time we receive XE_SWAP, stop logging the start and end of each viz query

Prefetch watch nodes in FireWatches based on feedback from vtune
Removed dead code from texture_info.cc
NOINLINE on GpuSwap, PGO builds did it so we should too.
2022-09-11 14:14:48 -07:00
chss95cs@gmail.com 08f7a28920 Alternative mutex 2022-08-14 08:59:11 -07:00
chss95cs@gmail.com cb85fe401c Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 12:59:00 -07:00
Gliniak 3d96dfa359 Always allocate system heap from top of heap 2022-05-25 07:53:50 +02:00
Gliniak b237b71031 Merge remote-tracking branch 'GliniakRepo/memory_stats' into canary_pr 2022-05-19 10:03:29 +02:00
Triang3l fdec0ab332 [Code] Make union usage more consistent 2021-11-03 20:45:09 +03:00
Gliniak 35321a10c3 [Kernel] Improvements to MmQueryStatistics
- Fixed incorrect calculation of available pages
- Changed amount of total virtual bytes
- Added real amount of reserved virtual bytes
- Removed unused methods
2021-07-15 09:45:35 +02:00
Gliniak a6868d1f8a [Memory] Removed redundant BaseHeap::IsGuestPhysicalHeap 2020-11-22 15:43:53 -06:00
Gliniak c071500ff4 [Base] Specify heap type on initialization 2020-11-22 15:43:53 -06:00
Triang3l 86ae42919d [Memory] Close shared memory FD and properly handle its invalid value 2020-11-22 14:17:37 +03:00
gibbed 5bf0b34445 C++17ification.
C++17ification!

- Filesystem interaction now uses std::filesystem::path.
- Usage of const char*, std::string have been changed to
  std::string_view where appropriate.
- Usage of printf-style functions changed to use fmt.
2020-04-07 16:09:41 -05:00
Triang3l c156616103 [Memory] Invalidate physical memory in Release/Decommit (#1559) 2020-02-24 01:04:30 +03:00
Triang3l f858631245 [Kernel] Trigger memory callbacks after file read 2020-02-22 18:06:56 +03:00
Triang3l 8ec813de82 [Memory, D3D12] Various refactoring from data provider development 2020-02-15 21:35:24 +03:00
Triang3l 8ba6f3fc37 [Memory] Trigger watches when making pages writable, not the other way around 2019-11-10 14:21:36 +03:00
Triang3l 7e6bf8022f [Memory] Refactor GetPhysicalAddress and use it for XMA, resolve #1448 2019-08-24 17:42:06 +03:00
Triang3l e35c609224 Revert "[APU] Temp XMA context allocation region workaround"
This reverts commit 968c337d22.
2019-08-16 21:11:55 +03:00
Triang3l 968c337d22 [APU] Temp XMA context allocation region workaround 2019-08-16 09:47:28 +03:00
Triang3l e862169156 [Memory] BaseHeap::TranslateRelative including host address offset 2019-08-15 00:31:21 +03:00
Triang3l 2152c79965 [Memory] 0xE… adjustment in TranslateVirtual 2019-08-14 00:07:27 +03:00
Triang3l 741b5ae2ec [Memory] Add HostToGuestVirtual and use it in a couple of places 2019-08-13 23:49:49 +03:00
Triang3l cb0e18c7dc [Memory] BaseHeap::host_address_offset 2019-08-04 23:55:54 +03:00
Triang3l 25675cb8b8 [Memory] E0000000 adjustment in watches only for Windows 2019-08-04 23:10:59 +03:00
Triang3l d20c2fa9da [Memory/Vulkan] Move old memory watches to the Vulkan backend 2019-08-03 21:06:59 +03:00
Triang3l 0370f8bbd9 [Memory] Pass exact_range to watch callbacks 2019-08-03 19:16:04 +03:00
Triang3l 24383b9137 [Memory/D3D12] Unwatch up to 256 KB ranges 2019-07-31 00:18:12 +03:00
Triang3l b5fb84473d [Memory] Replace forgotten InvalidateRange in NtReadFile 2019-07-30 09:06:23 +03:00
Triang3l 4aceeb73c4 [Memory] Move new watches to heap-aware Memory from MMIOHandler 2019-07-30 08:00:20 +03:00
Triang3l 6e36101b42 [D3D12] Experimental write watch implementation for shared memory 2018-09-24 23:18:16 +03:00
Triang3l db625892ea [D3D12] Shared memory typo fix and improvements 2018-08-01 01:09:51 +03:00
Triang3l 4f7edff19d [D3D12] SHM: Watches prototype, some uploading 2018-07-26 22:52:26 +03:00
DrChat d0460122f4 [Core] BaseHeap::QueryBaseAndSize 2018-02-10 21:58:44 -06:00
DrChat e3787c05c1 [Core] QueryRegionInfo - report the original allocation size 2018-02-10 19:14:58 -06:00
DrChat 325599948a [Core] Remove hardcoded type field from HeapAllocationInfo 2018-02-10 16:47:53 -06:00
DrChat 4db94473ec [Core] Memory::GetPhysicalHeap 2018-02-10 16:45:06 -06:00
Dr. Chat aee5601c68 xboxkrnl: Initial (untested) implementation of NtProtectVirtualMemory 2017-07-24 21:41:47 -05:00
gibbed 16a15bab98 Exposed total page count. 2016-06-21 10:10:08 -05:00
gibbed 32e0ef397c Attempt at reporting something of an 'accurate' unreserved physical page
count. Still needs work.
2016-06-21 09:37:21 -05:00
Dr. Chat 0e3c113375 Physical write watches -> access watches (read and/or write watching) 2016-03-17 21:55:16 -05:00
Ben Vanik bbff23a8bb REBASE: Fixing Memory::Reset(). 2015-12-29 13:09:18 -08:00
Dr. Chat 432e32f7c2 memory Save/Restore 2015-12-29 13:09:18 -08:00
Ben Vanik ca8d658ffe Speeding up PPC tests significantly. 2015-12-27 12:03:30 -08:00
Ben Vanik 00240945fe Cleanup for the latest clang-format version. 2015-12-03 19:52:02 -08:00
Ben Vanik 249b952de9 Adding some comments. 2015-12-02 17:37:48 -08:00
Ben Vanik 3c96b6fa0a DANGER DANGER. Switching to global critical region.
This changes almost all locks held by guest threads to use a single global
critical region. This emulates the behavior on the PPC of disabling
interrupts (by calls like KeRaiseIrqlToDpcLevel or masking interrupts),
and prevents deadlocks from occuring when threads are suspended or
otherwise blocked.
This has performance implications and a pass is needed to ensure the
locking is as granular as possible. It could also break everything
because it's fundamentally unsound. We'll see.
2015-09-06 09:30:54 -07:00