2013-12-07 06:57:16 +00:00
|
|
|
/**
|
|
|
|
******************************************************************************
|
|
|
|
* Xenia : Xbox 360 Emulator Research Project *
|
|
|
|
******************************************************************************
|
2020-03-02 15:37:11 +00:00
|
|
|
* Copyright 2020 Ben Vanik. All rights reserved. *
|
2013-12-07 06:57:16 +00:00
|
|
|
* Released under the BSD license - see LICENSE in the root for more details. *
|
|
|
|
******************************************************************************
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef XENIA_CPU_XEX_MODULE_H_
|
|
|
|
#define XENIA_CPU_XEX_MODULE_H_
|
|
|
|
|
2014-07-14 04:15:37 +00:00
|
|
|
#include <string>
|
2015-08-07 03:17:01 +00:00
|
|
|
#include <vector>
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
#include "xenia/base/mapped_memory.h"
|
2015-03-24 15:25:58 +00:00
|
|
|
#include "xenia/cpu/module.h"
|
2015-06-29 05:48:24 +00:00
|
|
|
#include "xenia/kernel/util/xex2_info.h"
|
2013-12-07 06:57:16 +00:00
|
|
|
|
|
|
|
namespace xe {
|
2015-06-23 05:26:51 +00:00
|
|
|
namespace kernel {
|
|
|
|
class KernelState;
|
2015-08-07 03:17:01 +00:00
|
|
|
} // namespace kernel
|
|
|
|
} // namespace xe
|
2015-05-18 06:31:59 +00:00
|
|
|
|
2015-08-07 03:17:01 +00:00
|
|
|
namespace xe {
|
2013-12-07 06:57:16 +00:00
|
|
|
namespace cpu {
|
|
|
|
|
2021-05-06 16:46:43 +00:00
|
|
|
constexpr fourcc_t kXEX1Signature = make_fourcc("XEX1");
|
|
|
|
constexpr fourcc_t kXEX2Signature = make_fourcc("XEX2");
|
|
|
|
constexpr fourcc_t kElfSignature = make_fourcc(0x7F, 'E', 'L', 'F');
|
|
|
|
|
2015-03-24 15:25:58 +00:00
|
|
|
class Runtime;
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
struct InfoCacheFlags {
|
|
|
|
uint32_t was_resolved : 1; // has this address ever been called/requested
|
|
|
|
// via resolvefunction?
|
|
|
|
uint32_t accessed_mmio : 1;
|
2022-11-05 17:50:33 +00:00
|
|
|
uint32_t is_syscall_func : 1;
|
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
2022-11-27 17:37:06 +00:00
|
|
|
uint32_t is_return_site : 1; // address can be reached from another function
|
|
|
|
// by returning
|
|
|
|
uint32_t reserved : 28;
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
};
|
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
2022-11-27 17:37:06 +00:00
|
|
|
static_assert(sizeof(InfoCacheFlags) == 4,
|
|
|
|
"InfoCacheFlags size should be equal to sizeof ppc instruction.");
|
|
|
|
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
struct XexInfoCache {
|
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
2022-11-27 17:37:06 +00:00
|
|
|
// increment this to invalidate all user infocaches
|
|
|
|
static constexpr uint32_t CURRENT_INFOCACHE_VERSION = 4;
|
2022-11-07 13:40:18 +00:00
|
|
|
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
struct InfoCacheFlagsHeader {
|
2022-11-07 13:40:18 +00:00
|
|
|
uint32_t version;
|
|
|
|
|
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
2022-11-27 17:37:06 +00:00
|
|
|
unsigned char reserved[252];
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
|
|
|
|
InfoCacheFlags* LookupFlags(unsigned offset) {
|
|
|
|
return &reinterpret_cast<InfoCacheFlags*>(&this[1])[offset];
|
|
|
|
}
|
|
|
|
};
|
|
|
|
/*
|
|
|
|
for every 4-byte aligned address, records a 4 byte set of flags.
|
|
|
|
*/
|
|
|
|
std::unique_ptr<MappedMemory> executable_addr_flags_;
|
|
|
|
|
|
|
|
void Init(class XexModule*);
|
2022-11-07 13:40:18 +00:00
|
|
|
InfoCacheFlagsHeader* GetHeader() {
|
|
|
|
if (!executable_addr_flags_) {
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
uint8_t* data = executable_addr_flags_->data();
|
|
|
|
|
|
|
|
if (!data) {
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
return reinterpret_cast<InfoCacheFlagsHeader*>(data);
|
|
|
|
}
|
|
|
|
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
InfoCacheFlags* LookupFlags(unsigned offset) {
|
|
|
|
offset /= 4;
|
|
|
|
if (!executable_addr_flags_) {
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
uint8_t* data = executable_addr_flags_->data();
|
|
|
|
|
|
|
|
if (!data) {
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
return reinterpret_cast<InfoCacheFlagsHeader*>(data)->LookupFlags(offset);
|
|
|
|
}
|
|
|
|
};
|
2013-12-07 06:57:16 +00:00
|
|
|
|
2015-03-24 15:25:58 +00:00
|
|
|
class XexModule : public xe::cpu::Module {
|
2015-03-25 02:41:29 +00:00
|
|
|
public:
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
struct ImportLibraryFn {
|
|
|
|
public:
|
2018-10-20 16:13:42 +00:00
|
|
|
uint32_t ordinal;
|
|
|
|
uint32_t value_address;
|
|
|
|
uint32_t thunk_address;
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
};
|
|
|
|
struct ImportLibrary {
|
|
|
|
public:
|
2018-10-20 16:13:42 +00:00
|
|
|
std::string name;
|
|
|
|
uint32_t id;
|
|
|
|
xe_xex2_version_t version;
|
|
|
|
xe_xex2_version_t min_version;
|
|
|
|
std::vector<ImportLibraryFn> imports;
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
};
|
2019-10-28 18:59:30 +00:00
|
|
|
struct SecurityInfoContext {
|
|
|
|
const char* rsa_signature;
|
|
|
|
const char* aes_key;
|
|
|
|
uint32_t image_size;
|
|
|
|
uint32_t image_flags;
|
|
|
|
uint32_t export_table;
|
|
|
|
uint32_t load_address;
|
|
|
|
uint32_t page_descriptor_count;
|
|
|
|
const xex2_page_descriptor* page_descriptors;
|
|
|
|
};
|
|
|
|
enum XexFormat {
|
|
|
|
kFormatUnknown,
|
|
|
|
kFormatXex1,
|
|
|
|
kFormatXex2,
|
|
|
|
};
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
|
2015-05-31 23:58:12 +00:00
|
|
|
XexModule(Processor* processor, kernel::KernelState* kernel_state);
|
2013-12-07 06:57:16 +00:00
|
|
|
virtual ~XexModule();
|
|
|
|
|
2015-07-06 15:40:35 +00:00
|
|
|
bool loaded() const { return loaded_; }
|
2015-07-07 00:45:10 +00:00
|
|
|
const xex2_header* xex_header() const {
|
|
|
|
return reinterpret_cast<const xex2_header*>(xex_header_mem_.data());
|
|
|
|
}
|
2019-10-28 18:59:30 +00:00
|
|
|
const SecurityInfoContext* xex_security_info() const {
|
|
|
|
return &security_info_;
|
2018-10-20 03:36:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
uint32_t image_size() const {
|
|
|
|
assert_not_zero(base_address_);
|
|
|
|
|
|
|
|
// Calculate the new total size of the XEX image from its headers.
|
|
|
|
auto heap = memory()->LookupHeap(base_address_);
|
|
|
|
uint32_t total_size = 0;
|
|
|
|
for (uint32_t i = 0; i < xex_security_info()->page_descriptor_count; i++) {
|
|
|
|
// Byteswap the bitfield manually.
|
|
|
|
xex2_page_descriptor desc;
|
|
|
|
desc.value =
|
|
|
|
xe::byte_swap(xex_security_info()->page_descriptors[i].value);
|
|
|
|
|
2018-11-01 15:50:56 +00:00
|
|
|
total_size += desc.page_count * heap->page_size();
|
2018-10-20 03:36:21 +00:00
|
|
|
}
|
|
|
|
return total_size;
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
const std::vector<ImportLibrary>* import_libraries() const {
|
|
|
|
return &import_libs_;
|
|
|
|
}
|
|
|
|
|
|
|
|
const xex2_opt_execution_info* opt_execution_info() const {
|
|
|
|
xex2_opt_execution_info* retval = nullptr;
|
|
|
|
GetOptHeader(XEX_HEADER_EXECUTION_INFO, &retval);
|
|
|
|
return retval;
|
2015-07-03 02:58:47 +00:00
|
|
|
}
|
2013-12-25 01:25:29 +00:00
|
|
|
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
const xex2_opt_file_format_info* opt_file_format_info() const {
|
|
|
|
xex2_opt_file_format_info* retval = nullptr;
|
|
|
|
GetOptHeader(XEX_HEADER_FILE_FORMAT_INFO, &retval);
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2021-06-15 21:28:09 +00:00
|
|
|
std::vector<uint32_t> opt_alternate_title_ids() const {
|
|
|
|
return opt_alternate_title_ids_;
|
|
|
|
}
|
|
|
|
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
const uint32_t base_address() const { return base_address_; }
|
2018-10-20 03:36:21 +00:00
|
|
|
const bool is_dev_kit() const { return is_dev_kit_; }
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
|
2015-06-29 05:48:24 +00:00
|
|
|
// Gets an optional header. Returns NULL if not found.
|
|
|
|
// Special case: if key & 0xFF == 0x00, this function will return the value,
|
2015-07-03 15:41:43 +00:00
|
|
|
// not a pointer! This assumes out_ptr points to uint32_t.
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
static bool GetOptHeader(const xex2_header* header, xex2_header_keys key,
|
2015-06-29 05:48:24 +00:00
|
|
|
void** out_ptr);
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
bool GetOptHeader(xex2_header_keys key, void** out_ptr) const;
|
2015-06-29 06:39:07 +00:00
|
|
|
|
2015-07-03 02:58:47 +00:00
|
|
|
// Ultra-cool templated version
|
|
|
|
// Special case: if key & 0xFF == 0x00, this function will return the value,
|
2015-07-03 15:41:43 +00:00
|
|
|
// not a pointer! This assumes out_ptr points to uint32_t.
|
2015-07-03 02:58:47 +00:00
|
|
|
template <typename T>
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
static bool GetOptHeader(const xex2_header* header, xex2_header_keys key,
|
2015-07-03 02:58:47 +00:00
|
|
|
T* out_ptr) {
|
|
|
|
return GetOptHeader(header, key, reinterpret_cast<void**>(out_ptr));
|
|
|
|
}
|
|
|
|
|
|
|
|
template <typename T>
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
bool GetOptHeader(xex2_header_keys key, T* out_ptr) const {
|
2015-07-03 02:58:47 +00:00
|
|
|
return GetOptHeader(key, reinterpret_cast<void**>(out_ptr));
|
|
|
|
}
|
|
|
|
|
2019-10-28 18:59:30 +00:00
|
|
|
static const void* GetSecurityInfo(const xex2_header* header);
|
2018-10-20 03:36:21 +00:00
|
|
|
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
const PESection* GetPESection(const char* name);
|
2015-07-03 02:58:47 +00:00
|
|
|
|
2015-06-29 06:39:07 +00:00
|
|
|
uint32_t GetProcAddress(uint16_t ordinal) const;
|
2020-03-02 15:37:11 +00:00
|
|
|
uint32_t GetProcAddress(const std::string_view name) const;
|
2015-06-29 05:48:24 +00:00
|
|
|
|
2018-10-20 03:36:21 +00:00
|
|
|
int ApplyPatch(XexModule* module);
|
2020-03-02 15:37:11 +00:00
|
|
|
bool Load(const std::string_view name, const std::string_view path,
|
2015-06-29 05:48:24 +00:00
|
|
|
const void* xex_addr, size_t xex_length);
|
2018-10-20 03:36:21 +00:00
|
|
|
bool LoadContinue();
|
2015-07-05 19:03:00 +00:00
|
|
|
bool Unload();
|
2013-12-07 06:57:16 +00:00
|
|
|
|
2018-10-20 03:36:21 +00:00
|
|
|
bool ContainsAddress(uint32_t address) override;
|
|
|
|
|
2014-07-14 04:15:37 +00:00
|
|
|
const std::string& name() const override { return name_; }
|
2016-10-24 16:00:22 +00:00
|
|
|
bool is_executable() const override {
|
|
|
|
return (xex_header()->module_flags & XEX_MODULE_TITLE) != 0;
|
|
|
|
}
|
2013-12-07 06:57:16 +00:00
|
|
|
|
2018-10-20 03:36:21 +00:00
|
|
|
bool is_valid_executable() const {
|
|
|
|
assert_not_zero(base_address_);
|
|
|
|
if (!base_address_) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
uint8_t* buffer = memory()->TranslateVirtual(base_address_);
|
|
|
|
return *(uint32_t*)buffer == 0x905A4D;
|
|
|
|
}
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
|
2018-10-20 03:36:21 +00:00
|
|
|
bool is_patch() const {
|
|
|
|
assert_not_null(xex_header());
|
|
|
|
if (!xex_header()) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return (xex_header()->module_flags &
|
|
|
|
(XEX_MODULE_MODULE_PATCH | XEX_MODULE_PATCH_DELTA |
|
|
|
|
XEX_MODULE_PATCH_FULL));
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
}
|
|
|
|
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
InfoCacheFlags* GetInstructionAddressFlags(uint32_t guest_addr);
|
2022-11-01 21:45:36 +00:00
|
|
|
|
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
"Fix" debug console, we were checking the cvar before any cvars were loaded, and the condition it checks in AttachConsole is somehow always false
Remove dead #if 0'd code in math.h
On amd64, page_size == 4096 constant, on amd64 w/ win32, allocation_granularity == 65536. These values for x86 windows havent changed over the last 20 years so this is probably safe
and gives a modest code size reduction
Enable XE_USE_KUSER_SHARED. This sources host time from KUSER_SHARED instead of from QueryPerformanceCounter, which is far faster, but only has a granularity of 100 nanoseconds.
In some games seemingly random crashes were happening that were hard to trace because
the faulting thread was actually not the one that was misbehaving, another threads stack was underflowing into the faulting thread.
Added a bunch of code to synchronize the guest stack and host stack so that if a guest longjmps the host's stack will be adjusted.
Changes were also made to allow the guest to call into a piece of an existing x64 function.
This synchronization might have a slight performance impact on lower end cpus, to disable it set enable_host_guest_stack_synchronization to false.
It is possible it may have introduced regressions, but i dont know of any yet
So far, i know the synchronization change fixes the "hub crash" in super sonic and allows the game "london 2012" to go ingame.
Removed emit_useless_fpscr_updates, not emitting these updates breaks the raiden game
MapGuestAddressToMachineCode now returns nullptr if no address was found, instead of the start of the function
add Processor::LookupModule
Add Backend::DeinitializeBackendContext
Use WriteRegisterRangeFromRing_WithKnownBound<0, 0xFFFF> in WriteRegisterRangeFromRing for inlining (previously regressed on performance of ExecutePacketType0)
add notes about flags that trap in XamInputGetCapabilities
0 == 3 in XamInputGetCapabilities
Name arg 2 of XamInputSetState
PrefetchW in critical section kernel funcs if available & doing cmpxchg
Add terminated field to X_KTHREAD, set it on termination
Expanded the logic of NtResumeThread/NtSuspendThread to include checking the type of the handle (in release, LookupObject doesnt seem to do anything with the type)
and returning X_STATUS_OBJECT_TYPE_MISMATCH if invalid. Do termination check in NtSuspendThread.
Add basic host exception messagebox, need to flesh it out more (maybe use the new stack tracking stuff if on guest thrd?)
Add rdrand patching hack, mostly affects users with nvidia cards who have many threads on zen
Use page_size_shift in more places
Once again disable precompilation! Raiden is mostly weird ppc asm which probably breaks the precompilation. The code is still useful for running the compiler over the whole of an xex in debug to test for issues
2022-11-27 17:37:06 +00:00
|
|
|
virtual void Precompile() override;
|
|
|
|
|
2015-08-06 04:50:02 +00:00
|
|
|
protected:
|
|
|
|
std::unique_ptr<Function> CreateFunction(uint32_t address) override;
|
|
|
|
|
2015-03-25 02:41:29 +00:00
|
|
|
private:
|
2022-11-05 17:50:33 +00:00
|
|
|
void PrecompileKnownFunctions();
|
|
|
|
void PrecompileDiscoveredFunctions();
|
|
|
|
std::vector<uint32_t> PreanalyzeCode();
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
friend struct XexInfoCache;
|
2021-01-04 16:10:50 +00:00
|
|
|
void ReadSecurityInfo();
|
|
|
|
|
2018-10-20 03:36:21 +00:00
|
|
|
int ReadImage(const void* xex_addr, size_t xex_length, bool use_dev_key);
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
int ReadImageUncompressed(const void* xex_addr, size_t xex_length);
|
|
|
|
int ReadImageBasicCompressed(const void* xex_addr, size_t xex_length);
|
|
|
|
int ReadImageCompressed(const void* xex_addr, size_t xex_length);
|
|
|
|
|
|
|
|
int ReadPEHeaders();
|
|
|
|
|
2020-03-02 15:37:11 +00:00
|
|
|
bool SetupLibraryImports(const std::string_view name,
|
2015-07-03 02:58:47 +00:00
|
|
|
const xex2_import_library* library);
|
2015-05-06 00:21:08 +00:00
|
|
|
bool FindSaveRest();
|
2013-12-07 06:57:16 +00:00
|
|
|
|
2015-07-20 01:32:48 +00:00
|
|
|
Processor* processor_ = nullptr;
|
|
|
|
kernel::KernelState* kernel_state_ = nullptr;
|
2015-03-25 02:41:29 +00:00
|
|
|
std::string name_;
|
|
|
|
std::string path_;
|
2015-07-07 00:45:10 +00:00
|
|
|
std::vector<uint8_t> xex_header_mem_; // Holds the xex header
|
2018-10-20 03:36:21 +00:00
|
|
|
std::vector<uint8_t> xexp_data_mem_; // Holds XEXP patch data
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
|
|
|
|
std::vector<ImportLibrary>
|
|
|
|
import_libs_; // pre-loaded import libraries for ease of use
|
|
|
|
std::vector<PESection> pe_sections_;
|
|
|
|
|
2021-06-15 21:28:09 +00:00
|
|
|
// XEX_HEADER_ALTERNATE_TITLE_IDS loaded into a safe std::vector
|
|
|
|
std::vector<uint32_t> opt_alternate_title_ids_;
|
|
|
|
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
uint8_t session_key_[0x10];
|
2018-10-20 03:36:21 +00:00
|
|
|
bool is_dev_kit_ = false;
|
[CPU] Move XEX2 code into XexModule class, autodetect XEX key
Code is mainly just copy/pasted from kernel/util/xex2.cc, I've tried fixing it up to work better in a class, but there's probably some things I missed.
Also includes some minor improvements to the XEX loader, like being able to try both XEX keys (retail/devkit) automatically, and some fixes to how the base address is determined.
(Previously there was code that would get base address from optional header, code that'd get it from xex_security_info, code that'd use a stored base address value...
Now everything reads it from a single stored value instead, which is set either from the xex_security_info, or if it exists from the optional header instead.
Maybe this can help improve compatibility with any weird XEX's that don't have a base address optional header?)
Compressed XEX loader also has some extra checks to make sure the compressed data hash matches what's expected.
Might increase loading times by a fraction, but could save reports from people unknowingly using corrupt XEXs.
(still no checks for non-compressed data though, maybe need to compare data with xex_security_info->ImageHash?)
2018-10-20 03:18:18 +00:00
|
|
|
|
2018-10-20 03:36:21 +00:00
|
|
|
bool loaded_ = false; // Loaded into memory?
|
|
|
|
bool finished_load_ = false; // PE/imports/symbols/etc all loaded?
|
2013-12-07 06:57:16 +00:00
|
|
|
|
2015-07-20 01:32:48 +00:00
|
|
|
uint32_t base_address_ = 0;
|
|
|
|
uint32_t low_address_ = 0;
|
|
|
|
uint32_t high_address_ = 0;
|
2019-10-28 18:59:30 +00:00
|
|
|
|
|
|
|
XexFormat xex_format_ = kFormatUnknown;
|
|
|
|
SecurityInfoContext security_info_ = {};
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
|
2023-10-21 15:07:29 +00:00
|
|
|
uint8_t image_sha_bytes_[20];
|
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90%
But for normal msvc builds i would put it at around 30-50%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
|
|
|
std::string image_sha_str_;
|
|
|
|
XexInfoCache info_cache_;
|
2013-12-07 06:57:16 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
} // namespace cpu
|
|
|
|
} // namespace xe
|
|
|
|
|
|
|
|
#endif // XENIA_CPU_XEX_MODULE_H_
|