xenia-canary/src/xenia/cpu/xex_module.h

271 lines
8.1 KiB
C
Raw Normal View History

/**
******************************************************************************
* Xenia : Xbox 360 Emulator Research Project *
******************************************************************************
* Copyright 2020 Ben Vanik. All rights reserved. *
* Released under the BSD license - see LICENSE in the root for more details. *
******************************************************************************
*/
#ifndef XENIA_CPU_XEX_MODULE_H_
#define XENIA_CPU_XEX_MODULE_H_
2014-07-14 04:15:37 +00:00
#include <string>
2015-08-07 03:17:01 +00:00
#include <vector>
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
#include "xenia/base/mapped_memory.h"
2015-03-24 15:25:58 +00:00
#include "xenia/cpu/module.h"
2015-06-29 05:48:24 +00:00
#include "xenia/kernel/util/xex2_info.h"
namespace xe {
namespace kernel {
class KernelState;
2015-08-07 03:17:01 +00:00
} // namespace kernel
} // namespace xe
2015-08-07 03:17:01 +00:00
namespace xe {
namespace cpu {
constexpr fourcc_t kXEX1Signature = make_fourcc("XEX1");
constexpr fourcc_t kXEX2Signature = make_fourcc("XEX2");
constexpr fourcc_t kElfSignature = make_fourcc(0x7F, 'E', 'L', 'F');
2015-03-24 15:25:58 +00:00
class Runtime;
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
struct InfoCacheFlags {
uint32_t was_resolved : 1; // has this address ever been called/requested
// via resolvefunction?
uint32_t accessed_mmio : 1;
uint32_t is_syscall_func : 1;
uint32_t reserved : 29;
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
};
struct XexInfoCache {
struct InfoCacheFlagsHeader {
unsigned char reserved[256]; // put xenia version here
InfoCacheFlags* LookupFlags(unsigned offset) {
return &reinterpret_cast<InfoCacheFlags*>(&this[1])[offset];
}
};
/*
for every 4-byte aligned address, records a 4 byte set of flags.
*/
std::unique_ptr<MappedMemory> executable_addr_flags_;
void Init(class XexModule*);
InfoCacheFlags* LookupFlags(unsigned offset) {
offset /= 4;
if (!executable_addr_flags_) {
return nullptr;
}
uint8_t* data = executable_addr_flags_->data();
if (!data) {
return nullptr;
}
return reinterpret_cast<InfoCacheFlagsHeader*>(data)->LookupFlags(offset);
}
};
2015-03-24 15:25:58 +00:00
class XexModule : public xe::cpu::Module {
public:
struct ImportLibraryFn {
public:
uint32_t ordinal;
uint32_t value_address;
uint32_t thunk_address;
};
struct ImportLibrary {
public:
std::string name;
uint32_t id;
xe_xex2_version_t version;
xe_xex2_version_t min_version;
std::vector<ImportLibraryFn> imports;
};
2019-10-28 18:59:30 +00:00
struct SecurityInfoContext {
const char* rsa_signature;
const char* aes_key;
uint32_t image_size;
uint32_t image_flags;
uint32_t export_table;
uint32_t load_address;
uint32_t page_descriptor_count;
const xex2_page_descriptor* page_descriptors;
};
enum XexFormat {
kFormatUnknown,
kFormatXex1,
kFormatXex2,
};
2015-05-31 23:58:12 +00:00
XexModule(Processor* processor, kernel::KernelState* kernel_state);
virtual ~XexModule();
bool loaded() const { return loaded_; }
const xex2_header* xex_header() const {
return reinterpret_cast<const xex2_header*>(xex_header_mem_.data());
}
2019-10-28 18:59:30 +00:00
const SecurityInfoContext* xex_security_info() const {
return &security_info_;
}
uint32_t image_size() const {
assert_not_zero(base_address_);
// Calculate the new total size of the XEX image from its headers.
auto heap = memory()->LookupHeap(base_address_);
uint32_t total_size = 0;
for (uint32_t i = 0; i < xex_security_info()->page_descriptor_count; i++) {
// Byteswap the bitfield manually.
xex2_page_descriptor desc;
desc.value =
xe::byte_swap(xex_security_info()->page_descriptors[i].value);
total_size += desc.page_count * heap->page_size();
}
return total_size;
}
const std::vector<ImportLibrary>* import_libraries() const {
return &import_libs_;
}
const xex2_opt_execution_info* opt_execution_info() const {
xex2_opt_execution_info* retval = nullptr;
GetOptHeader(XEX_HEADER_EXECUTION_INFO, &retval);
return retval;
}
2013-12-25 01:25:29 +00:00
const xex2_opt_file_format_info* opt_file_format_info() const {
xex2_opt_file_format_info* retval = nullptr;
GetOptHeader(XEX_HEADER_FILE_FORMAT_INFO, &retval);
return retval;
}
std::vector<uint32_t> opt_alternate_title_ids() const {
return opt_alternate_title_ids_;
}
const uint32_t base_address() const { return base_address_; }
const bool is_dev_kit() const { return is_dev_kit_; }
2015-06-29 05:48:24 +00:00
// Gets an optional header. Returns NULL if not found.
// Special case: if key & 0xFF == 0x00, this function will return the value,
// not a pointer! This assumes out_ptr points to uint32_t.
static bool GetOptHeader(const xex2_header* header, xex2_header_keys key,
2015-06-29 05:48:24 +00:00
void** out_ptr);
bool GetOptHeader(xex2_header_keys key, void** out_ptr) const;
// Ultra-cool templated version
// Special case: if key & 0xFF == 0x00, this function will return the value,
// not a pointer! This assumes out_ptr points to uint32_t.
template <typename T>
static bool GetOptHeader(const xex2_header* header, xex2_header_keys key,
T* out_ptr) {
return GetOptHeader(header, key, reinterpret_cast<void**>(out_ptr));
}
template <typename T>
bool GetOptHeader(xex2_header_keys key, T* out_ptr) const {
return GetOptHeader(key, reinterpret_cast<void**>(out_ptr));
}
2019-10-28 18:59:30 +00:00
static const void* GetSecurityInfo(const xex2_header* header);
const PESection* GetPESection(const char* name);
uint32_t GetProcAddress(uint16_t ordinal) const;
uint32_t GetProcAddress(const std::string_view name) const;
2015-06-29 05:48:24 +00:00
int ApplyPatch(XexModule* module);
bool Load(const std::string_view name, const std::string_view path,
2015-06-29 05:48:24 +00:00
const void* xex_addr, size_t xex_length);
bool LoadContinue();
2015-07-05 19:03:00 +00:00
bool Unload();
bool ContainsAddress(uint32_t address) override;
2014-07-14 04:15:37 +00:00
const std::string& name() const override { return name_; }
bool is_executable() const override {
return (xex_header()->module_flags & XEX_MODULE_TITLE) != 0;
}
bool is_valid_executable() const {
assert_not_zero(base_address_);
if (!base_address_) {
return false;
}
uint8_t* buffer = memory()->TranslateVirtual(base_address_);
return *(uint32_t*)buffer == 0x905A4D;
}
bool is_patch() const {
assert_not_null(xex_header());
if (!xex_header()) {
return false;
}
return (xex_header()->module_flags &
(XEX_MODULE_MODULE_PATCH | XEX_MODULE_PATCH_DELTA |
XEX_MODULE_PATCH_FULL));
}
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
InfoCacheFlags* GetInstructionAddressFlags(uint32_t guest_addr);
virtual void Precompile() override;
protected:
std::unique_ptr<Function> CreateFunction(uint32_t address) override;
private:
void PrecompileKnownFunctions();
void PrecompileDiscoveredFunctions();
std::vector<uint32_t> PreanalyzeCode();
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
friend struct XexInfoCache;
void ReadSecurityInfo();
int ReadImage(const void* xex_addr, size_t xex_length, bool use_dev_key);
int ReadImageUncompressed(const void* xex_addr, size_t xex_length);
int ReadImageBasicCompressed(const void* xex_addr, size_t xex_length);
int ReadImageCompressed(const void* xex_addr, size_t xex_length);
int ReadPEHeaders();
bool SetupLibraryImports(const std::string_view name,
const xex2_import_library* library);
2015-05-06 00:21:08 +00:00
bool FindSaveRest();
2015-07-20 01:32:48 +00:00
Processor* processor_ = nullptr;
kernel::KernelState* kernel_state_ = nullptr;
std::string name_;
std::string path_;
std::vector<uint8_t> xex_header_mem_; // Holds the xex header
std::vector<uint8_t> xexp_data_mem_; // Holds XEXP patch data
std::vector<ImportLibrary>
import_libs_; // pre-loaded import libraries for ease of use
std::vector<PESection> pe_sections_;
// XEX_HEADER_ALTERNATE_TITLE_IDS loaded into a safe std::vector
std::vector<uint32_t> opt_alternate_title_ids_;
uint8_t session_key_[0x10];
bool is_dev_kit_ = false;
bool loaded_ = false; // Loaded into memory?
bool finished_load_ = false; // PE/imports/symbols/etc all loaded?
2015-07-20 01:32:48 +00:00
uint32_t base_address_ = 0;
uint32_t low_address_ = 0;
uint32_t high_address_ = 0;
2019-10-28 18:59:30 +00:00
XexFormat xex_format_ = kFormatUnknown;
SecurityInfoContext security_info_ = {};
Huge set of performance improvements, combined with an architecture specific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
2022-08-13 19:59:00 +00:00
uint8_t image_sha_bytes_[16];
std::string image_sha_str_;
XexInfoCache info_cache_;
};
} // namespace cpu
} // namespace xe
#endif // XENIA_CPU_XEX_MODULE_H_