Adding some docs on CPU optimizations/potential work.

This commit is contained in:
Ben Vanik 2015-07-13 18:20:38 -07:00
parent c6ebcd508d
commit 31dab70a3a
4 changed files with 317 additions and 180 deletions

317
docs/cpu_todo.md Normal file
View File

@ -0,0 +1,317 @@
# CPU TODO
There are many improvements that can be done under `xe::cpu` to improve
debugging, performance (both to JIT and of generated code), and portability.
Some are in various states of completion, and others are just thoughts that need
more exploring.
## Debugging Improvements
### Reproducable X64 Emission
It'd be useful to be able to run a PPC function through the entire pipeline and
spit out x64 that is byte-for-byte identical across runs. This would allow
automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace`
will relocate the x64 when placing it in memory, which will be at a different
location each time. Instead it would be nice to have the xbyak `calcJmpAddress`
that performs the relocations use the address of our choosing.
### Stack Walking
Currently the Windows/VC++ dbghelp stack walking is relied on, however this is
not portable, is slow, and cannot resolve JIT'ed symbols properly. Having our
own stack walking code that could fall back to dbghelp (via some pluggable
system) for host symbols would let us quickly get stacks through host and guest
code and make things like sampling profilers, kernel callstack tracing, and
other features possible.
### Sampling Profiler
Once we have stack walking it'd be nice to take something like
[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to
support our system. This would let us run continuous performance analysis and
track hotspots in JITed code without a large performance impact. Automatically
showing the top hot functions in the debugger could help track down poor
translation much faster.
### Intel Architecture Code Analyzer Support
The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
is a nifty tool that, given a kernel of x64, can detail theoretical performance
characteristics on different processors down to cycle timings and potential
bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files
however it simply looks for special markers in the code. Having something that
walks the code cache and dumps a specially formatted file with the markers
around basic blocks could allow running the tool in bulk, or alternatively being
able to invoke it one-off by dumping a specific x64 block to disk and processing
it for display when looking at the code in the debugger would be useful.
I've done some early experiments with this and its possible to pass just a
bin file with the markers and the x64.
### Function Tracing/Coverage Information
`function_trace_data.h` contains the `FunctionTraceData` struct, which is
currently partially populated by the x64 backend. This enables tracking of which
threads a function is called on, function call count, recent callers of the
function, and even instruction-level counts.
This is all only partially implemented, though, and there's no tool to read it
out. This would be nice to get integrated into the debugger so that it can
overlay the information when viewing a function, but also useful in aggregate to
find hot functions/code paths or enhance callstacks by automatically annotating
thread information.
#### Block-level Counting
Currently the code assumes each instruction has a count, however this is
expensive and often unneeded as it can be done on a block level and then the
instruction counts can be derived from that. This can reduce the overhead (both
in memory and accounting time) by an order of magnitude.
### On-Stack Context Inspection
Currently the debugger only works with `--store_all_context_values`, as it can
only get the values of PPC registers when they are stored to the PPC context
after each instruction. As this can slow things down by ~10-20% it could be
useful to be able to preserve the optimized and register-allocated HIR so that
host registers holding context values can be derived on demand. Or, we could
just make `--store_all_context_values` faster.
## JIT Performance Improvements
### Reduce HIR Size
Currently there are a lot of pointers stored within `Instr`, `Value`, and
related types. These are big 8B values that eat a lot of memory and really
hurt the cache (especially with all the block/instruction walking done).
Aligning everything to 16B values in the arena and using 16bit indices
(or something) could shrink things a lot.
### Serialize Code Cache
The x64 code cache is currently set up to use fixed memory addresses and is even
represented as mapped memory. It should be fairly easy to back this with a file
and have all code written to disk. Adding more metadata, or perhaps a side-car
file, would allow for the code to be written to disk. On future runs the code
cache could load this data (by mapping the file containing the code right into
memory) and short cut JIT'ing entirely.
It would be possible to use a common container format (ELF/etc), however there's
elegance in not requiring any additional steps beyond the memory mapping. Such
containers could be useful for running static tools against, though.
## Portability Improvements
### Emulated Opcode Layer
Having a way to use emulated variants for any HIR opcode in a backend would
help when writing a new backend as well as when verifying the existing backends.
This may look like a C library with functions for each opcode/type pairing and
utilities to call out to them. Something like the x64 backend could then call
out to these with CallNativeSafe (or some faster equivalent) and something like
an interpreter backend would be fairly trivial to write.
## X64 Backend Improvements
### Implement Emulated Instructions
There are a ton of half-implemented HIR opcodes that call out to C++ to do their
work. These are extremely expensive as they incur a full guest-to-host thunk
(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe`
functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2
variants.
### Increase Register Availability
Currently only a few x64 registers are usable (due to reservations by the
backend or ABI conflicts). Though register pressure is surprisingly light in
most cases there are pathological cases that result in a lot of spills. By
freeing up some of the registers these spills could be reduced.
### Constant Pooling
This may make sense as a compiler pass instead.
Right now, particular sequences of instructions are nasty - such as anything
using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the
super fat (20-30byte!) constant loads as they are done now it may be better to
keep a per-function constant table and instead use RIP-relative addressing (or
something) to use the memory-form AVX instructions.
For example, right now this:
```
v82.v128 = [0,1,2,3]
v83.v128 = or v81.v128, v82.128
```
Translates to (something like):
```
mov([rsp+0x...], 0x00000000)
mov([rsp+0x...+4], 0x00000001)
mov([rsp+0x...+8], 0x00000002)
mov([rsp+0x...+12], 0x00000003)
vmovdqa(xmm2, [rsp+0x...])
vor(xmm2, xmm2, xmm2)
```
Where as it could be:
```
vor(xmm2, xmm2, [rip+0x...])
```
Whether the cost of doing the constant de-dupe is worth it remains to be seen.
Right now it's wasting a lot of instruction cache space, increasing decode time,
and potentially using a lot more memory bandwidth.
## Optimization Improvements
### Speed Up RegisterAllocationPass
Currently the slowest pass, this could be improved by requiring less use
tracking or perhaps maintaining the use tracking in other passes. A faster
SortUsageList (radix or something fancy?) may be helpful as well.
### More Opcodes in ConstantPropagationPass
There's a few HIR opcodes with no handling, and others with minimal handling.
It'd be nice to know what paths need improvement and add them, as any work here
makes things free later on.
### Cross-Block ConstantPropagationPass
Constant propagation currently only occurs within a single block. This makes it
difficult to optimize common PPC patterns like loading the constants 0 or 1 into
a register before a loop and other loads of expensive altivec values.
Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to
track constant load_context/store_context's across block bounds and propagate
the values. This is simpler than dynamic values as no phi functions or anything
fancy needs to happen.
### Add TypePropagationPass
There are many extensions/truncations in generated code right now due to
various load/stores of varying widths. Being able to find and short-
circuit the conversions early on would make following passes cleaner
and faster as they'd have to trace through fewer value definitions and there'd
be less extraneous movs in the final code.
Example (after ContextPromotion):
```
v82.i32 = truncate v81.i64
v83.i32 = and v82.i32, 3F
v85.i64 = zero_extend v84.i32
```
Becomes (after DCE/etc):
```
v85.i64 = and v81.i64, 3F
```
### Enhance MemorySequenceCombinationPass with Extend/Truncate
Currently this pass will look for byte_swap and merge that into loads/stores.
This allows for better final codegen at the cost of making optimization more
difficult, so it only happens at the end of the process.
There's currently TODOs in there for adding extend/truncate support, which
will extend what it does with swaps to also merge the
sign_extend/zero_extend/truncate into the matching load/store. This allows for
the x64 backend to generate the proper mov's that do these operations without
requiring additional steps. Note that if we had a LIR and a peephole optimizer
this would be better done there.
Load with swap and extend:
```
v1.i32 = load v0
v2.i32 = byte_swap v1.i32
v3.i64 = zero_extend v2.i32
```
Becomes:
```
v1.i64 = load_convert v0, [swap|i32->i64,zero]
```
Store with truncate and swap:
```
v1.i64 = ...
v2.i32 = truncate v1.i64
v3.i32 = byte_swap v2.i32
store v0, v3.i32
```
Becomes:
```
store_convert v0, v1.i64, [swap|i64->i32,trunc]
```
### Add DeadStoreEliminationPass
Generic DSE pass, removing all redundant stores. ContextPromotion may be
able to take care of most of these, as the input assembly is generally
pretty optimized already. This pass would mainly be looking for introduced
stores, such as those from comparisons.
Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing
edges as well as dominators, and that could be used to check whether stores into
the context are used in their destination block or instead overwritten
(currently they almost never are).
If this pass was able to remove a good number of the stores then the comparisons
would also be removed with dead code elimination and dramatically reduce branch
overhead.
Example:
```
<block0>:
v0 = compare_ult ... (later removed by DCE)
v1 = compare_ugt ... (later removed by DCE)
v2 = compare_eq ...
store_context +300, v0 <-- removed
store_context +301, v1 <-- removed
store_context +302, v2 <-- removed
branch_true v1, ...
<block1>:
v3 = compare_ult ...
v4 = compare_ugt ...
v5 = compare_eq ...
store_context +300, v3 <-- these may be required if at end of function
store_context +301, v4 or before a call
store_context +302, v5
branch_true v5, ...
```
### Add X64CanonicalizationPass
For various opcodes add copies/commute the arguments to match x64
operand semantics. This makes code generation easier and if done
before register allocation can prevent a lot of extra shuffling in
the emitted code.
Example:
```
<block0>:
v0 = ...
v1 = ...
v2 = add v0, v1 <-- v1 now unused
```
Becomes:
```
v0 = ...
v1 = ...
v1 = add v1, v0 <-- src1 = dest/src, so reuse for both
by commuting and setting dest = src1
```
### Add MergeLocalSlotsPass
As the RegisterAllocationPass runs it generates load_local/store_local as it
spills. Currently each set of locals is unique to each block, which in very
large functions can result in a lot of locals that are only used briefly. It
may be useful to use the results of the ControlFlowAnalysisPass to track local
liveness and merge the slots so they are reused when they cannot possibly be
live at the same time. This saves stack space and potentially improves cache
behavior.

View File

@ -53,11 +53,6 @@ static const size_t MAX_CODE_SIZE = 1 * 1024 * 1024;
static const size_t STASH_OFFSET = 32; static const size_t STASH_OFFSET = 32;
static const size_t STASH_OFFSET_HIGH = 32 + 32; static const size_t STASH_OFFSET_HIGH = 32 + 32;
// If we are running with tracing on we have to store the EFLAGS in the stack,
// otherwise our calls out to C to print will clear it before DID_CARRY/etc
// can get the value.
#define STORE_EFLAGS 1
const uint32_t X64Emitter::gpr_reg_map_[X64Emitter::GPR_COUNT] = { const uint32_t X64Emitter::gpr_reg_map_[X64Emitter::GPR_COUNT] = {
Operand::RBX, Operand::R12, Operand::R13, Operand::R14, Operand::R15, Operand::RBX, Operand::R12, Operand::R13, Operand::R14, Operand::R15,
}; };
@ -539,25 +534,6 @@ void X64Emitter::nop(size_t length) {
} }
} }
void X64Emitter::LoadEflags() {
#if STORE_EFLAGS
mov(eax, dword[rsp + STASH_OFFSET]);
btr(eax, 0);
#else
// EFLAGS already present.
#endif // STORE_EFLAGS
}
void X64Emitter::StoreEflags() {
#if STORE_EFLAGS
pushf();
pop(dword[rsp + STASH_OFFSET]);
#else
// EFLAGS should have CA set?
// (so long as we don't fuck with it)
#endif // STORE_EFLAGS
}
bool X64Emitter::ConstantFitsIn32Reg(uint64_t v) { bool X64Emitter::ConstantFitsIn32Reg(uint64_t v) {
if ((v & ~0x7FFFFFFF) == 0) { if ((v & ~0x7FFFFFFF) == 0) {
// Fits under 31 bits, so just load using normal mov. // Fits under 31 bits, so just load using normal mov.

View File

@ -173,9 +173,6 @@ class X64Emitter : public Xbyak::CodeGenerator {
// TODO(benvanik): Label for epilog (don't use strings). // TODO(benvanik): Label for epilog (don't use strings).
void LoadEflags();
void StoreEflags();
// Moves a 64bit immediate into memory. // Moves a 64bit immediate into memory.
bool ConstantFitsIn32Reg(uint64_t v); bool ConstantFitsIn32Reg(uint64_t v);
void MovMem64(const Xbyak::RegExp& addr, uint64_t v); void MovMem64(const Xbyak::RegExp& addr, uint64_t v);

View File

@ -24,157 +24,4 @@
#include "xenia/cpu/compiler/passes/validation_pass.h" #include "xenia/cpu/compiler/passes/validation_pass.h"
#include "xenia/cpu/compiler/passes/value_reduction_pass.h" #include "xenia/cpu/compiler/passes/value_reduction_pass.h"
// TODO:
// - mark_use/mark_set
// For now: mark_all_changed on all calls
// For external functions:
// - load_context/mark_use on all arguments
// - mark_set on return argument?
// For internal functions:
// - if liveness analysis already done, use that
// - otherwise, assume everything dirty (ACK!)
// - could use scanner to insert mark_use
//
// Maybe:
// - v0.xx = load_constant <c>
// - v0.xx = load_zero
// Would prevent NULL defs on values, and make constant de-duping possible.
// Not sure if it's worth it, though, as the extra register allocation
// pressure due to de-duped constants seems like it would slow things down
// a lot.
//
// - CFG:
// Blocks need predecessors()/successor()
// phi Instr reference
//
// - block liveness tracking (in/out)
// Block gets:
// AddIncomingValue(Value* value, Block* src_block) ??
// Potentially interesting passes:
//
// Run order:
// ContextPromotion
// Simplification
// ConstantPropagation
// TypePropagation
// ByteSwapElimination
// Simplification
// DeadStoreElimination
// DeadCodeElimination
//
// - TypePropagation
// There are many extensions/truncations in generated code right now due to
// various load/stores of varying widths. Being able to find and short-
// circuit the conversions early on would make following passes cleaner
// and faster as they'd have to trace through fewer value definitions.
// Example (after ContextPromotion):
// v81.i64 = load_context +88
// v82.i32 = truncate v81.i64
// v84.i32 = and v82.i32, 3F
// v85.i64 = zero_extend v84.i32
// v87.i64 = load_context +248
// v88.i64 = v85.i64
// v89.i32 = truncate v88.i64 <-- zero_extend/truncate => v84.i32
// v90.i32 = byte_swap v89.i32
// store v87.i64, v90.i32
// after type propagation / simplification / DCE:
// v81.i64 = load_context +88
// v82.i32 = truncate v81.i64
// v84.i32 = and v82.i32, 3F
// v87.i64 = load_context +248
// v90.i32 = byte_swap v84.i32
// store v87.i64, v90.i32
//
// - ByteSwapElimination
// Find chained byte swaps and replace with assignments. This is often found
// in memcpy paths.
// Example:
// v0 = load ...
// v1 = byte_swap v0
// v2 = byte_swap v1
// store ..., v2 <-- this could be v0
//
// It may be tricky to detect, though, as often times there are intervening
// instructions:
// v21.i32 = load v20.i64
// v22.i32 = byte_swap v21.i32
// v23.i64 = zero_extend v22.i32
// v88.i64 = v23.i64 (from ContextPromotion)
// v89.i32 = truncate v88.i64
// v90.i32 = byte_swap v89.i32
// store v87.i64, v90.i32
// After type propagation:
// v21.i32 = load v20.i64
// v22.i32 = byte_swap v21.i32
// v89.i32 = v22.i32
// v90.i32 = byte_swap v89.i32
// store v87.i64, v90.i32
// This could ideally become:
// v21.i32 = load v20.i64
// ... (DCE takes care of this) ...
// store v87.i64, v21.i32
//
// - DeadStoreElimination
// Generic DSE pass, removing all redundant stores. ContextPromotion may be
// able to take care of most of these, as the input assembly is generally
// pretty optimized already. This pass would mainly be looking for introduced
// stores, such as those from comparisons.
//
// Example:
// <block0>:
// v0 = compare_ult ... (later removed by DCE)
// v1 = compare_ugt ... (later removed by DCE)
// v2 = compare_eq ...
// store_context +300, v0 <-- removed
// store_context +301, v1 <-- removed
// store_context +302, v2 <-- removed
// branch_true v1, ...
// <block1>:
// v3 = compare_ult ...
// v4 = compare_ugt ...
// v5 = compare_eq ...
// store_context +300, v3 <-- these may be required if at end of function
// store_context +301, v4 or before a call
// store_context +302, v5
// branch_true v5, ...
//
// - X86Canonicalization
// For various opcodes add copies/commute the arguments to match x86
// operand semantics. This makes code generation easier and if done
// before register allocation can prevent a lot of extra shuffling in
// the emitted code.
//
// Example:
// <block0>:
// v0 = ...
// v1 = ...
// v2 = add v0, v1 <-- v1 now unused
// Becomes:
// v0 = ...
// v1 = ...
// v1 = add v1, v0 <-- src1 = dest/src, so reuse for both
// by commuting and setting dest = src1
//
// - RegisterAllocation
// Given a machine description (register classes, counts) run over values
// and assign them to registers, adding spills as needed. It should be
// possible to directly emit code from this form.
//
// Example:
// <block0>:
// v0 = load_context +0
// v1 = load_context +1
// v0 = add v0, v1
// ...
// v2 = mul v0, v1
// Becomes:
// reg0 = load_context +0
// reg1 = load_context +1
// reg2 = add reg0, reg1
// store_local +123, reg2 <-- spill inserted
// ...
// reg0 = load_local +123 <-- load inserted
// reg0 = mul reg0, reg1
#endif // XENIA_COMPILER_COMPILER_PASSES_H_ #endif // XENIA_COMPILER_COMPILER_PASSES_H_