Adding some docs on CPU optimizations/potential work.
This commit is contained in:
parent
c6ebcd508d
commit
31dab70a3a
|
@ -0,0 +1,317 @@
|
|||
# CPU TODO
|
||||
|
||||
There are many improvements that can be done under `xe::cpu` to improve
|
||||
debugging, performance (both to JIT and of generated code), and portability.
|
||||
Some are in various states of completion, and others are just thoughts that need
|
||||
more exploring.
|
||||
|
||||
## Debugging Improvements
|
||||
|
||||
### Reproducable X64 Emission
|
||||
|
||||
It'd be useful to be able to run a PPC function through the entire pipeline and
|
||||
spit out x64 that is byte-for-byte identical across runs. This would allow
|
||||
automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace`
|
||||
will relocate the x64 when placing it in memory, which will be at a different
|
||||
location each time. Instead it would be nice to have the xbyak `calcJmpAddress`
|
||||
that performs the relocations use the address of our choosing.
|
||||
|
||||
### Stack Walking
|
||||
|
||||
Currently the Windows/VC++ dbghelp stack walking is relied on, however this is
|
||||
not portable, is slow, and cannot resolve JIT'ed symbols properly. Having our
|
||||
own stack walking code that could fall back to dbghelp (via some pluggable
|
||||
system) for host symbols would let us quickly get stacks through host and guest
|
||||
code and make things like sampling profilers, kernel callstack tracing, and
|
||||
other features possible.
|
||||
|
||||
### Sampling Profiler
|
||||
|
||||
Once we have stack walking it'd be nice to take something like
|
||||
[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to
|
||||
support our system. This would let us run continuous performance analysis and
|
||||
track hotspots in JITed code without a large performance impact. Automatically
|
||||
showing the top hot functions in the debugger could help track down poor
|
||||
translation much faster.
|
||||
|
||||
### Intel Architecture Code Analyzer Support
|
||||
|
||||
The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
|
||||
is a nifty tool that, given a kernel of x64, can detail theoretical performance
|
||||
characteristics on different processors down to cycle timings and potential
|
||||
bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files
|
||||
however it simply looks for special markers in the code. Having something that
|
||||
walks the code cache and dumps a specially formatted file with the markers
|
||||
around basic blocks could allow running the tool in bulk, or alternatively being
|
||||
able to invoke it one-off by dumping a specific x64 block to disk and processing
|
||||
it for display when looking at the code in the debugger would be useful.
|
||||
|
||||
I've done some early experiments with this and its possible to pass just a
|
||||
bin file with the markers and the x64.
|
||||
|
||||
### Function Tracing/Coverage Information
|
||||
|
||||
`function_trace_data.h` contains the `FunctionTraceData` struct, which is
|
||||
currently partially populated by the x64 backend. This enables tracking of which
|
||||
threads a function is called on, function call count, recent callers of the
|
||||
function, and even instruction-level counts.
|
||||
|
||||
This is all only partially implemented, though, and there's no tool to read it
|
||||
out. This would be nice to get integrated into the debugger so that it can
|
||||
overlay the information when viewing a function, but also useful in aggregate to
|
||||
find hot functions/code paths or enhance callstacks by automatically annotating
|
||||
thread information.
|
||||
|
||||
#### Block-level Counting
|
||||
|
||||
Currently the code assumes each instruction has a count, however this is
|
||||
expensive and often unneeded as it can be done on a block level and then the
|
||||
instruction counts can be derived from that. This can reduce the overhead (both
|
||||
in memory and accounting time) by an order of magnitude.
|
||||
|
||||
### On-Stack Context Inspection
|
||||
|
||||
Currently the debugger only works with `--store_all_context_values`, as it can
|
||||
only get the values of PPC registers when they are stored to the PPC context
|
||||
after each instruction. As this can slow things down by ~10-20% it could be
|
||||
useful to be able to preserve the optimized and register-allocated HIR so that
|
||||
host registers holding context values can be derived on demand. Or, we could
|
||||
just make `--store_all_context_values` faster.
|
||||
|
||||
## JIT Performance Improvements
|
||||
|
||||
### Reduce HIR Size
|
||||
|
||||
Currently there are a lot of pointers stored within `Instr`, `Value`, and
|
||||
related types. These are big 8B values that eat a lot of memory and really
|
||||
hurt the cache (especially with all the block/instruction walking done).
|
||||
Aligning everything to 16B values in the arena and using 16bit indices
|
||||
(or something) could shrink things a lot.
|
||||
|
||||
### Serialize Code Cache
|
||||
|
||||
The x64 code cache is currently set up to use fixed memory addresses and is even
|
||||
represented as mapped memory. It should be fairly easy to back this with a file
|
||||
and have all code written to disk. Adding more metadata, or perhaps a side-car
|
||||
file, would allow for the code to be written to disk. On future runs the code
|
||||
cache could load this data (by mapping the file containing the code right into
|
||||
memory) and short cut JIT'ing entirely.
|
||||
|
||||
It would be possible to use a common container format (ELF/etc), however there's
|
||||
elegance in not requiring any additional steps beyond the memory mapping. Such
|
||||
containers could be useful for running static tools against, though.
|
||||
|
||||
## Portability Improvements
|
||||
|
||||
### Emulated Opcode Layer
|
||||
|
||||
Having a way to use emulated variants for any HIR opcode in a backend would
|
||||
help when writing a new backend as well as when verifying the existing backends.
|
||||
This may look like a C library with functions for each opcode/type pairing and
|
||||
utilities to call out to them. Something like the x64 backend could then call
|
||||
out to these with CallNativeSafe (or some faster equivalent) and something like
|
||||
an interpreter backend would be fairly trivial to write.
|
||||
|
||||
## X64 Backend Improvements
|
||||
|
||||
### Implement Emulated Instructions
|
||||
|
||||
There are a ton of half-implemented HIR opcodes that call out to C++ to do their
|
||||
work. These are extremely expensive as they incur a full guest-to-host thunk
|
||||
(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe`
|
||||
functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2
|
||||
variants.
|
||||
|
||||
### Increase Register Availability
|
||||
|
||||
Currently only a few x64 registers are usable (due to reservations by the
|
||||
backend or ABI conflicts). Though register pressure is surprisingly light in
|
||||
most cases there are pathological cases that result in a lot of spills. By
|
||||
freeing up some of the registers these spills could be reduced.
|
||||
|
||||
### Constant Pooling
|
||||
|
||||
This may make sense as a compiler pass instead.
|
||||
|
||||
Right now, particular sequences of instructions are nasty - such as anything
|
||||
using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the
|
||||
super fat (20-30byte!) constant loads as they are done now it may be better to
|
||||
keep a per-function constant table and instead use RIP-relative addressing (or
|
||||
something) to use the memory-form AVX instructions.
|
||||
|
||||
For example, right now this:
|
||||
```
|
||||
v82.v128 = [0,1,2,3]
|
||||
v83.v128 = or v81.v128, v82.128
|
||||
```
|
||||
|
||||
Translates to (something like):
|
||||
```
|
||||
mov([rsp+0x...], 0x00000000)
|
||||
mov([rsp+0x...+4], 0x00000001)
|
||||
mov([rsp+0x...+8], 0x00000002)
|
||||
mov([rsp+0x...+12], 0x00000003)
|
||||
vmovdqa(xmm2, [rsp+0x...])
|
||||
vor(xmm2, xmm2, xmm2)
|
||||
```
|
||||
|
||||
Where as it could be:
|
||||
```
|
||||
vor(xmm2, xmm2, [rip+0x...])
|
||||
```
|
||||
|
||||
Whether the cost of doing the constant de-dupe is worth it remains to be seen.
|
||||
Right now it's wasting a lot of instruction cache space, increasing decode time,
|
||||
and potentially using a lot more memory bandwidth.
|
||||
|
||||
## Optimization Improvements
|
||||
|
||||
### Speed Up RegisterAllocationPass
|
||||
|
||||
Currently the slowest pass, this could be improved by requiring less use
|
||||
tracking or perhaps maintaining the use tracking in other passes. A faster
|
||||
SortUsageList (radix or something fancy?) may be helpful as well.
|
||||
|
||||
### More Opcodes in ConstantPropagationPass
|
||||
|
||||
There's a few HIR opcodes with no handling, and others with minimal handling.
|
||||
It'd be nice to know what paths need improvement and add them, as any work here
|
||||
makes things free later on.
|
||||
|
||||
### Cross-Block ConstantPropagationPass
|
||||
|
||||
Constant propagation currently only occurs within a single block. This makes it
|
||||
difficult to optimize common PPC patterns like loading the constants 0 or 1 into
|
||||
a register before a loop and other loads of expensive altivec values.
|
||||
|
||||
Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to
|
||||
track constant load_context/store_context's across block bounds and propagate
|
||||
the values. This is simpler than dynamic values as no phi functions or anything
|
||||
fancy needs to happen.
|
||||
|
||||
### Add TypePropagationPass
|
||||
|
||||
There are many extensions/truncations in generated code right now due to
|
||||
various load/stores of varying widths. Being able to find and short-
|
||||
circuit the conversions early on would make following passes cleaner
|
||||
and faster as they'd have to trace through fewer value definitions and there'd
|
||||
be less extraneous movs in the final code.
|
||||
|
||||
Example (after ContextPromotion):
|
||||
```
|
||||
v82.i32 = truncate v81.i64
|
||||
v83.i32 = and v82.i32, 3F
|
||||
v85.i64 = zero_extend v84.i32
|
||||
```
|
||||
|
||||
Becomes (after DCE/etc):
|
||||
```
|
||||
v85.i64 = and v81.i64, 3F
|
||||
```
|
||||
|
||||
### Enhance MemorySequenceCombinationPass with Extend/Truncate
|
||||
|
||||
Currently this pass will look for byte_swap and merge that into loads/stores.
|
||||
This allows for better final codegen at the cost of making optimization more
|
||||
difficult, so it only happens at the end of the process.
|
||||
|
||||
There's currently TODOs in there for adding extend/truncate support, which
|
||||
will extend what it does with swaps to also merge the
|
||||
sign_extend/zero_extend/truncate into the matching load/store. This allows for
|
||||
the x64 backend to generate the proper mov's that do these operations without
|
||||
requiring additional steps. Note that if we had a LIR and a peephole optimizer
|
||||
this would be better done there.
|
||||
|
||||
Load with swap and extend:
|
||||
```
|
||||
v1.i32 = load v0
|
||||
v2.i32 = byte_swap v1.i32
|
||||
v3.i64 = zero_extend v2.i32
|
||||
```
|
||||
|
||||
Becomes:
|
||||
```
|
||||
v1.i64 = load_convert v0, [swap|i32->i64,zero]
|
||||
```
|
||||
|
||||
Store with truncate and swap:
|
||||
```
|
||||
v1.i64 = ...
|
||||
v2.i32 = truncate v1.i64
|
||||
v3.i32 = byte_swap v2.i32
|
||||
store v0, v3.i32
|
||||
```
|
||||
|
||||
Becomes:
|
||||
```
|
||||
store_convert v0, v1.i64, [swap|i64->i32,trunc]
|
||||
```
|
||||
|
||||
### Add DeadStoreEliminationPass
|
||||
|
||||
Generic DSE pass, removing all redundant stores. ContextPromotion may be
|
||||
able to take care of most of these, as the input assembly is generally
|
||||
pretty optimized already. This pass would mainly be looking for introduced
|
||||
stores, such as those from comparisons.
|
||||
|
||||
Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing
|
||||
edges as well as dominators, and that could be used to check whether stores into
|
||||
the context are used in their destination block or instead overwritten
|
||||
(currently they almost never are).
|
||||
|
||||
If this pass was able to remove a good number of the stores then the comparisons
|
||||
would also be removed with dead code elimination and dramatically reduce branch
|
||||
overhead.
|
||||
|
||||
Example:
|
||||
```
|
||||
<block0>:
|
||||
v0 = compare_ult ... (later removed by DCE)
|
||||
v1 = compare_ugt ... (later removed by DCE)
|
||||
v2 = compare_eq ...
|
||||
store_context +300, v0 <-- removed
|
||||
store_context +301, v1 <-- removed
|
||||
store_context +302, v2 <-- removed
|
||||
branch_true v1, ...
|
||||
<block1>:
|
||||
v3 = compare_ult ...
|
||||
v4 = compare_ugt ...
|
||||
v5 = compare_eq ...
|
||||
store_context +300, v3 <-- these may be required if at end of function
|
||||
store_context +301, v4 or before a call
|
||||
store_context +302, v5
|
||||
branch_true v5, ...
|
||||
```
|
||||
|
||||
### Add X64CanonicalizationPass
|
||||
|
||||
For various opcodes add copies/commute the arguments to match x64
|
||||
operand semantics. This makes code generation easier and if done
|
||||
before register allocation can prevent a lot of extra shuffling in
|
||||
the emitted code.
|
||||
|
||||
Example:
|
||||
```
|
||||
<block0>:
|
||||
v0 = ...
|
||||
v1 = ...
|
||||
v2 = add v0, v1 <-- v1 now unused
|
||||
```
|
||||
|
||||
Becomes:
|
||||
```
|
||||
v0 = ...
|
||||
v1 = ...
|
||||
v1 = add v1, v0 <-- src1 = dest/src, so reuse for both
|
||||
by commuting and setting dest = src1
|
||||
```
|
||||
|
||||
### Add MergeLocalSlotsPass
|
||||
|
||||
As the RegisterAllocationPass runs it generates load_local/store_local as it
|
||||
spills. Currently each set of locals is unique to each block, which in very
|
||||
large functions can result in a lot of locals that are only used briefly. It
|
||||
may be useful to use the results of the ControlFlowAnalysisPass to track local
|
||||
liveness and merge the slots so they are reused when they cannot possibly be
|
||||
live at the same time. This saves stack space and potentially improves cache
|
||||
behavior.
|
|
@ -53,11 +53,6 @@ static const size_t MAX_CODE_SIZE = 1 * 1024 * 1024;
|
|||
static const size_t STASH_OFFSET = 32;
|
||||
static const size_t STASH_OFFSET_HIGH = 32 + 32;
|
||||
|
||||
// If we are running with tracing on we have to store the EFLAGS in the stack,
|
||||
// otherwise our calls out to C to print will clear it before DID_CARRY/etc
|
||||
// can get the value.
|
||||
#define STORE_EFLAGS 1
|
||||
|
||||
const uint32_t X64Emitter::gpr_reg_map_[X64Emitter::GPR_COUNT] = {
|
||||
Operand::RBX, Operand::R12, Operand::R13, Operand::R14, Operand::R15,
|
||||
};
|
||||
|
@ -539,25 +534,6 @@ void X64Emitter::nop(size_t length) {
|
|||
}
|
||||
}
|
||||
|
||||
void X64Emitter::LoadEflags() {
|
||||
#if STORE_EFLAGS
|
||||
mov(eax, dword[rsp + STASH_OFFSET]);
|
||||
btr(eax, 0);
|
||||
#else
|
||||
// EFLAGS already present.
|
||||
#endif // STORE_EFLAGS
|
||||
}
|
||||
|
||||
void X64Emitter::StoreEflags() {
|
||||
#if STORE_EFLAGS
|
||||
pushf();
|
||||
pop(dword[rsp + STASH_OFFSET]);
|
||||
#else
|
||||
// EFLAGS should have CA set?
|
||||
// (so long as we don't fuck with it)
|
||||
#endif // STORE_EFLAGS
|
||||
}
|
||||
|
||||
bool X64Emitter::ConstantFitsIn32Reg(uint64_t v) {
|
||||
if ((v & ~0x7FFFFFFF) == 0) {
|
||||
// Fits under 31 bits, so just load using normal mov.
|
||||
|
|
|
@ -173,9 +173,6 @@ class X64Emitter : public Xbyak::CodeGenerator {
|
|||
|
||||
// TODO(benvanik): Label for epilog (don't use strings).
|
||||
|
||||
void LoadEflags();
|
||||
void StoreEflags();
|
||||
|
||||
// Moves a 64bit immediate into memory.
|
||||
bool ConstantFitsIn32Reg(uint64_t v);
|
||||
void MovMem64(const Xbyak::RegExp& addr, uint64_t v);
|
||||
|
|
|
@ -24,157 +24,4 @@
|
|||
#include "xenia/cpu/compiler/passes/validation_pass.h"
|
||||
#include "xenia/cpu/compiler/passes/value_reduction_pass.h"
|
||||
|
||||
// TODO:
|
||||
// - mark_use/mark_set
|
||||
// For now: mark_all_changed on all calls
|
||||
// For external functions:
|
||||
// - load_context/mark_use on all arguments
|
||||
// - mark_set on return argument?
|
||||
// For internal functions:
|
||||
// - if liveness analysis already done, use that
|
||||
// - otherwise, assume everything dirty (ACK!)
|
||||
// - could use scanner to insert mark_use
|
||||
//
|
||||
// Maybe:
|
||||
// - v0.xx = load_constant <c>
|
||||
// - v0.xx = load_zero
|
||||
// Would prevent NULL defs on values, and make constant de-duping possible.
|
||||
// Not sure if it's worth it, though, as the extra register allocation
|
||||
// pressure due to de-duped constants seems like it would slow things down
|
||||
// a lot.
|
||||
//
|
||||
// - CFG:
|
||||
// Blocks need predecessors()/successor()
|
||||
// phi Instr reference
|
||||
//
|
||||
// - block liveness tracking (in/out)
|
||||
// Block gets:
|
||||
// AddIncomingValue(Value* value, Block* src_block) ??
|
||||
|
||||
// Potentially interesting passes:
|
||||
//
|
||||
// Run order:
|
||||
// ContextPromotion
|
||||
// Simplification
|
||||
// ConstantPropagation
|
||||
// TypePropagation
|
||||
// ByteSwapElimination
|
||||
// Simplification
|
||||
// DeadStoreElimination
|
||||
// DeadCodeElimination
|
||||
//
|
||||
// - TypePropagation
|
||||
// There are many extensions/truncations in generated code right now due to
|
||||
// various load/stores of varying widths. Being able to find and short-
|
||||
// circuit the conversions early on would make following passes cleaner
|
||||
// and faster as they'd have to trace through fewer value definitions.
|
||||
// Example (after ContextPromotion):
|
||||
// v81.i64 = load_context +88
|
||||
// v82.i32 = truncate v81.i64
|
||||
// v84.i32 = and v82.i32, 3F
|
||||
// v85.i64 = zero_extend v84.i32
|
||||
// v87.i64 = load_context +248
|
||||
// v88.i64 = v85.i64
|
||||
// v89.i32 = truncate v88.i64 <-- zero_extend/truncate => v84.i32
|
||||
// v90.i32 = byte_swap v89.i32
|
||||
// store v87.i64, v90.i32
|
||||
// after type propagation / simplification / DCE:
|
||||
// v81.i64 = load_context +88
|
||||
// v82.i32 = truncate v81.i64
|
||||
// v84.i32 = and v82.i32, 3F
|
||||
// v87.i64 = load_context +248
|
||||
// v90.i32 = byte_swap v84.i32
|
||||
// store v87.i64, v90.i32
|
||||
//
|
||||
// - ByteSwapElimination
|
||||
// Find chained byte swaps and replace with assignments. This is often found
|
||||
// in memcpy paths.
|
||||
// Example:
|
||||
// v0 = load ...
|
||||
// v1 = byte_swap v0
|
||||
// v2 = byte_swap v1
|
||||
// store ..., v2 <-- this could be v0
|
||||
//
|
||||
// It may be tricky to detect, though, as often times there are intervening
|
||||
// instructions:
|
||||
// v21.i32 = load v20.i64
|
||||
// v22.i32 = byte_swap v21.i32
|
||||
// v23.i64 = zero_extend v22.i32
|
||||
// v88.i64 = v23.i64 (from ContextPromotion)
|
||||
// v89.i32 = truncate v88.i64
|
||||
// v90.i32 = byte_swap v89.i32
|
||||
// store v87.i64, v90.i32
|
||||
// After type propagation:
|
||||
// v21.i32 = load v20.i64
|
||||
// v22.i32 = byte_swap v21.i32
|
||||
// v89.i32 = v22.i32
|
||||
// v90.i32 = byte_swap v89.i32
|
||||
// store v87.i64, v90.i32
|
||||
// This could ideally become:
|
||||
// v21.i32 = load v20.i64
|
||||
// ... (DCE takes care of this) ...
|
||||
// store v87.i64, v21.i32
|
||||
//
|
||||
// - DeadStoreElimination
|
||||
// Generic DSE pass, removing all redundant stores. ContextPromotion may be
|
||||
// able to take care of most of these, as the input assembly is generally
|
||||
// pretty optimized already. This pass would mainly be looking for introduced
|
||||
// stores, such as those from comparisons.
|
||||
//
|
||||
// Example:
|
||||
// <block0>:
|
||||
// v0 = compare_ult ... (later removed by DCE)
|
||||
// v1 = compare_ugt ... (later removed by DCE)
|
||||
// v2 = compare_eq ...
|
||||
// store_context +300, v0 <-- removed
|
||||
// store_context +301, v1 <-- removed
|
||||
// store_context +302, v2 <-- removed
|
||||
// branch_true v1, ...
|
||||
// <block1>:
|
||||
// v3 = compare_ult ...
|
||||
// v4 = compare_ugt ...
|
||||
// v5 = compare_eq ...
|
||||
// store_context +300, v3 <-- these may be required if at end of function
|
||||
// store_context +301, v4 or before a call
|
||||
// store_context +302, v5
|
||||
// branch_true v5, ...
|
||||
//
|
||||
// - X86Canonicalization
|
||||
// For various opcodes add copies/commute the arguments to match x86
|
||||
// operand semantics. This makes code generation easier and if done
|
||||
// before register allocation can prevent a lot of extra shuffling in
|
||||
// the emitted code.
|
||||
//
|
||||
// Example:
|
||||
// <block0>:
|
||||
// v0 = ...
|
||||
// v1 = ...
|
||||
// v2 = add v0, v1 <-- v1 now unused
|
||||
// Becomes:
|
||||
// v0 = ...
|
||||
// v1 = ...
|
||||
// v1 = add v1, v0 <-- src1 = dest/src, so reuse for both
|
||||
// by commuting and setting dest = src1
|
||||
//
|
||||
// - RegisterAllocation
|
||||
// Given a machine description (register classes, counts) run over values
|
||||
// and assign them to registers, adding spills as needed. It should be
|
||||
// possible to directly emit code from this form.
|
||||
//
|
||||
// Example:
|
||||
// <block0>:
|
||||
// v0 = load_context +0
|
||||
// v1 = load_context +1
|
||||
// v0 = add v0, v1
|
||||
// ...
|
||||
// v2 = mul v0, v1
|
||||
// Becomes:
|
||||
// reg0 = load_context +0
|
||||
// reg1 = load_context +1
|
||||
// reg2 = add reg0, reg1
|
||||
// store_local +123, reg2 <-- spill inserted
|
||||
// ...
|
||||
// reg0 = load_local +123 <-- load inserted
|
||||
// reg0 = mul reg0, reg1
|
||||
|
||||
#endif // XENIA_COMPILER_COMPILER_PASSES_H_
|
||||
|
|
Loading…
Reference in New Issue