318 lines
12 KiB
Markdown
318 lines
12 KiB
Markdown
|
# CPU TODO
|
||
|
|
||
|
There are many improvements that can be done under `xe::cpu` to improve
|
||
|
debugging, performance (both to JIT and of generated code), and portability.
|
||
|
Some are in various states of completion, and others are just thoughts that need
|
||
|
more exploring.
|
||
|
|
||
|
## Debugging Improvements
|
||
|
|
||
|
### Reproducable X64 Emission
|
||
|
|
||
|
It'd be useful to be able to run a PPC function through the entire pipeline and
|
||
|
spit out x64 that is byte-for-byte identical across runs. This would allow
|
||
|
automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace`
|
||
|
will relocate the x64 when placing it in memory, which will be at a different
|
||
|
location each time. Instead it would be nice to have the xbyak `calcJmpAddress`
|
||
|
that performs the relocations use the address of our choosing.
|
||
|
|
||
|
### Stack Walking
|
||
|
|
||
|
Currently the Windows/VC++ dbghelp stack walking is relied on, however this is
|
||
|
not portable, is slow, and cannot resolve JIT'ed symbols properly. Having our
|
||
|
own stack walking code that could fall back to dbghelp (via some pluggable
|
||
|
system) for host symbols would let us quickly get stacks through host and guest
|
||
|
code and make things like sampling profilers, kernel callstack tracing, and
|
||
|
other features possible.
|
||
|
|
||
|
### Sampling Profiler
|
||
|
|
||
|
Once we have stack walking it'd be nice to take something like
|
||
|
[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to
|
||
|
support our system. This would let us run continuous performance analysis and
|
||
|
track hotspots in JITed code without a large performance impact. Automatically
|
||
|
showing the top hot functions in the debugger could help track down poor
|
||
|
translation much faster.
|
||
|
|
||
|
### Intel Architecture Code Analyzer Support
|
||
|
|
||
|
The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
|
||
|
is a nifty tool that, given a kernel of x64, can detail theoretical performance
|
||
|
characteristics on different processors down to cycle timings and potential
|
||
|
bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files
|
||
|
however it simply looks for special markers in the code. Having something that
|
||
|
walks the code cache and dumps a specially formatted file with the markers
|
||
|
around basic blocks could allow running the tool in bulk, or alternatively being
|
||
|
able to invoke it one-off by dumping a specific x64 block to disk and processing
|
||
|
it for display when looking at the code in the debugger would be useful.
|
||
|
|
||
|
I've done some early experiments with this and its possible to pass just a
|
||
|
bin file with the markers and the x64.
|
||
|
|
||
|
### Function Tracing/Coverage Information
|
||
|
|
||
|
`function_trace_data.h` contains the `FunctionTraceData` struct, which is
|
||
|
currently partially populated by the x64 backend. This enables tracking of which
|
||
|
threads a function is called on, function call count, recent callers of the
|
||
|
function, and even instruction-level counts.
|
||
|
|
||
|
This is all only partially implemented, though, and there's no tool to read it
|
||
|
out. This would be nice to get integrated into the debugger so that it can
|
||
|
overlay the information when viewing a function, but also useful in aggregate to
|
||
|
find hot functions/code paths or enhance callstacks by automatically annotating
|
||
|
thread information.
|
||
|
|
||
|
#### Block-level Counting
|
||
|
|
||
|
Currently the code assumes each instruction has a count, however this is
|
||
|
expensive and often unneeded as it can be done on a block level and then the
|
||
|
instruction counts can be derived from that. This can reduce the overhead (both
|
||
|
in memory and accounting time) by an order of magnitude.
|
||
|
|
||
|
### On-Stack Context Inspection
|
||
|
|
||
|
Currently the debugger only works with `--store_all_context_values`, as it can
|
||
|
only get the values of PPC registers when they are stored to the PPC context
|
||
|
after each instruction. As this can slow things down by ~10-20% it could be
|
||
|
useful to be able to preserve the optimized and register-allocated HIR so that
|
||
|
host registers holding context values can be derived on demand. Or, we could
|
||
|
just make `--store_all_context_values` faster.
|
||
|
|
||
|
## JIT Performance Improvements
|
||
|
|
||
|
### Reduce HIR Size
|
||
|
|
||
|
Currently there are a lot of pointers stored within `Instr`, `Value`, and
|
||
|
related types. These are big 8B values that eat a lot of memory and really
|
||
|
hurt the cache (especially with all the block/instruction walking done).
|
||
|
Aligning everything to 16B values in the arena and using 16bit indices
|
||
|
(or something) could shrink things a lot.
|
||
|
|
||
|
### Serialize Code Cache
|
||
|
|
||
|
The x64 code cache is currently set up to use fixed memory addresses and is even
|
||
|
represented as mapped memory. It should be fairly easy to back this with a file
|
||
|
and have all code written to disk. Adding more metadata, or perhaps a side-car
|
||
|
file, would allow for the code to be written to disk. On future runs the code
|
||
|
cache could load this data (by mapping the file containing the code right into
|
||
|
memory) and short cut JIT'ing entirely.
|
||
|
|
||
|
It would be possible to use a common container format (ELF/etc), however there's
|
||
|
elegance in not requiring any additional steps beyond the memory mapping. Such
|
||
|
containers could be useful for running static tools against, though.
|
||
|
|
||
|
## Portability Improvements
|
||
|
|
||
|
### Emulated Opcode Layer
|
||
|
|
||
|
Having a way to use emulated variants for any HIR opcode in a backend would
|
||
|
help when writing a new backend as well as when verifying the existing backends.
|
||
|
This may look like a C library with functions for each opcode/type pairing and
|
||
|
utilities to call out to them. Something like the x64 backend could then call
|
||
|
out to these with CallNativeSafe (or some faster equivalent) and something like
|
||
|
an interpreter backend would be fairly trivial to write.
|
||
|
|
||
|
## X64 Backend Improvements
|
||
|
|
||
|
### Implement Emulated Instructions
|
||
|
|
||
|
There are a ton of half-implemented HIR opcodes that call out to C++ to do their
|
||
|
work. These are extremely expensive as they incur a full guest-to-host thunk
|
||
|
(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe`
|
||
|
functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2
|
||
|
variants.
|
||
|
|
||
|
### Increase Register Availability
|
||
|
|
||
|
Currently only a few x64 registers are usable (due to reservations by the
|
||
|
backend or ABI conflicts). Though register pressure is surprisingly light in
|
||
|
most cases there are pathological cases that result in a lot of spills. By
|
||
|
freeing up some of the registers these spills could be reduced.
|
||
|
|
||
|
### Constant Pooling
|
||
|
|
||
|
This may make sense as a compiler pass instead.
|
||
|
|
||
|
Right now, particular sequences of instructions are nasty - such as anything
|
||
|
using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the
|
||
|
super fat (20-30byte!) constant loads as they are done now it may be better to
|
||
|
keep a per-function constant table and instead use RIP-relative addressing (or
|
||
|
something) to use the memory-form AVX instructions.
|
||
|
|
||
|
For example, right now this:
|
||
|
```
|
||
|
v82.v128 = [0,1,2,3]
|
||
|
v83.v128 = or v81.v128, v82.128
|
||
|
```
|
||
|
|
||
|
Translates to (something like):
|
||
|
```
|
||
|
mov([rsp+0x...], 0x00000000)
|
||
|
mov([rsp+0x...+4], 0x00000001)
|
||
|
mov([rsp+0x...+8], 0x00000002)
|
||
|
mov([rsp+0x...+12], 0x00000003)
|
||
|
vmovdqa(xmm2, [rsp+0x...])
|
||
|
vor(xmm2, xmm2, xmm2)
|
||
|
```
|
||
|
|
||
|
Where as it could be:
|
||
|
```
|
||
|
vor(xmm2, xmm2, [rip+0x...])
|
||
|
```
|
||
|
|
||
|
Whether the cost of doing the constant de-dupe is worth it remains to be seen.
|
||
|
Right now it's wasting a lot of instruction cache space, increasing decode time,
|
||
|
and potentially using a lot more memory bandwidth.
|
||
|
|
||
|
## Optimization Improvements
|
||
|
|
||
|
### Speed Up RegisterAllocationPass
|
||
|
|
||
|
Currently the slowest pass, this could be improved by requiring less use
|
||
|
tracking or perhaps maintaining the use tracking in other passes. A faster
|
||
|
SortUsageList (radix or something fancy?) may be helpful as well.
|
||
|
|
||
|
### More Opcodes in ConstantPropagationPass
|
||
|
|
||
|
There's a few HIR opcodes with no handling, and others with minimal handling.
|
||
|
It'd be nice to know what paths need improvement and add them, as any work here
|
||
|
makes things free later on.
|
||
|
|
||
|
### Cross-Block ConstantPropagationPass
|
||
|
|
||
|
Constant propagation currently only occurs within a single block. This makes it
|
||
|
difficult to optimize common PPC patterns like loading the constants 0 or 1 into
|
||
|
a register before a loop and other loads of expensive altivec values.
|
||
|
|
||
|
Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to
|
||
|
track constant load_context/store_context's across block bounds and propagate
|
||
|
the values. This is simpler than dynamic values as no phi functions or anything
|
||
|
fancy needs to happen.
|
||
|
|
||
|
### Add TypePropagationPass
|
||
|
|
||
|
There are many extensions/truncations in generated code right now due to
|
||
|
various load/stores of varying widths. Being able to find and short-
|
||
|
circuit the conversions early on would make following passes cleaner
|
||
|
and faster as they'd have to trace through fewer value definitions and there'd
|
||
|
be less extraneous movs in the final code.
|
||
|
|
||
|
Example (after ContextPromotion):
|
||
|
```
|
||
|
v82.i32 = truncate v81.i64
|
||
|
v83.i32 = and v82.i32, 3F
|
||
|
v85.i64 = zero_extend v84.i32
|
||
|
```
|
||
|
|
||
|
Becomes (after DCE/etc):
|
||
|
```
|
||
|
v85.i64 = and v81.i64, 3F
|
||
|
```
|
||
|
|
||
|
### Enhance MemorySequenceCombinationPass with Extend/Truncate
|
||
|
|
||
|
Currently this pass will look for byte_swap and merge that into loads/stores.
|
||
|
This allows for better final codegen at the cost of making optimization more
|
||
|
difficult, so it only happens at the end of the process.
|
||
|
|
||
|
There's currently TODOs in there for adding extend/truncate support, which
|
||
|
will extend what it does with swaps to also merge the
|
||
|
sign_extend/zero_extend/truncate into the matching load/store. This allows for
|
||
|
the x64 backend to generate the proper mov's that do these operations without
|
||
|
requiring additional steps. Note that if we had a LIR and a peephole optimizer
|
||
|
this would be better done there.
|
||
|
|
||
|
Load with swap and extend:
|
||
|
```
|
||
|
v1.i32 = load v0
|
||
|
v2.i32 = byte_swap v1.i32
|
||
|
v3.i64 = zero_extend v2.i32
|
||
|
```
|
||
|
|
||
|
Becomes:
|
||
|
```
|
||
|
v1.i64 = load_convert v0, [swap|i32->i64,zero]
|
||
|
```
|
||
|
|
||
|
Store with truncate and swap:
|
||
|
```
|
||
|
v1.i64 = ...
|
||
|
v2.i32 = truncate v1.i64
|
||
|
v3.i32 = byte_swap v2.i32
|
||
|
store v0, v3.i32
|
||
|
```
|
||
|
|
||
|
Becomes:
|
||
|
```
|
||
|
store_convert v0, v1.i64, [swap|i64->i32,trunc]
|
||
|
```
|
||
|
|
||
|
### Add DeadStoreEliminationPass
|
||
|
|
||
|
Generic DSE pass, removing all redundant stores. ContextPromotion may be
|
||
|
able to take care of most of these, as the input assembly is generally
|
||
|
pretty optimized already. This pass would mainly be looking for introduced
|
||
|
stores, such as those from comparisons.
|
||
|
|
||
|
Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing
|
||
|
edges as well as dominators, and that could be used to check whether stores into
|
||
|
the context are used in their destination block or instead overwritten
|
||
|
(currently they almost never are).
|
||
|
|
||
|
If this pass was able to remove a good number of the stores then the comparisons
|
||
|
would also be removed with dead code elimination and dramatically reduce branch
|
||
|
overhead.
|
||
|
|
||
|
Example:
|
||
|
```
|
||
|
<block0>:
|
||
|
v0 = compare_ult ... (later removed by DCE)
|
||
|
v1 = compare_ugt ... (later removed by DCE)
|
||
|
v2 = compare_eq ...
|
||
|
store_context +300, v0 <-- removed
|
||
|
store_context +301, v1 <-- removed
|
||
|
store_context +302, v2 <-- removed
|
||
|
branch_true v1, ...
|
||
|
<block1>:
|
||
|
v3 = compare_ult ...
|
||
|
v4 = compare_ugt ...
|
||
|
v5 = compare_eq ...
|
||
|
store_context +300, v3 <-- these may be required if at end of function
|
||
|
store_context +301, v4 or before a call
|
||
|
store_context +302, v5
|
||
|
branch_true v5, ...
|
||
|
```
|
||
|
|
||
|
### Add X64CanonicalizationPass
|
||
|
|
||
|
For various opcodes add copies/commute the arguments to match x64
|
||
|
operand semantics. This makes code generation easier and if done
|
||
|
before register allocation can prevent a lot of extra shuffling in
|
||
|
the emitted code.
|
||
|
|
||
|
Example:
|
||
|
```
|
||
|
<block0>:
|
||
|
v0 = ...
|
||
|
v1 = ...
|
||
|
v2 = add v0, v1 <-- v1 now unused
|
||
|
```
|
||
|
|
||
|
Becomes:
|
||
|
```
|
||
|
v0 = ...
|
||
|
v1 = ...
|
||
|
v1 = add v1, v0 <-- src1 = dest/src, so reuse for both
|
||
|
by commuting and setting dest = src1
|
||
|
```
|
||
|
|
||
|
### Add MergeLocalSlotsPass
|
||
|
|
||
|
As the RegisterAllocationPass runs it generates load_local/store_local as it
|
||
|
spills. Currently each set of locals is unique to each block, which in very
|
||
|
large functions can result in a lot of locals that are only used briefly. It
|
||
|
may be useful to use the results of the ControlFlowAnalysisPass to track local
|
||
|
liveness and merge the slots so they are reused when they cannot possibly be
|
||
|
live at the same time. This saves stack space and potentially improves cache
|
||
|
behavior.
|