xenia-canary/docs/cpu_todo.md

# CPU TODO

There are many improvements that can be done under `xe::cpu` to improve
debugging, performance (both to JIT and of generated code), and portability.
Some are in various states of completion, and others are just thoughts that need
more exploring.

## Debugging Improvements

### Reproducable X64 Emission

It'd be useful to be able to run a PPC function through the entire pipeline and
spit out x64 that is byte-for-byte identical across runs. This would allow
automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace`
will relocate the x64 when placing it in memory, which will be at a different
location each time. Instead it would be nice to have the xbyak `calcJmpAddress`
that performs the relocations use the address of our choosing.

### Sampling Profiler

Once we have stack walking it'd be nice to take something like
[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to
support our system. This would let us run continuous performance analysis and
track hotspots in JITed code without a large performance impact. Automatically
showing the top hot functions in the debugger could help track down poor
translation much faster.

### Intel Architecture Code Analyzer Support

The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
is a nifty tool that, given a kernel of x64, can detail theoretical performance
characteristics on different processors down to cycle timings and potential
bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files
however it simply looks for special markers in the code. Having something that
walks the code cache and dumps a specially formatted file with the markers
around basic blocks could allow running the tool in bulk, or alternatively being
able to invoke it one-off by dumping a specific x64 block to disk and processing
it for display when looking at the code in the debugger would be useful.

I've done some early experiments with this and its possible to pass just a
bin file with the markers and the x64.

### Function Tracing/Coverage Information

`function_trace_data.h` contains the `FunctionTraceData` struct, which is
currently partially populated by the x64 backend. This enables tracking of which
threads a function is called on, function call count, recent callers of the
function, and even instruction-level counts.

This is all only partially implemented, though, and there's no tool to read it
out. This would be nice to get integrated into the debugger so that it can
overlay the information when viewing a function, but also useful in aggregate to
find hot functions/code paths or enhance callstacks by automatically annotating
thread information.

#### Block-level Counting

Currently the code assumes each instruction has a count, however this is
expensive and often unneeded as it can be done on a block level and then the
instruction counts can be derived from that. This can reduce the overhead (both
in memory and accounting time) by an order of magnitude.

### On-Stack Context Inspection

Currently the debugger only works with `--store_all_context_values`, as it can
only get the values of PPC registers when they are stored to the PPC context
after each instruction. As this can slow things down by ~10-20% it could be
useful to be able to preserve the optimized and register-allocated HIR so that
host registers holding context values can be derived on demand. Or, we could
just make `--store_all_context_values` faster.

## JIT Performance Improvements

### Reduce HIR Size

Currently there are a lot of pointers stored within `Instr`, `Value`, and
related types. These are big 8B values that eat a lot of memory and really
hurt the cache (especially with all the block/instruction walking done).
Aligning everything to 16B values in the arena and using 16bit indices
(or something) could shrink things a lot.

### Serialize Code Cache

The x64 code cache is currently set up to use fixed memory addresses and is even
represented as mapped memory. It should be fairly easy to back this with a file
and have all code written to disk. Adding more metadata, or perhaps a side-car
file, would allow for the code to be written to disk. On future runs the code
cache could load this data (by mapping the file containing the code right into
memory) and short cut JIT'ing entirely.

It would be possible to use a common container format (ELF/etc), however there's
elegance in not requiring any additional steps beyond the memory mapping. Such
containers could be useful for running static tools against, though.

## Portability Improvements

### Emulated Opcode Layer

Having a way to use emulated variants for any HIR opcode in a backend would
help when writing a new backend as well as when verifying the existing backends.
This may look like a C library with functions for each opcode/type pairing and
utilities to call out to them. Something like the x64 backend could then call
out to these with CallNativeSafe (or some faster equivalent) and something like
an interpreter backend would be fairly trivial to write.

## X64 Backend Improvements

### Implement Emulated Instructions

There are a ton of half-implemented HIR opcodes that call out to C++ to do their
work. These are extremely expensive as they incur a full guest-to-host thunk
(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe`
functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2
variants.

### Increase Register Availability

Currently only a few x64 registers are usable (due to reservations by the
backend or ABI conflicts). Though register pressure is surprisingly light in
most cases there are pathological cases that result in a lot of spills. By
freeing up some of the registers these spills could be reduced.

### Constant Pooling

This may make sense as a compiler pass instead.

Right now, particular sequences of instructions are nasty - such as anything
using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the
super fat (20-30byte!) constant loads as they are done now it may be better to
keep a per-function constant table and instead use RIP-relative addressing (or
something) to use the memory-form AVX instructions.

For example, right now this:
```
  v82.v128 = [0,1,2,3]
  v83.v128 = or v81.v128, v82.128
```

Translates to (something like):
```
  mov([rsp+0x...], 0x00000000)
  mov([rsp+0x...+4], 0x00000001)
  mov([rsp+0x...+8], 0x00000002)
  mov([rsp+0x...+12], 0x00000003)
  vmovdqa(xmm2, [rsp+0x...])
  vor(xmm2, xmm2, xmm2)
```

Where as it could be:
```
  vor(xmm2, xmm2, [rip+0x...])
```

Whether the cost of doing the constant de-dupe is worth it remains to be seen.
Right now it's wasting a lot of instruction cache space, increasing decode time,
and potentially using a lot more memory bandwidth.

## Optimization Improvements

### Speed Up RegisterAllocationPass

Currently the slowest pass, this could be improved by requiring less use
tracking or perhaps maintaining the use tracking in other passes. A faster
SortUsageList (radix or something fancy?) may be helpful as well.

### More Opcodes in ConstantPropagationPass

There's a few HIR opcodes with no handling, and others with minimal handling.
It'd be nice to know what paths need improvement and add them, as any work here
makes things free later on.

### Cross-Block ConstantPropagationPass

Constant propagation currently only occurs within a single block. This makes it
difficult to optimize common PPC patterns like loading the constants 0 or 1 into
a register before a loop and other loads of expensive altivec values.

Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to
track constant load_context/store_context's across block bounds and propagate
the values. This is simpler than dynamic values as no phi functions or anything
fancy needs to happen.

### Add TypePropagationPass

There are many extensions/truncations in generated code right now due to
various load/stores of varying widths. Being able to find and short-
circuit the conversions early on would make following passes cleaner
and faster as they'd have to trace through fewer value definitions and there'd
be less extraneous movs in the final code.

Example (after ContextPromotion):
```
  v82.i32 = truncate v81.i64
  v83.i32 = and v82.i32, 3F
  v85.i64 = zero_extend v84.i32
```

Becomes (after DCE/etc):
```
  v85.i64 = and v81.i64, 3F
```

### Enhance MemorySequenceCombinationPass with Extend/Truncate

Currently this pass will look for byte_swap and merge that into loads/stores.
This allows for better final codegen at the cost of making optimization more
difficult, so it only happens at the end of the process.

There's currently TODOs in there for adding extend/truncate support, which
will extend what it does with swaps to also merge the
sign_extend/zero_extend/truncate into the matching load/store. This allows for
the x64 backend to generate the proper mov's that do these operations without
requiring additional steps. Note that if we had a LIR and a peephole optimizer
this would be better done there.

Load with swap and extend:
```
  v1.i32 = load v0
  v2.i32 = byte_swap v1.i32
  v3.i64 = zero_extend v2.i32
```

Becomes:
```
  v1.i64 = load_convert v0, [swap|i32->i64,zero]
```

Store with truncate and swap:
```
  v1.i64 = ...
  v2.i32 = truncate v1.i64
  v3.i32 = byte_swap v2.i32
  store v0, v3.i32
```

Becomes:
```
  store_convert v0, v1.i64, [swap|i64->i32,trunc]
```

### Add DeadStoreEliminationPass

Generic DSE pass, removing all redundant stores. ContextPromotion may be
able to take care of most of these, as the input assembly is generally
pretty optimized already. This pass would mainly be looking for introduced
stores, such as those from comparisons.

Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing
edges as well as dominators, and that could be used to check whether stores into
the context are used in their destination block or instead overwritten
(currently they almost never are).

If this pass was able to remove a good number of the stores then the comparisons
would also be removed with dead code elimination and dramatically reduce branch
overhead.

Example:
```
<block0>:
  v0 = compare_ult ...     (later removed by DCE)
  v1 = compare_ugt ...     (later removed by DCE)
  v2 = compare_eq ...
  store_context +300, v0   <-- removed
  store_context +301, v1   <-- removed
  store_context +302, v2   <-- removed
  branch_true v1, ...
<block1>:
  v3 = compare_ult ...
  v4 = compare_ugt ...
  v5 = compare_eq ...
  store_context +300, v3   <-- these may be required if at end of function
  store_context +301, v4       or before a call
  store_context +302, v5
  branch_true v5, ...
```

### Add X64CanonicalizationPass

For various opcodes add copies/commute the arguments to match x64
operand semantics. This makes code generation easier and if done
before register allocation can prevent a lot of extra shuffling in
the emitted code.

Example:
```
<block0>:
  v0 = ...
  v1 = ...
  v2 = add v0, v1          <-- v1 now unused
```

Becomes:
```
  v0 = ...
  v1 = ...
  v1 = add v1, v0          <-- src1 = dest/src, so reuse for both
                               by commuting and setting dest = src1
```

### Add MergeLocalSlotsPass

As the RegisterAllocationPass runs it generates load_local/store_local as it
spills. Currently each set of locals is unique to each block, which in very
large functions can result in a lot of locals that are only used briefly. It
may be useful to use the results of the ControlFlowAnalysisPass to track local
liveness and merge the slots so they are reused when they cannot possibly be
live at the same time. This saves stack space and potentially improves cache
behavior.
Adding some docs on CPU optimizations/potential work. 2015-07-14 01:20:38 +00:00			`# CPU TODO`

			There are many improvements that can be done under `xe::cpu` to improve
			`debugging, performance (both to JIT and of generated code), and portability.`
			`Some are in various states of completion, and others are just thoughts that need`
			`more exploring.`

			`## Debugging Improvements`

			`### Reproducable X64 Emission`

			`It'd be useful to be able to run a PPC function through the entire pipeline and`
			`spit out x64 that is byte-for-byte identical across runs. This would allow`
			automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace`
			`will relocate the x64 when placing it in memory, which will be at a different`
			location each time. Instead it would be nice to have the xbyak `calcJmpAddress`
			`that performs the relocations use the address of our choosing.`

			`### Sampling Profiler`

			`Once we have stack walking it'd be nice to take something like`
			`[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to`
			`support our system. This would let us run continuous performance analysis and`
			`track hotspots in JITed code without a large performance impact. Automatically`
			`showing the top hot functions in the debugger could help track down poor`
			`translation much faster.`

			`### Intel Architecture Code Analyzer Support`

			`The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)`
			`is a nifty tool that, given a kernel of x64, can detail theoretical performance`
			`characteristics on different processors down to cycle timings and potential`
			`bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files`
			`however it simply looks for special markers in the code. Having something that`
			`walks the code cache and dumps a specially formatted file with the markers`
			`around basic blocks could allow running the tool in bulk, or alternatively being`
			`able to invoke it one-off by dumping a specific x64 block to disk and processing`
			`it for display when looking at the code in the debugger would be useful.`

			`I've done some early experiments with this and its possible to pass just a`
			`bin file with the markers and the x64.`

			`### Function Tracing/Coverage Information`

			`function_trace_data.h` contains the `FunctionTraceData` struct, which is
			`currently partially populated by the x64 backend. This enables tracking of which`
			`threads a function is called on, function call count, recent callers of the`
			`function, and even instruction-level counts.`

			`This is all only partially implemented, though, and there's no tool to read it`
			`out. This would be nice to get integrated into the debugger so that it can`
			`overlay the information when viewing a function, but also useful in aggregate to`
			`find hot functions/code paths or enhance callstacks by automatically annotating`
			`thread information.`

			`#### Block-level Counting`

			`Currently the code assumes each instruction has a count, however this is`
			`expensive and often unneeded as it can be done on a block level and then the`
			`instruction counts can be derived from that. This can reduce the overhead (both`
			`in memory and accounting time) by an order of magnitude.`

			`### On-Stack Context Inspection`

			Currently the debugger only works with `--store_all_context_values`, as it can
			`only get the values of PPC registers when they are stored to the PPC context`
			`after each instruction. As this can slow things down by ~10-20% it could be`
			`useful to be able to preserve the optimized and register-allocated HIR so that`
			`host registers holding context values can be derived on demand. Or, we could`
			just make `--store_all_context_values` faster.

			`## JIT Performance Improvements`

			`### Reduce HIR Size`

			Currently there are a lot of pointers stored within `Instr`, `Value`, and
			`related types. These are big 8B values that eat a lot of memory and really`
			`hurt the cache (especially with all the block/instruction walking done).`
			`Aligning everything to 16B values in the arena and using 16bit indices`
			`(or something) could shrink things a lot.`

			`### Serialize Code Cache`

			`The x64 code cache is currently set up to use fixed memory addresses and is even`
			`represented as mapped memory. It should be fairly easy to back this with a file`
			`and have all code written to disk. Adding more metadata, or perhaps a side-car`
			`file, would allow for the code to be written to disk. On future runs the code`
			`cache could load this data (by mapping the file containing the code right into`
			`memory) and short cut JIT'ing entirely.`

			`It would be possible to use a common container format (ELF/etc), however there's`
			`elegance in not requiring any additional steps beyond the memory mapping. Such`
			`containers could be useful for running static tools against, though.`

			`## Portability Improvements`

			`### Emulated Opcode Layer`

			`Having a way to use emulated variants for any HIR opcode in a backend would`
			`help when writing a new backend as well as when verifying the existing backends.`
			`This may look like a C library with functions for each opcode/type pairing and`
			`utilities to call out to them. Something like the x64 backend could then call`
			`out to these with CallNativeSafe (or some faster equivalent) and something like`
			`an interpreter backend would be fairly trivial to write.`

			`## X64 Backend Improvements`

			`### Implement Emulated Instructions`

			`There are a ton of half-implemented HIR opcodes that call out to C++ to do their`
			`work. These are extremely expensive as they incur a full guest-to-host thunk`
			(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe`
			functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2
			`variants.`

			`### Increase Register Availability`

			`Currently only a few x64 registers are usable (due to reservations by the`
			`backend or ABI conflicts). Though register pressure is surprisingly light in`
			`most cases there are pathological cases that result in a lot of spills. By`
			`freeing up some of the registers these spills could be reduced.`

			`### Constant Pooling`

			`This may make sense as a compiler pass instead.`

			`Right now, particular sequences of instructions are nasty - such as anything`
			using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the
			`super fat (20-30byte!) constant loads as they are done now it may be better to`
			`keep a per-function constant table and instead use RIP-relative addressing (or`
			`something) to use the memory-form AVX instructions.`

			`For example, right now this:`
			```
			`v82.v128 = [0,1,2,3]`
			`v83.v128 = or v81.v128, v82.128`
			```

			`Translates to (something like):`
			```
			`mov([rsp+0x...], 0x00000000)`
			`mov([rsp+0x...+4], 0x00000001)`
			`mov([rsp+0x...+8], 0x00000002)`
			`mov([rsp+0x...+12], 0x00000003)`
			`vmovdqa(xmm2, [rsp+0x...])`
			`vor(xmm2, xmm2, xmm2)`
			```

			`Where as it could be:`
			```
			`vor(xmm2, xmm2, [rip+0x...])`
			```

			`Whether the cost of doing the constant de-dupe is worth it remains to be seen.`
			`Right now it's wasting a lot of instruction cache space, increasing decode time,`
			`and potentially using a lot more memory bandwidth.`

			`## Optimization Improvements`

			`### Speed Up RegisterAllocationPass`

			`Currently the slowest pass, this could be improved by requiring less use`
			`tracking or perhaps maintaining the use tracking in other passes. A faster`
			`SortUsageList (radix or something fancy?) may be helpful as well.`

			`### More Opcodes in ConstantPropagationPass`

			`There's a few HIR opcodes with no handling, and others with minimal handling.`
			`It'd be nice to know what paths need improvement and add them, as any work here`
			`makes things free later on.`

			`### Cross-Block ConstantPropagationPass`

			`Constant propagation currently only occurs within a single block. This makes it`
			`difficult to optimize common PPC patterns like loading the constants 0 or 1 into`
			`a register before a loop and other loads of expensive altivec values.`

			`Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to`
			`track constant load_context/store_context's across block bounds and propagate`
			`the values. This is simpler than dynamic values as no phi functions or anything`
			`fancy needs to happen.`

			`### Add TypePropagationPass`

			`There are many extensions/truncations in generated code right now due to`
			`various load/stores of varying widths. Being able to find and short-`
			`circuit the conversions early on would make following passes cleaner`
			`and faster as they'd have to trace through fewer value definitions and there'd`
			`be less extraneous movs in the final code.`

			`Example (after ContextPromotion):`
			```
			`v82.i32 = truncate v81.i64`
			`v83.i32 = and v82.i32, 3F`
			`v85.i64 = zero_extend v84.i32`
			```

			`Becomes (after DCE/etc):`
			```
			`v85.i64 = and v81.i64, 3F`
			```

			`### Enhance MemorySequenceCombinationPass with Extend/Truncate`

			`Currently this pass will look for byte_swap and merge that into loads/stores.`
			`This allows for better final codegen at the cost of making optimization more`
			`difficult, so it only happens at the end of the process.`

			`There's currently TODOs in there for adding extend/truncate support, which`
			`will extend what it does with swaps to also merge the`
			`sign_extend/zero_extend/truncate into the matching load/store. This allows for`
			`the x64 backend to generate the proper mov's that do these operations without`
			`requiring additional steps. Note that if we had a LIR and a peephole optimizer`
			`this would be better done there.`

			`Load with swap and extend:`
			```
			`v1.i32 = load v0`
			`v2.i32 = byte_swap v1.i32`
			`v3.i64 = zero_extend v2.i32`
			```

			`Becomes:`
			```
			`v1.i64 = load_convert v0, [swap\|i32->i64,zero]`
			```

			`Store with truncate and swap:`
			```
			`v1.i64 = ...`
			`v2.i32 = truncate v1.i64`
			`v3.i32 = byte_swap v2.i32`
			`store v0, v3.i32`
			```

			`Becomes:`
			```
			`store_convert v0, v1.i64, [swap\|i64->i32,trunc]`
			```

			`### Add DeadStoreEliminationPass`

			`Generic DSE pass, removing all redundant stores. ContextPromotion may be`
			`able to take care of most of these, as the input assembly is generally`
			`pretty optimized already. This pass would mainly be looking for introduced`
			`stores, such as those from comparisons.`

			`Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing`
			`edges as well as dominators, and that could be used to check whether stores into`
			`the context are used in their destination block or instead overwritten`
			`(currently they almost never are).`

			`If this pass was able to remove a good number of the stores then the comparisons`
			`would also be removed with dead code elimination and dramatically reduce branch`
			`overhead.`

			`Example:`
			```
			`<block0>:`
			`v0 = compare_ult ... (later removed by DCE)`
			`v1 = compare_ugt ... (later removed by DCE)`
			`v2 = compare_eq ...`
			`store_context +300, v0 <-- removed`
			`store_context +301, v1 <-- removed`
			`store_context +302, v2 <-- removed`
			`branch_true v1, ...`
			`<block1>:`
			`v3 = compare_ult ...`
			`v4 = compare_ugt ...`
			`v5 = compare_eq ...`
			`store_context +300, v3 <-- these may be required if at end of function`
			`store_context +301, v4 or before a call`
			`store_context +302, v5`
			`branch_true v5, ...`
			```

			`### Add X64CanonicalizationPass`

			`For various opcodes add copies/commute the arguments to match x64`
			`operand semantics. This makes code generation easier and if done`
			`before register allocation can prevent a lot of extra shuffling in`
			`the emitted code.`

			`Example:`
			```
			`<block0>:`
			`v0 = ...`
			`v1 = ...`
			`v2 = add v0, v1 <-- v1 now unused`
			```

			`Becomes:`
			```
			`v0 = ...`
			`v1 = ...`
			`v1 = add v1, v0 <-- src1 = dest/src, so reuse for both`
			`by commuting and setting dest = src1`
			```

			`### Add MergeLocalSlotsPass`

			`As the RegisterAllocationPass runs it generates load_local/store_local as it`
			`spills. Currently each set of locals is unique to each block, which in very`
			`large functions can result in a lot of locals that are only used briefly. It`
			`may be useful to use the results of the ControlFlowAnalysisPass to track local`
			`liveness and merge the slots so they are reused when they cannot possibly be`
			`live at the same time. This saves stack space and potentially improves cache`
			`behavior.`