Adding some docs on CPU optimizations/potential work.

2015-07-13 18:20:38 -07:00 · 2015-07-13 18:20:38 -07:00 · 31dab70a3a
parent c6ebcd508d
commit 31dab70a3a
4 changed files with 317 additions and 180 deletions
--- a/docs/cpu_todo.md
+++ b/docs/cpu_todo.md
@ -0,0 +1,317 @@
+# CPU TODO
+
+There are many improvements that can be done under `xe::cpu` to improve
+debugging, performance (both to JIT and of generated code), and portability.
+Some are in various states of completion, and others are just thoughts that need
+more exploring.
+
+## Debugging Improvements
+
+### Reproducable X64 Emission
+
+It'd be useful to be able to run a PPC function through the entire pipeline and
+spit out x64 that is byte-for-byte identical across runs. This would allow
+automated verification, bulk analysis, etc. Currently `X64Emitter::Emplace`
+will relocate the x64 when placing it in memory, which will be at a different
+location each time. Instead it would be nice to have the xbyak `calcJmpAddress`
+that performs the relocations use the address of our choosing.
+
+### Stack Walking
+
+Currently the Windows/VC++ dbghelp stack walking is relied on, however this is
+not portable, is slow, and cannot resolve JIT'ed symbols properly. Having our
+own stack walking code that could fall back to dbghelp (via some pluggable
+system) for host symbols would let us quickly get stacks through host and guest
+code and make things like sampling profilers, kernel callstack tracing, and
+other features possible.
+
+### Sampling Profiler
+
+Once we have stack walking it'd be nice to take something like
+[micro-profiler](https://code.google.com/p/micro-profiler/) and augment it to
+support our system. This would let us run continuous performance analysis and
+track hotspots in JITed code without a large performance impact. Automatically
+showing the top hot functions in the debugger could help track down poor
+translation much faster.
+
+### Intel Architecture Code Analyzer Support
+
+The [Intel ACA](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
+is a nifty tool that, given a kernel of x64, can detail theoretical performance
+characteristics on different processors down to cycle timings and potential
+bottlenecks on memory/execution units. It's designed to run on elf/obj/etc files
+however it simply looks for special markers in the code. Having something that
+walks the code cache and dumps a specially formatted file with the markers
+around basic blocks could allow running the tool in bulk, or alternatively being
+able to invoke it one-off by dumping a specific x64 block to disk and processing
+it for display when looking at the code in the debugger would be useful.
+
+I've done some early experiments with this and its possible to pass just a
+bin file with the markers and the x64.
+
+### Function Tracing/Coverage Information
+
+`function_trace_data.h` contains the `FunctionTraceData` struct, which is
+currently partially populated by the x64 backend. This enables tracking of which
+threads a function is called on, function call count, recent callers of the
+function, and even instruction-level counts.
+
+This is all only partially implemented, though, and there's no tool to read it
+out. This would be nice to get integrated into the debugger so that it can
+overlay the information when viewing a function, but also useful in aggregate to
+find hot functions/code paths or enhance callstacks by automatically annotating
+thread information.
+
+#### Block-level Counting
+
+Currently the code assumes each instruction has a count, however this is
+expensive and often unneeded as it can be done on a block level and then the
+instruction counts can be derived from that. This can reduce the overhead (both
+in memory and accounting time) by an order of magnitude.
+
+### On-Stack Context Inspection
+
+Currently the debugger only works with `--store_all_context_values`, as it can
+only get the values of PPC registers when they are stored to the PPC context
+after each instruction. As this can slow things down by ~10-20% it could be
+useful to be able to preserve the optimized and register-allocated HIR so that
+host registers holding context values can be derived on demand. Or, we could
+just make `--store_all_context_values` faster.
+
+## JIT Performance Improvements
+
+### Reduce HIR Size
+
+Currently there are a lot of pointers stored within `Instr`, `Value`, and
+related types. These are big 8B values that eat a lot of memory and really
+hurt the cache (especially with all the block/instruction walking done).
+Aligning everything to 16B values in the arena and using 16bit indices
+(or something) could shrink things a lot.
+
+### Serialize Code Cache
+
+The x64 code cache is currently set up to use fixed memory addresses and is even
+represented as mapped memory. It should be fairly easy to back this with a file
+and have all code written to disk. Adding more metadata, or perhaps a side-car
+file, would allow for the code to be written to disk. On future runs the code
+cache could load this data (by mapping the file containing the code right into
+memory) and short cut JIT'ing entirely.
+
+It would be possible to use a common container format (ELF/etc), however there's
+elegance in not requiring any additional steps beyond the memory mapping. Such
+containers could be useful for running static tools against, though.
+
+## Portability Improvements
+
+### Emulated Opcode Layer
+
+Having a way to use emulated variants for any HIR opcode in a backend would
+help when writing a new backend as well as when verifying the existing backends.
+This may look like a C library with functions for each opcode/type pairing and
+utilities to call out to them. Something like the x64 backend could then call
+out to these with CallNativeSafe (or some faster equivalent) and something like
+an interpreter backend would be fairly trivial to write.
+
+## X64 Backend Improvements
+
+### Implement Emulated Instructions
+
+There are a ton of half-implemented HIR opcodes that call out to C++ to do their
+work. These are extremely expensive as they incur a full guest-to-host thunk
+(~hundreds of instructions!). Basically, any of the `Emulate*`/`CallNativeSafe`
+functions in `x64_sequences.cc` need to be replaced with proper AVX/AVX2
+variants.
+
+### Increase Register Availability
+
+Currently only a few x64 registers are usable (due to reservations by the
+backend or ABI conflicts). Though register pressure is surprisingly light in
+most cases there are pathological cases that result in a lot of spills. By
+freeing up some of the registers these spills could be reduced.
+
+### Constant Pooling
+
+This may make sense as a compiler pass instead.
+
+Right now, particular sequences of instructions are nasty - such as anything
+using `LoadConstantXmm` to load non-zero or non-1 vec128's. Instead of doing the
+super fat (20-30byte!) constant loads as they are done now it may be better to
+keep a per-function constant table and instead use RIP-relative addressing (or
+something) to use the memory-form AVX instructions.
+
+For example, right now this:
+```
+  v82.v128 = [0,1,2,3]
+  v83.v128 = or v81.v128, v82.128
+```
+
+Translates to (something like):
+```
+  mov([rsp+0x...], 0x00000000)
+  mov([rsp+0x...+4], 0x00000001)
+  mov([rsp+0x...+8], 0x00000002)
+  mov([rsp+0x...+12], 0x00000003)
+  vmovdqa(xmm2, [rsp+0x...])
+  vor(xmm2, xmm2, xmm2)
+```
+
+Where as it could be:
+```
+  vor(xmm2, xmm2, [rip+0x...])
+```
+
+Whether the cost of doing the constant de-dupe is worth it remains to be seen.
+Right now it's wasting a lot of instruction cache space, increasing decode time,
+and potentially using a lot more memory bandwidth.
+
+## Optimization Improvements
+
+### Speed Up RegisterAllocationPass
+
+Currently the slowest pass, this could be improved by requiring less use
+tracking or perhaps maintaining the use tracking in other passes. A faster
+SortUsageList (radix or something fancy?) may be helpful as well.
+
+### More Opcodes in ConstantPropagationPass
+
+There's a few HIR opcodes with no handling, and others with minimal handling.
+It'd be nice to know what paths need improvement and add them, as any work here
+makes things free later on.
+
+### Cross-Block ConstantPropagationPass
+
+Constant propagation currently only occurs within a single block. This makes it
+difficult to optimize common PPC patterns like loading the constants 0 or 1 into
+a register before a loop and other loads of expensive altivec values.
+
+Either ControlFlowAnalysisPass or DataFlowAnalysisPass could be piggy-backed to
+track constant load_context/store_context's across block bounds and propagate
+the values. This is simpler than dynamic values as no phi functions or anything
+fancy needs to happen.
+
+### Add TypePropagationPass
+
+There are many extensions/truncations in generated code right now due to
+various load/stores of varying widths. Being able to find and short-
+circuit the conversions early on would make following passes cleaner
+and faster as they'd have to trace through fewer value definitions and there'd
+be less extraneous movs in the final code.
+
+Example (after ContextPromotion):
+```
+  v82.i32 = truncate v81.i64
+  v83.i32 = and v82.i32, 3F
+  v85.i64 = zero_extend v84.i32
+```
+
+Becomes (after DCE/etc):
+```
+  v85.i64 = and v81.i64, 3F
+```
+
+### Enhance MemorySequenceCombinationPass with Extend/Truncate
+
+Currently this pass will look for byte_swap and merge that into loads/stores.
+This allows for better final codegen at the cost of making optimization more
+difficult, so it only happens at the end of the process.
+
+There's currently TODOs in there for adding extend/truncate support, which
+will extend what it does with swaps to also merge the
+sign_extend/zero_extend/truncate into the matching load/store. This allows for
+the x64 backend to generate the proper mov's that do these operations without
+requiring additional steps. Note that if we had a LIR and a peephole optimizer
+this would be better done there.
+
+Load with swap and extend:
+```
+  v1.i32 = load v0
+  v2.i32 = byte_swap v1.i32
+  v3.i64 = zero_extend v2.i32
+```
+
+Becomes:
+```
+  v1.i64 = load_convert v0, [swap|i32->i64,zero]
+```
+
+Store with truncate and swap:
+```
+  v1.i64 = ...
+  v2.i32 = truncate v1.i64
+  v3.i32 = byte_swap v2.i32
+  store v0, v3.i32
+```
+
+Becomes:
+```
+  store_convert v0, v1.i64, [swap|i64->i32,trunc]
+```
+
+### Add DeadStoreEliminationPass
+
+Generic DSE pass, removing all redundant stores. ContextPromotion may be
+able to take care of most of these, as the input assembly is generally
+pretty optimized already. This pass would mainly be looking for introduced
+stores, such as those from comparisons.
+
+Currently ControlFlowAnalysisPass will annotate blocks with incoming/outgoing
+edges as well as dominators, and that could be used to check whether stores into
+the context are used in their destination block or instead overwritten
+(currently they almost never are).
+
+If this pass was able to remove a good number of the stores then the comparisons
+would also be removed with dead code elimination and dramatically reduce branch
+overhead.
+
+Example:
+```
+<block0>:
+  v0 = compare_ult ...     (later removed by DCE)
+  v1 = compare_ugt ...     (later removed by DCE)
+  v2 = compare_eq ...
+  store_context +300, v0   <-- removed
+  store_context +301, v1   <-- removed
+  store_context +302, v2   <-- removed
+  branch_true v1, ...
+<block1>:
+  v3 = compare_ult ...
+  v4 = compare_ugt ...
+  v5 = compare_eq ...
+  store_context +300, v3   <-- these may be required if at end of function
+  store_context +301, v4       or before a call
+  store_context +302, v5
+  branch_true v5, ...
+```
+
+### Add X64CanonicalizationPass
+
+For various opcodes add copies/commute the arguments to match x64
+operand semantics. This makes code generation easier and if done
+before register allocation can prevent a lot of extra shuffling in
+the emitted code.
+
+Example:
+```
+<block0>:
+  v0 = ...
+  v1 = ...
+  v2 = add v0, v1          <-- v1 now unused
+```
+
+Becomes:
+```
+  v0 = ...
+  v1 = ...
+  v1 = add v1, v0          <-- src1 = dest/src, so reuse for both
+                               by commuting and setting dest = src1
+```
+
+### Add MergeLocalSlotsPass
+
+As the RegisterAllocationPass runs it generates load_local/store_local as it
+spills. Currently each set of locals is unique to each block, which in very
+large functions can result in a lot of locals that are only used briefly. It
+may be useful to use the results of the ControlFlowAnalysisPass to track local
+liveness and merge the slots so they are reused when they cannot possibly be
+live at the same time. This saves stack space and potentially improves cache
+behavior.
--- a/src/xenia/cpu/backend/x64/x64_emitter.cc
+++ b/src/xenia/cpu/backend/x64/x64_emitter.cc
@ -53,11 +53,6 @@ static const size_t MAX_CODE_SIZE = 1 * 1024 * 1024;
 static const size_t STASH_OFFSET = 32;
 static const size_t STASH_OFFSET_HIGH = 32 + 32;

-// If we are running with tracing on we have to store the EFLAGS in the stack,
-// otherwise our calls out to C to print will clear it before DID_CARRY/etc
-// can get the value.
-#define STORE_EFLAGS 1
-
 const uint32_t X64Emitter::gpr_reg_map_[X64Emitter::GPR_COUNT] = {
    Operand::RBX, Operand::R12, Operand::R13, Operand::R14, Operand::R15,
 };
@ -539,25 +534,6 @@ void X64Emitter::nop(size_t length) {
  }
 }

-void X64Emitter::LoadEflags() {
-#if STORE_EFLAGS
-  mov(eax, dword[rsp + STASH_OFFSET]);
-  btr(eax, 0);
-#else
-// EFLAGS already present.
-#endif  // STORE_EFLAGS
-}
-
-void X64Emitter::StoreEflags() {
-#if STORE_EFLAGS
-  pushf();
-  pop(dword[rsp + STASH_OFFSET]);
-#else
-// EFLAGS should have CA set?
-// (so long as we don't fuck with it)
-#endif  // STORE_EFLAGS
-}
-
 bool X64Emitter::ConstantFitsIn32Reg(uint64_t v) {
  if ((v & ~0x7FFFFFFF) == 0) {
    // Fits under 31 bits, so just load using normal mov.
--- a/src/xenia/cpu/backend/x64/x64_emitter.h
+++ b/src/xenia/cpu/backend/x64/x64_emitter.h
@ -173,9 +173,6 @@ class X64Emitter : public Xbyak::CodeGenerator {

  // TODO(benvanik): Label for epilog (don't use strings).

-  void LoadEflags();
-  void StoreEflags();
-
  // Moves a 64bit immediate into memory.
  bool ConstantFitsIn32Reg(uint64_t v);
  void MovMem64(const Xbyak::RegExp& addr, uint64_t v);
--- a/src/xenia/cpu/compiler/compiler_passes.h
+++ b/src/xenia/cpu/compiler/compiler_passes.h
@ -24,157 +24,4 @@
 #include "xenia/cpu/compiler/passes/validation_pass.h"
 #include "xenia/cpu/compiler/passes/value_reduction_pass.h"

-// TODO:
-//   - mark_use/mark_set
-//     For now: mark_all_changed on all calls
-//     For external functions:
-//       - load_context/mark_use on all arguments
-//       - mark_set on return argument?
-//     For internal functions:
-//       - if liveness analysis already done, use that
-//       - otherwise, assume everything dirty (ACK!)
-//       - could use scanner to insert mark_use
-//
-//   Maybe:
-//   - v0.xx = load_constant <c>
-//   - v0.xx = load_zero
-//   Would prevent NULL defs on values, and make constant de-duping possible.
-//   Not sure if it's worth it, though, as the extra register allocation
-//   pressure due to de-duped constants seems like it would slow things down
-//   a lot.
-//
-//   - CFG:
-//     Blocks need predecessors()/successor()
-//     phi Instr reference
-//
-//   - block liveness tracking (in/out)
-//     Block gets:
-//       AddIncomingValue(Value* value, Block* src_block) ??
-
-// Potentially interesting passes:
-//
-// Run order:
-//   ContextPromotion
-//   Simplification
-//   ConstantPropagation
-//   TypePropagation
-//   ByteSwapElimination
-//   Simplification
-//   DeadStoreElimination
-//   DeadCodeElimination
-//
-// - TypePropagation
-//    There are many extensions/truncations in generated code right now due to
-//    various load/stores of varying widths. Being able to find and short-
-//    circuit the conversions early on would make following passes cleaner
-//    and faster as they'd have to trace through fewer value definitions.
-//    Example (after ContextPromotion):
-//      v81.i64 = load_context +88
-//      v82.i32 = truncate v81.i64
-//      v84.i32 = and v82.i32, 3F
-//      v85.i64 = zero_extend v84.i32
-//      v87.i64 = load_context +248
-//      v88.i64 = v85.i64
-//      v89.i32 = truncate v88.i64      <-- zero_extend/truncate => v84.i32
-//      v90.i32 = byte_swap v89.i32
-//      store v87.i64, v90.i32
-//    after type propagation / simplification / DCE:
-//      v81.i64 = load_context +88
-//      v82.i32 = truncate v81.i64
-//      v84.i32 = and v82.i32, 3F
-//      v87.i64 = load_context +248
-//      v90.i32 = byte_swap v84.i32
-//      store v87.i64, v90.i32
-//
-// - ByteSwapElimination
-//   Find chained byte swaps and replace with assignments. This is often found
-//   in memcpy paths.
-//   Example:
-//     v0 = load ...
-//     v1 = byte_swap v0
-//     v2 = byte_swap v1
-//     store ..., v2       <-- this could be v0
-//
-//   It may be tricky to detect, though, as often times there are intervening
-//   instructions:
-//     v21.i32 = load v20.i64
-//     v22.i32 = byte_swap v21.i32
-//     v23.i64 = zero_extend v22.i32
-//     v88.i64 = v23.i64 (from ContextPromotion)
-//     v89.i32 = truncate v88.i64
-//     v90.i32 = byte_swap v89.i32
-//     store v87.i64, v90.i32
-//   After type propagation:
-//     v21.i32 = load v20.i64
-//     v22.i32 = byte_swap v21.i32
-//     v89.i32 = v22.i32
-//     v90.i32 = byte_swap v89.i32
-//     store v87.i64, v90.i32
-//   This could ideally become:
-//     v21.i32 = load v20.i64
-//     ... (DCE takes care of this) ...
-//     store v87.i64, v21.i32
-//
-// - DeadStoreElimination
-//   Generic DSE pass, removing all redundant stores. ContextPromotion may be
-//   able to take care of most of these, as the input assembly is generally
-//   pretty optimized already. This pass would mainly be looking for introduced
-//   stores, such as those from comparisons.
-//
-//   Example:
-//   <block0>:
-//     v0 = compare_ult ...     (later removed by DCE)
-//     v1 = compare_ugt ...     (later removed by DCE)
-//     v2 = compare_eq ...
-//     store_context +300, v0   <-- removed
-//     store_context +301, v1   <-- removed
-//     store_context +302, v2   <-- removed
-//     branch_true v1, ...
-//   <block1>:
-//     v3 = compare_ult ...
-//     v4 = compare_ugt ...
-//     v5 = compare_eq ...
-//     store_context +300, v3   <-- these may be required if at end of function
-//     store_context +301, v4       or before a call
-//     store_context +302, v5
-//     branch_true v5, ...
-//
-// - X86Canonicalization
-//   For various opcodes add copies/commute the arguments to match x86
-//   operand semantics. This makes code generation easier and if done
-//   before register allocation can prevent a lot of extra shuffling in
-//   the emitted code.
-//
-//   Example:
-//   <block0>:
-//     v0 = ...
-//     v1 = ...
-//     v2 = add v0, v1          <-- v1 now unused
-//   Becomes:
-//     v0 = ...
-//     v1 = ...
-//     v1 = add v1, v0          <-- src1 = dest/src, so reuse for both
-//                                  by commuting and setting dest = src1
-//
-// - RegisterAllocation
-//   Given a machine description (register classes, counts) run over values
-//   and assign them to registers, adding spills as needed. It should be
-//   possible to directly emit code from this form.
-//
-//   Example:
-//   <block0>:
-//     v0 = load_context +0
-//     v1 = load_context +1
-//     v0 = add v0, v1
-//     ...
-//     v2 = mul v0, v1
-//   Becomes:
-//     reg0 = load_context +0
-//     reg1 = load_context +1
-//     reg2 = add reg0, reg1
-//     store_local +123, reg2  <-- spill inserted
-//     ...
-//     reg0 = load_local +123  <-- load inserted
-//     reg0 = mul reg0, reg1
-
 #endif  // XENIA_COMPILER_COMPILER_PASSES_H_