Seems to be pretty high in the profile in some geometry-heavy games like The
Last Story, and the compiler-generated assembly is terrifyingly bad, so
SSE-ize it.
VertexLoader::VertexLoader was setting loop_counter, a *static*
variable, to 0. This was nonsensical, but harmless until I started to
run it on a separate thread, where it had a chance of interfering with a
running vertex translator.
Switch to just using a register for the loop counter.
- Lazily create the native vertex format (which involves GL calls) from
RunVertices rather than RefreshLoader itself, freeing the latter to be
run from the CPU thread (hopefully).
- In order to avoid useless allocations while doing so, store the native
format inside the VertexLoader rather than using a cache entry.
- Wrap the s_vertex_loader_map in a lock, for similar reasons.
To avoid FPRs being pushed unnecessarily, I checked the uses: DSPEmitter
doesn't use FPRs, and VertexLoader doesn't use anything but RAX, so I
specified the register list accordingly. The regular JIT, however, does
use FPRs, and as far as I can tell, it was incorrect not to save them in
the outer routine. Since the dispatcher loop is only exited when
pausing or stopping, this should have no noticeable performance impact.
- Factor common work into a helper function.
- Replace confusingly named "noProlog" with "rsp_alignment". Now that
x86 is not supported, we can just specify it explicitly as 8 for
clarity.
- Add the option to include more frame size, which I'll need later.
- Revert a change by magumagu in March which replaced MOVAPD with MOVUPD
on account of 32-bit Windows, since it's no longer supported. True,
apparently recent processors don't execute the former any faster if the
pointer is, in fact, aligned, but there's no point using MOVUPD for
something that's guaranteed to be aligned...
(I discovered that GenFrsqrte and GenFres were incorrectly passing false
to noProlog - they were, in fact, functions without prologs, the
original meaning of the parameter - which caused the previous change to
break. This is now fixed.)
This is required to make packing consistent between compilers: with u32, MSVC
would not allocate a bitfield that spans two u32s (it would leave a "hole").
We are used to have a 1:1 mapping of GX vertex formats and the native (OGL + D3D) ones, but there are by far more GX ones.
This new cache maps them directly so that we don't flush on GX vertex format changes as long as the native one doesn't change.
The idea is stolen from galop1n.
Our defines were never clear between what meant 64bit or x86_64
This makes a clear cut between bitness and architecture.
This commit also has the side effect of bringing up aarch64 compiling support.