The old calculation was stride * (max_index + 1), which fails if stride is less than the size of a component (for instance, if float XYZ positions are used, and the stride was set to 4 (i.e. sizeof(float)) instead of 12 (i.e. 3 * sizeof(float)), it would be missing the last 8 bytes of the final element in the array. Or, if stride was set to 0, then no bytes would be recorded at all (though that's not a useful configuration so it's unlikely to actually exist).
I'm not aware of any games affected by this issue.
This should fix recording the wall in the staircase leading to the basement in Luigi's Mansion (though I haven't tested it, as I don't own a copy of Luigi's Mansion). This uses NormalIndex3, and the index for the normal vector (generally 0x02XX or 0x01XX) there is always lower than the tangent or binormal (generally 0x07XX). Other games seem to usually have a similar range of indices for the normal, tangent, and binormal, so this issue wouldn't affect them.
In most cases, games will use the same type for all vertex components (either Index8 or Index16 or Direct). However, RS2's deflection towers use Index16 for the texture coordinate and Index8 for everything else, meaning the texture coordinates were recorded incorrectly (the first byte was used, so only indices 0 and 1 were recorded instead of 0 through 0x0192). Worse still, some background elements in RS2 use direct positions but indexed normals or texture coordinates, and those would not be recorded at all.
This is a regression from b5fd35f951.
`count` is the number of stereo samples to write (where each stereo sample is two shorts), while `BUFFER_SIZE` is the size of the buffer in shorts. So `count` needs to be multiplied by `2`, not `BUFFER_SIZE`. Also, when this check was failed, the previous code just clobbered whatever was past the end of the buffer after logging the warning, which corrupted `basename`, eventually resulting in Dolphin crashing.
This affected Datel's Wii-compatible Action Replay, which uses a block size of 2298, or 18384 stereo samples, which is 36768 shorts, which is bigger than the buffer size of 32768. (However, the previous commit means that only one block is transfered at a time, eliminating this issue; fixing the bounds check is just a general safety thing instead of an actual bugfix now.)
The previous implementation of Force25BitPrecision was essentially a
translation of the x86-64 implementation. It worked, but we can make a
more efficient implementation by using an AArch64 instruction I don't
believe x86-64 has an equivalent of: URSHR. The latency is the same as
before, but the instruction count and register count are both reduced.