Previously, we only calculated the width of a single output circuit which lead to missing a single pixel from the other output circuit which in turn causes offset issues in Persona games, I have customized GetDisplayRect() to now also calculate the dimensions of the merged rectangle when both the output circuits are enabled through the PMODE register, so this hack is no longer needed. :)
TL;DR - The above commit of mine accurately handles the offset issues by calculating union of the rects, removing this stupid hack. (not insulting any other developers, this stupid hack was mine :)
Passes the merged output circuit as the base size for texture cache scaling code. Helps fixing scaling issues where games use both of the output circuits for rendering.
Future Note: Alter the behavior of IsEnabled() check always preferring the second output circuit for some weird reason. I plan on changing it to a better auto-output circuit selection mechanism but that could probably be done some time in the future.
* Upgrade the counter to signed 32 bits. 16 bits is too small to contains the 64K value.
* Read ThreadProc/m_count when the mutex is locked
* Use old value of the fetch instead to read back the new value
Hopefully this translates well to slower systems :)
Tekken Tag:
Before: 79-81fps
After: 82-84fps
Front Mission 4 intro (as it pans over the roofs)
Before: 158-159fps
After: 165-166fps
Previously, the seconds variable of the RTC was updated on progressive modes after every 50 Vsyncs, which was obviously wrong. The code has been adjusted to update the RTC with respect to the vertical frequencies of various other video modes.
Avoid reading past the end of the disk.
Avoid waiting when there are prefetches remaining.
Fix the maths so that the first prefetch after a request attempts to
read the next block of sectors and not the block of sectors that was
just read (which will just be skipped anyway because the data has just
been cached).
Avoid potential prefetch after disk is swapped (though disc swap doesn't
work properly if you just eject and insert a different disk).
Stop prefetching on disk read failure (Suikoden hits this case - 2048
byte reads are requested, but only 2352 byte reads will succeed).
Also reduce the read retry count to 2.
16B alignment is now useless for nVifBlock (no more SSE)
However update the alignment of bucket to 64B. It will reduce cache miss
probability in the find loop
It avoids memory stalls and greatly reduces the overhead of the dVifUnpack function
Here a vtune summary of this branch (done on SotC init)
dVifUnpack<1> was 14.5% of effective VU thread time
dVifUnpack<1> is now 3.8% of effective VU thread time
I hope it will translate to better fps
Delete() deletes the menu item but keeps the sub menu. Remove() doesn't
delete the menu item.
Also use AppendSubMenu - using Append on a submenu is deprecated.
It allow to compare only 8B in the lookup so SSE could be replaced with general instruction
As a bonus, it allow to compute the hash key with a mov rather than modulo (which was an 'and')
Inline the execution part
Add a num parameter to dVifsetVUptr
Use a local variable for the nVifBlock instead of a global struct state
The goal is to ease future update of the nVifBlock struct
Previous implementation saved the both the chain pointer and the chain size
Rational: size is useful to add new element and to detect the end of the chain
Vif cache is rarely miss. So 'add' is barely called and the end of a chain is
barely reached.
New implementation will add a null cell at the end of the chain. As a
cell contains a x86 pointer, if is null you could conclude that you
reach the end of the chain.
The 'add' function will traverse the chain to get the current size. It is
a cold path besides the chain is often short (< 4).
The 'find' function only need to check the startPtr bytes to detect the end
of the loop.
Note: SizeChain was replaced with a std::array
Safety:
* check remaining space before compilation
* clear hash if recompiler is reset
Perf:
* don't research the hash after a miss
* reduce branching in Unpack/ExecuteUnpack
Note: a potential speed optimization for dVifsetVUptr
Precompute the length and store in the cache. However it need 2B on the
nVifBlock struct. Maybe we can compact cl/wl. Or merge aligned with upkType
(if some bits are useless)
I misses some early return in my first tentative. Now VTune shows me
properly the time in VU recompiler.
Note: It seem some block overlap (likely due to the branching mess). But it is still way better than no data
GS_Packet constructor calls memset which is quite slow and useless as data is overwritten
Vtune overhead of Gif_Unit::Execute goes from 5.8% to 3.0% (EE thread)