The idea is to use a floating texture to accumulate the data and
then do a final postprocessing pass to apply the modulo
v2:
* use bounding box to
* fix vertex corruption issue
* use negative number in shader which allow to use half float (+12
fps@4x)
Previous CopyRect function does a memcopy without conversion.
This function will allow to use different format for input/output. Just a
possibility for the future
Basically the code does the alpha multiplication in the shader therefore
the blend unit only does a pure addition. This way the multiplication is
accurate and accurate_blending doesn't requires a costly barrier.
This code also avoid variable duplication to make the code more separated.
Hopefully blending can be done in a separated function
It is preliminary work to support fast color clipping with HDR
v2: fix assertion compilation failure
v3: fix regression in not accurate mode
v3: Cs * As/Af is not an accumulation
Those cases don't need the Cd addition and were already optimized anyway
Fix a regression on GoW2
Accurate options do a better jobs. Technically it can still
be useful for old gpu/driver that doesn't support the GL4.5 extension.
On Windows, you can still rely on Dx
On linux, free driver support it (except Intel)
GS uses integer value and does integer operation too.
This commit trunc the sampled texture, the interpoled fragment color
and the product of the 2.
It impacts negatively the perf of about 3/4% (GPU) but it fixes rendering on
suikoden and potentially some others games too.
Nvidia allows to get the ASM of the shader of the compiled shader. It is useful
to check the performance.
It also allow me to compile most of shader code path for QA
Dump is enabled in linux replayer + debug_glsl_shader = 2
It might save a couple of fps
Add a define to test the perf if we keep only the blue channel. It brokes
the code in Prince Of Persia that use the Red/Green channel... Maybe the
speed hack :( Or find a way to replace all if with a lookup table
Note: it is only supported on OpenGL currently
It might help to fix a bit the color on a couple of games
accurate_fbmask = 1
Code uses GL4.5 extensions. So far it seems the effect is ony used a couple
of time and often in non-overlapping primitive. Speed impact will likely remain small
GS doesn't supports texture shuffle/swizzle so it is emulated in a
complex way.
The idea is to read/write the 32 bits color format as a 16 bit format.
This way, RG (16 lsb bits) or BA (16 msb bits) can be read or written with
square texture that targets pixels 1-8 or pixels 8-16.
However shuffle is limited. For example you can copy the green channel
to either the alpha channel or another green channel.
Note: Partial masking of channel is not yet implemented
V2: improve logging
V3: better support of green channel in shader
V4: improve detection of destination (issue due to rounding)
Gow uses 24 bits buffer, so only color is updated but blending is configured as Cd
so it is a NOP
In this case, we don't lookup the target in the texture cache. It reduces the complexity
to handle depth which can be located at same address as RT
Note: please test DX renderer
The idea will be to replace StretchRect for standard case with a framebuffer
blit. Potentially it toggles less gl state.
Worth a test on Star Ocean 3 that uses a lots this function for stencil emulation
The purpose of the code is to support alpha channel
of RT uses as an index for a palette texure.
I'm afraid that code will likely break pure palette texture. Only used
if paltex is enabled
It fixes missing shadow in Star Ocean 3 (issue #374) in Native resolution
with filter = 0 (no filtering) or = 2 (normal fitering)
Rendering explanation:
The game emulates a stencil buffer with the alpha channel
The alpha channel of the RT can contains a palette texture index (format 4HH)
The idea is to have a gradient of value in the palette (16/32/48/...).
This way you can implement a +16/-16 and even wrap the alpha value every time
you hit the pixel.
Bilinear filtering breaks the rendering because it interpolates between counts
so you doesn't have the exact count
Upscaling breaks the rendering because the RT is reused as an input texture. It means
that we need to scale it down which again create some interpolations.
This way it will allow to implement all blendings operartion in FS.
Of course it will be slow, but it would be nice for debug and quickly check
game error rendering.
Currently we're trying to infer the conversion shader based on the output format
It only works if the input data is RGBA8. It might not be true in the future
Initially we copy pitch by line in the PBO and tell the dma to only
use the first valid byte.
Now, we only copy useful data to the PBO. It reduce the copy and PBO memory requirement.
It seems a bit faster on native resolution
Group opengl calls into a nice name.
Apitrace shows them in a tree format that support folding. Previously it
was a long flat list (10K-40K of lines by frame)
I align the call number with the internal s_n variable. This way it is
easy to map GSdx dump output with the GL debugger :)
If there is no overlap, it is allowed to directly read from the render target.
On SotC testcase with 6x scaling: 30fps -> 40fps
Note: it requires GL_ARB_texture_barrier extension so be sure to have a recent driver
Note2: it requires a lots of testing too
Open question: in case of complex date (written alpha)
Will it be faster to split the draw call into multiple call with no
primitive overlap
Note yet enabled because I'm afraid of data corruption but feel free to test it
The option:
ogl_vertex_storage = 1
Performance note (warm cache+gs replay on colin3)
60 fps -> 76 fps
UserHacks_UnscaleSprite = 1 will unscale flat sprites
UserHacks_UnscaleSprite = 2 will unscale all sprites (don't work well so far)
The idea of the hack is to redo the interpolation of texture coordinate
based on the non-upscaled pixel position.
It avoids various glitches but sprites aren't upscaled anymore (so no
more anti-aliasing, potentially a coefficient can be added).