Basically the code does the alpha multiplication in the shader therefore
the blend unit only does a pure addition. This way the multiplication is
accurate and accurate_blending doesn't requires a costly barrier.
This code also avoid variable duplication to make the code more separated.
Hopefully blending can be done in a separated function
It is preliminary work to support fast color clipping with HDR
v2: fix assertion compilation failure
v3: fix regression in not accurate mode
v3: Cs * As/Af is not an accumulation
Those cases don't need the Cd addition and were already optimized anyway
Fix a regression on GoW2
Do DATE algo selection before blending. This way we can detect bad
interaction.
Regroup all blending/colclip in a single block. Avoid to check abe &&
rt multiple times.
v2: only enable sw blending when abe is true
The updated medium level will run for all sprites. It helps sotc blooming effect and it remains
fast enough to be enabled by default (at least on 3D games)
The new high level will run for all sprites + color clipping
The idea is that sprites are often use for post-processing effect (ofc except 2D games)
Most of the time post-processing supports SW blending with a small speed penality. SW
blending is more accurate so it is better to use it.
Gain: 1% at 4x on SotC (it partially compensates recent additions)
When the color is constant and equal to 128, the MODULATE mode is
equivalent to the DECAL mode. It saves 5 instructions on the FS.
Accurate options do a better jobs. Technically it can still
be useful for old gpu/driver that doesn't support the GL4.5 extension.
On Windows, you can still rely on Dx
On linux, free driver support it (except Intel)
Code is not yet enabled because it requires extensive test
The idea is to replace point by a 1 pixels sprite with the help of
a geometry shader. In 4x, point will be replaced by a 4x4 sprite.
// GL42 interact very badly with sw blending. GL42 uses the primitiveID to find the primitive
// that write the bad alpha value. Sw blending will force the draw to run primitive by primitive
// (therefore primitiveID will be constant to 1)
It might help to fix a bit the color on a couple of games
accurate_fbmask = 1
Code uses GL4.5 extensions. So far it seems the effect is ony used a couple
of time and often in non-overlapping primitive. Speed impact will likely remain small
GS doesn't supports texture shuffle/swizzle so it is emulated in a
complex way.
The idea is to read/write the 32 bits color format as a 16 bit format.
This way, RG (16 lsb bits) or BA (16 msb bits) can be read or written with
square texture that targets pixels 1-8 or pixels 8-16.
However shuffle is limited. For example you can copy the green channel
to either the alpha channel or another green channel.
Note: Partial masking of channel is not yet implemented
V2: improve logging
V3: better support of green channel in shader
V4: improve detection of destination (issue due to rounding)
Gow uses 24 bits buffer, so only color is updated but blending is configured as Cd
so it is a NOP
In this case, we don't lookup the target in the texture cache. It reduces the complexity
to handle depth which can be located at same address as RT
Note: please test DX renderer
Please test it!
GS supports 3 formats for the output:
32 bits: normal case
=> no change
24 bits: like 32 bits but without alpha channel
=> mask alpha channel (ie don't write it anymore)
=> Always uses 1.0f as blending coefficient
16 bits: RGB5A1, emulated by a 32 bits openGL texture. I think it will be more correct to use
a real 16 bits GL texture. Unfortunately it would cost several (slow) target conversions.
Anyway as a current solution
=> apply a mask of 0xF8 on color when SW blending is used (improve Castlevania shadow)
unfortunately normal blending mode still uses the full range of colors!
This commit also corrects a couple of blending factor. 128/255 is equivalent to 1.0f in PS2, whereas GPU uses 1.0f. So the blending factor must be 255/128 instead of 2
Note: disable CRC hack and enable accurate_colclip to see Castlevania shadow ^^
(issue #380).
Note2: SW renderer is darker on Castlevania. I don't know why maybe linked to the 16 bits format poorly emulated
The purpose of the code is to support alpha channel
of RT uses as an index for a palette texure.
I'm afraid that code will likely break pure palette texture. Only used
if paltex is enabled
It fixes missing shadow in Star Ocean 3 (issue #374) in Native resolution
with filter = 0 (no filtering) or = 2 (normal fitering)
Rendering explanation:
The game emulates a stencil buffer with the alpha channel
The alpha channel of the RT can contains a palette texture index (format 4HH)
The idea is to have a gradient of value in the palette (16/32/48/...).
This way you can implement a +16/-16 and even wrap the alpha value every time
you hit the pixel.
Bilinear filtering breaks the rendering because it interpolates between counts
so you doesn't have the exact count
Upscaling breaks the rendering because the RT is reused as an input texture. It means
that we need to scale it down which again create some interpolations.
* Dump context before the increase of s_n
=> aligned with the global call number
* Don't print colclip not supported when it is optimized away
* dettach the input texture when it is useless
=> avoid to show a wrong texture in the debugger
This way it will allow to implement all blendings operartion in FS.
Of course it will be slow, but it would be nice for debug and quickly check
game error rendering.
Currently colclip uses 2 passes to wrap the output of blending unit
However some blending mode are only a plain copy (of 0 or Cs or Cd).
So no overflow of [0:255], no need to wrap it
Note: I saw those cases in GoW.
Much faster for small batch that write the alpha value. Code can
be enabled with accurate_date option.
Here a summary of all DATE possibilities:
1/ no overlap of primitive
=> texture barrier (pro no setup of stencil and single draw)
2/ alpha written
=> small batch => texture barrier (primitive by primitive). Done in N-primitive draw calls.
(based on GL_ARB_texture_barrier)
=> bigger batch => compute the first good primitive, slow but only 2 draw calls.
(based on GL_ARB_shader_image_load_store)
=> Otherwise there is the UserHacks_AlphaStencil but it is a hack!
3/ alpha written
=> full setup of stencil ( 2 draw calls)
No barrier => draw all primitives
Barrier but without overlap => draw all primitives
Barrier with overlap => draw primitive by primitve
It will ease the implementation of accurate blending and why not date too
Group opengl calls into a nice name.
Apitrace shows them in a tree format that support folding. Previously it
was a long flat list (10K-40K of lines by frame)
I align the call number with the internal s_n variable. This way it is
easy to map GSdx dump output with the GL debugger :)
If there is no overlap, it is allowed to directly read from the render target.
On SotC testcase with 6x scaling: 30fps -> 40fps
Note: it requires GL_ARB_texture_barrier extension so be sure to have a recent driver
Note2: it requires a lots of testing too
Open question: in case of complex date (written alpha)
Will it be faster to split the draw call into multiple call with no
primitive overlap
DATE is implemented in 2 ways.
1/ with stencil
2/ purely in FS (sw)
I kept method 1 to reduce the work on method 2. It sucks for performance.
So it would be either 1 or 2.
Note: DATE has a big impact on higher upscaling
Note2: you can disable the 2nd method with this configuration parameter
override_GL_ARB_shader_image_load_store = 0
UserHacks_UnscaleSprite = 1 will unscale flat sprites
UserHacks_UnscaleSprite = 2 will unscale all sprites (don't work well so far)
The idea of the hack is to redo the interpolation of texture coordinate
based on the non-upscaled pixel position.
It avoids various glitches but sprites aren't upscaled anymore (so no
more anti-aliasing, potentially a coefficient can be added).
It is replacement of the previous hack (UserHacks_stretch_sprite). Don't enable both in the same time!
The idea of the hack is to move the sprite to the pixel boundary. It
avoids most of rounding issue. It also rescales verticaly the sprite (avoid horizontal line on ace combat).
I don't like this rescaling maybe we can limit it to only 1 pixels.
On my limited testcase, results are much better with any upscaling factor.
I still have a bad line in Kingdom heart. If you have issue with others
game please provide us a GS dump.
2x upscaling is pixel perfects. Bigger upscaling is better but not yet perfect
Feedbacks are welcomes (note it doesn't solve all upscaling issue, only wrong texture sampling)
For the history:
If you have a texture of [0;16[ texels and draws a primitive [0;16[
The formulae to sample last pixels of texture is
0.5 + (16*s-1)/(16*s) * 16
Native (s==1): 15.5 (good)
2x (s==2): 16 (bad, outside of the texture)
4x (s==4): 16.25 (bad, really outside of the texure))
* Only a single VAO
=> Format is set once
=> Only a single bind at startup
=> GSVertexBufferStateOGL is nearly useless
=> barely faster but better than nothing :)
Add the --gles build option to the linux main script
Ifdef all gl code not supported on gles3 (note some will be reenabled for gles3.1)
Note: it probably doesn't run anymore. My Nvidia driver doesn't support
yet egl/gles so I can't test it. Feel free to contribute.
UserHacks_AlphaStencil will take precedence on override_GL_ARB_shader_image_load_store
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5891 96395faa-99c1-11dd-bbfe-3dabce05a288
Note: DATE is used for shadows (persona 3) and others effect
You can disable it when you disable gl image extension (override_GL_ARB_shader_image_load_store = 0)
You can also enable it on AMD too (set it to 1 this time) but no guarantee.
If you feel the extension is too slow, you can try disable some gl barrier (aka "damn the torpedo full steam ahead!").
It can be done with the option UserHacks_DateGL4 = 1. Otherwise just disable the extension.
Note: don't enable UserHacks_AlphaStencil in the same time. GL_ARB_shader_image_load_store is an alternative implementation.
Enabling both in the same time will lead to undefined (well surely wrong) behavior.
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5890 96395faa-99c1-11dd-bbfe-3dabce05a288
* try to use more subroutine on VS&PS, unfortunately hit a driver crash!
* Call Attach/DetachContext through GSDevice so I can unmap currently mapped buffer
* Implement glsl part of GL_ARB_bindless texture, again hit another driver crash!
* various fix of GL_ARB_buffer_storage. Basic benchmark show only improvement on 'cold' case, I guess it will improve smoothness
* try to fix GL_clear_texture, no success so far. It seem the extension is limited to basic texture (aka no depth/stencil)
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5752 96395faa-99c1-11dd-bbfe-3dabce05a288
* GL_ARB_shader_subroutine for perf
fix for nvidia => add missing shader declaration. Nvidia got +4fps on colin3 :)
For the moment only 2 PS parameters are supported. Code need to be extended to support others games that often
switch shader program (like xenosaga).
require GL4 class hardware and the option override_GL_ARB_shader_subroutine = 1
Note: strangely on AMD linux it is slower!
* GL_ARB_shader_image_load_store for accuraccy (Date)
Use a signed integer texture and reenable color buffer writing
Current status: Amagami_transparency.gs & P3_battle_shadows.gs are now working on Nvidia with a small perf impact.
Current implementation detail:
1/ setup the standard stencil as before
2/ on remaining pixel, draw once to compute first primitive that will write a fail alpha value.
3/ final draw based on primitive id of step 2
Note: I think we would get a bad behavior if depth test&mask are enabled on step 2/3
Note2: on my limited testcase the perf impact was on CPU. It would be possible to merge step1&2 to nullifying it (could
even be faster actually), however it would require more GPU power.
Again require GL4 class hardware. And the option UserHacks_DateGL4 = 1
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5725 96395faa-99c1-11dd-bbfe-3dabce05a288
* some preliminary work to test/benchmark bindless texture in the future (glsl was not yet updated)
Bindless texture allow to get a GPU texture pointer and then set it directly
to the shader as a basic uniform.
=> no more texture unit selection/validation
=> no more texture validation neither texture hash lookup
3rdparty: update gl header to the latest gl4.4
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5720 96395faa-99c1-11dd-bbfe-3dabce05a288
Card that support gs:
remain only a gs to generate sprite from a line. Even dummy gs are costly for the GPU.
Card that don't support gs:
remove useless copy of color for line and triangle primitives
Note for dx: opengl 3.2 (maybe not gles) supports both flat interpolation
convention (GL_FIRST_VERTEX_CONVENTION or GL_LAST_VERTEX_CONVENTION). It might
be possible to shuffle vertex index to put the last vertex in first position.
- buff[0] = head + 0;
- buff[1] = head + 1;
- buff[2] = head + 2;
+ buff[0] = head + 2;
+ buff[1] = head + 1;
+ buff[2] = head + 0;
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5718 96395faa-99c1-11dd-bbfe-3dabce05a288
The idea was to replace shader program swith by pointer function calls inside
shaders. At least parameters that are often changed between draw call. So far
I only ported atst and colclip. Unfortunately code is "slower" (on GSdx standalone).
For the moment keep the code but disabled.
If I understand well the validation of program is done in the "driver thread"
but the additional call are done in the overloaded MTGS thread. Apitrace
profiling shows faster GPU draw calls. Another possibility is that the driver still
need to validate the draw call because of others state change.
Here some stats on colin3 (90 frames):
without subroutine: UseProgram 125246
with subroutine: UseProgram 2906, subroutine 125945 => 3605 extra calls overhead (not
all parameters are ported to subroutine)
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5715 96395faa-99c1-11dd-bbfe-3dabce05a288
Note: I think we can do the same on DX11
Perf wise: on colin mcrae 3 it reduces shader prog setup from 3005 to 2086 each frames. It saves 2 ms of CPU processing (27->29fps)
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5714 96395faa-99c1-11dd-bbfe-3dabce05a288
* move most of gl states into a separate namespace. Extend it to depth/stencil/blend micro state
=> save 10,000 opengl call by frame for colin mcrae 3
* Only setup blend state of first drawbuffer
* Don't request anymore a debug context on dev/release build
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5713 96395faa-99c1-11dd-bbfe-3dabce05a288
* preliminary work for GL4.4 extensions (ARB_clear_texture & ARB_multi_bind). Disabled until I got a 4.4 driver
Note: I plan also to use ARB_buffer_storage
* compute texture gl option in the constructor (avoid a couple of swith case)
* redo texture unit management. Unit 0-2 for shaders, Unit 3 for texture operations. MultiBind will allow to bind
shader input without disturbing texture binding points.
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5711 96395faa-99c1-11dd-bbfe-3dabce05a288
* add a non-working hack: UserHacks_DateGL4, goal was to replace UserHacks_AlphaStencil
+ Detection of good/bad samples is based on primitive ID variable. However I'm not sure
the behavior is always the same between draw call...Anyway let's keep a copy of the current
work
* Dump integer texture into text csv
* add gl4.2 ARB_shader_image_load_store extension (needed by UserHacks_DateGL4)
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5707 96395faa-99c1-11dd-bbfe-3dabce05a288
* allow to switch renderer with F9
* skip first frame in stat of the replayer
* drop msaa. Fxaa and internal resolution will do the job
* move texture attachment from texture object into device object (allow to keep sanely the state)
* split the write buffer and attachment setup
* completely split sampler and texture input setup
* redo GSDeviceOGL::CopyOffscreen to avoid an extra copy.
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5704 96395faa-99c1-11dd-bbfe-3dabce05a288
* use gles header file, disable opengl code (mainly GLX, ARB_sso, geometry shader)
* Define properly the function pointer, GLES use basic linking whereas GL must get the symbol dynamically
* cmake: properly search and set libglesv2.so
* don't use dual source blending => HW renderer work (only miss unimportant FBA)
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5701 96395faa-99c1-11dd-bbfe-3dabce05a288
* replace vertex interface with block interface. It avoid to depends on the ARB_sso extension.
* disable geometry shader on Nvidia & Linux. Slower but better than a black screen !
* default logz to 1, avoid some glitches.
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5699 96395faa-99c1-11dd-bbfe-3dabce05a288
* use svnrev.h on linux too
* replace sprintf_s with snprintf (hope it still compile on Windows)
* init integer with 0 instead of NULL
* various int -> u32/uint32/uint on for loop index
* remove a couple of unused variable
* init few variable
* disable unused warning results
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5683 96395faa-99c1-11dd-bbfe-3dabce05a288
* Separate state and shader compilation into separate function
* replace various hash_map by basic array
* Compact VertexScale and offset into a single vec4
* add the new option "ogl_vertex_subdata": subdata is faster on FGLRX, test are welcome on Nvidia drivers
0 => use map/unmap
1 => use subdata
replay: add "linux_replay" option and compute some nice stat (mean, standard deviation)
cmake: recreate shader header at build time
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5682 96395faa-99c1-11dd-bbfe-3dabce05a288
Now the brach is ready to be merged :)
Dears Window users. If you can test that:
1/ still compile
2/ still running on DX
3/ can run with opengl
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl-wnd@5663 96395faa-99c1-11dd-bbfe-3dabce05a288
* factorize sample object creation
* remember frame buffer attachment state
* Use a basic context on EGL. Allow to use Mesa 9.1 on AMD GPU.
* precompile vertex and geometry shader to avoid benchmark polution on replay
* Try harder to detect FGLRX driver on window
* various clean
Remain to fix the coordinate system for upscaling
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl-wnd@5651 96395faa-99c1-11dd-bbfe-3dabce05a288
* Emulate Geometry Shader from the CPU.
* add some option to override opengl extension detection
* redo shader interface (again) to compile on the free driver
SW renderer is now working on the free driver.
To test it on your linux box use this cmake option -DEGL_API=TRUE
Note: (need opengl 3.0) I test mesa 9.2 git
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl-wnd@5646 96395faa-99c1-11dd-bbfe-3dabce05a288
* port KrossX patch from r5556 to openGL
* add a basic gui entry, would love an additional description
* also add the pointsampler hack but don't activate it yet
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5570 96395faa-99c1-11dd-bbfe-3dabce05a288
* Update project files
* basic compilation fix: include stdafx, s/uint/uint32/
* add selection of the opengl renderer/device in gsopen
Remain to fix opengl function declaration/initialization
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl-wnd@5505 96395faa-99c1-11dd-bbfe-3dabce05a288
gsdx:
* add some parenthesis to shup up very verbose gcc warning
* adapt ogl to latest sudonim change
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5290 96395faa-99c1-11dd-bbfe-3dabce05a288
* properly delete program and vertex array. Avoid a crash on plugin reload
* reset shader state. Avoid to reuse invalid data on plugin reload
gsdx:
* add an hack to unattach/attach the gl context from different thread. Help to solve some crashes. The best will be to move gpu operation out of gsreadfifo but it would need more works
* implement logz for test purpose (don't seem to help)
gsdx replay:
* use default xdg location
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@5289 96395faa-99c1-11dd-bbfe-3dabce05a288
* Use the new map interface/separate texture coordinate inside shader
* support new format on texture
Note: it is quite instable with various crashes and GL error but at least it compiles now :p
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl@5094 96395faa-99c1-11dd-bbfe-3dabce05a288
* implement some missing shader for DATE, invert coordinate like strech rectangle
* Use glCopyImageSubDataNV nvidia extension to copy image (you need latest AMD drivers)
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl@5086 96395faa-99c1-11dd-bbfe-3dabce05a288
Current goal is to implement the SW render with pure opengl instead of SDL.
I plan to use OpenGL4.2 capability (the latest actually) => need libglew1.7 and a Dx11 capable GPU/drivers.
git-svn-id: http://pcsx2.googlecode.com/svn/branches/gsdx-ogl@4970 96395faa-99c1-11dd-bbfe-3dabce05a288
GSdx: Removed OpenGL "support". Nobody showed any interest in getting this working.
GSdx: Removed PS1 GPU support. pcsx2 does not use this and it is unmaintained, likely broken, and frequently confuses intellisense.
GSDumpGUI: Use the correct export for the library name, was using the PS1 version.
If any of the above code is needed in the future, we have this wonderful technology called version control.
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@2754 96395faa-99c1-11dd-bbfe-3dabce05a288
* Software mode seems to work fine. Suspend and resume emulation work nicely, without flaws.
* Hardware DX9 mode suspends but displays only black after resuming.
* Hardware DX10 status is unknown.
git-svn-id: http://pcsx2.googlecode.com/svn/branches/GSopen2@1842 96395faa-99c1-11dd-bbfe-3dabce05a288
- trying the dx10.1-only "gather" shader instruction for palletized lookups ("8-bit texture" mode), saves 4 instructions which isn't much but still... (not tested, don't have ati)
- may fix the intel gma "no output" bug (don't have gma either :P)
- and the usual small code optimizations
git-svn-id: http://pcsx2.googlecode.com/svn/trunk@1549 96395faa-99c1-11dd-bbfe-3dabce05a288