glBlitFramebuffer() does not bypass the scissor test, which meant that
part of texture copies (e.g. XFB) could have been clipped when running
under OpenGL ES, as glCopyImageSubData() is not supported.
floatindex is clamped to the range [0, 9]. For non-negative numbers
floor() is equivalent to trunc(). Truncation happens implicitly when
converting to uint, so the floor() is unnecessary.
Coherent mappings have a lower overhead and less GL codes.
So enables coherent mapping by default for all drivers.
Both Qualcomm and ARM performs very bad with explicit flushing, so this change helps them as well.
AFAIK there was one GPU generation which was slower on coherent mapping: nvidia tesla
So Geforce 200 and 300 series should be tested with this PR before merging.
As this was last tested many years ago, this issue might have been fixed as well.
Those GPUs are close to 10 years old and not supported any more by nvidia.
Using 8-bit integer math here lead to precision loss for depth copies,
which broke various effects in games, e.g. lens flare in MK:DD.
It's unlikely the console implements this as a floating-point multiply
(fixed-point perhaps), but since we have the float round trip in our
EFB2RAM shaders anyway, it's not going to make things any worse. If we
do rewrite our shaders to use integer math completely, then it might be
worth switching this conversion back to integers.
However, the range of the values (format) should be known, or we should
expand all values out to 24-bits first.
This excludes the second argument from template deduction.
Otherwise, it is required to manually cast the second argument to
the ConfigInfo type (because implicit conversions won't work).
e.g. to set the value for a ConfigInfo<std::string> from a string
literal, you'd need a ugly `std::string("yourstring")`.
Also move it to MathUtils where it belongs with the rest of the
power-of-two functions. This gets rid of pollution of the current scope
of any translation unit with b<value> macros that aren't intended to be
used directly.
Also makes y_scale a dynamic parameter for EFB copies, as it doesn't
make sense to keep it as part of the uid, otherwise we're generating
redundant shaders.
RE4's brightness screen is actually very good for spotting these.
Bug 1: Colors at the end of the scanlines are clamped, instead of a black
border
Bug 2: U and V color channels share coordinates, instead of being offset
by a pixel.
Skip ubershader mode works the same as hybrid ubershaders in that the
shaders are compiled asynchronously. However, instead of using the
ubershader to draw the object, it skips it entirely until the
specialized shader is made available.
This mode will likely result in broken effects where a game creates an
EFB copy, and does not redraw it every frame. Therefore, it is not a
recommended option, however, it may result in better performance on
low-end systems.
Depending on which constructor is invoked, m_id or m_compute_program_id
can end up in an uninitialized state. We should ensure that the object
is completely initialized to something deterministic regardless of the
constructor taken.
This replaces usages of the non-standard __FUNCTION__ macro with the standard
mandated __func__ identifier.
__FUNCTION__ is a preprocessor definition that is provided as an
extension by compilers. This was the only convenient option to rely on
pre-C++11. However, C++11 and greater mandate the predefined identifier
__func__, which lets us accomplish the same thing.
The difference between the two, however, is that __func__ isn't a
preprocessor macro, it's an actual identifier that exists at function
scope. The C++17 draft standard (N4659) at section [dcl.fct.def.general]
paragraph 8 states:
"
The function-local predefined variable __func__ is defined as if a
definition of the form
static const char __func__[] = "function-name ";
had been provided, where function-name is an implementation-defined
string. It is unspecified whether such
a variable has an address distinct from that of any other object in the
program.
"
Thankfully, we don't do any macro or string concatenation with __FUNCTION__
that can't be modified to use __func__.
Fixes a crash which could occur in platforms which do not support
buffer_storage, and EFB2RAM is enabled (which indirectly uses the
attributeless buffer).
Yes, this commit is only to blame OSX and Mali. Through the former supports unsynchronized mappings, the latter supports *no* way to stream dynamic data at all. Let's try to make bad news, as they ignore friendly feature requests. Maybe we just need to make more noise...
We would want to improve the granularity here in the future, but for
now, this should avoid any performance loss from switching to the
VideoCommon shader cache.
tl;dr: This PR speedups dolphin on mobiles with the Mali GPU and ES 3.2
drivers by a factor of 10 by using the method with the biggest overhead.
Please keep care not to buy this shit!
The ARM driver team seems to care very well about their customers. But
bad luck, users and open source developers are *not* their customers. So
even device-independent feature requests are just ignored for *years*:
https://community.arm.com/graphics/f/discussions/4645/gl_ext_buffer_storage-support
The bad point, they neither implement any of the other common ways to
stream dynamic content in unextented GL:
- They just ignore the GL_MAP_UNSYNCHRONIZED_BIT flag
- They don't support on-device buffer updates and just stall with
glBufferSubData
It seems like no benchmark is using any dynamic content - and like no
customer cares about anything but benchmarks, or users...
We have a flag to disable the glBufferSubData way, this PR adds the flag
to also disable the unsychronized mapping way. The second one is
available since their ES 3.2 update, but slow as hell.
So how to continue? The last remaining technical way to stream dynamic
content at all is to alloc a new buffer per draw call with glBufferData.
This is very gross, but still a factor 10 speedup compared to stalling
the GPU. Small tests shows that you can expect another 3-5 times speedup
with EXT_buffer_data, so Mali would be on pair with Adreno here. So if
you have bought such a device unfortunately, please try to make noise on
your vendor forums/support and ask for this extension. If you are going
to buy a new mobile, I'd recormend to avoid *any* mobile with a Mali GPU
in it.
We now differentiate between a resize event and surface change/destroyed
event, reducing the overhead for resizes in the Vulkan backend. It is
also now now safe to change the surface multiple times if the video thread
is lagging behind.
This could cause glReadPixels() calls which assume no buffer is bound
(e.g. CPU EFB access) to fail. The problem was limited to devices which
don't support persistent mapping, as the map path is not otherwise.
Both libusbhid (system library) and libhidapi (3rd party library)
provide a function called hid_init. Dolphin was being linked to both.
The WiimoteScannerHidapi constructor was calling hid_init without
arguments. libusbhid's hid_init expects one argument (a file path).
It was being called as if it was defined without arguments, which
resulted in a garbage path being passed in, and because of that,
the Qt GUI was failing to launch with the following error:
'dolphin-emu-qt2: @ : No such file or directory'
The console appears to behave against standard IEEE754 specification
here, in particular around how NaNs are handled. NaNs appear to have no
effect on the result, and are treated the same as positive or negative
infinity, based on the sign bit.
However, when the result would be NaN (inf - inf, or (-inf) - (-inf)),
this results in a completely fogged color, or unfogged color
respectively. We handle this by returning a constant zero for the A
varaible, and positive or negative infinity for C depending on the sign
bits of the A and C registers. This ensures that no NaN value is passed
to the GPU in the first place, and that the result of the fog
calculation cannot be NaN.
It seems it doesn't like modifying inout variables in place - so instead
use a temporary for ocol0/ocol1 and only write them once at the end of
the shader
This will generate one shader per copy format. For now, it is the same
shader with the colmat hard coded. So it should already improve the GPU
performance a bit, but a rewrite of the shader generator is suggested.
Half of the patch is done by linkmauve1:
VideoCommon: Reorganise the shader writes.
Also skips swapping the window system buffers in headless mode, as there
may not be a surface which can be swapped in the first place. Instead,
we call glFlush() at the end of a frame in this case.
Cel-damage uses the color from the lighting stage of the vertex pipeline
as texture coordinates, but sets numColorChans to zero.
We now calculate the colors in all cases, but override the color before
writing it from the vertex shader if numColorChans is set to a lower value.
All file scope variables are able to be made internally linked.
CD3DFont is essentially used as an extension to the utility interface, so
this is able to be made internal as well, removing a global from
external view.
Some lines of code in Dolphin just plainly grabbed the value of
g_ActiveConfig.iEFBScale, which resulted in Auto being treated as
0x rather than the actual automatically selected scale.
Drivers can return VK_ERROR_OUT_OF_DATE_KHR from vkQueuePresentKHR, and
we should resize the image in this case, as well as when getting it back
from vkAcquireNextImageKHR.
These rely on instance state, or are used within instance-based class
member functions, so they should belong to the instance itself instead
of being file statics.
Ideally Common.h wouldn't be a header in the Common library, and instead be renamed to something else, like PlatformCompatibility.h or something, but even then, there's still some things in the header that don't really fall under that label
This moves the version strings out to their own version header that doesn't dump a bunch of other unrelated things into scope, like what Common.h was doing.
This also places them into the Common namespace, as opposed to letting them sit in the global namespace.
SetFormat() is only ever used internally. ResetBuffer() is only
used to implement the VertexManagerBase class interface, so
there's no need to make it protected.
If we allocate a large amount of memory (A), commit a smaller amount,
then allocate memory smaller than allocation A, we will have already
waited for these fences in A, but not used the space. In this case,
don't set m_free_iterator to a position before that which we know is
safe to use, which would result in waiting on the same fence(s) next
time.
Calling vkCmdClearAttachments with a partial rect, or specifying a
render area in a render pass with the load op set to clear can cause the
GPU to lock up, or raise a bounds violation. This only occurs on MSAA
framebuffers, and it seems when there are multiple clears in a single
command buffer. Worked around by back to the slow path (drawing quads)
when MSAA is enabled.
Before this change, we simply fail if the device does not expose one
queue family that supports both graphics and present. Currently this is
fine, since devices tend to lay out their queues in this way. NV, for
instance, tends to have one queue family for all graphics operations and
one more for transfer only. However, it's not a hard requirement, and it
is cheap to use a separate queue, so we might as well.
Currently, this is only the logic op bit, but this will be extended to
the framebuffer fetch/blend modes. In the future, when/if we move to
VideoCommon pipelines, this state will be part of the pipeline UID
anyway, and we can mask it out in the backend by using a two-level map,
so the shaders/programs are shared.