The Metal shader compiler fails to compile the atomic instructions
when operating on individual components of a vector. Spltting it
into four variables shouldn't make any difference for other
platforms, as they are accessed independently.
floatindex is clamped to the range [0, 9]. For non-negative numbers
floor() is equivalent to trunc(). Truncation happens implicitly when
converting to uint, so the floor() is unnecessary.
Tested on a linux Intel Skylake integrated graphics with
blend_func_extended force-disabled, as it's the only platform I have
that doesn't crash with ubershaders and supports fb_fetch
Currently, this is only the logic op bit, but this will be extended to
the framebuffer fetch/blend modes. In the future, when/if we move to
VideoCommon pipelines, this state will be part of the pipeline UID
anyway, and we can mask it out in the backend by using a two-level map,
so the shaders/programs are shared.
The switch statements in these functions appear to get transformed into
an if..else chain on NVIDIA's OpenGL/Vulkan drivers, resulting in lower
performance than the D3D counterparts. Transforming the switch into a
binary tree of ifs can increase performance by up to 20%.