This fixes bounding box shaders failing to compile under Vulkan, due to
differences between GLSL and HLSL in the return value of vector
comparisons and what types these functions accept. I included all() for
the sake of completeness.
At higher resolutions, our bounding box dimensions end up being
slightly larger than original hardware in some cases. This is not
necessarily wrong, it's just an artifact of rendering at a higher
resolution, due to bringing out detail that wouldn't have appeared on
original hardware. It causes a texel to fall partially on what would
have been a single pixel at native resolution, resulting in the
coordinates getting bumped up to the next valid value. In many cases,
these slightly larger bounding boxes are perfectly fine, as games don't
hard-code expected dimensions. It is problematic in Paper Mario TTYD
though, for a somewhat complicated reason.
Paper Mario TTYD frequently uses EFB copies to pre-render a bunch of
animation frames for a character sprite (especially in Chapter 2), so
that it can then render 100 or more of them without bringing the
GameCube to its knees. Based on my observation, the game seems to set
aside a region of memory to store these EFB copies. This region is
obviously fairly small, as the GameCube only has 24MB of RAM. There are
2 rooms in Chapter 2 where you fight a horde of as many as 100 Jabbies,
which are also rendered using EFB copies, so in this room the game ends
up making 130(!) EFB copies just for Puni and Jabbi sprites. This seems
to nearly fill the region of memory it set aside for them.
Unfortunately, our slightly larger bounding boxes at higher resolutions
results in overflowing this memory, causing very strange behavior. Some
EFB copies partially overlap game state, resulting in reading it as a
garbage RGB5A3 texture that constantly changes. Others apparently
somehow trigger a corner case in our persistent buffer mapping, causing
them to partially overwrite earlier EFB copies.
What this change does is only include the screen coordinates that align
with the equivalent native resolution pixel centers, which generally
results in the bounding boxes being more in line with original
hardware. It isn't perfect, but it's enough to fix Paper Mario TTYD's
Jabbi rooms by avoiding the buffer overflow. Notably, it is more
accurate at odd resolutions than at even resolutions. Native resolution
is completely unaffected by this change, as should be the case. This
change may also have a small positive impact on shader performance at
higher resolutions, as there will be less atomic operations performed.
Running the min/max operation on the upside down, quad-rounded pixel
coordinates before inverting them to the standard upper-left origin
produces wrong results. Therefore, we need to do the inversion before
rounding to pixel quads.
Fragment coordinates always have a 0.5 offset from a whole integer, as
that's where the pixel center is on modern GPUs. Therefore, we want to
always round the fragment coordinates down for bounding box
calculations. This also renders the pixel center offset useless, as 0.5
vs ~0.5833333 makes no difference when rounding down.
The SDK seems to write "default" bounding box values before every draw
(1023 0 1023 0 are the only values encountered so far, which happen to
be the extents allowed by the BP registers) to reset the registers for
comparison in the pixel engine, and presumably to detect whether GX has
updated the registers with real values. Handling these writes and
returning them on read when bounding box emulation is disabled or
unsupported, even without computing real values from rendering, seems
to prevent games from corrupting memory or crashing.
This obviously does not fix any effects that rely on bounding box
emulation, but having the game not clobber its own code/data or just
outright crash is a definite improvement.
This fixes rendering issues in Viewtiful Joe (https://bugs.dolphin-emu.org/issues/12525), but it is not entirely hardware accurate, as hardware testing showed other, more complex behavior in this case. However, it should be good enough for our purposes.
Added RAII wrapper around the the JITPageWriteEnableExecuteDisable() and
JITPageWriteDisableExecuteEnable() to make it so that it is harder to forget to
pair the calls in all code branches as suggested by leoetlino.
This commit adds support for compiling Dolphin for ARM on MacOS so that it can
run natively on the M1 processors without running through Rosseta2 emulation
providing a 30-50% performance speedup and less hitches from Rosseta2.
It consists of several key changes:
- Adding support for W^X allocation(MAP_JIT) for the ARM JIT
- Adding the machine context and config info to identify the M1 processor
- Additions to the build system and docs to support building universal binaries
- Adding code signing entitlements to access the MAP_JIT functionality
- Updating the MoltenVK libvulkan.dylib to a newer version with M1 support
The GC/Wii GPU rasterizes in 2x2 pixel groups, so bounding box values
will be rounded to the extents of these groups, rather than the exact
pixel. To account for this, we'll round the top/left down to even and
the bottom/right up to odd. I have verified that the values resulting
from this change exactly match a real Wii.
The STL has everything we need nowadays.
I have tried to not alter any behavior or semantics with this
change wherever possible. In particular, WriteLow and WriteHigh
in CommandProcessor retain the ability to accidentally undo
another thread's write to the upper half or lower half
respectively. If that should be fixed, it should be done in a
separate commit for clarity. One thing did change: The places
where we were using += on a volatile variable (not an atomic
operation) are now using fetch_add (actually an atomic operation).
Tested with single core and dual core on x86-64 and AArch64.
Previously we set the texture coordinate to zero, now we set
the texture coordinate *index* to zero. This fixes the ripple
effect of the Mario painting in Luigi's Mansion.
This change should have no behavioral differences itself, but allows for changing the behavior of out of bounds tex coord indices more easily in the next commit. Without this change, returning tex0 for out of bounds cases and then applying the fixed-point logic would use the wrong tex dimension info (tex0 with I_TEXDIMS[1] or such), which is inaccurate.
Previously we set the texture coordinate to zero, now we set
the texture coordinate *index* to zero. This fixes the ripple
effect of the Mario painting in Luigi's Mansion.
Co-authored-by: Pokechu22 <Pokechu022@gmail.com>
It is no longer relevant for the current set of loaders after 7030542546. If it becomes relevant again, a static function named IsUsable or IsCompatibleWithCurrentMachine or something would be a better approach.
-Add pause state to FPSCounter.
-Add ability to have more than one "OnStateChanged" callback in core.
-Add GetActualEmulationSpeed() to Core. Returns 1 by default. It's used by my input PRs.
-They might have never drawn if DrawMessages wasn't called before they actually expired
-Their fade was wrong if the duration of the message was less than the fade time
This makes them much more useful for debugging, I know there might be other means
of debugging like logs and imgui, but this was the simplest so that's what I used.
If you want to print the same message every frame, but with a slightly different value
to see the changes, it now work.
To compensate for the fact that they are now always rendered once,
so on start up a lot of old messages (printed while the emulation was off) could show up,
I've added a "drop" time, which means if a msg isn't rendered for the first
time within that time, it will be dropped and never rendered.
VideoCommon: Change the type of BPMemory.scissorOffset to 10bit signed: S32X10Y10
VideoBackends: Fix Software Clipper.PerspectiveDivide function, use BPMemory.scissorOffset instead of hard code 342
texture serialization and deserialization used to involve many memory
allocations and deallocations, along with many copies to and from
those allocations. avoid those by reserving a memory region inside the
output and writing there directly, skipping the allocation and copy to
an intermediate buffer entirely.
This avoids some warnings, which were originally fixed by ignoring loads with a value of zero (see 636bedb207 / #3242).
Note that FifoCI will report some changes, but only on the first frame; these seem to be timing related as they don't happen if a different write is used to replace skipped ones.
They appear to relate to perf queries, and combining them with truely unknown commands would probably hide useful information. Furthermore, 0x20 is issued by every title, so without this every title would be recorded as using an unknown command, which is very unhelpful.
debaf63fe8 moved the "Sonic epsilon hack"
to vertex shaders. However, it was only done for targets with depth
clamping. If this is not available, for example the target is OpenGL ES,
the Sonic problem appears (https://bugs.dolphin-emu.org/issues/11897).
A version of the "Sonic epsilon hack" is added for targets without
depth clamping.
BPMEM_TEV_COLOR_ENV + 6 (0xC6) was missing due to a typo. BPMEM_BP_MASK (0xFE) does not lend itself well to documentation with the current FIFO analyzer implementation (since it requires remembering the values in BP memory) but still shouldn't be treated as unknown. BPMEM_TX_SETMODE0_4 and BPMEM_TX_SETMODE1_4 (0xA4-0xAB) were missing entirely.
Additional changes:
- For TevStageCombiner's ColorCombiner and AlphaCombiner, op/comparison and scale/compare_mode have been split as there are different meanings and enums if bias is set to compare. (Shift has also been renamed to scale)
- In TexMode0, min_filter has been split into min_mip and min_filter.
- In TexImage1, image_type is now cache_manually_managed.
- The unused bit in GenMode is now exposed.
- LPSize's lineaspect is now named adjust_for_aspect_ratio.
Additionally, VCacheEnhance has been added to UVAT_group1. According to YAGCD, this field is always 1.
TVtxDesc also now has separate low and high fields whose hex values correspond with the proper registers, instead of having one 33-bit value. This change was made in a way that should be backwards-compatible.
We want to clear/memset the padding bytes, not just each member,
so using assignment or {} initialization is not an option.
To silence the warnings, cast the object pointer to u8* (which is not
undefined behavior) to make it explicit to the compiler that we want
to fill the object representation.
Cel-damage depends on lighting being calculated for the first channel
even though there is no color in the vertex format (defaults to the
material color). If lighting for the channel is not enabled, the vertex
will use the default color as before.
The default value of the color is determined by the number of elements in
the vertex format. This fixes the grey cubes in Super Mario Sunshine.
If the color channel count is zero, we set the color to black before the
end of the vertex shader. It's possible that this would be undefined
behavior on hardware if a vertex color index that was greater than the
channel count was used within TEV.
Now that we've converted all of the shader generators over to using fmt,
we can drop the old Write() member function and perform a rename
operation on the WriteFmt() to turn it into the new Write() function.
All changes within this are the removal of a <cstdarg> header, since the
previous printf-based Write() required it, and renaming. No functional
changes are made at all.
Completes the migration over to using the fmt-formatting WriteFmt
function. The next PR will rename all usages of WriteFmt, while
simultaneously getting rid of the old printf code.
Replace it with a function-local static that is initialized on first
use. This gets rid of a global variable and removes the need for
manual initialization in UICommon.
This commit also replaces the weird find_if that looks for a non-null
unique_ptr with a simple "is vector empty" check considering that
none of the pointers can be null by construction.
Continues migration of the shader generators over to fmt.
With this, all that's left to move over are the pixel shaders (regular
and ubershader variants)
Adding AmdPowerXpressRequestHighPerformance
This will allow AMD drivers to detect the request to use the dGPU instead of the iGPU on compatible hybrid graphics systems.
Reference: https://community.amd.com/thread/169965
Fixes https://bugs.dolphin-emu.org/issues/12245.
I considered making a change to DolphinQt instead of
the core, but then additional effort would've been
required to add the same fix to the Android GUI once
we start using the new config system there.
This is now unused. Seems like it was an improper fix
(there would be a race if saving the screenshot took longer
than 2 seconds) back when it was used too.
Other than the controller settings and JIT debug settings,
these are the only settings which were defined in Java code
but not defined in the new config system in C++. (There are
still a lot of settings that are defined in the new config
system but not yet saveable in the new config system, though.)
While manually capturing constexpr variables used in lambda
expressions does work, it's really easy to forget doing so since
we don't have a Windows CMake builder and the workaround isn't
necessary anywhere else. Fortunately, MSVC has a flag that fixes
the constexpr capture behavior, so let's use that instead.
These are only ever used with ShaderCode instances and nothing else.
Given that, we can convert these helper functions to expect that type of
object as an argument and remove the need for templates, improving
compiler throughput a marginal amount, as the template instantiation
process doesn't need to be performed.
We can also move the definitions of these functions into the cpp file,
which allows us to remove a few inclusions from the ShaderGenCommon
header. This uncovered a few instances of indirect inclusions being
relied upon in other source files.
One other benefit is this allows changes to be made to the definitions
of the functions without needing to recompile all translation units that
make use of these functions, making change testing a little quicker.
Moving the definitions into the cpp file also allows us to completely
hide DefineOutputMember() from external view, given it's only ever used
inside of GenerateVSOutputMembers().
A very trivial conversion, this simply converts calls to Write over to
WriteFmt and adjusts the formatting specifiers as necessary.
This also allows the const char* parameters to become std::string_view
instances, allowing for ease of use with other string types.
Lambda expressions with uncaptured constants were leading to errors,
and there were also some warnings about deprecated functions
(QFontMetrics::width and inet_ntoa).
It actually maps to postMtxInfo, not posMtxInfo (which isn't a thing).
This is especially confusing because there *are* position matrices (as
opposed to post-transform matrices).
It used to be the case that frame advance skipped duplicate frames
(i.e. it would take 30 frame advances to get through one second
of emulated time in a 30 fps game), but this broke in 9c5c3c0.
Skipping duplicate frames making TASing less annoying.
Hardware tests have shown that if the number of texgens/channels do not
match, you get garbage rendering. Presumably because the output
registers from the XF stage are fed into the incorrect input registers
for TEV/BP.
Currently, this causes Dolphin to crash/generate invalid shaders with an
assertion failure in the hardware backends. Instead, we log an error.
Perhaps in the future we should just spit out all texgens/colors anyway
from both stages, and let cross-stage optimization take care of DCE'ing
it away. But doing so would require changing the UIDs and invalidating
everyone's shader caches.
It seems that the newer version of fmt gets tripped up by bitfields
within structs. However, we can just specify the intended type where
necessary to get around this.
Migrates the shader generator off the use of a global array, eliminating
the use of some global state. This also allows us to move the shader
generation over to using fmt in a subsequent change.
Currently, we do not display every second frame in 25fps/30fps games
which run to vsync. This improves performance as there's less rendering
for the GPU to perform, but when combined with vsync, could cause frame
pacing issues.
This commit adds an option to force every frame generated by the console
to be displayed to the host, which may improve pacing for these games.
The frame number is incremented before the first frame is swapped out.
Fixes ffmpeg creating invalid video files on output if the emulator only
runs for a single frame, e.g. FifoCI.
Now that we have an actual interface to manage things, we can stop
duplicating the calls to to the pixel shader manager and remove the
need to remember to actually do so when disabling or enabling the
bounding box.
Rather than expose the bounding box members directly, we can instead
provide an interface for code to use. This makes it nicer to transition
from global data, as the interface function names are already in
place.
Now that we've extracted all of the stateless functions that can be
hidden, it's time to make the index generator a regular class with
active data members.
This can just be a member that sits within the vertex manager base
class. By deglobalizing the state of the index generator we also get rid
of the wonky dual-initializing that was going on within the OpenGL
backend.
Since the renderer is always initialized before the vertex manager, we
now only call Init() once throughout the execution lifecycle.
We can use if constexpr with the template functions that pass in a
non-type template parameter, allowing the removal of branches that
aren't taken at compile time.
Compilers will generally do this by default, however, we now give a
gentle prodding to the compiler if this would otherwise not be the case.
These don't rely on any of the static members within the IndexGenerator
class, so we can make all of these functions fully internal to the
translation unit.
We can make use of if constexpr in several scenarios here to allow
compilers to exise the relevant code paths out.
Technically a decent compiler would do this already, but now we can give
compilers a little more nudging here in the event that isn't the case.
cmd2 is a u32, so any bitwise arithmetic on it with a type of the same
size or smaller will result in a u32 value. This is also implicitly
converted to an unsigned type in the if statement as well, given that
size_t * int -> size_t.
This is just more explicit about the operations occurring and also
likely silences a sign conversion warning.
We only use these string streams to output into a final std::string
instance, we don't read into types with them. Because of this, we can
just make use of std::ostringstream, rather than the fully-fledged
std::stringstream.
No behavioral change. This is intended to make the transition to fmt
less noisy in subsequent changes by combining insertions of multiple
string literals into one where applicable.
Begins the conversion of the shader generators over to using fmt
formatting specifiers.
This also has a benefit over the older StringFromFormat-based API in
that all formatted data is appended to the existing buffer rather than
creating a completely separate string and then appending it to the
internal string buffer.
Migrates most of VideoCommon over to using fmt, with the exception being
the shader generator code. The shader generators are quite large and
have more corner cases to deal with in terms of conversion (shaders have
braces in them, so we need to make sure to escape them).
Because of the large amount of code that would need to be converted, the
conversion of VideoCommon will be in two parts:
- This change (which converts over the general case string formatting),
- A follow up change that will specifically deal with converting over
the shader generators.
OpenGL doesn't render to a 2-layer backbuffer like D3D/Vulkan for quad-buffered
stereo, instead drawing twice with the eye selected by glDrawBuffer()
(see OGL::Renderer::RenderXFBToScreen).
This is a giant hack which was previously removed because it causes
broken rendering. However, it seems that some devices still do not
support logical operations (looking at you, Adreno/Mali). Therefore, for
a handful of cases where the hack actually makes things slightly better,
we can use it.
... but not without spamming the log with warnings. With my warning
message PR, we can inform the users before emulation starts anyway.
Same behavior without hardcoding the type of the mutex within the lock
guards. This means the type of the mutex would be able to be changed
without needing to also change all occurrences lock guards are used.
Allows callers to std::move strings into the functions (or automatically
assume the move constructor/move assignment operator for rvalue
references, potentially avoiding copies altogether.
This header doesn't actually make use of MathUtil.h within itself, so
this can be removed. Many other source files used VideoCommon.h as an
indirect include to include MathUtil.h, so these includes can also be
adjusted.
While we're at it, we can also migrate valid inclusions of VideoCommon.h
into cpp files where it can feasibly be done to minimize propagating it
via other headers.
This isn't used anywhere, so it can be removed. This also potentially
fixes an underlying compilation error waiting to happen, given DECSTAT
could have potentially been used, someone disables statistics (for
whatever reason), then gets a compilation error due to the #else case
not containing an empty definition of DECSTAT.
Makes the global variable follow our convention of prefixing g_ on
global variables to make it obvious in surrounding code that it's not a
local variable.
Rather than making Statistics' member functions operate on the global
variable instance of itself, we can make these functions member
functions and operate on a by-instance state, removing the direct
dependency on the global variable itself.
This also makes for less reading, as there's no need to repeat "stats."
for all variable accesses.
Normalizes all variables related to statistics so that they follow our
coding style.
These are relatively low traffic areas, so this modification isn't too
noisy.
Now that the floating point members are assigned in bulk, we can remove
their setter macro. While we're at it, we can also remove the setter for
unsigned int, given it's not used.
ImGui::Text() assumes that the incoming text is intended to be
formatted, but we don't actually use it to format anything. We can be
explicit by using the relevant function.
This also has a plus of not needing to go through the formatter itself,
but the gains from that are probably minimal.
This fixes the problem where OBS game capture only grabs the region
inside an ImGui window whenever one is open, when using the OpenGL
backend. Shouldn't have any negative effects, as the scissor would've
been something completely arbitrary anyways.
This may affect other capture software that uses the same hooking
method, but I've only tested OBS.
Previously, this array potentially wouldn't be placed within the
read-only segment, since it wasn't marked const. We can make the lookup
table const, along with any other nearby variables.
Avoids dragging in IniFile, EXI device and SI device headers in this header which is
quite widely used throughout the codebase.
This also uncovered a few cases where indirect inclusions were being
relied upon, which this also fixes.
Since C++17, non-member std::size() is present in the standard library
which also operates on regular C arrays. Given that, we can just replace
usages of ArraySize with that where applicable.
In many cases, we can just change the actual C array ArraySize() was
called on into a std::array and just use its .size() member function
instead.
In some other cases, we can collapse the loops they were used in, into a
ranged-for loop, eliminating the need for en explicit bounds query.
We can use u32 instead of unsigned int to shorten up these definitions
and make them much nicer to read.
While we're at it, change the size array to house u32 elements
to match the return value of the function.
We can use u32 instead of unsigned int to shorten up these definitions
and make them much nicer to read.
While we're at it, change the size array to house u32 elements to match
the return value of the function.
This is only ever used to retrieve a raw view of the given UID data
structure, however it's already valid C++ to retrieve a char/unsigned
char view of an object for bytewise inspection.
u8 maps to unsigned char on all platforms we support, so we can just do
this directly with a reinterpret cast, simplifying the overall
interface.
Zero-initialization zeroes out all members and padding bits, so this is
safe to do. While we're at it, also add static assertions that enforce
the necessary requirements of a UID type explicitly within the ShaderUid
class.
This way, we can remove several memset calls around the shader
generation code that makes sure the underlying UID data is zeroed out.
Now our ShaderUid class enforces this for us, so we don't need to care about
it at the usage sites.
Greatly simplifies the overall interface when it comes to compiling
shaders. Also allows getting rid of a std::string overload of the same
name. Now std::string and const char* both go through the same function.
Makes VertexLoader_Normal completely stateless, eliminating the need for
an Init() function, and by extension, also gets rid of the need for the
FifoAnalyzer to have an Init() function.
Many of the arrays defined within this file weren't declared as
immutable, which can inhibit the strings being put into the read-only
segment. We can declare them constexpr to make them immutable.
While we're at it, we can use std::array, to allow bounds conditional
bounds checking with standard libraries. The declarations can also be
shortened in the future when all platform toolchain versions we use
support std::array deduction guides. Currently macOS and FreeBSD
builders fail on them.
Ensures that the destruction logic is kept local to the translation unit
(making it nicer when it comes to forward declaring non-trivial types).
It also doesn't really do much to define it in the header.
Given we're simply storing the std::string into a deque. We can emplace
it and move it. Completely avoiding copies with the current usage of the
function.