mirror of https://github.com/red-prig/fpPS4.git
425 lines
24 KiB
Markdown
425 lines
24 KiB
Markdown
# FastMM4-AVX
|
|
|
|
FastMM4-AVX (efficient synchronization and AVX1/AVX2/AVX512/ERMS/FSRM support for FastMM4)
|
|
- Copyright (C) 2017-2020 Ritlabs, SRL. All rights reserved.
|
|
- Copyright (C) 2020-2023 Maxim Masiutin. All rights reserved.
|
|
|
|
Written by Maxim Masiutin <maxim@masiutin.com>
|
|
|
|
Version 1.0.7
|
|
|
|
This is a fork of the "Fast Memory Manager" (FastMM) v4.993 by Pierre le Riche
|
|
(see below for the original FastMM4 description)
|
|
|
|
What was added to FastMM4-AVX in comparison to the original FastMM4:
|
|
|
|
- Efficient synchronization
|
|
- improved synchronization between the threads; proper synchronization
|
|
techniques are used depending on context and availability, i.e., pause-
|
|
based spin-wait loops, umonitor/umwait (WaitPKG), SwitchToThread,
|
|
critical sections, etc.;
|
|
- used the "test, test-and-set" technique for the spin-wait loops; this
|
|
technique is recommended by Intel (see Section 11.4.3 "Optimization with
|
|
Spin-Locks" of the Intel 64 and IA-32 Architectures Optimization Reference
|
|
Manual) to determine the availability of the synchronization variable;
|
|
according to this technique, the first "test" is done via the normal
|
|
(non-locking) memory load to prevent excessive bus locking on each
|
|
iteration of the spin-wait loop; if the variable is available upon
|
|
the normal memory load of the first step ("test"), proceed to the
|
|
second step ("test-and-set") which is done via the bus-locking atomic
|
|
"xchg" instruction; however, this two-steps approach of using "test" before
|
|
"test-and-set" can increase the cost for the un-contended case comparing
|
|
to just single-step "test-and-set", this may explain why the speed benefits
|
|
of the FastMM4-AVX are more pronounced when the memory manager is called
|
|
from multiple threads in parallel, while in single-threaded use scenario
|
|
there may be no benefit compared to the original FastMM4;
|
|
- the number of iterations of "pause"-based spin-wait loops is 5000,
|
|
before relinquishing to SwitchToThread();
|
|
- see https://stackoverflow.com/a/44916975 for more details on the
|
|
implementation of the "pause"-based spin-wait loops;
|
|
- using normal memory store to release a lock:
|
|
FastMM4-AVX uses normal memory store, i.e., the "mov" instruction, rather
|
|
then the bus-locking "xchg" instruction to write into the synchronization
|
|
variable (LockByte) to "release a lock" on a data structure,
|
|
see https://stackoverflow.com/a/44959764
|
|
for discussion on releasing a lock;
|
|
you man define "InterlockedRelease" to get the old behavior of the original
|
|
FastMM4.
|
|
- implemented dedicated lock and unlock procedures that operate with
|
|
synchronization variables (LockByte);
|
|
before that, locking operations were scattered throughout the code;
|
|
now the locking functions have meaningful names:
|
|
AcquireLockByte and ReleaseLockByte;
|
|
the values of the lock byte are now checked for validity when
|
|
FullDebugMode or DEBUG is defined, to detect cases when the same lock is
|
|
released twice, and other improper use of the lock bytes;
|
|
- added compile-time options "SmallBlocksLockedCriticalSection",
|
|
"MediumBlocksLockedCriticalSection" and "LargeBlocksLockedCriticalSection"
|
|
which are set by default (inside the FastMM4Options.inc file) as
|
|
conditional defines. If you undefine these options, you will get the
|
|
old locking mechanism of the original FastMM4 based on loops of Sleep() or
|
|
SwitchToThread().
|
|
|
|
- AVX, AVX2 or AVX512 instructions for faster memory copy
|
|
- if the CPU supports AVX or AVX2, use the 32-byte YMM registers
|
|
for faster memory copy, and if the CPU supports AVX-512,
|
|
use the 64-byte ZMM registers for even faster memory copy;
|
|
- please note that the effect of using AVX instruction in speed improvement is
|
|
negligible, compared to the effect brought by efficient synchronization;
|
|
sometimes AVX instructions can even slow down the program because of AVX-SSE
|
|
transition penalties and reduced CPU frequency caused by AVX-512
|
|
instructions in some processors; use DisableAVX to turn AVX off completely
|
|
or use DisableAVX1/DisableAVX2/DisableAVX512 to disable separately certain
|
|
AVX-related instruction set from being compiled);
|
|
- if EnableAVX is defined, all memory blocks are aligned by 32 bytes, but
|
|
you can also use Align32Bytes define without AVX; please note that the memory
|
|
overhead is higher when the blocks are aligned by 32 bytes, because some
|
|
memory is lost by padding; however, if your CPU supports
|
|
"Fast Short REP MOVSB" (Ice Lake or newer), you can disable AVX, and align
|
|
by just 8 bytes, and this may even be faster because less memory is wasted
|
|
on alignment;
|
|
- with AVX, memory copy is secure - all XMM/YMM/ZMM registers used to copy
|
|
memory are cleared by vxorps/vpxor, so the leftovers of the copied memory
|
|
are not exposed in the XMM/YMM/ZMM registers;
|
|
- the code attempts to properly handle AVX-SSE transitions to not incur the
|
|
transition penalties, only call vzeroupper under AVX1, but not under AVX2
|
|
since it slows down subsequent SSE code under Skylake / Kaby Lake;
|
|
- on AVX-512, writing to xmm16-xmm31 registers will not affect the turbo
|
|
clocks, and will not impose AVX-SSE transition penalties; therefore, when we
|
|
have AVX-512, we now only use x(y/z)mm16-31 registers.
|
|
|
|
- Speed improvements due to code optimization and proper techniques
|
|
- if the CPU supports Enhanced REP MOVSB/STOSB (ERMS), use this feature
|
|
for faster memory copy (under 32 bit or 64-bit) (see the EnableERMS define,
|
|
on by default, use DisableERMS to turn it off);
|
|
- if the CPU supports Fast Short REP MOVSB (FSRM), uses this feature instead
|
|
of AVX;
|
|
- branch target alignment in assembly routines is only used when
|
|
EnableAsmCodeAlign is defined; Delphi incorrectly encodes conditional
|
|
jumps, i.e., use long, 6-byte instructions instead of just short, 2-byte,
|
|
and this may affect branch prediction, so the benefits of branch target
|
|
alignment may not outweigh the disadvantage of affected branch prediction,
|
|
see https://stackoverflow.com/q/45112065
|
|
- compare instructions + conditional jump instructions are put together
|
|
to allow macro-op fusion (which happens since Core2 processors, when
|
|
the first instruction is a CMP or TEST instruction and the second
|
|
instruction is a conditional jump instruction);
|
|
- multiplication and division by a constant, which is a power of 2
|
|
replaced to shl/shr, because Delphi64 compiler doesn't replace such
|
|
multiplications and divisions to shl/shr processor instructions,
|
|
and, according to the Intel Optimization Reference Manual, shl/shr is
|
|
faster than imul/idiv, at least for some processors.
|
|
|
|
- Safer, cleaner code with stricter type adherence and better compatibility
|
|
- names assigned to some constants that used to be "magic constants",
|
|
i.e., unnamed numerical constants - plenty of them were present
|
|
throughout the whole code;
|
|
- removed some typecasts; the code is stricter to let the compiler
|
|
do the job, check everything and mitigate probable error. You can
|
|
even compile the code with "integer overflow checking" and
|
|
"range checking", as well as with "typed @ operator" - for safer
|
|
code. Also added round bracket in the places where the typed @ operator
|
|
was used, to better emphasize on who's address is taken;
|
|
- the compiler environment is more flexible now: you can now compile FastMM4
|
|
with, for example, typed "@" operator or any other option. Almost all
|
|
externally-set compiler directives are honored by FastMM except a few
|
|
(currently just one) - look for the "Compiler options for FastMM4" section
|
|
below to see what options cannot be externally set and are always
|
|
redefined by FastMM4 for itself - even if you set up these compiler options
|
|
differently outside FastMM4, they will be silently
|
|
redefined, and the new values will be used for FastMM4 only;
|
|
- the type of one-byte synchronization variables (accessed via "lock cmpxchg"
|
|
or "lock xchg") replaced from Boolean to Byte for stricter type checking;
|
|
- those fixed-block-size memory move procedures that are not needed
|
|
(under the current bitness and alignment combinations) are
|
|
explicitly excluded from compiling, to not rely on the compiler
|
|
that is supposed to remove these function after compilation;
|
|
- added length parameter to what were the dangerous null-terminated string
|
|
operations via PAnsiChar, to prevent potential stack buffer overruns
|
|
(or maybe even stack-based exploitation?), and there some Pascal functions
|
|
also left, the argument is not yet checked. See the "todo" comments
|
|
to figure out where the length is not yet checked. Anyway, since these
|
|
memory functions are only used in Debug mode, i.e., in development
|
|
environment, not in Release (production), the impact of this
|
|
"vulnerability" is minimal (albeit this is a questionable statement);
|
|
- removed all non-US-ASCII characters, to avoid using UTF-8 BOM, for
|
|
better compatibility with very early versions of Delphi (e.g., Delphi 5),
|
|
thanks to Valts Silaputnins;
|
|
- support for Lazarus 1.6.4 with FreePascal (the original FastMM4 4.992
|
|
requires modifications, it doesn't work under Lazarus 1.6.4 with FreePascal
|
|
out-of-the-box, also tested under Lazarus 1.8.2 / FPC 3.0.4 with Win32
|
|
target; later versions should be also supported.
|
|
|
|
Here are the comparison of the Original FastMM4 version 4.992, with default
|
|
options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization),
|
|
and the current FastMM4-AVX branch ("AVX-br."). Under some multi-threading
|
|
scenarios, the FastMM4-AVX branch is more than twice as fast compared to the
|
|
Original FastMM4. The tests have been run on two different computers: one
|
|
under Xeon E5-2543v2 with 2 CPU sockets, each has 6 physical cores
|
|
(12 logical threads) - with only 5 physical core per socket enabled for the
|
|
test application. Another test was done under an i7-7700K CPU.
|
|
|
|
Used the "Multi-threaded allocate, use and free" and "NexusDB"
|
|
test cases from the FastCode Challenge Memory Manager test suite,
|
|
modified to run under 64-bit.
|
|
|
|
Xeon E5-2543v2 2*CPU i7-7700K CPU
|
|
(allocated 20 logical (8 logical threads,
|
|
threads, 10 physical 4 physical cores),
|
|
cores, NUMA), AVX-1 AVX-2
|
|
|
|
Orig. AVX-br. Ratio Orig. AVX-br. Ratio
|
|
------ ----- ------ ----- ----- ------
|
|
02-threads realloc 96552 59951 62.09% 65213 49471 75.86%
|
|
04-threads realloc 97998 39494 40.30% 64402 47714 74.09%
|
|
08-threads realloc 98325 33743 34.32% 64796 58754 90.68%
|
|
16-threads realloc 116273 45161 38.84% 70722 60293 85.25%
|
|
31-threads realloc 122528 53616 43.76% 70939 62962 88.76%
|
|
64-threads realloc 137661 54330 39.47% 73696 64824 87.96%
|
|
NexusDB 02 threads 122846 90380 73.72% 79479 66153 83.23%
|
|
NexusDB 04 threads 122131 53103 43.77% 69183 43001 62.16%
|
|
NexusDB 08 threads 124419 40914 32.88% 64977 33609 51.72%
|
|
NexusDB 12 threads 181239 55818 30.80% 83983 44658 53.18%
|
|
NexusDB 16 threads 135211 62044 43.61% 59917 32463 54.18%
|
|
NexusDB 31 threads 134815 48132 33.46% 54686 31184 57.02%
|
|
NexusDB 64 threads 187094 57672 30.25% 63089 41955 66.50%
|
|
|
|
The above tests have been run on 14-Jul-2017.
|
|
|
|
Here are some more test results (Compiled by Delphi 10.2 Update 3):
|
|
|
|
Xeon E5-2667v4 2*CPU i9-7900X CPU
|
|
(allocated 32 logical (20 logical threads,
|
|
threads, 16 physical 10 physical cores),
|
|
cores, NUMA), AVX-2 AVX-512
|
|
|
|
Orig. AVX-br. Ratio Orig. AVX-br. Ratio
|
|
------ ----- ------ ----- ----- ------
|
|
02-threads realloc 80544 60025 74.52% 66100 55854 84.50%
|
|
04-threads realloc 80751 47743 59.12% 64772 40213 62.08%
|
|
08-threads realloc 82645 32691 39.56% 62246 27056 43.47%
|
|
12-threads realloc 89951 43270 48.10% 65456 25853 39.50%
|
|
16-threads realloc 95729 56571 59.10% 67513 27058 40.08%
|
|
31-threads realloc 109099 97290 89.18% 63180 28408 44.96%
|
|
64-threads realloc 118589 104230 87.89% 57974 28951 49.94%
|
|
NexusDB 01 thread 160100 121961 76.18% 93341 95807 102.64%
|
|
NexusDB 02 threads 115447 78339 67.86% 77034 70056 90.94%
|
|
NexusDB 04 threads 107851 49403 45.81% 73162 50039 68.39%
|
|
NexusDB 08 threads 111490 36675 32.90% 70672 42116 59.59%
|
|
NexusDB 12 threads 148148 46608 31.46% 92693 53900 58.15%
|
|
NexusDB 16 threads 111041 38461 34.64% 66549 37317 56.07%
|
|
NexusDB 31 threads 123496 44232 35.82% 62552 34150 54.60%
|
|
NexusDB 64 threads 179924 62414 34.69% 83914 42915 51.14%
|
|
|
|
The above tests (on Xeon E5-2667v4 and i9) have been done on 03-May-2018.
|
|
|
|
Here is the single-threading performance comparison in some selected
|
|
scenarios between FastMM v5.03 dated May 12, 2021 and FastMM4-AVX v1.05
|
|
dated May 20, 2021. FastMM4-AVX is compiled with default optinos. This
|
|
test is run on May 20, 2021, under Intel Core i7-1065G7 CPU, Ice Lake
|
|
microarchitecture, base frequency: 1.3 GHz, max turbo frequencey: 3.90 GHz,
|
|
4 cores, 8 threads. Compiled under Delphi 10.3 Update 3, 64-bit target.
|
|
Please note that these are the selected scenarios where FastMM4-AVX is
|
|
faster then FastMM5. In other scenarios, especially in multi-threaded
|
|
with heavy contention, FastMM5 is faster.
|
|
|
|
FastMM5 AVX-br. Ratio
|
|
------ ------ ------
|
|
ReallocMem Small (1-555b) benchmark 1425 1135 79.65%
|
|
ReallocMem Medium (1-4039b) benchmark 3834 3309 86.31%
|
|
Block downsize 12079 10305 85.31%
|
|
Address space creep benchmark 13283 12571 94.64%
|
|
Address space creep (larger blocks) 16066 13879 86.39%
|
|
Single-threaded reallocate and use 4395 3960 90.10%
|
|
Single-threaded tiny reallocate and use 8766 7097 80.96%
|
|
Single-threaded allocate, use and free 13912 13248 95.23%
|
|
|
|
You can find the program, used to generate the benchmark data,
|
|
at https://github.com/maximmasiutin/FastCodeBenchmark
|
|
|
|
You can find the program, used to generate the benchmark data,
|
|
at https://github.com/maximmasiutin/FastCodeBenchmark
|
|
|
|
FastMM4-AVX is released under a dual license, and you may choose to use it
|
|
under either the Mozilla Public License 2.0 (MPL 2.1, available from
|
|
https://www.mozilla.org/en-US/MPL/2.0/) or the GNU Lesser General Public
|
|
License Version 3, dated 29 June 2007 (LGPL 3, available from
|
|
https://www.gnu.org/licenses/lgpl.html).
|
|
|
|
FastMM4-AVX is free software: you can redistribute it and/or modify
|
|
it under the terms of the GNU Lesser General Public License as published by
|
|
the Free Software Foundation, either version 3 of the License, or
|
|
(at your option) any later version.
|
|
|
|
FastMM4-AVX is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
GNU Lesser General Public License for more details.
|
|
|
|
You should have received a copy of the GNU Lesser General Public License
|
|
along with FastMM4-AVX (see license_lgpl.txt and license_gpl.txt)
|
|
If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
FastMM4-AVX Version History:
|
|
|
|
- 1.0.7 (22 March 2023) - implemented the optional use of user mode wait
|
|
(WaitPKG) umonitor/umwait instructions to wait for a synchronization
|
|
variable; it is disabled by default; define the "EnableWaitPKG" conditional
|
|
define to enable this feature; however it may not be as efficient
|
|
as the pause-based loop, so only use this feature it if your tests
|
|
show clear benefit in your scenarios.
|
|
|
|
- 1.0.6 (25 August 2021) - it can now be compiled with any alignment (8, 16, 32)
|
|
regardless of the target (x86, x64) and whether inline assembly is used
|
|
or not; the "PurePascal" conditional define to disable inline assembly at
|
|
all, however, in this case, efficient locking would not work since it
|
|
uses inline assembly; FreePascal now uses the original FreePascal compiler
|
|
mode, rather than the Delphi compatibility mode as before; resolved many
|
|
FreePascal compiler warnings; supported branch target alignment
|
|
in FreePascal inline assembly; small block types now always have
|
|
block sizes of 1024 and 2048 bytes, while in previous versions
|
|
instead of 1024-byte blocks there were 1056-byte blocks,
|
|
and instead of 2048-byte blocks were 2176-byte blocks;
|
|
fixed Delphi compiler hints for 64-bit Release mode; Win32 and Win64
|
|
versions compiled under Delphi and FreePascal passed the all the FastCode
|
|
validation suites.
|
|
|
|
- 1.05 (20 May 2021) - improved speed of releasing memory blocks on higher thread
|
|
contention. It is also possible to compile FastMM4-AVX without a single
|
|
inline assembly code. Renamed some conditional defines to be self-explaining.
|
|
Rewritten some comments to be meaningful. Made it compile under FreePascal
|
|
for Linux 64-bit and 32-bit. Also made it compile under FreePascal for
|
|
Windows 32-bit and 64-bit. Memory move functions for 152, 184 and 216 bytes
|
|
were incorrect Linux. Move216AVX1 and Move216AVX2 Linux implementation had
|
|
invalid opcodes. Added support for the GetFPCHeapStatus(). Optimizations on
|
|
single-threaded performance. If you define DisablePauseAndSwitchToThread,
|
|
it will use EnterCriticalSection/LeaveCriticalSectin. An attempt to free a
|
|
memory block twice was not caught under 32-bit Delphi. Added SSE fixed block
|
|
copy routines for 32-bit targets. Added support for the "Fast Short REP MOVSB"
|
|
CPU feature. Removed redundant SSE code from 64-bit targets.
|
|
- 1.04 (O6 October 2020) - improved use of AVX-512 instructions to avoid turbo
|
|
clock reduction and SSE/AVX transition penalty; made explicit order of
|
|
parameters for GetCPUID to avoid calling convention ambiguity that could
|
|
lead to incorrect use of registers and finally crashes, i.e., under Linux;
|
|
improved explanations and comments, i.e., about the use of the
|
|
synchronization techniques.
|
|
- 1.03 (04 May 2018) - minor fixes for the debug mode, FPC compatibility
|
|
and code readability cosmetic fixes.
|
|
- 1.02 (07 November 2017) - added and tested support for the AVX-512
|
|
instruction set.
|
|
- 1.01 (10 October 2017) - made the source code compile under Delphi5,
|
|
thanks to Valts Silaputnins.
|
|
- 1.00 (27 July 2017) - initial revision.
|
|
|
|
|
|
The original FastMM4 description follows:
|
|
|
|
# FastMM4
|
|
Fast Memory Manager
|
|
|
|
Description:
|
|
A fast replacement memory manager for Embarcadero Delphi applications
|
|
that scales well under multi-threaded usage, is not prone to memory
|
|
fragmentation, and supports shared memory without the use of external .DLL
|
|
files.
|
|
|
|
Homepage:
|
|
https://github.com/pleriche/FastMM4
|
|
|
|
Advantages:
|
|
- Fast
|
|
- Low overhead. FastMM is designed for an average of 5% and maximum of 10%
|
|
overhead per block.
|
|
- Supports up to 3GB of user mode address space under Windows 32-bit and 4GB
|
|
under Windows 64-bit. Add the "$SetPEFlags $20" option (in curly braces)
|
|
to your .dpr to enable this.
|
|
- Highly aligned memory blocks. Can be configured for either 8-byte or 16-byte
|
|
alignment.
|
|
- Good scaling under multi-threaded applications
|
|
- Intelligent reallocations. Avoids slow memory move operations through
|
|
not performing unneccesary downsizes and by having a minimum percentage
|
|
block size growth factor when an in-place block upsize is not possible.
|
|
- Resistant to address space fragmentation
|
|
- No external DLL required when sharing memory between the application and
|
|
external libraries (provided both use this memory manager)
|
|
- Optionally reports memory leaks on program shutdown. (This check can be set
|
|
to be performed only if Delphi is currently running on the machine, so end
|
|
users won't be bothered by the error message.)
|
|
- Supports Delphi 4 (or later), C++ Builder 4 (or later), Kylix 3.
|
|
|
|
Usage:
|
|
Delphi:
|
|
Place this unit as the very first unit under the "uses" section in your
|
|
project's .dpr file. When sharing memory between an application and a DLL
|
|
(e.g. when passing a long string or dynamic array to a DLL function), both the
|
|
main application and the DLL must be compiled using this memory manager (with
|
|
the required conditional defines set). There are some conditional defines
|
|
(inside FastMM4Options.inc) that may be used to tweak the memory manager. To
|
|
enable support for a user mode address space greater than 2GB you will have to
|
|
use the EditBin* tool to set the LARGE_ADDRESS_AWARE flag in the EXE header.
|
|
This informs Windows x64 or Windows 32-bit (with the /3GB option set) that the
|
|
application supports an address space larger than 2GB (up to 4GB). In Delphi 6
|
|
and later you can also specify this flag through the compiler directive
|
|
{$SetPEFlags $20}
|
|
*The EditBin tool ships with the MS Visual C compiler.
|
|
C++ Builder:
|
|
Refer to the instructions inside FastMM4BCB.cpp.
|
|
|
|
|
|
# FastMM4
|
|
Fast Memory Manager
|
|

|
|
|
|
## Description:
|
|
A fast replacement memory manager for Embarcadero Delphi applications
|
|
that scales well under multi-threaded usage, is not prone to memory
|
|
fragmentation, and supports shared memory without the use of external .DLL
|
|
files.
|
|
|
|
## Homepage:
|
|
https://github.com/pleriche/FastMM4
|
|
|
|
## Advantages:
|
|
* Fast
|
|
* Low overhead. FastMM is designed for an average of 5% and maximum of 10%
|
|
overhead per block.
|
|
* Supports up to 3GB of user mode address space under Windows 32-bit and 4GB
|
|
under Windows 64-bit. Add the "$SetPEFlags $20" option (in curly braces)
|
|
to your .dpr to enable this.
|
|
* Highly aligned memory blocks. Can be configured for either 8-byte or 16-byte
|
|
alignment.
|
|
* Good scaling under multi-threaded applications
|
|
* Intelligent reallocations. Avoids slow memory move operations through
|
|
not performing unneccesary downsizes and by having a minimum percentage
|
|
block size growth factor when an in-place block upsize is not possible.
|
|
* Resistant to address space fragmentation
|
|
* No external DLL required when sharing memory between the application and
|
|
external libraries (provided both use this memory manager)
|
|
* Optionally reports memory leaks on program shutdown. (This check can be set
|
|
to be performed only if Delphi is currently running on the machine, so end
|
|
users won't be bothered by the error message.)
|
|
* Supports Delphi 4 (or later), C++ Builder 4 (or later), Kylix 3.
|
|
|
|
## Usage:
|
|
### Delphi:
|
|
Place this unit as the very first unit under the "uses" section in your
|
|
project's .dpr file. When sharing memory between an application and a DLL
|
|
(e.g. when passing a long string or dynamic array to a DLL function), both the
|
|
main application and the DLL must be compiled using this memory manager (with
|
|
the required conditional defines set).
|
|
|
|
There are some conditional defines
|
|
(inside `FastMM4Options.inc`) that may be used to tweak the memory manager. To
|
|
enable support for a user mode address space greater than 2GB you will have to
|
|
use the EditBin* tool to set the `LARGE_ADDRESS_AWARE` flag in the EXE header.
|
|
This informs Windows x64 or Windows 32-bit (with the /3GB option set) that the
|
|
application supports an address space larger than 2GB (up to 4GB). In Delphi 6
|
|
and later you can also specify this flag through the compiler directive
|
|
`{$SetPEFlags $20}`
|
|
|
|
*The EditBin tool ships with the MS Visual C compiler.
|
|
### C++ Builder:
|
|
Refer to the instructions inside `FastMM4BCB.cpp`.
|
|
|