Utilising More Than 4GB of Memory in a 32-bit Windows Process

Introduction

Large scale enterprise services like SQL Server and Exchange Server can be memory hungry beasts. Given the chance they will devour as much RAM as you can feed them, using it for caching to reduce costly I/O requests. This kind of service is often deployed on some Big Iron hardware with the sole aim of allowing it free rein of the host machine - its job being to serve clients, preferably as many as possible, and in the shortest possible time.

This article will outline the various memory constraints that affect 32-bit processes on the Windows platform and the solutions that both Intel and Microsoft provide for overcoming them through hardware, OS configuration or API changes.

32-bit Process Memory Limits

When Windows NT was first being developed back in the early 1990’s you were lucky to find hard disks with a capacity over 2GB, let alone that much physical RAM. The initial design decision was to split the 4GB virtual address space that every 32-bit process would be limited to into two halves. That meant 2GB was reserved for the system (or kernel space) and 2GB for the application (or user space). Even today this address space limit of 4GB is still in effect for 32-bit processes. What the Windows engineers have done instead is provide a variety of techniques to either shuffle the kernel/user allocation ratio about or provide other APIs to allow larger memory regions to be allocated and different portions of that to be mapped into the process address space on demand [Russinovich].

Much of the confusion around this particular topic is due to the differences between the following limits: the virtual address space that a process is bound by, the amount of physical RAM that is defined by the hardware and the disk-based virtual memory provided by additional page-files. In some older articles the terms “memory”, “virtual memory” and “address space” are used interchangeably which only compounds the confusion. So, to ensure consistency throughout this article I am going to provide clear definitions of these key constraints.

Process Virtual Address Space

A process lives within a 4GB virtual address space. The limit mirrors that of a 32-bit pointer and is deemed “virtual” because the address pointer does not refer to physical memory but is actually a logical address. Instead, the page belonging to the address can be mapped anywhere within physical RAM or even inside a page-file. This is the classic “Level of Indirection” at play.

Physical Memory

Naturally this is the hardware you have within your machine. Desktop editions of Windows have historically only allowed you to access up to 4GB, whereas the Data Centre Edition of Windows Server supports up to 64GB.

Page-Files

The hard disk can also act as a temporary store for memory pages that are not currently in use. This is called virtual memory because it can only be used for page storage – the pages still have to be present in physical memory to be accessed.

This total size of all paging files defines the virtual memory limit for the entire machine; this can be smaller or larger than the per-process 4GB limit.

Commit Charge (Total Memory)

If you open the Task Manager and look at the “Performance” tab you will see a number of system memory figures quoted. One of them is labelled “Commit Charge”. This represents the sum of both the Physical RAM and any space allocated via page-files. It is the total amount of memory available for all processes.

Reserved & Committed Process Pages

Within a process the pages that constitute the virtual address space can be in one of three states – Free, Reserved or Committed. Free pages are exactly that – pages which have yet to be used. Committed pages are those that are in use and count towards the process’s footprint as they must be backed either by the page-file (for data) or the executable image (for code). The intermediate state of Reserved is a half-way house used to put a region of address space to one side without actually forcing the OS to commit any physical resources to maintaining it (except for bookkeeping).

Reserved memory is a particularly tricky beast because it is invisible in the Task Manager due to there being no physical overhead and yet it creates contention and fragmentation that is difficult to observe without inspecting the process directly.

Running Out of Memory

There are essentially two ways that you can run out of memory. The first is to exhaust your own process’s virtual address space by utilising all the pages within it, or making it impossible for the heap manager to find enough free contiguous pages from which to satisfy a memory allocation request. The second method involves consuming all available system memory (i.e. both physical and virtual) so that the OS cannot allocate a free memory page to any process. The implication of the latter is that a different process is the cause of a memory allocation failure – you might only be the victim.

To diagnose a process breaching its own limits you can monitor it with PERFMON.EXE and watch the Process | Virtual Bytes and Process | Private Bytes counters. The former represents the amount of virtual address space that has ever been allocated for heaps, page-file sections, executable code, etc. The latter is the number of Committed Pages which represents the footprint of the process within the total memory available to the system.

The Private Bytes counter can also be seen in Task Manager under the confusingly named column “VM Size”. Alternatively Process Explorer, via the Properties | Performance tab, provides a single dialog for a process that contains all the important memory statistics. However it uses the term “Virtual Size” in place of “Virtual Bytes”. The following table maps the terms between the various common tools:-

Tool	Working Set	Commit Charge	Address Space
Task Manager	Mem Usage	VM Size	N/A
Perfmon	Working Set	Private Bytes	Virtual Bytes
Process Explorer	Working Set	Private Bytes	Virtual Size

As we shall see later when discussing the mechanisms for breaking the 4GB barrier the column name “Commit Charge” becomes less meaningful; but it is a good first-order approximation.

The exact cause of the process’s exhaustion will likely need much closer examination of the actual page usage, for which WinDbg can be of great assistance. Another more recent tool from the Sysinternals stables, called VMMap, can also be of use. The latter is more graphical in nature than WinDbg so is easier for visualisation.

Determining that the entire machine has hit the buffers can be a much simpler affair. Bring up the Performance tab in Task Manager and compare the Commit Charge “Peak” to the “Limit” – if they’re the same you’ve maxed out. Things will likely start going awry before this point though. If for instance you’re making very heavy use of file or network I/O, you can drain the number of System Page Table Entries which is the pool from which everything flows. The likely indicators here are the Win32 error codes 1450 (Insufficient system resources exist to complete the requested service) and 1453 (Insufficient quota to complete the requested service). Perfmon is able to help you visualise the consumption of this vital system resource via the Memory | Free System Page Table Entries counter. If you’ve hit either of these two conditions then your problem is not going to be solved by any of the solutions below; they may even make it worse!

Memory Pressure Solutions

There are a number of different options available for remedying a memory bound 32-bit process that range from simple OS level configuration changes to architectural changes via the use of certain Win32 APIs. Porting to 64-bit Windows is mentioned here as well, but only out of completeness.

Configuration Based Remediation

We start with the OS/process configuration based solutions as they don’t require any code changes per-se.

The /3GB or /USERVA boot.ini Flag

One of the simplest ways to increase the amount of memory a process can use on 32-bit Windows is to enable the /3GB flag in the Window’s booot.ini file. This has the effect of adjusting the kernel/user address space split in favour of the application by 1GB, i.e. instead of a 2GB/2GB split you have a 3GB/1GB split. The downside to this is that the kernel address space is halved so there is less space for certain key kernel data structures such as the number of System Page Table Entries mentioned earlier. The /USERVA flag is an alternative to /3GB that allows for fine tuning of this ratio.

Unfortunately this magic flag is no good by itself. The increase in application address space means that all of a sudden an application could start dealing with addresses above 0x7FFFFFFF. Signed pointer arithmetic on memory allocated above this threshold could expose latent bugs that may lead to subtle data loss instead of catastrophic failure. Consequently an application or service must declare itself compatible with this larger address space by being marked with the /LARGEADDRESSAWARE flag in the executable image. This flag is accessible via the Visual C++ linker and is exposed by the later editions of the Visual Studio IDE under System | Linker | Enable Large Addresses. For .Net applications you currently need to use a custom build step that invokes EDITBIN.EXE to set the flag.

To aid in testing your application’s compatibility with high addresses there is a flag (MEM_TOP_DOWN) that can be passed to VirtualAlloc() to force higher addresses to be allocated before lower ones (the default).

As an aside the more recent documentation from Microsoft on this topic now uses the term 4-Gigabyte Tuning (4GT) [MSDN].

Physical Address Extensions (The /PAE boot.ini Flag)

Extending the virtual address space of a single process overcomes one limitation, but there is a second one in play on 32-bit Windows that affects your ability to run many of these “large address aware” processes, such as in a Grid Computing environment. The maximum amount of physical memory that could be managed by Windows was also originally 4GB. This is still the case for the 32-bit desktop editions of Windows, but the server variants are able to address much more physical RAM – up to 64 GB on the Data Centre Server edition.

This has been achieved by utilising an Intel technology known as Physical Address Extensions which was introduced with the Pentium Pro. It adds an extra layer to the page table mechanism and extends entries from 32-bits to 64-bits so that up to 128GB could theoretically be addressed.

The introduction of PAE means that kernel drivers would now also be exposed to physical addresses above the 4GB barrier, something they may not have originally been tested for. Windows tries to keep buffers under the 4GB limit to aid reliability, but once again the enablement of the feature must be a conscious one – this time via the /PAE switch also in boot.ini. If the server hardware supports Hot Add Memory this flag is actually enabled by default.

The Danger of /3GB and /PAE

As always there is a cost to enabling this and the halving of the Page Table Index from 10-bits to 9-bits via /PAE means that there are half as many System Page Table Entries available for use. If you combine this with the /3GB flag you will have significantly reduced this resource and may see the server straining badly under heavy I/O load, i.e. you could start seeing those 1450 and 1453 errors mentioned earlier.

The other major casualty is the video adaptor [Chen], but this is often of little consequence as application servers are not generally renowned for their game playing abilities. Of course the rise in general-purpose graphics processing units (GPGPU) puts a different spin on the use of such hardware in modern servers.

Using 64-bit Windows to run a 32-bit Process

Naturally all this /3GB and /PAE nonsense goes away under 64-bit Windows as the total system address space is massive by comparison. Although in theory you have 64-bits to play with, implementation limitations mean there are actually only 48-bits to work with. Still, 256 TB should be enough for anyone?

But, 64-bit Windows doesn’t just benefit 64-bit processes; the architecture also changes the address space layout for 32-bit processes too. The kernel address space now lives much higher up leaving the entire 4GB region for the application to play with (assuming that your image is marked with the /LARGEADDRESSAWARE flag as before).

Recompiling for 64-bit

The obvious solution to all these shenanigans might just simply be to recompile your application as a 64-bit process. Better still, if you rewrite it in .Net you have the ability to run as either a 32-bit or 64-bit process as appropriate with no extra work. Only, it’s never quite that simple…

There are many issues that make porting to a 64-bit architecture non-trivial, both at the source code level, and due to external dependencies. Ensuring your pointer arithmetic is sound and that any persistence code is size agnostic are two of the main areas most often written about. But you also need to watch those 3^rd party libraries and COM components as a 64-bit process cannot host a 32-bit DLL, such as an inproc COM server.

The hardware and operating system will also behave differently. There are plenty of “gotchas” waiting to catch you out during deployment and operations. In the corporate world 32-bit Windows desktops are still probably the norm with 64-bit Windows becoming the norm in the server space. So, whilst the 64-bit editions of SQL & Exchange Server are well bedded-in, custom applications are still essentially developed on a different platform.

Useable Memory

Having a user address space of 2, 3, or even 4 GB does not of course mean that you get to use every last ounce. You’ll have thread stacks and executable image sections taking chunks out. Plus, if you use C++ and COM you have at least two heaps competing, both of which will hold to ransom any virtual address descriptors (VADs) that they reserve, irrespective of whether they are in use or not. Throw in “Virtual Address Space Fragmentation” and you’re pretty much guaranteed (unless you’ve specifically tuned your application’s memory requirements) to get less than you bargained for.

The following table describes my experiences of the differences between the maximum and realistic usable memory for a process making general use of both the COM and CRT heaps:-

Max User Address Space	Useable Space
2.0 GB	1.7 GB
3.0 GB	2.6 GB
4.0 GB	3.7 GB

This kind of information is useful if you want to tune the size of any caches, or if you need to do process recycling such as in a grid or web-hosted scenario. To see the amount of virtual address space used by a process you can watch the “Virtual Bytes” Perfmon counter as described earlier.

Extending Your Footprint Over 4GB

Those who went through the 16-bit to 32-bit Windows transition will no doubt be overly cautious - the risk/reward for porting a line-of-business application that is only in need of a little more headroom may not be sufficient to justify the cost and potential upheaval straight away.

If it’s caching you need, and you don’t mind going out-of-process on the same machine (or even making a remote call) then there are any number of off-the-shelf products in the NOSQL space, such as the open source based Memcached. However if you’re looking to do something yourself and you want to avoid additional dependencies, or need performance closer to in-process caching, then there are two options – Address Windowing Extensions and Shared Memory.

What you need to bare in mind though is that it’s not possible to overcome the 4 GB address space limit, but what both these mechanisms allow is the ability to store and access more than 4GB memory very quickly – just not all at exactly the same time.

Address Windowing Extensions (AWE)

Windows 2000 saw the addition of a new API targeted specifically at this problem, and it is the one SQL Server uses. The AWE API is designed solely with performance in mind and provides the ability to allocate and map portions of the physical address space into a process. As the name implies you cannot directly access all that memory in one go but need to create “windows” onto sections of it as and when you need to. The number and size of windows you can have mapped at any one time is still effectively bound by the 4GB per-process limit.

Due to the way the AWE work there are some restrictions on the memory that is allocated:-

The memory is non-paged.
The application must be granted the “Lock Pages in Memory Privilege”

The API functions allow you to allocate memory as raw pages (as indicated by the use of the term Page Frame Numbers) – this is the same structure the kernel itself uses. You then request for a subset of those pages to be mapped into a region of the process’s restricted virtual address space to gain access to it, using the previously returned Page Frame Numbers.

For services such as SQL Server and Exchange Server, which are often given an entire host, this API allows them to make the most optimal use of the available resources on the proviso that the memory will never be paged out.

Page-File Backed Shared Memory

There is another way to access all that extra memory using the existing Windows APIs in a manner similar to the AWE mechanism, but without many of its limitations – Shared Memory. Apart from not needing any extra privileges the memory allocated can also be paged which is useful for overcoming transient spikes or exploiting the paging algorithm already provided by the OS.

Allocating shared memory under Windows is the job of the same API used for Memory Mapped Files. In essence what you are mapping is a portion of a file, though not an application defined file but a part of the system’s page-file. This is achieved by passing INVALID_HANDLE_VALUE instead of a real file handle to CreateFileMapping(). The example below creates a shared segment of 1MB:-

const size_t size = 1024 * 1024;

HANDLE mapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL,

PAGE_READWRITE, 0u, size, NULL);

if (mapping == NULL)

throw std::runtime_error("Failed to create segment");

At this point we have allocated a chunk of memory from the system, but we can’t access it. More importantly though we haven’t consumed any of our address space either. To read and write to it we need to map a portion (or all) of it into our address space, which we do with MapViewOfFile(). When we’re done we can free up the address space again with UnmapViewOfFile(). Continuing our example we require the following code to access the shared segment:-

const size_t offset = 0;

const size_t length = 1024 * 1024;

void* region = MapViewOfFile(mapping, FILE_MAP_ALL_ACCESS, 0u, 0u,

length);

if (region == NULL)

throw std::runtime_error("Failed to map segment");

// read & write to the region...

UnmapViewOfFile(region);

Every time we need to access the segment we just map a view, access it and un-map the view again. When we’re completely done with it we can free up the system’s memory with the usual call to CloseHandle().

Limitations of Shared Memory Segments

This approach is not without its own constraints as anyone who has used VirtualAlloc() will know. Just as with any normal heap allocation the actual size will be rounded up to some extent to match the underlying page size. What is more restrictive though is that the “window” you map to access the segment (via MapViewOfFile) must start on an offset which is a multiple of the “allocation granularity”. This is commonly 64K and can be obtained by calling GetSystemInfo(). The length can be any size and will be rounded up to the nearest page boundary. The pretty much guarantees it’s only useful with larger chunks of data.

A more subtle problem can arise if you fail to match the calls to MapViewOfFile with those to UnmapViewOfFile. Each call to MapViewOfFile bumps the reference count on the underlying segment handle and so calling CloseHandle will not free the segment if any views are still mapped. If left unchecked this could create one almighty memory leak that would be interesting to track down.

Apart from the API limitations there is also the problem of not being able to cache or store raw pointers to the data either outside or inside the memory block – you must use or store offsets instead. The base address of each view is only valid for as long as the view is mapped so care needs to be taken to avoid dangling pointers.

One other operational side-effect of this technique that you need to warn your System Administrators about is the massive rise in page faults that they will see in the process stats. What they need to understand is that these are probably just “soft faults” where a physical page is mapped into a process and not a “hard fault” where a disk access also occurs. Although the segment is officially backed by the system page-file if enough physical RAM exists the page should never be written out to disk and so provides excellent performance.

Real-World Use

I have previously used shared memory segments very successfully in two 32-bit COM heavy services that ran alongside other services on a 64-bit Windows 2003 server. One of them cached up to 16 GB of data without any undue side effects, even when transient loads pushed it over the physical RAM limit and some paging occurred for short periods.

I’m currently working on a .Net based system that is dependent on a 32-bit native library and have earmarked the technique again as one method of overcoming out-of-memory problems caused by needing to temporarily cache large intermediate blobs of data.

Research Project - Service-Less Caching

The ability to cache data in shared memory, which is effectively reference counted by the OS, provided the basis for a prototype mechanism that would allow multiple “engine” processes running on the same host to cache common data without needing a separate service process to act as a gateway. This would avoid the massive duplication of cached data for each process, which, as the number of CPUs (and therefore engine processes) increased, would afford more efficient use of the entire pool of system RAM.

The mechanism was fairly simple. Instead of each engine process storing its large blobs of common data in private memory, it would be stored in a shared segment (backed by a deterministic object name) and mapped on demand. The use of similarly (deterministically) named synchronisation objects ensures that only one engine needed to request the data from upstream and the existence of a locally cached blob could be detected easily too. This was done by exploiting the fact that creation of an object with the same name as another succeeds and returns the special error ERROR_ALREADY_EXISTS.

The idea was prototyped but never used in production as far as I know.

Summary

This article provided a number of techniques to illustrate how a 32-bit Windows process can access more memory that the 2GB default. These ranged from configuration tweaks involving the /3GB and /PAE flags through to the AWE and Shared Memory APIs. Along the way it helped explain some of the terminology and showed how to help diagnose memory exhaustion problems.

Credits

Thanks to Matthew Wilson for reciprocating and commenting on my first draft, and to Frances Buontempo for her valuable feedback and encouragement too.

References

[Chen] Raymond Chen, Kernel address space consequences of the /3GB switch,

http://blogs.msdn.com/b/oldnewthing/archive/2004/08/06/209840.aspx.

[MSDN] Memory Limits for Windows Releases, http://msdn.microsoft.com/en-gb/library/windows/desktop/aa366778(v=vs.85).aspx

[Russinovich] Mark Russinovich and David Solomon, Windows Internals 4^th edition

Chris Oldwood

07/01/2013

Biography

Chris started out as a bedroom coder in the 80s, writing assembler on 8-bit micros. These days it’s C++ and C# on Windows in big plush corporate offices. He is also the commentator for the Godmanchester Gala Day Duck Race and can be contacted via gort@cix.co.uk.