Skip to main content

VMware virtual memory management

Virtual memory management refers to techniques that operating systems employ to manage the allocation of physical memory resources (or RAM) on demand, transparent to the applications that execute on the machine. Modern operating systems, including IBM’s proprietary mainframe OSes, virtually all flavors of Unix and Linux, as well as Microsoft Windows, have uniformly adopted virtual memory management techniques, ever since the first tentative results from using on-demand virtual memory management demonstrated more effective utilization of RAM, compared to static memory partitioning schemes. All but the simplest processor hardware offer support for virtual memory management.

VMware is a hypervisor, responsible for running one or more virtual machine guests, providing each guest machine with a virtualized set of CPU, memory, disk and network resources that VMware is then responsible for allocating and managing. With regard to managing physical memory, VMware initially grants each virtual machine that is running a virtualized physical address space, the size of which is specified during configuration. From within the guest machine, there is no indication to either the operating system or the process address spaces that execute under the guest OS that physical addresses are virtualized. Unaware that that physical addresses are virtualized, the guest machine OS manages its physical memory in its customary manner, allocating physical memory on demand and replacing older pages with new pages whenever it detects contention for “virtualized” physical memory.

To make it possible for guest machines to execute, VMware provides an additional layer of virtual memory:physical memory mapping for each guest machine. VMware is responsible for maintaining a hardware-specific virtual:physical address translation capability, permitting guest machine instructions to access their full virtualized physical addresses range. Meanwhile, inside the VMware Host, actual physical memory is allocated on demand, as guest machines execute and reference virtualized physical addresses. As actual physical memory fills, VMware similarly must implement page replacement.  Unlike the guest OSes it hosts, VMware gathers itself very little information concerning page reference patterns – due to overhead concerns – that would be useful in performing page replacement. Consequently, VMware principal’s page replacement strategy is to try to induce paging inside the guest OS, where, presumably, better informed decisions can be made. This is known as guest machine ballooning.

To support more aggressive consolidation of guest virtual machine images onto VMware servers, VMware also attempts dynamically to identify identical instances of virtual memory pages within the guest machine or across guest machines that would allow them to be mapped to a single copy of a physical memory page, thus saving on overall physical memory usage. This feature is known as transparent memory sharing

Virtual addressing

Virtual memory refers to the virtualized linear address space that an OS builds and presents to each application. 64-bit address registers, for example, can access a breathtaking range of 264 virtual addresses, even though the actual physical memory configuration is much, much smaller. Virtual memory addressing permits applications to be written that can execute (largely) independent of the underlying physical memory configuration.

Transparent to the application process address space, the operating system maintains a set of tables that map virtual addresses to actual physical memory addresses during runtime. This mapping is performed at the level of a page, a block of contiguous memory addresses. A page size of 4K addresses (212 bytes) is frequently used, although other page sizes are possible. (Some computer hardware allows the OS to select and use a range of supported page sizes.)

To support virtual memory management, the operating system maintains page tables that map virtual memory addresses to physical memory addresses for each process being executed. The precise form of the page tables that are necessary to perform this mapping is specified by the underlying hardware platform. As a computer program executes on the hardware, the processor hardware performs the necessary translation of virtual memory addresses to physical memory addresses dynamically during run-time. Operating systems functions that support virtual memory management include setting up and maintaining the per process page tables that are used perform this dynamic mapping and instructing the hardware about the location of these memory address translation tables in physical memory, which is accomplished by loading a dedicated control register to point to the process-specific set of address mapping tables. When the execution of one running process blocks, the operating system performs a context switch that loads a different set of page tables to allow for the translation of that process’s valid virtual addresses.

The techniques that allow an operating system to execute multiple processes concurrently and switch between them dynamically are collectively known as multiprogramming. Modern operating systems evolved rapidly to support multiprogramming across multiple processors, where each CPU is capable of accessing the full range of installed physical memory locations.

(Large scale multi-core multiprocessors are frequently configured with more than one memory bank, where the result is a NUMA (non-uniform memory access) architecture. In machines with NUMA characteristics – something that is quite common in blade servers – accessing a location that resides in a remote memory bank takes longer than a local memory access, a fact that can have serious performance implications. For optimal performance on NUMA machines, the OS memory manager must factor in the NUMA topology into memory allocation decisions, something which VMware evidently does. Further discussion of NUMA architectures and the implications for the performance of guest machines is beyond the scope of the current inquiry, however. Single core multiprocessors from Intel have uniform memory access latency, while AMD single-core multiprocessors have NUMA characteristics.)

Virtual memory management allocates memory on demand, which is demonstrably more effective in managing physical RAM than static partitioning schemes where each executing process acquires a fixed set of physical memory addresses for the duration of its execution. In addition, virtual memory provides a secure foundation for executing multiple processes concurrently since each running process has no capability to access and store data in physical memory locations outside the range of its own unique set of dedicated virtual memory addresses. The OS ensures that each virtual address space is mapped to a disjoint set of physical memory pages. The virtual addresses associated with the OS itself represent a set of pages that are shared in common across all of the process address spaces, a scheme that enables threads in each process to call OS services directly, including the system services enabling interprocess communication (or IPC).

The operating system presents each running process with a range of virtual memory addresses to use that often exceeds the size of physical RAM. Virtualizing memory addressing allows applications to be written that are largely unconcerned with the physical limits of the underlying computer hardware, greatly simplifying their construction. Permitting applications to be portable across a wide variety of hardware configurations, irrespective of the amount of physical memory that is actually available for them to execute, is also of considerable benefit.

The virtual:physical memory mapping and translation that occurs during instruction execution is transparent to the application that is running. However, there are OS functions, including setting up and maintaining the Page Tables, which need to understand and utilize physical memory locations. In addition, device driver software, installed alongside and serving as an extension to the OS, that are  directly responsible for communicating with all manner of peripheral devices. Device driver software must communicate with those devices using actual physical addresses. Peripheral devices use Direct Memory Access (DMA) interfaces that do not have access to the processor’s virtual address to physical address mapping capability during execution.

Memory over-commitment

Allowing applications access to a range of virtual memory addresses that individually or collectively exceeds the amount of physical memory that is actually available during execution inevitably leads to situations where physical memory is over-committed. When physical memory is over-committed, the operating system implements a page replacement policy that dynamically manages the contents of physical memory, reclaiming a previously allocated physical page and re-purposing it for use backing a an entirely different set of virtual memory addresses, possibly in an entirely different process address space. Dynamically replacing the pages of applications that have not been accessed recently with more recently accessed pages has proven to be an effective way to manage this over-commitment. This is known as demand paging.

Allowing applications to collectively commit more virtual memory pages than are actually present in physical memory, but biasing the contents of physical memory based on current usage patterns, permits operating systems that support virtual memory addressing to utilize physical memory resources very effectively. Over-commitment of physical memory works because applications frequently exhibit phased behavior during execution in which they actively access only a relatively small subset of the overall memory locations they have allocated. The subset of the total number of allocated virtual memory pages that are currently active and resident in physical memory is known as the application’s working set of active pages.

Under virtual memory management, a process address space acquires virtual memory addresses a page at a time, dynamically, on demand. The application process normally requests the OS to allocate a block of contiguous virtual memory addresses for it use. (Since RAM, by definition, is randomly addressable, the process seldom cares where within the address space this specific size block of memory addresses is located. But because fragmentation of the address space can occur, potentially leading to allocation failures when a large enough contiguous block of free addresses is not available to satisfy an allocation request, there is usually some effort on the part of the OS Memory Manager to make virtual allocations contiguous, where possible.)

  In Windows, for example, these allocated pages are known as committed pages because the OS has committed to backing the virtual page in either physical memory or in auxiliary memory, which is another name for the paging file located on disk. Windows also has a commit limit, an upper limit on the number of virtual memory pages it is willing to allocate. The commit limit is equal to the sum of size of RAM and the size of the paging file(s).

The Page Table entry, or PTE, the format of which is specified by the hardware, is the basic mechanism used by the hardware and operating system to communicate the current allocation status of a virtual page. Two bits in the PTE, the valid bit and the dirty bit, are the key status indicators. When the PTE is flagged as invalid, it is a signal to the hardware to not perform virtual address translation. When the PTE is marked valid, it will contain the address of the physical memory page that was allocated by the OS that is to be used in address translation. When the PTE is marked invalid for address translation, the remaining bits in the PTE can be used by the operating system. For example, if the page in question currently resides on the paging file, the data necessary to access the page from the paging file are usually stored in the PTE. (Additional hardware-specified bits in the PTE are used to indicate that the page is Read-only, the page size, and other status data associated with the page.)

Initially, when a virtual memory page is first allocated, it is marked as invalid because the OS has not yet allocated it a physical memory page. Once it is accessed, and the OS does allocate a physical memory page for it, the PTE entry is marked as valid, and is updated to reflect the physical memory address that the OS assigned. The hardware sets the “dirty” bit to indicate that an instruction has written or changed data on the page. The OS accesses the dirty bit during page replacement to determine if it is necessary to write the contents of the page to the paging file before the physical memory page can be “re-purposed” for use by a different range of virtual addresses.

Page fault resolution

It is not until an instruction executing inside the process attempts to access a virtual memory address during execution that the OS maps the virtual address to a corresponding physical page in RAM. When the Page Table entry (PTE) used for virtual:physical address translation indicates no corresponding physical memory page has been assigned, an instruction that references a virtual address on that page generates an addressing exception. This condition is known as a page fault. The operating system intercepts this page fault and allocates an available page from physical memory, modifying the corresponding PTE to reflect the change. Once a valid virtual:physical mapping exists, the original failing instruction can be re-executed successfully.
In Windows, in resolving a page fault that results from an initial access, the OS assigns an empty page from its Zero Page list to the process address space that generated the fault and marks the corresponding PTE as valid. These operations are known as Demand Zero page faults.

Page fault resolution is transparent to the underlying process address space, but it does have a performance impact. The instruction thread that was executing when the page fault occurred is blocked until after the page fault is resolved. (In Windows, a Thread Wait State Reason is assigned that indicates an involuntary wait status, waiting for the OS to release the thread again, following page fault resolution.) The operating system attempts to minimize page fault resolution time by maintaining a queue of free physical memory pages that are available to be allocated immediately whenever a demand zero page fault occurs. Resolving a page fault by supplying an empty page from the Zero list is regarded as a “soft” fault in Windows because the whole operation is designed to be handled very quickly and usually does not necessitate a disk operation.

Hard page faults are any that need to be resolved from disk. When a thread from the process address space first writes data to the page, changing its contents, the hardware flags that page’s PTE “dirty” bit. Later, if a dirty page is “stolen” from the process address during a page trimming scan, the dirty bit provides an indication to the OS that the contents of the page must be written to paging file before the page can be “re-purposed.” When the contents of the page have been written to disk, the PTE is marked, showing its location out in the paging file. If, subsequently, a thread from the original process executes and re-references an address on a previously stolen page, a page fault is generated. During hard page fault resolution, the OS determines from the PTE that the page is currently on disk. It initiates a Page Read operation to the disk that copies the current contents of the page from the paging file into an empty page on the Zero list. When this disk IO operation completes, the OS updates the PTE and re-dispatches the thread that was blocked for the duration of the disk IO.

LRU-based page replacement

Whenever the queue of available physical memory pages on the Zero list becomes depleted, however, the operating system needs to invoke its page replacement policy to replenish it. Page replacement, also known as page stealing or, more euphemistically, page trimming, involves scanning physical memory looking for good candidates for page replacement, based on their pattern of usage.

Specifically, operating systems like Windows or Linux implement page replacement policies that choose to replace pages based on actual memory usage patterns, which requires them to keep track – to a degree – of which virtual memory pages an application allocates that are actually currently in use. A page replacement policy that can identify those allocated pages which are Least Recently Used (LRU) and target them for removal has generally proven quite effective. Most cache management algorithms – and it is quite reasonable to conceptualize physical memory as a “cache” for a virtual address space – in use today use some form of LRU-based page replacement.

In order to identify which allocated pages an application is actually using at a given time, it is necessary for the OS to gather information on page usage patterns. Physical memory hardware provides very basic functions that the OS can then exploit to track physical memory usage. The hardware sets an access bit in the Page Table Entry (PTE) associated with that corresponding range of physical addresses, indicating that an instruction accessed some address resident on the page. (Similarly, the hardware sets a “dirty” bit to indicate that an instruction has stored new data somewhere in the page.)

How the OS uses this information from the PTE access bit to keep track of the age of a page varies from vendor to vendor. For instance, some form of “clock algorithm” that periodically resets the access bits of every page that was recently accessed is the approach used in the IBM mainframe OS. The next clock interval in which the aging code is dispatched scans memory & resets the access bit for any page that was accessed during the previous interval. Meanwhile, the clock aging algorithm increments the unreferenced interval count for any page that was not accessed during the interval. Over time, the distribution of unreferenced interval counts for allocated pages yields a partial order over the age of each page on the machine. This partial order allows the page replacement routine to target the oldest pages on the system for page stealing.

The clock algorithm provides an incremental level of detail on memory usage patterns that is potentially quite useful for performance and capacity planning purposes [3], but it also has some known limitations, especially with regard to performance. One performance limitation is that the execution time of a memory scan varies linearly with the size of RAM. On very large scale machines, with larger amounts of RAM to manage, scanning page table entries is time-consuming. And it is precisely those machines that have the most amount of memory and the least amount of memory contention where the overhead for maintaining memory usage data is the highest.

Windows adopted a form of interval-oriented clock-based page aging algorithm that, hopefully, requires far less resources to run, allowing memory management to scale better for machines with very large amount of RAM to manage. In Windows, the Balance Set Manager is dispatched once per second to “trim” pages aggressively from processes that have working set pages that exceed their target values, which by default are set arbitrarily to low levels. Pages stolen from the address space in this fashion are, in fact, only stolen provisionally. In effect, they are placed in a memory resident cache managed as a FIFO queue called the Standby list. (In some official Windows documentation sources, the Standby list is referred to simply as “the cache.”) When the process references any previously stolen pages that are still resident in the FIFO cache, these pages can be “soft-faulted” back into the process’s working set without the necessity for any IO to the paging disk.

Pages in the Standby list that remain unreferenced are aged during successive page trimming cycles, eventually being pushed to the head of the queue. The Windows OS zero paging thread, which is awakened whenever the Zero list needs replenishing, pulls aged pages from the head of the Standby list, and writes zero values to the page, erasing the previous contents. After being zeroed, the page is then moved to the Zero list, which is used to satisfy any process requests for new page allocations. (Stolen pages that have their “dirty” bit set are detoured first to the Modified List prior to being added to the Standby cache.)

So long as an application’s set of resident virtual memory pages corresponds reasonably well to its working set of active pages, relatively few incidents of hard page faults will occur during execution, and managing virtual memory on demand will have very little impact on the execution time of the application.  Moreover, so long as the operating system succeeds in maintaining an adequate inventory of available physical pages in advance of their actual usage by running processes, what page faults do occur can be resolved relatively quickly, minimizing the execution time delays that running processes incur. However, the performance impact of virtual memory management on the execution time of running tasks can be substantial if, for example, the demand for new pages exceeds the supply, or replenishing the inventory of available physical pages forces the OS to trim steal pages that are apt to be accessed again quite soon again once a blocked process is re-dispatched. This situation where the impact of virtual memory management on performance is significant is commonly referred to as thrashing, conjuring up an image of the machine exerting a great deal of effort on behalf of moving many virtual memory pages in and out of physical memory to the detriment of performing useful work.


Popular posts from this blog

Inside the Windows Runtime, Part 2

As I mentioned in the previous post, run-time libraries in Windows provide services for applications running in User mode. For historical reasons, this run-time layer in Windows was always known as the Win32 libraries, even when these services are requested in the 64-bit OS in 32-bit mode. A good example of a Win32 run-time service is any operation that involves opening and accessing a file somewhere in the file system (or the network, or the cloud). A more involved example is the set of Win32 services an application needs to access to play an audio file, including understanding the specific audio file compressed format, and checking authorization and security.
For Windows 8, a portion of the existing Win32 services in Windows were ported to the ARM hardware platform.  The scope of the Win32 API is huge, and it was probably not feasible to convert all of it during the span of a single, time-constrained release cycle. Unfortunately, the fact that the new Windows 8 Runtime library encomp…

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…