Skip to main content

Virtual memory management in VMware.

Server virtualization technology, as practiced by products such as the VMware ESX hypervisor, applies similar virtual memory management techniques in order to operate an environment where multiple virtual guest machines are provided separate address spaces so they can execute concurrently, sharing a single hardware platform. To avoid confusion, in this section machine memory will refer to the actual physical memory (or RAM) installed on the underlying VMware Host platform. Virtual memory will continue to refer to virtual address space a guest OS builds for a process address space. Physical memory will refer to a virtualized view of machine memory that VMware grants to each guest machine. Virtualization adds a second level of memory address virtualization. (A white paper published by VMware entitled "“Understanding Memory Resource Management in VMware® ESX™ Server” is a good reference.)

When VMware spins up a new virtual guest machine, it grants that machine a set of contiguous virtual memory addresses that correspond to a fixed amount of physical memory, as specified by configuration parameters. The fact that this grant of physical memory pages does not reflect a commitment of actual machine memory is transparent to the guest OS, which then proceeds to create page tables and allocate this (virtualized) physical memory to running processes the same as it would if the OS were running native on the hardware. The VMware hypervisor is then responsible for maintaining a second of set of physical:machine memory mapping tables, which VMware calls shadow page tables. Just as the page tables maintained by the OS map virtual addresses to (virtualized) physical addresses, the shadow page tables map the virtualized physical addresses granted to the guest OS to actual machine memory pages, which are managed by the VMware hypervisor.

VMware maintains a set of shadow page tables that map virtualized physical addresses to machine memory addresses for each guest machine that it is executing. In effect, there is a second level of virtual to physical address translation that occurs each time a program executing inside a guest machine references a virtual memory address, once for the guest OS to map the process virtual address to a virtualized physical address and then by the VMware hypervisor to map the virtualized physical address to an actual machine memory address. Server hardware is available that supports this two-phase virtual:physical address mapping, as illustrated in Figure 1. In a couple of white papers, VMware reports this hardware greatly reduces the effort required by the VMware Host software to maintain the shadow page tables.

Figure 1. Two levels of Page Tables are maintained in virtualization Hosts. The first level is the normal set of Page Tables that the guest machines build to map virtual address spaces to (virtualized) physical memory. The virtualization layer builds a second set of shadow Page Tables that are involved in a two-step address translation process to derive the actual machine memory address during instruction execution.

Ballooning.

VMware attempts to manage virtual memory on demand without unnecessarily duplicating all the effort that its client guest machines already expend on managing virtual memory. The VMware hypervisor, which also needs to scale effectively on machines with very large amounts of physical memory, only gathers a minimum amount of information on the memory access patterns of any virtual machine guests that it is currently running. When VMware needs to replenish its inventory of available pages, it attempts to pressure the resident virtual machines to make those decisions by inducing paging within the guest OS, using a technique known as ballooning.

The VMware memory manager intervenes to handle the page faults that occur when a page initially granted to a guest OS is first referenced. This first reference triggers the allocation of machine memory to back the page affected, and results in the hypervisor setting the valid bit of the corresponding shadow Page Table entry. On the basis of setting the PTE valid bit on this first reference, VMware understands that it is an active page. But, following the initial access, VMware does very little to try to understand the reference patterns of the active pages of a guest OS. Neither does it attempt to use an LRU-based page replacement algorithm.

VMware does try to understand how many of the pages allocated to a guest machine are actually active using sampling. At random, it periodically selects a small sample of the guest machine’s active pages and flips the valid bit in the shadow PTE.[1] This is mainly done to try and identify guest machines that are idle and calculate what is known as an Idle machine tax. Pages from idle guest machines are preferred if VMware needs to perform page replacement. If any of those active pages that are flagged as invalid are referenced again, these pages are then soft-faulted back into the guest OS working set with little delay. The percentage of such pages that are re-referenced again within the sampling period is used to estimate the total number of Active pages in the guest machine working set. Note that it is only an estimate.

Using the page fault mechanism described above, VMware assigns free machine memory pages to a guest OS on demand. When the amount of free physical memory available for new guest machine allocation requests drops below 6%, ballooning is triggered.  Ballooning is an attempt to induce paging stealing in the guest OS. Ballooning works as follows. VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate.” vmmemctl.sys is the VMware balloon device driver software installed inside a guest Windows machine that “inflates” on command. The vmmemctl.sys VMware balloon driver uses a private communications channel to poll the VMware Host once per second to obtain a ballooning target. Waldspurger [7] reports that in Windows, the balloon inflates by calling standard routines that are available to device drivers that need to pin virtual memory pages in memory. The two memory allocation APIs Waldspurger references are MmProbeAndLockPages and MmAllocatePagesForMDLEx.  These APIs specifically allocates pages that remain resident in physical memory until they are explicitly freed by the device driver.

After allocating these balloon pages, which remain empty of any content, the balloon driver sends a return message to the VMware Host, providing a list of the physical addresses of the pages it has acquired. Since these pages will remain unused, the VMware memory manager can delete them from physical memory immediately upon receipt of this reply. So, ballooning itself has no guaranteed immediate impact on physical memory contention inside the guest. The intent, however, is to pin enough guest OS pages in physical memory to trigger the guest machine’s page replacement policy. However, if ballooning does not cause the guest OS machine to experience memory contention, i.e., if the balloon request can be satisfied without triggering the guest machine’s page replacement policy, there will be no visible impact inside the guest machine. If there is no relief from the memory contention, VMware, of course, may continue to increase the guest machine’s balloon target until the guest machine starts to shed pages. We will see how effectively this process works in the next blog entry in this series.

Because inducing page replacement at the guest meachine level using ballooning may not act quickly enough to relieve a machine memory shortage, VMware will also resort to random page replacement from guest OS working sets when necessary. In VMware, this is called swapping. Swapping is triggered when the amount of free physical memory available for new guest machine allocation requests drops below 4%. Random page replacement is one page replacement policy that can be performed without any gathering information about the age of resident pages, and while less optimal than an LRU-based approach, simulation studies show its performance can be reasonably effective.

VMware’s current level of physical memory contention is encapsulated in a performance counter called Memory State. This Memory State variable is set based on the amount of Free memory available. Memory state transitions trigger the reclamation actions reported in Table 1:


State
Value
Free Memory Threshold
Reclamation Action
High
0
³ 6%
None
Soft
1
< 6%
Ballooning
Hard
2
< 4%
Swapping to Disk or Pages compressed
Low
3
<2%
Blocks execution of active VMs > target allocations

Table 1. The values reported in the ESX Host Memory State performance counter.

In monitoring the performance of a VMware Host configuration, the Memory State counter is one of the key metrics to track.

In the case study discussed beginning in the next blog entry, a benchmarking workload was executed that generated contention for machine memory on a VMware ESX server. During the benchmark, we observed the memory state transitioning to both the “soft” and “hard” paging states shown in Table 1, triggering both ballooning and swapping.



[1] According to the “Understanding Memory Resource Management in VMware® ESX™ Server” white paper, ESX selects 100 physical pages randomly from each guest machine and records how many of the pages that were selected were accessed in the next 60 seconds. The sampling rate can be adjusted by changing Mem.SamplePeriod in ESX advanced settings.

Comments

Popular posts from this blog

Inside the Windows Runtime, Part 2

As I mentioned in the previous post, run-time libraries in Windows provide services for applications running in User mode. For historical reasons, this run-time layer in Windows was always known as the Win32 libraries, even when these services are requested in the 64-bit OS in 32-bit mode. A good example of a Win32 run-time service is any operation that involves opening and accessing a file somewhere in the file system (or the network, or the cloud). A more involved example is the set of Win32 services an application needs to access to play an audio file, including understanding the specific audio file compressed format, and checking authorization and security.
For Windows 8, a portion of the existing Win32 services in Windows were ported to the ARM hardware platform.  The scope of the Win32 API is huge, and it was probably not feasible to convert all of it during the span of a single, time-constrained release cycle. Unfortunately, the fact that the new Windows 8 Runtime library encomp…

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…