Server virtualization technology, as practiced by products such as the VMware ESX hypervisor, applies similar virtual memory management techniques in order to operate an environment where multiple virtual guest machines are provided separate address spaces so they can execute concurrently, sharing a single hardware platform. To avoid confusion, in this section machine memory will refer to the actual physical memory (or RAM) installed on the underlying VMware Host platform. Virtual memory will continue to refer to virtual address space a guest OS builds for a process address space. Physical memory will refer to a virtualized view of machine memory that VMware grants to each guest machine. Virtualization adds a second level of memory address virtualization. (A white paper published by VMware entitled "“Understanding Memory Resource Management in VMware® ESX™ Server” is a good reference.)
When VMware spins up a new virtual guest machine, it grants that machine a set of contiguous virtual memory addresses that correspond to a fixed amount of physical memory, as specified by configuration parameters. The fact that this grant of physical memory pages does not reflect a commitment of actual machine memory is transparent to the guest OS, which then proceeds to create page tables and allocate this (virtualized) physical memory to running processes the same as it would if the OS were running native on the hardware. The VMware hypervisor is then responsible for maintaining a second of set of physical:machine memory mapping tables, which VMware calls shadow page tables. Just as the page tables maintained by the OS map virtual addresses to (virtualized) physical addresses, the shadow page tables map the virtualized physical addresses granted to the guest OS to actual machine memory pages, which are managed by the VMware hypervisor.
VMware maintains a set of shadow page tables that map virtualized physical addresses to machine memory addresses for each guest machine that it is executing. In effect, there is a second level of virtual to physical address translation that occurs each time a program executing inside a guest machine references a virtual memory address, once for the guest OS to map the process virtual address to a virtualized physical address and then by the VMware hypervisor to map the virtualized physical address to an actual machine memory address. Server hardware is available that supports this two-phase virtual:physical address mapping, as illustrated in Figure 1. In a couple of white papers, VMware reports this hardware greatly reduces the effort required by the VMware Host software to maintain the shadow page tables.
Table 1. The values reported in the ESX Host Memory State performance counter.
When VMware spins up a new virtual guest machine, it grants that machine a set of contiguous virtual memory addresses that correspond to a fixed amount of physical memory, as specified by configuration parameters. The fact that this grant of physical memory pages does not reflect a commitment of actual machine memory is transparent to the guest OS, which then proceeds to create page tables and allocate this (virtualized) physical memory to running processes the same as it would if the OS were running native on the hardware. The VMware hypervisor is then responsible for maintaining a second of set of physical:machine memory mapping tables, which VMware calls shadow page tables. Just as the page tables maintained by the OS map virtual addresses to (virtualized) physical addresses, the shadow page tables map the virtualized physical addresses granted to the guest OS to actual machine memory pages, which are managed by the VMware hypervisor.
VMware maintains a set of shadow page tables that map virtualized physical addresses to machine memory addresses for each guest machine that it is executing. In effect, there is a second level of virtual to physical address translation that occurs each time a program executing inside a guest machine references a virtual memory address, once for the guest OS to map the process virtual address to a virtualized physical address and then by the VMware hypervisor to map the virtualized physical address to an actual machine memory address. Server hardware is available that supports this two-phase virtual:physical address mapping, as illustrated in Figure 1. In a couple of white papers, VMware reports this hardware greatly reduces the effort required by the VMware Host software to maintain the shadow page tables.
Ballooning.
VMware attempts to manage virtual memory on demand without
unnecessarily duplicating all the effort that its client guest machines already
expend on managing virtual memory. The VMware hypervisor, which also needs to
scale effectively on machines with very large amounts of physical memory, only
gathers a minimum amount of information on the memory access patterns of any
virtual machine guests that it is currently running. When VMware needs to
replenish its inventory of available pages, it attempts to pressure the
resident virtual machines to make those decisions by inducing paging within the
guest OS, using a technique known as ballooning.
The VMware memory manager intervenes to handle the page
faults that occur when a page initially granted to a guest OS is first
referenced. This first reference triggers the allocation of machine memory to
back the page affected, and results in the hypervisor setting the valid bit of
the corresponding shadow Page Table entry. On the basis of setting the PTE
valid bit on this first reference, VMware understands that it is an active page. But, following the initial
access, VMware does very little to try to understand the reference patterns of
the active pages of a guest OS. Neither does it attempt to use an LRU-based
page replacement algorithm.
VMware does try to understand how many of the pages
allocated to a guest machine are actually active using sampling. At random, it
periodically selects a small sample of the guest machine’s active pages and
flips the valid bit in the shadow PTE.[1]
This is mainly done to try and identify guest machines that are idle and calculate
what is known as an Idle machine tax. Pages from idle guest machines are
preferred if VMware needs to perform page replacement. If any of those active
pages that are flagged as invalid are referenced again, these pages are then
soft-faulted back into the guest OS working set with little delay. The
percentage of such pages that are re-referenced again within the sampling period
is used to estimate the total number of Active pages in the guest machine
working set. Note that it is only an estimate.
Using the page fault mechanism described above, VMware
assigns free machine memory pages to a guest OS on demand. When the amount of
free physical memory available for new guest machine allocation requests drops
below 6%, ballooning is triggered. Ballooning
is an attempt to induce paging stealing in the guest OS. Ballooning works as
follows. VMware installs a balloon driver inside the guest OS and signals the
driver to begin to “inflate.” vmmemctl.sys
is the VMware balloon device driver software installed inside a guest Windows machine
that “inflates” on command. The vmmemctl.sys
VMware balloon driver uses a private communications channel to poll the VMware
Host once per second to obtain a ballooning target. Waldspurger [7] reports
that in Windows, the balloon inflates by calling standard routines that are available
to device drivers that need to pin virtual memory pages in memory. The two
memory allocation APIs Waldspurger references are MmProbeAndLockPages
and MmAllocatePagesForMDLEx.
These APIs specifically allocates pages that
remain resident in physical memory until they are explicitly freed by the
device driver.
After allocating these balloon pages, which remain empty of
any content, the balloon driver sends a return message to the VMware Host,
providing a list of the physical addresses of the pages it has acquired. Since
these pages will remain unused, the VMware memory manager can delete them from
physical memory immediately upon receipt of this reply. So, ballooning itself
has no guaranteed immediate impact on physical memory contention inside the
guest. The intent, however, is to pin enough guest OS pages in physical memory
to trigger the guest machine’s page replacement policy. However, if ballooning
does not cause the guest OS machine to experience memory contention, i.e., if
the balloon request can be satisfied without triggering the guest machine’s
page replacement policy, there will be no visible impact inside the guest machine.
If there is no relief from the memory contention, VMware, of course, may
continue to increase the guest machine’s balloon target until the guest machine
starts to shed pages. We will see how effectively this process works in the next blog entry in this series.
Because inducing page replacement at the guest meachine level
using ballooning may not act quickly enough to relieve a machine memory
shortage, VMware will also resort to random page replacement from guest OS
working sets when necessary. In VMware, this is called swapping. Swapping is triggered when the amount of free physical
memory available for new guest machine allocation requests drops below 4%. Random
page replacement is one page replacement policy that can be performed without any
gathering information about the age of resident pages, and while less optimal
than an LRU-based approach, simulation studies show its performance can be reasonably
effective.
VMware’s current level of physical memory contention is
encapsulated in a performance counter called Memory State. This Memory State variable
is set based on the amount of Free memory available. Memory state transitions
trigger the reclamation actions reported in Table 1:
State
|
Value
|
Free Memory Threshold
|
Reclamation Action
|
High
|
0
|
³
6%
|
None
|
Soft
|
1
|
< 6%
|
Ballooning
|
Hard
|
2
|
<
4%
|
Swapping
to Disk or Pages compressed
|
Low
|
3
|
<2%
|
Blocks execution of active VMs
> target allocations
|
Table 1. The values reported in the ESX Host Memory State performance counter.
In monitoring the performance of a VMware Host
configuration, the Memory State counter is one of the key metrics to track.
In the case study discussed beginning in the next blog entry, a
benchmarking workload was executed that generated contention for machine memory
on a VMware ESX server. During the benchmark, we observed the memory state
transitioning to both the “soft” and “hard” paging states shown in Table 1,
triggering both ballooning and swapping.
[1]
According to the “Understanding Memory Resource Management in VMware® ESX™ Server” white paper, ESX selects 100 physical pages randomly from each guest
machine and records how many of the pages that were selected were accessed in
the next 60 seconds. The sampling rate can be adjusted by changing
Mem.SamplePeriod in ESX advanced settings.
Comments
Post a Comment