Virtual memory management refers to techniques that
operating systems employ to manage the allocation of physical memory resources
(or RAM) on demand, transparent to the applications that execute on the
machine. Modern operating systems, including IBM’s proprietary mainframe OSes,
virtually all flavors of Unix and Linux, as well as Microsoft Windows, have
uniformly adopted virtual memory management techniques, ever since the first
tentative results from using on-demand virtual memory management demonstrated
more effective utilization of RAM, compared to static memory partitioning
schemes. All but the simplest processor hardware offer support for virtual
memory management.
VMware is a hypervisor,
responsible for running one or more virtual machine guests, providing each guest
machine with a virtualized set of CPU, memory, disk and network resources that
VMware is then responsible for allocating and managing. With regard to managing
physical memory, VMware initially grants each virtual machine that is running a
virtualized physical address space, the size of which is specified during
configuration. From within the guest machine, there is no indication to either
the operating system or the process address spaces that execute under the guest
OS that physical addresses are virtualized. Unaware that that physical
addresses are virtualized, the guest machine OS manages its physical memory in
its customary manner, allocating physical memory on demand and replacing older
pages with new pages whenever it detects contention for “virtualized” physical
memory.
To make it possible for guest machines to execute, VMware
provides an additional layer of virtual memory:physical memory mapping for each
guest machine. VMware is responsible for maintaining a hardware-specific
virtual:physical address translation capability, permitting guest machine
instructions to access their full virtualized physical addresses range.
Meanwhile, inside the VMware Host, actual physical memory is allocated on
demand, as guest machines execute and reference virtualized physical addresses.
As actual physical memory fills, VMware similarly must implement page
replacement. Unlike the guest OSes it
hosts, VMware gathers itself very little information concerning page reference
patterns – due to overhead concerns – that would be useful in performing page
replacement. Consequently, VMware principal’s page replacement strategy is to
try to induce paging inside the guest OS, where, presumably, better informed
decisions can be made. This is known as guest machine ballooning.
To support more aggressive consolidation of guest virtual
machine images onto VMware servers, VMware also attempts dynamically to
identify identical instances of virtual memory pages within the guest machine
or across guest machines that would allow them to be mapped to a single copy of
a physical memory page, thus saving on overall physical memory usage. This
feature is known as transparent memory
sharing.
Virtual addressing
Virtual memory refers to the virtualized linear address
space that an OS builds and presents to each application. 64-bit address
registers, for example, can access a breathtaking range of 264
virtual addresses, even though the actual physical memory configuration is
much, much smaller. Virtual memory addressing permits applications to be
written that can execute (largely) independent of the underlying physical
memory configuration.
Transparent to the application process address space, the
operating system maintains a set of tables that map virtual addresses to actual
physical memory addresses during runtime. This mapping is performed at the
level of a page, a block of
contiguous memory addresses. A page size of 4K addresses (212 bytes)
is frequently used, although other page sizes are possible. (Some computer
hardware allows the OS to select and use a range of supported page sizes.)
To support virtual memory management, the operating system
maintains page tables that map
virtual memory addresses to physical memory addresses for each process being
executed. The precise form of the page tables that are necessary to perform
this mapping is specified by the underlying hardware platform. As a computer
program executes on the hardware, the processor hardware performs the necessary
translation of virtual memory addresses to physical memory addresses
dynamically during run-time. Operating systems functions that support virtual
memory management include setting up and maintaining the per process page
tables that are used perform this dynamic mapping and instructing the hardware
about the location of these memory address translation tables in physical
memory, which is accomplished by loading a dedicated control register to point
to the process-specific set of address mapping tables. When the execution of
one running process blocks, the operating system performs a context switch that loads a different
set of page tables to allow for the translation of that process’s valid virtual
addresses.
The techniques that allow an operating system to execute
multiple processes concurrently and switch between them dynamically are
collectively known as multiprogramming.
Modern operating systems evolved rapidly to support multiprogramming across
multiple processors, where each CPU is capable of accessing the full range of
installed physical memory locations.
(Large scale multi-core multiprocessors are frequently configured with more than one memory bank, where the result is a NUMA (non-uniform memory access) architecture. In machines with NUMA characteristics – something that is quite common in blade servers – accessing a location that resides in a remote memory bank takes longer than a local memory access, a fact that can have serious performance implications. For optimal performance on NUMA machines, the OS memory manager must factor in the NUMA topology into memory allocation decisions, something which VMware evidently does. Further discussion of NUMA architectures and the implications for the performance of guest machines is beyond the scope of the current inquiry, however. Single core multiprocessors from Intel have uniform memory access latency, while AMD single-core multiprocessors have NUMA characteristics.)
Virtual memory management allocates memory on demand, which
is demonstrably more effective in managing physical RAM than static
partitioning schemes where each executing process acquires a fixed set of physical
memory addresses for the duration of its execution. In addition, virtual memory
provides a secure foundation for executing multiple processes concurrently
since each running process has no capability to access and store data in
physical memory locations outside the range of its own unique set of dedicated
virtual memory addresses. The OS ensures that each virtual address space is
mapped to a disjoint set of physical memory pages. The virtual addresses associated
with the OS itself represent a set of pages that are shared in common across
all of the process address spaces, a scheme that enables threads in each
process to call OS services directly, including the system services enabling
interprocess communication (or IPC).
The operating system presents each running process with a
range of virtual memory addresses to use that often exceeds the size of physical RAM. Virtualizing memory addressing
allows applications to be written that are largely unconcerned with the
physical limits of the underlying computer hardware, greatly simplifying their
construction. Permitting applications to be portable across a wide variety of
hardware configurations, irrespective of the amount of physical memory that is
actually available for them to execute, is also of considerable benefit.
The virtual:physical memory mapping and translation that
occurs during instruction execution is transparent to the application that is
running. However, there are OS functions, including setting up and maintaining
the Page Tables, which need to understand and utilize physical memory
locations. In addition, device driver software, installed alongside and serving
as an extension to the OS, that are directly
responsible for communicating with all manner of peripheral devices. Device
driver software must communicate with those devices using actual physical
addresses. Peripheral devices use Direct Memory Access (DMA) interfaces that do
not have access to the processor’s virtual address to physical address mapping
capability during execution.
Memory over-commitment
Allowing applications access to a range of virtual memory
addresses that individually or collectively exceeds the amount of physical
memory that is actually available during execution inevitably leads to
situations where physical memory is over-committed. When physical memory is
over-committed, the operating system implements a page replacement policy that
dynamically manages the contents of physical memory, reclaiming a previously
allocated physical page and re-purposing it for use backing a an entirely different
set of virtual memory addresses, possibly in an entirely different process
address space. Dynamically replacing the pages of applications that have not
been accessed recently with more recently accessed pages has proven to be an
effective way to manage this over-commitment. This is known as demand paging.
Allowing applications to collectively commit more virtual
memory pages than are actually present in physical memory, but biasing the
contents of physical memory based on current usage patterns, permits operating
systems that support virtual memory addressing to utilize physical memory
resources very effectively. Over-commitment of physical memory works because
applications frequently exhibit phased behavior during execution in which they
actively access only a relatively small subset of the overall memory locations
they have allocated. The subset of the total number of allocated virtual memory
pages that are currently active and resident in physical memory is known as the
application’s working set of active
pages.
Under virtual memory management, a process address space
acquires virtual memory addresses a page at a time, dynamically, on demand. The
application process normally requests the OS to allocate a block of contiguous virtual memory addresses for it use. (Since RAM, by definition, is randomly addressable, the process seldom cares where within the address space this specific size block of memory addresses is located. But because fragmentation of the address space can occur, potentially leading to allocation failures when a large enough contiguous block of free addresses is not available to satisfy an allocation request, there is usually some effort on the part of the OS Memory Manager to make virtual allocations contiguous, where possible.)
In Windows, for example, these allocated
pages are known as committed pages
because the OS has committed to backing the virtual page in either physical
memory or in auxiliary memory, which is another name for the paging file
located on disk. Windows also has a commit
limit, an upper limit on the number of virtual memory pages it is willing
to allocate. The commit limit is equal to the sum of size of RAM and the size
of the paging file(s).
The Page Table entry, or PTE, the format of which is
specified by the hardware, is the basic mechanism used by the hardware and
operating system to communicate the current allocation status of a virtual
page. Two bits in the PTE, the valid bit and the dirty bit, are the key status
indicators. When the PTE is flagged as invalid, it is a signal to the hardware
to not perform virtual address translation. When the PTE is marked valid, it
will contain the address of the physical memory page that was allocated by the
OS that is to be used in address translation. When the PTE is marked invalid
for address translation, the remaining bits in the PTE can be used by the
operating system. For example, if the page in question currently resides on the
paging file, the data necessary to access the page from the paging file are
usually stored in the PTE. (Additional hardware-specified bits in the PTE are
used to indicate that the page is Read-only, the page size, and other status
data associated with the page.)
Initially, when a virtual memory page is first allocated, it
is marked as invalid because the OS has not yet allocated it a physical memory
page. Once it is accessed, and the OS does allocate a physical memory page for
it, the PTE entry is marked as valid, and is updated to reflect the physical
memory address that the OS assigned. The hardware sets the “dirty” bit to
indicate that an instruction has written or changed data on the page. The OS
accesses the dirty bit during page replacement to determine if it is necessary
to write the contents of the page to the paging file before the physical memory
page can be “re-purposed” for use by a different range of virtual addresses.
Page fault resolution
It is not until an instruction executing inside the process
attempts to access a virtual memory address during execution that the OS maps
the virtual address to a corresponding physical page in RAM. When the Page
Table entry (PTE) used for virtual:physical address translation indicates no
corresponding physical memory page has been assigned, an instruction that
references a virtual address on that page generates an addressing exception.
This condition is known as a page fault.
The operating system intercepts this page fault and allocates an available page
from physical memory, modifying the corresponding PTE to reflect the change.
Once a valid virtual:physical mapping exists, the original failing instruction can
be re-executed successfully.
In Windows, in resolving a page fault that results from an
initial access, the OS assigns an empty page from its Zero Page list to the
process address space that generated the fault and marks the corresponding PTE
as valid. These operations are known as Demand Zero page faults.
Page fault resolution is transparent to the underlying
process address space, but it does have a performance impact. The instruction
thread that was executing when the page fault occurred is blocked until after the page fault is resolved. (In
Windows, a Thread Wait State Reason is assigned that indicates an involuntary
wait status, waiting for the OS to release the thread again, following page
fault resolution.) The operating system attempts to minimize page fault
resolution time by maintaining a queue of free physical memory pages that are
available to be allocated immediately whenever a demand zero page fault occurs.
Resolving a page fault by supplying an empty page from the Zero list is
regarded as a “soft” fault in Windows because the whole operation is designed
to be handled very quickly and usually does not necessitate a disk operation.
Hard page faults
are any that need to be resolved from disk. When a thread from the process
address space first writes data to the page, changing its contents, the
hardware flags that page’s PTE “dirty” bit. Later, if a dirty page is “stolen”
from the process address during a page trimming scan, the dirty bit provides an
indication to the OS that the contents of the page must be written to paging
file before the page can be “re-purposed.” When the contents of the page have
been written to disk, the PTE is marked, showing its location out in the paging
file. If, subsequently, a thread from the original process executes and
re-references an address on a previously stolen page, a page fault is
generated. During hard page fault resolution, the OS determines from the PTE
that the page is currently on disk. It initiates a Page Read operation to the
disk that copies the current contents of the page from the paging file into an
empty page on the Zero list. When this disk IO operation completes, the OS
updates the PTE and re-dispatches the thread that was blocked for the duration
of the disk IO.
LRU-based page replacement
Whenever the queue of available physical memory pages on the
Zero list becomes depleted, however, the operating system needs to invoke its
page replacement policy to replenish it. Page replacement, also known as page stealing or, more euphemistically, page trimming, involves scanning
physical memory looking for good candidates for page replacement, based on
their pattern of usage.
Specifically, operating systems like Windows or Linux
implement page replacement policies that choose to replace pages based on
actual memory usage patterns, which requires them to keep track – to a degree –
of which virtual memory pages an application allocates that are actually currently
in use. A page replacement policy that can identify those allocated pages which
are Least Recently Used (LRU) and target them for removal has generally proven
quite effective. Most cache management algorithms – and it is quite reasonable
to conceptualize physical memory as a “cache” for a virtual address space – in
use today use some form of LRU-based page replacement.
In order to identify which allocated pages an application is
actually using at a given time, it is necessary for the OS to gather
information on page usage patterns. Physical memory hardware provides very
basic functions that the OS can then exploit to track physical memory usage.
The hardware sets an access bit in the Page Table Entry (PTE) associated with
that corresponding range of physical addresses, indicating that an instruction
accessed some address resident on the page. (Similarly, the hardware sets a
“dirty” bit to indicate that an instruction has stored new data somewhere in
the page.)
How the OS uses this information from the PTE access bit to
keep track of the age of a page varies from vendor to vendor. For instance,
some form of “clock algorithm” that periodically resets the access bits of
every page that was recently accessed is the approach used in the IBM mainframe
OS. The next clock interval in which the aging code is dispatched scans memory
& resets the access bit for any page that was accessed during the previous
interval. Meanwhile, the clock aging algorithm increments the unreferenced interval count for any page
that was not accessed during the interval. Over time, the distribution of unreferenced
interval counts for allocated pages yields a partial order over the age of each
page on the machine. This partial order allows the page replacement routine to
target the oldest pages on the system for page stealing.
The clock algorithm provides an incremental level of detail
on memory usage patterns that is potentially quite useful for performance and
capacity planning purposes [3], but it also has some known limitations,
especially with regard to performance. One performance limitation is that the execution
time of a memory scan varies linearly with the size of RAM. On very large scale
machines, with larger amounts of RAM to manage, scanning page table entries is
time-consuming. And it is precisely those machines that have the most amount of
memory and the least amount of memory contention where the overhead for
maintaining memory usage data is the highest.
Windows adopted a form of interval-oriented clock-based page
aging algorithm that, hopefully, requires far less resources to run, allowing
memory management to scale better for machines with very large amount of RAM to
manage. In Windows, the Balance Set Manager is dispatched once per second to
“trim” pages aggressively from processes that have working set pages that
exceed their target values, which by default are set arbitrarily to low levels.
Pages stolen from the address space in this fashion are, in fact, only stolen
provisionally. In effect, they are placed in a memory resident cache managed as
a FIFO queue called the Standby list. (In some official Windows documentation sources, the Standby list is referred to simply as “the cache.”) When the process references any previously stolen pages that are still resident
in the FIFO cache, these pages can be “soft-faulted” back into the process’s
working set without the necessity for any IO to the paging disk.
Pages in the Standby list that remain unreferenced are aged
during successive page trimming cycles, eventually being pushed to the head of
the queue. The Windows OS zero paging thread, which is awakened whenever the
Zero list needs replenishing, pulls aged pages from the head of the Standby
list, and writes zero values to the page, erasing the previous contents. After being
zeroed, the page is then moved to the Zero list, which is used to satisfy any
process requests for new page allocations. (Stolen pages that have their
“dirty” bit set are detoured first to the Modified List prior to being added to
the Standby cache.)
So long as an application’s set of resident virtual memory
pages corresponds reasonably well to its working set of active pages,
relatively few incidents of hard page faults will occur during execution, and managing
virtual memory on demand will have very little impact on the execution time of
the application. Moreover, so long as
the operating system succeeds in maintaining an adequate inventory of available
physical pages in advance of their actual usage by running processes, what page
faults do occur can be resolved relatively quickly, minimizing the execution
time delays that running processes incur. However, the performance impact of
virtual memory management on the execution time of running tasks can be
substantial if, for example, the demand for new pages exceeds the supply, or
replenishing the inventory of available physical pages forces the OS to trim
steal pages that are apt to be accessed again quite soon again once a blocked
process is re-dispatched. This situation where the impact of virtual memory
management on performance is significant is commonly referred to as thrashing, conjuring up an image of the
machine exerting a great deal of effort on behalf of moving many virtual memory
pages in and out of physical memory to the detriment of performing useful work.
Comments
Post a Comment