Skip to main content

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-management during the case study. The Memory state transitions indicated mean that VMware would attempt to use both ballooning and swapping to try to relieve the over-committed virtual memory condition.

The impact of ballooning will be discussed first.

Figure 7. Memory State transitions that are associated with active memory granted to guest machines that drained the supply of free machine memory pages.

Ballooning occurs when the VMware Host recognizes that there is a shortage of machine memory and must be replenished using page replacement. Since VMware has only limited knowledge of current page access patterns, it is not in a position to implement an optimal LRU-based page replacement strategy. Ballooning attempts to shift responsibility for page replacement to the guest machine OS, which presumably can implement a more optimal page replacement strategy than the VMware hypervisor. Essentially, using ballooning, VMware reduces the amount of physical memory available for internal use within the guest machine, forcing the guest OS into exercising its memory management policies. 

The impact that ballooning has on virtual memory management being performed internally by the guest OS suggests that it will be well worth looking inside the guest OS to assess how it detects and responds to the shortage of physical memory that ballooning induces. Absent ballooning, when the VMware Host recognizes that there is a shortage of machine memory, this external contention for machine memory is not necessarily manifest inside the guest OS where, for example, its physical memory as configured might, in fact, be sized very appropriately. Unfortunately, for Windows guest machines the problem is not as simple and straightforward as a guest machine configured to run well with x amount of physical memory assigned finds itself abruptly in a situation where it can access far less than the amount of physical memory configured for it to use, which is serious enough. An additional complication is that well-behaved Windows applications listen for notifications from the OS and attempt to manage their own process virtual memory address spaces in response to these low memory events.

Baseline measurements without memory contention.

To help understand what happened during the memory contention benchmark, it will be useful to compare those results to a standalone baseline set of measurements gathered when there was no contention for machine memory. We begin by reviewing some of the baseline memory measurements taken from inside Windows when the benchmark program was executed standalone on the VMware ESX server with only one guest machine active. 

When the benchmark program was executed standalone with only one guest machine active, there was no memory contention. A single 8 GB guest defined to run on a 16 GB VMware ESX machine could count on all the machine memory granted to it being available. For the baseline, the VMware Host reported overall machine memory usage never exceeding 60% and the balloon targets communicated to the guest machine balloon driver are zero values. The Windows guest machine does experience some internal memory contention, though, because there is a significant amount of demand paging that occurred during the baseline.

Physical memory usage inside the Windows guest machine during the baseline run is profiled in Figure 8. The Available Bytes counter is shown in light blue, while memory allocated to process address spaces is reported in dark blue. Windows performs page replacement when the pool of available bytes drops below a threshold number of free pages on the Zero list. When the benchmark process is run, beginning at 2:50 pm, physical memory begins to be allocated to the benchmark process, shrinking the pool of available bytes. Over the course of the execution of the benchmark process – a period of approximately 30 minutes – the Windows page replacement policy periodically needs to replenish the pool of available bytes by trimming back the number of working set pages allocated to running processes. 

In a standalone mode, when the benchmark process is active, the Windows guest machine manages to utilize all the 8 GBs of physical memory granted to it under VMware. This is primarily a result of the automatic memory management policy built into the .NET Framework runtime which the benchmark program uses. While a .NET Framework program specifically allocates virtual memory as required, the runtime, not the program, is responsible for reclaiming any memory that was previously allocated but is no longer needed. The .NET Framework runtime periodically reclaims currently unused memory by scheduling a garbage collection when it receives a notification from the OS that there is a shortage of physical memory Available Bytes. The result is a tug of war, the benchmark process continually growing its working set to fill physical memory and Windows memory management periodically signally the CLR of an impending shortage of physical memory, causing the CLR to free some of the virtual memory previously allocated by the managed process.

The demand paging rate of the Windows guest is reported in Figure 8 as a dotted line chart, plotted against the right axis. There are several one-minute spikes reaching 150 hard page faults per second. The Windows page replacement policy leads to the OS trimming physical pages that are then re-accessed during the benchmark run that subsequently need to be retrieved from the paging file. In summary, during a standalone execution of the benchmark workload, the benchmark process allocates and uses enough physical memory to trigger the Windows page replacement policy on the 8 GB guest machine.

Figure 8. Physical Memory usage by the Windows guest OS when a single guest machine was run in a standalone mode.

The physical memory utilization profile of the Windows guest machine reflects the use of virtual memory by the benchmark process, which is the only application running. This process, ThreadContentionGenerator.exe, is a multithreaded 64-bit program written in C# that deliberately stresses the automated memory management functions of the .NET runtime. The benchmark program's usage of process virtual memory is highlighted in Figure 9.

The benchmark program allocates some very large data structures and persists them through long processing cycles that access, modify and update them at random. Inside the program, the program allocates these data structures using the managed Heaps associated with the .NET Framework’s Common Language Runtime (CLR). The CLR periodically schedules a garbage collection thread that automatically deletes and compacts the memory previously allocated to objects that are no longer actively referenced. (Jeff Richter's book, CLR via C#, is a good reference on garbage collection inside a .NET Framework process.)

Figure 9. The sizes of the managed heaps inside the benchmarking process address space when there was no "external" memory contention. The size of the process’s Large Object Heap varies between 4 and 8 GB during a standalone run.

Figure 9 reports the sizes of the four managed Heaps during the standalone benchmark run. It shows the amount of process private virtual memory bytes allocated ranging between 4 and 8 GBs during the run. The size of the Large Object Heap, which is never compacted during garbage collection, dwarfs the sizes of the other 3 managed Heaps, which are generation-based. Normally in a .NET application process, garbage collection is initiated automatically when the size of the Generation 0 heap grows to exceed a threshold value, which is chosen in 64-bit Windows based on the amount of physical memory that is available. But the Generation 0 heap in this case remains quite small, smaller than the Generation 0 “budget.” 

The CLR also initiates garbage collection when it receives a LowMemoryResourceNotification event from the OS, a signal that page trimming is about to occur. Well-behaved Windows applications that allow their working sets to expand until they reach the machine’s physical memory capacity wait on this notification. In response to the Low Memory resource Notification, the CLR dispatches its garbage collection thread to reclaim whatever unused virtual memory it can find inside the process address space. Garbage collections initiated by LowMemoryResourceNotification events cause the size of the Large Object Heap to fluctuate greatly during the standalone benchmark run. 

To complete the picture of virtual memory management at the .NET process level, Figure 10 charts the cumulative number of garbage collections that were performed inside the ThreadContentionGenerator.exe process address space. For this discussion, it is appropriate to focus on the number of Generation 0 garbage collections – the fact that so many Generation 0 collections escalate to Gen 1 and Gen 2 collections is a byproduct of the fact that so much of the virtual memory used by the ThreadContentionGenerator program was allocated in the Large Object Heap. Figure 10 shows about 1000 Generation 0 garbage collections occurring. This represents a reasonable, but rough, estimate of the number of times the OS generated Low Memory resource notifications during the run to trigger garbage collection.
Figure 10. The number of CLR Garbage collections inside the benchmarking process when the guest Windows machine was run in standalone mode.

In summary, the benchmark program is a multi-threaded 64-bit .NET Framework application that will allocate virtual memory up to the physical memory limits on the machine. When the Windows OS encounters a shortage of empty Zero pages as a result of these virtual memory allocations, it issues a Low Memory notification that is received and processed by the CLR. Upon receipt of the this Low Memory notification, the CLR schedules a garbage collection to reclaim any private bytes previously allocated on the managed Heaps that are no longer in use.

Introducing external memory contention.

Now, let’s review the same memory management statistics when the same VMware Host is asked to manage a configuration of four such Windows guest machines running concurrently, all actively attempting to allocate and use 8 GB of physical memory. 

As shown in Figure 11, ballooning begins to kick in during the benchmark run around 9:10 AM, which also corresponds to an interval in which the Memory State transitioned to the “hard” state where both ballooning and swapping would be initiated. (From Figure 2, we saw this corresponds to intervals where the machine memory usage was reported running about 98% full.) These balloon targets are communicated to the balloon driver software resident in the guest OS. An increase in the target instructs the balloon driver to “inflate” by allocating memory, while a decrease in the target causes the balloon driver to deflate. Figure 11 reports the memory balloon targets communicated to each of the guest machine resident balloon drivers, with balloon targets rising to over 4 GB per machine. When the balloon drivers in each guest machine begin to inflate, the guest machines will eventually encounter contention for physical memory, which they will respond to by using their page replacement policies to identify older pages to be trimmed from physical memory.

Figure 11. The memory balloon targets for each guest machine increase to about 4 GB when machine memory fills.

Note that the balloon driver’s memory targets are also reported when the Windows guest machine has the VMware Windows tools installed in the VM Memory performance counters. 

In Windows, when VMware’s vmmemsty.sys balloon driver inflates, it allocates physical memory pages and pins them in physical memory until explicitly released. To determine how effective ballooning works to relieve a shortage of machine memory condition, it is useful to drill into the guest machine performance counters and look for signs of increased demand paging and other indicators of memory contention. Based on how Windows virtual memory management works [3], we investigated the following telltale signs that virtual memory appeared to be under stress as a result of the balloon driver inflating inside the Windows guest: 
  • memory allocated in the nonpaged pool should spike due to allocation requests from the VMware balloon driver 
  • a reduction in Available Bytes, leading to an increase in hard page faults (Page Reads/sec) as a result of increased contention for virtual memory 
  • applications that listen for low memory notifications from the OS will initiate page replacement to trim their residents sets voluntarily 
(As discussed above, the number of garbage collections performed inside the ThreadContentionGenerator process address space corresponds to the number of low memory notifications received from the OS indicating that page trimming is about to occur.) 

Generally speaking, if ballooning is effective, it should cause a reduction in the working set of the benchmark process address space since that is main consumer of physical memory inside each guest. Let's take a look.

Physical memory usage.

Figure 12 shows physical memory usage inside one of the active Windows guest machines, reporting the same physical memory usage metrics as Figure 8. (All four show a similar pattern.) Beginning at 9:10 AM when the VMware balloon driver inflates, process working sets (again shown in dark blue) are reduced, which reduces the physical memory footprint of the guest machine to approximately 4 GB. Concurrently, the Available Byes counter, which includes the Standby list, also drops precipitously.

Figure 12. Windows guest machine memory usage counters show a reduction in process working sets when the balloon driver inflates.

Figure 12 also shows an overlay line graph of the Page Reads/sec counter, which is a count of all the hard page faults that need to be resolved from disk. Page Reads/sec is quite erratic while the benchmark program is active. What tends to happen when the OS is short of physical memory is that demand paging to and from disk increases, until the paging disk saturates. Predictably, a physical memory bottleneck is transfigured into a disk IO bottleneck in systems with virtual memory. Throughput of the paging disk serves as an upper limit on performance when there is contention for physical memory. In a VMware configuration, this paging IO bottleneck is compounded when guest machines share the same physical disks, which was the case here. 

Process virtual memory and paging.

Figure 13 drills into the process working set for the benchmarking application, ThreadContentionGenerator.exe. The process working set is evidently impacted by the ballooning action, decreasing in size from a peak value of 6 GB down to less than 4 GB. The overlay line in Figure 13 shows the amount of Available Bytes in Windows. The reduction in the number of Available Bytes triggers the low memory notifications that the OS delivers and .NET Framework CLR listens for. When the CLR receives an OS LowMemoryResourceNotification event, it schedules a garbage collection run to release previously allocated, but currently unused, virtual memory inside the application process address space.

Figure 13. Working set of the benchmarking application process is reduced when VMware ballooning occurs from over 4 GB down to about 2 GB.

Figure 14 looks for additional evidence that the VMware balloon driver induces memory contention inside the Windows guest machine when the balloon inflates. It graphs the counters associated with physical memory allocations, showing a sharp increase in the number of bytes allocated in the Nonpaged pool, corresponding to the period when the balloon driver begins to inflate. The size of the Nonpaged pool shows a sharp increase from 30 MB to 50 MB, beginning shortly after 9:10 AM. The balloon evidently deflates shortly before 10:30 AM, over an hour later when VMware no longer experiences memory contention.

What is curious, however, is that the magnitude of the increase in the size of the nonpaging pool shown in Figure 14 is so much smaller than the VMware guest machine balloon targets reported in Figure 9. The guest machine balloon target is approximately 4 GB in Figure 10, and it is evident in Figure 12 that the balloon inflating reduced the memory footprint of the OS by approximately 4 GB. However, the increase in the size of the nonpaging pool (in light blue) reported in Figure 12 is only 20 MB. This discrepancy requires some explanation.

What seems likely is that the balloon driver inflates by calling MmProbeAndLockPages, allocating physical memory pages that are not associated with either of the standard paged or nonpaged system memory pools. (Note that he http.sys IIS kernel-mode driver allocates a physical memory resident cache that is similarly outside the range of both the nonpaged and paged pool and is not part of any process address space either. Like the balloon driver, the size of the http.sys cache is not reported in any of the Windows Memory performance counters. By allocating physical memory that is outside the system’s virtual memory addressing scheme, the VMware balloon driver can inflate effectively in 32-bit Windows when the virtual memory size of the nonpaged pool is constrained architecturally.)

The size of the VMware memory balloon is not captured directly by any of the standard Windows memory performance counters. The balloon inflating does appear to cause an increase in the size of the nonpaged pool, probably reflecting the data structures that the Windows Memory Manager places there to keep track of the locked pages that the balloon driver allocates.

Figure 14. Physical memory allocations from inside one of the Windows guest machines. The size of the nonpaged pool shows a sharp increase from 30 MB to 50 MB, beginning shortly after 9:10 AM when the VMware balloon driver inflates.

Figure 14 clearly shows a gap in the measurement data at around 9:15. This gap in the measurement data is probably caused by a missing collection data interval because VMware was taking drastic action to alleviate memory contention by blocking execution of the guest machine, triggered by the amount of free machine memory dipping below 2%.

Figure 15 provides some evidence that ballooning induces an increased level of demand paging inside the Windows guest machines. Windows reports demand paging rates using the Page Reads/sec counter, which shows consistently higher levels of demand paging activity once ballooning is triggered. The increased level of demand paging is less pronounced than might otherwise be expected from the extent of the process working set reduction that was revealed in Figure 13. As discussed above, the demand paging rate is predictably constrained by the bandwidth of the paging disk, which in this configuration is a disk shared by all the guest machines.

Figure 15. Ballooning causes an increased level of demand paging inside the Windows guest machines. The Pages Read/sec counter from one of the active Windows guest machines is shown. Note, however, the demand paging rate is constrained by the bandwidth of the paging disk, which in this test configuration is a single disk shared by all the guest machines.

For comparison purposes, it is instructive to compare the demand paging rates in Windows from the one of the guest machines to the same guest machine running the benchmark workload in a standalone environment where only a single Windows guest was allowed to execute. In Figure 16, the demand paging rate for the ESXAS12B Windows guest during the benchmark is contrasted to the same machine and workload running the standalone baseline. The performance counter data from the standalone run is graphed as a line chart overlaying the data from the memory contention benchmark. In a standalone mode, the Windows guest machine has exclusive access to the virtual disk where the OS paging file resides. Executing during contention mode, the disk where the guest machines’ paging files reside is shared. In standalone mode – where there is no disk contention – Figure 16 shows that the standalone Windows guest machine is able to sustain higher demand paging rates.

Due to the capacity constraint imposed by the bandwidth of the shared paging disk, the most striking comparison in the measurement data shown in Figure 16 is the difference in run-times. The benchmark workload that took about 30 minutes to execute in a standalone mode ran for over 90 minutes when there was memory contention, almost three times longer. Longer execution times are the real performance impact of memory contention in VMware, as in any other operating system that supports virtual memory. The hard page faults that occur inside the guest OS delay the execution of every instruction thread that experiences them, even operating system threads. Moreover, when VMware-initiated swapping is also occurring – more about that in a moment – execution of the guest OS workload is also potentially impacted by page faults whose source is external.

Figure 16. Hard page fault rates for one of the guest machines, comparing the rate of Page Reads/sec with and without memory contention. In standalone mode – where there is no disk contention – the Windows guest machine is able to sustain higher demand paging rates. Moreover, the benchmark workload that took about 30 minutes to execute in a standalone mode ran for over 90 minutes when there was memory contention.

VMware ballooning appears to cause Windows to send low memory notifications to the CLR, which then schedules a garbage collection run that attempts to shrink the amount of virtual memory the benchmark process uses. 

Garbage collection inside a managed process.

Finally, let's look inside the process address space for the benchmark application and the memory management performed by the .NET Framework's CLR. The cumulative number of .NET Framework garbage collections that occurred inside the ThreadContentionGenerator process address space is reported for the configuration where there was memory contention in Figure 17. Figure 18 shows the size of the managed heaps inside the ThreadContentionGenerator process address space and the impact of ballooning. Again, the size of the Large Object Heap dwarfs the other managed heaps. 

Comparing the usage of virtual memory by the benchmarking process in Figure 18 to the working set pages that are actually resident in physical memory, as reported in Figure 13, it is apparent that the Windows OS is aggressively trimming pages from the benchmarking process address space. Figure 13 shows the working set of the benchmarking process address space constrained to just 2 GB, once the balloon driver inflated. Meanwhile, the size of the Large Object Heap shown in Figure 18 remains above 4 GB. When an active 4 GB virtual memory address space is confined to running within just 2 GB of physical memory, the inevitable result is that the application’s performance will suffer from numerous paging-related delays. We conclude that ballooning successfully transforms the external contention for machine memory that the VMware Host detects into contention for physical memory that the Windows guest machine needs to manage internally.

Comparing Figure 17 to Figure 10, we see that in both situations, the CLR initiated about 1000 Generation 0 garbage collections. During the memory contention test, the benchmark process takes much longer to execute, so these garbage collections are reported over a much larger interval. This causes the shape of the cumulative distribution in Figure 17 to flatten out, compared to Figure 10. The shape of the distribution changes because the execution time of the application elongates.

Figure 17. CLR Garbage collections inside the benchmarking process when there was "external" memory contention.

Comparing the sizes of the Large Object Heap allocated by the ThreadContentionGenerator process address space, Figure 9 from running standalone shows that the size of the Large Object Heap fluctuates from almost 8 GB down to 4 GB. Meanwhile, in Figure 18, the size of the Large Object Heap is constrained to between 4 and 6 GBs, once the VMware balloon driver began to inflate. The size of the Large Object Heap is constrained due to aggressive page trimming by the Windows OS.

Figure 18. The size of the managed heaps inside the benchmarking process address space when there was "external" memory contention. The size of the Large Object Heap dwarfs the sizes of the other managed heaps, but is constrained to between 4 and 6 GBs due to VMware ballooning.

That completes our look at VMware ballooning during a benchmark run where machine memory was substantially over-committed and VMware was forced into responding, using its ballooning technique.

In the next post, we will consider the impact of VMware's random page replacement algorithm, which it calls swapping.


Popular posts from this blog

Inside the Windows Runtime, Part 2

As I mentioned in the previous post, run-time libraries in Windows provide services for applications running in User mode. For historical reasons, this run-time layer in Windows was always known as the Win32 libraries, even when these services are requested in the 64-bit OS in 32-bit mode. A good example of a Win32 run-time service is any operation that involves opening and accessing a file somewhere in the file system (or the network, or the cloud). A more involved example is the set of Win32 services an application needs to access to play an audio file, including understanding the specific audio file compressed format, and checking authorization and security.
For Windows 8, a portion of the existing Win32 services in Windows were ported to the ARM hardware platform.  The scope of the Win32 API is huge, and it was probably not feasible to convert all of it during the span of a single, time-constrained release cycle. Unfortunately, the fact that the new Windows 8 Runtime library encomp…

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…