Skip to main content

A comment on “Memory Overcommitment in the ESX Server”

The VMware Technical Journal recently published a paper entitled “Memory Overcommitment in the ESX Server.” It traverses similar ground to my recent blog entries on the subject of VMware memory management, and similarly illustrates the large potential impact paging can have on the performance of applications running under virtualization. Using synthetic benchmarks, the VMware study replicates the major findings from the VMware benchmark data that I recently reported on beginning here

VMware being willing to air performance results publicly that are less than benign is a very positive sign. Unfortunately, it is all too easy for VMware customers to configure machines where memory is overcommitted, subject to severe performance problems. VMware customers require frank guidance from their vendor to help them recognize when this happens and understand what steps to try to minimize these problems arising in the future. The publication of this article in the VMTJ is a solid, first step in that direction.

Memory overcommitment factor

One of the interesting results reported in the VMTJ paper is Figure 6, reproduced below. It reports application throughput (in operations per minute for some synthetic benchmark workload that simulates an online DVD store) against a memory overcommitment factor, calculated as: 

S guest machine allocated virtual memory / available machine memory 

This overcommitment factor is essentially a V:R ratio, the amount of virtual memory allocated to guest machines divided by the amount of RAM available to the VMware ESX host.

Figure 6. Reproduced from the article, “Memory Overcommitment in the ESX Server” published in the VMware Technical Journal.  

The two virtual machines configured in the test allocate about 20 GB of RAM, according to the authors. The benchmark was initially run on a physical machine configured with 42 GB of RAM. With the machine’s memory less than fully over-committed, throughput of the test workload was measured at almost 26K operations per minute. The authors then began to tighten the screws on the amount machine memory available for the VMware Host machine to manage. The performance of the workload degraded gradually as machine memory was reduced in stages down to 12 GB of machine memory. (The memory overcommitment factor for the 12 GB machine is 20/12, or 1.6667.) 

Finally, the authors attempted to run the benchmark workload in a machine configured with only 7 GB of RAM, yielding the degradation shown on the line chart. Under severe memory constraints, the benchmark app managed less than 9,000 operations/minute, a degradation of about 3X.

The same sequence of experiments was repeated on the machine, but this time with a solid state disk (SSD) configured for VMware to use for swapping (plotted using the green line). With the faster paging device configured, throughput of the test workload increased about 10% on the 12 GB machine. The magnitude of the improvements associated with using the SSD increased even more for the 7 GB machine, but the workload still suffered serious performance degradation, as shown.

Reading the SSD comparison, I experienced an extremely acute (but very pleasant) sense of dejà vu. Thirty years ago, I worked for StorageTek, which was selling one of the industry’s first SSDs, known as the 4305 because it emulated a high speed disk drum that IBM built called the 2305. Due to its limited capacity and the fact the storage itself was volatile RAM, the 4305 was best suited to serve as a fast paging device. It was often attached to an IBM mainframe running the MVS operating system. 

In the system engineering organization at STK, where I worked, I learned to assess the potential improvement in performance from plugging in the 4305 by calculating a V:R ratio using the customer’s data. I was very taken with the idea that we could use this V:R calculation for performance prediction, something I explored more rigorously a few years later in a paper called “Nonlinear Regression Models and CPE Applications,” available at the Computer Measurement Group’s web site. V:R remains a very useful metric to calculate, and I have also used it subsequently when discussing Windows virtual memory management.

By way of comparison to the results reported in the VMTJ paper, the overcommitment factor in the VMware ESX benchmarks I reported on was 2:1; the performance impact on throughput was an approximately 3X degradation in the elapsed time of the benchmark app under the impact of VMware memory ballooning and swapping.

Flaws in the analysis

I can recommend reading the VMTJ paper on its merits, but I also noticed it was flawed in several places, which I will mention here.

Transparent memory sharing

The first is an error of omission. The authors fail to report any of the shared memory measurements that are available. Instead, they accept on faith the notion that VMware’s transparent memory sharing feature is effective even when machine memory is substantially over-committed. That is not what I found. 

Transparent memory sharing is an extremely attractive mechanism for packing as many idle guest Windows machines as possible onto a single VMware ESX host. What I discovered (and reported here) in the benchmarks we ran was that the amount of memory sharing was effectively reduced to near zero as the guest workloads started to become active.

The VMTJ authors do note that when the Windows OS initializes, it writes zeros to every physical page of RAM that is available. Since VMware can map all the pages on the OS Zero list to a single resident machine memory page, transparent memory sharing is a spectacularly successful strategy initially. However, once the Windows OS’s own virtual memory management kicks in when the guest machine becomes active, the contents of the Zero list, for example, become quite dynamic. This apparently defeats the simple mechanism ESX uses to detect identical virtual pages in real-time.

VMware transparent memory sharing is a wonderful mechanism for packing as many mostly idle guest Windows machines onto ESX Servers as possible. On the other hand, it can be a recipe for disaster when you are configuring production workloads to run as VMware virtual machines.

Measuring guest machine Active Memory

A second flaw in the paper is another error of omission. The VMTJ authors decline to discuss the sampling methodology VMware uses to measure a guest machine’s current working set. This is what VMware calls guest machine active memory, the measurement that is plugged into the formula for calculating the memory over-commitment factor as the numerator. To avoid duplicating all the effort that the guest machine OS expends trying to implement some form of LRU-based page replacement, VMware calculates a guest machine’s active memory using a sampling technique. The benchmark results I reported on showed the guest machine Active Memory measurement VMware reports can be subject to troubling sampling errors. 

The VMTJ authors manage to avoid using this dicey metric to calculate the memory overcommitment factor by relying on a benchmark program that is effectively a memory “sponge.” These are programs designed specifically to stress LRU-based memory management by touching at random all the physical pages that are allocated to the guest machine.

With their sponge program running, the authors can reliably calculate memory overcommitment factor without using the Active Memory counter that is subject to measurement anomalies. With the sponge program, the amount of guest machine committed physical memory is precisely equal to the amount of active memory. For the other DVDStore workload, they were able to assess its active memory requirements by running it in a stand-alone mode without any memory contention.

Again, thing aren’t nearly so simple in the real world of virtualization. In order to calculate the memory overcommitment factor for real-world VMware workloads, customers need to use the Active Memory counter, which is not a wholly reliable number due to the way it is calculated using sampling.

To manage VMware, access to a reliable measure of a guest machine’s active memory is very important. Assessing just how much virtual memory your Windows application needs isn’t always a simple matter, as I tried to discuss in my discussion of the VMware benchmark results. A complication arises in Windows whenever any well-behaved Windows application is running that (1) expands to fill the size of physical RAM, (2) listens for low physical memory notifications from the OS, and (3) uses these notifications to trigger virtual memory management routines internally. Examples of applications that do this include Microsoft SQL Server. Any 64-bit .NET Framework application, including 64-bit ASP.NET web sites, is also in this category of applications because the Framework’s Common Language Runtime (CLR) triggers garbage collection in its managed Heaps whenever it receives low physical memory notifications from the OS. With processes designed to take advantage of whatever physical memory is available, it is actually quite difficult to assess how much RAM they require for good performance, absent, of course, internal performance statistics on throughput or response time.

Memory management in Hyper-V

Finally, the VMware paper mischaracterizes the memory management features in Microsoft’s Hyper-V product, dismissing it is an inferior derivative of what VMware accomplishes in a more comprehensive fashion. Moreover, this dismissal is unaccompanied by any hard performance data to back up the claim that VMware ESX is better. It may well be true that VMware is superior, but I am not aware of any data in the public domain that would enable anyone to argue conclusively one way or the other. (Hmmm. I smell an opportunity there. J) 

It seems entirely possible to me that Hyper-V actually currently holds the advantage when it comes to virtual memory management. Memory management in Hyper-V, which installs on top of the Windows Server OS, can leverage the Windows Memory Manager. Memory management benefits from the fact that the OS understands the guest machine’s actual pattern of physical memory usage since the guest machine appears as simply another process address space executing on the virtualization host.

In its original incarnation, Hyper-V could not take advantage of this because the entire container process that the guest machine executed in was relegated to non-paged memory in the virtualization host. Guest machines acquired their full allotment of virtualized “physical” memory at initialization and retained it throughout as static non-paged memory, which put Hyper-V at a significant disadvantage compared to VMware, which granted guest machine memory on demand.

In the case of a Windows guest machine, VMware would initially have to grant the guest all the physical memory it requested because of the zero memory initialization routine Windows runs at start-up to clear the contents of RAM. While a Windows guest machine is idle, however, VMware’s transparent memory sharing feature reduces the guest machine memory footprint by mapping all those identical, empty zero-filled pages to a single page in machine memory. This behavior was evident in the benchmark test results I reported.

So, VMware’s dynamic approach to memory management let you over-commit machine memory and pack significantly more guest machines into the same memory footprint. This, as I have indicated, works just great so long as those guest machines are mostly idle, making, for example, virtualization of machines devoted to application testing and QA no-brainer candidates for virtualization.

However, the most recent versions of Hyper-V have a feature that Microsoft calls “Dynamic Memory” that changes all that. It is fair to say that Dynamic Memory gives Hyper-V parity with VMware in this area. RAM is allocated on demand, similar to VMware. When you configure a Hyper-V guest, you specify the maximum amount of RAM that Hyper-V will allocate on its behalf. But that RAM is only granted on demand. To improve start-up performance, you can also specify the amount of an initial grant of RAM to the guest machine. (I suspect a Windows guest machine is “enlightened” when it detects it is running under Hyper-V and skips the zero memory initialization routine.)

There is one other configurable parameter the new dynamic memory management feature uses called the memory buffer. The VMTJ authors mistakenly conflate the Hyper-V memory buffer with the ballooning technique that VMware pioneered. In fact, since the Hyper-V Host understands the memory usage pattern of its guests, there is no reason to resort to anything like ballooning to drive page replacement at the guest machine level when the virtualization host faces memory over-commitment.

The configurable Hyper-V guest machine memory buffer is a feature which tries to maintain a pool of free or available pages to stay ahead of the demand for new pages. In Hyper-V, the virtualization host allocates machine memory pages for the guest based on guest machine committed memory. (My guess is that the allocation mechanism Hyper-V relies on is the guest machine’s construction of the Page Table entry for the virtual memory that is allocated.)

The memory buffer is a cushion that Hyper-V adds to the guest machine’s Committed bytes – but only when sufficient machine memory is available – to accommodate rapid expansion since the number of committed bytes fluctuates as processes come and go, for example. Requests for new pages are satisfied from this pool of available memory immediately without incurring any of the delays associated with page replacement, which can often be relegated to background operations.

Configuring a guest machine’s memory buffer is basically a hint to the Hyper-V virtualization host memory manager – essentially, the Windows Memory Manager – to allocate and maintain some amount of additional virtual memory for the guest machine to use when it needs to expand. This is a good practice in any virtual memory management system. It is not a ballooning mechanism at all.


Popular posts from this blog

Inside the Windows Runtime, Part 2

As I mentioned in the previous post, run-time libraries in Windows provide services for applications running in User mode. For historical reasons, this run-time layer in Windows was always known as the Win32 libraries, even when these services are requested in the 64-bit OS in 32-bit mode. A good example of a Win32 run-time service is any operation that involves opening and accessing a file somewhere in the file system (or the network, or the cloud). A more involved example is the set of Win32 services an application needs to access to play an audio file, including understanding the specific audio file compressed format, and checking authorization and security.
For Windows 8, a portion of the existing Win32 services in Windows were ported to the ARM hardware platform.  The scope of the Win32 API is huge, and it was probably not feasible to convert all of it during the span of a single, time-constrained release cycle. Unfortunately, the fact that the new Windows 8 Runtime library encomp…

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…