Skip to main content

Understanding Guest Machine Performance under Hyper-V: Benchmark results

Benchmark results: baseline measurements with no contention

To gain some additional perspective on the performance impact of virtualization, we will look first at some benchmarking results showing the performance of virtual machines in various simple configurations, which we will also compare to native performance where Windows is installed directly on top of the hardware. For these performance tests, I used a benchmarking program that simulates the multi-threaded CPU and memory load of an active ASP.NET web application, but without issuing disk or network requests so that those limited resources on the target machine are not overwhelmed in the course of executing the benchmark program. 

The benchmark program I used for stress testing Hyper-V guest machines is a Load Generator application I wrote that is parameter-driven to generate a wide variety of “challenging” workloads. The current version is a 64-bit .NET program written in C# called the ThreadContentionGenerator. It has a main dispatcher thread and a variable number of worker threads, similar to ASP.NET. You set it to execute a fixed number of concurrent tasks, and perform a specific number of iterations of each task. Each task allocates a large .NET collection object that it then fills with random data. It then searches the collection repeatedly, and finally deletes all the data. In this fashion, the program stresses both the processor and virtual memory. Periodically, each active thread simulates an IO wait by sleeping, where the simulated IO rate and the IO duration is also subject to some degree of realistic variation. 

This benchmark program is a very flexible beast that can be adjusted to stress the machine’s CPUs, memory or both. You can execute it in a shared nothing environment where the threads execute independent of each other. Alternatively, you can set a parameter that adds an element of resource sharing to the running process so that the threads face lock contention. In contention mode, the main thread builds shared data structures that the worker threads access serially to generate a degree of realistic lock contention that can be dialed either up or down by increasing or decreasing the amount of processing delay in the critical section.   

For this first set of Hyper-V guest machine performance experiments, I set the number of concurrent worker tasks to 32 and the number iterations to 90:
 
ThreadContentionGenerator.exe –tasks 32 –iterations 90

There are additional parameters to vary the virtual memory footprint of the program, the duration of IO waits and the rate of lock contention, but for this set of tests I let the program run with default values for those three parameters. With these settings, the program generates a load that is similar in many respects to a busy ASP.NET web application, one that is compute-bound, with requests that can be processed largely independent of each other. Note that the intent was to stress the Hyper-V environment, beginning by stressing the machine’s CPU capacity, without attempting a realistic simulation of a representative or a particular ASP.NET workload. 

The hardware was an Intel i7 single socket machine with four physical CPUs (and Intel Hyper-Threading disabled) and 12 GB of RAM. The OS was Windows Server 2012 R2.

Native performance (baseline)

Running first on the native machine – after re-booting with Hyper-V disabled – the benchmark program ran to completion in about 90 minutes, the baseline execution time we will use to compare the various virtualization configurations that were tested. The only other active process running on the native Windows machine was Demand Technology’s Performance Sentry performance monitor, DmPerfss.exe, gathering performance counters once per minute.

At this stage, the only aspect of the benchmark program’s resource usage profile that is relevant is its CPU utilization. Because each task being processed goes to sleep periodically to simulate I/O, individual worker threads are not CPU-bound. However, since there are 32 worker threads executing concurrently and only four physical CPUs available, the overall workload is CPU-bound, as evidenced in Figure 25, which reports processor utilization by the top 5 consumers of CPU time during a one hour slice when the ThreadContentionGenerator program was active on the native machine.

Figure 25. Native execution of the benchmark program shows CPU utilization near 400% on a single socket machine with 4 physical CPUs. Instantaneous measurements of the System/Processor Queue Length counter, represented by a dotted line chart plotted against the right-hand y-axis, indicate a significant amount of processor queuing.
You can see in Figure 25 that overall processor utilization approaches the capacity of the machine at close to 400% utilization. The dotted line graph in Figure 25 also shows the instantaneous values obtained from the Processor Queue Length counter. The number of threads waiting in the Windows Scheduler Ready Queue exceeds fifteen for some of the observations. We can readily see that not only are the four physical CPUs on the machine quite busy, at many intervals there are a large number of ready threads waiting for service. Figure 26 confirms that the threads waiting in the Ready Queue are predominately from the ThreadContentionGenerator process (shown in blue), which is the behavior I expected, by the way.

Figure 26. This chart charts threads with a Wait State Reason indicating they are waiting in the OS Scheduler Ready Queue. As expected, most of the ready threads in the Ready Queue are from the benchmark program, the ThreadContentionGenerator process.

Standalone in the Root partition

In the next scenario, running standalone on the Root partition under Hyper-V with no child partitions active, the same benchmark executed for approximately 100 minutes, about 11% longer than the native execution baseline. In many scenarios a 10% performance penalty is a small price to pay for the other operational benefits virtualization provides, but it is important to keep in mind that there is always some performance penalty that is due whenever you are running an application in a virtualized environment.

Applications take longer to run inside a virtual machine compared to running native because of a variety of virtualization costs that are not encountered on a native machine. These include performance costs associated with Hyper-V intercepts and Hypercalls, plus the additional path length associated with synthetic interrupt processing. As mentioned above, the benchmark program simulates IO by issuing Timer Waits. These require the timer services of the hypervisor, which are less costly that the synthetic interrupt processing associated with disk and network IO. So, the 10% increase in execution time is very likely a best case of the performance degradation to expect.

Those costs of virtualization are minor irritants so long as the Hyper-V Host machine can supply ample resources to the guest machine. The performance costs of virtualization do increase substantially, however, when guest machines start to contend for shared resources on the Host machine. 

Since processor scheduling is under the control of the hypervisor in the second benchmark run, for reliable processor measurements, it is necessary to turn to the Hyper-V Logical Processor counters, as shown in Figure 27. For a one-hour period while the benchmark program was active, overall processor utilization is reported approaching 400%, but you will notice it is slightly lower than the levels reported for the native machine in Figure 25. Figure 27 also shows an overlay line graphing hypervisor processor utilization against the right-hand y-axis, which accounts for some of the difference. The hypervisor consumes about 6% of one processor over the same measurement interval. The amount of CPU time consumed directly by the Hyper-V hypervisor is one readily quantifiable source of virtualization overhead that causes performance of the benchmark application to degrade by 10% or so. 
Figure 27. Running the benchmark workload standalone on the Root partition, the hypervisor consumes about 6% of one processor. Overall CPU utilization approaches 400% busy, slightly less busy than the configuration shown in Figure 25.
Reviewing the Hyper-V counter measurement data, we can see that thread execution inside the Root Partition executes on a virtual processor, subject to the hypervisor Scheduler, the same as the virtual processor scheduling performed for any guest machine child partition. When the Windows OS inside the Root Partition executes a thread context switch, the Hyper-V performance counters graphed in Figure 28 show that there is a corresponding hypervisor context switch. For child partitions, there is an additional Hyper-V Scheduler interrupt that requires processing on a context switch, so there is slightly more virtualization overhead whenever child partitions are involved. 

Figure 28. Each time the Windows OS inside the Root Partition executes a thread context switch, there is a corresponding hypervisor context switch.
The Hyper-V Logical Processor utilization measurements do include a metric that should be directly comparable to the System\Processor Queue Length measurement that was shown in Figure 25 called CPU Wait Time per Dispatch, which is available at the virtual processor level. Unfortunately, this performance counter is not helpful, however. It is not clear what the units of Wait Time that are reported, although an educated guess is standard Windows 100-nanosecond timer units seems likely. It also reports Wait Time in very discrete, discontinuous measurements, which is strange. Together, these two issues make for problems of interpretation. 

Fortunately, the System\Processor Queue Length is an instantaneous measurement that remains serviceable under Hyper-V. Figure 29 shows the same set of Process(*)\% Processor Time counters and a Processor Queue Length overlay line as Figure 25. The length of the processor Ready Queue for the Root partition is comparable to the native benchmark run, with even some evidence that the Ready Queue delays are slightly longer in the configuration where virtualization was enabled.

Figure 29. Processor utilization and processor queue length measurements observed from inside the Root partition, where the benchmark program was executed in a standalone mode.
Microsoft strongly recommends against using the Root partition to execute any work other than what is necessary to administer the VM Host machine. There is no technical obstacle that prevents you from executing application programs on the Root partition like I did with the benchmark program. So, while technically feasible, this is not normally an option to consider. The Root partition provides a number of high priority virtualization services, like the handling of synthetic disk and network IO requests, which you want to take pains to try not to impact by running any other applications in the Root. 

Standalone in a single child partition

Given the prohibition against running applications in the Root, the more useful comparison quantifying the minimum overhead of virtualization would be to compare performance of a guest machine in a child partition with performance on native hardware. So, on the same physical machine, I then created a Windows 8.1 virtual machine and configured it to run with 4 virtual processors. Making sure that nothing else was running on the Hyper-V server, I then ran the same benchmark on the 4-way guest machine. This time the benchmark ran to completion in 105 minutes. 

Notice that on the child partition the benchmark run took about 5% longer when a single 4X Guest machine was configured. This virtual machine had access to all the physical CPUs that were available on the physical machine and executed in a standalone environment where it did not have to contend with any other guest VMs for processor resources. 105 minutes in execution time is about 17% longer than it took the same benchmark program to execute in native mode. Figure 30, which shows the rate that the Hyper-V hypervisor processed several types of virtualization-related interrupts, provides some insight into why execution time elongates under virtualization. Notice that hypervisor Scheduler interrupts occur when child partitions are executing – these Scheduler interrupts do not occur when the benchmark program's threads are executing inside the Root partition, as illustrated back in Figure 28.

Figure 30. Interrupt processing rates reported for the hypervisor when a child a partition is active.
This configuration was also noteworthy because the hypervisor CPU consumption was reported as about 8%, a slightly higher utilization level (+25%) than any of the other configurations evaluated. 

Today, performance testing is often performed on virtual machines due to the fact that they are only intermittently active, plus the ease with which you can spin them up and tear them down again. In my experience it is reasonable to expect the same workload to take about 10% longer to execute if you run inside a VM under ideal circumstances, which implies the VM has access to all the resources it needs on the machine, and there is no or minimal contention for those resources from other resident guest machines. The first set of benchmark tests show that the performance degradation to expect when a guest machine executes on an efficiently-provisioned VM Host is for tasks to run approximately 10% slower. Consider this a stretch factor that elongates execution time due to various virtualization overheads. Furthermore, it is reasonable to expect this stretch factor to increase whenever the guest machine is under-provisioned or the Hyper-V machine is over-committed.

Summarizing the results so far for (1( the native machine, (2) running the benchmark workload in the Root partition for the sake of comparison, and, (3) the best case, over-provisioned guest machine with no contention:

Configuration
#
of
machines
CPUs
per machine
elapsed time (minutes)
stretch factor
Thruput
Hyper-V
% Run Time
Native machine
1
4
90
1
Root Partition
1
4
100
1.11
1
6%
Guest machine
1
4
105
1.17
1
8%

Next: things will get more interesting when we compare these baseline runs to an under-provisioned guest machine, plus various runs illustrating what happens when the Hyper-V Host is over-committed.

Comments

Popular posts from this blog

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning
Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…