Skip to main content

Understanding Guest Machine Performance under Hyper-V: more benchmarks

Recognizing an under-provisioned guest machine

Another set of benchmark results document the additional performance delays guest machines encounter when they execute on under-provisioned guest machines. I simulated this condition by executing the same benchmark on a guest Windows VM that had access to only two of the 4 available physical processors. Configured to use only two virtual processors, the benchmark program required 147 minutes to run to completion.

Obviously, in this scenario the performance of the benchmark workload being executed on the 2-way guest machine suffered because it did not have access to an adequate number of virtual processors. It is easy to see that this guest machine is under-provisioned in this example where the conditions are tightly controlled. The key is being able to recognize when guest machines that are executing an unknown workload are under-provisioned. Look for the combination of the following:


  1. Each of the Hyper-V virtual processors allotted to the child partition show % Run Time processor utilization measurements approaching 100% busy, and
  2. Internal guest machine System\Processor Queue Length measurements exceed 3 X the number of virtual processors that are configured.

Together, these are reliable indicators that the guest machine workload is constrained by access to too few virtual CPUs.

Efficiently-provisioned Hyper-V Guest

When the Hyper-V Host machine is efficiently provisioned, application responsiveness is still affected, but it becomes possible to scale up and scale out an application. By running the same benchmark program simultaneously on 2 2-way guest machines, I was able to generate a simple example of this scaling out behavior. When run concurrently in separate two-processor virtual machines, each individual benchmark ran to completion in about 178 minutes, an execution time stretch factor of almost 2, compared to the native execution baseline. But, interestingly, the overall throughput of the guest machines doubled, since two full complements of tasks ran to completion during that time period. 

Summarizing the benchmark results reported so far:

Configuration
#
of
machines
CPUs
per machine
elapsed time (minutes)
stretch factor
Thruput
Hyper-V
% Run Time
Native machine
1
4
90
1
Root Partition
1
4
100
1.11
1
6%
Guest machine
1
4
105
1.17
1
8%
Under-provisioned Guest machine
1
2
147
1.63
1
4%
2 Guest machines
2
2
178
1.98
2
6%

Over-committed Hyper-V Host 

Having established that the benchmark workload will absorb all the CPU capacity that is available on the Hyper-V Host, it is easy to move from efficiently provisioned to an under-provisioned Host machine. This was accomplished by doubling the number of guest machines that are executing concurrently, compared to the previous benchmarking configuration. With four 2-way guest machines executing concurrently, the Hyper-V Host is thoroughly out of CPU capacity. Yet, Hyper-V still continues to execute the guest machine workloads efficiently. The execution time of a single benchmark job increases to 370 minutes, a stretch factor of almost 4.1 times slower than the native machine baseline. Throughput also increases proportionately – four times as many tasks were completed during that longer period. 

The symptoms that the Hyper-V Host machine is out of CPU capacity are easy to spot. Figure 31 reports that each of the four guest machines consumes close to 100% of one of the available physical CPUs. Hyper-V utilization continues to hold steady at approximately 6% busy. There is no excess processor capacity.
Figure 31. Guest machines consume all available processor cycles when four 2-way guest machines were configured to run concurrently. Hypervisor CPU utilization continued to hold steady at around 6%.

If the physical CPUs are overloaded, you can then drill into the CPU usage by each of the virtual machines. Figure 32 shows the processor utilization distributed evenly across all the child partition virtual processors, which are weighted evenly in this example. 

Figure 32. Guest machine CPU usage is tracked by virtual processor. Here virtual processor usage is distributed evenly across all the child partition, which are weighted evenly in this example.
 The results from timing the six benchmark runs that were discussed in this post and the previous one are summarized in Table 3, which also calculates a virtualization “stretch factor” The stretch factor calculated from the ratio of the elapsed execution time of the guest machine configuration compared to native Windows performance for the same workload.

Table 3.

Configuration
#
of
machines
CPUs
per machine
elapsed time (minutes)
stretch factor
Thruput
Hyper-V
% Run Time
Native machine
1
4
90
1
Root Partition
1
4
100
1.11
1
6%
Guest machine
1
4
105
1.17
1
8%
Under-provisioned Guest machine
1
2
147
1.63
1
4%
2 Guest machines
2
2
178
1.98
2
6%
4 Guest machines
4
2
370
4.08
4
6%

Discussion

Summarizing this set of benchmark results, we can see that it is reasonable to expect any timing test to execute at least 15% longer when it is running on an adequately provisioned virtual machine, compared to running on native hardware. Meanwhile, while I only provided a single, simple example, it is readily apparent that an under-provisioned guest machine pays a substantial performance penalty when its configuration settings restrict it from consuming the resources the workload demands. In that example, a known CPU-bound workload was configured with too few virtual CPUs. This under-provisioning caused the benchmark to execute 40% longer than an efficiently provisioned guest machine executing the same workload. If that guest machine were constrained even further – say, it was configured to access only virtual CPU – the performance penalty would have been twice as severe.

Finally, we see some evidence for the ability of virtualization technology to support applications that have a need to scale up and out by running multiple machine images in parallel. This is an important capability, helping in situations, for instance, where the resource demand is very elastic. Coupled with the ability to provision and spin up a new guest machine instance quickly on a virtualized infrastructure, this capability, of course, remains one of the prime drivers behind cloud computing initiatives & adoption.

One final word on the benchmark results. The last column in Table 3 shows the CPU utilization directly attributed to the Hyper-V hypervisor, which ranged from 4 to 8%. The amount of hypervisor overhead is a function of the guest machine activity that generates interrupts, intercepts and Hypercalls. Notice that the scenario with the least amount of hypervisor activity is the one with the guest machine that was under-provisioned with only two logical processors defined. Not all the overhead associated with Hyper-V virtualization is captured by this performance counter, however, since there are also Hyper-V components that execute in the Root partition and in the child partitions. Hyper-V does provide a set of performance counters under the Hyper-V Logical Processor object that help you to assess how much virtualization overhead is involved. Figure 33 is an example of these measurements that break down the rate of interrupt processing by the hypervisor. Among the four categories of hypervisor interrupts, inter-processor interrupts predominate in this workload, which was running four guest machines concurrently. A smaller number of hypervisor Scheduler, Timer and hardware interrupts were also handled.

Figure 33. Hypervisor interrupt processing, broken down by the type of interrupt. Among the four categories of hypervisor interrupts that are counted, inter-processor signaling interrupts predominate in this workload, which was running four guest machines concurrently.
In the next post, we will look at how effective the CPU priority scheduling options in Hyper-V are at protecting preferred machines from the negative performance impact of running on an over-provisioned Host. 


Comments

Popular posts from this blog

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning
Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…