Skip to main content

Understanding Guest Machine Performance under Hyper-V: more benchmarks

Recognizing an under-provisioned guest machine

Another set of benchmark results document the additional performance delays guest machines encounter when they execute on under-provisioned guest machines. I simulated this condition by executing the same benchmark on a guest Windows VM that had access to only two of the 4 available physical processors. Configured to use only two virtual processors, the benchmark program required 147 minutes to run to completion.

Obviously, in this scenario the performance of the benchmark workload being executed on the 2-way guest machine suffered because it did not have access to an adequate number of virtual processors. It is easy to see that this guest machine is under-provisioned in this example where the conditions are tightly controlled. The key is being able to recognize when guest machines that are executing an unknown workload are under-provisioned. Look for the combination of the following:


  1. Each of the Hyper-V virtual processors allotted to the child partition show % Run Time processor utilization measurements approaching 100% busy, and
  2. Internal guest machine System\Processor Queue Length measurements exceed 3 X the number of virtual processors that are configured.

Together, these are reliable indicators that the guest machine workload is constrained by access to too few virtual CPUs.

Efficiently-provisioned Hyper-V Guest

When the Hyper-V Host machine is efficiently provisioned, application responsiveness is still affected, but it becomes possible to scale up and scale out an application. By running the same benchmark program simultaneously on 2 2-way guest machines, I was able to generate a simple example of this scaling out behavior. When run concurrently in separate two-processor virtual machines, each individual benchmark ran to completion in about 178 minutes, an execution time stretch factor of almost 2, compared to the native execution baseline. But, interestingly, the overall throughput of the guest machines doubled, since two full complements of tasks ran to completion during that time period. 

Summarizing the benchmark results reported so far:

Configuration
#
of
machines
CPUs
per machine
elapsed time (minutes)
stretch factor
Thruput
Hyper-V
% Run Time
Native machine
1
4
90
1
Root Partition
1
4
100
1.11
1
6%
Guest machine
1
4
105
1.17
1
8%
Under-provisioned Guest machine
1
2
147
1.63
1
4%
2 Guest machines
2
2
178
1.98
2
6%

Over-committed Hyper-V Host 

Having established that the benchmark workload will absorb all the CPU capacity that is available on the Hyper-V Host, it is easy to move from efficiently provisioned to an under-provisioned Host machine. This was accomplished by doubling the number of guest machines that are executing concurrently, compared to the previous benchmarking configuration. With four 2-way guest machines executing concurrently, the Hyper-V Host is thoroughly out of CPU capacity. Yet, Hyper-V still continues to execute the guest machine workloads efficiently. The execution time of a single benchmark job increases to 370 minutes, a stretch factor of almost 4.1 times slower than the native machine baseline. Throughput also increases proportionately – four times as many tasks were completed during that longer period. 

The symptoms that the Hyper-V Host machine is out of CPU capacity are easy to spot. Figure 31 reports that each of the four guest machines consumes close to 100% of one of the available physical CPUs. Hyper-V utilization continues to hold steady at approximately 6% busy. There is no excess processor capacity.
Figure 31. Guest machines consume all available processor cycles when four 2-way guest machines were configured to run concurrently. Hypervisor CPU utilization continued to hold steady at around 6%.

If the physical CPUs are overloaded, you can then drill into the CPU usage by each of the virtual machines. Figure 32 shows the processor utilization distributed evenly across all the child partition virtual processors, which are weighted evenly in this example. 

Figure 32. Guest machine CPU usage is tracked by virtual processor. Here virtual processor usage is distributed evenly across all the child partition, which are weighted evenly in this example.
 The results from timing the six benchmark runs that were discussed in this post and the previous one are summarized in Table 3, which also calculates a virtualization “stretch factor” The stretch factor calculated from the ratio of the elapsed execution time of the guest machine configuration compared to native Windows performance for the same workload.

Table 3.

Configuration
#
of
machines
CPUs
per machine
elapsed time (minutes)
stretch factor
Thruput
Hyper-V
% Run Time
Native machine
1
4
90
1
Root Partition
1
4
100
1.11
1
6%
Guest machine
1
4
105
1.17
1
8%
Under-provisioned Guest machine
1
2
147
1.63
1
4%
2 Guest machines
2
2
178
1.98
2
6%
4 Guest machines
4
2
370
4.08
4
6%

Discussion

Summarizing this set of benchmark results, we can see that it is reasonable to expect any timing test to execute at least 15% longer when it is running on an adequately provisioned virtual machine, compared to running on native hardware. Meanwhile, while I only provided a single, simple example, it is readily apparent that an under-provisioned guest machine pays a substantial performance penalty when its configuration settings restrict it from consuming the resources the workload demands. In that example, a known CPU-bound workload was configured with too few virtual CPUs. This under-provisioning caused the benchmark to execute 40% longer than an efficiently provisioned guest machine executing the same workload. If that guest machine were constrained even further – say, it was configured to access only virtual CPU – the performance penalty would have been twice as severe.

Finally, we see some evidence for the ability of virtualization technology to support applications that have a need to scale up and out by running multiple machine images in parallel. This is an important capability, helping in situations, for instance, where the resource demand is very elastic. Coupled with the ability to provision and spin up a new guest machine instance quickly on a virtualized infrastructure, this capability, of course, remains one of the prime drivers behind cloud computing initiatives & adoption.

One final word on the benchmark results. The last column in Table 3 shows the CPU utilization directly attributed to the Hyper-V hypervisor, which ranged from 4 to 8%. The amount of hypervisor overhead is a function of the guest machine activity that generates interrupts, intercepts and Hypercalls. Notice that the scenario with the least amount of hypervisor activity is the one with the guest machine that was under-provisioned with only two logical processors defined. Not all the overhead associated with Hyper-V virtualization is captured by this performance counter, however, since there are also Hyper-V components that execute in the Root partition and in the child partitions. Hyper-V does provide a set of performance counters under the Hyper-V Logical Processor object that help you to assess how much virtualization overhead is involved. Figure 33 is an example of these measurements that break down the rate of interrupt processing by the hypervisor. Among the four categories of hypervisor interrupts, inter-processor interrupts predominate in this workload, which was running four guest machines concurrently. A smaller number of hypervisor Scheduler, Timer and hardware interrupts were also handled.

Figure 33. Hypervisor interrupt processing, broken down by the type of interrupt. Among the four categories of hypervisor interrupts that are counted, inter-processor signaling interrupts predominate in this workload, which was running four guest machines concurrently.
In the next post, we will look at how effective the CPU priority scheduling options in Hyper-V are at protecting preferred machines from the negative performance impact of running on an over-provisioned Host. 


Comments

Popular posts from this blog

Hyper-V Architecture: Intercepts, interrupts and Hypercalls

Intercepts, interrupts and Hypercalls Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead...

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe...

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profi...