Skip to main content

Interpreting Windows performance gathered from VMware and Hyper-V Guest Machines


In this blog entry, I want to step back and revisit a topic I have blogged about much earlier (see https://performancebydesign.blogspot.com/search/label/VMware.) and discuss how guest machine performance counters are impacted by virtualization in general. Based on those impacts, we can assess which performance measurements guest machines produce remain viable for diagnosing performance problems and understanding capacity issues. Depending on the type of performance counter, the impact of the virtualization environment varies considerably.

To begin, it is necessary to understand how both Hyper-V and the VMware ESX hypervisor affect the clocks and timers that are available on their guest machines. Essentially, Hyper-V intercepts all calls made from the guest OS to access hardware-based clock and timer services on the Host machine and substitutes a virtualized Time of Day clock value. The hypervisor takes pains to ensure that this virtual clock value sent to the guest is minimally consistent, which means that any intercepted timer service call always returns to the guest a value of the virtual clock that is later than the previous timer service call. Consequently, the guest machine obtains a view of the current time that is monotonically increasing, but not especially concerned with reflecting time of day intervals that are guaranteed to be consistent with the assumption made by guest machine monitoring tools, namely that measurement intervals are of equal and consistent duration.

Inside a guest Windows machine, the OS provides the following basic clock and timer services that are relied on in performance monitoring:
  • a periodic clock interrupt that Windows relies upon to maintain the System Time of Day clock,
  • the rdtsc (Read TimeStamp Counter), an Intel hardware instruction that Windows wraps using the QueryPerformanceCounter function.
Under Hyper-V and VMware ESX, both time sources that Windows depends upon in performance measurement are virtualized. From inside the Windows guest machine, there is no clock or timer service that is consistently reliable, which naturally affects taking any granular measurements, which includes many Windows performance counters.

The fact that virtualization impacts external, hardware-based clock and timer services should not be too surprising. Whenever the guest machine accesses an external hardware timer or clock, that access is virtualized like any other access to an external hardware device. Any external device IO request is intercepted by then Hyper-V software, which then redirects the request to the hypervisor services that access the actual physical hardware device. If the target device happens to be busy servicing another request originating from a different virtual machine, the Hyper-V software must queue the request. When the actual hardware device completes its processing of the request, there is an additional processing delay associated with Hyper-V routing the result to the guest machine that originated the request. In the case of a guest machine making a synchronous HPET timer request, this interception and routing overhead leads to some amount of “jitter” in interpreting the clock values that are retrieved that is not otherwise present when Windows is running in a native mode.

This clock jitter, the result of the virtualization overhead associated with Hyper-V’s intercept of an external clock interface request, is one problem. Hyper-V Host machines that contain multiple sockets face a different problem. Each socket has a separate clock domain, its own TimeStamp Counter (TSC), which increments at a constant rate. Processors support the rdtsc instruction to read the current value of the TSC, a low latency facility that normally provides the granular clock measurements used in the QueryPerformanceCounter API. Currently, Windows synchronizes the TSCs on board each processor core initially during the boot process, but older hardware does not support this function. The crystal oscillation that drives the update of the TSCs flutters a bit, but over time this fluctuation leads TSC values on a socket to diverge.

When clock domains are not synchronized, and a guest machine virtual processor is dispatched by the hypervisor first on one logical processor on one socket, followed by being dispatched on a logical processor on another socket, it is possible for a native rdtsc instruction issued on the 2nd socket to return a clock value that is earlier than the clock value obtained by an earlier rdtsc instruction issued on the 1st socket.

Windows running in native mode on a multi-socket computer has a strategy for dealing with this anomaly that is built into the QueryPerformanceCounter function. A Windows guest machine, however, never has the same opportunity to detect and correct for this potential anomaly because it is dealing with virtual clocks. What Hyper-V does is ensure that the clock value returned by the rdtsc instruction is never less than the previous clock value. Microsoft calls this Hyper-V clock service its reference time enlightenment, which ensures that time never runs backwards in the guest machine. In the process what gets sacrificed is clock accuracy – a clock duration value calculated by issuing two successive rdtsc instructions or two successive calls to the QueryPerformanceCounter API will not reliably and accurately reflect the external elapsed wall clock time between the two timing events.

How this Hyper-V reference time enlightenment affects performance counters gathered on the guest machine depends on the type of performance counter. The impact of virtualization on the % processor time measurements reported by a Windows guest is severe, but there is a way to derive reasonably accurate measurements by correcting the distortion that virtualization introduces. On the other hand, virtualization has some, minimal impact on the most common Difference counters, counters that report the difference between the number of events that occurred between two measurement intervals. And for the class of counters that are, in fact, instantaneous measurements of system or process state, running under virtualization has little or no impact. Finally, I will discuss the Disk performance counters that rely on the QueryPerformanceCounter API to measure IO response time. These measurements remain quite useful, but interpreting what they measure changes under virtualization. The sections that follow discuss the impact of virtualization on each of these classes of counters in more detail.

% Processor Time counters

The % processor Time counters at the system, processor and process level are impacted by clock virtualization. At the system level, you need to substitute the measurements the hypervisor guest machine Scheduler reports and use them instead. For example, in Hyper-V you substitute Hyper-V virtual processor % Run Time measurements for the Processor\% Processor Time measurements you use when Windows is running native. The corresponding metric to use in VMware is the  CPU Usage (Percentage)  counter available from the Guest Aggregate statistics.

For those situations where you need to understand CPU utilization at the process level for the Windows guest machine, you can make a simple adjustment to those guest machine measurements, once you have the hypervisor virtual processor utilization measurements from either VMware or Hyper-V.

To measure processor utilization, the Windows OS uses a periodic clock interrupt designed to interrupt the machine 64 times per second. During servicing of this high priority clock interrupt, Windows maintains its System Time of Day clock. (The Windows System Time of Clock is maintained in 100 nanosecond timer units, but the actual time of day values are only updated 64 times per second – approximately once every 16.67 milliseconds.)  The same clock interrupt service routine also samples the execution state of the machine, and these samples form the basis of the % Processor Time measurements that are maintained at the processor, process, and thread levels in Windows.

Virtualization technology impacts any asynchronous timer request of this type. Timer interrupts are subject to additional delay because the original interrupt is serviced first by hypervisor services and sometime later a virtualized device interrupt is presented to the guest machine. Interrupt delay time is normally minimal if the virtual guest is currently dispatched. However, if, at the specific point in time when the periodic clock interval the Windows Scheduler relies upon is scheduled to expire, the guest machine is blocked by the Hyper-V Scheduler, the interrupt is deferred, delaying the timer expiration that much longer.

As we see in some of the benchmark results at the guest machine level under both Hyper-V and VMware, in extreme circumstances, it is possible for guest machine scheduling delays to grow large enough to cause some periodic timer interrupts to be dropped entirely. On a Windows guest, when the Hyper-V Host machine CPUs start to get busy, the periodic timer intervals that drives the CPU accounting function become highly irregular. Windows is still sampling the virtual processor execution state to see if the processor is idle or executing someone’s code, but the sampling rate can drop considerably below 64 times per second.

So, virtualization does introduce variability in the duration of the interval between samples and sometimes causes the sampling itself rate to fluctuate.

However, neither the variability in the time between samples nor possible changes to the sampling rate undermine the validity of the sampling process. The execution state sampling performed by periodic clock interval handler still amounts to a random sampling of processor utilization. The sampling data should still accurately reflect the relative proportion of the actual CPU time used by the individual processes running on the guest machine.

Since you have access to the actual guest machine CPU usage – from the Hyper-V Virtual Processor\% Run Time counters – you can adjust the internal guest machine process level data accordingly. You need to accumulate a sufficient number of guest machine samples to ensure that the sample data is a good estimator of the underlying population. In practice, that means if you have at least five minutes worth of Windows counter data at the Hyper-V Host machine and guest machine level for a five minute interval, you should be to correct for the distortion due to virtualization and reliably estimate processor utilization at the process level within Windows guest machines.


Calculate a correction factor, w, from the ratio of the actual guest machine physical CPU usage as reported by the hypervisor to the processor utilization reported by the Windows guest. For a guest machine with multiple virtual processors, use, w, the average utilization across virtual CPUs, calculated as follows:

w = mean(Virtual Processor\% Run Time) /mean (Processor\% Processor Time)

w is also a weight factor that can be used to correct the % processor time counter values reported at the process level to determine the actual CPU usage of individual guest machine processes.

An example. To work through a simple example of calculating and applying the weight factor to correct the processor utilization measurements at the process level, let’s assume a guest machine that is configured to run with just one virtual CPU. Suppose Windows reports the following performance counter values for the Windows guest machine:


% Processor Time (virtual)
Processor(0)
73%
Process A
40%
Process B
20%

From the Hyper-V counters, suppose you see that the actual CPU utilization of the guest machine is 88%.

You can then correct the internal process CPU time measurements by weighting them by the ratio of the actual CPU time consumed compared to the CPU time reported by the guest machine (88:73, or roughly 1.2).


% Processor Time (virtual)
% Processor Time (actual)
Processor(0)
73%
88%
Process A
40%
48%
Process B
20%
24%

Difference counters.

Windows performance counters that are difference counters are also impacted by virtualization of the Windows time of day clock. Difference counters, which are mainly of type PERF_COUNTER_COUNTER, are probably the most common type of Windows performance counters. They are based on the simple counting of the number of events that occur. Examples include Memory\Pages/sec, Logical Disk(n)\Disk transfer/sec, TCPvx\TCP segments/sec, and other similar counters. On the Hyper-V guest, you can trust most of these counters, although if the Hyper-V Host gets very busy, you should also expect some anomalies.

Difference counters are all performance counters whose values are transformed by performance monitoring software and reported as events/second rates.  The performance monitoring software calculates an interval delta by subtracting the value from the previous measurement interval from the current value of the counter. This interval delta value is then divided by the interval duration to create a rate, i.e., events/second. In making that events/second calculation, the numerator – the number of these events that were observed during the interval – remains a valid count field. 

As an example of how a difference functions, let’s look at the Logical Disk\Disk Bytes/sec counter, which is the sum of the Disk Bytes Read/sec and Disk Bytes Written/sec counters. It is the responsibility of the disk driver software in Windows to maintain a DISK_PERFORMANCE structure associated with each Logical and Physical Disk that is updated at the end of every IO operation. Using the Performance Monitoring API, the Performance Monitor obtains a current value for Disk Bytes Read and Disk Bytes Written. The Performance Monitor retains the previous set of values for these counters from the last collection period, and also remembers the time that collection interval occurred. To generate the disk throughput rate for the last measurement interval, the Performance Monitor calculates:


Disk Bytes/sec = (DiskBytes_t0 - DiskBytes_(t-1)) / (Timestamp_t0 - Timestamp_(t-1))

Most of the time you can trust the values of these performance counters from a Windows guest machine because the disk driver in the guest machine keeps an accurate count of the number of Disk Bytes transferred. What is not entirely dependable under both VMware and Hyper-V, however, is the interval duration, something which is derived from two successive calls to virtualized timer services that may or may not be delayed significantly. While there is no doubt that the events that were counted during the measurement interval actually occurred, what can be suspect is the calculation of the rate that those events occurred because the timestamps themselves are unreliable.

For difference counters that report disk or network IO rates and throughput for synthetic disks and NICs, you can compare the performance counters reported by the Root partition, which is handling all communication with the physical devices, to those surfaced by the resident guest machines. Figure 1 reports the results of one of these comparisons for a Root partition with one active guest partition that was issuing disk IO requests to a synthetic drive. Since the synthetic hard drive on the child partition is a .vhdx file on the Root partition’s file system, a difference counter like Disk Bytes/sec is being counted independently by both the device driver software running on the Root and the device driver software running inside the child partition. The measurement intervals reported in Figure 1 are one minute apart for a two-hour period. 

Notice that the measurements on the guest machine line up well with those taken on the Root partition, although the Root partition performs some additional disk IO of its own – logging performance data, Event logging, and maintaining the file system on the logical disk where the synthetic disk the guest machine is reading and writing resides. Intervals where the measurements from the Root and the active child partition differ, they are consistently different in the direction of the Root partition issuing more IO requests and transferring more data than the guest machine.

Figure 1. Comparison of a difference counter measuring disk throughput on the guest machine line up well with those taken on the Root partition. For those intervals where the measurements from the Root and the active child partition differ, they are consistently different in the direction of the Root partition issuing more IO requests and transferring more data than the guest machine.
Instantaneous counters.

On the other hand, there are counter types that are unaffected by virtualized clock and timer values. These are instantaneous counters, effectively a snapshot of a value such as Memory\Available Bytes that is observed at a single point of time. (This sort of counter is known in the official documentation as a PERF_COUNTER_RAWCOUNT.) Since there is no interval timer or duration associated with production of this type of counter, virtualized clock and timer values have no impact on the validity of these measurements. They are good to use as is.

Disk performance counters that use QueryPerformanceCounter

Finally, there is a set of counters that use the QueryPerformanceCounter API to measure duration at higher precision than the System Time of Day clock permits. QueryPerformanceCounter is a Windows API that wraps the hardware rdtsc instruction. Under both VMware and Hyper-V, timings based on the lightweight rdtsc instruction issued from guest machines are subject to virtualization delays. The hypervisor intercepts all rdtsc instructions and returns virtualized timer values. Note that rdtsc instructions are issued routinely in Windows by system level functions for disk performance monitoring and by the TCP/IP networking stack to understand a TCP session’s Round Trip Time. They can also be issued by a program executing at any protection level. But due to the potential for anomalies on a multi-socket Hyper-V Host machine with unsynchronized clock domains, Hyper-V intercepts all rdtsc instructions, returning a consistent “reference” time, but not necessarily the actual value reported by the hardware TSC.  

Performance counters that utilize the QueryPerformanceCounter API are found mainly among the Logical and Physical Disk counters. The System Time of Day clock, which advances 64 times per second, provides too low resolution for timing disk I/O operations that often complete within 10 milliseconds of less. In fact, the the QueryPerformanceCounter API was originally introduced back in Windows 2000 to improve the resolution of performance counters such as Logical Disk(*)Avg. Disk sec/Transfer – and, thus, the unfortunate name for an API that is actually the Windows high resolution timer function.

In theory, Hyper-V’s virtualization of the rdtsc instruction should not have a major impact on the timer-based calculations used to produce counters like Avg. Disk sec/Transfer that measure disk latency in Windows. To calculate disk latency in Windows, the disk device driver issues a call to QueryPerformanceCounter that executes an rdtsc instruction at time t1 when the IO operation to disk is initiated. The driver issues a second call to QueryPerformanceCounter at time t2 when the IO completes. The device driver software calculates the interval delta, t2 – t1, and accumulates the total elapsed time for the device in a DISK_PERFORMANCE structure. Since the rdtsc timing instructions are effectively issued in line by the guest machine, the clock value returned by virtualized rdtsc calls are usually very close to the actual rdtsc value, plus some slight, additional overhead associated with the Hyper-V hypervisor intercepting the instruction.

When a guest machine issues an IO request to a synthetic disk, Hyper-V IO redirects this disk IO request to the Root partition where it is translated into a request to read or write the .vhdx file in the Root partition’s file system. The precise sequence of calls to the QueryPerformanceCounter function to measure the duration of a synthetic disk IO request is as follows:


  • at t0, by the guest machine device driver marking the beginning of the disk IO request
  • at t1, by the Root partition device driver marking the beginning of the disk IO request
  • at t2, by the Root machine device driver recording the completion of the disk IO request
  • at t3, by the guest machine device driver recording the completion of the disk IO request

The Root partition calculates the response time (RT) of the IO request by subtracting t2 – t1. Meanwhile, the guest machine calculates the response time (RT) of the IO request by subtracting t3t0. It follows then that RTroot > RTguest for every disk IO request.

To investigate the relative accuracy of disk IO timing using calls to QueryPerformanceCounter that are executed by Windows guest machines under Hyper-V are, let’s again look at the experiment where I compared the values reported in the disk performance counters for the Root partition and an active child partition that was executing an IO workload that was illustrated in Figure 1. This time, instead of looking at a Logical Disk difference counter, let’s compare the counters that use the QueryPerformanceCounter function to measure the duration of disk IO requests.

A comparison of the Ave Disk secs/Write measurements gathered at the same time on the Root partition and the active child partition is shown in Figure 2. Ignoring a few intervals where the guest machine reported zero values for the Ave Disk secs/Write counter, the guest machine measurements again line up well with the ones gathered by the Root partition. 

Figure 2. Comparing disk response time measurements gathered using calls to the Win32 QueryPerformanceCounter() method by the guest machine and the Root partition. As expected, the Ave Disk secs/Write measurements gathered by the child partition are uniformly higher than the measurements reported in the same interval by the Root partition.
Consistent with the sequence in which QueryPerformanceCounter clock values are gathered on the guest machine and the Root partition to measure the response time of the IO request, the Ave Disk secs/Write measurements on the guest machine are uniformly greater than the Ave Disk secs/Write measurements taken on the Root partition. In fact, IO requests to synthetic devices do take longer to process than native IO, due to the extra processing steps that are required, plus the possibility of deferred interrupts, and now you understand how to measure how much longer. Comparing the Hyper-V Host measurements of disk response time to the same measurements on the guest machine quantifies the dispatching delays that a guest machine potentially experiences under Hyper-V (and VMware). 

In practice, note that the comparison between the Host and guest machine measurements is only valid when there is either (1) a single guest machine running (which was the case reported here) or (2) the guest machine synthetic disk is allocated on its own dedicated Host machine logical disk, something that is frequently done to ensure better performance of the guest machine workload. If, on the other hand, the Host machine disk is configured to contain done from multiple guest machines, the inference is not valid. Logically, you should not compare the Hyper-V Host aggregate measurements to one instance of the guest machine measurements, which is another reason why isolating a guest machine synthetic disk to a dedicated Host machine disk is considered a Best Practice when disk performance on the guest machine is critical.

Comments

Popular posts from this blog

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning
Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…