This post is a prequel to a recent one on correcting the Process(*)\% Processor Time counters on a Windows guest machine.
% Processor Time counters
Difference counters.
Instantaneous counters.
Disk performance counters that use the QueryPerformanceCounter API
To assess the overall impact of the VMware virtualization
environment on the accuracy of the performance measurements available for
Windows guest machines, it is necessary to first understand how VMware affects
the clocks and timers that are available on the guest machine. Basically,
VMware virtualizes all calls made from the guest OS to hardware-based clock and
timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware
Virtual Machines” contains an extended discussion of the clock and timer
distortions that occur in Windows guest machines when there are virtual machine
scheduling delays. These clock and timer services distortions, in turn, cause
distortion among a considerably large set of Windows performance counters,
depending on the specific type of performance counter. (The different types of
performance counters are described here.)
First, let’s look into the clock and timer distortions that
occur in VMware, and then let’s discuss how these distorted clock values impact
various Windows performance measurements.
You may recall from an earlier blog entry on the subject
that the Windows OS provides the following clock and timer services:
a.
a periodic clock interrupt that Windows relies
upon to maintain the System Time of Day clock,
b.
the rdtsc
instruction, which Windows wraps using a call to the QueryPerformanceCounter
function.[1]
In VMware, virtualization influences both time sources that
Windows depends upon in performance measurement. From inside the Windows guest
machine, there is no clock or timer service that is consistently reliable. The
consequences for performance monitoring are profound.
The fact that virtualization impacts external,
hardware-based clock and timer services isn’t too surprising. Whenever the
guest machine accesses an external hardware timer or clock, that access is
virtualized like any other access to an external device. Any external device IO
request is intercepted by the VM Host software, which then redirects the
request to the actual hardware device. If the target device happens to be busy
servicing another request originating from a different virtual machine, the VM
Host software must queue the request. When the actual hardware device completes
its processing of the request, there is an additional processing delay
associated with the VM Host routing the result to the guest machine that
originated the request. In the case of a synchronous HPET timer request, this
interception and routing overhead leads to some amount of “jitter” in
interpreting the clock values that are retrieved that is not otherwise present
when Windows is running in a native mode.
% Processor Time counters
Windows schedules a periodic clock interrupt designed to
interrupt the machine 64 times per second. During servicing of this interrupt,
Windows maintains its System Time of Day clock. (The Windows System Time of
Clock is maintained in 100 nanosecond timer units, but the actual time of day
values are only updated 64 times per second – approximately once every 16.67
milliseconds.) The clock interrupt
service routine also samples the execution state of the machine, and these
samples form the basis of the % Processor Time measurements that are maintained
at the processor, process, and thread levels in Windows.
VMware’s virtualization technology impacts any asynchronous
timer request of this type. Timer interrupts are subject to additional delay because
the original interrupt is service first by VMware Host software services and
sometime later a virtualized device interrupt is presented to the guest
machine. Interrupt delay time is normally minimal if the virtual guest is
currently dispatched. However, if, at the specific point in time when the
periodic clock interval the Windows Scheduler relies upon is scheduled to
expire, the guest machine is blocked by the VMware Host scheduler (where it is
accumulating Ready time), the interrupt delay time will be significantly longer.
In extreme circumstances, it is possible for VMware guest
machine scheduling delays to grow large enough to cause some periodic timer
interrupts to be dropped entirely. In VMware terminology, these are known as lost ticks, another tell-tale sign of
contention for physical processors. In extreme cases where the backlog of timer
interrupts ever exceeds 60 seconds, VMware attempts a radical
re-synchronization of the time of day clock in the guest machine, zeros out its
backlog of timer interrupts, and starts over. (A VMware white paper entitled “Timekeeping in VMware
Virtual Machines” has an extended discussion of the clock and timer
distortions that occur in Windows guest machines when there are virtual machine
scheduling delays.)
Unfortunately, there is currently no way to measure directly
the magnitude of these guest machine dispatcher delays. The VMware measurement
that comes closest to characterizing the nature and extent of the delays
associated with timely delivery of timer interrupts is CPU Ready milliseconds,
which reports the amount of time a guest machine was waiting for execution, but
was delayed in the ESX Dispatcher queue. (VMware has a pretty good white paper
on the subject available here).
If the underlying ESX Host is not too busy, guest machines
are subject to very little CPU Ready delay. Under those circumstances, the
delays associated with virtualizing clock timer interrupts will introduce some
jitter into the Windows guest CPU utilization measurements, but these should
not be too serious. However, if the underlying ESX Host does become very
heavily utilized, clock timer interrupts are subject to major delays.
At the guest level, these delays impact both the rate of
execution state samples and the duration between samples. Consider a common
scenario where a dual socket VMware Host machine is configured to run 8 or more
Windows guest machines. Each Windows guest machine is expected a periodic clock
interrupt that is utilized to perform CPU accounting 64 times per second. If
more vCPUs for guest machines are defined than exist physically, then some of
those clock interrupts are going to be subject to VMware dispatching delays.
The
busier the underlying VMware Host machine is servicing other guest machines, the more these measurements are distorted.
Actually, this all too common scenario is a nightmare for
VMware – the effect is that of multiple Windows guest machines continuously issuing
and processing clock interrupts, even when the machines are idle or nearly
idle.
You do have recourse to actual measurements of processor
utilization by guest machines from VMware, which are available inside the
Windows guest when the VMware tools are installed. However, be aware that the
VMware guest machine processor utilization counters are not measuring the same
thing as the internal Windows CPU accounting function. VMware measures amount
of time a guest machine was dispatched on a physical processor during the last
interval. A Windows guest machine that appears to be running to the VMware
scheduler might actually be executing its Idle loop on one or more of the vCPUs
that are defined.
To summarize this discussion, remember that in a
virtualization environment, at the processor level the guest machine’s %
Processor Time counters are difficult to interpret. They are largely
meaningless, although when the underlying VMware Host machine is not very busy
with other guests, the internal measurements are probably not that far from
reality. You can substitute the guest machine processor time measurements that
the VMware Host reports, but they measure the amount of time the guest machine
was dispatched without regard for any periods of time when Windows guest
machine vCPUs were executing the Idle loop.
As discussed in the last blog entry, at the process and
thread level, the guest machine % Processor Time measurements can be readily
corrected so long as you have access to the VMware guest machine measurements,
so they continue to be quite useful.
Difference counters.
Windows performance counters that are difference counters are also impacted by virtualization of the
Windows time of day clock. Difference counters, which are mainly of type PERF_COUNTER_COUNTER,
are probably the most common type of Windows performance counters. They are
based on the simple counting of the number of events that occur. Examples
include Pages/sec, Disk transfer/sec, TCP segments/sec, and other similar
counters.
Difference counters are all performance counters whose
values are transformed by performance monitoring software and reported as events/second
rates. The by performance monitoring
software calculates an interval delta by subtracting the value from the last
measurement interval from the current value of the counter. This interval delta
value is then divided by the interval duration to create a rate, i.e.,
events/second. In making that events/second calculation, the numerator – the
number of these events that were observed during the interval – remains a valid
count field. What is not reliable
under VMware, however, is the interval duration, something which is derived
from two successive calls to virtualized timer services that may or may not be
delayed significantly. There should be no doubt that the events that were
counted during the measurement interval actually occurred. What is suspect is
the calculation of the rate that
those events occurred that is performed by Perfmon and other performance
monitoring applications.
Instantaneous counters.
On the other hand, there are counter types that are unaffected
by virtualized clock and timer values. These are instantaneous counters, effectively a snapshot of a value such as
Memory\Available Bytes that is observed at a single point of time. (This sort
of counter is known in the official documentation as a PERF_COUNTER_RAWCOUNT.)
Since there is no interval timer or duration associated with production of this
type of counter, virtualized clock and timer values have no impact on the
validity of these measurements.
Disk performance counters that use the QueryPerformanceCounter API
Finally, there is a set of counters that use the QueryPerformanceCounter API to
measure duration at higher precision than the System Time of Day clock permits.
QueryPerformanceCounter is a Windows
API that wraps the hardware rdtsc
instruction. (There is an extensive discussion of this interface here.)
Under VMware, even timings based on the lightweight rdtsc instruction issued from guest machines are subject to
virtualization delays. The VMware Host OS traps all rdtsc instructions and returns virtualized timer values. Despite
that fact that an rdtsc instruction
can normally be issued by a program executing at any protection level, they are
still trapped and virtualized in VMware. (In contrast, Microsoft’s Hyper-V
chooses not to trap rdtsc
instructions, so guest machines in that environment do have native access to
the hardware Read TimeStamp Counter instruction.)
Performance counters that utilize the QueryPerformanceCounter API are found mainly among the Logical and
Physical Disk counters. The System Time of Day clock, which advances 64 times
per second, provides too low resolution for timing disk I/O operations that
often complete within 10 milliseconds of less. In fact, the the
QueryPerformanceCounter API was originally introduced back in Windows 2000 to
improve the resolution of performance counters such as Logical Disk(*)\Avg. Disk sec/Transfer – and, thus, the unfortunate
name for an API that is actually the Windows high resolution timer function.
In theory, VMware’s virtualization of the rdtsc instruction should not have a
major impact on the timer-based calculations used to produce counters like Avg. Disk sec/Transfer that measure disk
latency in Windows. To calculate disk latency in Windows, the disk device
driver issues an rdtsc instruction
at time t1 when the IO operation to disk is initiated. The driver
issues a second rdtsc at time t2
when the IO completes. The device driver software calculates the interval
delta, t2 – t1,
and accumulates the total elapsed time for the device in a DISK_PERFORMANCE
structure. Since the rdtsc timing
instructions are effectively issued in line by the guest machine and presumably
processed in line by VMware, the clock value returned by virtualized rdtsc call should be very close to the
actual rdtsc value, plus some minor amount
of overhead associated with VMware trapping the instruction.
However, it is entirely possible for the entire guest
machine to be blocked from executing sometime between t1 and t2
while it has a disk IO pending. When the guest machine is re-dispatched, the
disk IO that was pending will finally complete, at which point the clock value
associated with that completion is acquired by the guest machine device driver.
Now, however, when the interval delta, t2
– t1, is calculated, the measurement of disk latency includes
the amount of time the guest machine was delayed in the VMware dispatcher ready
queue. Of course, from the standpoint of the application awaiting the IO
operation to complete, the t2
– t1 calculation remains perfectly valid. It is no longer disk latency, per see, but it does accurately represent the delay that the application
issuing the IO request perceives.
A final complication is that it is not possible to tell how
many times during any measurement interval that the amount of disk latency time
the Windows disk device driver accumulates includes guest machine dispatching
delays during pending IO operations.
Similarly to the VMware Host measurements of guest machine
physical processor utilization, VMware does provide direct measurements of disk
latency from the point of view of the VMware Host. For each VMware guest
machine, there are additional disk performance statistics available that tracks
bytes transferred and reads vs. writes. Measurements of disk latency at the
guest machine level are not available from VMware, however.
Reconciling
the VMware Host measurements of overall disk latency with the Windows guest
machine measurements of disk performance at a logical or physical disk level is
problematic at best. Figure 1 is an attempt to reconcile these disparate data
sources under the best possible circumstances – the VMware Host was managing a
single guest machine.
Figure 1. Comparing the VMware Host and Windows guest machine measurements of disk latency. |
In Figure 1, I plotted about 35 minutes worth of disk
performance data from the VMware Host (the area plotted in light orange) and
from the single Windows guest (the dotted line overlay) for the same one minute
measurement intervals. In an ideal world, this data should be quite similar, and
in some of the intervals, the disk latency measurements are quite similar. But,
in many other intervals, the measurements are quite different. It is unclear
how to reconcile these differences.
Overall, since VMware
is not able to report disk latency at the guest machine level, the disk
performance measurements available internally from the Windows guest machine remain
quite useful. You should attempt to reconcile the VMware disk measurements with
the guest machine internal measurements, which with regard to reads, writes and
bytes transferred should be comparable. Whenever a Windows guest machine
running on a busy VMware Host reports disk latency that is much worse than the
VMware Host disk latency measurements, you need to consider the possibility
that at least some of the difference is due guest machine dispatcher delays and
not poor disk hardware performance.
[1] You may recall from an earlier blog entry that the
QueryPerformanceCounter function that is used in performance monitoring to
generate granular measurements of elapsed time uses the hardware rdtsc instruction in Windows 7, but
reverts to the HPET external timer on older machines when rdtsc cannot be trusted. For simplicity’s sake, I will ignore the
potential use of the HPET in the discussion here.
Comments
Post a Comment