Skip to main content

Measuring Processor Utilization in Windows and Windows applications: Part 1

Introduction.
This blog entry discusses the legacy technique for measuring processor utilization in Windows that is based on sampling and compares and contrasts to other sampling techniques. It also introduces newer techniques for measuring processor utilization in Windows that are event-driven. The event-driven approaches are distinguished by far greater accuracy. They also entail significantly higher overhead, but measurements indicate this overhead is well within acceptable bounds on today’s high powered server machines.

As of this writing, Windows continues to report measurements of processor utilization based on the legacy sampling technique. The more accurate measurements that are driven using an event-driven approach are gaining ground and can be expected to supplant the legacy measurements in the not too distant future.
While computer performance junkies like me positively salivate at the prospect of obtaining more reliable and more precise processor busy metrics, the event-driven measurements do leave several very important issues in measuring CPU utilization unresolved. These include validity and reliability issues that arise when Windows is running as a guest virtual machine under VMware, Zen, or Hyper-V that impact the accuracy of most timer-based measurements. (In an aside, mitigation techniques for avoiding some of the worst measurement anomalies associated with virtualization are discussed.)
A final topic concerns characteristics of current Intel-compatible processors that that undermine the rationale for using measurements of CPU busy that are based solely on thread execution. This section outlines the appeal of using internal hardware measurements of the processor’s instruction execution rate in addition to or instead of wall-clock timing of thread execution state. While I do try to make the case for using internal hardware measurements of the processor’s instruction execution rate to augment more conventional measures of CPU busy, I will also outline some of the current difficulties that advocates of this approach encounter when they attempt to put it into practice today.

Sampling processor utilization.



The methodology used to calculate processor utilization in Windows was originally designed 20 years ago for Windows NT. Since the original design goal of Windows NT was to be hardware independent, the measurement methodology was also designed so that it was not dependent on any specific set of processor hardware measurement features.
The familiar % Processor Time counters in Perfmon are measurements derived using a sampling technique. The OS Scheduler samples the execution state of the processor once per system clock tick, driven by a high priority timer-based interrupt. Currently, the clock tick interval the OS Scheduler uses is usually 15.6 ms, roughly 64 clock interrupts per second. (The precise value that the OS uses between timer interrupts is available by calling  the GetSystemTimeAdjustment() function.) If the processor is running the Idle loop when the quantum interrupt occurs, it is recorded as an Idle Time sample. If the processor is running some application thread, that is recorded as a busy sample. Busy samples are accumulated continuously at both the thread and process level.
Figure 1 illustrates the calculation of CPU time based on this sampling of the processor execution state as reported in the Performance tab of the Windows Task Manager.
Figure 1. The Performance tab of the Windows Task Manager reports processor utilization based on sampling the processor execution state once every quantum, approximately 64 times per second.
When a clock interrupt occurs, the Scheduler performs a number of other tasks, including adjusting the dispatching priority of threads that are currently executing with the intention of stopping the progress of any thread that has exceeded its time slice. Using the same high priority OS Scheduler clock interrupt that is used for CPU accounting to implement processor time-slicing is the reason the interval between Scheduler interrupts is known as the quantum. At one time in Windows NT, the quantum was set based on the speed of the processor; the faster the processor the shorter the more frequently the OS Scheduler would gain control. Today, however, the quantum value is constant across all processors.
Another measurement function that is performed by the OS Scheduler’s clock interrupt is to take a sample of the length of the processor queue consisting of Ready, but waiting threads. The System\Processor Queue Length counter in Perfmon is an instantaneous counter that reflects the last measurement taken by the OS Scheduler’s clock interrupt of the current number of Ready threads waiting in the OS Scheduler queue. Thus, the System\Processor Queue Length counter represents one sampled observation, and needs to be interpreted with that in mind.
The processor Queue Length metric is sometime subject to anomalies due to the kind of phased behavior you sometimes see on an otherwise idle system. Even on an mostly idle system, a sizable number of threads can be waiting on the same clock interrupt (typically, polling the system state once per second), one of which also happens to be the Perfmon measurement thread, also cycling once per second. These sleeping threads tend to clump together so that they get woken up at the exact same time by the timer interrupt. (As I mentioned, this happens mainly when the machine is idling with little or no real work to do.) These awakened threads then flood the OS dispatching queue at exactly the same time. If one of these threads is the Perfmon measurement thread that gathers the Processor Queue Length measurement, you can see how this “clumping” behavior could distort the measurements. The Perfmon measurement thread executes at an elevated priority level of 15, so it is scheduled for execution ahead of any other User mode threads that were also awakened by the same Scheduler clock tick. The effect is that at the precise time when the Processor ready queue length is measured, there are likely to be a fair number of Ready Threads. Compared to the modeling assumption where processor scheduling is subject to random arrivals, one observes a disproportionate number of Ready Threads waiting for service, even (or especially) when the processor itself is not very busy overall.
This anomaly is best characterized as a low-utilization effect that perturbs the measurement when the machine is loafing. It generally ceases to be an issue when processor utilization climbs or there are more available processors on the machine. But this bunching of timer-based interrupts remains a serious concern, for instance, whenever Windows is running as a guest virtual machine under VMware or Hyper-V. Another interesting side discussion is how this clumping of timer-based interrupts interacts with power management, but I do not intend to venture further into that subject here.
To summarize, the CPU utilization measurements at the system, process and thread level in Windows are based on a sampling methodology. Similarly, the processor queue length is also sampled. Like any sampling approach, the data gathered is subject to typical sampling errors, including
  • accumulating a sufficient number of sample observations to be able to make a reliable statistical inference about the underlying population, and
  • ensuring that there aren’t systemic sources of sampling error that causes sub-classes of the underlying population to be under or over-sampled markedly
So, these CPU measurements face familiar issues with sampling size and the potential for systematic sampling bias, as well as the usual difficulty in ensuring that the sample data is representative of the underlying population (something known as non-sampling error). For example, the interpretation of the CPU utilization data that Perfmon gathers at the process and thread level is subject to limitations based on a small sample size for collection intervals less than, for example, 15 seconds. At one minute intervals, there are enough samples to expect accuracy within 1-2%, a reasonable trade-off of precision against overhead. Over even longer measurement intervals, say 5 or 10 minutes, the current sampling approach leads to miniscule sampling errors, except in anomalous cases where there is systematic under-sampling of the processor’s execution state.   
Sample size is also the reason that Microsoft does not currently permit Perfmon to gather data at intervals more frequent than once per second. Running performance data collection at intervals of 0.1 seconds, for example, the impact of relying on a very small number of processor execution state samples is quite evident. At 0.1 second intervals, it is possible to accumulate just 5 or 6 samples per interval. If you are running a micro-benchmark and want to access the Thread\% Processor Time counters from Perfmon over 0.1 second intervals, you are looking for trouble. Under these circumstances, the % Processor Time measurements cease to resemble a continuous function over time.
The limitations of the current approach to measuring CPU busy in Windows and the need for more precise measurements of CPU utilization are recognized in many quarters across the Windows development organization at Microsoft. More on how this area is evolving in the next post in this series.

Comments

  1. from a Reader:

    You say that the Windows OS scheduler performs thread scheduling every clock tick, which is usually 15.6 ms. (Does this hold for Windows 7 and Windows Server 2008?)

    Does that mean if a thread gets pre-empted it can never be scheduled again until 15.6 ms later? Or is this just setting thread priorities and another OS process is pre-empting and running threads using a much finer granularity?

    ReplyDelete
  2. Yes, this holds in both Windows 7 and Windows Server 2008 R2. Windows Server 2008 R2 provides a much longer time slice (a thread can consume more "quantums" of CPU time before being subject to preemption).

    The 15.6 ms clock tick determines when the OS checks for a thread that has exceeded its time slice allotment & blocks it. This gives another thread of equal or lower priority to run.

    The preempted thread will execute at the next context switch when it is the highest thread in the Ready Q.

    Also, at the quantum tick, the OS will reduce (or age) the priority of a running thread whose priority was boosted previously (due to interrupt processing, normally). Aging makes it less likely that the preempted thread will execute again soon. But if there are no other Ready threads, it will start executing again at the next context switch.

    This behavior is visble when you analyze CSwitch events, most easily accomplished using the Concurrency Visualizer option of the VS Profiler. (Which I blogged about here: http://performancebydesign.blogspot.com/2011/11/measuring-thread-execution-state-using.html.)

    ReplyDelete

Post a Comment

Popular posts from this blog

“There’s a lot more to running a starship than answering a lot of fool questions.”

Continuing a series of blog posts on “expert” computer Performance rules, I am reminded of something Captain James T. Kirk, commander of the starship Enterprise, once said in an old Star Trek episode: “There’s a lot more to running a starship than answering a lot of fool questions.” Star Trek, The Original Series. Episode: The Deadly Years. Season 2, Episode 12. See http://tos.trekcore.com/episodes/season2/2x12/captioninglog.txt. For some reason, the idea that the rote application of some set of rules derived by a domain “expert” can suffice in computer performance analysis has great sway. At the risk of beating a dead horse, I want to highlight another example of a performance Rule you are likely to face, and, in the process, discuss why there is a whole lot more to applying it than might be obvious at first glance. There happens to be a lot more to computer performance analysis than the rote evaluation of some set of well-formed performance rules. It ought to be apparent by now that I …

How Windows performance counters are affected by running under VMware ESX

This post is a prequel to a recent one on correcting the Process(*)\% Processor Time counters on a Windows guest machine.

To assess the overall impact of the VMware virtualization environment on the accuracy of the performance measurements available for Windows guest machines, it is necessary to first understand how VMware affects the clocks and timers that are available on the guest machine. Basically, VMware virtualizes all calls made from the guest OS to hardware-based clock and timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” contains an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. These clock and timer services distortions, in turn, cause distortion among a considerably large set of Windows performance counters, depending on the specific type of performance counter. (The different types of performance counters are described here

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning
Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…