Measuring Processor Utilization in Windows and Windows applications: Part 2

An event-driven approach to measuring processor execution state.

The limitations of the legacy approach to measuring CPU busy in Windows and the need for more precise measurements of CPU utilization are recognized in many quarters across the Windows development organization at Microsoft. The legacy sampling approach is doubtless very efficient, and this measurement facility was deeply embedded in the OS kernel’s Scheduler facility, a chunk of code that is very risky to tamper with. But, for example, more efficient power management, something that is crucial for battery-powered Windows devices, strongly argues for an event-driven alternative. You do not want the OS to wake up from a low power state regularly on an idle machine just to perform its CPU usage accounting duties, for example.

A straightforward alternative to periodically sampling the processor execution state is to measure the time spent in each processor state directly. This is accomplished by instrumenting the phase state transitions themselves. Processor state transitions in Windows are known as context switches. A context switch occurs in Windows whenever the processor switches the processor execution context to run a different thread. Processor state transitions also occur as a result of high priority Interrupt Service Routines (ISRs) gaining control following a device interrupt, as well as the Deferred Procedure Calls (DPCs) that ISRs schedule to complete the interrupt processing. By recording the time that each context switch occurs, it is possible to construct a complete and an accurate picture of CPU consumption.
(See a two-part article in MSDN Magazine, entitled “Core OS Events in Windows 7,” written by Insung Park and Alex Bendetovers and published beginning in September 2009. The authors are, respectively, the architect and lead developer of the ETW infrastructure. The article provides a conceptual overview describing how to use the various OS kernel events to reconstruct a state machine for processor execution, along with other diagnostic scenarios. Park and Bendetovers report, “In state machine construction, combining Context Switch, DPC and ISR events enables a very accurate accounting of CPU utilization.”)

It helps to have a good, general understanding of thread scheduling in the OS in order to interpret this stream of events. Figure 3 is a diagram depicting the state machine associated with thread execution. At any point in time, a thread can be in only one of the three states indicated: Waiting, Ready, or Running. The state transition diagram shows the changes in execution state that can occur. A Waiting thread is usually waiting for some event to occur, perhaps a Wait timer to expire, an IO operation to complete, a mouse or keyboard click that signals user interaction with the application, or a synchronization event from another thread that indicates it is OK to continue processing.

A thread that is Ready to run is placed in the Dispatcher’s Ready Queue, which is ordered by priority. When a processor becomes available, the OS Scheduler selects the highest priority thread on the Ready Queue and schedules it for execution on that processor. Once it is running, a thread remains in the Running state until it completes its execution cycle and transitions back to the Wait state. An executing thread can also be interrupted because a higher priority execution unit needs to run (this is known as preemptive scheduling) or it is interrupted by the OS Scheduler because its time-slice has expired. A Running thread can also be delayed because of a page fault, accessing data or an instruction in virtual memory that is not currently resident in physical memory. These thread execution time delays are often referred to as involuntary waits.

Figure 3. A state machine for thread execution.

Figure 4 associates these thread execution state transitions with the ETW events that record when these transitions occur. The most important of these is the CSwitch event record that is written on every processor context switch. The CSwitch event record indicates the thread ID of the thread that is entering the Running state (the new thread id), the thread ID that was displaced (the old thread ID) and provides the Wait Reason code associated with an old thread ID that is transitioning from Running back to the Wait state. The processor number indicating which logical CPU has undergone this state change is provided in an ETW_Buffer_Context structure associated with the ETW standard record header. In addition, it is necessary to know that Thread 0 from Process 0 indicates the Idle thread, which is dispatched on a processor whenever there are no Ready threads waiting for execution. While a thread other than the Idle thread is “active,” the CPU is considered busy.

Conceptually, a context switch event is something like a processor state switch(oldThreadId, newThreadId), with a time stamp identifying when the context switch occurred. The CPU time of a thread is precisely the amount of time it spends in the Running state. It can be measured using the CSwitch events that show the thread transitioning from Ready to the Running state and the CSwitch events that show that thread transitioning back from the Running state to Waiting. To calculate processor busy, you summarize the amount of time each processor spends when the Idle thread is active and subtract from 100% over the measurement interval.

Figure 4. The state transition diagram for thread execution, indicating the ETW trace events that mark thread state transitions.

One complication in this approach is that the ETW infrastructure does not guarantee delivery of every event to a Listener application. If the Listener application cannot keep up with the stream of events, then ETW will drop memory-resident buffers filled with events rather than queue them for delivery later. CSwitch events can occur at very high rates, 20,000-40,000 times per second per CPU are not unusual on busy machines, so there is definitely potential to miss enough of the context switch events to bias the calculations that result. In practice, handling the events efficiently in the Listener application and making appropriate adjustments to the ETW record buffering options can be used to minimize the potential for missing events.

To see this event-driven processor execution state measurement facility at work, access the Resource Monitor application (resmon.exe) that is available beginning in Vista and Windows Server 2008. Resource Monitor can be launched directly from the command line, or from either Performance Monitor plug-in or Task Manager Performance tab. Figure 5 displays a screen shot that shows Resource Monitor in action on a Windows 7 machine, calculating CPU utilization over the last 60 seconds of operation, breaking out that utilization by process. The CPU utilization measurements that ResMon calculates are based on the context switch events. These measurements are very accurate, about as good as it gets from a vantage point inside the OS.

Figure 5. The Windows 7 Resource Manager application.

The Resource Monitor measures CPU busy in real time by listening to the ETW event stream that generates an event every time a context switch occurs. It also produces similar reports from memory, disk, and network events.

To summarize these developments, this trace-driven measurement source positions the Windows OS so it could replace its legacy CPU measurement facility with something more reliable and accurate sometime in the near future. Unfortunately, converting all existing features in Windows, including Perfmon and Task Manager, to support the new measurements is a big job, not without its complications and not always as straightforward as one would hope. But we can look forward to future versions of the Windows OS where an accurate, event-driven approach to measuring processor utilization supplants the legacy sampling approach that Task Manager and Perfmon rely on today.

In the next blog entry in this series, I will show a quick example using xperf to calculate the same CPU utilization metrics from the ETW event stream. I will point xperfview at an .etl file gathered during the same measurement interval as the one illustrated in Figure 5 using ResMon.

Comments

EphiDecember 23, 2011 at 9:16 AM
"A Running thread can also be delayed because of a page fault, accessing data or an instruction in virtual memory that is not currently resident in physical memory."

Is there any literature on how to minimize these sort of delays (specifically loading in pages for instructions)? Is it possible to lock in all the code (running exe and any dlls that are needed) into physical memory so it never swaps out?
ReplyDelete
Replies
Mark B. FriedmanDecember 23, 2011 at 9:56 AM
The short answer is, "No."

Windows strongly discourages a process locking down its own pages, and there is no API call to do that. You might say it is an approach that florishes when there is ample physical memory.

There is a working set protection API, however, but it requires that your program understand its memory usage profile very clearly to use effectively.

Windows takes a more cooperative approach. Processes, for eample, can listen for a Low Memory event, then shed less important pages when that signal is received. The memory management functions of the .NET Framework manage process local memory in that fashion, so you don't have to worry about any of that inside your code. (There are a range of other memory management considerations, though, in .NET, centered on automatic garbage collection.)

In a demand paging environment, many page faults are simply unavoidable. The 1st time your program references an address on a page, it generates a page fault.

Also, keep in mind that Windows has a "soft" page fault mechanism which resolves recently trimmed pages directly from RAM; the Concurrency Visualizer will show those blocking thread execution very quickly.

An analysis of Page Fault ETW events can show you when and where page faults are impacting your app. I recommend using the PF option of the VSPerfCmd (the VS Profiler command line interface: see http://msdn.microsoft.com/en-us/library/dd255428.aspx) if you need to get a read on where page faults are occurring in your app. Many page faults, particularly at start up, are simply unavoidable though, as I mentioned earlier.
ReplyDelete
Replies
EphiDecember 23, 2011 at 10:17 AM
I am not worried about startup. But I suspect I am getting hard page faults to load code that is used infrequently. On the average my server processes events in the microsecond range. There are outliers (say 99.9 percentile) that experience delays of tens to hundreds of milliseconds. I can only explain this behavior because of 1) page faults or 2) preemption by some other process that is not executing in a timely manner.
I'm not sure how to capture this behavior with xperf or concurrency analyzer because the amount of information it generates is too large (my server is processing 10,000 events/second when I see this behavior).

Regarding 2) I am planning to test the resource manager tool that comes with Server 2008. I believe you can use that tool allows you to reserve cores for your processes only
ReplyDelete
Replies
Mark B. FriedmanDecember 23, 2011 at 10:49 AM
The volume of data these ETW traces produce definitely makes analysis of long running tasks challenging.

Preemption by an ISR/DPC should not lead to delays of that magnitude. Preemption by a higher priority thread would, however.

For longer running tasks, you have to use a different approach, relying more on Windows perf counters for a "longer" view.

A third possible source of delay of that magnitude is lock contention.

If memory serves, the Reskit Tool lets you set processor affinity for ISRs and DPCs. Inside your app, you have access to the same set of Processor Affinity options (see the SetProcessAffinityMask API).
ReplyDelete
Replies
EphiDecember 23, 2011 at 11:17 AM
Yes, a while back a member of my team did find a case of of a high priority thread causing delays on our servers. He created a simple test program and I don't remember if he used some of the sysinternals tools (ProcMon) or xperf. There was a Microsoft service (I think msi) process that was behaving badly which was subsequently fixed with a hotfix and service pack.

This brings up another point. What are the downsides of setting the priority of my threads to a high level (not realtime - just high) with the SetPriorityClass() and SetThreadPriority(). I'm assuming it would hide the problem in the msi process and allow my threads to continue without excessive delay.

The tool I am referring to is a bit more robust:
Windows Server 2008 includes an optionally installable component called Windows System Resource Manager. It permits the administrator to configure policies that specify CPU utilization, affinity settings, and memory limits for processes. Policies can be applied for specific applications, users or groups.

The ResKit tool to set affinities for ISRs and DPCs would complement this
ReplyDelete
Replies
Mark B. FriedmanDecember 23, 2011 at 12:53 PM
My bad.

I did not realize that was where the WSRM is located for WS 2008.

Yes, WSRM has some useful capabilities for setting system-wide priorities. I have used the capabilities of WSRM to establish a minimum CPU resource allotment for a preferred workload effectively. When other workloads start to eat into that "quota," WSRM starts reducing their thread priorities, which tends to elevate the dispatching priority of the preferred workload & improve responsiveness.

With regard to memory priorities, I find the WSRM controls to be rather limited due to the limited amount of discretion the OS gives to a process in general. If you need specific guidance on using either the memory protection API in Windows or would like to discuss memory management further, please let's take that discussion offline, as it is getting a little far afield from the original blog entry.

Regarding setting thread CPU priority, in general, try to restrict threads running at the "Time Critical" setting to ones that process in shorter bursts, if possible. Allow longer running compute-bound threads to execute at Normal priority levels, but assure there is adequate CPU capacity so they can still execute on a timely basis. This is a simple architectural approach that tends to reduce CPU queuing delays overall.

Again, if there is adequate CPU capacity, processor delays are going to be minimized, so you don't always have to micro-manage too many threads priorities yourself. (The OS does a pretty good job of juggling thread priorities dynamically in most cases.) At some point, you are apt to see diminishing returns from that degree of application tuning, in my experience. I'd be looking at ways to decrease my code path in tandem with setting up my thread priority scheme -- and also using WSRM to ensure other apps aren't eating into my critical resource allotment.

Of course, you have to measure throughout, assessing any tuning change to determine what benefit accrues, to what it extent it increases the responsiveness or throughput of your app. This is difficult, detailed & time-consuming work, no matter how you look at it.

But I assume I am preaching to choir here. Good luck and please share any of your experiences and learning going forward. Thanks.
ReplyDelete
Replies

Add comment

Performance By Design

Search This Blog