Skip to main content

Using xperf to analyze CSwitch events

Continuing the discussion from the previous blog entry on event-driven approaches to measuring CPU utilization in Windows...

Last time around I discussed the same CPU busy calculations that the Resource Manager in Windows 6 & 7 makes. This same calculation can also be performed after the fact using the event data from ETW. This is the technique used in the Windows Performance Toolkit (WPT, but which is better known around Microsoft as xperf), for example, to calculate CPU usage metrics.
Once you have downloaded and installed the Windows Performance Toolkit, you can launch a basic ETW collection session using the following xperf command:
xperf -on DiagEasy
Then, after you have accumulated enough data, issue another command to stop tracing and capture the event stream to a file:
xperf -d cputrace.etl
Next, process the cputrace.etl file using the xperfview app. After the trace file is loaded, xperfview provides visualizations that are very similar to ResMon. See Figure 6 for an example.
Figure 6. CPU utilization graphs in xperfview, based on ETW context switch event data gathered with the xperf utility. Additional ISR and DPC events are used to calculate the amount of time device drivers spend processing interrupts.

With xperfview, you can also request a Summary Table which displays CPU usage by process (and thread) by right-clicking on the CPU Usage graph and accessing the pop-up context menu. An example of the CPU Scheduling Aggregate Summary Table is illustrated in Figure 6. It is similar to the one ResMon produces. The data here was gathered while running a multi-threaded CPU soaker program called ThreadContentionGenerator while ResMon was also active. You can see that the calculation in Figure 7 roughly mirrors the results shown in Figure 5 for ResMon, allowing for some variation that is to be expected since the intervals themselves are not identical. The xperf interval shown in Figure 6 is approximately 500 seconds long, while ResMon maintains a rolling window that is only 60 seconds in duration. The ResMon screen shot was taken somewhere in the middle of the longer xperf tracing session.
Figure 7. The CPU Scheduling Aggregate Summary Table calculated by xperfview. The results of these calculations closely resembles the rolling one-minute calculation reported by the Resource Monitor in Figure 5.

For some perspective on the volume of trace events that can be generated, the binary .etl trace file produced in this example was approximately 325 MB on disk for a DiagEasy trace session that ran for more than ten minutes. Running with the xperf defaults, I received a notification when the trace session closed that three ETW 64K buffers of data were dropped during the trace because xperf was unable to keep pace with processing the event stream in real-time.
The Context Switch event also provides the old thread’s Wait Reason code, which helps you to understand why the sequence of thread scheduling events occurred. For reference, a Windows context switch is defined here, while the contents of the ETW (Event Tracing for Windows) context switch event record are defined here, including a list of the current thread WaitReason codes.
Note that you can measure CPU queue time accurately from the ETW events, an important indicator of processor contention when the processor becomes saturated. As illustrated in Figure 4, the transition from the Wait state to the Ready state is marked by a ReadyThread event record. The time between the ReadyThread event and a subsequent CSwitch event marking its transition to Running is one form of CPU Queue time. A second form of CPU queue time is the time between a CSwitch(…,out,WaitReason,…) where the WaitReason is either a Preempt or time-slice quantum expiration and a subsequent re-dispatch. Both forms of CPU queue time can be measured accurately using ETW.
When precision in understanding patterns of CPU consumption is required, post-processing the ETW context switching event stream is a much better way to proceed than attempting to use the Windows % Processor Time performance counters. Measuring CPU consumption from the context switch events is considerably more precise than, for example, the Windows performance counter data available in Perfmon that report processor utilization at the system and process level based on processor state sampling. Such high precision measurements are not always required, of course, and processing the ETW context switching event stream is relatively expensive due to the extremely high volume of trace data that you must deal with.

Figure 6 illustrates several of the CPU utilization graphs that xperfview creates from the context switch event stream. To help make the graph more readable, I filtered out Idle time calculation for all but two of the logical processors on this machine. (The machine illustrated has 8 logical CPUs.) To gain insight into what high priority Interrupt Service Routines (ISRs) and DPCs are running, ISR and DPC events should also be gathered, which the DiagEasy event profile in xperf does automatically. (Windows device driver developers are very interested in them. For an example, see this blog posting discussing using the xperf ETW utility to capture CPU consumption by the TCP/IP network driver stack.)


Popular posts from this blog

Inside the Windows Runtime, Part 2

As I mentioned in the previous post, run-time libraries in Windows provide services for applications running in User mode. For historical reasons, this run-time layer in Windows was always known as the Win32 libraries, even when these services are requested in the 64-bit OS in 32-bit mode. A good example of a Win32 run-time service is any operation that involves opening and accessing a file somewhere in the file system (or the network, or the cloud). A more involved example is the set of Win32 services an application needs to access to play an audio file, including understanding the specific audio file compressed format, and checking authorization and security.
For Windows 8, a portion of the existing Win32 services in Windows were ported to the ARM hardware platform.  The scope of the Win32 API is huge, and it was probably not feasible to convert all of it during the span of a single, time-constrained release cycle. Unfortunately, the fact that the new Windows 8 Runtime library encomp…

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…