Skip to main content

Measuring thread execution state using trace events.

Continuing the discussion from the previous blog entry on event-driven approaches to measuring CPU utilization in Windows...

Besides measuring processor utilization at the system level, the stream of context switch events can also be re-constructed to drill into CPU consumption at the process and thread level. An exemplary example of this approach is the Visual Studio Profiler’s Concurrency Visualizer, available in Visual Studio 2010. (For reference, see “Performance Tuning with the Concurrency Visualizer in Visual Studio 2010 in the Visual Studio 2010 Profiler,” an MSDN Magazine article written by the tool’s principal architect, Hazim Shafi.)  The Concurrency Visualizer gathers Context Switch events to calculate processor utilization for the application being profiled.
The VS Concurrency Visualizer creates a system-level CPU Utilization View with an interesting twist – the view pivots based on the application you are profiling, a perspective that matches that of a software performance engineer engaged in a performance investigation. Based on the sequence of context switch trace events, the Concurrency Visualizer calculates processor utilization by the process, aggregates it for the current selection window, and displays it in the CPU Utilization View. In the CPU Utilization View, all other processor activity for processes (other than one being profiled) is lumped together under a category called “Other Processes.” System-processes and the “Idle process,” which is a bookkeeping mechanism, not an actual process that is dispatched, are also broken out separately. See Dr. Shafi’s article for more details. (For reference, Figure 12 below illustrates the CPU Utilization View.)
The Concurrency Visualizer’s primary focus is on being able to reconstruct the sequence of events that impact an application’s execution progress. The Concurrency Visualizer’s Threads View is the main display showing an application’s execution path. The view here is of execution progress on a thread by thread basis. For each thread in your application, the Concurrency Visualizer shows the precise sequence of context switch events that occurred. These OS Scheduler events reflect that thread’s execution state over time. See Figure 8 for an example of this view.

 Figure 8. Screen shot of the Concurrency Visualizer illustrating thread preemption by a higher priority system routine.
Figure 8 shows the execution path of six application threads, a Main thread, a generic worker thread, and 4 CLR (the Common Language Runtime for .NET languages) worker threads that the application created by instantiating a .NET ThreadPool object. (There were originally more threads than this, but I chose to hide those that were inactive over the entire run.) For each thread, the execution state of the thread – whether it is running or whether it is blocked – is indicated over time.
The upper half of the display is a timeline that shows the execution state of each thread over time. The execution progress of each thread display is constructed horizontally from left to right from rectangles that indicate the start and end of a particular thread state. An interval when the thread was running shows as green. An interval where the thread is sleeping is shown in blue. A ready thread that is blocked from executing because a higher priority thread is running is shown in yellow. (This state is labeled “preemption.”) A thread in a synchronization delay waiting on a lock is visualized as red.
On the lower left of the display is a Visible Timeline Profile. This summarizes the state of all threads that are visible within the selected time window. In the screen shot in Figure 8, I have zoomed into a time window that is approximately 150 milliseconds wide. During that interval, the threads shown were in a state where they were actively executing instruction only 11% of the time. For 25% of the time interval, threads were blocked waiting on a lock. Finally, there is a tabbed display at the lower right. If you click on the “Profile Report” tab, a histogram displays that summarizes the execution state of each individual thread over the time window. In the screen shot, I have clicked on the “Current stack” tab that displays the call stack associated with the ETW context switch event. If the thread is blocked, the call stack indicates where in the code the thread will resume execution once it unblocks. We will drill into that call stack in a moment.
Note: The Threads View also displays call stacks from processor utilization samples that ETW gathers on a system-wide basis once per millisecond. Call-stacks samples are visible during any periods when the thread is executing instructions (and ETW execution sampling is active). One of the ETW OS kernel events that the Concurrency Visualizer does not analyze is the ReadyThread event. The interval between a ReadyThread event and a subsequent Context Switch that signals that a ready Thread is being dispatched measures CPU queue time delay directly. Using event data, it is possible to measure CPU queuing delays precisely. Analysis of the ETW kernel event stream far exceeds anything that can be done using Windows performance counters to try to estimate the impact of CPU queuing delays.
The Concurrency Visualizer screen shot in Figure 8 illustrates the calculation of a running thread’s CPU queuing delay. Thread 6920, which happens to be a CLR thread pool worker thread, is shown at a point in time where it was preempted by a higher priority task. The specific delay that I zoomed in on in the screen shot is preemption due to the scheduling of a high priority LPC or ISR – note this category in the Concurrency Visualizer also encompasses assorted APCs and DPCs. In this specific example, execution of Thread 6920 was delayed for 0.7718 milliseconds. According to the trace, that is the amount of time between Thread 6920 being preempted by a high priority system routine and a subsequent context switch when the ready thread was again re-dispatched.
The tool also displays the call stack of the preempted thread. The call stack indicates that the CLR’s garbage collector (GC) was running at the time that thread execution was preempted. From the call stack, it looks like the GC is sweeping the Large Object Heap (LOH), trying to free up some previously allocated virtual memory. This is not an opportune time to get preempted. You can see that one of the other CLR worker threads, Thread 6420, is also delayed. Notice from the color coding that Thread 6420 is delayed waiting on a lock. Presumably, one of the other active CLR worker threads in the parent process holds the lock that Thread 6420 is waiting for.
This is one of those “Aha” moments. If you click on the synchronization delay that Thread 6420 is experiencing, as illustrated in Figure 9, you can see that the lock that Thread 6420 is trying to acquire is, in fact, currently held by Thread 6920, the one that was preempted somewhere in the midst of running garbage collection. Clicking on the tab that says “Current Stack” (not illustrated) indicates that the duration of the synchronization delay that Thread 6420 suffered in this specific instance of lock contention was about 250 milliseconds.
The scenario here shows one CLR worker thread blocked on a lock that is held by another CLR worker thread, which in turn finds itself being delayed due to preemptions from higher priority Interrupt processing. We can see that whatever high priority work preempted Thread 6920 has the side effect of also delaying Thread 6420, since 6420 was waiting on a lock that Thread 6920 happened to be holding at the time. The tool in Figure 9 displays the Unblocking stack from Thread 6920 which shows the original memory allocation from the Dictionary.Resize() method call being satisfied, releasing a global GC lock. When Thread 6920 resumed execution following its preemption, the GC operation completes, releasing the global GC lock. Thread 6920 continues to execute for another 25 microseconds or so, before it is preempted because its time slice expired. Even as Thread 6920 blocks, Thread 6420 continues to wait while a different CLR thread pool thread (4664) begins to execute instead. Finally, after another 25 microseconds delay, Thread 6420 resumes execution. For a brief period both 6420 and 4664 execute in parallel from approximately the 7640 to 7650 microsecond milestones. (However, they are subject to frequent preemptions during that period of overlapped execution.)
Welcome to the indeterminacy associated with parallel programming.
I won’t take the time here to go into what this little concurrent CLR (Common Language Runtime ) thread pool application is doing. Suffice to say that it instantiates and references a very large Dictionary object in .NET, and I wrote it to illustrate some of the performance issues developers can face trying to do parallel programming, which is a topic I was blogging about at the time. (I should also note that the test program puts the worker threads to sleep periodically to simulate synchronous I/O waits to create an execution profile similar to what one could expect in processing a typical ASP.NET web request that needs to access an external database, an excellent idea I appropriated from a colleague, Joe Hellerstein.)
When I first began to profile this test app using the VS Concurrency Visualizer, I was able to see blocking issues like the one described here where the CLR introduced synchronization and locking considerations that are otherwise opaque to the developer. Well, caveat emptor, I suppose, when it comes to utilizing someone else’s code framework in your application. (See Rico Mariani’s Performance Tidbits blog for a singular discussion of his intriguing proposal that a .NET Framework method provide a performance signature that would allow a developer to make an informed decision before ever calling into some 3rd party’s code. Alas, static code analysis cannot be used to predict the performance of some arbitrarily complex method call embedded in your application, something Rico was eventually forced to concede.)
It turns out that .NET Framework collection classes do use locks to ensure thread-safe operation in a multi-threaded program, whether it is necessary or not. See the MSDN “Thread-Safe Collections” Help topic for more information. Each worker thread in my test program instantiated and accessed a dedicated instance of the Dictionary class during processing, so locking wasn’t necessary in this little test application. Because I had taken steps to ensure thread-safety issues would never arise in my test program, I was unpleasantly surprised when the tool uncovered lock contention for these Dictionary objects. Unfortunately, there is no way for the developer to explicitly signal the runtime that locking is not necessary. Some of the popular .NET Framework collection classes – like the HashTable – do provide a Synchronized method that exposes a lock created implicitly. But the Synchronized method is designed to support more complex multi-threaded access patterns, such as a multiple readers and writers scenario, for example. To assist in parallel programming tasks, several newer collection classes were introduced in the System.Collections.Concurrent Namespace that use “lock-free” and optimistic locking approaches that promise better scalability for parallel programs.
I eventually tweaked the test app into an especially ghoulish version I call the LockNestMonster program that uses explicit global locks to shine an even brighter light on these issues.
Figure 9. CLR Worker Thread 6420 blocked because it is waiting on a GC lock that happens to be held by Thread 6920, which is subject to preemption by higher priority system routines.

Time-slicing.

The Concurrency Visualizer also utilizes context switch events to calculate the delays a thread encounters during execution due to preemption, as a result of the expiration of a thread’s time-slice. In Figure 10, I clicked on the large yellow block on the right hand side of the execution time bar graph for Thread 6920 indicating another long delay. As in Figure 9, I have hidden all but the three active CLR thread pool threads. Using a combination of zooming to a point of interest in the event stream and filtering out extraneous threads, as illustrated in Figure 10, the Concurrency Visualizer is able to construct an execution time profile using just those events that are visible in the current time-window.
Overall, the three active CLR worker threads are only able to execute 18% of the time, while they are delayed by synchronization 9% of the time and subject to preemption 39% of the time. (You can click on the Profile Report tab in the middle right portion of the display and see a profile report by thread.)
Figure 10. Using the Concurrency Visualizer to drill into thread preemption delays.
At the point indicated by the selection, the time-slice quantum for Thread 6920 expired and the Scheduler preempted the executing thread in favor of some other ready thread. Looking at the visualization, it should be apparent that the ready thread the Scheduler chose to execute next was another CLR thread pool worker thread, namely Thread 4664, which then blocked Thread 6920 from continuing. The tool reports that a context switch(6920, 4664) occurred, and that Thread 6920 was delayed for about 275 milliseconds before it resumed execution after being preempted.
As illustrated in this example, the Concurrency Visualizer uses the ETW-based event data from a profiling run to construct a state machine that reflects the precise execution state of each application thread over the time interval being monitored. It goes considerably beyond calculating processor queue time at the thread level. It understands how to weave the sequence of Ready Thread and Context switch events together to create this execution time profile. It summarizes the profiling data, calculating the precise amount time of time each thread is delayed by synchronous IO, page faults (i.e., involuntary waits due to memory management overhead[1]), processor contention, preemption by higher priority work, and lock contention over the profiling interval. Furthermore, it analyzes the call stacks gathered at each Context Switch event, looking for signatures that identify the specific blocking reason. And, specifically, to help with lock contention issues, which are otherwise often very difficult to identify, it also identifies the thread that ultimately unblocks the thread that was found waiting to acquire a lock.


[1] In the Concurrency Visualizer, memory management waits that are resolved very quickly, usually in less than 1 m-second, correspond to soft page faults. When hard pages faults occur, the tool will show a corresponding disk IO, and the delay is ordinarily several milliseconds in duration, depending on the speed of the paging disk.

Comments

  1. It would be nice to see more information about the LPC/ISR preemptions. Can you use xperf in tandem with the visulaizer to reconstruct which ISR preempted the threads?

    ReplyDelete
  2. Yes, apparently the actual source of the higher priority Interrupot wasn't important to the Concurrency Visualizer developors, who were focused on the performance of multi=threaded programs. But xperf is helpful in that area. See an earlier blog that illustrates using xperf to evaluate ISR and DPC usage by the TCP/IP driver stack in windows. Link to: http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx for an example.

    ReplyDelete

Post a Comment

Popular posts from this blog

Monitoring SQL Server: the OS Wait stats DMV

This is the 2nd post in a series on SQL Server performance monitoring, emphasizing the use of key Dynamic Management View. The series starts here : OS Waits  The consensus among SQL Server performance experts is that the best place to start looking for performance problems is the OS Wait stats from the sys.dm_os_wait_stats DMV. Whenever it is running, the SQL Server database Engine dispatches worker threads from a queue of ready tasks that it services in a round-robin fashion. (There is evidently some ordering of the queue based on priority –background tasks with lower priority that defer to foreground tasks with higher priority.) The engine records the specific wait reason for each task waiting for service in the queue and also accumulates the Wait Time (in milliseconds) for each Wait reason. These Waits and Wait Time statistics accumulate at the database level and reported via the sys.dm_os_wait_stats DMV. Issuing a Query like the following on one of my SQL Server test mac

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profiling tools that exploit

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe