As discussed in the last blog entry, unfortunately, an automated, rules-based, expert systems approach to diagnosing performance-related problems turns out to be too brittle to be very effective. The simple threshold-based rules invoked by various authorities often need to be fleshed out with additional conditions and exceptions. Once the rule is burdened with all the predicates necessary to qualify an expert’s assessment of the data in context, the automated reasoning process starts to break down.
It turns out it isn’t so easy to encapsulate an expert’s knowledge and judgment into a simple, declarative rule. The expertise a performance analyst cultivates can involve pattern matching based on experience with many other incidents with similar problems encountered in the past. Where the human diagnostic expert often indulges in intuition based on that background and professional experience, it is difficult to craft a mathematical or logical rule that can accurately mimic that reasoning and decision-making process. We haven’t yet figured out how to get a computerized expert system to play a hunch or take an educated guess.
Moreover, “It depends” is often the right answer when it comes to setting an Alert threshold for many of the common performance metrics that are gathered for Windows – or any other type of machine, for that matter. Generating alerts based on measurements exceeding some pre-defined threshold value – as determined by some expert – that are genuinely useful usually requires a deeper understanding of what the threshold value depends on.
The fact that the experts often argue over what the Alert threshold for this or that Windows counter should be isn’t too surprising. The rules themselves often need to be interpreted in context – what is the workload, what kind of hardware, what is the application, etc. That is why I often try to turn the argument over what the proper setting for the rule is into a discussion of the reasoning underlying why the expert chose that particular threshold value. If you understand why the expert chose this or that threshold value for the rule, you have a much better chance at getting that rule to work for you in your environment.
Another issue you need to face when it comes to threshold settings for alerts is that in many, many cases what it depends on is what is customary in your specific environment. The shorthand for this dependency that I used in the Win2K3 Server Resource Kit book is along the lines of “Build alerts for important server application processes based on deviation from historical norms [emphasis added].”
For example, take the context switches/sec counter in Windows. It helps, of course, to have some basic understanding of what a context switch is in Windows and how they are counted. The “textbook” definition I provided in the Win2K3 Performance Guide reads,
A context switch occurs whenever the operating system stops one thread from running thread and replaces it with another. This can happen because the thread that was originally running voluntarily relinquishes the processor, usually because it needs to wait until an I/O finishes before it resume processing. A running thread can also be preempted by a higher priority thread that is ready to running, again, often due to an I/O interrupt that has just occurred. User mode threads also switch to a corresponding kernel mode thread whenever the User mode application needs to perform a privileged mode operating system or subsystem service.
The rate of thread context switch that occur is tallied at the Thread level and at the overall system level. While this is an intrinsically interesting statistic, there is very little that a system administrator can do about the rate that context switches occur.
which, let’s face it, may or may not be all that helpful. A thread is a unit of execution, and there are typically hundreds of them, most of which are usually idle. When there is work to be done by a thread, though, the thread is scheduled for execution. A context switch occurs when the thread actually begins execution. Logically, there is a context switch (newthreadid, oldthreadid) event that occurs, and these are what are being counted. (Note: if the processor was idle at the time the new thread is scheduled for execution, the oldthreadid = the Idle thread, which is more of a bookkeeping mechanism than an actual execution thread.)
It certainly sounds like monitoring the rate context switches occur could be useful. So far as an alert threshold is concerned, however, there is simply no carved-in-stone, right or wrong threshold you can set based on the number of context switches per second that occur on a Windows machine. It depends on the workload. For additional context (pun intended), let’s see some other measurements that are apt to be related to the number of context switches that occur.
The Perfmon screen shot in Figure 1 shows that the number of context switches that occur can vary a great deal during a typical execution interval. The data here, as you can see, was gathered over a 50 minute interval and ranges between 8K and 32K context switches per second, which is a considerable degree of variability. This will make building a statistical quality control alert “based on deviation from historical norms” a bit more challenging because it means understanding what that customary range of behavior is. (I will return to the subject of statistical quality control approach in more detail later. For now, see this earlier blog entry of mine that I posted last year.)
Figure 1. Charting the relationship between device interrupts and context switches using Perfmon.
Knowing what a metric like context switches depends on can also be very helpful in determining whether the number you are seeing is problematic or not. That was why I gathered data on Interrupt rates at the same time. Under normal circumstances, each device interrupt that is processed will cause at least two thread context switches:
1) an initial context switch that invokes the Interrupt Service routine, and
2) a subsequent context switch to a user mode thread that is lying dormant, waiting to be woken up by the OS when the data transmitted to or from the device that caused the interrupt is completed.
The relationship between the two measurements is quite evident. (Be careful, Perfmon graphs in real-time wrap; the data on the right side of the chart was gathered before the data on the left side.) These are not two independent variables. They are, in fact, closely related to each other. Context switches/sec is at least partially a function of the device interrupt rate.
I initiated a large file copy operation on this machine after I started tracking the performance counters in Perfmon. Instead of tracking the overall interrupt rate, I could have also looked specifically at the number of disk interrupts that are occurring using the Physical Disk(_Total)\Disk transfers/sec counter, especially helpful if the workload happens to be disk IO bound. In fact, if the workload is disk IO bound, as my file copy operation undoubtedly was, the number of context switches/sec that occurs is primarily an artifact of disk IO capacity. If the workload is disk IO bound, and you are able to swap in a faster disk, the number of disk transfers will increase, with a corresponding increase in the number of context switches.
The spike in context switches on the right side of the graph is the result of some web server activity I also initiated once I got the file copy operation going. In the case of network IO requests, things are more complicated. Network Sends and Receives must traverse several layers of the TCP/IP stack, ultimately arriving at the application layer for processing. For example, the data from an http request to an IIS web server is handled in turn by a series of kernel and user mode threads before it finally arrives in your ASP.NET application for processing. In the case of a TCP Receive of an http packet, I would expect to see at least three or four thread context switches before it is finally processed in your ASP.NET application.
Let’s widen the context a little more. In Figure 2, both interrupts/sec and context switches/sec are shown in relationship to overall processor utilization. (I resorted here to serving up a chart from the NTSMF reporting portal that overlays interrupts/sec and context switches/sec over a stacked area graph that reports processor utilization per processor. Perfmon charts don’t do this type of reporting very well.) In Figure 2, the corresponding relationships between device interrupt handling, which includes all network IO requests on a web server machine, thread context switching, and the demand for overall processor resources is evident. Interrupts lead to context switches, which from another perspective, also represent the units of processor work that need to be performed to service typical web requests.
Figure 2. The relationship between device interrupts, thread context switches and CPU utilization on a Windows machine.
The point is these are not independent measurements. Context switches, interrupts, and processor utilization are measuring related aspects of thread scheduling and thread execution time. From a statistical viewpoint, they are not only highly correlated, they are autocorrelated.
The fact that these metrics are all potentially related adds a whole new dimension to this discussion about performance rules and alerts. Perhaps how many context switches per second my machine can handle is more appropriately a question of how much processor, disk and network capacity I have on hand to field http or other network requests without starting to impact on server responsiveness in my ASP.NET application. From a capacity planning perspective, we can also see that being able to calculate the amount of CPU time per network request on average and then trending that data over time is extremely useful.
Ultimately, it is in this wider overall context that tracking a metric like context switches/sec makes sense. This is why raising an alert when a web server that normally handles 15-30K context switches per second suddenly is processing more than 100,000 context switches per second can be useful. And then drilling into the context switch data and seeing if the relationship between device interrupts, context switches, and CPU processing that held in the past continues to hold true in the current situation. Has a disk controller gone bad suddenly and is spewing forth a ton of extra device interrupts? Or maybe it is a denial of service attack on your web server by some evil hackers. Or maybe Ashton Kutcher just tweeted about that cat photo you posted and your web server is being deluged with requests to view it.
Or maybe you’ve just gone online with a new rev of the application, and this is something that should never have gotten past the QA team.
Encapsulate that knowledge into a Performance Rule and you are in business.