Skip to main content

Rules in PAL: the Performance Analysis of Logs tool

In spite of their limitations, some of which were discussed in an earlier blog entry, rule-based bromides for automating computer performance analysis endure. A core problem that the rule-based approach attempts to address is that, with all the sources of performance data that are available, we are simply awash in esoteric data that very few people understand or know how to interpret properly. At a bare minimum, we need expert advice on how to navigate through this morass of data, separating the wheat from the chaff, and transforming the raw data into useful information to aid IT decision-making. Performance rules that filter the data, with additional suggestions about which measurements are potentially important, can be quite helpful.
A good, current example is the Performance Analysis of Logs (PAL) tool, a free Windows counter analysis tool developed by Clint Huffman, a performance analyst of the first rank who works in Microsoft’s Premier Field Engineering team. PAL serves the need of someone like Clint, a hired gun who walks into an unfamiliar environment where they are experiencing problems and wants to be able to size up the situation quickly. PAL uses pre-defined data collection sets to gather a core set of Windows counter data and thresholds to filter the results. Threshold templates are stored in files that can easily be edited and changed. The program potentially fills an important gap; it is capable of analyzing large quantities of esoteric performance measurement data gathered using Perfmon and, when used properly, returns a manageable number of cases that are potentially worth investigating further.
As a diagnostic tool, PAL is deliberately designed merely to scratch at the surface of a complex problem. In the hands of a skilled performance analyst like Cliff, it does an excellent job of quickly digging through the clutter in a very messy room. Understanding that these are mainly filtering rules, heuristics that help a skilled performance analyst to size up a new environment quickly, is a key to using a tool like PAL effectively; that, plus a healthy skepticism about the quality of the analysis that these simple, declarative “expert” performance rules provide, apart from the expertise and judgment of the person wielding the tool.
Setting rule thresholds.

PAL diffuses the problem that “experts” are likely to disagree over what the threshold values for various rules should be by allowing you access to those setting so that they can be easily changed. Experts, of course, love to argue about whether the threshold for the rule on the desirable level of, for example, processor utilization is 70% or 80% or 90% or whatever. IMHO, these arguments are not a productive way to spend time. Rather than obsessing over whether 80% or 90% CPU busy is the proper value for the rule’s firing threshold, I like to transform the debate into a discussion about the reasoning used by the “expert” in selecting that specific target value. This tends to be a much more productive discussion. Knowing what the specific threshold setting depends on is often useful information. If you understand why a specific threshold setting was suggested, it helps you to understand whether or not the Rule is appropriate for your needs, whether the threshold should be adjusted for your specific environment, etc.
Consider the CPU, for example. The rule about excessive levels of CPU utilization is shorthand for what can happen to the performance of your application if, when it needs to use a processor, it finds that the CPUs are all busy with other, higher priority work. When the processors are overloaded, threads are delayed waiting to be scheduled.
If an application thread is Ready to execute, but all the CPUs are currently busy doing higher priority work, the thread is delayed in the OS Scheduler’s Ready Queue. In addition, if the application thread is executing, but some higher priority thread needs to execute (often, as a result of a device Interrupt being received and processed) and no other CPU is available, then the higher priority thread will preemptively interrupt the currently running thread. If this happens frequently enough, the throughput and/or response time of the application will be impacted by these queuing delays that are associated with thread scheduling.
Unfortunately, the amount of queuing delay in thread scheduling is not measured directly in any of the available Windows performance counters. Since thread scheduling delays should only occur when the CPU resources are approaching saturation, measurements of processor utilization, which are readily available, are used as a proxy for detecting thread queuing problems directly. If it were possible to gather measurements of thread execution queuing delay directly, the arguments over what CPU busy threshold to alert based on would surely evaporate. (Note: I described how to use the CSwitch and ReadyThread events in Windows to measure thread execution queue time directly from ETW in an earlier blog post entitled “Measuring Processor Utilization and Queuing Delays in Windows applications.”)
So, the first complication with a CPU busy rule is that processor utilization is really a proxy for a measurement that is not available directly. Fortunately, queuing theory can be invoked to predict mathematically the relationship between processor utilization and thread CPU queue time. A simple m/m/n queuing model predicts the following relationship between processor utilization and queue time, for example:
















Figure 1. Response time vs. utilization in a simple m/m/n model.

 

 
The chart in Figure 1 shows response time, which is the sum of service time + queue time, rising sharply as the utilization of some resource begins to approach 100%. This particular chart uses the assumption that the average service time of a request (without queuing) is 10 ms., and then calculates the overall response time observed as utilization at the server is varied. At 50% utilization (and this leads to a nice Rule of Thumb), the queue time = the service time, according to the model, which means that overall response time is 2 * the average service time. At 75% utilization, the queue time increases to 3 * service time; an increase to 80% utilization increases the queue time to 4 * the service time, etc.
The overall shape of the response time curve in Figure 1 is one which swoops upward towards infinity, with a characteristic “knee” corresponding to a steep spike in response time as the server approaches saturation. Assuming for the moment that the response time curve depicted in Figure 1 is a realistic representation of OS thread scheduling, you should be able to see how this response time curve motivates formulating a rule to alerts us whenever CPU busy exceeds 75, 80, or 85% busy threshold.
The response time curve in Figure 1 does model the behavior that we often can observe when computer systems begin to experience performance problems. These problems often arise suddenly, as if out of nowhere. The response time curve of the m/m/n model mimics that behavior. Response time remains relatively flat when the system is lightly loaded. But the performance of our computer systems does not degrade gradually. Suddenly, and in the face of some capacity constraint or bottleneck, response times spike. As utilization of the CPU increases, an m/m/n model predicts that thread execution time will elongate due to queuing delays. According to the curve shown in Figure 1, queuing delays are apt to be significant at, say, 80% utilization. A performance rule that fires when the processor is observed running at 80% or higher encapsulates that knowledge.
But in order to formulate a useful diagnostic rule that warns us reliably to watch for potential thread execution time delays when the processor is running close to saturation, we need to become more familiar with an m/m/n queuing model and understand how realistically it can be used to represent what actually goes on during OS thread scheduling. It turns out that the specific response time curve from an m/m/n model, depicted in Figure 1, is not an adequate model for predicting queuing delays in OS thread scheduling.
Multiple CPUs. The first important wrinkle in applying the formula to a threshold rule is how to calculate utilization. In the case of a single CPU (in which case, we are dealing with an m/m/1 model), the calculation is straightforward, and so is the rule. When there are multiple CPUs and a thread can be scheduled for execution on any available engine, however, the utilization value is fed into the model is the probability that all CPUs are busy simultaneously. This is known as the joint probability. A ready thread is only forced to wait on a processor if all CPUs are currently busy. If each processor CPUn is busy at probablilty pn, the joint probability that all processors are busy is p1 * p2 * … pn, or pn. For example, if each CPU in a 4-way multiprocessor is 80%, the probability that all CPUs are busy simultaneously is 0.84, or about 41%.
Clearly, a rule threshold that is based on how busy one processor is needs to be adjusted upward significantly when multiple CPUs are configured. At a minimum, the rules engine should calculate the joint probability that all CPUs are busy. Elsewhere, I have blogged about all sorts of interpretation issues involving the % Processor Time counters in Windows. So long as it is understood that the CPU busy threshold is a proxy for CPU queue time delays, it is OK to ignore most of these considerations. How to interpret these CPU utilization measurements on NUMA machines, SMT hardware, processor hardware that dynamically overclocks itself, or virtualization are all extremely funky issues. The fact that OS thread scheduling in Windows uses priority queuing with preemptive scheduling means that the simple response time formula from an m/m/n model is not a very adequate representation of reality. Finally, given that a thread does not compete with itself for access to CPU resources, you actually should calculate CPU utilization relative to all other higher priority executing threads.
At this point, let’s not try to go there. However, if you have to make a specific hardware or software recommendation for a mission-critical application that also involves a good deal of time and money changing hands, all of these murky areas may need to be investigated in detail.
  • To summarize,
  • a simple processor utilization rule has value as a filtering rule,
  • based on the potential response time delays predicted by simple queuing models when there is heavy contention for CPU resources ,
  • if the rule threshold is based on calculating the joint probability that all CPUs are busy,
  • and the rule’s firing is understood to be a proxy for a direct measurement of thread execution time delays due to CPU queuing.
More on the Rules in PAL in the next post.

Comments

Popular posts from this blog

Hyper-V Architecture: Intercepts, interrupts and Hypercalls

Intercepts, interrupts and Hypercalls Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead...

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe...

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profi...