“There’s a lot more to running a starship than answering a lot of fool questions.”

Continuing a series of blog posts on “expert” computer Performance rules, I am reminded of something Captain James T. Kirk, commander of the starship Enterprise, once said in an old Star Trek episode:

“There’s a lot more to running a starship than answering a lot of fool questions.”

Star Trek, The Original Series. Episode: The Deadly Years. Season 2, Episode 12. See http://tos.trekcore.com/episodes/season2/2x12/captioninglog.txt.

For some reason, the idea that the rote application of some set of rules derived by a domain “expert” can suffice in computer performance analysis has great sway. At the risk of beating a dead horse, I want to highlight another example of a performance Rule you are likely to face, and, in the process, discuss why there is a whole lot more to applying it than might be obvious at first glance. There happens to be a lot more to computer performance analysis than the rote evaluation of some set of well-formed performance rules.

It ought to be apparent by now that I not a huge fan of performance Rules as they are conventionally bandied about. I can remember when I was just starting to learn about the discipline of computer performance analysis in the early 1980s, Rules of Thumb (ROTs) approaches were very popular. The approach continues to be popular today. While filtering rules clearly have a place, as I discussed previously, believing the rote application of some expert’s set of threshold rules when you are examining copious quantities of performance data can substitute for the experience, knowledge and judgment of the person wielding the same tools is mostly an exercise in wishful thinking.

As I mentioned in the previous blog entry on this same subject, in the right hands, the Performance Analysis of Logs (PAL) reporting program developed by Clint Huffman can be a very useful tool. However, the tool does not substitute for the experience, knowledge and judgment of the person wielding the tool. This, unfortunately, is one of the serious limitations of rule-based approaches to automating performance analysis.

On his blog, Clint offers the following rule for evaluating the health of your server’s disk farm:

“If “\LogicalDisk(*)\Avg Disk Sec/Read” or “\LogicalDisk(*)\Avg Disk Sec/Write” is greater than 15ms, then the computer may [emphasis added] have a disk performance issue.”

Clint’s disk performance rule is one I plan to dissect. This is essentially a filtering rule, and, as such, I have no quarrel with it. It is useful, up to a point. Of course, the 15 ms threshold setting is definitely one of those “It depends” type of thresholds I wrote about last time. Anyone attempting to wield this rule need to understand the circumstances when the 15 ms threshold might be appropriate and when it isn’t. I plan to drill into that in a moment. But, first, I’d like to step back and view this disk performance rule in a wider context of identifying and resolving disk performance issues in general. This is a subject for a book, or at least a chapter in a book, and the wider context I have in mind encompasses the chapter I wrote on Windows disk performance several years ago. (A reasonable facsimile is available here.)

When Disk performance rules!

One good reason to mention this rule of Clint’s is that disk performance is often quite important on production servers, especially back-end database servers. For example, at Microsoft recently, I was involved in tuning a large server-based production application that was causing fits inside the Developer Division where I worked. (Since this tuning project has been discussed openly in various blogs and presentations, it should be okay for me to write a little about what happened from my point of view.) One of the key things I did early on in this effort was to identify a critical disk performance bottleneck on the Team Foundation Server (TFS) database server instance that was getting hammered by developers (and their managers) in the Division where I worked.

I attended several departmental meetings where the managers were blaming sluggish TFS performance for slowing down the progress of the developers from their teams who were working on the next version of Visual Studio and the .NET Framework. The sluggish performance of the application was starting to impact everyone’s productivity. I chose to interpret my responsibilities for the performance of the Division’s products broadly in this case and began to insert myself into the process to see if I could help resolve the problems.

In essence, TFS source code version control was not scaling, given the number of developers pounding away at Visual Studio and the size of the Visual Studio code base, which is enormous. This, by itself, is hardly unusual. Many complex computer applications encounter similar problems of scale. In fact, I have made a decent living helping to identify these scalability problems and then helping to resolve them over the last thirty years. What was interesting in this case, and this is also pretty typical, was there were lots of opinions about what was wrong with TFS and how to fix its performance from various “experts”, but until I began to get involved, I am not aware of anyone taking a hard look at any of the measurement data that was available. It is good to express an opinion, of course, but even better is expressing an informed opinion.

After first learning a little bit about how TFS itself was put together, I began reviewing Windows performance counter data with the TFS system admins that were responsible for gathering the measurements on the production servers, focusing on the peak periods when the most complaining was occurring. Initially, there was a problem gathering the disk counters – the “Disable Performance Counters” flag tends to rear its ugly head with the disk counters. But I showed the sysadmins how to circumvent the problem (the call to Open the Perflib responsible for gathering the disk frequently times out).

In short order, I had access to all the counter data I needed, and I was able to determine that disk throughput and response time was a major bottleneck. An overloaded disk subsystem was also the kind of problem that I thought could be fixed in the short-term with improved hardware, in effect, stopping the bleeding long enough for the TFS developers to address longer term architectural changes to the application to help its scalability. (Which they later did.)

Grant Holliday (whose blog is here) was my main point of contact on this effort – and Grant did an excellent job throughout, including building a Team Foundation Server performance reporting tool based on the work we did together that our customers could use, which was something I lobbied for very hard when I first got involved in the original tuning effort. TFS is a complex, multi-tiered server application, and I was certain our paying customers could easily encounter similar problems of scale.

Interestingly, from Grant’s perspective, there is a relatively simple set of performance rules that anyone can apply systematically and find these sorts of problems quickly. In reality, I don’t think it was quite that simple, but, watching me at work on this, I guess it could have appeared that way to Grant.

A more comprehensive disk performance rule.

What I want to do here is look at the disk performance rule Grant formulated, unbeknownst to me, as a result of the work we did together. Grant’s disk performance rule adds a few wrinkles to the simple filtering rule in PAL, attempting to construct something with greater diagnostic power (quoted directly, typos and all, from his blog entry on the subject):

To determine if you are having significant issue with disk latency you should use the following performance counters:

Object: [Physical Disk] or [Logical Disk]
Counter: [Avg. Disk Sec/Transfer]
Instance: Ideally you collect this for individual disks however you may also use [_Total] to identify general issues. If [_Total] is high then further collections can be taken to isolate the specific disks affected.
Collection Interval: Ideally you should collect at least every 1 minutes (sic). The collection should be run for a significant period of time to show it is an ongoing issue and not just a transient spike. 15 minutes is minimum suggested interval.
Issue Thresholds (seconds):

< 0.020: Normal time and no I/O latency issues are apparent
> 0.00 (sic) – 0.050: You may (be) somewhat concerned. Continue to collect and analyze data. Try to correlate application performance issues to these spikes [emphasis added]
> 0.050 – 0.100: You are concerned and should escalate to SAN administrators with your data and analysis. Correlate spikes to application performance concerns.
> 0.100: You are very concerned and should escalate to SAN administrators. Correlate spikes to application performance concerns.

And, according to Grant, it is as simple as applying that set of logic. Hmmm.

In transcribing the body of the rule from Grant’s blog, notice that I have highlighted one critical element above that is decidedly not simple, namely, that you need to correlate application performance issues to observed spikes in disk performance. Noticing a sharp spike in disk response time is always interesting, but, in isolation, it is not very meaningful. On the other hand, when you can correlate the spike in disk performance to a spike in application performance service levels, then you are really onto something.

Having access to application service level reporting – the rate requests are being processed and the response times of these requests – is crucial. First of all, if there is a performance problem, it should be evident from the service level reporting. (If not, you’ve got a problem with your reporting right there.) From there your performance investigation proceeds to identifying some component of the application’s response time that you find is highly correlated with the response time spikes you are observing. Moreover, over time, as the application load continues to grow and new scalability challenges inevitably emerge, you should use the service level measurements to verify and validate the effectiveness of any continuous quality improvement effort that is initiated.

It turned out that TFS did measure service levels, basically server-side response time, broken out for each TFS command. Initially, when I first got involved in the TFS tuning effort, the reporting of this application response time measurement data was a mess, rendering it almost useless. What was being reported was an aggregate response time for the application. Aggregated together were commands that involved major source code check-ins that required merging thousands of lines of new code into the existing code base (which contained millions of lines code) with simple check-ins of just a few lines of changed code or simple TFS Work Item queries where the developer needed to access the bug report. Major check-ins would take 30-60 minutes or more, generating literally millions of database calls. Meanwhile, simple commands were routinely executed in less than one second. Aggregating and reporting all these commands together created a service level report that did not reflect developers’ actual experience using the system. It also obscured what was happening on the TFS database back-end.

Once I got Grant to sort that data to break out the commands into buckets of short, medium, and long requests, based on both the resource demands and reasonable service expectations for those requests, the application service level measurements became very, very useful. And, to the extent that the new reports were a better reflection of the actual user experience, Grant’s team started to gain credibility in the regular meetings they held to communicate status of the tuning effort to management. Probably, the most significant longer term contribution I made to TFS performance and scalability during this effort was in straightening out this service level reporting so that it more accurately reflected customers’ experience using the product.

The improved set of service level reports that Grant built changed the character of the conversation managers could have about the application’s performance radically. When TFS was slow, we knew how slow, and we could quantify that judgment. From a diagnostic standpoint, we could also compare performance on a bad day to one of its better days, and see what was different about what was happening during a bad day. I was also then able to successfully promote the idea that we could evaluate the effectiveness of any proposed change to the TFS application (or its specific run-time configuration in our environment) based on the extent that service levels actually improved as a result.

Dissecting the disk performance rule.

Let’s get back to considering the specific rule Grant formulated for diagnosing disk performance problems, which he based on our experience working together over the course of several months.

Collection interval. So far as the Perfmon counter collection interval is concerned, Grant’s advice is right from the Friedman playbook. Perfmon is capable of gathering counter data at intervals as short as one second, but when you are looking at a large scale production server, I prefer the longer view. For a large scale production server, I am only willing to suggest a remedy for a problem that persists for a long enough period that it will be visible using one minute connection intervals.

Counters evaluated. On the specific counters selected, the first issue is whether or not these are good counters to review. No problem there. The Avg Disk Sec/Read and Avg Disk Sec/Write counters measure the average response time of disk requests. The measurements are taken from the standpoint of the Logical Disk device driver. (This turns out to something that can be very important when you are interpreting the data.) Of all disk performance counters available, these are the best measurements to be monitoring. The counters that the rule tests are not only valid service level indicators; they are also the best indicators available in Windows that you might be experiencing a disk performance problem.

But, at this point, it is already worth introducing a caveat. Being the best available indicator is not the same as being a good indicator, just as the mean is the best single estimator of a population, but is not guaranteed to be a good estimator. (Consider the case of a bi-modal distribution, for example.) As I will discuss more in a moment, these disk response time measurements are also difficult to interpret definitively when viewed in isolation. They almost always need to be understood in the context of some of the other, related disk performance counters. In addition, the physical disk configuration also matters. How logical disks are mapped to physical disks and how the physical disks are mapped to hardware entities can also have an impact on our understanding of how to interpret the measurement data.

The threshold test. The simple 15 millisecond threshold in Clint’s default PAL rule is based on the physical capabilities of the typical disk drives that are normally configured (but not always) on server-class machines. Because disk drives are mechanical devices, several orders of magnitude slower than the solid-state components (i.e., CPU and RAM) of the computer system, many applications are sensitive to good disk performance. The filtering rule in PAL indicates that we needn’t concern ourselves with any logical disk drive where we observe that the average response time is less than 15 ms. (The fact that these counters measure disk response time, which includes queue time, from the standpoint of the OS device driver also turns out to be important. For instance, the 15 ms. disk response time can correspond to a 10-12 ms. service time for the physical disk operation, with the remainder the result of queuing.)

Notice that I used weasel words like “typical” and “normally” in the 1^st sentence in that last paragraph. That was deliberate. Among physical disk drives available for sale, you will find a range of performance capabilities. In the case of a 5400 RPM drive on your laptop computer, a 15 ms average response time isn’t bad at all. These are drives optimized for weight and power consumption, rather than performance. For a premium drive that is optimized for performance and something for which you probably paid a premium price, at 15 ms, you are probably not getting your money’s worth. Previously, I wrote that when the “experts” start to debate what the rule’s threshold test should be, I like to turn that largely unproductive discussion into a much more productive one that concerns how they arrived at the correct setting. So, to reiterate, the disk performance filtering rule in PAL reflects reasonable disk performance service level expectations based on the physical capabilities of the typical disk drives that are often configured on server machines.

With regard to the more complicated set of thresholds that Grant suggests in the diagnostic rule he formulated, I have mixed feelings. Grant’s corresponding filtering rule uses a 20 ms. threshold. My thinking is that above 20 ms. and the spikes in disk performance are correlated with spikes in the performance of the application itself, I am more than just somewhat concerned. And Grant saw that I was extremely concerned that disk performance levels were sometimes worse than 100 ms. on the TFS back-end server during peak loads. But I also explained that the disk response time numbers are only a concern in the context of high IO rates. To give an extreme case, if the disk is only being accessed by the application once or twice a second, even 100 ms. disk response times are not a grave concern. (And, yes, this situation can happen.)

Moreover, when the disk is actually a virtual volume exposed by a high-end, SAN-based storage processor like an EMC Symmetrix that is configured because I want considerably better disk performance than 15 ms average response time, I likely to be dissatisfied if disk response time is greater than 1-2 ms. The reason for configuring my application to use a high-priced, EMC Symmetrix disk (or similar offerings from other storage vendors, I am not trying to endorse any particular brand of storage processor here) should be that my application requires disk response times in the neighborhood of 1 ms or less, something that the EMC box can deliver reliably (when it is configured properly).

As indicated above, the disk response time measurements are made in software that runs in the OS device driver level. The physical characteristics of the disk device are not generally known at that level. But, ordinarily, you must have knowledge of the physical characteristics of the disk device to interpret the measurement data correctly. You need to know what kind of “physical” disk it is in order to make an informed judgment about what is a reasonable disk service time expectation for the device, plus, some basic understanding of queuing delays at the disk. (A simple M/M/1 model -- illustrated in the previous blog entry -- is sufficient for getting started here.)

Another important consideration that affects your interpretation of the disk response time measurements is the block size of the requests. Is the application requesting small blocks of data scattered across wide areas of the disk? Or is it requesting large blocks that are located at locations that are mostly contiguous on the physical spindle? Or maybe it is an unholy mix of those two distinct IO workload profiles. Fortunately, the Avg Disk Bytes/Read and Avg Disk Bytes/Write counters report on the average size of the disk requests observed during the interval.

For example, if my machine is functioning as a video server, it should be reading video data off the disk in very large, contiguously-located chunks. The number of bytes per Read should be 128 KB or more. It takes considerably longer to transfer 128 KB on a disk read than it would be a 8 KB read from an ordinary file system request. Moreover, streaming video is a disk throughput-oriented operation where I might care more about Read Bytes/sec than Avg. Disk Secs/Read.

IO rate. As mentioned, above, there is additional context that I need to understand before I can interpret the measurement data correctly. For example, what is the IO rate? Basically,

· If the disk that is performing sub-optimally isn’t being hammered by the application I care about, do I still need to care?

The answer usually is, “Probably not.”

Since many Windows Server machines are configured as dedicated application (IIS, Exchange, Sql Server, Sharepoint, business logic middleware, etc.) servers, so you usually don’t have the problem associating the disk activity you observe with the application. On a dedicated application server, it is OK to assume that if the disk is active, it is active on behalf of the application I care about.

A corresponding IO rate threshold test is also useful to catch false positives that the simple disk response time rule diagnoses incorrectly, e.g.,

· If the disk is only being accessed once per second or less, I probably don’t care, unless the response time is off the charts, say, > 250 ms.

The key disk response time counters should always be viewed and interpreted in the context of the IO rate. My Rule of Thumb is I normally only get concerned about the disk response time when the IO rate to the disk in question exceeds 20-25 Disk Transfers/sec.

–—˜

To sum up, you have probably noticed that the simple threshold-based disk performance rule Grant saw me apply is starting to get quite a bit more complicated.

But knowing what to do next after the rule fires is also crucial. Knowing what can be done to resolve a disk performance problem that you have just diagnosed, what some of the hardware and configuration options are (and what the cost of those options are), and how to implement them with a minimum of disruption is quite important. These are things that Grant’s rule discussion merely touches upon. For instance, good luck contacting your beleaguered SAN administrators without being able to marshal a very complete analysis of the problem that you found, along with some suggestions for fixing it. And simply laying the problem at the feet of your disk vendor seldom brings concerted action unless you are both very well-informed about their hardware and very persistent. (Demonstrating that you have the clout to green light a large check on their behalf will also get their attention.)

I will take up the topic of what to do next (but only briefly, I promise) after the rule fires in the next post.

Performance By Design

Search This Blog