Thursday, May 10, 2012

Measuring application response time using the Scenario instrumentation library.

This blog post describes the Scenario instrumentation library, a simple but useful tool for generating response time measurements from inside a Windows application. The Scenario instrumentation library uses QPC() and QTCT(), the Windows APIs discussed in an earlier blog entry, to gather elapsed times and CPU times between two explicit application-designated code markers. The application response time measurements are then written as ETW events that you can readily gather and analyze.
 You can download a copy of the Scenario instrumentation library here at http://archive.msdn.microsoft.com/Scenario.
The Scenario class library was originally conceived as a .NET Framework-flavored version of the Application Response Measurement (ARM) standard, which was accepted and sponsored by the Open Group. The idea behind ARM was that adding application response time measurements to an application in a standardized way would promote 3rd party tool development. This was moderately successful for a spell, ARM even developed some momentum, and was adopted by a number of IT organizations. Some major management tool vendors, including IBM's Tivoli suite and HP OpenView, supported ARM measurements in their tools.
In the Microsoft world, however, the ARM standard itself never stirred much interest, and application response time measurements are conspicuously absent from the performance counters supplied by many Microsoft-built client and server applications. However, many of these applications are extensively instrumented and can report response time measurements using ETW, which is one of the many reasons that something ARM-like for Windows should leverage ETW.  The Scenario instrumentation library tries to satisfy the same set of requirements as an ARM-compliant tool, but tailored to the Windows platform.
The topic is very much on my mind at the moment – thus this blog post – because I am working on a new tool for reporting on web application response times under Windows using ETW events generated by the HttpServer component of Windows (better known as IIS) and the TCP/IP networking stack. One of the early users of the Beta version of the tool also wanted a way to track application Scenario markers, so I am currently adding that capability. I expect to have an initial version of this reporting tool that will also support web applications instrumented with the Scenario library available next month, so…
Why measure application response times?
There are several very good reasons for instrumenting applications so that they can gather response time measurements. If you are at all serious about performance, it is well-nigh impossible to do good work without access to application response time measurements. If you don’t have a way to quantify empirically what is “good” response time and compare it to periods of “bad” response time, let’s face it, you are operating in the dark. Performance analysis begins with measurement data, and you cannot manage what you can’t measure.
Application response time measurements are important for two main reasons. The first is that application response time measurements correlate with customer satisfaction. In survey after survey of customers, performance concerns usually rank just below reliability (i.e., bugs and other defects) as the factor most influential in forming either a positive or negative attitude towards an application. They are a critical aspect of software quality that you can measure and quantify.
In performance analysis, application response time measurements are also essential to apply any of the important analytic techniques that people have developed over the years for improving application response time. These techniques include using Queuing Theory and related mathematical techniques used by capacity planners to predict response time in the face of growing workloads and changing hardware. Any form of optimization or tuning you want to apply to your application also needs to be grounded – how can you know if this or that optimization leads to an improvement if you are not measuring response times, both before and after. Even knowing which aspect of the application’s response time to target a tuning effort on requires measurements that allow you to break the response times you observe into their component parts – CPU, IO, network, etc., an analysis technique known as response time decomposition.
So, for these and other reasons, application response time measurements are extremely important. Which is why it is especially annoying to discover that application response time measurements are largely absent from the standard Windows performance counter measurements that are available for both native C++ and managed .NET Framework apps in Windows. The Scenario instrumentation library helps to address this gap in a standard fashion, similar to the ARM standard, and likewise enables the development of 3rd party reporting tools.
Tips for gettng your applications instrumented.
Adding ARM-like instrumentation to an application inevitably raises some concerns. The prime concern is that adding the library calls means you have to open up the code and modify it. Even if the instrumentation API is easy to use – and the Scenario class library is very simple – modifying the code is risky, riskier than doing nothing. It needs to be performed by someone who knows the code and will add the instrumentation carefully. A reluctance to open up the code and expose it to additional risk is usually the big initial obstacle organizations face when it comes to adding instrumentation – it is one more thing on the ToDo list that has to be coded and tested, and one more thing that can go wrong.
The best approach to overcoming this obstacle is to line up executive support for the initiative. Let’s face it, your IT management will appreciate receiving regular service level reports that accurate reflect application usage and response time. We all want satisfied customers, and meeting service objectives associated with availability, reliability and performance is highly correlated with customer satisfaction. Application response time data is critical information for IT decision makers.
The 2nd obstacle, which is actually the more serious one, is that someone has to figure out what application scenarios to instrument. In practice, this is not something that is technically difficult. It just requires some thought from the people who designed the application and care about its quality, and perhaps some input from the folks that use the app to understand what pieces of it they rely on the most. Technical input is also required at various points in the process of coming up with the scenario definitions – decisions about what scenarios to instrument need to be made in light of any technical obstacles that can arise.
Let me illustrate one of the technical considerations that will arise when you first consider instrumenting an application to report response times. You will discover rather quickly that reporting response times alone apart from some explanatory context leads to problems in interpretation. Let me illustrate with an example from a well-known and well-instrumented application you are probably familiar with – Google Search. At the top of the panel where Google displays Search results is a response time measurement. For example, I just typed in “search engines” and Google Search returns the information that it knows of 264,000,000 “results” that match that search criteria. Google then reports that it required all of 0.25 seconds to find that information for me and report back. The 250 milliseconds measure is the response time and the 264 million results is the context needed to interpret whether that response time is adequate.
When you instrument your application using the Scenario class library, you have two additional fields that can be used to report context, a 64-bit integer Property called Size and string Property called Category. If you were instrumenting Google Search using calls to the Scenario class, you would set the Size Property to the 260 million results value and place the search keywords into the Category string. The Size and Category Properties that are reported alongside the response time measurement provide context to assist you in interpreting whether or not the response time result the application supplied was adequate in that particular instance.
So, one final aspect of instrumenting your application to consider is what additional context you want to supply that will aid you in interpreting the response time measurements after the fact. The contextual data that is usually most helpful is associated with what I like to call the application’s scalability model. The application’s scalability is your theory about the factors that have the most influence over the application’s response time. If, for example, you think that how many rows in the database the application must process has something to do with how long it takes to compute a result and generate a response message, that conjecture reflects a simple scalability model
                f(x) = y * rowsn  
Populating the Scenario object’s Size and Category Properties from data relevant to the application’s scalability model helps provide the context necessary for interpreting whether a specific response time measurement that was reported is adequate or not.
It may be challenging to squeeze all the data relevant to the application’s scalability model into the limited space the Scenario class provides for customization. In practice, many adopters turn the Category string into a composite value. That’s something I often resort to myself. Consequently, in my reporting program I support a standard method for packing data into the Category string value, which is automatically broken out in the reporting process. Worst case is that you will never be able to shoehorn all the contextual data needed into the Scenario class’s simple Size and Category Property fields. When that happens, you will need to develop your own instrumentation class – and your own reporting.
Software performance engineering and application development lifecycle
Given how fundamentally important application response time measurements to any worthwhile program for maintaining and improving software quality, it is worth thinking a bit about why this critical data is so often missing. Let’s consider the full application development life cycle – from gathering requirements, to design, coding, testing, acceptance and stress testing, to deployment and ongoing “maintenance” activities. Too often, the application’s performance requirements are relegated to a narrow window somewhere towards the end of the development process but presenting a significant hurdle during acceptance testing. When performance testing is concentrated in this fashion late in the acceptance testing stage, this positioning is almost guaranteed to cause resentment among the hard-pressed software developer staff way. Performance requirements should actually be set early in the requirements process as scenarios targeted for development are being defined. Provisioning the application so that it can report on the response times of those key scenarios emphasizes performance considerations at every stage of the application development lifecycle.
In principle, the application scenarios are specified early in the development life cycle, and early in the cycle is also the best time to begin thinking about response time requirements. In the software development methodologies that are in fashion, application performance is usually viewed as a “non-functional” requirement, one that does not get a lot of attention. This is all wrong, of course. As one of my colleagues was fond of telling our CIO, performance isn’t a coat of paint you can slap on an application after it is done that will beautify it. Decisions about how to structure the application made during the design stage often determine what application response times are even achievable.
On the contrary, performance needs to be considered at every stage of application development. Performance goals for key application scenarios need to be defined early in the design stage. Instrumentation to measure application response time allows developers to assess their progress accurately in meeting these goals at every stage of the process. Instrumentation embedded in the application also aids in performance testing. My experience is that with the right level of instrumentation, every functional test can also serve as a performance test.
While application response time measurements are the closest we can get to understanding and quantifying the end user experience, it is worth noting that the correlation between response time and customer satisfaction is typically not a simple linear relationship. Human beings are a little more complicated than that. If you want a set of relatively simple guidelines to help you decide what response times are good for your application and which are bad, I recommend Steve Seow’s book on the subject “Designing and Engineering Time: The Psychology of Time Perception in Software.” The guidelines in Dr. Seow’s book are grounded in the relevant Human-Computer Interaction (HCI) psychological research, but it is not a dry, academic discussion.
Steve’s book also promotes a set of prescriptive techniques for engineering a better user experience whenever the processing requirements associated with a request are too demanding to produce a prompt application response. For instance, by managing a customer’s expectations about how long some longer running operation is potentially going to take, you can engineer a highly satisfying solution without blistering response times. Showing a progress bar that accurately and reliably reflects the application’s forward progress and providing ways for a customer to cancel out of a long running task that ties up her machine are two very effective ways to create a positive experience for your customers when it is just not possible to complete the computing task at hand quickly enough.
Using the Scenario instrumentation library.
The Scenario instrumentation class library provides a convenient way for a developer to indicate in the application the beginning and end of a particular usage scenario. Internally, the Scenario instance uses an ExtendedStopwatch() object to gather both wall clock time (using QueryPerformanceCounter) and the CPU ticks from QTCT() for the Scenario when it executes. The Scenario class can then output these measurements in an ETW trace event record that renders for posterity the elapsed time and CPU time of the designated block of code.
The Scenario class wraps calls to an internal ExtendedStopwatch object that returns both the elapsed time and CPU time of a demarcated application scenario. Once a Scenario object is instantiated by an application, calls to Scenario.Begin() and Scenario.End() are used to mark the beginning and end of a specific application scenario. After the Scenario.End() method executes, the program can access the object’s Elapsed and ElapsedCpu time properties. In addition, the Scenario.Begin() and Scenario.End() methods generate ETW events that can be post-processed. The payload of the ETW trace event that is issued by the Scenario.End() method reports the elapsed time and CPU time measurements that are generated internally by the class.
To support more elaborate application response time monitoring scenarios, there is a Scenario.Step method that provides intermediate CPU and wall clock timings. The Scenario class also provides a correlation ID for use in grouping logically related requests. Nested parent-child relationships among scenarios that are part of larger scenarios are also explicitly supported. For details, see the API and ETW payload documentation pages on the MSDN archive.
I developed an earlier version of the Scenario instrumentation library in conjunction with several product development teams in the Developer Tools Division when I was at Microsoft. An early adopter was a product team attempting to build a next-generation C++ compiler. The test case they were looking to optimize was the commercial Windows Build, a set of very demanding and long running compiler jobs. This product team gave the original Scenario instrumentation library quite a stress test, and I added several features to the library to make it more efficient and effective in that challenging environment.
Subsequently, what was effectively the Scenario instrumentation library version 2.0 was included in the commercial version of Visual Studio 2010, with instrumentation added for several newly developed components that shipped in the product. The Visual Studio 2010 flavor of the Scenario instrumentation library is known as the MeasurementBlock class. If you have Visual Studio 2010 installed, you can incorporate this into your application by referencing the Microsoft.VisualStudio.Diagnostics.Measurement.dll that is located at C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE. After adding a Reference to Microsoft.VisualStudio.Diagnostics.Measurement.dll, you can verify that the semantics of the MeasurementBlock class are functionally identical to the published Scenario library using the Object Browser applet in Visual Studio, as illustrated in the screen shot shown in Figure 1 below.

Figure 1. MeasurementBlock class imported from Microsoft.VisualStudio.Diagnostics.Measurement.dll as viewed in the Visual Studio Object Browser.
  
You will note some cosmetic differences, though. The eponymous Mark() method in the original implementation was renamed to Step() in the subsequent Scenario library. (Ever since I found it what it meant a few years back, I have always wanted to use the word “eponymous” in a sentence! J) MeasurementBlock is uses a different ETW Provider GUID; we didn't want developers adding Scenario markers to their apps suddenly seeing the VS MeasurementBlock events, too, when they enabled the ETW Provider.
Using the Scenario library in your application is straightforward and is documented here, but I will provide some additional coding guidance for use in interfacing to the new performance tool I am working on in the next blog entry.


Monday, April 23, 2012

Presenting two sessions at the upcoming UKCMG meeting in Oxford, England on May 14-15.

Some news that regular readers of this blog might be interested in hearing about...

I plan to present two sessions at the upcoming UKCMG annual conference, which is being held this year on May 14 & 15 at the Oxford Belfry on the outskirts of Oxford, England.
The first presentation is a repeat performance of the one I gave at the US CMG in December, a paper entitled  “Measuring Processor Utilization in Windows and Windows applications,” essentially pulling together the series of blog entries I have been posting here, beginning with the first installment, but with a good deal more material than I have gotten around to posting to the blog.
For instance, the last blog post discussing the high resolution clocks and timer facilities in Windows leads directly to a consideration of what happens to the various CPU utilization measurements when Windows is running as virtual guest under VMware or Hyper-V. That discussion is in the paper, but, unfortunately, hasn’t made it to the blog yet.
But you can download the full paper from my company’s web site here.
It is shameful to admit that the full paper has been available since December. Inept as I am at blogging, I had not alerted you blog readers about its availability. Unfortunately, and it will be forever thus, or at least until I retire from my day job, self-publishing on this blog takes a back seat to work that actually pays the bills around here.
(I will resist the temptation to go off on a rant here about the idiotic and naïve notion expounded by fanatical proponents of Open Source technology that information should be free. That’s a wonderful ideal state, of course, but flies in the face of the economics of information gathering, production, storage and dissemination, which has real costs associated with it. Even in the digital age, which has revolutionized the costs associated with information storage and dissemination, these costs remain and they are considerable. My contrarian view is that no one, other than gods and saints, in possession of potentially valuable information is apt to give it away for free under our system of capitalism, but that is another topic entirely.)
Workshop in Web Application Performance and Tuning.
The second session is an extended workshop on web application performance. It is focused on Windows technology (IIS, ASP.NET, AJAX, etc.), but many of the tools and techniques discussed are directly applicable to other web hosting platforms.
The workshop is based on a course that I used to give in-house back in Microsoft to the developers working on various Microsoft web-based applications. While I have published very little on this topic over the years, it has actually been the focus of much of my software development work over the past five years or so. I do expect to start publishing something soon on the subject, especially as I am in the late stages of developing a new software tool aimed squarely at Microsoft web application performance.
Reading between the lines of some of my recent blog postings that are ETW-oriented, including the CPU measurement series, you would be correct in guessing that the new tool attempts to leverage ETW trace events, specifically, in this case, the events that instrument the Microsoft IIS web server and the TCP/IP networking stack. This new trace analysis tool also correlates these system-oriented trace events from various Windows components with events issued from inside application scenarios instrumented using the Scenario class library (a free component, currently posted in the MSDN Archive here).
Instrumenting your application for performance monitoring is a crucial step, and that is where the Scenario class library comes in. Originally, I conceived of the Scenario instrumentation library as a .NET flavor of the open source Application Response Measuriment (ARM) initiative that was championed by both HP and IBM (and supported by the US CMG, where I was the ARM Committee liaison for many years). Soon after I arrived at Microsoft, it quickly became apparent that I needed to adapt my original conception to leverage ETW tracing technology, which had the considerable weight of the Windows Fundamentals organization behind it.
In the workshop I explain how to use this application-oriented instrumentation as part of integrating software performance engineering best practices into the software development life cycle. This involves first setting performance goals around key application scenarios that you’ve identified, and then instrumenting those scenarios to determine whether or not the application as delivered for testing is actually capable of meeting those goals. The instrumentation can also safely be embedded in the application when it is ultimately deployed in production. This is fundamentally necessary to enable service level reporting and verify, for example, that the app is meeting its stated performance objectives. Most ARM advocates concentrate on monitoring application performance in production, but tend to neglect the crucial earlier stages of application development where it is important to bake goal-oriented, performance monitoring in at the outset.
The new Windows performance tool is currently in a very limited beta release, and contrary to the negative views I expressed in my earlier aside -- not a rant -- about information being free, we are looking at some sort of freebie distribution of the initial “commercial” version of the tool to allow you guys to explore the technology and see what it can do for you.
So, if you happen to be in the neighborhood of Oxford, England next month, you can hear & see more about this initiative. In the meantime, stayed tuned to this space, where I will try to do a better job keeping you posted as we make progress in this area.

Wednesday, April 11, 2012

The mystery of the scalability of a CPU-intensive application on an Intel multi-core processor with HyperThreading (HT).

This blog entry comes from answering the mail (note: names and other incriminating identification data were redacted to protect the guilty):
A Reader writes:
“My name is … and I am a developer for a software company. I encountered something I didn’t understand profiling our software and stumbled across your blog. Listen, hot-shot, since you are such an expert on this subject, maybe you can figure this out.
Our software is very CPU-intensive, so we use QueryPerformanceCounter() and QueryPerformanceFrequency() to get the CPU cycles. Hence, if someone mistakenly commits some change to slow it down, we can catch it. It works fine until one day we decided to fire up more than one instance of the application. Since a large part of our code is sequential and could not be parallelized, by firing up another instance, we can use the other cores that are idle when one core is executing the sequential code. We found the timing profile all messed up. The time difference can be 5x difference for part of the code. Apparently QueryPerformanceCounter() is counting wall time, not CPU cycles. BTW, I have a quad-core hyper-threaded i7 PC.
Then I wrote this small bit of code  (at the end of this email) to test QueryPerformanceCounter(), GetProcessTimes() function AND QueryProcessCycleTime() function. If I run the code solo (just one instance), we get pretty consistent numbers.
However, if I run 10-instances on my Intel 4-way with HyperThreading enabled (8-logical processors) machine, all three methods (QueryPerformanceCounter(),GetProcessTimes() and QueryProcessCycleTime() report random numbers.


# of Concurrent Processes
1
10
QueryPerformanceCounter time
21.5567753
39.6047058
GetProcessTimes (User)
21.4501375
39.4214527
QueryProcessCycleTime
77376526479
141329606436

Can you please help me understand what is going on here?
Signed,
   -- Dazed and confused.

Note that the QPC and GetProcessTimes are reported in seconds. The QPCT times are in cpu cycles, which are model-dependent. You can use the QueryProcessorFrequency() API call to get the processor speed in cycles/second. Or check the ~MHz field in the Registry as illustrated in Figure 1. In my experience, that will report a number similar to QFC().

Swallowing deeply, I humbly replied:
Dear Dazed,
I see that you are confused, but there is a very good explanation for what you are seeing.
If I understand this correctly, the app runs for 21 seconds in a standalone environment. When you run 10 processes concurrently, the app runs for 39 seconds.
You are correct, QueryPerformanceCounter() is counting wall clock time, not cycles. Any time that the thread is switched out between successive calls to QPC() would be included in the timing, which is why you obtained some unexpected results. You can expect thread pre-emption to occur when you have 10 long-running threads Ready to run on your 8-way machine.
But I think there is something else going on in this case, too. Basically, you are seeing two multiprocessor scalability effects – one due to HyperThreading and one due to more conventional processor queuing. The best way to untangle these is to perform the following set of test runs:
  1. run standalone
  2. run two concurrent processes
  3. run four concurrent processes
  4. run 8 concurrent processes
  5. then, keep increasing the # of concurrent processes and see if a pattern emerges.
Using GetProcessTimes is not very precise -- for all the reasons discussed on my blog, it is based on sampling – but for this test of a long running processing (~ 21 seconds), the precision is sufficient. Since the process is CPU-intensive, this measurement is consistent with elapsed time as measured using QPC, but that is just pure luck, because, as it happens, the program you are testing does not incur any significant IO wait time.  My recommendation is to try calling QueryThreadCycleTime instead; it is a better bet, but it is not foolproof either (as I have discussed on my blog). Actually, I recommend you instrument the code using the Scenario class library that we put up on Code Gallery sometime back, see http://archive.msdn.microsoft.com/Scenario. It calls both QPC and QTCT in-line during thread execution.

I personally don't have a lot of direct experience with QueryProcessCycleTime, but my understanding of how it works could make it problematic when you are calling it from inside a multi-threaded app. If your app runs mainly single-threaded, it should report CPU timings similar to QTCT.
 


The Mystery Solved. 
Within a few days, the intrepid Reader supplied some additional measurement data, per my instructions. After instrumenting the program using the Scenario class, he obtained measurements from executing 2, 4, 6, 8, etc. processes in parallel, up to running 32 processes in parallel on this machine. Figure 2 summarizes the results he obtained:



Figure 2. Elapsed times and CPU times as a function of the number of concurrent processes. The machine environment is a quad-core Intel i7 processor with HT enabled (the OS sees 8 logical processors).
(Note that I truncated the results at 24 concurrent processes to create this graphic to focus on left side of the chart. Beyond 24 processes, both line graphs continue consistently along the same pattern indicated. CPU time remains flat and elapsed time nues to scale linearly.)

To be clear about the measurements being reported, the Elapsed property of the Scenario object is the delta between a QPC() clock timer issued at Scenario.Begin() and one issued at Scenario.End(). It is reported in standard Windows 100 nanosecond timer ticks. The ElapsedCpu property is the delta between two calls to QTCT, made immediately after the QPC call in the Scenario.Begin method and just before Scenario.End calls QPC. It is also reported in standard Windows timer ticks. Respectively, they are the elapsed (wall clock time) time and the CPU thread execution time for the thread calling into the embedded Scenario instrumentation library.
 
Looking at the runs for two and four concurrent processes, we see both elapsed time and CPU time holding constant. In terms of parallel scalability, this application running on this machine with 4 physical processor cores scales linearly for up to four concurrent processes.
We also see that for up to 8 concurrent processes, elapsed time and CPU time are essentially identical.
For more than 8 processes, the CPU time curve flattens out, while elapsed time of the application increases linearly with the number of concurrent processes. Running more processes than logical processors does nothing to increase throughput; it simply creates conventional processor queuing delays. The fact that the workload scales linearly up to 32 concurrent processes is actually a very positive result. The queuing delays aren’t exponential, there is no evidence of locking or other blocking that would impede scalability. These results suggest that if you had a 32-way (physical processor) machine, you could run 32 of these processes in parallel very efficiently.
The HT effects are confined to the area of the elapsed and CPU time curves between 4 and 8 concurrent processes. Notice with six processes running concurrently, CPU time elongates only slightly – the benefit of the HT hardware boost is evident here. However, running eight concurrent processes leads to a sharp spike in CPU time – this is a sign that 8 concurrent processes saturate the four processor cores. The HT feature no longer effectively increases instruction execution throughput.
The CPU time curve flattening out after 8 concurrent processes suggests that at least some of that CPU time spike at 8 concurrent processes is due to contention for shared resources internal to the physical processor core from threads running in parallel on logical (HT) processors.
HT is especially tricky because the interaction between threads contending for shared resources internal to the physical processor core is very workload dependent. Running separate processes in parallel eliminates most of the parallel scalability issues associated with “false” sharing of shared cache lines because each process is running in its own dedicated block of virtual memory. What is particularly nefarious about concurrently executing threads “false sharing” of cache lines is that it subjects executing to time-consuming delays due to writing back updated cache lines to RAM and re-fetching them. In theory, these delays can be minimized by changing the way you allocate your data structures. In practice, you do not have a lot of control over how memory is allocated if you are an application developer using the .NET Framework. On the other hand, if you are writing a device driver in Windows and working in C++, false sharing in the processor data cache is something you need to be careful to address.
I trust this further exploration and explanation will prove helpful, and I look forward to seeing a check in the mail from your company to me for helping to sweep aside this confusion.
Signed,
   -- Your obedient servant, etc., etc.

Wednesday, March 7, 2012

Using QueryThreadCycleTime to access CPU execution timing

As a prelude to a discussion of the Scenario instrumentation library, I mentioned in the previous post that a good understanding of the clocks and timers available to the Windows developer would likely prove helpful. Timekeeping in Windows is an area fraught with confusion, and I confess, it was not something I understand well enough at the time I wrote either of my Windows performance books.
The previous post in this series focused on the Windows QueryPerformanceCounter API that is a wrapper around the hardware rdtsc (Read Time-Stamp Counter) instruction. QPC() provides access to a high-resolution timer that is frequently used in performance monitoring (see http://support.microsoft.com/kb/172338, for example). It is also useful in a variety of other measurement contexts. That then led to a brief discussion of the vagaries involved in using the hardware rdtsc instruction itself. Windows 7 attempts to utilize rdtsc where it is appropriate, based on repeatedly measuring the execution time of a routine during system start-up to determine if the execution times are reliably stable across power management events. On older PC hardware, power managements changes that slowed the machine execution cycle time were reflected in the frequency of TSC ticks. On current Intel hardware, however, the frequency of TSC ticks is constant across power management events, one of which is dynamic over-clocking, something Intel brands as its PowerBoost technology.
If all this talk of hardware clocks and timing instructions strikes you as a bit on the esoteric side or unnecessarily complicated, I'd like to assure you that it is actually a fundamental topic in computer performance. In Dr. Neil Gunther's excellent introduction to computer performance modeling, The Practical Performance Analyst, one of the early foundation chapters contains an excellent discussion of the importance of time measurements in the field. (For reference, Neil's often entertaining computer capacity planning blog is here.)
Prior to stumbling into the field of computer capacity planning, Neil's academic background was in astrophysics, so an interest in the subject of time comes naturally to him. The fundamental insight of Einstein's famous Special Theory of Relativity was that time measurements are necessarily relative to the position of the observer. Relativity placed the results of the famous Michelson-Morley experiment published in 1887 in a radically new context that revolutionized 20th century physics. The Michelson-Morley experiment showed that measurements of the speed of light were invariant relative to the motion of the earth through space. This was a surprising result that the astrophysics of the day did not predict. Crucial to interpreting the experimental data was the ingenious instrumentation rig Michelson and Morley constructed, the Michelson interferometer, to measure the speed of light at an unprecedented level of accuracy and precision.
Making the connection to computer performance a little less obscurely, the lifetime achievement award granted by the Computer Measurement Group, of which Dr. Gunther is a recent recipient, is named in honor of Professor Michelson. 
It is interesting to note that a related time measurement phenomenon confounding physics in the late 19th century was the problem of clock synchronization across large geographical distances. This problem arose in the context of coordinating intercontinental rail travel. After trying to synchronize the clocks in the French railway system using an electrical signaling apparatus, the famous French physicist and mathematician Henri Poincare was reluctantly led to conclude that a naive conception of the absolute simultaneity of two events occurring in different places at the same time was a worthless concept, given that the speed of light itself was finite. 
(The conceptual link between these two time measurement problems in 19th century physics is the subject of a recent popular science book entitled "Einstein's Clocks, Poincare's Maps." However, it is not that good a popular science book, IMHO, because the author isn't that good at explaining the linkage. But, the book does quite possibly engender the longest and most convoluted book review I have ever encountered on the Amazon web site. It is from John Ryskamp, evidently an amateur mathematician. So if you crave even more diversion than I can supply directly -- after all, if you are reading this, you are probably just wasting time grazing on the Internet to begin with -- the Ryskamp book review is here,  and to understand the intellectual baggage Ryskamp carries to his reading of the otherwise unassuming Einstein/Poincare book, try this article of his that he references. Fortunately for me -- time-wasting-wise, that is -- I feel quite fortunate that Ryskamp is not a more prolific Amazon reviewer. He is basically Charles Kinbote manifest in the flesh -- and you will have to work out that obscure reference on your own.)
And, finally, to bring this digression full circle back to computer measurement, the problem of synchronizing clocks across large distances reasserted itself in the early days of the development of the Internet Protocol (IP) for use in Wide Area Networking (WAN). The original IP header (version 4) Time-to-Live field had to be reinterpreted as a hop count when the search for a way to synchronize clocks across IP routers arrayed around the world proved futile (once again !). The Transmission Control Protocol (TCP) that sits atop IP in the network stack resorts instead to measuring the Round Trip Time (RTT) of Send requests relative to the sender. RTT, of course, is an important measurement used in TCP to anchor network congestion avoidance strategies.
And now back to Windows clocks and timers...
Beginning in Windows 6 (which refers to both Vista and Windows Server 2008), there is a new, event-driven mechanism for measuring processor utilization at the thread level. This measurement facility relies on the OS Scheduler issuing an rdtsc instruction at the beginning and end of each thread execution dispatch. By accumulating these CPU time measurements at the thread and process level each time a context switch occurs, the OS can maintain an accurate running total of the amount of time on the processor an executing thread consumes. Application programs can then access these accumulated CPU time measurements by calling a new API, QueryThreadCycleTime() and specifying a Thread Id.
QueryThreadCycleTime(), or QTCT, provides measurements of CPU time gathered by issuing an rdtsc instruction each time a context switch occurs. Using the same mechanism, the OS Scheduler also keeps track of processor idle time, which can be retrieved by calling either QueryIdleProcessorCycleTime or QueryIdleProcessorCycleTimeEx(), depending on whether multiple processor groups are defined in Windows 7. (Overall CPU utilization is calculated from Idle time for any given interval by subtracting Idle time from the interval duration:
CPU % Busy =  (IntervalDuration – Idle Time) * 100 /  IntervalDuration )
With QTCT, CPU time is kept in units of clock ticks, which is model-dependent. So, you also need to call QueryPerformanceFrequency and get the clock frequency to transform processor cycles into wall clock time.
The potential significance of the Windows OS thread scheduler being instrumented cannot be under-estimated, but using the rdtsc instruction isn’t quite as straightforward as computer measurement people would like. The OS Scheduler, at least, handles some of the vagaries automatically. Potential clock drift across processors isn't a problem, for instance. The CSwitch event that signals the beginning of a thread execution time interval and the CSwitch event that terminates it both occur on the same (logical) CPU. 
However, if you are running on an older Intel processor that does not have a constant tick rate across power management states, a thread's cycle time is potentially going to be a very difficult number to interpret. That is because on older CPUs, whenever there is a power state change that changes the clock frequency, the frequency of the associated Time Stamp Counter (the TSC register) is adjusted in tandem. When a power management does occur, the OS Scheduler does not attempt to adjust for any change in the time between clock ticks. That means that on one of these machines, it is possible for QTCT() to return accumulated clock ticks for an interval in which one or more p-state changes have occurred such that clock ticks of different lengths are aggregated together. Obviously, this creates a problem in interpretation, but, of course, only to the extent that power management changes are actually occurring during thread execution, and it is a problem that only occurs on older hardware.
Given that set of concerns with an rdtsc-based measurement mechanism, QTCT() remains a major step forward in measuring CPU usage in Windows. Instrumenting the OS Scheduler directly to measure processor usage is the first step towards replacing the legacy sampling technique I discussed at the outset of the series of blog articles. It has all the advantages of accuracy and precision that accrue to an event-oriented methodology. Plus, the OS Scheduler issuing an rdtsc instruction in-line during a context switch is much more efficient than generating an ETW event that must be post-processed subsequently, the approach that xperf uses.
As of Window 7, the OS Scheduler measurements are only exposed through QTCT at the thread level. I suspect that the rdtsc anomalies due to the variable rate of the clock on older machines are probably one factor holding up wider adoption, while the scope of retrofitting all the services in & around the OS that currently rely on the sampling-based processor utilization data is probably another. I am curious to see if Windows 8 takes some of the obvious next steps to improve the CPU utilization measurements.
The QTCT() API that gives access to these timings at the thread level does have one other serious design limitation that I ran into. QTCT currently returns the number of processor cycles a thread has consumed up until the last time a context switch occurred. There is no method that allows a running thread to get an up-to-date, point-in-time measurement that includes the cycle time accumulated up to the present time, not just the last context switch. A serializing method along those lines would make QTCT suitable for explicitly accounting for CPU usage at a thread level between two phases in the execution of a Windows program.

Next: The Scenario class uses QPC and QTCT together to try to calculate both the elapsed time and CPU time consumed between two markers strategically inserted into your Windows application. In the next blog entry in this series, I will discuss the Scenario instrumentation library in more detail.

Thursday, March 1, 2012

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison.
Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the ETW facilities in Windows. As we have seen illustrated, the Concurrency Visualizer demonstrates the capability to reconstruct an application’s execution time profile from kernel trace events with great precision and detail. An application response time monitoring tool that generates ETW response time events that can then be correlated with the Windows OS thread scheduling events has potential value in performance investigations of many sorts.
Fortunately, there is such an application response time monitoring tool for both Windows C++ native and .NET applications that is integrated with ETW. It is called the Scenario class. (The name “Scenario” was suggested by a Habib Heydarian, an excellent Program Manager at Microsoft, and is a nod to the popular practice of scenario-based development, associated primarily with Agile software development methodologies. See, for example, Scenario-Based Design of Human-Computer Interactions, by John Carroll and User Stories Applied: For Agile Software Development by Mike Cohen.)

The Scenario instrumentation library is a free download available from the MSDN Code Gallery site here.

The Scenario class is an ETW wrapper built around an extended version of the .NET Stopwatch class. The standard .NET Stopwatch class in the System.Diagnostics Namespace is used to measure elapsed time in a .NET Framework application. The .NET Stopwatch class itself is a managed wrapper around the native Windows APIs called QueryPerformanceFrequency and QueryPerformanceCounter (QPC) that access a high resolution timer and are used to calculate elapsed time. A straightforward extension of the .NET Stopwatch class adds a call to QueryThreadCycleTime (QTCT), a Windows API that provides a measure of CPU usage at the thread level, beginning in Window version 6.
Prior to discussing the use of the application-oriented Scenario instrumentation library, however, we should first take a deeper look at the Windows APIs it utilizes, namely QueryPerformanceCounter (QPC), QueryPerformanceFrequency, and the newer QueryThreadCycleTime (QTCT). Using the Scenario library properly and interpreting the measurement data it produces will benefit from a deeper understanding of how these Windows APIs work.
QueryPerformanceCounter. The QueryPerformanceCounter and QueryPerformanceFrequency APIs were added to Windows beginning with Windows 2000 when the Performance Monitor developers noticed that the granularity of the Windows system clock was inadequate for measuring disk IO response time. As discussed in the first post in this series, the Windows system time of day clock contains time values that display precision to 100 nanosecond timer units. However, the current time of day clock value maintained by the OS is only updated once per quantum, effectively about once every 15.6 milliseconds in current versions of Windows. Using values from the system clock to time disk IOs in early versions of Windows NT, the Logical and Physical Disk\Avg Disk sec/Transfer counters in Perfmon often reported zero values whenever an IO operation started and ran to completion within a single tick of the system clock.
The solution in Windows 2000 was the addition of the QueryPerformanceCounter (QPC) and QueryPerformanceFrequency APIs. In Windows 2000, the QueryPerformanceCounter API returned a value from an rdtsc (Read TimeStampCounter) instruction. The TSC is a special hardware performance monitoring counter on Intel-compatible processors that is incremented once per processor clock cycle. On a processor clocked at 2 GHz, for example, one expects two TSC clock cycle ticks to occur every nanosecond. (As will be discussed further, elapsed time measurements that are based on successive rdtsc instructions are far less precise than the processor’s actual instruction execution cycle time. See, for example, this FAQ from the Intel Software Network for an official explanation from the CPU manufacturer.)

Issuing an rdtsc instruction on a processor clocked at 2 GHz returns a clock value considerably more precise than standard Windows timer values delineated in 100 nanosecond units. Since the processor clock speed is hardware-dependent, an additional API call, QueryPerformanceFrequency, was provided to supply the processor clock speed. Once you know the clock frequency, you can translate the output from successive rdtsc instructions into elapsed wall clock time.
When the QueryPerformanceCounter and QueryPerformanceFrequency APIs became generally available in Windows, they were rapidly adopted by other applications in need of a more precise timer facility than the Windows system clock. However, developers using QPC() soon began to notice discrepancies in time measurements taken using the rdtsc instruction due to the way in which the TSC was implemented in the hardware. There were two major discrepancies that were encountered using the rdtsc instruction on Intel-compatible processors, namely:
(1)    lack of synchronization of the TSC across processors, and
(2)    dynamic changes to the TSC clock update interval as a result of the processor entering a lower power state, slowing both the clock rate and the TSC update interval in tandem.
The effect of these TSC anomalies was quickly evident in the disk driver routine responsible for timing the duration of disk operations when running on a multiprocessor, which was especially ironic since QPC was originally built in order to measure disk IO operations accurately. It was possible on a multi-processor for a disk IO that was initiated on CPU 0 and completed on CPU 1 to retrieve a TSC value from CPU 1 for the IO completion that was before the TSC value retrieved when the IO operation was initiated on CPU 0. (The flavor of the difficulties in dealing with rdtsc instructions across multiple CPUs on multiprocessors is captured in this blog entry written by Microsoft SQL Server customer support representatives.)
Of lesser concern, the latency of an rdtsc instruction was surprisably larger than expected for a hardware instruction, on the order of several hundred clock cycles on older Intel hardware. That, and the fact that the rdtsc instruction does not serialize the machine, made QPC() unsuitable for accurately timing, say, micro-benchmarks of less than several thousand instructions.
To deal with the drift in TSC values across multiple processors, in Windows 6 (Vista and Windows Server 2008), the Windows QueryPerformanceCounter function was changed to use one of several external, high resolution timer facilities that are usually available on the machine. These include the High Precision Event Timer (HPET), often also referred to as the Multimedia Timer in Windows, and the external ACPI Power Management Timer, another high resolution timer facility that is independent of the main processor hardware. Because these timer facilities are external to the processors, they are capable of supplying uniform values that are consistent across CPUs. (At the same time, QueryPerformanceFrequency was also re-written to return the frequency of the external timer source.) This change effectively fixed the problems associated with accurate measurements of disk IO response time that were evident in Windows 2000 and Windows XP.
However, using an external timer in the QPC implementation does have one major disadvantage, namely addional latency. If you wrap rdtsc instructions around QPC() calls in Windows Vista, you can typically measure latency on the order of 800 nanoseconds to call the QPC API, or roughly 1 m-second per call. This latency is particularly problematic given how frequently QPC is called. In ETW tracing, for instance, QPC is the default timer routine that is used to generate the event timestamps. When gathering high volume events such as the CSwitch, ReadyThread, ISR and DPC, using QPC for timestamps in ETW generates significant overhead. If one is expecting, say, 20,000 ETW events to be generated per second per processor on a Vista or Windows Server 2008 machine, calling QPC that frequently adds about 2% additional CPU utilization per processor.
(Note: The clock facility used by default in ETW is QPC. Alternatively, one can specify either the low resolution system timer (an option that should only be used in rare instances where the low resolution 15.6 ms clock ticks suffice), or that an rdtsc instruction be issued directly. The rdtsc option is termed the “CPU cycle counter” clock resolution option. See this ETW documentation entry on the MSDN Library for details.)
For the record, ETW is hardly the only application that routinely relies on QPC-based measurements of elapsed time. When the ETW tracing infrastructure is used to gather OS kernel scheduling events, it is probably the most frequent caller of the API on the machine. Other frequent callers of QPC include the disk driver routines mentioned earlier that measure disk IO response time – reported in Perfmon as the Avg. Disk Secs/Transfer counters. These were re-written in Windows 2000 to use QPC. The TCP protocol, which needs to estimate the Round Trip Time (RTT) of packets sent to remote TCP sessions, utilizes QPC for high resolution timing, also. As mentioned earlier, the .NET Framework Stopwatch class allows an application to issue calls to QPC and QPF as an alternative to using the low resolution DateTime.Now() method that access the Windows system clock.


The hardware manufacturers were meanwhile at work making improvements in the TSC hardware to allow it to serve as an efficient and accurate elapsed time counter. Changing the behavior of the TSC when there was a power state change that adjusted the processor’s clock cycle time was the first fix. Newer machines now contain a TSC with a constant tick rate across power management states. (Which is what I like to call it, instead of what Intel officially calls an invariant TSC, terminology I find a little awkward). For details, see the Intel hardware manual entitled Intel® 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1. Section 16.9 of this manual discusses the Time-Stamp Counter on Intel hardware and its processor-dependent behavior. The manual reports, “Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.” [Emphasis added.] 
The second problem related to the TSC clocks not being synchronized across processors still exists, but the TSCs for all the processor cores resident on a single multi-core socket do run off the same underlying hardware source, at least. Clock drift across processor cores remains an issue on multi-socket NUMA hardware, but built-in NUMA node thread and interrupt affinity in Windows minimizes some of these concerns, while not eliminating them completely.
Finally, the hardware vendors also report that the latency of the rdtsc instruction has improved significantly in current hardware. Reportedly, the latency of an rdtsc instruction has improved by an order of magnitude on current hardware. Unfortunately, the hardware vendors are reluctant to disclose the specific latency of their rdtsc instructions. The rdtsc instruction still does not serialize the processor instruction execution engine, so rdtsc continues to return timer values that are subject to some amount of processor core "jitter," imprecision about instruction execution latency simply is a fact of like in superscalar, pipelined machines. The need to maintain a TSC with a constant tick rate across power state changes also results in some loss of precision in rdtsc return values, which is really no big deal since it affects just one or two of the least significant clock resolution timer bits.
The long latency associated with accessing an external clock facility combined, with the rdtsc hardware improvements described above, prompted another round of changes in QueryPerformanceCounter for Windows 7 (and Windows Server 2008 R2). During system initialization, Windows 7 attempts to figure out if the current hardware supports a TSC tick rate that is constant across power state changes. When Windows 7 determines that the processor’s TSC tick rate is constant, the QPC routine is set to issue rdtsc instructions. If it appears that the TSC is not invariant across processor core frequency changes, then QPC will be resolved as in Windows 6 by calling the machine’s external timer. In this fashion, QPC in Windows 7 automatically provides a well-behaved, high resolution hardware clock timer service that uses the low latency rdtsc instruction whenever it is well-behaved.
This ability in Windows 7 to resolve the QPC timer service dynamically based on the current hardware is the reason Windows application developers are advised to stay away from using rdtsc – unless you absolutely know what you are doing – and to use QPC instead. Coding an inline rdtsc instruction is still going to be faster than calling QPC to access the TSC, but I hope I have made clear that using rdtsc directly is not for the faint of heart. One especially has to be aware of older portables in my experience. In newer machines, however, my experience using rdtsc is very positive.
Another good reason to avoid issuing rdtsc instructions in your app is the portability of your app onto ARM machines going forward in Windows 8. On ARM-based tablets and phone, there are a slew of hardware dependencies that we are going to have to start looking out for -- which is a good topic for another set of blog posts!
Next time, the QueryThreadCycleTime API introduced in Windows 6 that is based on instrumenting the OS Scheduler using rdtsc instructions.