Skip to main content

The mystery of the scalability of a CPU-intensive application on an Intel multi-core processor with HyperThreading (HT).

This blog entry comes from answering the mail (note: names and other incriminating identification data were redacted to protect the guilty):
A Reader writes:
“My name is … and I am a developer for a software company. I encountered something I didn’t understand profiling our software and stumbled across your blog. Listen, hot-shot, since you are such an expert on this subject, maybe you can figure this out.
Our software is very CPU-intensive, so we use QueryPerformanceCounter() and QueryPerformanceFrequency() to get the CPU cycles. Hence, if someone mistakenly commits some change to slow it down, we can catch it. It works fine until one day we decided to fire up more than one instance of the application. Since a large part of our code is sequential and could not be parallelized, by firing up another instance, we can use the other cores that are idle when one core is executing the sequential code. We found the timing profile all messed up. The time difference can be 5x difference for part of the code. Apparently QueryPerformanceCounter() is counting wall time, not CPU cycles. BTW, I have a quad-core hyper-threaded i7 PC.
Then I wrote this small bit of code  (at the end of this email) to test QueryPerformanceCounter(), GetProcessTimes() function AND QueryProcessCycleTime() function. If I run the code solo (just one instance), we get pretty consistent numbers.
However, if I run 10-instances on my Intel 4-way with HyperThreading enabled (8-logical processors) machine, all three methods (QueryPerformanceCounter(),GetProcessTimes() and QueryProcessCycleTime() report random numbers.


# of Concurrent Processes
1
10
QueryPerformanceCounter time
21.5567753
39.6047058
GetProcessTimes (User)
21.4501375
39.4214527
QueryProcessCycleTime
77376526479
141329606436

Can you please help me understand what is going on here?
Signed,
   -- Dazed and confused.

Note that the QPC and GetProcessTimes are reported in seconds. The QPCT times are in cpu cycles, which are model-dependent. You can use the QueryProcessorFrequency() API call to get the processor speed in cycles/second. Or check the ~MHz field in the Registry as illustrated in Figure 1. In my experience, that will report a number similar to QFC().

Swallowing deeply, I humbly replied:
Dear Dazed,
I see that you are confused, but there is a very good explanation for what you are seeing.
If I understand this correctly, the app runs for 21 seconds in a standalone environment. When you run 10 processes concurrently, the app runs for 39 seconds.
You are correct, QueryPerformanceCounter() is counting wall clock time, not cycles. Any time that the thread is switched out between successive calls to QPC() would be included in the timing, which is why you obtained some unexpected results. You can expect thread pre-emption to occur when you have 10 long-running threads Ready to run on your 8-way machine.
But I think there is something else going on in this case, too. Basically, you are seeing two multiprocessor scalability effects – one due to HyperThreading and one due to more conventional processor queuing. The best way to untangle these is to perform the following set of test runs:
  1. run standalone
  2. run two concurrent processes
  3. run four concurrent processes
  4. run 8 concurrent processes
  5. then, keep increasing the # of concurrent processes and see if a pattern emerges.
Using GetProcessTimes is not very precise -- for all the reasons discussed on my blog, it is based on sampling – but for this test of a long running processing (~ 21 seconds), the precision is sufficient. Since the process is CPU-intensive, this measurement is consistent with elapsed time as measured using QPC, but that is just pure luck, because, as it happens, the program you are testing does not incur any significant IO wait time.  My recommendation is to try calling QueryThreadCycleTime instead; it is a better bet, but it is not foolproof either (as I have discussed on my blog). Actually, I recommend you instrument the code using the Scenario class library that we put up on Code Gallery sometime back, see http://archive.msdn.microsoft.com/Scenario. It calls both QPC and QTCT in-line during thread execution.

I personally don't have a lot of direct experience with QueryProcessCycleTime, but my understanding of how it works could make it problematic when you are calling it from inside a multi-threaded app. If your app runs mainly single-threaded, it should report CPU timings similar to QTCT.
 


The Mystery Solved. 
Within a few days, the intrepid Reader supplied some additional measurement data, per my instructions. After instrumenting the program using the Scenario class, he obtained measurements from executing 2, 4, 6, 8, etc. processes in parallel, up to running 32 processes in parallel on this machine. Figure 2 summarizes the results he obtained:



Figure 2. Elapsed times and CPU times as a function of the number of concurrent processes. The machine environment is a quad-core Intel i7 processor with HT enabled (the OS sees 8 logical processors).
(Note that I truncated the results at 24 concurrent processes to create this graphic to focus on left side of the chart. Beyond 24 processes, both line graphs continue consistently along the same pattern indicated. CPU time remains flat and elapsed time nues to scale linearly.)

To be clear about the measurements being reported, the Elapsed property of the Scenario object is the delta between a QPC() clock timer issued at Scenario.Begin() and one issued at Scenario.End(). It is reported in standard Windows 100 nanosecond timer ticks. The ElapsedCpu property is the delta between two calls to QTCT, made immediately after the QPC call in the Scenario.Begin method and just before Scenario.End calls QPC. It is also reported in standard Windows timer ticks. Respectively, they are the elapsed (wall clock time) time and the CPU thread execution time for the thread calling into the embedded Scenario instrumentation library.
 
Looking at the runs for two and four concurrent processes, we see both elapsed time and CPU time holding constant. In terms of parallel scalability, this application running on this machine with 4 physical processor cores scales linearly for up to four concurrent processes.
We also see that for up to 8 concurrent processes, elapsed time and CPU time are essentially identical.
For more than 8 processes, the CPU time curve flattens out, while elapsed time of the application increases linearly with the number of concurrent processes. Running more processes than logical processors does nothing to increase throughput; it simply creates conventional processor queuing delays. The fact that the workload scales linearly up to 32 concurrent processes is actually a very positive result. The queuing delays aren’t exponential, there is no evidence of locking or other blocking that would impede scalability. These results suggest that if you had a 32-way (physical processor) machine, you could run 32 of these processes in parallel very efficiently.
The HT effects are confined to the area of the elapsed and CPU time curves between 4 and 8 concurrent processes. Notice with six processes running concurrently, CPU time elongates only slightly – the benefit of the HT hardware boost is evident here. However, running eight concurrent processes leads to a sharp spike in CPU time – this is a sign that 8 concurrent processes saturate the four processor cores. The HT feature no longer effectively increases instruction execution throughput.
The CPU time curve flattening out after 8 concurrent processes suggests that at least some of that CPU time spike at 8 concurrent processes is due to contention for shared resources internal to the physical processor core from threads running in parallel on logical (HT) processors.
HT is especially tricky because the interaction between threads contending for shared resources internal to the physical processor core is very workload dependent. Running separate processes in parallel eliminates most of the parallel scalability issues associated with “false” sharing of shared cache lines because each process is running in its own dedicated block of virtual memory. What is particularly nefarious about concurrently executing threads “false sharing” of cache lines is that it subjects executing to time-consuming delays due to writing back updated cache lines to RAM and re-fetching them. In theory, these delays can be minimized by changing the way you allocate your data structures. In practice, you do not have a lot of control over how memory is allocated if you are an application developer using the .NET Framework. On the other hand, if you are writing a device driver in Windows and working in C++, false sharing in the processor data cache is something you need to be careful to address.
I trust this further exploration and explanation will prove helpful, and I look forward to seeing a check in the mail from your company to me for helping to sweep aside this confusion.
Signed,
   -- Your obedient servant, etc., etc.

Comments

  1. Classic post Mark Informative and funny - what more could a loyal reader want from a performance blog.

    ReplyDelete
  2. Is there another source of the Scenario object? The MSDN archive does not seem to be active any more.

    ReplyDelete
  3. Thanks, Kevin. Bummer. That link to the Scenario.dll no longer appears to be active.

    If you e-mail me directly, I can send you a current version of the class library.

    You can also access the MeasurementBlock class, functionally similar, found in Microsoft.VisualStudio.Diagnostics.Measurement.dll, which you should be able to find in the VS \Common\IDE subdirectory. VS uses a 32-bit version of the library.

    Let me know if either works for you.

    ReplyDelete

Post a Comment

Popular posts from this blog

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the …

Why is my web app running slowly? -- Part 1.

This series of blog posts picks up on a topic I made mention of earlier, namely scalability models, where I wrote about how implicit models of application scalability often impact the kinds of performance tests that are devised to evaluate the performance of an application. As discussed in that earlier blog post, sometimes the influence of the underlying scalability model is subtle, often because the scalability model itself is implicit. In the context of performance testing, my experience is that it can be very useful to render the application’s performance and scalability model explicitly. At the very least, making your assumptions explicit opens them to scrutiny, allowing questions to be asked about their validity, for example.
The example I used in that earlier discussion was the scalability model implicit when employing stress test tools like HP LoadRunner and Soasta CloudTest against a web-based application. Load testing by successively increasing the arrival rate of customer r…

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.


Ballooning
Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-managemen…