Skip to main content

The mystery of the scalability of a CPU-intensive application on an Intel multi-core processor with HyperThreading (HT).

This blog entry comes from answering the mail (note: names and other incriminating identification data were redacted to protect the guilty):
A Reader writes:
“My name is … and I am a developer for a software company. I encountered something I didn’t understand profiling our software and stumbled across your blog. Listen, hot-shot, since you are such an expert on this subject, maybe you can figure this out.
Our software is very CPU-intensive, so we use QueryPerformanceCounter() and QueryPerformanceFrequency() to get the CPU cycles. Hence, if someone mistakenly commits some change to slow it down, we can catch it. It works fine until one day we decided to fire up more than one instance of the application. Since a large part of our code is sequential and could not be parallelized, by firing up another instance, we can use the other cores that are idle when one core is executing the sequential code. We found the timing profile all messed up. The time difference can be 5x difference for part of the code. Apparently QueryPerformanceCounter() is counting wall time, not CPU cycles. BTW, I have a quad-core hyper-threaded i7 PC.
Then I wrote this small bit of code  (at the end of this email) to test QueryPerformanceCounter(), GetProcessTimes() function AND QueryProcessCycleTime() function. If I run the code solo (just one instance), we get pretty consistent numbers.
However, if I run 10-instances on my Intel 4-way with HyperThreading enabled (8-logical processors) machine, all three methods (QueryPerformanceCounter(),GetProcessTimes() and QueryProcessCycleTime() report random numbers.


# of Concurrent Processes
1
10
QueryPerformanceCounter time
21.5567753
39.6047058
GetProcessTimes (User)
21.4501375
39.4214527
QueryProcessCycleTime
77376526479
141329606436

Can you please help me understand what is going on here?
Signed,
   -- Dazed and confused.

Note that the QPC and GetProcessTimes are reported in seconds. The QPCT times are in cpu cycles, which are model-dependent. You can use the QueryProcessorFrequency() API call to get the processor speed in cycles/second. Or check the ~MHz field in the Registry as illustrated in Figure 1. In my experience, that will report a number similar to QFC().

Swallowing deeply, I humbly replied:
Dear Dazed,
I see that you are confused, but there is a very good explanation for what you are seeing.
If I understand this correctly, the app runs for 21 seconds in a standalone environment. When you run 10 processes concurrently, the app runs for 39 seconds.
You are correct, QueryPerformanceCounter() is counting wall clock time, not cycles. Any time that the thread is switched out between successive calls to QPC() would be included in the timing, which is why you obtained some unexpected results. You can expect thread pre-emption to occur when you have 10 long-running threads Ready to run on your 8-way machine.
But I think there is something else going on in this case, too. Basically, you are seeing two multiprocessor scalability effects – one due to HyperThreading and one due to more conventional processor queuing. The best way to untangle these is to perform the following set of test runs:
  1. run standalone
  2. run two concurrent processes
  3. run four concurrent processes
  4. run 8 concurrent processes
  5. then, keep increasing the # of concurrent processes and see if a pattern emerges.
Using GetProcessTimes is not very precise -- for all the reasons discussed on my blog, it is based on sampling – but for this test of a long running processing (~ 21 seconds), the precision is sufficient. Since the process is CPU-intensive, this measurement is consistent with elapsed time as measured using QPC, but that is just pure luck, because, as it happens, the program you are testing does not incur any significant IO wait time.  My recommendation is to try calling QueryThreadCycleTime instead; it is a better bet, but it is not foolproof either (as I have discussed on my blog). Actually, I recommend you instrument the code using the Scenario class library that we put up on Code Gallery sometime back, see http://archive.msdn.microsoft.com/Scenario. It calls both QPC and QTCT in-line during thread execution.

I personally don't have a lot of direct experience with QueryProcessCycleTime, but my understanding of how it works could make it problematic when you are calling it from inside a multi-threaded app. If your app runs mainly single-threaded, it should report CPU timings similar to QTCT.
 


The Mystery Solved. 
Within a few days, the intrepid Reader supplied some additional measurement data, per my instructions. After instrumenting the program using the Scenario class, he obtained measurements from executing 2, 4, 6, 8, etc. processes in parallel, up to running 32 processes in parallel on this machine. Figure 2 summarizes the results he obtained:



Figure 2. Elapsed times and CPU times as a function of the number of concurrent processes. The machine environment is a quad-core Intel i7 processor with HT enabled (the OS sees 8 logical processors).
(Note that I truncated the results at 24 concurrent processes to create this graphic to focus on left side of the chart. Beyond 24 processes, both line graphs continue consistently along the same pattern indicated. CPU time remains flat and elapsed time nues to scale linearly.)

To be clear about the measurements being reported, the Elapsed property of the Scenario object is the delta between a QPC() clock timer issued at Scenario.Begin() and one issued at Scenario.End(). It is reported in standard Windows 100 nanosecond timer ticks. The ElapsedCpu property is the delta between two calls to QTCT, made immediately after the QPC call in the Scenario.Begin method and just before Scenario.End calls QPC. It is also reported in standard Windows timer ticks. Respectively, they are the elapsed (wall clock time) time and the CPU thread execution time for the thread calling into the embedded Scenario instrumentation library.
 
Looking at the runs for two and four concurrent processes, we see both elapsed time and CPU time holding constant. In terms of parallel scalability, this application running on this machine with 4 physical processor cores scales linearly for up to four concurrent processes.
We also see that for up to 8 concurrent processes, elapsed time and CPU time are essentially identical.
For more than 8 processes, the CPU time curve flattens out, while elapsed time of the application increases linearly with the number of concurrent processes. Running more processes than logical processors does nothing to increase throughput; it simply creates conventional processor queuing delays. The fact that the workload scales linearly up to 32 concurrent processes is actually a very positive result. The queuing delays aren’t exponential, there is no evidence of locking or other blocking that would impede scalability. These results suggest that if you had a 32-way (physical processor) machine, you could run 32 of these processes in parallel very efficiently.
The HT effects are confined to the area of the elapsed and CPU time curves between 4 and 8 concurrent processes. Notice with six processes running concurrently, CPU time elongates only slightly – the benefit of the HT hardware boost is evident here. However, running eight concurrent processes leads to a sharp spike in CPU time – this is a sign that 8 concurrent processes saturate the four processor cores. The HT feature no longer effectively increases instruction execution throughput.
The CPU time curve flattening out after 8 concurrent processes suggests that at least some of that CPU time spike at 8 concurrent processes is due to contention for shared resources internal to the physical processor core from threads running in parallel on logical (HT) processors.
HT is especially tricky because the interaction between threads contending for shared resources internal to the physical processor core is very workload dependent. Running separate processes in parallel eliminates most of the parallel scalability issues associated with “false” sharing of shared cache lines because each process is running in its own dedicated block of virtual memory. What is particularly nefarious about concurrently executing threads “false sharing” of cache lines is that it subjects executing to time-consuming delays due to writing back updated cache lines to RAM and re-fetching them. In theory, these delays can be minimized by changing the way you allocate your data structures. In practice, you do not have a lot of control over how memory is allocated if you are an application developer using the .NET Framework. On the other hand, if you are writing a device driver in Windows and working in C++, false sharing in the processor data cache is something you need to be careful to address.
I trust this further exploration and explanation will prove helpful, and I look forward to seeing a check in the mail from your company to me for helping to sweep aside this confusion.
Signed,
   -- Your obedient servant, etc., etc.

Comments

  1. Classic post Mark Informative and funny - what more could a loyal reader want from a performance blog.

    ReplyDelete
  2. Is there another source of the Scenario object? The MSDN archive does not seem to be active any more.

    ReplyDelete
  3. Thanks, Kevin. Bummer. That link to the Scenario.dll no longer appears to be active.

    If you e-mail me directly, I can send you a current version of the class library.

    You can also access the MeasurementBlock class, functionally similar, found in Microsoft.VisualStudio.Diagnostics.Measurement.dll, which you should be able to find in the VS \Common\IDE subdirectory. VS uses a 32-bit version of the library.

    Let me know if either works for you.

    ReplyDelete

Post a Comment

Popular posts from this blog

Hyper-V Architecture: Intercepts, interrupts and Hypercalls

Intercepts, interrupts and Hypercalls Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead...

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profi...

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe...