Skip to main content

On processor performance in the age of multi-core, part 1.

Processor performance in the age of multi-core: RISC vs. CISC, part 1.

Reading Apple’s announcement in the news media and trade press about a plan to transition its next generation Mac computers from Intel-manufactured x64 processors to custom ARM chips prompted me to write a blog entry discussingApple’s strategy in greater depth and, hopefully, with more insight than the coverage of the move that published reports provided. An issue raised by one of the computer industry experts that analyzed the Apple announcement was that it might re-ignite an old debate among CPU hardware engineers with regard to the relative virtues of the CISC vs. RISC approaches to processor design. This seems very unlikely to me, and I will attempt to explain why in this post. Basically, RISC has won the engineering battle, but meanwhile Intel has good reasons to continue to resist any breaking changes in its hardware platform that would cause existing x86 and x64 software to fail. What is actually the most interesting aspect of the Apple announcement is its consolidation around a business model where the same company makes both the hardware and the software that runs on its platform, which is the model that was used very profitably until the Wintel collaboration began to dominate the PC desktop, portable and, eventually, server market and seemingly broke the mold. 

As I discussed in the previous post, Apple has a number of compelling reasons to shift from Intel to Apple silicon to power future Macs, none of which have much to do with the relative merits of the CISC vs. RISC approaches to processor architecture. It is true that Intel microprocessors support a quite complex instruction set (i.e., CISC), a legacy of the original x86 design which proved so successful for the US company that the market pushed back when it tried changing it. When Intel did attempt to introduce a radically new architecture, branded as the Itanium, incorporating all the latest ideas about processor hardware performance, it was a colossal flop. Since that debacle, which created a window of opportunity for AMD to extend the x86 architecture to 64-bit addressing in a less radical way, current Intel and AMD x64 processors maintain strict backward compatibility with older x86 software binaries, ensuring that new hardware advances will not break existing software.

Meanwhile the ARM processors used in Apple cell phones and tablets, as well as the smartphone and tablets from Apple’s main competitors, originally used the reduced instruction set approach (RISC) – the ARM acronym stands for Advanced RISC Machine. The RISC design approach decreased the complexity of the instruction set in order to streamline the instruction execution pipeline, which reduced the amount of logic necessary on board the microprocessor chip to control instruction execution. Limited to the execution of fixed length instructions, for example, processors designed on RISC principles not only reduced the number of instructions available, but also limited the ones available to simple arithmetic, logical and control operations. By cutting back on the number of instructions that it supported, the first RISC machines introduced in the 1980s were able to cram the instruction execution engine and ample cache memory onto a single chip microprocessor. Taking notice of the role high speed cache memory incorporated onto the chip plays in accelerating pipelined instruction execution, the RISC designers also focused on creating an instruction format that made caching easier and improved cache effectiveness. 

David Patterson of US-Berkeley and David Ditzel from AT&T Bell Labs make The Case for the Reduced Instruction Set Computer in a 1980 paper that persuasively makes the argument for a slimmed down instruction set for a high performance, single chip computer. At that time CPU manufacturers like IBM and DEC would add complex instructions to the machine's instruction set that could replace a commonly used sequence of instructions both for the convenience of assembler language programmers and to reduce the footprint of your binaries, which was once a big deal because proprietary RAM on those machines was so expensive. Note that these hardware vendors also made the OS software that ran on its platform. The hardware manufacturers also made the software development tools like compilers, etc., that were used to build all the application software that ran on its platform. Whenever new instructions were added to their hardware, these vendors could also coordinate the software support for these enhancements. The result was an orchestrated release of new hardware and related-software illuminating a clear upgrade path for customers. This is exactly what Apple did this summer, announcing the new hardware strategy for the Mac and simultaneously releasing the developer tools that support a path for migrating existing software to the new hardware.         

Back in the day, for example, IBM handed OS 360/370 developers a STore Multiple (STM) instruction for storing registers values in the call stack prior to branching to a subroutine and a corresponding LoaD Multiple (LDM) instruction for restoring register values upon return. Here one complex STM instruction replaced 14 individual ST instructions which needed to be executed every time you issued a BALR 14,15 to branch to a subroutine (or method). The complex STM instruction did not execute any faster than 14 separate ST Register instructions, but it was a lot more convenient for asssembly language programmers.  Ironically, ARM processors -- which are based on RISC design principles -- continue to support STM and LDM instructions. Go figure. I guess developer convenience is still a thing. 

Requiring less instruction management and control logic on a RISC processor chip also led to efficiencies in power consumption. More efficient use of electric power was a secondary benefit initially of RISC designs initially, but with the breakdown of what is known as Dennard scaling around 2005, it emerged as one of the key elements that allowed low-power ARM processors to capture the market for battery powered smart phones and other mobile devices. Dennard scaling was an observation made in the early 1970s that increased chip density in successive generations of semiconductor fabrication technology was accompanied by reduced power consumption, allowing manufacturers to increase the clock speed of microprocessors at the same time that they added more circuitry to the chip. With the breakdown of Dennard scaling, however, processor manufacturers have been unable to increase CPU clock speeds without generating excessive heat that must be dissipated somehow, creating a whole new set of engineering challenges that are chronicled in Hennessy and Patterson’s Turing award lecture and the paper they published in 2019 to accompany the lecture.

Programming RISC machines with simpler instruction sets placed greater emphasis on compilers to generate efficient code from high level language statements, something which was aligned with broader trends in software engineering that diminished the role of assembly language programming in general. As Patterson and Ditzel observe in their 1980 "Case for RISC" paper:

"One of the interesting results of rising software costs is the increasing reliance on high-level languages. One consequence is that the compiler writer is replacing the assembly-language programmer in deciding which instructions the machine will execute. (emphasis added) Compilers are often unable to utilize complex instructions, nor do they use the insidious tricks in which assembly language programmers delight." 

Back when IBM dominated the market for enterprise computing in the 1970s and 1980s, the hardware manufacturer would extend its already large set of machine language instructions on its mainframe computers, to assist developers using its assembler language. Sprinkling in a few new machine language instructions to each new generation of proprietary IBM mainframe processor hardware also helped IBM maintain its market dominance, the profitability of which was mainly threatened by plug-compatible manufacturers (PCMs) who built machines that executed identical instructions cheaper or faster or, ideally, both. The new instructions presented a moving target to the PCMs, making the IBM competitors play a never ending game of catch-up that they could never win to in order to maintain strict compatibility with IBM's latest and greatest. 

When its dominance in the PC market was similarly threatened by AMD’s x86 plug compatible processors, Intel adopted a markedly similar strategy, adding new instructions to each subsequent x86 processor model. New x86 instructions targeted at speeding up graphic processing on desktop PCs, for example, clearly benefitted customers, but also had the advantage of making the instruction set a moving target, forcing a plug-compatible manufacturer like AMD to be in a constant state of flux, always one step behind the latest Intel hardware. Intel’s continuous expansion of the x86 instruction set to stay one step of its PCM rival was less successful than IBM’s, however, because the software produced to run on Intel hardware is generated primarily using compilers produced by 3rd parties like Microsoft (and, at one time Borland) that are not always aligned with Intel’s business objectives[1]. The compiler developers at Microsoft, for example, have been known to ignore many of Intel’s hardware innovations, focusing less on optimizing code generation for Intel’s latest and greatest and more on software developer productivity. In my previous post, I called attention to this aspect of Apple’s business model where it builds the hardware and also maintains the compiler software used to build the application software that runs on its hardware platform, which is similar to the old IBM mainframe business model, which was driven by the profitability of its underlying hardware products.

In order to appreciate the performance benefits of the RISC approach we will need to dive deeper into pipelining, which breaks instruction execution into multiple stages and executes those stages in parallel. Pipelining is a technique commonly used in all modern CPUs, but it adds substantial complexity to the instruction execution logic, a problem that is compounded when faced with the need to manage a very large and complicated set of instructions. We will look at the instruction execution pipeline that Intel x64 processors to illustrate just how complex this machinery is.

More recently, CISC and RISC style machines have actually converged, with higher end ARM processors supporting a larger, richer set of instructions to support digital signal processing in phones, high resolution vector graphics for video streaming apps, and floating point instructions to support compute-bound tasks, relaxing the restriction surrounding variable length instructions in the process. For example, Apple’s ARM processor used in high end iPhones supports SIMD (single instruction, multiple data) instructions like the ones that were introduced in Intel x86 machines to speed up vector graphics processing. After all, Apple iPhone customers having demanding requirements: they want jitter-free performance when they are watching cat videos on YouTube or dance videos on TikTok. These instructions added to the RISC core of the Apple Silicon make it a capable candidate for powering portable iMac machines.

Intel responded to competition from RISC-based CPUs by building a “microarchitecture” that breaks complex instructions into RISC-like m-instructions that are then feed to the instruction execution pipeline. The Intel processor microprocessor improves instruction execution performance in a RISC-like fashion without compromising backward capability of the machine’s complex set of instructions.

More generally, steady improvements in semiconductor fabrication technology – as embodied in the observation that came to be known as Moore’s Law – allows the hardware manufacturers to pack more and more logic on each chip. It has gotten to the point where increases in chip density have outpaced hardware engineers’ ideas about how to make good use of all that additional logic function. Multi-core processors are the result – the processor hardware manufacturers don’t currently have any better ideas for what can done with all that additional circuitry available on the chip with each new semiconductor fabrication advance. TSMC, which is Apple’s semiconductor manufacturing partner for Apple Silicon chips, is currently using a 7 nm process and beginning the transition to a 5 nm process. Note that ehe end of Dennard scaling occurred when MOSFET semiconductor fabrication reached 65 nm around 2005, which marked the rise of the multi-core approach.

A lesson learned from the expensive failure of the Itanium architecture is that engineers are wary of introducing radical new hardware function. In the current mode where Intel hardware innovation requires adoption from its software partners like Microsoft, it is not wise for the hardware to get too far ahead of the software. This is one aspect of Apple’s switch from Intel hardware to Apple Silicon that bears watching – it is a return to a business model where the hardware manufacturer also controls the software that runs on its platform, which allows for tighter hardware and software integration, a key ingredient in Apple’s enormously successful iPhone business.

On the other hand, Intel’s experience of adding instructions narrowly targeted at multimedia applications has proved successful. Similarly, extensions to the ARM processor instruction set that mirror earlier Intel multimedia instructions have also worked. Extending the instruction set of both RISC and CISC processors in narrow, targeted ways has proved successful as long as the extensions do not introduce any breaking changes to existing software. Where the new instructions can demonstrate superior performance for very specific functions, manufacturers can then drive adoption by a select group of professional assembly language developers at work on the operating system, on devices drivers, or specializing in code generation in the compiler teams.

In the next blog entry in this series, I will drill into these issues more, beginning with a discussion about instruction sets and assembly language programming, followed by a deeper dive into the RISC vs. CISC battle for the hearts and minds of developers.



[1] Intel does build an C++ compiler focused on optimizing code generated for its x86 and x64 platforms that is capable of taking advantageous of all the latest Intel hardware extensions. However, adoption of Intel’s own proprietary compiler software is very limited, compared to the widespread use of Microsoft software developer tools, which are known for their ease of use.

Comments

Popular posts from this blog

Monitoring SQL Server: the OS Wait stats DMV

This is the 2nd post in a series on SQL Server performance monitoring, emphasizing the use of key Dynamic Management View. The series starts here : OS Waits  The consensus among SQL Server performance experts is that the best place to start looking for performance problems is the OS Wait stats from the sys.dm_os_wait_stats DMV. Whenever it is running, the SQL Server database Engine dispatches worker threads from a queue of ready tasks that it services in a round-robin fashion. (There is evidently some ordering of the queue based on priority –background tasks with lower priority that defer to foreground tasks with higher priority.) The engine records the specific wait reason for each task waiting for service in the queue and also accumulates the Wait Time (in milliseconds) for each Wait reason. These Waits and Wait Time statistics accumulate at the database level and reported via the sys.dm_os_wait_stats DMV. Issuing a Query like the following on one of my SQL Server test mac

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profiling tools that exploit

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe