Skip to main content

 NVIDIA executes a pivot to the Data Center

Announcing the company’s 1Q23 financial results on May 24, 2023, Jensen Huang, the founder and CEO of NVIDIA, reported that the company is seeing “surging demand” for its latest generation of data center hardware products that are a key ingredient in building generative AI models, like OpenAI’s ChatGPT chatbot. On that very promising financial guidance from the NVIDIA founder, the company’s stock (NVDA) leaped from $305 to $356 in after-hours trading on Wednesday, up over 16%. When the New York stock exchange re-opened on Thursday, NVDA stock continued to soar upwards, to close at $380 for the day. To put that rise in perspective, back in September 2022, NVDA stock bottomed out at $112, so it’s up more than 300% in the intervening 8 months.

In the wake of ChatGPT, which Microsoft has integrated into its Bing Search engine to great effect, other companies are scrambling to catch up, or at least not fall too far behind, what could potentially be a technological game-changer. This includes Google rushing to release its own version of a generative AI chatbot. There is also a rush to build custom AI tools that marry automatic speech recognition to domain-specific generative chatbots to deploy in customer Helpdesk environments. Data centers rely on NVIDIA hardware, which has been optimized for this specific purpose, to create these Large Language Models, the neural networks that are at the heart of ChatGPT. Once these models are built, they also need to be run in response to customer inquiries. NVIDIA’s hardware is also deployed in the data center to do that.

That is all certainly good news for NVDA investors, some of whom will be able to brag to their fortunate progeny that “I bought NVIDIA at $125.”  What I prefer to focus on in my blogging, however, is the computing technology and the cost/performance tradeoffs that confront the planners who are responsible for building and maintaining large scale computing infrastructure. In this blog post, I will dive deeper into how NVIDIA got here by discussing the computing technology that NVIDIA originally designed for accelerating vectorized computer graphics that has now become a staple product that is proving to be essential to running generative AI applications in the modern data center.

For many years NVIDIA has been the leading supplier of high-end GPUs, graphical co-processors optimized for manipulating and displaying high-resolution vector graphics on Intel PCs, where they are associated mainly with accelerating the super-charged performance of games like Resident Evil on the Windows platform. With NVIDIA’s gaming business shrinking steadily, the company has engineered a nifty pivot to its data center products which now account for almost 60% of the company’s total revenue.

Here’s an image of what an NVIDIA DGX H100 unit looks like from the outside. (The “H” designation honors Grace Hopper.) You can probably buy one on e-Bay in the neighborhood of $40,000. This is the hardware that is flying off the shelves at NVIDIA. It doesn’t look like much, does it. It is a nondescript 19” wide enclosure, 14” high and just under 36” deep, designed to be rack-mounted. It tips the scales at about 130 kilograms, almost 290 lbs., according to the company spec sheet, so clearly it is not designed for the home hobbyist. 



 

This unassuming looking unit is a supercomputer, capable of 32 petaFLOPS. On the inside, it is packed with high end components, including 8 NVIDIA H100 Tensor Core GPUs, the most powerful GPUs in NVIDIA’s current product lineup. GPUs in the NVIDIA world are coprocessors, so the Tensor Core units are packaged with dual Intel Xeon 8480C CPUs, that provide a total of 112 x64 processors. These are the Host processors in the NVIDIA compute model. The Intel processors have access to 2 TB of shared system RAM, while each H100 Tensor Core GPU has access to a 4 TB internal NVMe (non-volatile memory) drive.

The key to NVIDIA’s journey from high-end desktop graphics to its current role in accelerating Machine Learning and other AI applications was the CUDA API, first released to the public in 2007. The CUDA interface was a software layer that opened up the GPU for developers to run tasks other than the vector graphic manipulation workloads it was originally designed for. Consider a computer monitor like mine that supports resolutions up to 3840 x 2160 pixels, a rectangular matrix that has 8 million addressable 8-bit cells. Vector graphic operations target some contiguous subset of that two-dimensional matrix, representing the various geometric shapes – circles, cones, cylinders, rectangles, 3-D boxes, and other polygons, etc.) that are sub-components of the display image.

Say you want to build a bouncing ball. You start with a circle, which is represented by a simple mathematical function, a vector that identifies the set of all the nearby pixels that are within the required distance from a designated pixel at the center. Then you apply a shader operator to make the flat circle look like a 3-dimensional sphere. Once you have the ball defined, you request an animation operation to move the center of the circle from one spot in the matrix to another. Your graphical sphere is now bouncing around on the screen. You are ready to start building your NBA 2023 basketball app – except for a few messy details like licensing player images and likenesses – something for the lawyers. Back to our bouncing ball, the monitor has a refresh rate of 60 Hz, so the function that the GPU carries out is to rebuild the display matrix completely at least 24-30 times per second so that the human visual apparatus will interpret any moving images as continuous. Hey, nice dunk.

Graphics cards like the ones NVIDIA has been building for years allow the main Intel (or AMD or ARM) processor to offload vector graphics operations to the GPU coprocessor. The display adapter driver code can send the circular definition function, the shading function, and the animation function to the GPU, which is then responsible for translating these vector graphical constructs into precise instructions to the video monitor. Fundamentally, it is matrix manipulation. And one good way to accelerate matrix operations is to perform them in parallel, a classic example of the Divide-and-Conquer techniques used in parallel computing. Returning to my NBA 2023 example, divide the 3840 x 2160 display matrix into eight subsets and then operate on them in parallel, using eight rendering engines working in parallel inside the GPU.

With the advent of larger displays and multiple displays, the demand on the GPU increased, which led NVIDIA to respond by building even more powerful hardware, evolving into what are effectively massively parallel processors. NVIDIA also began innovating in ways that Intel and ARM more general purpose CPUs were not. NVIDIA GPUs are based on a proprietary architecture that features a Very Large Instruction Word (VLIW) that allows its compiler to insert “hints” into each machine instruction that the hardware can then use to optimize performance during execution. NVIDIA has also pioneered a unique parallel processing architecture where a single instruction cycle operates on multiple Threads executing simultaneously, where these Threads are executing on an 8-CPU multi-core unit known as a Streaming Multiprocessor. NVIDIA calls this singular approach to parallel processing Single Instruction, Multiple Threads or SIMT.[1]

Somewhere along the line, inside NVIDIA, the designers of their graphics accelerator boards recognized that their innovations in developing massively parallel processors might benefit numerical processing workloads beyond the range of computer graphics. That recognition led to the development and release of the original CUDA interface that gave other software developers access to the GPU for the first time. NVIDIA wasn’t building data center hardware like the NVIDIA DGX H100 at the time, only GPU cards for accelerating computer graphics. But the developers who started playing with the CUDA interface quickly figured out that leveraging those GPUs for parallel processing was often a big win. With CUDA, you do have figure out what portions of your compute-intensive code to offload to the GPU and what serial code needs to stay on the Host CPU. But, naturally, NVIDIA started supplying profiler tools to help software developers with those decisions.

One of the issues that emerged quickly as soon as the CUDA opened up the GPU for numerical computation more generally was that the GPUs NVIDIA was building at the time did not have stellar performance for floating point instructions, since manipulating the computer graphics matrix is done mainly with integer arithmetic. In contrast, the mainstays of numerical computation, which are known as the Basic Linear Algebra Subprograms (BLAS), rely heavily on floating point instructions. One step even closer to Machine Learning-specific algorithms is General Matrix Multiply (GeMM), to solve matrix algebra problems that typically also use floating point arithmetic. From the perspective of the parallel processing hardware, something like the popular Machine Learning framework known as TensorFlow, originally created at Google, generates instruction streams that looks very similar to GeMM.

In turn, NVIDIA’s response was to beef up the performance of floating point instructions significantly in each succeeding generation of its GPU hardware, which led to a further differentiation of its hardware for data center operations from the original desktop graphics accelerators. If you look at what NVIDIA calls its Tensor Core GPU in its Hopper class machines what you see is superfast floating point instructions.

I will save a deeper dive into the unique NVIDIA GPU architecture for a later blog post. The point I want to emphasize here is that NVIDIA positioning its hardware in the data center as the premier compute engine for executing Machine Learning programs is the result of long years of innovation in this space. It is certainly starting to pay off now, but for many years prior to the current generative AI moment, it wasn’t obvious that this was a product strategy that would yield a dividend to the company. But NVIDIA has stuck with the program.

The company’s business has actually been looking precarious for the last few years since the market for desktop GPU coprocessors has declined. Decreased revenue from its desktop graphics cards  was what caused the NVIDIA stock price to fall to its September 2022 levels. Apple has locked NVIDIA out of its proprietary hardware platforms completely, and Intel has been pushing into that space aggressively by adding more GPU capabilities to its desktop cores. This shrinking market for desktop GPU cards has made NVIDIA’s pivot to the data center business very tricky. I think the manner in which Jensen Huang and his management team has handled this transition is very commendable – they have been juggling a boatload of risk to get to where they are today, especially when Jensen’s previous big move, which was to try and purchase the ARM consortium from Softbank, had to be abandoned.

Before leaving today’s topic, I would like to dispel any suggestion that NVIDIA might be in any way lucky to have arrived at its current position as a technological leader in the Machine Learning space. What I see is a long history of innovation by NVIDIA to adapt its GPUs, designed originally for graphics acceleration, to handle the numerical computations that the GeMM and TensorFlow libraries execute. What I am talking about is what Sara Hooker, an AI researcher at Google, has called the hardware lottery with its emphasis on matrix algebra and floating point arithmetic. (You can read Dr. Hooker’s original hardware lottery paper here, but I realize many people prefer to watch a video instead. 😊) Hooker’s field of AI research is computer vision, and the intensive optimization of floating point matrix algebra featured in the TensorFlow core doesn’t help that much in that area of AI research. Her message is intended as a warning against those very successful hardware optimizations starting to foreclose other research avenues – that would certainly be putting the software cart in front of the hardware horse.

What is noteworthy in that regard is one of the nascent lines of business over at NVIDIA happens to be autonomous vehicles, technology for self-driving cars. The technology associated with autonomous vehicles has been notoriously over-hyped (take a bow, Elon Musk), but certainly the potential is there -- it is something automakers cannot afford to ignore. Computer vision, using either video cameras or even higher resolution LiDAR, married to Deep Learning neural networks and generative AI might be the path forward for this technology. NVIDIA’s approach has been to partner with automakers like Mercedes, and new hardware requirements are likely to emerge from that collaboration. It is a long shot, but so was the CUDA back in the day.  


[1] SIMT distinguishes the NVIDIA GPU architecture from a popular approach to speeding up vector operations known as Single Instruction, Multiple Data, or SIMD. SIMD vector instructions have been around since the designer Seymour Cray first added them to the Control Data architecture. Intel, for example, added SIMD vector instructions to its processor cores back in the early 2000s.

Comments

Popular posts from this blog

Hyper-V Architecture: Intercepts, interrupts and Hypercalls

Intercepts, interrupts and Hypercalls Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead...

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison. Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed  in the previous post , there are few professional-grade, application response time monitoring and profi...

Memory Ballooning in Hyper-V

The previous post in this series discussed the various Hyper-V Dynamic Memory configuration options. Ballooning Removing memory from a guest machine while it is running is a bit more complicated than adding memory to it, which makes use of a hardware interface that the Windows OS supports. One factor that makes removing memory from a guest machine difficult is that the Hyper-V hypervisor does not gather the kind of memory usage data that would enable it to select guest machine pages that are good candidates for removal. The hypervisor’s virtual memory capabilities are limited to maintaining the second level page tables needed to translate Guest Virtual addresses to valid machine memory addresses. Because the hypervisor does not maintain any memory usage information that could be used, for example, to identify which of a guest machine’s physical memory pages have been accessed recently, when Guest Physical memory needs to be removed from a partition, it uses ballooning, which transfe...