NVIDIA executes a pivot to the Data Center
Announcing the company’s 1Q23 financial results on May 24,
2023, Jensen Huang, the founder and CEO of NVIDIA, reported
that the company is seeing “surging demand” for its latest generation of data
center hardware products that are a key ingredient in building generative AI
models, like OpenAI’s ChatGPT
chatbot. On that very promising financial guidance from the NVIDIA founder, the
company’s stock (NVDA) leaped from $305 to $356 in after-hours trading on
Wednesday, up over 16%. When the New York stock exchange re-opened on Thursday,
NVDA stock continued to soar upwards, to close at $380 for the day. To put that
rise in perspective, back in September 2022, NVDA stock bottomed out at $112,
so it’s up more than 300% in the intervening 8 months.
In the wake of ChatGPT, which Microsoft has integrated into
its Bing Search engine to great effect, other companies are scrambling to catch
up, or at least not fall too far behind, what could potentially be a technological
game-changer. This includes Google rushing to release its own version of a
generative AI chatbot. There is also a rush to build custom AI tools that marry
automatic speech recognition to domain-specific generative chatbots to deploy in
customer Helpdesk environments. Data centers rely on NVIDIA hardware, which has
been optimized for this specific purpose, to create these Large Language Models,
the neural networks that are at the heart of ChatGPT. Once these models are
built, they also need to be run in response to customer inquiries. NVIDIA’s
hardware is also deployed in the data center to do that.
That is all certainly good news for NVDA investors, some of
whom will be able to brag to their fortunate progeny that “I bought NVIDIA at $125.”
What I prefer to focus on in my blogging,
however, is the computing technology and the cost/performance tradeoffs that
confront the planners who are responsible for building and maintaining large
scale computing infrastructure. In this blog post, I will dive deeper into how
NVIDIA got here by discussing the computing technology that NVIDIA originally
designed for accelerating vectorized computer graphics that has now become a
staple product that is proving to be essential to running generative AI
applications in the modern data center.
For many years NVIDIA has been the leading supplier of high-end GPUs, graphical co-processors optimized for manipulating and displaying high-resolution vector graphics on Intel PCs, where they are associated mainly with accelerating the super-charged performance of games like Resident Evil on the Windows platform. With NVIDIA’s gaming business shrinking steadily, the company has engineered a nifty pivot to its data center products which now account for almost 60% of the company’s total revenue.
Here’s an image of what an NVIDIA DGX H100 unit looks like from the outside. (The “H” designation honors Grace Hopper.) You can probably buy one on e-Bay in the neighborhood of $40,000. This is the hardware that is flying off the shelves at NVIDIA. It doesn’t look like much, does it. It is a nondescript 19” wide enclosure, 14” high and just under 36” deep, designed to be rack-mounted. It tips the scales at about 130 kilograms, almost 290 lbs., according to the company spec sheet, so clearly it is not designed for the home hobbyist.
This unassuming looking unit is a supercomputer, capable of 32 petaFLOPS. On the inside, it is packed with high end components, including 8 NVIDIA H100 Tensor Core GPUs, the most powerful GPUs in NVIDIA’s current product lineup. GPUs in the NVIDIA world are coprocessors, so the Tensor Core units are packaged with dual Intel Xeon 8480C CPUs, that provide a total of 112 x64 processors. These are the Host processors in the NVIDIA compute model. The Intel processors have access to 2 TB of shared system RAM, while each H100 Tensor Core GPU has access to a 4 TB internal NVMe (non-volatile memory) drive.
The key to NVIDIA’s journey
from high-end desktop graphics to its current role in accelerating Machine
Learning and other AI applications was the CUDA API, first released to the public
in 2007. The CUDA interface
was a software layer that opened up the GPU for developers to run tasks other
than the vector graphic manipulation workloads it was originally designed for. Consider
a computer monitor like mine that supports resolutions up to 3840 x 2160 pixels,
a rectangular matrix that has 8 million addressable 8-bit cells. Vector graphic
operations target some contiguous subset of that two-dimensional matrix,
representing the various geometric shapes – circles, cones, cylinders, rectangles,
3-D boxes, and other polygons, etc.) that are sub-components of the display image.
Say you want to build a
bouncing ball. You start with a circle, which is represented by a simple
mathematical function, a vector that identifies the set of all the nearby pixels
that are within the required distance from a designated pixel at the center. Then
you apply a shader operator to make the flat circle look like a 3-dimensional sphere.
Once you have the ball defined, you request an animation operation to move the center
of the circle from one spot in the matrix to another. Your graphical sphere is
now bouncing around on the screen. You are ready to start building your NBA
2023 basketball app – except for a few messy details like licensing player
images and likenesses – something for the lawyers. Back to our bouncing ball,
the monitor has a refresh rate of 60 Hz, so the function that the GPU carries out
is to rebuild the display matrix completely at least 24-30 times per second so
that the human visual apparatus will interpret any moving images as continuous.
Hey, nice dunk.
Graphics cards like the ones
NVIDIA has been building for years allow the main Intel (or AMD or ARM) processor
to offload vector graphics operations to the GPU coprocessor. The display
adapter driver code can send the circular definition function, the shading
function, and the animation function to the GPU, which is then responsible for
translating these vector graphical constructs into precise instructions to the video
monitor. Fundamentally, it is matrix manipulation. And one good way to accelerate
matrix operations is to perform them in parallel, a classic example of the
Divide-and-Conquer techniques used in parallel computing. Returning to my NBA
2023 example, divide the 3840 x 2160 display matrix into eight subsets and then
operate on them in parallel, using eight rendering engines working in parallel
inside the GPU.
With the advent of larger displays
and multiple displays, the demand on the GPU increased, which led NVIDIA to
respond by building even more powerful hardware, evolving into what are
effectively massively parallel processors. NVIDIA also began innovating in ways
that Intel and ARM more general purpose CPUs were not. NVIDIA GPUs are based on
a proprietary architecture that features a Very Large Instruction Word (VLIW) that
allows its compiler to insert “hints” into each machine instruction that the
hardware can then use to optimize performance during execution. NVIDIA has also
pioneered a unique parallel processing architecture where a single instruction
cycle operates on multiple Threads executing simultaneously, where these
Threads are executing on an 8-CPU multi-core unit known as a Streaming
Multiprocessor. NVIDIA calls this singular approach to parallel processing Single
Instruction, Multiple Threads or SIMT.[1]
Somewhere along the line, inside
NVIDIA, the designers of their graphics accelerator boards recognized that
their innovations in developing massively parallel processors might benefit numerical
processing workloads beyond the range of computer graphics. That recognition
led to the development and release of the original CUDA interface that gave other
software developers access to the GPU for the first time. NVIDIA wasn’t
building data center hardware like the NVIDIA DGX H100 at the time, only GPU
cards for accelerating computer graphics. But the developers who started
playing with the CUDA interface quickly figured out that leveraging those GPUs
for parallel processing was often a big win. With CUDA, you do have figure out
what portions of your compute-intensive code to offload to the GPU and what serial
code needs to stay on the Host CPU. But, naturally, NVIDIA started supplying profiler tools
to help software developers with those decisions.
One of the issues that emerged
quickly as soon as the CUDA opened up the GPU for numerical computation more generally
was that the GPUs NVIDIA was building at the time did not have stellar performance
for floating point instructions, since manipulating the computer graphics matrix
is done mainly with integer arithmetic. In contrast, the mainstays of numerical
computation, which are known as the Basic Linear Algebra Subprograms (BLAS), rely heavily on floating point
instructions. One step even closer to Machine Learning-specific algorithms is General Matrix Multiply (GeMM), to
solve matrix algebra problems that typically also use floating point arithmetic.
From the perspective of the parallel processing hardware, something like the popular
Machine Learning framework known as TensorFlow,
originally created at Google, generates instruction streams that looks very
similar to GeMM.
In turn, NVIDIA’s response was
to beef up the performance of floating point instructions significantly in each
succeeding generation of its GPU hardware, which led to a further
differentiation of its hardware for data center operations from the original
desktop graphics accelerators. If you look at what NVIDIA calls its Tensor Core
GPU in its Hopper class machines what you see is superfast floating point
instructions.
I will save a deeper dive into
the unique NVIDIA GPU architecture for a later blog post. The point I want to
emphasize here is that NVIDIA positioning its hardware in the data center as
the premier compute engine for executing Machine Learning programs is the
result of long years of innovation in this space. It is certainly starting to
pay off now, but for many years prior to the current generative AI moment, it
wasn’t obvious that this was a product strategy that would yield a dividend to
the company. But NVIDIA has stuck with the program.
The company’s business has
actually been looking precarious for the last few years since the market for desktop
GPU coprocessors has declined. Decreased revenue from its desktop graphics
cards was what caused the NVIDIA stock price
to fall to its September 2022 levels. Apple has locked NVIDIA out of its proprietary
hardware platforms completely, and Intel has been pushing into that space aggressively
by adding more GPU capabilities to its desktop cores. This shrinking market for
desktop GPU cards has made NVIDIA’s pivot to the data center business very
tricky. I think the manner in which Jensen Huang and his management team has
handled this transition is very commendable – they have been juggling a boatload
of risk to get to where they are today, especially when Jensen’s previous big
move, which was to try and purchase the ARM consortium from Softbank, had to be
abandoned.
Before leaving today’s topic,
I would like to dispel any suggestion that NVIDIA might be in any way lucky to
have arrived at its current position as a technological leader in the Machine
Learning space. What I see is a long history of innovation by NVIDIA to adapt its
GPUs, designed originally for graphics acceleration, to handle the numerical
computations that the GeMM and TensorFlow libraries execute. What I am talking
about is what Sara Hooker, an AI researcher at Google, has called the hardware
lottery with its emphasis on matrix algebra and floating point arithmetic.
(You can read Dr. Hooker’s original hardware lottery paper here, but I realize many people
prefer to watch a video instead. 😊) Hooker’s field of AI research is computer
vision, and the intensive optimization of floating point matrix algebra featured
in the TensorFlow core doesn’t help that much in that area of AI research. Her
message is intended as a warning against those very successful hardware optimizations
starting to foreclose other research avenues – that would certainly be putting
the software cart in front of the hardware horse.
What is noteworthy in that
regard is one of the nascent lines of business over at NVIDIA happens to be autonomous
vehicles, technology for self-driving cars. The technology associated with autonomous
vehicles has been notoriously over-hyped (take a bow, Elon Musk), but certainly
the potential is there -- it is something automakers cannot afford to ignore. Computer
vision, using either video cameras or even higher resolution LiDAR, married to Deep
Learning neural networks and generative AI might be the path forward for this
technology. NVIDIA’s approach has been to partner with automakers like Mercedes,
and new hardware requirements are likely to emerge from that collaboration. It
is a long shot, but so was the CUDA back in the day.
[1]
SIMT distinguishes the NVIDIA GPU architecture from a popular approach to speeding
up vector operations known as Single Instruction, Multiple Data, or SIMD. SIMD
vector instructions have been around since the designer Seymour Cray first added
them to the Control Data architecture. Intel, for example, added SIMD vector instructions
to its processor cores back in the early 2000s.
Comments
Post a Comment