A high-performance arm server CPU for use in large AI systems

This morning, a busy Spring GPU technology conference for NVIDIA kicks off, and the graphics and accelerator designer announces that they’ll be redesigning their own arm-based CPU. Named – named Grace Hopper, the pioneer of computer programming and the rear admiral of the US Navy – the CPU is NVIDIA’s latest attempt to integrate the hardware stack more vertically through a high-performance CPU with their regular GPU products to offer. According to NVIDIA, the chip is specifically designed for large-scale neural networking, and is expected to be available in NVIDIA products by 2023.

With two more years to go before the disc is ready, NVIDIA is currently playing relatively funny. The company offers only limited details for the chip – it will be based on a future iteration of Arm’s Neo-Core cores, for example – as today’s announcement is a bit more focused on NVIDIA’s future workflow model than it is speed and feed is. If nothing else, the company made it clear early on that Grace is at least for now an internal product for NVIDIA being offered as part of their larger server offering. The company is not directly looking for the Intel Xeon or AMD EPYC server market, but rather builds their own chip to complement their GPU offering, creating a specialized chip that can connect directly to their GPUs and help to handle the enormous, trillion parameter AI. models.

NVIDIA CPU Specification Comparison
Mercy Carmel Denver 2
SVE core ? 8 2
CPU architecture Next Generation Arm Neoverse Custom Arm v8.2 Custom Arm v8
Memory bandwidth > 500 GB / sec
LPDDR5X
(ECC)
137GB / sec
LPDDR4X
60 GB / sec
LPDDR4
GPU-to-CPU interface > 900 GB / sec
NVLink 4
PCIe 3 PCIe 3
CPU-to-CPU interface > 600 GB / sec
NVLink 4
Nvt Nvt
Manufacturing process ? TSMC 12nm TSMC 16nm
Release year 2023 2018 2016

Broadly speaking, Grace is designed to fill the CPU-sized hole in NVIDIA’s AI server offering. The company’s GPUs are incredibly suitable for certain classes of in-depth workload, but not all workloads are purely GPU-bound, if only because a CPU is needed to keep the GPUs powered. NVIDIA’s current server offering, in turn, is usually based on AMD’s EPYC processors, which are very fast for common computing purposes, but they do not have the kind of high-speed I / O and deep learning optimizations that NVIDIA is looking for. In particular, NVIDIA is currently bottled up using PCI Express for CPU-GPU connectivity; their GPUs can quickly talk to each other via NVLink, but not back to the host CPU or system RAM.

The solution to the problem, as before Grace, was the use of NVLink for CPU-GPU communication. Previously, NVIDIA worked with the OpenPOWER Foundation to get NVLink for exactly this reason in POWER9, but the relationship is apparently on the way out, both because POWER’s popularity is declining and POWER10 is skipping NVLink. Instead, NVIDIA is going its own way by building a poor server CPU with the necessary NVLink functionality.

The end result, according to NVIDIA, will be a high-performance, high-bandwidth CPU designed to work with future-generation NVIDIA server GPUs. While NVIDIA is talking about pairing each NVIDIA GPU with a Grace CPU on a single board – similar to today’s mezzanine cards – CPU performance and system memory increase not only with the number of GPUs, but on a roundabout will Grace as a co-processor of different types of NVIDIA GPUs. This, if nothing else, is a very NVIDIA solution to the problem, not only to improve their performance, but also to give them a counter if the more traditionally integrated AMD or Intel have a similar CPU + GPU fusion game would try.

By 2023, NVIDIA will be on NVLink 4, offering at least 900 GB / sec bandwidth between the CPU and GPU, and more than 600 GB / sec between Grace CPUs. It is critical that it is larger than the memory bandwidth of the CPU, which means that NVIDIA’s GPUs have a coherent cache memory link to the CPU that has access to the memory of the system with full bandwidth, and also the whole system a can provide single shared memory address. space. NVIDIA describes it as the balance between the amount of bandwidth available in a system, and it’s not wrong, but there’s more to it. Having a package on the package is an important way to increase the amount of memory. NVIDIA GPUs can have effective access to and use, as memory capacity is still the primary limiting factor for large neural networks – you can manage a network just as efficiently as your local memory pool.

CPU and GPU interconnect bandwidth
Mercy EPYC 2 + A100 EPYC 1 + V100
GPU-to-CPU interface > 900 GB / sec
NVLink 4
~ 32GB / sec
PCIe 4
~ 16GB / sec
PCIe 3
CPU-to-CPU interface > 600 GB / sec
NVLink 4
304GB / sec
Infinite substance 2
152GB / sec
Infinite substance

And this memory-focus strategy is also reflected in the design of Grace’s memory pool. Since NVIDIA is putting the CPU on a shared package with the GPU, they are going to put down the RAM right after that. Grace-equipped GPU modules contain a specified amount of LPDDR5x memory, with NVIDIA targeting at least 500 GB / sec memory bandwidth. Besides the fact that NVIDIA is likely to be the non-graphics memory option with the highest bandwidth in 2023, NVIDIA gives the use of LPDDR5x as a gain for energy efficiency, due to the technology’s mobile focus roots and very short track lengths. Since it is a server component, Grace’s memory will also be ECC-enabled.

In terms of CPU performance, this is actually the part where NVIDIA said the least. The company will use a future generation of Arm’s Neoverse CPU cores, where the initial N1 design has already made heads turn. But other than that, the company is just saying that the core should break 300 points on the SPECrate2017_int_base throughput measure, which will be comparable to some of AMD’s second generation 64-core EPYC CPUs. The company also does not say much about how the CPUs are configured or what optimizations are added specifically for neural network processing. But since Grace is meant to support NVIDIA’s GPUs, I would expect it to be stronger if GPUs are generally weaker.

Otherwise, as mentioned earlier, the NVIDIA Big Vision goal for Grace significantly reduces the time required for the largest neural network models. NVIDIA aims to deliver 10x higher performance on 1 trillion parameter models, and their performance projections for a 64-module Grace + A100 system (with theoretical support from NVLink 4) would be to train such a model from one month to to bring down three days. Or alternatively be able to do real-time inferences on a 500 billion parameter model on an 8-module system.

Overall, this is NVIDIA’s second real turning point in the CPU market for the data center – and the first that is likely to succeed. NVIDIA’s Project Denver, which was originally announced just over a decade ago, never really got off to a good start as NVIDIA expected. The family of custom arm cores was never good enough and never got it out of NVIDIA’s mobile SoCs. Grace, on the other hand, is a much safer project for NVIDIA; they are merely licensing Arm core rather than building their own, and this core will also be used by numerous other parties. This reduces NVIDIA’s risk of getting the I / O and memory plumbing done properly, as well as keeping the final design energy efficient.

If all goes according to plan, expect to see Grace in 2023. NVIDIA has already confirmed that Grace modules will be available for use in HGX boards and added DGX and all the other systems that use these boards. While we have not seen the full scope of NVIDIA’s Grace plans, it is clear that they intend to make it a core part of future server offerings.

First two supercomputer customers: CSCS and LANL

And although Grace will only ship in 2023, NVIDIA has already set up its first hardware customers – and it’s not just supercomputer customers. Both the Swiss National Supercomputer Center (CSCS) and the Los Alamos National Laboratory announced today that they will be ordering supercomputers based on Grace. Both systems are being built by HPE’s Cray Group and will be available online in 2023.

CSCS’s system, called Alps, replaces their current Piz Daint system, a Xeon plus NVIDIA P100 group. According to the two companies, Alps 20 offers ExaFLOPS AI performance, which is thought to be a combination of CPU, CUDA core and tensor core throughput. When it launches, Alps should be the fastest AI-focused supercomputer in the world.


An artist’s version of the expected Alpine system

It is interesting, however, that CSCS’s ambitions for the system extend beyond just workload for machine learning. The institute says they will use Alps as a general system and work on traditional tasks and HP-focused tasks. This includes the CSCS’s traditional investigation into weather and climate, for which the pre-AI Piz Daint is also already being used.

As already mentioned, Alps will be built by HPE, which will be based on their previously announced Cray EX architecture. This will make NVIDIA’s Grace the second CPU option for Cray EX, along with AMD’s EPYC processors.

Meanwhile, the Los Alamos system is being developed as part of an ongoing collaboration between the lab and NVIDIA, with LANL as the first U.S. customer to receive a Grace system. LANL does not discuss the expected performance of their system, except that there is expected to be a ‘leadership class’, although the laboratory plans to use it for 3D simulations, using the largest sizes of data set offered by Grace . The LANL system will be delivered in early 2023.

Source