

We have been beneath the distinct impression that AMD would not discuss a lot about knowledge middle computing engines at its newly launched Shopper Electronics Present. “Genoa” Epyc 9004 server CPUs in November very fanciful.
However per week earlier than Intel’s launch – finally – “Sapphire Rapids” Xeon SP server CPUsand on the final minute AMD determined to have its president and president Lisa Su use the annual CES keynote to speak about knowledge middle merchandise aside from the Epyc 9004s so we all know a bit extra in regards to the future (and as-yet-uncoded) Intuition MI300 hybrid Alongside the CPU-GPU accelerators is an upcoming Alveo A70 matrix math engine accelerator derived from the hard-code AI engines in Xilinx “Everest” Versal FPGAs.
AMD has been speaking about so-called “accelerated processing unit” or APU units for over a decade, and has been producing hybrid CPU-GPU chips for desktop, laptop computer and recreation console machines for a few years. And it has all the time needed to supply a giant unhealthy APU for knowledge middle computing. In reality, after the primary “Naples” Epyc 7001 CPU launch in 2017, AMD was anticipated to launch an APU that mixes parts of this CPU and a Radeon “Vega” GPU in a single bundle. For no matter purpose – more than likely the price of integration and the shortage of a software program platform to program it simply – AMD stepped up the trouble and by no means talked about knowledge middle APUs once more, besides in abstract.
Nevertheless, that did not imply that AMD wasn’t working behind the scenes on an APU for the datacenter. Future “El Capitan” exascale-class supercomputer for Lawrence Livermore National Laboratorywill probably be established later this 12 months. In reality, Lawrence Livermore, AMD, Cray, and Hewlett Packard Enterprise all Hides the fact that the El Capitan machine will use an APU and never a group of discrete AMD CPUs and GPUs, as did the “Frontier” supercomputer put in at Oak Ridge Nationwide Laboratory final 12 months. Diagrams of the El Capitan machine really present the separate units – basically the identical one CPU to 4 GPU ratio Frontier has. It doesn’t matter. We perceive why the reality is distorted. And it is no shock that Intel all of a sudden began engaged on it. “Falcon Shores” hybrid CPU-GPU computing engines For HPC and AI workloads and solely discrete Xeon SP CPUs and Discrete Max Series GPUs codenamed “Ponte Vecchio” and “Rialto Bridge”.
AMD talked a bit in regards to the Intuition MI300A as it is going to apparently be known as the APU variant of the MI300 collection. in some detail in June last year. Whereas AMD hasn’t mentioned something about it, we expect there will probably be separate, PCI-Categorical 5.0 variants of the MI300 collection. But it surely’s clear that Intel, AMD, and Nvidia will proceed to promote separate CPUs and GPUs, in addition to AMD’s APU and what Nvidia calls a “superchip.”
Six months in the past, AMD mentioned that the MI250X GPU accelerator used within the MI300’s Frontier machine will provide 8x the AI efficiency, and this may be completed fairly simply by placing 4 of its “Aldebaran” GPU chips in a single bundle. Eighth precision floating level math to FP4. (MI250X features two Aldebaran GPU chips and FP16 and BF16 hit backside at half-precision floating level.) Or, it might imply 4 GPU chips, every with twice as many cores and supporting the Intel-supported FP8 format, which Nvidia helps in “Hopper” H100 GPUs. Gaudi2 accelerators.
The large level in MI300 is that the CPU and GPU are in a single bundle, utilizing 3D packaging strategies and sharing the identical HBM reminiscence area. There isn’t any knowledge motion between CPUs and GPUs within the bundle – they actually share the identical reminiscence. This can apparently simplify the programming of hybrid computing, and Su mentioned it guarantees a “step operate” improve in efficiency over the two exaflops already offered by the CPU-GPU complexes within the Frontier system.
“To realize this, we’re growing the world’s first knowledge middle processor that mixes CPU and GPU on a single chip,” Su mentioned in his keynote, and he clearly meant it. bundle and no chip. “The Intuition MI300 is the primary chip to mix a knowledge middle CPU, GPU and reminiscence in a single built-in design. What this permits us to do is to share system sources or reminiscence and I/O and lead to a major improve in efficiency and effectivity whereas programming is far simpler.”
Nvidia will launch the “Grace” Arm CPU and “Hopper” H100 GPU superchips earlier than the MI300A ships from AMD, however the distinction is that the Grace CPU on the superchip has its personal LPDDR5 foremost reminiscence and the Hopper GPU on the superchip It has its personal LPDDR5 foremost reminiscence. HBM3 stacked reminiscence. They’ve constant reminiscence – that means they’ll transfer knowledge between units shortly and share them over an interconnect – however this isn’t precisely the identical bodily reminiscence utilized by each units and due to this fact doesn’t require knowledge motion between the 2 blocks and reminiscence varieties. (We are able to focus on which might be a greater strategy later, when HPC and AI facilities are coding for Grace-Hopper and MI300.)
Throughout his keynote, Su gave a number of extra particulars in regards to the MI300A APU:
We thought the MI300A APU would have 64 cores, just like the customized “Trento” Epyc 7003 processor used within the Frontier system, presumably decreasing that to 32 cores if the warmth begins to get too excessive within the gadget. Nevertheless, it turned out that the MI300A will solely have 24 of the Zen 4 cores used within the Genoa Epyc 9004s. Zen 4 cores present 14 p.c higher IPC than the Zen 3 cores used within the Epyc 7003, that means 56 cores operating on the similar clock pace would offer equal efficiency in integer workloads. However the floating level models in Zen 4 cores do about twice the work of these in Zen 3 cores, so 24 Zen 4 cores give about the identical efficiency as these 56 Zen 3 cores on FP64 and FP32, relying on the clock. outdated pace in fact. (This floating level bump is attributable to elevated reminiscence bandwidth from the Intel Xeon SP structure, in addition to assist for AVX-512 directions.)
AMD says the MI300A bundle consists of 9 5-nanometer chips and 4 6-nanometer chips, surrounded by HBM3 reminiscence. This is what the bundle appears to be like like as rendered:
And here is a really tight zoom to the bundle:
That undoubtedly appears to be like like six GPU chips, plus two CPU chips and an I/O die chip on high, with 4 backside chipsets connecting the 2 HBM3 reminiscence banks to the advanced at eight totally different factors and. Which means that AMD re-implements the I/O and reminiscence die within the Genoa Epyc 9004 advanced at 5 nanometer processes as a substitute of the 6 nanometer course of used on the I/O and reminiscence die. We strongly suspect that Infinity Cache is carried out in these 4 6-nanometer connectivity chips, however nothing has been mentioned about it. The CPU cores within the MI300A bundle appear to lack 3D V-Cache.
The Genoa Epyc 9004 consists of computing advanced dies (CCDs) engraved utilizing Taiwan Semiconductor Manufacturing Co’s 5 nanometer processes, and that appears to be the case right here. Every of those Genoa CCDs has eight cores and 32MB of L3 cache.
There are too few or too many cores, relying on how we interpret the MI300A construct. We suspect there are some dud cores and this advanced really has 32 bodily Zen 4 cores on it however solely 24 are energetic. With a gun to our heads, we would estimate that there are 4 instances as many GPU computing engines on this chip advanced, unfold throughout six chipsets with the FP8 (relatively than the rumored 4, and considerably increased than the 2 on the MI250X). precision because the lowest precision. This brings the CDNA 3 GPU structure to the 8X determine for the aforementioned AI efficiency.
Your complete MI300A advanced talked about above has an unbelievable 146 billion transistors, Su mentioned.
Now let’s discuss in regards to the 5x higher efficiency per watt that Su et al talked about. The MI250X runs at 560 watts to offer peak efficiency, and for those who do the mathematics, if the MI300A has 8 instances the efficiency per watt and 5 instances the efficiency, meaning the MI300A advanced will weigh in at 900 watts. That features 128GB of HBM3 reminiscence, which might in all probability run fairly sizzling on eight stacks.
Su mentioned it at the moment takes months to coach AI core fashions on 1000’s of GPUs and solely prices tens of millions of {dollars} in electrical energy, including that MI300A units will permit firms to do their coaching in weeks as a substitute of months, and get monetary savings. Spend an amazing period of time and vitality or prepare bigger fashions for a similar cash.
The Intuition MI300A has now returned from the foundry and labs and will probably be launched within the second half of this 12 months. Lawrence Livermore, in fact, in first place.
All that is still is the Alveo A70, about which Su says little.
The Alveo A70 takes its DSP matrix math engines from Versal FPGAs and places a lot of them in a brand new piece of silicon aimed solely at synthetic intelligence inference. (The Ryzen PC chip is powered by the identical AI matrix math engines.) This explicit gadget plugs right into a PCI-Categorical 5.0 slot, burns simply 75 watts – a magic quantity for inference accelerators – and performs 400 tera per second (TOPS) of AI efficiency. , presumably INT8 precision, however could possibly be FP8 and even INT4. AMD did not say. What Su says is in comparison with the Nvidia T4 accelerator, which is now a technology behind Nvidia launches “Lovelace” L40 accelerators in September 2022 – The Alveo A70 can run 70 p.c sooner in good metropolis purposes, 72 p.c sooner in affected person monitoring purposes, and 80 p.c sooner in good retail purposes with synthetic intelligence inference as a part of the workload.
The Ryzen AI engines used within the Ryzen 7040 collection laptop processors weigh 3 TOPS every, whatever the clock pace for these processors. If the AI engines are operating on the similar clock pace on the Alveo 70 accelerators – in all probability round 3 GHz – it will take about 135 motors to succeed in 400 TOPS whole efficiency. The Alveo 70 is more likely to have a slower clock pace – perhaps someplace between 1 GHz and 1.5 GHz – and due to this fact might have wherever from 250 to 400 of those AI engines inherited from Xilinx FPGAs.
#AMD #Pronounces #Particulars #Future #MI300 #Hybrid #Computing #Engines