Deep Dive into Nvidia Blackwell
Nvidia Blackwell GPU Deep Dive: GB202 Architecture Analysis, Performance vs AMD RDNA4, 750mm² Design with 192 SMs and Advanced Memory Subsystem
"ChipPub" Publication: 20% Discount Offer Link.
Nvidia has long been committed to building massive GPUs. Its latest graphics architecture, Blackwell, continues this tradition. The GB202 is the largest Blackwell chip. It occupies a massive area of 750 square millimeters and contains 92.2 billion transistors. The GB202 has 192 Streaming Multiprocessors (SMs), which are the closest equivalent units to CPU cores on a GPU, and is powered by a massive memory subsystem.
Nvidia's RTX PRO 6000 Blackwell features the largest GB202 configuration to date. It sits alongside the RTX 5090 in Nvidia's product lineup, which also uses the GB202 but with more SMs disabled.
Advanced comparisons showcase the scale of Nvidia's Blackwell products. AMD's RDNA4 series is benchmarked with the RX 9070 and RX 9070XT. The RX 9070 has slightly reduced performance with 4 out of 32 WGPs disabled. I will use the RX 9070 to provide comparative data.
Work Distribution
GPUs use dedicated hardware to launch threads between their cores, unlike CPUs, which rely on software scheduling in the operating system. Hardware thread launching is well-suited for the short, small tasks common in GPU workloads. Streaming Multiprocessors (SMs) are the basic building blocks of Nvidia GPUs, roughly analogous to CPU cores. SMs are grouped into Graphics Processing Clusters (GPCs), which contain rasterizers and associated work distribution hardware.