GeForce RTX 3070, 3080 and 3090 preview & analysis

Graphics cards 1048 Page 3 of 5 Published by

teaser

Ampere Architecture Explored

Ampere architecture has over 10000 fp32 (Shading) cores

Ampere has an updated architecture that has gotten a new SM (Streaming Multiprocessor) design. Once SM is a cluster that holds your shader processors. As most of you have noticed, the shader processor count was a bit of an enigma; it seems that mysteriously the shader count has doubled up as to what was expected in the first place. So there would be further nuances to explain. Changes have been made towards the Streaming Multiprocessor design that holds the shading cores. The RTX 3000 series GPUs hold SMSs that hold fp32 compute units. Ampere architecture supports parallel execution of FP32 and INT32 operations with independent thread scheduling. That's also described as concurrent execution of FP32 and INT32 operation. New, seen from Turing, is a combination of an INT32/FP32 cluster of shader processors that effectively. That doubles up that shader count. We'll show by example: 


Untitled-1

Above the Turing SM

 

Block-sm-ampere-guru3d

Ampere SM - Look to the left side cluster, INT32+FP32 is a significant change


The RTX 3000 series GPUs hold SMSs that in their core blocks hold fp32 compute units, and that was one in the past generation as well (Turning). However, look closer. However, one cluster holding the INT32 now is INT32 + FP32. So to reiterate, the Ampere SM has a new datapath design for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 shader cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 shader cores and 16 INT32 cores. And therein is the secret sauce to be found as that doubles up twice the shading throughput. The result of this change (compared to top Turing) is that the unit is capable of executing 32 FP32 operations per clock, or 16 FP32, and 16 INT32 operations per clock. One SM in its entirety now can execute 128 FP32 operations per clock, and that is double the FP32 rate of a Turing SM (which does 64 FP32 and 64 INT32 operations per clock). Performance gains will vary at the shader and application level depending on the mix of instructions. According to NVIDIA Ray tracing denoising shaders are good examples that should benefit greatly from doubling FP32 throughput. Twice the shading performance of course can create bottlenecks all by themselves at an earlier stage in the pipeline. Therefore has twice the shared memory and L1 cache performance for the SM, that would be 128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing. Total L1 (128KB) bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec (96KB) for GeForce RTX 2080 Super (Turing). Each segment then leads to one Tensor core and of course an RT core, both again renewed.


Block-sm-ampere-guru3d-small


Ampere is formed based on Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Raster Operators (ROPS), and later the memory controllers. The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine. Ampere has one more chance here, it carries two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. 

Mind you that with a full 82 SMs (x128) enabled an RTX 3090 reaches its extraordinary 10,490 shader core count. 


Up to 24GB GDDR6X Graphics memory

Worthy a paragraph all by itself is not just the bandwidth for the RTX 3000 series but in specific the huge memory volume for the GeForce RTX 3090. It's been discussed widely already; the GeForce 3080 will be released with 10GB of GDDR6X graphics memory initially. There is an option open for a possible 20GB version later down the line. The 10GB model, however, is confirmed. That means the memory bus sits at 320-bits wide for the 3080.  The flagship GeForce RTX 3090 is going to make Flight Simulator 2020 fans happy, the mac daddy of the Ampere desktop consumer lineup gets tailored with a staggering 24 GB GDDR6X graphics memory. And that means a 384-bit wide memory interface. GDDR6X has been announced already and can be configured in a 19 to 20 Gbps range for default configurations. And that's an incredible amount of memory bandwidth alright as that sits on and about TB/sec in effective data-rates. Micron earlier already shared a document about GDDR6X being capable of close to 1 TB/sec in bandwidth. Working closely with NVIDIA, the new refreshed memory would be able to deliver 19 to 21 Gb/s (data-rate per pin). That means if there are 12 GDDR6X ICs populated on board, it could reach close to 1 TB/s of bandwidth (effective data-rate)). Prior to launch, if we math that up a bit loose with some reserves, it would become something in between 912 to 1008 GB/s. 


Untitled-12

Share this content
Twitter Facebook Reddit WhatsApp Email Print