Nvidia GP100 GPU architecture recap - full GPU has 3840 Shader processors
Last week Nvidia announced the GP100 GPU powering the Tesla P100 HPC module. More rumors are surfacing about GP104 as well. Since the full block diagrams for Gp100 are now available, we can also tell what a full GP100 looks like when fully enabled, in this post a little recap on the GP100 architecture and its positioning.
This year several GPUs are going to be released from Nvidia, all based on their new Pascal architecture in a wide variety of segments in the market channels. For consumers the first wave of graphics cards it will be the GP104 GPU, these are empowering the high-end products like 'GTX 980' class products, the current rumor is that the new GTX 1070 and 1080 albeit with a bit of weird Full HD like naming, will use that chip. These should be announced during Computex time in June with availability in the summer (likely July).
Then there is big Pascal, the big daddy Nvidia GPU developed under GPU codename GP100. This is the GPU that will empower (for the consumer side) the enthusiast class products e.g. the Titan etc. Make no mistake, this product will not launch anytime soon for consumers. Expect at the very best a launch late this year closer to the Christmas season, likely even later in Q1/Q2 2017 (we think).
All Pascal products are based on a 16nm FinFet design and the GP100 in particular comes with stacked HBM2 (16GB in four stacks). The Pascal based GPU driving the unit holds 15 Billion transistors which is roughly double that of the current biggest Maxwell chip. Gp100 is huge at 600mm^2. The prognosis performance (according to Nvidia) is 5.3TFLOPS using 64-bit floating-point numbers and is rated at 10.6TFLOPS using 32-bit and 21.2TFLOPS using 16-bit. P100 has 4MB of L2 cache and 14MB of shared memory for just the register file. The following table provides a high-level comparison of Tesla P100 specifications compared to previous-generation Tesla GPU accelerators, however I added the GP100 as a fully enabled product:
Tesla Products | Tesla K40 | Tesla M40 | Tesla P100 | GP100 |
GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) | GP100 (Pascal) |
SMs | 15 | 24 | 56 | 60 |
TPCs | 15 | 24 | 28 | 30 |
FP32 CUDA Cores / SM | 192 | 128 | 64 | 64 |
FP32 CUDA Cores / GPU | 2880 | 3072 | 3584 | 3840 |
FP64 CUDA Cores / SM | 64 | 4 | 32 | 32 |
FP64 CUDA Cores / GPU | 960 | 96 | 1792 | 1920 |
Base Clock | 745 MHz | 948 MHz | 1328 MHz | ~1328 MHz |
GPU Boost Clock | 810/875 MHz | 1114 MHz | 1480 MHz | ~1480 MHz |
Texture Units | 240 | 192 | 224 | 240 |
Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 | 4096-bit HBM2 |
Memory Size | Up to 12 GB | Up to 24 GB | 16 GB | 16 GB |
L2 Cache Size | 1536 KB | 3072 KB | 4096 KB | 4096 KB |
Register File Size / SM | 256 KB | 256 KB | 256 KB | 256 KB |
Register File Size / GPU | 3840 KB | 6144 KB | 14336 KB | 14336 KB |
TDP | 235 Watts | 250 Watts | 300 Watts | ~300 Watts |
Transistors | 7.1 billion | 8 billion | 15.3 billion | 15.3 billion |
GPU Die Size | 551 mm² | 601 mm² | 610 mm² | 610 mm² |
Manufacturing Process | 28-nm | 28-nm | 16-nm | 16-nm |
As the block diagram now shows, the GP100 features six graphics processing clusters (GPCs). Just look at the diagram and count along with me - each GPC holds 10 streaming multiprocessors (SMs) and then each SM has 64 CUDA cores and four texture units. Do the math and you'll reach 640 shader processors per GPC and 3840 shader cores with 240 texture units in total.
- 6 (GPC) x (10x64) = 3840 Shader processor units in total.
Meaning the GP100 used on the Tesla P100 is not fully enabled. Nvidia is known to out GPU that have disabled segments, it helps them selling different SKUs, the Tesla P100 holds a shader count of 3584 and thus has 56 SMs enabled (from the 60).
GP100’s SM incorporates 64 single-precision (FP32) CUDA Cores. In contrast, the Maxwell and Kepler SMs had 128 and 192 FP32 CUDA Cores, respectively. The GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores, an instruction buffer, a warp scheduler, and two dispatch units. While a GP100 SM has half the total number of CUDA Cores of a Maxwell SM, it maintains the same register file size and supports similar occupancy of warps and thread blocks.GP100’s SM has the same number of registers as Maxwell GM200 and Kepler GK110 SMs, but the entire GP100 GPU has far more SMs, and thus many more registers overall. This means threads across the GPU have access to more registers, and GP100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.
Since the graphics memory is on-die HBM2, the VRAM amount is fixed. That means that ALL GP100 products will get 16GB of memory. HBM2 will run a wide 4096-bit HBM2 (1024 bit per IC stack) memory interface running an effective bandwidth anywhere up-to a full 1 TB/s.
This is a big chip, very big at 600mm^2 hence it is interesting to see that 16nm can offer a lot in terms of clock frequency, The Tesla P100 is an enterprise part that ends up in servers, however this part already is clocked at 1328 MHz with Boost capabilities towards a frequency of 1480 MHz. Combined the TDP still remains to be under 300W.
Download Nvidia GeForce 364.72 WHQL drivers - 03/28/2016 03:17 PM
You can download the new Nvidia GeForce 364.72 WHQL driver for Windows 7 - 8.1 and 10 in 32 and 64-bit flavors. This title adds VR support and game ready support for Quantum Break, Killer Instinct, ...
Nvidia GeForce X80 and X80 Ti Pascal Specs? - 03/17/2016 08:57 AM
In our forums somebody posted the spec list shown after the break. Indicating the Pascal graphics cards would be called GeForce X80 and X80Ti....
NVIDIA GameWorks SDK 3.1 Released - 03/15/2016 08:28 AM
NVIDIA today announced worldwide availability of the NVIDIA GameWorks software development kit (SDK) 3.1, which introduces three groundbreaking graphics techniques for shadows and lighting as well as ...
Download Nvidia GeForce 364.51 WHQL driver - 03/08/2016 09:23 PM
Nvidai updated to a new WHQL driver. This is a new driver following the issues Nvidia had with multi-monitor support. The latest version, 364.51 This driver ensures you will have an optimal experienc...
Download Nvidia GeForce 364.47 WHQL driver - 03/07/2016 04:38 PM
Nvidia just released a new GeForce 364.47 WHQL driver. It's aimed at an optimal experience with Tom Clancy’s The Division, Hitman, Need for Speed, Ashes of the Singularity, and Rise of th...
Senior Member
Posts: 17415
Joined: 2012-05-18
I've meant 2xfp32 operations with: 1 on fp32-core and 1 on fp64-core.
I think it's impossible to compute 2xfp32 on a single fp64 in one cycle.
Ofcourse it will have limited bandwidth. But its likely a gain with so much fp64 cores. Like having a titan-x-vanta on top of your gpu.
Dispatcher: if it can feed fp32-multiplication, then it can feed fp32-division and fp64-multiplication, both division could be even easier I suppose. Ofcourse fp64 addition and fp32 addition at the same time could be bottlenecked by dispatcher.
You can't really mix fp32 and fp64 or use fp64 to convert into 2x fp32 or use fp64 along with fp32 at the same time.. Each is a separate and does its own job.
This is not Intel/AMD cpu AVX/SSE that can do 256bit or combine 2x128bit or split 128bit into 2x64bit..
Senior Member
Posts: 3490
Joined: 2007-01-27
Really? Stock? No custom pcb, those go a long way man.
Senior Member
Posts: 7975
Joined: 2014-09-27
yeah I remember fast inverse sqrt !
wasn't quake 3! way older, or maybe it was... Wasn't quake 3 released in like 2006
that wasn't from id though, they stole the idea from someone else
I've read somewhere that fast inverse sqrt was first used before 1990 in a chemistry modeling software(Algorithms look similar but magic number changes.).
That was even before 80286? Math operations were so slow, people were getting results from memory look-up-tables instead of calculating on cpu.
SFU cores in a gpu must be using a very fast look up table to calculate sqrt faster than that quake thing. I tested on my hd7870, hardware function is faster than quake version.
Here it is:
Fast inverse square root (sometimes referred to as Fast InvSqrt() or by the hexadecimal constant 0x5f3759df) is a method of calculating x***8722;½, the reciprocal (or multiplicative inverse) of a square root for a 32-bit floating point number in IEEE 754 floating point format. The algorithm was probably developed at Silicon Graphics in the early 1990s, and an implementation appeared in 1999 in the Quake III Arena source code, but the method did not appear on public forums such as Usenet until 2002 or 2003. (There is a discussion on the Chinese developer forum CSDN back in 2000.) At the time, the primary advantage of the algorithm came from avoiding computationally expensive floating point operations in favor of integer operations. Inverse square roots are used to compute angles of incidence and reflection for lighting and shading in computer graphics.
Senior Member
Posts: 236
Joined: 2007-10-08
All this power in a piece of silicon just to play bad console ports.......sigh!
Let´s see how it improves the VR before jumping the gun.
Senior Member
Posts: 7114
Joined: 2004-10-01
Like the 980ti, once the 1080ti releases, I will grab the first one I can get, stock, and throw a waterblock on it. Done and done.