AMD Radeon R9 Fury X review

Graphics cards 1048 Page 7 of 38 Published by

teaser

The graphics engine (GCN 1.2) architecture

 

The Graphics Engine Architecture

We explained already on the previous page that the architecture used is the latest iteration of graphics core next architecture. Since say, the Radeon 7000 series, the architecture building block has changed significantly to remove certain inefficiencies seen in the VLIW architecture. GCN is, in its essence, the basis of a GPU that performs well at both graphical and computing tasks.

For the compute side of things, the GCN Compute unit model was introduced, it is designed for better utilization, high throughput and multi tasking. E.g. performance, performance, performance. Each CU, or building block, for Fiji is identical to what you have seen with Tonga.

Untitled-3


Your basic shader cluster is called one GCN Compute Unit, a CU. These units are similar to the ones used in the past and feature:

  • Non-VLIW Design
  • 16 wide SIMD Units
  • 64 KB registers / SIMD Unit

If we take four of these SIMD Units, they will form the basis of one Compute Unit (CU). 4x4=16 so each SIMD unit is 16 units wide, times four per compute unit. This means that each CU unit has 64 shader processors. And as such we just learned that each shader cluster has 64 shader processors. So far you are with me yeah?

  • AMD R9 290X / 390X has 2816 shader processors
  • AMD R9 Fury X has 4096 shader processors

So if one CU cluster has 64 shader processors then a Radeon R9 Fury X has 64 Compute units meaning 64 SIMDs x 64 CUs = 4096 shader processors (for the R9 Fury X).

  • Engine has Dual Geometry engines / Asynchronous Compute engines
  • 16 render backends / 64 color ROPs per clock cycle / 256 Z/Stencil ROPs per clock
  • Engine ties to 2048 KB R/W L2 cache (Upto16 64 KB L2 cache partition)
  • Fiji GPU has up-to 64 Compute Units
  • 4 Geometry processors (4 primitives per clock cycle)
  • 64 Pixel Output/clock

The Graphics Core Next Compute Unit (CU) has about the same floating point power per clock as the previous one (i.e. Cayman). It also has the same amount of register space (for the vector units). Each CU also has its own registers and local data share. The GCN architecture has 16-wide vector processors, again for a total of 4x16=64 operations per clock. GCN also has a scalar processor.
 

Guru3d-fury-x-2


So the theoretical floating point power stays more or less the same per CU, but GCN is efficient since it does not require instruction level parallelism. GCN is all about creating a GPU good for both graphics and computing purposes. At the end of the pipeline, we see eight memory-controllers accumulated towards a 4096-bit wide bus (1024-bit per memory stack). Combined with HBM at 1.0 Gbps this will deliver the Fury X series with 512 GB/sec of memory bandwidth.

Share this content
Twitter Facebook Reddit WhatsApp Email Print