The PolyMorph engine - This unit was introduced to cope with the tremendous workload that the all new function, tesselation, brings into play. With tessellation the number of triangles in a scene can increase by multiple orders of magnitude and therefore NVIDIA had to come up with a solution to either build one massive tessellation unit, or break it down in several ones per shader cluster and go for efficiency.
If you look at the G80, on top of the pipeline you had all these little units separated right, well the PolyMorph engine is an accumulated cluster with the Vertex Fetcher, Viewport transform, it handles attribute setup and stream output (massive numbers of the same object repeated in a scene), and also now the Tessellation unit can be found in it, all merged together in this one little mini processor. And that is significant, each cluster of shader processors therefore has a tessellate unit hence the (what we can only assume) massive tessellation performance. So now then, each of the sixteen PolyMorph engines contain a vertex fetcher and tesselator, greatly expanding tessellation and (when sent out to the raster engine) rasterization performance. This was an expensive unit in terms of transistors to insert, we heard something like 10% of the design, but should reap the fruit of some hard labor.
One thing I'd like to add is that a lot of improvement has been made in the ROP side of things, the AA performance will go up significantly, in fact if I can sidetrack and relate directly to a game; take HAWX for example at 8xAA it will perform roughly 2.33 times faster than a GeForce GT 285.
So the engine allows to parallelize the workload and have a nicely scalable design which in the end also ensures better usage of caches. More on that later though. And we'll explain in one of the next chapters what tessellation exactly is, okay?
Let's talk about data caches for a minute, you guys might remember that GT200 all of a sudden had a shared Level 2 cache. It's the same for GF100 but now we also spot a L1 cache as well and that is going to help out massively on the compute side of the Shader processors.
GF100 Cache setup
L1 Texture Cache (per quad)
Faster Texture Filtering
L1 LD/ST Cache dedicated
16 or 48KB
Efficient physics & raytracing
Total Shared memory
16 or 48KB
More data reuse among threads
L2 Cache (shared)
Greater texture coverage, compute perf
So the GF100 has a dedicated L1 cache per shader cluster, each SM has 64KB of on-chip memory which can be configured as 48KB of shared memory with 16KB of L1 cache, or 16KB of shared memory with 48KB of L1 cache.
Next to that the GF100 has a 768KB shared L2 cache allowing load, store and texture requests. This cache sits in-between all shader clusters and can be accessed by all of them. This unified read/write cache allows program correctness and is a key feature to support generic C/C++ programs.
So yes, the caches certainly look a whole lot better, and that is going to work out beneficial on many sides and segments.
Nvidia GeForce GTX 1070 review In this review we test the GeForce GTX 1070 (Nvidia Founders Edition). The 8 GB graphics card is the somewhat limited little brother of the GTX 1080, this little demon on the Pascal architecture and 1...
Nvidia GeForce GTX 1080 review We review the all new Nvidia GeForce GTX 1080 (founders edition). The new 8GB beast based on the Pascal architecture and 16nm FinFET has arrived. It's cool, it's silent and it rocks hard when it com...
Nvidia GeForce GTX 980 Ti Review In this review we look deeply into the GeForce GTX 980 Ti. Everything you heard is true, this product is based on BIG Maxwell, the same GPU that is powering the Titan X. Obviously the product has been...
Nvidia GeForce GTX Titan X Review We review the GeForce GTX Titan X. Now it surely hasn't been a long wait as in-between the introduction announcement and launch there have been two weeks. But yeah, the 12 GB beast has arrived. Initi...