PolyMorph engine and Data Caches
The PolyMorph engine - This unit was introduced to cope with the tremendous workload that the all new function, tesselation, brings into play. With tessellation the number of triangles in a scene can increase by multiple orders of magnitude and therefore NVIDIA had to come up with a solution to either build one massive tessellation unit, or break it down in several ones per shader cluster and go for efficiency.
If you look at the G80, on top of the pipeline you had all these little units separated right, well the PolyMorph engine is an accumulated cluster with the Vertex Fetcher, Viewport transform, it handles attribute setup and stream output (massive numbers of the same object repeated in a scene), and also now the Tessellation unit can be found in it, all merged together in this one little mini processor. And that is significant, each cluster of shader processors therefore has a tessellate unit hence the (what we can only assume) massive tessellation performance. So now then, each of the sixteen PolyMorph engines contain a vertex fetcher and tesselator, greatly expanding tessellation and (when sent out to the raster engine) rasterization performance. This was an expensive unit in terms of transistors to insert, we heard something like 10% of the design, but should reap the fruit of some hard labor.
One thing I'd like to add is that a lot of improvement has been made in the ROP side of things, the AA performance will go up significantly, in fact if I can sidetrack and relate directly to a game; take HAWX for example at 8xAA it will perform roughly 2.33 times faster than a GeForce GT 285.
So the engine allows to parallelize the workload and have a nicely scalable design which in the end also ensures better usage of caches. More on that later though. And we'll explain in one of the next chapters what tessellation exactly is, okay?
Let's talk about data caches for a minute, you guys might remember that GT200 all of a sudden had a shared Level 2 cache. It's the same for GF100 but now we also spot a L1 cache as well and that is going to help out massively on the compute side of the Shader processors.
GF100 Cache setup
|L1 Texture Cache (per quad)||12 KB||12 KB||Faster Texture Filtering|
|L1 LD/ST Cache dedicated||-||16 or 48KB||Efficient physics & raytracing|
|Total Shared memory||16KB||16 or 48KB||More data reuse among threads|
|L2 Cache (shared)||256KB||768KB||Greater texture coverage, compute perf|
So the GF100 has a dedicated L1 cache per shader cluster, each SM has 64KB of on-chip memory which can be configured as 48KB of shared memory with 16KB of L1 cache, or 16KB of shared memory with 48KB of L1 cache.
Next to that the GF100 has a 768KB shared L2 cache allowing load, store and texture requests. This cache sits in-between all shader clusters and can be accessed by all of them. This unified read/write cache allows program correctness and is a key feature to support generic C/C++ programs.
So yes, the caches certainly look a whole lot better, and that is going to work out beneficial on many sides and segments.