Nvidia might be moving to Multi-Chip-Module GPU design
With Moore's law becoming more difficult each year technology is bound to change. At one point it will be impossible to shrink transistors even further, hence companies like Nvidia already are thinking about new methodologies and technologies to adapt to that. Meet the Multi-Chip-Module GPU design.
Nvidia published a paper that shows how they can connect multiple parts (GPU modules) with an interconnect. According to the research, this will allow for bigger GPUs with more processing power. Not only will is help tackling the common problems, it would also be cheaper to achieve as fabbing four dies that you connect is cheaper to do than to make one huge monolithic design.
Thinking about it, AMD is doing exactly this with Threadripper and EPYC processors where they basically connect two to four Summit Ridge (ZEN) dies with that wide PCIe lane link (they use 64 PCie lanes per link with 128 available), Infinity Fabric.
According to the researchers, as an example a GPU with four GPU modules they recommend three architecture optimizations that will allow for minimal loss off data-communication in-between the different modules. According to the paper the loss in performance compared to a monolithic single die chip would be merely 10%
Of course when you think about it, in essence SLI is already a similar methodology (not technology), however as you guys know it can be rather inefficient and challenging in scaling and compatibility. The paper states this MCM design would be performing 26.8% better compared to any multi-GPU solution. If and when Nvidia is going to fab MCM multi GPU module based chips is not known, for now this is just a paper on the topic. The fact that they publish it indicates it is bound to happen at one point in time though.
Sorry, I could not resist ... ;)
Rumor: Nvidia Mobile Pascal-GPUs during Computex 2016 - not Desktop - 02/26/2016 01:31 PM
It's been topic for discussion for a while now. Personally I think we'll see some soft of announcement in April during the GTC, and later on a broad announcement in the Computex timeframe. Likely, a...
Nvidia might release GTX 980MX and 970MX for laptops - 01/19/2016 10:26 AM
Nvidia is likely planning the successors to the 970M and 980M in the 2nd half of 2016. The successors will be the 980MX and 970MX and are based on GM204....
Nvidia might be working on their own VR headset - 06/07/2015 01:03 PM
Nvidia might release their own VR headset, this is now speculated as information surfaced showing that Nvidia holds a patent for a headset with six camera and two displays, each for one eye....
Microsoft Confirms DirectX 12 MIX AMD and Nvidia Multi-GPUs - 03/13/2015 08:38 AM
It's not exactly new news, but Microsoft actually confirms it at this stage. Microsoft technical support states that DirectX 12 will support “multi-GPU configurations between Nvidia and AM...
Nvidia Maxwell GM200 Flagship GPU caught on photo - 01/16/2015 04:10 PM
NVIDIA’s flagship Maxwell GPU (to be released) GM200 GPU core has been spotted and somebody took a photo. The product will end up in Quadro and hopefully GeForce graphics cards. The card tha...
Senior Member
Posts: 2985
Joined: 2016-08-01
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive.
Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures.
This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM.
Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU.
https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/
This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.
none expect em to be cheap although it's much much much easier to produce 2x250 chips than a massive 500mm chip and if the 2x250 give similar or even just close performance it will be much much cheaper than the massive chips ... you kind of can see that with the intel cpu's now the 20 and 18 core parts are extremely expensive sure intel charge extra for the bragging rights but also those are hard to produce when you get a full working 20 core xeon of em you pretty much get the best of the best silicon they can produce the rest are 10~12~14....18 core chips same goes for the gpus really
all this of course if they make it so the system see em as a single entity
Senior Member
Posts: 1309
Joined: 2003-09-14
OK, so nVidia published a paper out lining the theory and application behind the use of MCM in a GPU.
However, having an interconnect that can supply enough bandwidth without large latency hits is a different matter. AMD got very lucky with IF, but will Intel and nVidia be able to replicate results without hitting on AMD's patents related to IF? If they can't, their only option could be to lease the technology from AMD, assuming AMD are game for giving up their ace up their sleeve.
I'm pretty sure Infinity Fabric uses PCIe lanes for communication, maybe it can use other transports as well.
Between the CPU's on an Epyc chip, there are 64PCIe lanes going in between each cpu, if i read the slides correctly.
They can cut the latencies due to the short hops in between the on-chip cpu's, and the bandwidth should be plenty.
A GPU that has 2x16 PCIe lanes, could use the second set for intra-GPU signalling. Ideally you'd want 4 sets, like the North/South/East/West links on those DEC Alpha chips. That way, each GPU Die would be only one hop from any other, up to a certain number of dies.
Senior Member
Posts: 1781
Joined: 2003-10-27
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive.
Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures.
This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM.
Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU.
https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/
This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.
Just looked like it it could be potentially used to make it better (SLI), at least I was just hoping that lol. I didn't read in-depth about the architecture, was just a generalised observation really.
Senior Member
Posts: 14040
Joined: 2004-05-16
They explain all this in the PDF in the article.
data placement, workload distribution and interconnection bandwidth
discussed in Sections 3 and 5, are amplified due to severe
NUMA effects from the lower inter-GPU bandwidth. Distributed
CTA scheduling together with the first-touch page allocation mechanism
(described respectively in Sections 5.2 and 5.3) are also applied
to the multi-GPU. We refer to this design as a baseline multi-GPU
system. Although a full study of various multi-GPU design options
was not performed, alternative options for CTA scheduling and page
allocation were investigated. For instance, a fine grain CTA assignment
across GPUs was explored but it performed very poorly due to
the high interconnect latency across GPUs. Similarly, round-robin
page allocation results in very
GPU organizations and unrealizable hypothetical designs, all
normalized to the baseline multi-GPU configuration. The optimized
multi-GPU which has GPU-side caches outperforms the baseline
multi-GPU by an average of 25.1%. Our proposed MCM-GPU on
the other hand, outperforms the baseline multi-GPU by an average
of 51.9% mainly due to higher quality on-package interconnect.
Senior Member
Posts: 845
Joined: 2015-05-19
NUMA just defines memory access patterns, it has nothing to do with how the actual work is spread over the processors.