Nvidia might be moving to Multi-Chip-Module GPU design

Published by

Click here to post a comment for Nvidia might be moving to Multi-Chip-Module GPU design on our message forum
https://forums.guru3d.com/data/avatars/m/235/235224.jpg
Thanks for the link, it's a great read. Looking forward to seeing what they achieve with their first commercial Multi-Chip-Module GPU.
https://forums.guru3d.com/data/avatars/m/59/59930.jpg
both companies need to mask the amount of gpus from the OS driver level so the system only sees 1 and the onboard bios of the gpu decides out how the gpu dishes out the utilization otherwise we will be stuck waiting and hoping the developers figure it out Same goes for CPU i really want to find the documents on this it was discussed way back in mid 2000's how its possible but no one wants to do it.. and from what I can find it has been done back in the earlier days aka voodoo and someother company forget which one where os and drivers only seen it as 1
https://forums.guru3d.com/data/avatars/m/223/223196.jpg
Hello Voodoo 5 ?
https://forums.guru3d.com/data/avatars/m/66/66219.jpg
Good stuff, no more masses of video cards crammed into cases overheating all over the place. This could be soo good, SLI in the one card done properly, but I guess it hasn't happend due to the current tech limitations.
https://forums.guru3d.com/data/avatars/m/63/63170.jpg
yay for a Voodoo 5 🙂 This is understandable, especially what with the Fab Processes being late, and farther and farther apart. If you can't rely on reducing the size of the chips/transistors, the you have to go MCM, or Multi-chip-on-board a la Voodoo, for the High end cards. even if its only for the datacenter GPU cards at first (they can pay for the R&D with the prices the cards sell for) and then the consumer will get a trickle down effect later on. I suspect it hasn't been done lately, because they didn't plan to, officially. It also means dedicating on-die space to something that won't be used in Single chip boards, which would be wasted. I could see a medium sized GPU with a 128bit GDDR/X bus, or a Single HBM Stack (either 512bit or 1024bit wide) and adding up to four of them together on a single card. The HBCC might be a good fit for this, since there would be one per chip, and one of them could become 'master' to the others, and give them orders. NVLink is probably set up for this as well.
data/avatar/default/avatar25.webp
Good stuff, no more masses of video cards crammed into cases overheating all over the place. This could be soo good, SLI in the one card done properly, but I guess it hasn't happend due to the current tech limitations.
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive. Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures. This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM. Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU. https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/ This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.
data/avatar/default/avatar36.webp
This could be soo good, SLI in the one card done properly, but I guess it hasn't happend due to the current tech limitations.
If its just another form of SLI then it'll actually suck quite bad. SLI support in games has been terrible and will not get much better. If they want to pull this off, they need to find a way to utilize all the GPUs on all workloads, if anything just runs on one of the modules because its not flexible enough, its going to be an extremely disappointing experience.
data/avatar/default/avatar36.webp
If its just another form of SLI then it'll actually suck quite bad. SLI support in games has been terrible and will not get much better. If they want to pull this off, they need to find a way to utilize all the GPUs on all workloads, if anything just runs on one of the modules because its not flexible enough, its going to be an extremely disappointing experience.
Its not SLI at all. Its NUMA.
data/avatar/default/avatar38.webp
OK, so nVidia published a paper out lining the theory and application behind the use of MCM in a GPU. However, having an interconnect that can supply enough bandwidth without large latency hits is a different matter. AMD got very lucky with IF, but will Intel and nVidia be able to replicate results without hitting on AMD's patents related to IF? If they can't, their only option could be to lease the technology from AMD, assuming AMD are game for giving up their ace up their sleeve.
data/avatar/default/avatar05.webp
both companies need to mask the amount of gpus from the OS driver level so the system only sees 1 and the onboard bios of the gpu decides out how the gpu dishes out the utilization otherwise we will be stuck waiting and hoping the developers figure it out Same goes for CPU i really want to find the documents on this it was discussed way back in mid 2000's how its possible but no one wants to do it.. and from what I can find it has been done back in the earlier days aka voodoo and someother company forget which one where os and drivers only seen it as 1
This isn't feasible now. Workloads are more complex and that's why it's up to the developer to manage their resource allocation, not drivers. Mantle, DX12, and Vulkan all went in this direction not for naught. Rendering engines are becoming more and more involved, and often you have to work preemptively, such as loading textures before they're used when you're near the borders of a new area. This is not something the driver can just guess unless the game exposes the notion of an area whereby the driver is configured to load textures for nearby areas automatically - think of an infinite combination of situations that the driver would then have to cater for. If each GPU die comes with a memory controller and its own VRAM, then choosing to preload the textures into one or both pools at a specific time when the bus is idle or capable is vital. Nvidia works with developers, and AMD is starting to. We have huge game engines with tons of backend support, such as Unreal, Frostbite, Cryengine, etc... so if this pattern resumes we're good to go.
data/avatar/default/avatar03.webp
Its not SLI at all. Its NUMA.
NUMA just defines memory access patterns, it has nothing to do with how the actual work is spread over the processors.
https://forums.guru3d.com/data/avatars/m/268/268248.jpg
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive. Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures. This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM. Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU. https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/ This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.
none expect em to be cheap although it's much much much easier to produce 2x250 chips than a massive 500mm chip and if the 2x250 give similar or even just close performance it will be much much cheaper than the massive chips ... you kind of can see that with the intel cpu's now the 20 and 18 core parts are extremely expensive sure intel charge extra for the bragging rights but also those are hard to produce when you get a full working 20 core xeon of em you pretty much get the best of the best silicon they can produce the rest are 10~12~14....18 core chips same goes for the gpus really all this of course if they make it so the system see em as a single entity
https://forums.guru3d.com/data/avatars/m/63/63170.jpg
OK, so nVidia published a paper out lining the theory and application behind the use of MCM in a GPU. However, having an interconnect that can supply enough bandwidth without large latency hits is a different matter. AMD got very lucky with IF, but will Intel and nVidia be able to replicate results without hitting on AMD's patents related to IF? If they can't, their only option could be to lease the technology from AMD, assuming AMD are game for giving up their ace up their sleeve.
I'm pretty sure Infinity Fabric uses PCIe lanes for communication, maybe it can use other transports as well. Between the CPU's on an Epyc chip, there are 64PCIe lanes going in between each cpu, if i read the slides correctly. They can cut the latencies due to the short hops in between the on-chip cpu's, and the bandwidth should be plenty. A GPU that has 2x16 PCIe lanes, could use the second set for intra-GPU signalling. Ideally you'd want 4 sets, like the North/South/East/West links on those DEC Alpha chips. That way, each GPU Die would be only one hop from any other, up to a certain number of dies.
https://forums.guru3d.com/data/avatars/m/66/66219.jpg
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive. Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures. This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM. Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU. https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/ This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.
Just looked like it it could be potentially used to make it better (SLI), at least I was just hoping that lol. I didn't read in-depth about the architecture, was just a generalised observation really.
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
NUMA just defines memory access patterns, it has nothing to do with how the actual work is spread over the processors.
They explain all this in the PDF in the article.
In such a multi-GPU system the challenges of load imbalance, data placement, workload distribution and interconnection bandwidth discussed in Sections 3 and 5, are amplified due to severe NUMA effects from the lower inter-GPU bandwidth. Distributed CTA scheduling together with the first-touch page allocation mechanism (described respectively in Sections 5.2 and 5.3) are also applied to the multi-GPU. We refer to this design as a baseline multi-GPU system. Although a full study of various multi-GPU design options was not performed, alternative options for CTA scheduling and page allocation were investigated. For instance, a fine grain CTA assignment across GPUs was explored but it performed very poorly due to the high interconnect latency across GPUs. Similarly, round-robin page allocation results in very
Figure 17 summarizes the performance results for different buildable GPU organizations and unrealizable hypothetical designs, all normalized to the baseline multi-GPU configuration. The optimized multi-GPU which has GPU-side caches outperforms the baseline multi-GPU by an average of 25.1%. Our proposed MCM-GPU on the other hand, outperforms the baseline multi-GPU by an average of 51.9% mainly due to higher quality on-package interconnect.
data/avatar/default/avatar04.webp
NUMA just defines memory access patterns, it has nothing to do with how the actual work is spread over the processors.
I know but the technology is about distributing a workload on a MCM GPU, which effectively turns a GPU into a small NUMA node. The new intra chip interconnect scales much better than SLI because it uses the L1.5 cache to avoid unnecessary communication between the L1 cache and "far" memory on a different chip within the GPM. Its designed to make the GPU chiplets and their RAM communicate effectively within a GPM. NVLink SLI would be nice.
https://forums.guru3d.com/data/avatars/m/232/232349.jpg
Pretty interesting.... Couldn't resist on the 3DFX Voodoo 5 5500 AGP picture as I had that card years ago!!! I still remember my ATI x1800PE dying on me and having to slap in that good'ol 3DFX Voodoo 5 5500 AGP just to have a display adapter. Damn thing ran Halflife 2 at 800p resolution most setting at max... Just as long as the tech doesn't take a long time to come around I'm on board....
https://forums.guru3d.com/data/avatars/m/262/262564.jpg
http://research.nvidia.com/publication/2017-06_MCM-GPU%3A-Multi-Chip-Module-GPUs Guru3D news item: http://www.guru3d.com/news-story/nvidia-might-be-moving-to-multi-chip-module-gpu-design.html Nvidia's answer to AMD's Infinity Fabric A good read...check the PDF
Given AMD's size and limited resources, it's amazing they're the lead on several fronts. I can only imagine if they had Intel and Nvidias resources. Good stuff all around. Amazing what stiff competition can do to light a fire under behometh's behinds.
https://forums.guru3d.com/data/avatars/m/270/270008.jpg
I assume Nvidia thinks AMD's Navi will be a bit hit since they are also moving the same direction. To me Vega is pretty boring but Navi using the IF along with a die shrink looks pretty interesting.
https://forums.guru3d.com/data/avatars/m/259/259654.jpg
It's a physics question, not just a business one.