Guru3D.com
  • HOME
  • NEWS
    • Channels
    • Archive
  • DOWNLOADS
    • New Downloads
    • Categories
    • Archive
  • GAME REVIEWS
  • ARTICLES
    • Rig of the Month
    • Join ROTM
    • PC Buyers Guide
    • Guru3D VGA Charts
    • Editorials
    • Dated content
  • HARDWARE REVIEWS
    • Videocards
    • Processors
    • Audio
    • Motherboards
    • Memory and Flash
    • SSD Storage
    • Chassis
    • Media Players
    • Power Supply
    • Laptop and Mobile
    • Smartphone
    • Networking
    • Keyboard Mouse
    • Cooling
    • Search articles
    • Knowledgebase
    • More Categories
  • FORUMS
  • NEWSLETTER
  • CONTACT

New Reviews
ASUS TUF Gaming B760-PLUS WIFI D4 review
Netac NV7000 2 TB NVMe SSD Review
ASUS GeForce RTX 4080 Noctua OC Edition review
MSI Clutch GM51 Wireless mouse review
ASUS ROG STRIX B760-F Gaming WIFI review
Asus ROG Harpe Ace Aim Lab Edition mouse review
SteelSeries Arctis Nova Pro Headset review
Ryzen 7800X3D preview - 7950X3D One CCD Disabled
MSI VIGOR GK71 SONIC Blue keyboard review
AMD Ryzen 9 7950X3D processor review

New Downloads
Intel ARC graphics Driver Download Version: 31.0.101.4255
GeForce 531.41 WHQL driver download
AMD Radeon Software Adrenalin 23.3.2 WHQL download
GeForce 531.29 WHQL driver download
CrystalDiskInfo 9.0.0 Beta3 Download
AMD Ryzen Master Utility Download 2.10.2.2367
AMD Radeon Software Adrenalin 23.3.1 WHQL download
Display Driver Uninstaller Download version 18.0.6.1
CPU-Z download v2.05
AMD Chipset Drivers Download 5.02.19.2221


New Forum Topics
Failed 8,3 Years old WD Red drive 3TB (EFRX) - what now...? Gordon Moore Dies at 94 531.41 - Clean Version NVIDIA GeForce 531.41 WHQL driver Download & Discussion Kioxia 2nd Gen XL-NAND Flash Memory up to 13.5 GB/s Seq Reads and 3M IOPS Random Reads Fine Utilise Power of RadeonPRO Software & SweetFX Part 2 RDNA3 RX7000 Seriess! Owners Thread, Tests, Benchmarks, Screenshots, Overclocks, & Tweaks! RTX 4080 Owner's Thread Performance for Free: Unlocking Resizable Bar for unsupported AMD GPUs (Polaris, VEGA, Radeon VII) AMD Software: Adrenalin Edition 23.3.2 WHQL - Driver Download and Discussion




Guru3D.com » News » Nvidia might be moving to Multi-Chip-Module GPU design

Nvidia might be moving to Multi-Chip-Module GPU design

by Hilbert Hagedoorn on: 07/05/2017 08:02 AM | source: | 38 comment(s)
Nvidia might be moving to Multi-Chip-Module GPU design

With Moore's law becoming more difficult each year technology is bound to change. At one point it will be impossible to shrink transistors even further, hence companies like Nvidia already are thinking about new methodologies and technologies to adapt to that. Meet the Multi-Chip-Module GPU design.

Nvidia published a paper that shows how they can connect multiple parts (GPU modules) with an interconnect. According to the research, this will allow for bigger GPUs with more processing power. Not only will is help tackling the common problems, it would also be cheaper to achieve as fabbing four dies that you connect is cheaper to do than to make one huge monolithic design.

Thinking about it, AMD is doing exactly this with Threadripper and EPYC processors where they basically connect two to four Summit Ridge (ZEN) dies with that wide PCIe lane link (they use 64 PCie lanes per link with 128 available), Infinity Fabric.

According to the researchers, as an example a GPU with four GPU modules they recommend three architecture optimizations that will allow for minimal loss off data-communication in-between the different modules. According to the paper the loss in performance compared to a monolithic single die chip would be merely 10%
 


Of course when you think about it, in essence SLI is already a similar methodology (not technology), however as you guys know it can be rather inefficient and challenging in scaling and compatibility. The paper states this MCM design would be performing 26.8% better compared to any multi-GPU solution. If and when Nvidia is going to fab MCM multi GPU module based chips is not known, for now this is just a paper on the topic. The fact that they publish it indicates it is bound to happen at one point in time though.
 

A Decade Guru3D.comSorry, I could not resist ... ;)



Nvidia might be moving to Multi-Chip-Module GPU design




« AMD to launch Radeon RX 560D for Asia market · Nvidia might be moving to Multi-Chip-Module GPU design · OneDrive Now Requiring NTFS formatted drives - Users Angry »

Related Stories

Rumor: Nvidia Mobile Pascal-GPUs during Computex 2016 - not Desktop - 02/26/2016 01:31 PM
It's been topic for discussion for a while now. Personally I think we'll see some soft of announcement in April during the GTC, and later on a broad announcement in the Computex timeframe. Likely, a...

Nvidia might release GTX 980MX and 970MX for laptops - 01/19/2016 10:26 AM
Nvidia is likely planning the successors to the 970M and 980M in the 2nd half of 2016. The successors will be the 980MX and 970MX and are based on GM204....

Nvidia might be working on their own VR headset - 06/07/2015 01:03 PM
Nvidia might release their own VR headset, this is now speculated as information surfaced showing that Nvidia holds a patent for a headset with six camera and two displays, each for one eye....

Microsoft Confirms DirectX 12 MIX AMD and Nvidia Multi-GPUs - 03/13/2015 08:38 AM
It's not exactly new news, but Microsoft actually confirms it at this stage. Microsoft technical support states that DirectX 12 will support “multi-GPU configurations between Nvidia and AM...

Nvidia Maxwell GM200 Flagship GPU caught on photo - 01/16/2015 04:10 PM
NVIDIA’s flagship Maxwell GPU (to be released) GM200 GPU core has been spotted and somebody took a photo. The product will end up in Quadro and hopefully GeForce graphics cards. The card tha...


8 pages « 2 3 4 5 > »


nevcairiel
Senior Member



Posts: 845
Joined: 2015-05-19

#5449632 Posted on: 07/05/2017 12:47 PM
Its not SLI at all. Its NUMA.


NUMA just defines memory access patterns, it has nothing to do with how the actual work is spread over the processors.

Venix
Senior Member



Posts: 2985
Joined: 2016-08-01

#5449635 Posted on: 07/05/2017 01:02 PM
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive.

Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures.

This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM.

Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU.

https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/

This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.

none expect em to be cheap although it's much much much easier to produce 2x250 chips than a massive 500mm chip and if the 2x250 give similar or even just close performance it will be much much cheaper than the massive chips ... you kind of can see that with the intel cpu's now the 20 and 18 core parts are extremely expensive sure intel charge extra for the bragging rights but also those are hard to produce when you get a full working 20 core xeon of em you pretty much get the best of the best silicon they can produce the rest are 10~12~14....18 core chips same goes for the gpus really
all this of course if they make it so the system see em as a single entity

Evildead666
Senior Member



Posts: 1309
Joined: 2003-09-14

#5449636 Posted on: 07/05/2017 01:04 PM
OK, so nVidia published a paper out lining the theory and application behind the use of MCM in a GPU.
However, having an interconnect that can supply enough bandwidth without large latency hits is a different matter. AMD got very lucky with IF, but will Intel and nVidia be able to replicate results without hitting on AMD's patents related to IF? If they can't, their only option could be to lease the technology from AMD, assuming AMD are game for giving up their ace up their sleeve.

I'm pretty sure Infinity Fabric uses PCIe lanes for communication, maybe it can use other transports as well.

Between the CPU's on an Epyc chip, there are 64PCIe lanes going in between each cpu, if i read the slides correctly.
They can cut the latencies due to the short hops in between the on-chip cpu's, and the bandwidth should be plenty.

A GPU that has 2x16 PCIe lanes, could use the second set for intra-GPU signalling. Ideally you'd want 4 sets, like the North/South/East/West links on those DEC Alpha chips. That way, each GPU Die would be only one hop from any other, up to a certain number of dies.

drac
Senior Member



Posts: 1781
Joined: 2003-10-27

#5449638 Posted on: 07/05/2017 01:06 PM
This isnt a replacement for SLI and not even something for consumer GPUs for a long time(when the performance requirements dictate that a monolithic die is too expensive). That should be a ways off still, and considering that Nvidias monolithic V100 sells for $13,000 dont expect these to be cheap. It may reduce the cost of the individual dies and make binning easier, but the addition of all the interconnects and SRAM for the L1.5 cache will still make these expensive.

Its a small NUMA setup for a GPU that uses L1.5 cache to get around some of the issues involved with making NUMA architectures.

This GPM(graphics processing module) approach is destined to be used i n Nvidias exascale architecture, and the Volta V100 successor chip will likely be such an MCM.

Intel discussed a similar idea a year or two ago regarding the Knights Hill architecture, which follows the 72 core Knights Hill HPC focused x86 CPU.

https://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/

This is the next step in 2.5D architectures. Nvidias approach discusses how to solve data locality issues and reduce the pJ/bit cost of moving data with their L1.5 cache. I need to read up on Infinity Fabric and HBCC to see if it has any similar provisions. If it doesnt now, it certainly will need them for large scale systems with hundreds of thousands or millions of cores.

Just looked like it it could be potentially used to make it better (SLI), at least I was just hoping that lol. I didn't read in-depth about the architecture, was just a generalised observation really.

Denial
Senior Member



Posts: 14040
Joined: 2004-05-16

#5449639 Posted on: 07/05/2017 01:08 PM
NUMA just defines memory access patterns, it has nothing to do with how the actual work is spread over the processors.


They explain all this in the PDF in the article.

In such a multi-GPU system the challenges of load imbalance,
data placement, workload distribution and interconnection bandwidth
discussed in Sections 3 and 5, are amplified due to severe
NUMA effects from the lower inter-GPU bandwidth. Distributed
CTA scheduling together with the first-touch page allocation mechanism
(described respectively in Sections 5.2 and 5.3) are also applied
to the multi-GPU. We refer to this design as a baseline multi-GPU
system. Although a full study of various multi-GPU design options
was not performed, alternative options for CTA scheduling and page
allocation were investigated. For instance, a fine grain CTA assignment
across GPUs was explored but it performed very poorly due to
the high interconnect latency across GPUs. Similarly, round-robin
page allocation results in very


Figure 17 summarizes the performance results for different buildable
GPU organizations and unrealizable hypothetical designs, all
normalized to the baseline multi-GPU configuration. The optimized
multi-GPU which has GPU-side caches outperforms the baseline
multi-GPU by an average of 25.1%. Our proposed MCM-GPU on
the other hand, outperforms the baseline multi-GPU by an average
of 51.9%
mainly due to higher quality on-package interconnect.


8 pages « 2 3 4 5 > »


Post New Comment
Click here to post a comment for this news story on the message forum.


Guru3D.com © 2023