NVIDIA announces RTX IO, GPU to Directly Access SSD

Published by

Click here to post a comment for NVIDIA announces RTX IO, GPU to Directly Access SSD on our message forum
https://forums.guru3d.com/data/avatars/m/246/246171.jpg
Fox2232:

I used to have storage statistics via MSI Afterburner. So I kind of know at what rate games loaded data and when. And games gained almost nothing from my NVMe drives.
I'm not surprised - storage is hardly a bottleneck in games nowadays. I'm still using SATA because I know most games barely load faster with NVMe. It's everything that comes after storage (decompression, transferring over PCIe, dropping into VRAM, etc) that slows things down. As the article mentioned, you could speed things up by taking out some of this overhead. For games that don't have official support, my "prefetch" idea (I meant to say prefetch, not paging file) with already decompressed data could make a measurable performance improvement. Storage in theory should still be the bottleneck. But if you eliminate the very long and complicated path that game data takes to reach its destination, it will likely become a bottleneck. That isn't such a bad thing either - you want the slowest part in the system to be under 100% load. The fact that it isn't is a problem. DS can help alleviate that problem.
data/avatar/default/avatar21.webp
DirectStorage is coming to PC Sept 1, 2020
We’re excited to bring DirectStorage, an API in the DirectX family originally designed for the Velocity Architecture to Windows PCs! DirectStorage will bring best-in-class IO tech to both PC and console just as DirectX 12 Ultimate does with rendering tech. With a DirectStorage capable PC and a DirectStorage enabled game, you can look forward to vastly reduced load times and virtual worlds that are more expansive and detailed than ever.
https://devblogs.microsoft.com/directx/directstorage-is-coming-to-pc/
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
Cplifj:

Did nvidia just copy or license the AMD HBCC technology ?
It's not the same. GPUDirect completely bypasses system memory, allowing the GPU to pull from the hard disk. HBCC uses system memory as a VRAM cache and intelligently pulls from it. It's two completely different technologies. Also GPUDirect is made by Microsoft, not AMD/Nvidia. As for everyone else talking about how they copied AMD, Nvidia announced GPUDirect as part of it's Magnum IO API stack back in November last year.
https://forums.guru3d.com/data/avatars/m/56/56686.jpg
so this gona be asnwer to console ps5/xbox faster loading? is this all built into the drivers and windows or "Extra" software that need to be installed? like say "drivex" and seeing it involves dx12 do i need newer version of windows still on 1907 here and is this gona be universal thing? meaning old game will support this? or is the game gona have to be patch to support this seeing is involves dx 12, what about DX9/10/11 games yes games are still using DX9 to this day, 10 to lesser degree, DX11 more the the other 2 fast as I can tell.
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
Cplifj:

Radeon pro SSG did something similar. Using SSD up to 1TB for storage via it's own m.2 slot. That i call similar to this Nvidia tech, just only slightly different since Nvidia uses the system SSD.
The SSG with the 1TB storage was kind of similar - it had a PCI-E switch on it and essentially bypassed the CPU to write data directly to the SSD - but it's not like Windows could see that drive or you could install games to it.
data/avatar/default/avatar05.webp
Undying:

Ps5 we have instant loading! Nvidia hold my beer...
Just to note that the Xbox Series X has similar nvme4 accelerated decompression. Its not just on the PS5.
https://forums.guru3d.com/data/avatars/m/266/266726.jpg
Denial:

It's not the same. GPUDirect completely bypasses system memory, allowing the GPU to pull from the hard disk. HBCC uses system memory as a VRAM cache and intelligently pulls from it. It's two completely different technologies. Also GPUDirect is made by Microsoft, not AMD/Nvidia. As for everyone else talking about how they copied AMD, Nvidia announced GPUDirect as part of it's Magnum IO API stack back in November last year.
The HBCC can pull directly from any storage device according to some early slides, it specifically can use any storage as a cache (including things like network storage), , the HBCC is aware of different available memory pools and uses a tiered storage like solution presented as vram, the whitepaper doesn't detail using anything other than nvram or ram, so maybe it was cancelled or subject to some erratum. [spoiler] https://www.custompcreview.com/wp-content/uploads/2017/01/amd-vega-ces-2017-press-deck_Page_36.jpg [/spoiler] its not the same as RTX io/directstorage. though maybe amd can implement support for Directstorage in the same or similar way.
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
user1:

The HBCC can pull directly from any storage device according to some early slides, it specifically can use any storage as a cache (including things like network storage), , the HBCC is aware of different available memory pools and uses a tiered storage like solution presented as vram, the whitepaper doesn't detail using anything other than nvram or ram, so maybe it was cancelled or subject to some erratum. its not the same as RTX io/directstorage. though maybe amd can implement support for Directstorage in the same or similar way.
HBCC creates what AMD calls a HBC (High Bandwidth Cache) which resides in both VRAM/SDRAM in a tiered hierarchy, with VRAM as the last level cache. If the GPU requires an asset that's outside of this cache, the controller can request the CPU to fetch it and pull it within the HBC, than the GPU can utilize it. So while it can request data from any location, the data is moved into the HBC first and it's all done by the CPU. It's really not that much different than how GPUs worked prior to HBCC, but HBCC creates storage tiers and manages pages/swaps/etc for the developer. https://www.reddit.com/r/Amd/comments/7x552w/exploring_vega_hbcc_and_its_effect_on_the_system/ This post does a good job investigating the effects of HBCC on the CPU. _ GPUDirect Storage on the other hand allows the DMA on the NVMe drive to push the request data directly into the GPU's memory, bypassing both system memory, the CPU and the GPU's DMA engine entirely. I think this section from Nvidia explains it pretty well:
The PCI Express (PCIe) interface connects high-speed peripherals such as networking cards, RAID/NVMe storage, and GPUs to CPUs. PCIe Gen3, the system interface for Volta GPUs, delivers an aggregated maximum bandwidth of 16 GB/s. Once the protocol inefficiencies of headers and other overheads are factored out, the maximum achievable data rate is over 14 GB/s. Direct memory access (DMA) uses a copy engine to asynchronously move large blocks of data over PCIe rather than loads and stores. It offloads computing elements, leaving them free for other work. There are DMA engines in GPUs and storage-related devices like NVMe drivers and storage controllers but generally not in CPUs. In some cases, the DMA engine cannot be programmed for a given destination; for example, GPU DMA engines cannot target storage. Storage DMA engines cannot target GPU memory through the file system without GPUDirect Storage. DMA engines, however, need to be programmed by a driver on the CPU. When the CPU programs the GPU’s DMA, the commands from the CPU to GPU can interfere with other commands to the GPU. If a DMA engine in an NVMe drive or elsewhere near storage can be used to move data instead of the GPU’s DMA engine, then there’s no interference in the path between the CPU and GPU. Our use of DMA engines on local NVMe drives vs. the GPU’s DMA engines increased I/O bandwidth to 13.3 GB/s, which yielded around a 10% performance improvement relative to the CPU to GPU memory transfer rate of 12.0 GB/s shown in Table 1 below.
The technologies are similar in that they both work to provide data to the GPU but the similarities kind of end there. HBCC creates a tiered VRAM/SDRAM cache and simply requests data the traditional way, but intelligently manages this cache. GPU Direct Storage allows the data on, what I think is any device with a DMA engine, to directly write to GPU's storage.
https://forums.guru3d.com/data/avatars/m/196/196284.jpg
Denial:

The SSG with the 1TB storage was kind of similar - it had a PCI-E switch on it and essentially bypassed the CPU to write data directly to the SSD - but it's not like Windows could see that drive or you could install games to it.
I'd like to see a solution where an SSD is installed on the graphics card and accessible by Windows.....
https://forums.guru3d.com/data/avatars/m/266/266726.jpg
Denial:

HBCC creates what AMD calls a HBC (High Bandwidth Cache) which resides in both VRAM/SDRAM in a tiered hierarchy, with VRAM as the last level cache. If the GPU requires an asset that's outside of this cache, the controller can request the CPU to fetch it and pull it within the HBC, than the GPU can utilize it. So while it can request data from any location, the data is moved into the HBC first and it's all done by the CPU. It's really not that much different than how GPUs worked prior to HBCC, but HBCC creates storage tiers and manages pages/swaps/etc for the developer. https://www.reddit.com/r/Amd/comments/7x552w/exploring_vega_hbcc_and_its_effect_on_the_system/ This post does a good job investigating the effects of HBCC on the CPU. _ GPUDirect Storage on the other hand allows the DMA on the NVMe drive to push the request data directly into the GPU's memory, bypassing both system memory, the CPU and the GPU's DMA engine entirely. I think this section from Nvidia explains it pretty well: The technologies are similar in that they both work to provide data to the GPU but the similarities kind of end there. HBCC creates a tiered VRAM/SDRAM cache and simply requests data the traditional way, but intelligently manages this cache. GPU Direct Storage allows the data on, what I think is any device with a DMA engine, to directly write to GPU's storage.
thing is that accessing system memory in anyway requires using the cpu, its not really useful to show that turning on hbcc uses more cpu energy/sycles since fundamentally there is no other way to access that memory, the fact that the SSG variant has its own ssd it can read from via pcie, is managed by the HBCC, and the slides show network access , pcie ,xdma ect, strongly suggests that it is doesn't have to talk to the cpu inorder to use storage as a cache. kinda like how amd used to use xdma engines for crossfire over the pcie bus without cpu involvement. also found this slide from the SSG press release [spoiler] https://pics.computerbase.de/7/9/3/5/2/1-630.3959041560.png [/spoiler] so the question remains whether the inclusion of the cpu block in this diagram for accessing "storage", is due to no apis/os support , or a hard limitation.
https://forums.guru3d.com/data/avatars/m/196/196426.jpg
Don't forget that GPU is physically connected to the CPU... the 16 lanes come from the CPU's I/O area (internal North Bridge), and in case of Zen 2, it's a dedicated die. Even if the GPU accesses the SSD -directly-, without involving the CPU cores, it will still happen through the CPU I/O (but not through execution of CPU code)
https://forums.guru3d.com/data/avatars/m/35/35316.jpg
This is a pretty big game changer regardless of who got there first. It may not be as sexy as ray tracing to demo but this kind of tech will be the unsung hero as textures get ever larger over the foreseeable future. And I agree that we will probably see it sooner than we expect. These kinds of low level features and enhancements can be added without necessarily altering core storage access APIs.
https://forums.guru3d.com/data/avatars/m/269/269912.jpg
Excuse my ignorance on the subject but can someone tell me how much of a difference it makes from the other way of going thru the cpu? Seconds, miliseconds, can / can't tell the difference while gaming? Will it give you an edge over someone online using the cpu method? Is this a game changer, pardon the pun, or who cares?
https://forums.guru3d.com/data/avatars/m/224/224952.jpg
NewTRUMP Order:

Excuse my ignorance on the subject but can someone tell me how much of a difference it makes from the other way of going thru the cpu? Seconds, miliseconds, can / can't tell the difference while gaming? Will it give you an edge over someone online using the cpu method? Is this a game changer, pardon the pun, or who cares?
I saw mention it is 10% faster getting the data directly to the GPU vs going through ram+using the CPU. Plus there will be benefits from using less CPU and ram bandwidth/space.
https://forums.guru3d.com/data/avatars/m/243/243702.jpg
NewTRUMP Order:

Excuse my ignorance on the subject but can someone tell me how much of a difference it makes from the other way of going thru the cpu? Seconds, miliseconds, can / can't tell the difference while gaming? Will it give you an edge over someone online using the cpu method? Is this a game changer, pardon the pun, or who cares?
Online games usually preload all data for given level (loading screen with progress bar for each player). That means, no benefit at all unless everyone has same loading capability. (Except of feeling that you was fastest.) But there are games which take like 5~8 seconds to load even from NVMe as CPU is limiting factor. Would there be no CPU bottleneck, such game would load within second. Then there is compression ratio. Once GPU takes care of data, compression used can be better which will mean that even in situation where storage is limiting factor, more data will be extracted per second. But problem is again with people who have no access to this decompression. So it either has to have dynamic compression decided on per system basis, or decompression can't exceed reasonable CPU requirements.
https://forums.guru3d.com/data/avatars/m/247/247876.jpg
So how many people in the world have NVMe disks in their rigs? 100%?
https://forums.guru3d.com/data/avatars/m/273/273678.jpg
Mufflore:

I saw mention it is 10% faster getting the data directly to the GPU vs going through ram+using the CPU. Plus there will be benefits from using less CPU and ram bandwidth/space.
well its more the fact the current method uses the cpu for decompression which adds latency to getting the data onto the gpu.
https://forums.guru3d.com/data/avatars/m/273/273678.jpg
GPUDirect Storage is not RTX IO, RTX IO is derived from it to a degree but where as GPDS is a full stack nvidia implementation, RTX IO cuts out the front end and replaces it with MSDS API.
https://forums.guru3d.com/data/avatars/m/263/263710.jpg
Astyanax:

GPUDirect Storage is not RTX IO, RTX IO is derived from it to a degree but where as GPDS is a full stack nvidia implementation, RTX IO cuts out the front end and replaces it with MSDS API.
Human language please:D!
https://forums.guru3d.com/data/avatars/m/273/273678.jpg
Caesar:

Human language please:D!
the machines using GPUDirect are on linux, nvidia and their device partners for the DGX systems are handling all the work themselves.