JEDEC JESD239 GDDR7 Standard: Upcoming High-Speed Memory Technology

Kaarme

2024-03-05 19:33

Card manufacturers are only interested in higher bandwidth and capacity per chip, as it would allow them to cut down the bus width and number of memory chips, which means lowering costs. It would make the businessmen at AMD and Nvidia salivate to imagine four chips for 16GB and still (barely) enough bandwidth (together with the cache) to make the card work within intended performance target zone. No doubt this is exactly what we are going to see. RTX 5070 with a 128-bit bus for 16GB and who knows what bandwidth.

#6213320

Alessio1989

2024-03-05 20:16

what about latency?

#6213438

Crazy Joe

2024-03-06 09:30

Kaarme:

Card manufacturers are only interested in higher bandwidth and capacity per chip, as it would allow them to cut down the bus width and number of memory chips, which means lowering costs. It would make the businessmen at AMD and Nvidia salivate to imagine four chips for 16GB and still (barely) enough bandwidth (together with the cache) to make the card work within intended performance target zone. No doubt this is exactly what we are going to see. RTX 5070 with a 128-bit bus for 16GB and who knows what bandwidth.

They clearly spell out what the bandwidth per device is, so you can determine the effective bandwidth of a 128-bit bus with 16 GB. Since devices use a 32 bit bus, to get a 128 bit bus would require 4 devices: 4x 192 GB/s = 768 GB/s, which is more bandwidth than the RTX 4080 Super has at the moment (736 GB/s). The question remains whether GDDR7 will be used to produce 4GB devices or manufacturers opt for larger devices.

#6213462

Celcius

2024-03-06 12:00

I wonder what the thermal situation will be like? Perhaps a more aggressive cooling design may be required? Anything to make a graphics card larger and heavier than they already are seems like exactly what we need. Or, maybe it will hardly be any different from the highest frequency GDDR6 used currently.

#6213500

Kaarme

2024-03-06 15:23

Crazy Joe:

They clearly spell out what the bandwidth per device is, so you can determine the effective bandwidth of a 128-bit bus with 16 GB. Since devices use a 32 bit bus, to get a 128 bit bus would require 4 devices: 4x 192 GB/s = 768 GB/s, which is more bandwidth than the RTX 4080 Super has at the moment (736 GB/s).

It's up to bandwidth. If the performance target is reached with lower speed, it can reduce costs and also make energy consumption lower. Looking at how my RTX 4070 can't even honestly beat RTX 3080, we can say Nvidia's performance targets are extremely modest, at least below the flagship level. So, don't count on Nvidia going for the fastest possible speed. They will use speed and chips that are barely enough if it helps them make more profit. RTX 5090 will be a different thing, naturally. I'm only talking about Nvidia here because that's what I'm currently running, but it's not like I'd have forgotten AMD was actually the one who invented the use of extra large cache to make up for a tradionally speaking narrow looking memory bus.

#6213624

Reddoguk

2024-03-07 01:23

I wonder if they can make 4gb chips with this new iteration. That would be a game changer. We went from 1gb to 2gb chips but 4gb ones would be a real advantage. 24gb on a 3090 is a lot of chips that could go bad.

#6213927

Crazy Joe

2024-03-08 10:07

Kaarme:

It's up to bandwidth. If the performance target is reached with lower speed, it can reduce costs and also make energy consumption lower. Looking at how my RTX 4070 can't even honestly beat RTX 3080, we can say Nvidia's performance targets are extremely modest, at least below the flagship level. So, don't count on Nvidia going for the fastest possible speed. They will use speed and chips that are barely enough if it helps them make more profit. RTX 5090 will be a different thing, naturally. I'm only talking about Nvidia here because that's what I'm currently running, but it's not like I'd have forgotten AMD was actually the one who invented the use of extra large cache to make up for a tradionally speaking narrow looking memory bus.

Caches don't make up for narrow memory busses, they make up for latency and bandwidth (depending on the memory type used for implementing the cache). Performance not only depends on bandwidth (the speed at which data is copied into the GPU), but also on latency (the time it takes before the copy-in starts). Latency is a massive factor in performance because inefficient memory access patterns can cause huge drops in performance. Typical global memory (as NVIDIA calls the RAM on the GPU) operates best when the data is read in sizes of cache lines, but the time it takes before such a transfer starts is dependent on the CAS latency. The bandwidth only then applies for determining the time it takes for data to arrive. Since NVIDIA doesn't use L3 cache (yet), the cache line size of the L2 cache is used. Now for the various cache levels we again have a latency hierarchy. I don't know what the exact latencies are, but each level has increased latency. For example for Intel's Core 2 architecture the latencies were: - Registers: immediate access, no delay - L1 cache: 3 clock cycles delay - L2 cache: 18 clock cycles delay - RAM: 200 clock cycles delay As far as I understand L3 cache is typically at half the latency of RAM, but that information might be outdated. So 100 clock cycles typically. The clock rate of the memory used then defines how much time is actually spent waiting for that first access. Most CPU memory controllers have some form of predictive functionality that tries to reduce RAM to cache latencies as much as possible (by prefetching consecutive data as much as possible), but if you do truly random access at the byte level, nothing will help and you'll be hit the full latency of RAM for every byte you read. Luckily most GPU programs are designed to access memory as efficiently as possible (or they should be!) by doing coalesced memory accesses, meaning that each thread ensures that data is read in such a way that data can be read as a continuous stream from memory. But as you can see, if you violate against coalesced memory access patterns, the latencies will hit hard.