A light has been casted - AMD files patent combining hardware and software for raytracing

Dribble

2019-07-02 18:06

Kaarme:

Not to rain on your one man parade, but by your logic both AMD and Nvidia are always on back foot (who isn't, then? Intel? Bwahahahaha!). Surely you don't yet need reminding about things like async compute, which Nvidia lacked but quickly tried to implement in their next gen when it became something of a thing. Neither AMD nor Nvidia have always been able to predict everything, which is natural. Dedicated RT cores were a big sacrifice from Nvidia, which they could afford because AMD hasn't been able to offer much competition in the GPU world. Then there are also the blur cores, as well. If the competition had been fierce, I doubt Nvidia would have been as willing to risk it, unless they knew for a fact AMD was planning it as well.

I do think they caught out AMD this time - but importantly there was time for AMD to turn it around, particularly for the consoles. It's key for ray tracing adoption that the next gen of consoles support it. If it hadn't gone into the consoles then you can bet there'd be a lot less enthusiasm for using it by games. AMD also have time to put it into their cards - there's not many ray tracing games yet, as long as they release something with decent DXR next year they'll do ok. Where I do think they were foolish positioning the 5700's as high as they did - it's pointless competing in the $400 bracket with a card that doesn't have ray tracing now the 2060S has arrived. If they'd just decided to compete with the 1660/Ti they'd have been fine but you can't expect people to spend that much money on a card that's lacking what looks like being 2020's key feature.

#5686454

holler

2019-07-02 18:20

Filed Dec 22, 2017!

#5686677

Astyanax

2019-07-03 09:25

Fox2232:

@Astyanax : What you missed is that while nVidia beefed their SMs a lot to do RT

No they didn't. Thats your misunderstanding of the core structure. the Tensor cores are are not necessary for denoising but do accelerate the task, the RT cores are a part of the SM blocks and are tiny.

#5686693

Fox2232

2019-07-03 10:25

Astyanax:

No they didn't. Thats your misunderstanding of the core structure. the Tensor cores are are not necessary for denoising but do accelerate the task, the RT cores are a part of the SM blocks and are tiny.

Do you expect SMs to be smaller (transistor count) then? Will nVidia revert back Tensors? Do you think that nVidia can deliver double of Turing's RT performance at same price mark as they move to 7nm?

#5686704

Astyanax

2019-07-03 11:15

You aren't asking the right questions.

#5686705

Stefem

2019-07-03 11:21

jbscotchman:

I knew there was no way Nvidia keep could this proprietary for long. They were crazy for thinking it would only be possible with special hardware dedication.(aka RT cores)

Wait a minute, are you really arguing that NVIDIA failed to keep it proprietary (which actually never tried) while commenting a news about AMD filing a patent on it?o_O That would be pretty hilarious 🙄

#5686706

Stefem

2019-07-03 11:27

schmidtbag:

That was pretty fast getting that patent filled out. What I think is especially interesting is how, as far as I can tell, this should allow you to use a secondary GPU as a discrete ray-tracer. I wouldn't mind using the M.2 slot for some cheap low-end GPU for this. I'm not going to use it on an NVMe drive, not for a while anyway.

I would drop this idea entirely, it's a bad design choice, you need RT and shading block closely tied to be efficient

#5686733

Fox2232

2019-07-03 13:11

Astyanax:

You aren't asking the right questions.

And you are only able to make: "No", "Wrong", "Does Not" statements. But It is so rare to see you form post that can be called an argument, that your posts are basically worthless and I only reply to see another post devoid of substance. And I too often remind you of that in hope that you would actually show something of value. I believe in you. I know that you can do it. Edit: Just to clarify this quote. I am not asking wrong questions. There are no wrong questions. I did ask questions, you did not like.

#5686739

Alessio1989

2019-07-03 13:33

holler:

Filed Dec 22, 2017!

Do you know how many years does it takes to go to a hardware architecture implementation production?

#5686747

Astyanax

2019-07-03 13:47

Fox2232:

And you are only able to make: "No", "Wrong", "Does Not" statements. But It is so rare to see you form post that can be called an argument, that your posts are basically worthless and I only reply to see another post devoid of substance. And I too often remind you of that in hope that you would actually show something of value. I believe in you. I know that you can do it. Edit: Just to clarify this quote. I am not asking wrong questions. There are no wrong questions. I did ask questions, you did not like.

There are wrong questions, you don't seem to be able to ask the right ones because of a lack of technical foresight and no understanding of silicon processor design. Your only argument is basically "But AMD did......" This patent is just AMD catching up to what nvidia is already doing with RT, the fixed function cores will be not unlike nvidia's as all they are doing is BVH traversal, nothing more. All RT implementations are fixed function hybrids with software interaction, the only possible thing AMD could do differently is CHEAT the API again like they did with tesselation on terascale, polaris and vega because their front end was just not great.

#5686761

Fox2232

2019-07-03 14:39

Astyanax:

There are wrong questions, you don't seem to be able to ask the right ones because of a lack of technical foresight and no understanding of silicon processor design. Your only argument is basically "But AMD did......" This patent is just AMD catching up to what nvidia is already doing with RT, the fixed function cores will be not unlike nvidia's as all they are doing is BVH traversal, nothing more. All RT implementations are fixed function hybrids with software interaction, the only possible thing AMD could do differently is CHEAT the API again like they did with tesselation on terascale, polaris and vega because their front end was just not great.

Wrong.

#5686785

holler

2019-07-03 15:57

Alessio1989:

Do you know how many years does it takes to go to a hardware architecture implementation production?

of course, I am alluding to the fact that the headline makes it seem like they just filed it, but in reality it is already published. Ray Tracing has been in AMD's pipeline for quite sometime...

#5686800

Stefem

2019-07-03 16:54

Fox2232:

They share some things, that's why I call them Dual-CUs. It is not really Raytracing Fixed Function HW. It is slight addition to TMU which enables to do BVH. @Astyanax : What you missed is that while nVidia beefed their SMs a lot to do RT, AMD needs just small increase of transistor count per CU if they use same arrangement as in RX 5700 XT. But in reality, AMD can use different CU arrangements which will make CU smaller at cost of shading, but not TMU (Raytracing) power... Getting more CUs at same transistor count => higher raytracing performance. (Current GCN compatible arrangement is worst case scenario for RDNA and new features but understandable since RX 5700 does not look like having capability described in given patent.) I do expect certain trade-offs, but AMD's GPUs had too much compute for actual gaming performance till RDNA anyway. And in 5 days it will be seen that shader count is not as important as is ability to use them properly. = = = = And as for the patent. It describes "software" solutions as introduction to problematic and that they are ineffective and generally bad even if done on "WH-level" via shaders alone. This is not CPU related. That "software" is control method done inside CU. (And means that there is logic in contrast to fully "Fixed Function-HW".)

It becomes quite a complex topic once you start to scratch the surface but ray/triangle intersection testing is kind of a hard and difficult thing for a GPU due to the resulting memory access behaviour, here BVH comes to help but doesn't solves the problem at all. it's not just a matter of raw compute power, building and updating the BVH structure is quite efficient on a GPU but traversing the BVH down to a triangle is an entirely different story, that's the hard part that is being accelerated by NVIDIA with their RT cores. Maxwell and Pascal was already faster than Vega in the ray tracing pass (note that like with rasterization you still have to do shading once tested the triangle to be visible), Volta greatly improved above them but Turing is an entirely different beast and its actually still even smaller than Vega. At what is almost the same process node the closer Turing iteration to Vega in terms of transistor number is TU106 (which is slightly smaller), yet, despite all the added dedicated hardware units for ray tracing and AI, it's faster at normal rasterization which doesn't even make use of them. Actually, simplifying the GCN CU would make them faster at shading but slower at tracing rays, it isn't a simple operation to be repeated several times like shading is, ray tracing is a rather complex problem to deal with.

#5686834

Fox2232

2019-07-03 18:15

Stefem:

It becomes quite a complex topic once you start to scratch the surface but ray/triangle intersection testing is kind of a hard and difficult thing for a GPU due to the resulting memory access behaviour, here BVH comes to help but doesn't solves the problem at all. it's not just a matter of raw compute power, building and updating the BVH structure is quite efficient on a GPU but traversing the BVH down to a triangle is an entirely different story, that's the hard part that is being accelerated by NVIDIA with their RT cores. Maxwell and Pascal was already faster than Vega in the ray tracing pass (note that like with rasterization you still have to do shading once tested the triangle to be visible), Volta greatly improved above them but Turing is an entirely different beast and its actually still even smaller than Vega. At what is almost the same process node the closer Turing iteration to Vega in terms of transistor number is TU106 (which is slightly smaller), yet, despite all the added dedicated hardware units for ray tracing and AI, it's faster at normal rasterization which doesn't even make use of them. Actually, simplifying the GCN CU would make them faster at shading but slower at tracing rays, it isn't a simple operation to be repeated several times like shading is, ray tracing is a rather complex problem to deal with.

Triangle has plane. Ray is vector. Knowing where vector hits plane and if that's inside given triangle is simple and fast. Issue is extreme number of rays to triangle count. Because it is many to many relation. Getting smaller CU != simplifying. Read other patent on different CU configurations. And considering what part of that CU is responsible for RT in AMD's patent => more CU's better RT. And af for rasterization and other things... AMD's approach has it in one place. nVidia's not. But how good that is remains to be seen with RDNA2.

#5686886

Denial

2019-07-03 20:48

Fox2232:

Getting smaller CU != simplifying. Read other patent on different CU configurations. And considering what part of that CU is responsible for RT in AMD's patent => more CU's better RT. And af for rasterization and other things... AMD's approach has it in one place. nVidia's not. But how good that is remains to be seen with RDNA2.

I don't really see how this is much different than Nvidia's approach other then what fetches the BVH tree from memory. With AMD, the texture units are going to do it, with Nvidia the RT core is going to do it. The calc of the intersect itself is done by specialized hardware on both architectures and on both architectures it scales with the number of cores as every SM has an RT core and SM's and CUs are essentially the same from a high level.

#5686906

Fox2232

2019-07-03 22:28

Denial:

I don't really see how this is much different than Nvidia's approach other then what fetches the BVH tree from memory. With AMD, the texture units are going to do it, with Nvidia the RT core is going to do it. The calc of the intersect itself is done by specialized hardware on both architectures and on both architectures it scales with the number of cores as every SM has an RT core and SM's and CUs are essentially the same from a high level.

Difference is in immediate transistor count and efficiency. AMD needs minimal change to Texture Units. But will be limited by their count. This can be compensated by differently set up CUs (smaller) and therefore getting more of them. And from efficiency point of view, that small change is key here, they are already doing operation that's pretty close to required one. It looks like they'll chain multiple stages which will save memory accesses = time & energy. On practical side of things, inputs are same, results are same as you wrote. Winning side manages to fit more RT capability (within reasonable limits) at lower transistor cost. It remains to be seen if nVidia's scalability or AMD's efficiency is better once limited by transistor count. AMD's approach is solid, but we are yet to see how it performs in real world. (As it is shared HW to certain degree and immediate use of results may or may not compensate for this.) I see it this way: nVidia has 13.6B working transistors in 2080 Super (as that's full GPU). Next generation with same transistor count may want to have higher RT capability. Adding RT cores means either removing something or optimizing something else to point where it makes space for new RT cores. But they may be able to even double them to 96. On other hand AMD could make at same transistor budget 56CUs in same CU organization as in 5700XT enabled to do RT. Or they can reconfigure to make smaller CUs and get maybe 64CUs => 256 potential RT units. But this variability is limited by certain minimum and comes at trade off where some other function may be lacking.

#5687517

Stefem

2019-07-05 17:37

Fox2232:

Triangle has plane. Ray is vector. Knowing where vector hits plane and if that's inside given triangle is simple and fast. Issue is extreme number of rays to triangle count. Because it is many to many relation. Getting smaller CU != simplifying. Read other patent on different CU configurations. And considering what part of that CU is responsible for RT in AMD's patent => more CU's better RT. And af for rasterization and other things... AMD's approach has it in one place. nVidia's not. But how good that is remains to be seen with RDNA2.

You are oversimplifying a lot, if you were right all the render farm and render software would be based on GPU by decades and it's not, CPU was (and still) the predominant processor until RTX. GPUs excells at highly coherent work but ray tracing is very incoherent, rays divergence can dramatically reduce the efficiency (theoretical to real performance ratio) in a GPU (it's even described in the AMD patent if you care to read). You seems to make a lot of arbitrary assumption, you don't know the real transistor cost of RT cores, NVIDIA unified the cache (shared memory, L1 and texture cache) so the main supposed advantage of the AMD implementation (saving on buffers) isn't there, take NAVI which is very close on transistor count to TU106 and even with a full node advantage is just marginally faster while lacking RT and AI acceleration. Some of your argument are specious, more cu the better is obvious, simpler cu the better is not, especially if we talk of ray tracing. You are also ignoring amdahl's law, the rt pass is just a fraction of the frametime, even with infinite fast rt cores that completely offset the ray casting process you would just improve (in the best case scenario) by 30% your frametime, even on a production quality offline render the improvement would be limited.

#5687540

Fox2232

2019-07-05 18:50

Stefem:

You are oversimplifying a lot, if you were right all the render farm and render software would be based on GPU by decades and it's not, CPU was (and still) the predominant processor until RTX. GPUs excells at highly coherent work but ray tracing is very incoherent, rays divergence can dramatically reduce the efficiency (theoretical to real performance ratio) in a GPU (it's even described in the AMD patent if you care to read). You seems to make a lot of arbitrary assumption, you don't know the real transistor cost of RT cores, NVIDIA unified the cache (shared memory, L1 and texture cache) so the main supposed advantage of the AMD implementation (saving on buffers) isn't there, take NAVI which is very close on transistor count to TU106 and even with a full node advantage is just marginally faster while lacking RT and AI acceleration. Some of your argument are specious, more cu the better is obvious, simpler cu the better is not, especially if we talk of ray tracing. You are also ignoring amdahl's law, the rt pass is just a fraction of the frametime, even with infinite fast rt cores that completely offset the ray casting process you would just improve (in the best case scenario) by 30% your frametime, even on a production quality offline render the improvement would be limited.

You wrote: "ray/triangle intersection testing is kind of a hard" ... That's false statement. AMD's approach by use of TMUs => More CUs = more TMUs => higher performance for RT. There are other downsides, but not for this. As for best case scenario of 30% faster... I think that best case scenario is DX-R off. And comparing DX-R ON where it is applied on like 20% of pixels on screen only against DX-R OFF shows much higher performance difference. Or if you will, performance impact if every single pixel on screen had to go through this pass. And therefore much higher potential improvement. But I have that strange feeling that you mean something else. Yet, I am not sure. Because acknowledging that would mean you are an idiot. Because Amdahl's Law is stupidly simple. And to have workload that can achieve through parallelization only 30% speedup means, that given workload has ~77% part that can't be parallelized. That would apparently be false statement too. You likely thrown Amdahl in, from some confusion. Do you even have DX-R frametime analysis at hand? Do you even know to what degree DX-R is made to be parallelized? My humble guess is to something like 0.999 to 0.99999 as are most of rendering methods to be able to $h*7 around 400 million pixels per second while using thousands of shader cores. (which would be otherwise almost useless) = = = = Was your argument really about more HW units capable to do simple plane to vector check would not speed up the task?

#5687661

jwb1

2019-07-05 23:07

Never been really that sold on Raytracying to begin with, but this is obvious AMD quickly coming up with something to try and compete. nVidia caught them by surprise with a new hardware level feature.

#5687674

Stormyandcold

2019-07-05 23:45

Their public plan was to do RT via the cloud. I do think Nvidia RTX has forced them to re-evaluate their plans going forward. However, the biggest question the consumers are asking is will RTX games work on their (future) hardware?