AMD: Asynchronous shaders in GCN handy with DirectX 12

Turanis

2015-03-31 11:50

"All graphics cards based on GCN architecture can now handle multiple command instructions and data flows simultaneously which is managed by compute-engines called ACEs. Each queue can pass instructions without the need to wait on other tasks. That will keep your GPU 100% active as the work-flow is prioritized and thus always available. Per GPU eight ACE (Asynchronous Compute Engine) units are available which can manage eight waiting queues with direct access to the L2 cache and which is called 'global share data'. The really are multiple advantages to be found here, an overall more efficient rendering experience meaning a higher FPS, but due to the decreased latency this is way more optimal for Virtual Reality gaming." Yeah GCN all the way, compute all the way, double precision all the way. 🙂 Will see in the future if the devs can handle ACE (Asynchronous Compute Engine).Can't wait.

#5041371

OnnA

2015-03-31 13:09

Yep ! thats GCN :banana: And of course Devs can handle (must) because of PS4 and XboX 1 So the porting will be much easier + we PC gamers have better Gaming exp. overall So DX12 (with GCN 12_3) will benefit all MS platforms. Finger crossed 🤓

#5041374

Noisiv

2015-03-31 13:22

HD 7000 & Rx 240/250/270/280 : processeur de commandes x1 queue + 2 ACE x1 queue + 2 DMA engines ->Graphics/Compute/Copy with limitations HD 7790 & R7 260 : processeur de commandes x1 queue + 2 ACE x8 queues + 2 DMA engines ->Graphics/Compute/Copy R9 285/290 : processeur de commandes x1 queue + 8 ACE x8 queues + 2 DMA engines ->Graphics/Compute/Copy GTX 400/500/600/700 : processeur de commandes x1 queue + 1 DMA engine ->No support GTX 750/780/Titan : processeur de commandes x32 queues (limité) + 1 DMA engine ->Compute/Compute GTX 900/Titan X : processeur de commandes x32 queues + 2 DMA engines ->Graphics/Compute/Copy At the latest graphics cards, the GeForce GTX 900 will take full advantage of optimizations associated with concomitant tasks, as will the Radeon R9 290 for example. By cons, it remains to see what that will do the developers. The gains will not be automatic and will require that the different stages of 3D rendering are suitable.

http://www.hardware.fr/news/14133/gdc-d3d12-amd-parle-gains-gpu.html

#5041396

fantaskarsef

2015-03-31 14:10

http://www.hardware.fr/news/14133/gdc-d3d12-amd-parle-gains-gpu.html

This reads as if the Maxwells and the middle to top 300 AMD cards will treat this pretty much the same, or am I wrong?

#5041479

Denial

2015-03-31 16:41

This reads as if the Maxwells and the middle to top 300 AMD cards will treat this pretty much the same, or am I wrong?

Maxwell can do 32 Queues, 290 can do 64.

#5041480

Enticles

2015-03-31 16:44

this sounds like a GPU version of hyperthreading to me. Happy to see the efficiency improving - and not just for team red or green. so everyone who has a compatible card regardless of vendor will be able to enjoy these improvements. can't wait to see the real life results 🙂

#5041506

Hootmon

2015-03-31 17:24

Spiffy!

#5041581

sykozis

2015-03-31 18:42

this sounds like a GPU version of hyperthreading to me. Happy to see the efficiency improving - and not just for team red or green. so everyone who has a compatible card regardless of vendor will be able to enjoy these improvements. can't wait to see the real life results 🙂

No. Each shader processor will still only be able to execute 1 thread at a time whereas hyperthreading allows a single processor core to execute 2 threads. They're just finally implementing true, simultaneous multi-threading for GPU's.....and doing so through software. There are enough shader processors within a GPU where HyperThreading really isn't needed. My GTX970, for example, has 1664 shader processors (or CUDA cores as NVidia calls them). GPUs, under DX11 and OpenGL, are essentially "In-Order" processors where data is processed in the exact order it's received. With DX12 and "Vulkan", the GPU will function more like an "Out-of-Order" processor where instructions are prioritized and executed in order of importance.

#5041627

FerCam™

2015-03-31 19:19

Maxwell can do 32 Queues, 290 can do 64.

humm nice to know, just got a gtx980 from a 290x RMA, due to the store not selling 290x anymore...

#5041642

Lane

2015-03-31 19:49

Maxwell can do 32 Queues, 290 can do 64.

it seems a bit different than that if i understand it well...

Finally, with the GPU Maxwell 2 ( GM200 / GM204 / GM206 ) Nvidia blew all these limitations , contrary to what we thought. First, the second DMA Engine is running on GeForce declensions . But especially when Hyper -Q is active, one of the 32 queues can be kind Graphics .

Ok confirmed by Anandtech:

so we checked with NVIDIA on queues. Fermi/Kepler/Maxwell 1 can only use a single graphics queue or their complement of compute queues, but not both at once – early implementations of HyperQ cannot be used in conjunction with graphics. Meanwhile Maxwell 2 has 32 queues, composed of 1 graphics queue and 31 compute queues (or 32 compute queues total in pure compute mode). So pre-Maxwell 2 GPUs have to either execute in serial or pre-empt to move tasks ahead of each other, which would indeed give AMD an advantage..

#5041686

anxious_f0x

2015-03-31 21:13

It's an interesting way of doing things, let's hope it's actually utilised by developers on both PC and console, certainly puts the PS4 in a good position with it's 8 ACE'S.

#5041944

fantaskarsef

2015-04-01 07:13

Maxwell can do 32 Queues, 290 can do 64.

so we checked with NVIDIA on queues. Fermi/Kepler/Maxwell 1 can only use a single graphics queue or their complement of compute queues, but not both at once – early implementations of HyperQ cannot be used in conjunction with graphics. Meanwhile Maxwell 2 has 32 queues, composed of 1 graphics queue and 31 compute queues (or 32 compute queues total in pure compute mode). So pre-Maxwell 2 GPUs have to either execute in serial or pre-empt to move tasks ahead of each other, which would indeed give AMD an advantage..

I'm still not entirely sure I get it... doesn't that mean 64 AMD vs a single one with Maxwell 2 cards? That would indeed look like an avantage for AMD...

#5042561

Dazz

2015-04-02 06:15

Doesn't Maxwell do this anyway but in hardware? it tries to prioritise traffic this can clearly be seen in the Maxwell version of the 970 since it puts frequent information on the fast memory partition and stored less used cache data on the reserved part. In essence nVidia should get a nice increase if it's done in software first then hardware can either change it on it's requirements or ignore it as being already efficient enough. AMD's solution doesn't do this so may benefit immensely. Time will tell tho.

#5042704

xg-ei8ht

2015-04-02 14:06

PS4 has 8 ACES and 64 queues.

#5042717

Lane

2015-04-02 14:31

Doesn't Maxwell do this anyway but in hardware? it tries to prioritise traffic this can clearly be seen in the Maxwell version of the 970 since it puts frequent information on the fast memory partition and stored less used cache data on the reserved part. In essence nVidia should get a nice increase if it's done in software first then hardware can either change it on it's requirements or ignore it as being already efficient enough. AMD's solution doesn't do this so may benefit immensely. Time will tell tho.

Both are not related.. Here we are speaking about Asynchronous computing on the shader level.. ( not fix a bad design conception on memory access level how they can ) I enjoin you to read the article from Anandtech: (Just dont look at the table number of queue, it seems they are wrong ( 8xqueue / Aces will bring a total of 64queue not, 8 ) http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading

I'm still not entirely sure I get it... doesn't that mean 64 AMD vs a single one with Maxwell 2 cards? That would indeed look like an avantage for AMD...

Ofc, its an architecture advantages from AMD, as basically GCN have been designed for and around it since the 1.0 iteration ( HD 7970). When for Nvidia they have now Maxwell who support it, but will indeed not been as good as GCN for it.. But i think the time we see developpers who take advantage of it, certainly that 2016-2017 GPU's will be out ( so Pascal ). I can bet that on many front, Pascal will look really similar of GCN.

#5042725

Spets

2015-04-02 14:46

Both are not related.. Here we are speaking about Asynchronous computing on the shader processors level.. ( not fix a bad design conception on memory access level how they can ) I enjoin you to read the article from Anandtech: http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Its an architecture advantages from AMD, as basically GCN have been designed for it since the 1.0 iteration ( HD 7970). When for Nvidia they have now Maxwell who support it, but will indeed not been as good as GCN for it.. But i think on the time developpers will really take advantage of it, certainly that 2016 GPU's will be out ( so Pascal ). I can bet that on many front, Pascal will look really similar of GCN.

Going off the chart from the article you linked, it looks like Maxwell 2 has better support than GCN. Everything up to it though does lack in comparison. Would be nice to see developers taking advantage of this.

#5042728

Lane

2015-04-02 14:49

Going off the chart from the article you linked, it looks like Maxwell 2 has better support than GCN. Everything up to it though does lack in comparison. Would be nice to see developers taking advantage of this.

The table number are in discussion right now.. it seems it is 8x queue / Ace, bringing a total of 64queue.. not 8.. ( following OpenCL GCN table ). ( but it is a little bit like play on the word). ( dont forget that AMD have then a second level. ( the Aces are not in the SM, at contrario of Nvidia ). Again for Maxwell this is the "computing queue".. Asynchronous shading need to use 3 different things to work simultaneously: graphics, computing and DMA (Copy).. this is where the problem lie today with DX11, you cant do both at once. AMD GCN can do this, because it have allways got the 3 type supported simultaneously ( with some limitation on first iteration ).