AMD Working on 16-Core Processor with Integrated PCI Express 3.0 Controller

Published by

Click here to post a comment for AMD Working on 16-Core Processor with Integrated PCI Express 3.0 Controller on our message forum
https://forums.guru3d.com/data/avatars/m/229/229509.jpg
Assuming this is gunna be a monolithic die for server applications, though if they did export do desktop... 😀
https://forums.guru3d.com/data/avatars/m/229/229509.jpg
By the time they integrate PCI-E 3.0, Intel will be on 4.0 lol.
I doubt that :P
https://forums.guru3d.com/data/avatars/m/246/246171.jpg
I doubt that :P
I would hope so. As far as I'm aware, we haven't even saturated 2.0 yet. Aside from some SSDs, I don't really understand the point of releasing PCIe 3.0. AMD already has several 16 core Opterons. I'm guessing their current generation is Steamroller based, and I figure this new 16 core will also only be an Opteron. AMD has stated before they're not targeting the high-end/enthusiast desktop PC market anymore and that's exactly what a 16 core would be classified as. I don't see a reason for a 16 core entering the desktop market anyway - most people still can't put good use to an i7. But, Opterons are still relatively cheap. You could probably make a pretty good desktop computer out of an Opteron system as long as you expect the motherboard you get likely lack Crossfire/SLi support, built-in audio, and a slew of USB ports.
https://forums.guru3d.com/data/avatars/m/229/229509.jpg
I would hope so. As far as I'm aware, we haven't even saturated 2.0 yet. Aside from some SSDs, I don't really understand the point of releasing PCIe 3.0. AMD already has several 16 core Opterons. I'm guessing their current generation is Steamroller based, and I figure this new 16 core will also only be an Opteron. AMD has stated before they're not targeting the high-end/enthusiast desktop PC market anymore and that's exactly what a 16 core would be classified as. I don't see a reason for a 16 core entering the desktop market anyway - most people still can't put good use to an i7. But, Opterons are still relatively cheap. You could probably make a pretty good desktop computer out of an Opteron system as long as you expect the motherboard you get likely lack Crossfire/SLi support, built-in audio, and a slew of USB ports.
Exactly. The current 16-core Opterons are a bit like the Pentium-D and Core2-Quads, though, dual 8-core Bulldozer/Piledriver dies on one chip, so having this as a monolithic CPU would be a big step up for AMD. Saying that, looking at Kaveri on the new 28 nm CPU has really shrunk the die size down quite a bit from what you'd expect, so assuming it'd be on the same process is reasonable. We might even see something that can complete with Haswell-E's 8-core if we're lucky. If I were aiming for a cheap server though, it would be AMD based, the intel Xeons are horrendously expensive, Opterons, not so much.
data/avatar/default/avatar11.webp
Mantle + 16 core gaming can be good. Crop some L3, add 4 more cores(which means more L1(and even L2)), add two more channels for memory, add some pipeline depth, decrease frequency and increase efficiency, increase working temperature so it works okay even over 75°C, make L2 or L3 caches addressable by APIs like CUDA and OPENCL and .... so better game physics, all these can be even better.
https://forums.guru3d.com/data/avatars/m/93/93080.jpg
I would hope so. As far as I'm aware, we haven't even saturated 2.0 yet. Aside from some SSDs, I don't really understand the point of releasing PCIe 3.0..
I think you mean releasing PCIe 4.0. We've had 3.0 for a few years now, and I'm running 3.0 x8/x8 in SLI now. Intel has plans on releasing 4.0 this year or next. I thought I read somewhere by 2015 at the latest. I saw 780's running in a x8/x8 2.0 rig and were slower in benches and gaming. 4.0 may be needed for the super enthusiasts looking to push all that bandwidth. There are already bigger and badder cards than Maxwell planned. I myself am not upgrading though.
data/avatar/default/avatar11.webp
PCI Express 4 ??
I think you mean releasing PCIe 4.0. We've had 3.0 for a few years now, and I'm running 3.0 x8/x8 in SLI now. Intel has plans on releasing 4.0 this year or next. I thought I read somewhere by 2015 at the latest. I saw 780's running in a x8/x8 2.0 rig and were slower in benches and gaming. 4.0 may be needed for the super enthusiasts looking to push all that bandwidth. There are already bigger and badder cards than Maxwell planned. I myself am not upgrading though.
There is still no graphics card can bottleneck a PCI-E x16 v2.0 then why you need v4 ??:puke2: I think you dont have any idea about PCI-E and its bandwidth......
data/avatar/default/avatar37.webp
apparently skylake is getting pci-e 4.0 since AMD has removed the CF bridge for R9 an made them bridgeless apparently they can now saturate pci-e 3.0 @ 16x since they talk over the pci-e now
https://forums.guru3d.com/data/avatars/m/251/251033.jpg
There is still no graphics card can bottleneck a PCI-E x16 v2.0 then why you need v4 ??:puke2: I think you dont have any idea about PCI-E and its bandwidth......
I remember reading some benches a year ago about PCI-e 2.0 vs 3.0, and if memory serves me right, there was only a 3% degradation when using 2.0 (instead of 3.0). ...Things might change this year, but I won't be upgrading anything because I expect Mantle to supplement my CPU well enough that I'll be able to ride my i7 920 (& R9 290) for two more years 🙂
https://forums.guru3d.com/data/avatars/m/196/196284.jpg
Mantle + 16 core gaming can be good. Crop some L3, add 4 more cores(which means more L1(and even L2)), add two more channels for memory, add some pipeline depth, decrease frequency and increase efficiency, increase working temperature so it works okay even over 75°C, make L2 or L3 caches addressable by APIs like CUDA and OPENCL and .... so better game physics, all these can be even better.
Ummmm...huh? L1, L2 and L3 aren't addressable by APIs because it would cause problems for the processor. Increasing pipeline depth would be very bad. Decreasing frequency, while increasing pipeline depth would be suicide for AMD. To increase efficiency, you have to shorten the pipeline. AMD can't do anything that affects CUDA because they have no rights to it...also, allowing CUDA to access cache, wouldn't improve PhysX in the least as the system would be too unstable to be usable.
data/avatar/default/avatar12.webp
Increasing pipeline depth would be very bad.
Why increasing the pipeline depth cannot increase instructions per cycle? How can we increase total performance of CPU then? Increasing clock frequency versus increasing IPC?. Which one is more future-proof? Which one is more efficient in terms of "instructions per Joule"?
https://forums.guru3d.com/data/avatars/m/196/196284.jpg
Why increasing the pipeline depth cannot increase instructions per cycle? How can we increase total performance of CPU then? Increasing clock frequency versus increasing IPC?. Which one is more future-proof? Which one is more efficient in terms of "instructions per Joule"?
The longer the pipeline, the longer it takes for an instruction to complete, thus reducing performance (and efficiency) The long pipeline was among the drawbacks of Intel's NetBurst architecture. With Conroe, Intel drastically reduced the pipeline. Shorter pipeline results in instructions completing faster. Shorter pipelines are more efficient. The shorter pipeline of the Athlon series processors is part of the reason that they were just as fast, at lower clock speeds, as the Pentium 4 and Pentium-D processors.
data/avatar/default/avatar10.webp
The longer the pipeline, the longer it takes for an instruction to complete, thus reducing performance (and efficiency) The long pipeline was among the drawbacks of Intel's NetBurst architecture. With Conroe, Intel drastically reduced the pipeline. Shorter pipeline results in instructions completing faster. Shorter pipelines are more efficient. The shorter pipeline of the Athlon series processors is part of the reason that they were just as fast, at lower clock speeds, as the Pentium 4 and Pentium-D processors.
Then longer pipeline increases pipeline latency so this leads lesser instructions per second?(as long as instruction issue/fetcher remains same?) Then it is like: short pipeline(tripartitioned, single issue): 1 instruction = 3 cycles ----> 1 instruction per 3 cycles inefficient 2 instructions = 4 cycles ----> 1 instruction per 2 cycles...ok 3 instructions = 5 cycles ----> 3/5 better 4 instructions = 6 cycles ----> 2/3 even better but low probability 5 instructions = 7 cycles ----> 5 /7 best but very hard to maintain? long pipeline(tenfold, single issue): 1 instruction = 30 cycles ----> 1/30 yes very slow 2 instructions = 31 cycles -----> 2/31 nearly double of the first one 3 instructions = 32 cycles -----> 3/32 ---> cycles hardly increase but instructions increase faster ... ... 10 instructions = 39 cycles ----> nearly 1 instruction per 4 cycle long pipeline(tenfold, 20 issued): 20 instructions = 49 cycles ----> 2/5 very good from the beginning 40 instructions = 69 cycles -----> 4/7 even better 60 instructions = 89 cycles -----> 2 instructions per 3 cycles You are right. But faster issuing can help, can it? I meant instruction fetching by "issue"
https://forums.guru3d.com/data/avatars/m/246/246171.jpg
For all of you not understanding how bandwidth on PCI-e works, take your 3.0 port and a high-end 3.0 GPU, drop it from 16 lanes down to 8 and run benchmarks between both. You likely won't see a difference (maybe 1 or 2 FPS). Now, drop it down to 4 lanes. You might lose a few FPS here and there, but the game should still be playable. The only reason for increasing bandwidth per-lane is for the PCI-e 1x devices, such as TV tuners, SSDs, Thunderbolt cards, or USB 3.x cards. Otherwise, we don't even need the bandwidth of 3.0 for modern GPUs. Assuming PCIe 4.0 will continue the trend of doubling bandwidth, one lane will be equally as fast as 8 lanes from the first generation, which is good enough to run most mid-range GPUs. It won't be long until something like the Titan can run off a 1x slot.
https://forums.guru3d.com/data/avatars/m/156/156133.jpg
Moderator
16 cores is definitely nice, but AMD needs to focus on single threaded performance. I'll give AMD this though, their server cpu's are crazy!
https://forums.guru3d.com/data/avatars/m/196/196284.jpg
Then longer pipeline increases pipeline latency so this leads lesser instructions per second?(as long as instruction issue/fetcher remains same?) Then it is like: short pipeline(tripartitioned, single issue): 1 instruction = 3 cycles ----> 1 instruction per 3 cycles inefficient 2 instructions = 4 cycles ----> 1 instruction per 2 cycles...ok 3 instructions = 5 cycles ----> 3/5 better 4 instructions = 6 cycles ----> 2/3 even better but low probability 5 instructions = 7 cycles ----> 5 /7 best but very hard to maintain? long pipeline(tenfold, single issue): 1 instruction = 30 cycles ----> 1/30 yes very slow 2 instructions = 31 cycles -----> 2/31 nearly double of the first one 3 instructions = 32 cycles -----> 3/32 ---> cycles hardly increase but instructions increase faster ... ... 10 instructions = 39 cycles ----> nearly 1 instruction per 4 cycle long pipeline(tenfold, 20 issued): 20 instructions = 49 cycles ----> 2/5 very good from the beginning 40 instructions = 69 cycles -----> 4/7 even better 60 instructions = 89 cycles -----> 2 instructions per 3 cycles You are right. But faster issuing can help, can it? I meant instruction fetching by "issue"
If deeper pipelines were better, AMD's FX series would easily beat Intel's Core series processors and Intel's NetBurst architecture wouldn't have been replaced by Conroe.....or been slower than AMD's K7 and K8 architectures. Fact of the matter is, shorter pipes provide better performance. Shorter pipelines mean instructions complete faster.
data/avatar/default/avatar20.webp
If deeper pipelines were better, AMD's FX series would easily beat Intel's Core series processors and Intel's NetBurst architecture wouldn't have been replaced by Conroe.....or been slower than AMD's K7 and K8 architectures. Fact of the matter is, shorter pipes provide better performance. Shorter pipelines mean instructions complete faster.
Then the main idea is just keeping all compute cells busy using the pipelining and keeping the pipeline as short as possible? Putting a gpu inside of a cpu may have a similar reason then? Shorter ways, fewer stages between two?. Maybe thats why Nvidia will add stacked dram in GPUs? Then total length of data pipes between cores are important if we were to add more cores?(maybe some optimization algorithms can be used here? Like "simulated annealing" to find a golden geometry of cores and best "core to communication length" ratio?)
data/avatar/default/avatar04.webp
cu seems more "compute units" than cpu for me, so i guess that first picture is a gpu.