Does the GeForce GTX 970 have a memory allocation bug ? (update 3)

Headd

2015-01-23 12:10

On ocnet forum there is explanation

From the Nai's Benchmark, assuming if the allocation is caused by disabled of SMM units, and different bandwidth for each different gpus once Nai's Benchmark memory allocation reaches 2816MiBytes to 3500MiBytes range, I can only assume this is caused by the way SMM units being disabled. Allow me to elaborate my assumption. As we know, there are four raster engines for GTX 970 and GTX 980. Each raster engine has four SMM units. GTX 980 has full SMM units for each raster engine, so there are 16 SMM units. GTX970 is made by disabling 3 of SMM units. What nvidia refused to told us is which one of the raster engine has its SMM unit being disabled. I found most reviewers simply modified the high level architecture overview of GTX 980 diagram by removing one SMM unit for each three raster engine with one raster engine has four SMM unit intact. First scenario What if the first (or the second, third, fourth) raster engine has its 3 SMM units disabled instead of evenly spread across four raster engine? Second scenario Or, first raster engine has two SMM units disabled and second raster engine has one SMM unit disabled? Oh, please do notice the memory controller diagram for each of the raster engine too. >.< If we follow the first scenario, definitely, the raster engine will not be able to make fully use of the memory controller bandwidth

64bit memory controller, total 4 memory controllers = 256 bit memory controller. Assuming if there are 3 raster engines with each three has one SMM disabled leaving 1 raster engine with 4 SMM intact. Mathematically ; 16 SMM = 256 bit = 4096 Mb 13 SMM = 208 bit = 3328 Mb 208 bit = effective width after disabling SMM with 256 bit being actual memory controller width

GTX970=208bit card.

#4997861

Undying

2015-01-23 12:23

WoW, that is a hell of a bandwidth drop. So, nvidia was selling 3GB/208bit cards as 4GB/256bit? Oh, my... 😀

#4997863

JohnLai

2015-01-23 12:30

Fresh user.....T_T....okay.... Pill monster, could you take a look on Nai source code? See if there is any issue with the code? I admit I am not coding literate.

#4997880

demise

2015-01-23 13:11

Testing methods are all over the place, so not really anything conclusive there. I personally won't bother testing until Witcher 3 comes out. Most of these other games are questionable console ports from Ubisoft or Shadow of Mordor which I can't be bothered to re-download. Interested to see how this concludes myself. Not too worried about it for the moment though. The 970 is still a massive improvement over the 560Ti I was using previously.

#4997886

Fox2232

2015-01-23 13:29

WoW, that is a hell of a bandwidth drop. So, nvidia was selling 3GB/208bit cards as 4GB/256bit? Oh, my... 😀

No, they are selling 4GB/256bit card as there are 4GB physically and each of 8 chips have 32bit bus. But From their assumption it is quite possible that while some parts of memory can be accessed directly, others are accessed via shared switching infrastructure as crucial parts of GPU are cut.

Fresh user.....T_T....okay.... Pill monster, could you take a look on Nai source code? See if there is any issue with the code? I admit I am not coding literate.

Get the link, while I don't do CUDA I can check for obvious deviations.

#4997889

JohnLai

2015-01-23 13:41

No, they are selling 4GB/256bit card as there are 4GB physically and each of 8 chips have 32bit bus. But From their assumption it is quite possible that while some parts of memory can be accessed directly, others are accessed via shared switching infrastructure as crucial parts of GPU are cut. Get the link, while I don't do CUDA I can check for obvious deviations.

http://www.computerbase.de/forum/showthread.php?t=1435408&p=16868213#post16868213 Please do check, I appreciate it ^.^ This is Nai's source code. The only problem, it is preferable to use IGPU and set GTX970 in headless display mode when running the benchmark. Otherwise, the result might be inaccurate due to web browsers and windows compositing reserving / using the VRAM.

#4997919

skacikpl

2015-01-23 14:56

MSI GTX 970 4G here: http://www.skacik.pl/images/WTFNVIDIA.png Right as it reaches 3rd Gigabyte, the bandwith starts going face down, ass up. The heck, NVIDIA?

#4997929

JohnLai

2015-01-23 15:09

MSI GTX 970 4G here: http://www.skacik.pl/images/WTFNVIDIA.png Right as it reaches 3rd Gigabyte, the bandwith starts going face down, ass up. The heck, NVIDIA?

Are you running the bench with nvidia gpu in headless display mode?

#4997932

Im2bad

2015-01-23 15:16

Same card as above and can confirm as well. Certainly odd. Edit: Noticed someone had posted while I was testing. Started wondering and checked Afterburner, and what do you know: 4038MB of VRAM allocated during the test. Still, it's odd that the 980 didn't show it in the test.

#4997937

skacikpl

2015-01-23 15:22

Are you running the bench with nvidia gpu in headless display mode?

No, not really but i could do that or retry the test with as much VRAM left free as i can. Generally i believe that there's something wrong here.

#4997938

JohnLai

2015-01-23 15:23

=.= Sigh, when you guys run the benchmark, please mention if you are running it with NVIDIA GPU being put in headless display mode! Otherwise, Windows compositing + web browser will reserve some portion of VRAM and skew the result.

#4997942

skacikpl

2015-01-23 15:29

=.= Sigh, when you guys run the benchmark, please mention if you are running it with NVIDIA GPU being put in headless display mode! Otherwise, Windows compositing + web browser will reserve some portion of VRAM and skew the result.

In normal scenario (games/rendering) nobody is going to run on IGPU, the drop in bandwidth is dramatic and i guess that something IS wrong here. Technically somebody could try running it clean just to prove it once and for all. Though, in normal usage, even if the drop in performance is expected - come on, DRAM drops from ~150 to ~16/~20, cache bandwith goes from ~422 to ~16/~25/~77. That much of a performance drop is suspicious. Also, i'm not an expert in VRAM allocation but i doubt that windows and normal desktop programs would anyhow impact last gigabyte of VRAM while leaving other three without any issues.

#4997943

Im2bad

2015-01-23 15:32

=.= Sigh, when you guys run the benchmark, please mention if you are running it with NVIDIA GPU being put in headless display mode! Otherwise, Windows compositing + web browser will reserve some portion of VRAM and skew the result.

If I understand that mode correctly, what you're asking for is impossible with single GPU configurations.

#4997944

Fox2232

2015-01-23 15:35

http://www.computerbase.de/forum/showthread.php?t=1435408&p=16868213#post16868213 Please do check, I appreciate it ^.^ This is Nai's source code. The only problem, it is preferable to use IGPU and set GTX970 in headless display mode when running the benchmark. Otherwise, the result might be inaccurate due to web browsers and windows compositing reserving / using the VRAM.

Code looks solid, no logical/math errors. 1. Code allocates 128MB chunk by chunk till card runs out of memory (last sub 128MB block is not allocated). Therefore if you already allocate some when you run this code it allocates remaining memory and test should not be affected. 2. There were some experimental rewrites. I picked 3 definitions. 2 of them are fixed ones while 3rd is based on previous two. And I am really not sure why there is not fixed value for 3rd as it is based on 2 previous and they are not altered by code. 3. Only question I have is

__global__ void BenchMarkDRAMKernel(float4* In)
{
int ThreadID = blockDim.x *blockIdx.x + threadIdx.x ;
float4 Temp = make_float4(1);
Temp += In[ThreadID];
if (length(Temp) == -12354)
In[0] = Temp;
}

and its cache bench counterpart. As I am not sure what kind of overhead "blockDim.x *blockIdx.x + threadIdx.x " has. As those are CUDA related allocations. And where those are held. I would rather make something what allocates entire block, fills it with random incompressible data. and then bench some simple math operation over each chunk. Like negation since it is repeatable and always have same result. Would be much slower, would not give bandwidth, but would show if each chunk gets processed in same amount of time.

#4997945

Kashinoda

2015-01-23 15:35

If I understand that mode correctly, what you're asking for is impossible with single GPU configurations.

Wouldn't most people have IGP on their motherboard? Except for maybe older i7s. Easily tested.

#4997948

Im2bad

2015-01-23 15:41

Wouldn't most people have IGP on their motherboard? Except for maybe older i7s. Easily tested.

Of course, yes. Easy to forget that fact when you don't really ever need it.

#4997950

Fox2232

2015-01-23 15:44

As you are testing, I would ask you exactly opposite thing. Do not try to have minimal allocation before you start test. Do allocate even 1GB of vram before test, and terminate game after test allocates whole remaining 4GB. (if it allocates it only during bench itself and not in earlier part where it pauses, then kill game as soon as chunks start to get tested). This way you clear additional space which should not be allocated by benchmark. And if test does not show drop in performance after getting additional vram, issue is due to overhead.

#4997952

JohnLai

2015-01-23 15:45

Code looks solid, no logical/math errors. 1. Code allocates 128MB chunk by chunk till card runs out of memory (last sub 128MB block is not allocated). Therefore if you already allocate some when you run this code it allocates remaining memory and test should not be affected. 2. There were some experimental rewrites. I picked 3 definitions. 2 of them are fixed ones while 3rd is based on previous two. And I am really not sure why there is not fixed value for 3rd as it is based on 2 previous and they are not altered by code. 3. Only question I have is
__global__ void BenchMarkDRAMKernel(float4* In)
{
int ThreadID = blockDim.x *blockIdx.x + threadIdx.x ;
float4 Temp = make_float4(1);
Temp += In[ThreadID];
if (length(Temp) == -12354)
In[0] = Temp;
}
and its cache bench counterpart. As I am not sure what kind of overhead "blockDim.x *blockIdx.x + threadIdx.x " has. As those are CUDA related allocations. And where those are held. I would rather make something what allocates entire block, fills it with random incompressible data. and then bench some simple math operation over each chunk. Like negation since it is repeatable and always have same result. Would be much slower, would not give bandwidth, but would show if each chunk gets processed in same amount of time.

Hmm, I see....Perhaps you would like to tell this crucial information to Nai at that forum. I can't speak german.......:bang:

#4997957

sykozis

2015-01-23 15:48

Wouldn't most people have IGP on their motherboard? Except for maybe older i7s. Easily tested.

Why should people have to alter their configuration just to test a theory? Especially a theory that can not be definitively proven? While I trust Fox's evaluation of the source code, even he can't guarantee that there is not a flaw somewhere causing "unusual" results. If there was actually a "flaw", then all the results would fall within a respectable margin of error, which based on the screen shots posted here is not happening. Using CUDA to demonstrate a flaw is already creating a flaw in the testing to begin with. CUDA is an NVidia IP so it should not have been used and any results derived from a CUDA based test should be disregarded.

#4997958

Pill Monster

2015-01-23 15:50

Fresh user.....T_T....okay.... Pill monster, could you take a look on Nai source code? See if there is any issue with the code? I admit I am not coding literate.

Just to clarify, what that OCN guy said regarding fresh users wasn't directed at you specifically, nor did I mean it that way. 🙂 Instead it was illuminating a theme which seems to be constantly reoccurring in this situation. Similar comments have also been made on AnandTech..... And I don't know anything about coding so I can't help you there sorry. Coding is one of the things I know least about tbh.