Does the GeForce GTX 970 have a memory allocation bug ? (update 3)

skacikpl

2015-01-23 15:52

As you are testing, I would ask you exactly opposite thing. Do not try to have minimal allocation before you start test. Do allocate even 1GB of vram before test, and terminate game after test allocates whole remaining 4GB. (if it allocates it only during bench itself and not in earlier part where it pauses, then kill game as soon as chunks start to get tested). This way you clear additional space which should not be allocated by benchmark. And if test does not show drop in performance after getting additional vram, issue is due to overhead.

I'll try that. // Initial test - Lords of the Fallen, maxed out (3.5Gb VRAM used), benchmark crashes right away. BF4(1.1GB Vram used) ran alongside the benchmark BF4(1.1GB Vram used) killed ASAP after bench allocates memory.

#4997961

Fox2232

2015-01-23 15:53

Why should people have to alter their configuration just to test a theory? Especially a theory that can not be definitively proven? While I trust Fox's evaluation of the source code, even he can't guarantee that there is not a flaw somewhere causing "unusual" results. If there was actually a "flaw", then all the results would fall within a respectable margin of error. Using CUDA to demonstrate a flaw is already creating a flaw in the testing to begin with. CUDA is an NVidia IP so it should not have been used and any results derived from a CUDA based test should be disregarded.

You are right, I do not know inner CUDA workings. That is why I want to preallocate 1GB, let test to allocate remaining 3GB. Then Kill 1st GB of allocation leaving free space for bench itself. If it indeed has CUDA based memory overhead 1GB of vram should be enough to accommodate it and test would show only 22/23 chunks, but they would all perform well. If even with free 1GB block test shows that end blocks are performing bad then it is not caused by code.

#4997966

Fox2232

2015-01-23 15:56

I'll try that. // Initial test - Lords of the Fallen, maxed out (3.5Gb VRAM used), benchmark crashes right away.

Likely crash because it could not allocate even 0th block and then tried to run bench on it. Because there is no protection which checks if even one got allocated.

#4997967

JohnLai

2015-01-23 15:58

Just to clarify, what that OCN guy said regarding fresh users wasn't directed at you specifically, nor did I mean it that way. 🙂 Instead it was illuminating a theme which seems to be constantly reoccurring in this situation. Similar comments have also been made on AnandTech..... And I don't know anything about coding so I can't help you there sorry. Coding is one of the things I know least about tbh.

Thanks, I really appreciate it.

#4997970

sykozis

2015-01-23 16:04

You are right, I do not know inner CUDA workings. That is why I want to preallocate 1GB, let test to allocate remaining 3GB. Then Kill 1st GB of allocation leaving free space for bench itself. If it indeed has CUDA based memory overhead 1GB of vram should be enough to accommodate it and test would show only 22/23 chunks, but they would all perform well. If even with free 1GB block test shows that end blocks are performing bad then it is not caused by code.

The graphics card needs memory to perform operations, since that's where the data is stored that the GPU needs to be able to perform the requested operations. The more data you force into memory, the less is available to continue performing operations. The memory bandwidth is going to drop. Even an OpenCL based application would show this occuring. Being able to run from the CPU gives OpenCL the advantage of less overhead which would mean less drop in measured bandwidth. When you execute this "test", the instructions required for the operations are loaded into graphics memory. From there, the instructions are processed. For each function, more instructions have to be loaded into memory. When you flood the memory, there's no place to store the next set of instructions so the necessary space has to be flushed, which negatively affects memory bandwidth. The only way to avoid this would be for NVidia to partition the ram to give CUDA it's own, dedicated memory partition, which just isn't feasible on a consumer graphics card. For the Quadro line it might be, but it'll provide limited benefit to consumers compared to the cost associated with doing such.

#4997973

Fox2232

2015-01-23 16:10

The graphics card needs memory to perform operations, since that's where the data is stored that the GPU needs to be able to perform the requested operations. The more data you force into memory, the less is available to continue performing operations. The memory bandwidth is going to drop. Even an OpenCL based application would show this occuring. Being able to run from the CPU gives OpenCL the advantage of less overhead which would mean less drop in measured bandwidth. When you execute this "test", the instructions required for the operations are loaded into graphics memory. From there, the instructions are processed. For each function, more instructions have to be loaded into memory. When you flood the memory, there's no place to store the next set of instructions so the necessary space has to be flushed, which negatively affects memory bandwidth.

This overhead is extremely small in comparison to sizes it allocates. And test is smart in initial way where it allocates memory regions as it stops once allocation fails. That is why I want o free post allocation some memory. Test will still run on same blocks, so results should not be affected. And there is one last thing someone could have missed. Running test while low vram allocation (only desktop) so it can get most. But monitor vram and system ram usage, as performance drop may be caused by some emergency allocation from system ram.

#4997974

-Tj-

2015-01-23 16:10

Well mine apparently also drops ~ 2560mb by both, but I surpassed this line all the time and performance doesn't drop, so idk how legit this test really is. EDIT: I see its using 128mb chunks could be something with that, what if it uses 64mb chunks? I would like to test that as well, but idk how to change that..

#4997980

JohnLai

2015-01-23 16:18

Well mine apparently also drops ~ 2560mb by both, but I surpassed this line all the time and performance doesn't drop, so idk how legit this test really is.

Are you using Zotac GTX 780 3GB GDDR5? Probably you didn't run the bench in headless mode for GPU. I doubt GTX 780 has issue in the first place though. Try unplug the HDMI/DVI cable from your GPU and plug it to your motherboard IGPU. Make sure your primary display is using the IGPU. Then re-run the bench, probably your model doesnt have any bandwidth drop issue. 🙂

#4997981

sykozis

2015-01-23 16:19

idk how legit this test really is.

That's what Fox and I are discussing. He's got a good theory if you're able to test it and post back with results. In theory, the test should be able to fill all 4GB of ram without error. If you start with 1GB pre-allocated, it shouldn't crash. It should just loop around and fill the 1GB that was allocated when the test started. In your case, it would only be 3GB, but would still prove or disprove Fox's theory.

#4997983

Noisiv

2015-01-23 16:22

It looks that Aero would do this to both 970/980. https://forums.geforce.com/default/topic/803518/geforce-900-series/gtx-970-3-5gb-vram-issue/post/4430744/#4430744

#4997984

alanm

2015-01-23 16:23

The test as it applies to the 970 is nothing by itself. Its when the 980 does same test but retains full bandwidth past the 3.2gb point and not choke like the 970 does. Havent run too many games since getting my 970 but FC4 is about as tough as it gets, and I run that maxed other than modest AA levels. Dont care at all if the card is gimped past the 3.2gb point, would probably have bought it anyway even if it was a 3gb card. Interesting to see what ManuelG comes up when he gets answers.

#4997985

alanm

2015-01-23 16:25

It looks that Aero would do this to both 970/980. https://forums.geforce.com/default/topic/803518/geforce-900-series/gtx-970-3-5gb-vram-issue/post/4430744/#4430744

Yeah but others have done it with Aero disabled and same 970/980 differences. Some have done it running on the IGPU to completely free the cards from any overhead and same results.

#4997988

Fox2232

2015-01-23 16:28

Well mine apparently also drops ~ 2560mb by both, but I surpassed this line all the time and performance doesn't drop, so idk how legit this test really is. EDIT: I see its using 128mb chunks could be something with that, what if it uses 64mb chunks? I would like to test that as well, but idk how to change that..

Here are chunk size definitions:

	int Float4Count = 8 * 1024 * 1024;
int ChunkSize = Float4Count*sizeof(float4);

return of function sizeof(float4) should be 16Bytes so if you change multiplication above for Float4Count to be 1/2 of what it is it will take chunks of 64MB. ( 4 * 1024 * 1024 ) But still it will go on till it has as much memory as it can get. Then you have to change:

int BlockSize = 128;

to 64, as otherwise benchmark part will try to pump 128MB into each 64MB chunk. This should be all.

#4997990

palvo23

2015-01-23 16:28

It looks that Aero would do this to both 970/980.

I can tell you its not Aero. Odd thing is that when I turn all textures to Extra in CoD: AW, the game starts to microstutter. Vram usage hovers around 3400MB ish. However when I lower just one of the textures to high, the microstuttering is gone, and the VRAM usage seems to hover around 3100MB. I just plugged in 4G more Ram so I have 12G total Ram, I don't think that's the issue. Maybe it's just app specific problem. I don't quite have other games to test with, neither am I tech savvy about these problem, but I just want to enjoy stutter free gaming! Hopefully Nvidia has some answers soon 😛c1:

#4997993

skacikpl

2015-01-23 16:34

I'll try that. // Initial test - Lords of the Fallen, maxed out (3.5Gb VRAM used), benchmark crashes right away. BF4(1.1GB Vram used) ran alongside the benchmark BF4(1.1GB Vram used) killed ASAP after bench allocates memory.

Quoting myself, because i think people missed last two tests.

#4997995

-Tj-

2015-01-23 16:35

Are you using Zotac GTX 780 3GB GDDR5? Probably you didn't run the bench in headless mode for GPU. I doubt GTX 780 has issue in the first place though. Try unplug the HDMI/DVI cable from your GPU and plug it to your motherboard IGPU. Make sure your primary display is using the IGPU. Then re-run the bench, probably your model doesnt have any bandwidth drop issue. 🙂

Yeah 780 doesnt have this crashing or performance issue if I hit max 3040mb by games (Kombustor or that cube vram test), but this bandwidth drop is here non the less.. @ Win8.1, I can't test @ igpu port atm. http://i.imgur.com/yZ8tHJn.png

That's what Fox and I are discussing. He's got a good theory if you're able to test it and post back with results. In theory, the test should be able to fill all 4GB of ram without error. If you start with 1GB pre-allocated, it shouldn't crash. It should just loop around and fill the 1GB that was allocated when the test started. In your case, it would only be 3GB, but would still prove or disprove Fox's theory.

I think its also the way it allocates these memory blocks, e.g. 128MiByte, if it used 64MiByte im sure the test would be a little different. Although kinda strange why 980GTX isn't affected like that.. If there is this SMX/256bit thing then this test really has to use 64MiByte blocks or its not really legit. 24x128 3072 23x128 2944 << imo it should use this not stop by 22 blocks, i closed FF and still ended at 22 blocks. 47x64 3008 @ Fox2232 interesting, but idk how to input this in exe 😀

#4998000

JohnLai

2015-01-23 16:46

Yeah 780 doesnt have this crashing or performance issue if I hit max 3040mb by games (Kombustor or that cube vram test), but this bandwidth drop is here non the less.. @ Win8.1, I can't test @ igpu port atm. http://i.imgur.com/yZ8tHJn.png

Yeah, those last 2 chunks definitely caused by windows reserving gpu vram for desktop compositing. Don't fret about it. Once you can test it with IGPU as primary display, feel free to report back. 🙂 Your current result is quite okay.

#4998001

-Tj-

2015-01-23 16:49

Yeah, those last 2 chunks definitely caused by windows reserving gpu vram for desktop compositing. Don't fret about it. Once you can test it with IGPU as primary display, feel free to report back. 🙂 Your current result is quite okay.

Ah I see ok thanks. So this is not the case by 970gtx as well, windows allocation? I mean it apparently allocates 512mb for it? I get a drop @ 2560mb - 3072mb is = 512mb 🤓

#4998006

JohnLai

2015-01-23 16:57

Ah I see ok thanks. So this is not the case by 970gtx as well, windows allocation? I mean it apparently allocates 512mb for it? I get a drop @ 2560mb - 3072mb is = 512mb 🤓

Ideally this bench should be run on GTX970/980 in headless mode to eliminate windows desktop compositing vram allocation as well as other vram overhead. I have seen few users who ran this bench in headless mode. GTX 980 = bandwidth maxed all the way to 4GB. GTX 970 = unfortunately, memory bandwidth dropped by large margin starting from range of 3.2GB - 3.5GB. This is not to say GTX970 can't use full 4GB, it just the memory bandwidth after that range is too slow.

#4998014

Fox2232

2015-01-23 17:11

I'll try that. // Initial test - Lords of the Fallen, maxed out (3.5Gb VRAM used), benchmark crashes right away. BF4(1.1GB Vram used) ran alongside the benchmark BF4(1.1GB Vram used) killed ASAP after bench allocates memory.

Quoting myself, because i think people missed last two tests.

I did miss it. And it is interesting. There are 6 regions which show this bad performance even while one of test had enough of free vram to accommodate any overhead caused by CUDA error. Interesting enough, In test where BF was running in background some bad regions were allocated before there was need to do so. And all bad regions had same performance in both tests 1x better than average, 3x average, 1 worse than average and 1 even worse than that. At this point I would say that this CUDA test is not defective and truly points towards some issue. If anyone has environment for compiling CUDA code, I will modify it to eat exactly 3GB of VRAM. And then victim should try to start in remaining (presumably bad) memory region some game even small which needs like 500MB of VRAM. Should prove for sure that there is something very bad.