AMD EPYC 7002 Server Processors Reportedly Harbour a Bug Impacting Core Stability After 1044 Days of Uptime

Published by

Click here to post a comment for AMD EPYC 7002 Server Processors Reportedly Harbour a Bug Impacting Core Stability After 1044 Days of Uptime on our message forum
https://forums.guru3d.com/data/avatars/m/246/246171.jpg
I'm not sure how I feel about this. On one hand, it's highly unacceptable in the sense that this is a server CPU, where reliability and uptime really matters, regardless of needing to reboot for the sake of security. Even worse that AMD has no intention to fix it, since from what I can tell, it's a firmware bug. On the other hand, I don't really understand in what situation a server would ever need C6, to the point that I'm kind of confused why it has that to begin with*. I assume that it's possible for a whole core cluster to basically be put to sleep, but at that point: why bother with such a high-end CPU if you're not even going to use it? * This sleep state might have been intended for desktop/laptop parts and "tricked up" to server parts, simply because "why not?", but that to me suggests Ryzen parts will also have this bug. However, I find it even more unlikely for a laptop or desktop to stay on for that long. Though I have to say: who managed to pull off nearly 3 years of uptime? Even with backup power, that's uncommon.
https://forums.guru3d.com/data/avatars/m/273/273754.jpg
That's an oddly specific time. That's a little less than 3 years (51 days short). So, if the server never rebooted for any reason (not even server updates) for 1044 days, it's either "holy crap" level of stability or there's also lack of maintenance. I know they are meant for stability but top tier reliability server never advertise 100% uptime. It always has been a max of 99.999999999% (the number of 9 matters) per year. So, in the few minutes the server is down, it would reset this bug. Also, as long as server admins that have these chips know about this bug, they could just schedule a system update around that time (such as OS update) and have a planned reboot. And voilà, the timer is rebooted to 0. Also, CC6... if your servers use this and it has not been disabled, you fail as a server admin imo. That's just a recipe for downtime. Also, AMD, FFS. That seems so trivial to fix for your engineers.
schmidtbag:

but that to me suggests Ryzen parts will also have this bug. However, I find it even more unlikely for a laptop or desktop to stay on for that long.
Even for a server chip, that seems unlikely as well to not reboot at least once in 3 years.
https://forums.guru3d.com/data/avatars/m/273/273678.jpg
schmidtbag:

On one hand, it's highly unacceptable in the sense that this is a server CPU, where reliability and uptime really matters, regardless of needing to reboot for the sake of security. Even worse that AMD has no intention to fix it, since from what I can tell, it's a firmware bug.
airgapped servers don't need to be security updated as often, this means AMD is no longer an option for such deployments.
schmidtbag:

On the other hand, I don't really understand in what situation a server would ever need C6, to the point that I'm kind of confused why it has that to begin with*. I assume that it's possible for a whole core cluster to basically be put to sleep, but at that point: why bother with such a high-end CPU if you're not even going to use it?
You are no longer allowed to run any sort of server farm without low power states turned on in the EU.
Lebon30:

Also, CC6... if your servers use this and it has not been disabled, you fail as a server admin imo. That's just a recipe for downtime.
Spoken as someone that doesn't know crap about server deployments.
https://forums.guru3d.com/data/avatars/m/266/266726.jpg
since disabling c6 mitigates the problem , pretty much a nothing burger, worst case your using a few more watts at idle, unlikely to effect boost much , since epyc chips dont boost that high to begin with.
data/avatar/default/avatar01.webp
Hard part is test for 3 years before release a hardware. Even server farm should get kernel patches in 3 years and a reboot.. C6 state is important for servers to spare power. Power is the main cost on servers. Especially since last year.
https://forums.guru3d.com/data/avatars/m/255/255510.jpg
Its the millennium bug and its found a new home. 🙄
https://forums.guru3d.com/data/avatars/m/246/246171.jpg
Astyanax:

You are no longer allowed to run any sort of server farm without low power states turned on in the EU.
I can see why some low power states would be mandatory, but is C6? Doesn't really make sense to me why it would be.
user1:

since disabling c6 mitigates the problem , pretty much a nothing burger, worst case your using a few more watts at idle, unlikely to effect boost much , since epyc chips dont boost that high to begin with.
Unless I'm mistaken, C6 only affects sleep, and I don't get why a server would be put to sleep.
data/avatar/default/avatar27.webp
I would say that is the definition of good enough. A server would only need to restart once maybe twice in its normal lifespan, before it most likely would be retired. It is not like a Intel would have done better, I have needed to security bios update my Intel machine 5 times in 10 years anyway.
https://forums.guru3d.com/data/avatars/m/266/266726.jpg
schmidtbag:

I can see why some low power states would be mandatory, but is C6? Doesn't really make sense to me why it would be. Unless I'm mistaken, C6 only affects sleep, and I don't get why a server would be put to sleep.
You are mistaken, cores can be put into C6, to be "turned off" when they are not being used
With each core now in its own power island with its own LDO, each core can enter sleep states independently. In this case, AMD’s CC6 state powers off most of the core but keeps the L3 cache active in case another CPU uses it – it only takes 100 microseconds to enter/exit this CC6 state. When all the cores are in CC6, the regulators can also disable the L3 cache altogether for a CPUOFF state, giving better power reductions but now the entry/exist latency is around 1.5ms.
https://www.anandtech.com/show/11964/ryzen-mobile-is-launched-amd-apus-for-laptops-with-vega-and-updated-zen/4 https://en.wikichip.org/wiki/acpi/c-states