AMD EPYC 7002 Server Processors Reportedly Harbour a Bug Impacting Core Stability After 1044 Days of Uptime
Click here to post a comment for AMD EPYC 7002 Server Processors Reportedly Harbour a Bug Impacting Core Stability After 1044 Days of Uptime on our message forum
schmidtbag
I'm not sure how I feel about this.
On one hand, it's highly unacceptable in the sense that this is a server CPU, where reliability and uptime really matters, regardless of needing to reboot for the sake of security. Even worse that AMD has no intention to fix it, since from what I can tell, it's a firmware bug.
On the other hand, I don't really understand in what situation a server would ever need C6, to the point that I'm kind of confused why it has that to begin with*. I assume that it's possible for a whole core cluster to basically be put to sleep, but at that point: why bother with such a high-end CPU if you're not even going to use it?
* This sleep state might have been intended for desktop/laptop parts and "tricked up" to server parts, simply because "why not?", but that to me suggests Ryzen parts will also have this bug. However, I find it even more unlikely for a laptop or desktop to stay on for that long.
Though I have to say: who managed to pull off nearly 3 years of uptime? Even with backup power, that's uncommon.
Lebon30
That's an oddly specific time. That's a little less than 3 years (51 days short).
So, if the server never rebooted for any reason (not even server updates) for 1044 days, it's either "holy crap" level of stability or there's also lack of maintenance.
I know they are meant for stability but top tier reliability server never advertise 100% uptime. It always has been a max of 99.999999999% (the number of 9 matters) per year. So, in the few minutes the server is down, it would reset this bug.
Also, as long as server admins that have these chips know about this bug, they could just schedule a system update around that time (such as OS update) and have a planned reboot. And voilà, the timer is rebooted to 0.
Also, CC6... if your servers use this and it has not been disabled, you fail as a server admin imo. That's just a recipe for downtime.
Also, AMD, FFS. That seems so trivial to fix for your engineers.
Even for a server chip, that seems unlikely as well to not reboot at least once in 3 years.
Astyanax
user1
since disabling c6 mitigates the problem , pretty much a nothing burger, worst case your using a few more watts at idle, unlikely to effect boost much , since epyc chips dont boost that high to begin with.
Alessio1989
Hard part is test for 3 years before release a hardware. Even server farm should get kernel patches in 3 years and a reboot..
C6 state is important for servers to spare power. Power is the main cost on servers. Especially since last year.
vestibule
Its the millennium bug and its found a new home. 🙄
schmidtbag
TLD LARS
I would say that is the definition of good enough.
A server would only need to restart once maybe twice in its normal lifespan, before it most likely would be retired.
It is not like a Intel would have done better, I have needed to security bios update my Intel machine 5 times in 10 years anyway.
user1
https://www.anandtech.com/show/11964/ryzen-mobile-is-launched-amd-apus-for-laptops-with-vega-and-updated-zen/4
https://en.wikichip.org/wiki/acpi/c-states
You are mistaken, cores can be put into C6, to be "turned off" when they are not being used