AMD EPYC 7002 Server Processors Reportedly Harbour a Bug Impacting Core Stability After 1044 Days of Uptime

Published by

teaser

In AMD's EPYC 7002 server processors, an issue has been detected that can induce the failure of a compute core following 1044 days of uninterrupted operation.



Information from AMD indicates that restarting the server within this time frame may circumvent the problem. However, AMD currently does not intend to provide a bug fix. The editorial team at Tom's Hardware detected the bug within AMD's EPYC 7002 server processor revision guide, issued in April. The guide suggests that a compute core within the EPYC 7002 is prone to failure due to a lack of capability to awaken from CC6 sleep mode.

The exact duration until the bug manifests is influenced by a variety of factors, including the reference clock used by the processor. To sidestep the bug, users have the option to disable the CC6 sleep mode or to restart their server prior to roughly 1044 days of operation.

Speculation regarding the specific duration has arisen on online platforms, with a user on Reddit hypothesizing it may be closer to 1042 days. Despite the occurrence of bugs in processors being somewhat usual, as highlighted by Tom's Hardware, it is intriguing to note that the AMD EPYC 7002 "Rome" server chips' latest revision guide unveils a new bug, causing a core to hang after 1044 days of uptime. The anomaly transpires when the core is unable to exit the CC6 sleep state. The timing of its occurrence may vary based on factors such as spread spectrum and REFCLK frequency, the latter being the reference clock utilized for timekeeping. A Reddit user, acid_migrain, has put forth an intriguing theory, suggesting that the issue might actually appear at around 1042 days and 12 hours, due to the TSC ticking at 2800 MHz.

Addressing the bug is simple: reboot the server before it reaches 1044 days of uptime to reset the CPU's "timer," or disable the CC6 sleep state. Although the bug raises eyebrows, it likely doesn't impact the majority of users, who should be conducting regular security updates and maintenance more frequently.

That said, the bug might affect users utilizing the Linux live patching feature or kexec for updating without rebooting, potentially leading to extended uptime that could trigger the bug. Similarly, servers running critical applications with extensive uptime could be impacted.

AMD EPYC 7002 Server Processors Reportedly Harbour a Bug Impacting Core Stability After 1044 Days of Uptime


Share this content
Twitter Facebook Reddit WhatsApp Email Print