HP Enterprise SSD users, please check and update firmware (before a kill-switch kicks in)

DG21

2020-02-25 12:59

1. Afaik, those SSDs have a three year warranty - and as long as you don't write a SSD to death, it lasts quite long (my Crucial C300/256GB bought 12/2009 is still in a top condition). 2. They do not mention, why it's behaving like this (!!!). They could just tell what kind of mistake they made, and you could at least understand why (although it remains annoying for those ones who own the drive). 3. "By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors." Just combine those three withe the fact, that since the end of the 50s/beginning of the 60s some companies startet to make use of planned obsolescense, cause it helps to make more money. If there's a firmware, that fixes it, then ok - But for me initially it smelled at least like that, which doesn't mean it must be like that. Oh and as a physicist i can assure you that the Earth is a globe and not a disc. 🙂

#5763319

schmidtbag

2020-02-25 15:43

Aura89:

If it was planned, then why fix it? Why fix it in advance? Why do anything about it? It's not like if they had planned it, they wouldn't have known that at some point people would get upset, so what is planned about it? Nothing. Calling this "Planned obsolescence" is simply throwing conspiracy into the wind.

Because they're [going to be] under a lot of backlash over artificially crippling perfectly good hardware after a not-such-a-long-time duration. I get the impression you haven't worked with HP server hardware before but they pull this kind of crap all the time. If you don't feel like calling it planned obsolescence, it is/was still an attempt to grab more money from customers.

It's an issue that causes a product to brick itself that has been fixed and announced by the manufacturers so that way it doesn't happen, simple as that, to state anything else is nonsense, unneeded conspiracy theory.

The killswitch never needed to exist in the first place.

Planned Obsolescence: a policy of producing consumer goods that rapidly become obsolete and so require replacing, achieved by frequent changes in design, termination of the supply of spare parts, and the use of nondurable materials.

Yes: and if HP programmed in a time-based killswitch, that is literally a product becoming obsolete by [their] policy. Therefore, it is planned obsolescence. It's planned because it was deliberately programmed in there, and once the time runs out, the product must be replaced. SSDs don't just randomly fail because a certain amount of time has passed. Why else would they have this killswitch if it wasn't meant to force users to buy new drives? The reason this firmware update exists is because 3.7 years is way too short of an estimated lifespan for an SSD. HP's engineers screwed up.

#5763335

Astyanax

2020-02-25 16:36

its only a good kill switch if you can get the data off it to put it on a new drive.

#5763339

DG21

2020-02-25 16:41

@schmidtbag Thanx for the insight! 🙂

#5763347

Denial

2020-02-25 16:51

schmidtbag:

Because they're [going to be] under a lot of backlash over artificially crippling perfectly good hardware after a not-such-a-long-time duration. I get the impression you haven't worked with HP server hardware before but they pull this kind of crap all the time. If you don't feel like calling it planned obsolescence, it is/was still an attempt to grab more money from customers. The killswitch never needed to exist in the first place. Yes: and if HP programmed in a time-based killswitch, that is literally a product becoming obsolete by [their] policy. Therefore, it is planned obsolescence. It's planned because it was deliberately programmed in there, and once the time runs out, the product must be replaced. SSDs don't just randomly fail because a certain amount of time has passed. Why else would they have this killswitch if it wasn't meant to force users to buy new drives? The reason this firmware update exists is because 3.7 years is way too short of an estimated lifespan for an SSD. HP's engineers screwed up.

I don't really understand this part. How do you know HP programmed in a time-based kill switch? Nearly all, if not all Enterprise drives have fail in place technology built into them - if they detect a cell failure or corruption they automatically brick the entire drive (typically after the next power cycle) - Samsung, Intel, Kingston - they all do this. Sounds more like something in the firmware is simply triggering this to occur after a set time than an intentionally developed killswitch.

#5763351

schmidtbag

2020-02-25 17:03

Denial:

I don't really understand this part. How do you know HP programmed in a time-based kill switch? Nearly all, if not all Enterprise drives have fail in place technology built into them - if they detect a cell failure or corruption they automatically brick the entire drive (typically after the next power cycle) - Samsung, Intel, Kingston - they all do this. Sounds more like something in the firmware is simply triggering this to occur after a set time than an intentionally developed killswitch.

There's an enormous difference between deliberately failing because the drive has a defect and needs to be decommissioned vs deliberately failing just because X amount of time has gone by. The article itself says the drive fails after 3.7 years [regardless of its health]. It is normal for drives to count how many hours they've been on in SMART but it isn't normal for the drives to outright stop working just because a timer was reset, and it's also not normal for the counter to reset after such a short amount of time.

#5763354

Denial

2020-02-25 17:15

schmidtbag:

There's an enormous difference between deliberately failing because the drive has a defect and needs to be decommissioned vs deliberately failing just because X amount of time has gone by. The article itself says the drive fails after 3.7 years, regardless of its health. It is normal for drives to count how many hours they've been on in SMART but it isn't normal for the drives to outright stop working just because a timer was reset, and it's also not normal for the counter to reset after such a short amount of time.

But there isn't that much of a difference if the x amount of time goes by is triggering a failure, right? Like I'm more than confident they use the drive on time for all kinds of functions within the controller - if a single one of those times the number was accidentally stored as an integer and it happens to send the drive into some kind of error state, even for a moment, it's triggering the system to brick the drive. What's more likely, that out of the dozens if not hundreds of times that number is used in the firmware, someone simply forgot to store it correctly and it's causing this problem, or that HP intentionally programmed a kill switch but accidentally set it to specifically the integer number? Do you really think HP sat there and said "i bet no one will notice all their SSDs failing at the exact same time?" "won't look fishy at all?" "oops we accidentally set the if statement to 32,768!" I doubt it. It's clearly just a drive error related to the time on stored in integer being overflowed triggering an error state and thus bricking the drive by design.

#5763359

schmidtbag

2020-02-25 17:38

Denial:

But there isn't that much of a difference if the x amount of time goes by is triggering a failure, right? Like I'm more than confident they use the drive on time for all kinds of functions within the controller - if a single one of those times the number was accidentally stored as an integer and it happens to send the drive into some kind of error state, even for a moment, it's triggering the system to brick the drive.

Right... but not after 3.7 years. Don't get me wrong, I don't think there's a major problem in bricking a known erroneous drive. But it is plain stupid to just assume a drive is dead without any evidence. SSDs don't break down that quickly just because they're on. If the drive is perfectly functional and has enough life left, why brick it? If administrators have their own rule of "replace all drives after X amount of time", fine, but let them determine that. Don't force people to the same policy.

What's more likely, that out of the dozens if not hundreds of times that number is used in the firmware, someone simply forgot to store it correctly and it's causing this problem, or that HP intentionally programmed a kill switch but accidentally set it to specifically the integer number?

I think it's possible it was accidentally stored as a signed integer instead of unsigned. But still... a timed killswitch never needed to exist in the first place.

Do you really think HP sat there and said "i bet no one will notice all their SSDs failing at the exact same time?" "won't look fishy at all?" "oops we accidentally set the if statement to 32,768!" I doubt it.

Honestly... I don't think they're above that, considering the fact they programmed in a killswitch in the first place. They don't try to hide how much they screw you over with other drive-related issues, printer ink, RAM, or other replacement parts.

#5763361

Denial

2020-02-25 17:47

schmidtbag:

Right... but not after 3.7 years. Don't get me wrong, I don't think there's a major problem in bricking a known erroneous drive. But it is plain stupid to just assume a drive is dead without any evidence. SSDs don't break down that quickly just because they're on. If the drive is perfectly functional and has enough life left, why brick it? If administrators have their own rule of "replace all drives after X amount of time", fine, but let them determine that. Don't force people to the same policy. I think it's possible it was accidentally stored as a signed integer instead of unsigned. But still... a timed killswitch never needed to exist in the first place. Honestly... I don't think they're above that, considering the fact they programmed in a killswitch in the first place. They don't try to hide how much they screw you over with other drive-related issues, printer ink, RAM, or other replacement parts.

Again you're just assuming it's a timed killswitch and not an issue with the firmware sending the drive into an error state based on the "drive on time" For example - let's just say they have a system in the firmware that triggers a trim after exactly two years. Let's say for whatever reason that trim command is broken and sends the drive into an error state, even momentarily. On exactly two years the drives are bricked. No kill switch required. How are we sure something similar isn't happening here? All kinds of functions inside the controller occur based on the drives time on. If just one of these is broken and sends the drive into an error state, the drive is bricked and they'll all brick at the same time due to that error. You're saying they programmed a killswitch. I'm saying they programmed a controller that fails the drive based on error and there is some bug causing the drive to error after a set "x" amount of time (in this case 32,768 days because obviously somewhere the drive time is stored incorrectly). That's not a killswitch, it's just a bug causing the driver to error based on time on.

#5763365

Extraordinary

2020-02-25 17:57

No way to reset the count by editing firmware or whatever stores it?

#5763366

schmidtbag

2020-02-25 17:57

Denial:

You're saying they programmed a killswitch. I'm saying they programmed a controller that fails the drive based on error and there is some bug causing the drive to error after a set "x" amount of time (in this case 32,768 days because obviously somewhere the drive time is stored incorrectly). That's not a killswitch, it's just a bug causing the driver to error based on time on.

Aaah ok, I get you now. That does sound more reasonable.

#5763367

Denial

2020-02-25 17:58

Extraordinary:

No way to reset the count by editing firmware or whatever stores it?

Just seems like it would be easier to flash the new firmware and avoid it all together, no?

#5763473

TieSKey

2020-02-26 00:39

Denial:

Again you're just assuming it's a timed killswitch and not an issue with the firmware sending the drive into an error state based on the "drive on time" For example - let's just say they have a system in the firmware that triggers a trim after exactly two years. Let's say for whatever reason that trim command is broken and sends the drive into an error state, even momentarily. On exactly two years the drives are bricked. No kill switch required. How are we sure something similar isn't happening here? All kinds of functions inside the controller occur based on the drives time on. If just one of these is broken and sends the drive into an error state, the drive is bricked and they'll all brick at the same time due to that error. You're saying they programmed a killswitch. I'm saying they programmed a controller that fails the drive based on error and there is some bug causing the drive to error after a set "x" amount of time (in this case 32,768 days because obviously somewhere the drive time is stored incorrectly). That's not a killswitch, it's just a bug causing the driver to error based on time on.

------------------------------------------------------------------------------------------ While perfectly possible, if that's the case then I wouldn't touch HP hardware for an enterprise with a 10km pole... Software, specially something as important as an storage drive firmware has to be robust to errors, if any arbitrarily error in any part of the firmware (such as your theory) can trigger a full lock down of the drive they are brittle has sugar glass... And even if this were the case they could be honest and say "yeah, a wrong variable type in one counter makes X process fail and that triggers Y procedure which calls the security measure to fail the disk, we are truly sorry, this is the fix, we will take (limited) responsibility for any lost disk". But no... HP is not saying anything specific and include a "if u don't read this warning in our bulletin it's your fault" (Ofc at the end of the day, decision makers, both private and public "servants" don't care much about these things and sign exclusivity contracts after a hard earned HP-sponsored vacation xD)

#5763511

HeavyHemi

2020-02-26 05:58

schmidtbag:

Aaah ok, I get you now. That does sound more reasonable.

This is precisely what it is. Simply a bug in the firmware that is triggered after the drive reaches a certain amount of time.