Complete SSD failure: Dell and HPE release firmware against 40K hour bug

Published by

Click here to post a comment for Complete SSD failure: Dell and HPE release firmware against 40K hour bug on our message forum
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
cryohellinc:

I'm too tired after work to even dwell into this, but what the hell are you smoking? Instead, let me send you this, so that "even the simplest person could understand". Good starting point for a future research into the topic.
So it's planned obsolesce, but only with a few of their drives, at a very specific amount of time and they came out on their own, admitted they found the issue and released a patch that instantly fixes it. Seems like a weird way to obsolete a subset of products. Seems more like it might just be a bug.
data/avatar/default/avatar39.webp
cryohellinc:

Hmmm, don't have proof - my argument is wrong. You don't have proof either - your argument is right, and you don't need to prove anything. That sounds foolish to me. Counter [spoiler]https://static.tvtropes.org/pmwiki/pub/images/weirdalfoil_2322.jpg [/spoiler] Sure, 100% is too obvious, however, there are countless examples from various industries. Concept of "designed to fail" where after a certain amount of cycles equipment would break is spread across our world and isn't something new. Manufacturers would always try to implement it. It's a risk that is hard to prove (depending on a product of course) and many are willing to take it.
LOL wut?https://upload.wikimedia.org/wikipedia/en/8/81/TheWasteMakers.jpg
Ne1l:

If it's nonsensical how come this isn't the first time we've seen drives dying from a pre-specified number of hours? SSD's 'usually' have 'over provisioning' chips which gets used as issues arise and bad sectors are listed in the 'defective-sector table', once the spare capacity is used up the drives 'should' fail into read only mode, not die totally! There is absolutely no reason to set a die date by hours? That's nonsensical! << unless it can help generate a future sale 🙂 40,000 hours is just short of the usual 5 year extended life support contracts, extending support afterwards costs soo much, buying new hardware is more cost & tax efficient. Most corporations decommission their hardware before 5 years, as there is no 'depreciation of assets' tax benefits left and the hardware ends up on ebay and therefore someone else's issue. OEM's should be forced to show how many refunds and discounts (against future orders obviously) they give to customers as hush money for failing to deliver the quality the customer paid over the odds for.
You should learn how to counter a fact based logical argument, mine, instead of just posting more noise. I made precise logical points.
cryohellinc:

I'm too tired after work to even dwell into this, but what the hell are you smoking? Instead, let me send you this, so that "even the simplest person could understand". Good starting point for a future research into the topic.
Advising someone to read about a concept is not evidence your concept applies. Simple logic tells us that NOBODY would INTENTIONALLY program a 100% failure mode in a product that is A) easily discovered B) would fail INTENTIONALLY orders of magnitude under its predicted MTBF. I believe most with an IQ over 80 have an idea of the concept of 'planned obsolescence'. I know most with an IQ over 80, know it does not apply here. You're 'special'.
ThatsSuperman.jpg
https://forums.guru3d.com/data/avatars/m/233/233786.jpg
There's a lot of OEM defenders here 🙂 Simple question: Why would an SSD FW ever need to contain an expire date measured in hours? Even if Sandisk accidentally left it in there for testing to see what would happen, why 40k hours and not just set it for 10 hours or 1? very fi$hy indeed..., and considering each OEM issues their own FW, totally inexcusable.
https://forums.guru3d.com/data/avatars/m/233/233786.jpg
HeavyHemi:

You should learn how to counter a fact based logical argument, mine, instead of just posting more noise. I made precise logical points.
Your argument is based on them not fixing it and getting caught.., my argument explained why they needed to issue a new FW - if the hardware died 5 years an 1 month they wouldn't get caught and the original hardware owner would still be happy, so sales won't be effected and a new FW might not have been released I was just pointing out that from what I saw from the inside, your 'give them the benefit of the doubt' is sweet but somewhat naive ;-)
data/avatar/default/avatar02.webp
Ne1l:

There's a lot of OEM defenders here 🙂 Simple question: Why would an SSD FW ever need to contain an expire date measured in hours? Even if Sandisk accidentally left it in there for testing to see what would happen, why 40k hours and not just set it for 10 hours or 1? very fi$hy indeed..., and considering each OEM issues their own FW, totally inexcusable.
It does not, that is why this is a bug. You just solved the case with your own argument, Sherlock. 😉
Ne1l:

Your argument is based on them not fixing it and getting caught.., my argument explained why they needed to issue a new FW - if the hardware died 5 years an 1 month they wouldn't get caught and the original hardware owner would still be happy, so sales won't be effected and a new FW might not have been released I was just pointing out that from what I saw from the inside, your 'give them the benefit of the doubt' is sweet but naive ;-)
What are you babbling about? This is not about defending anyone. It's simply about facts. I am not making an argument, and certainly not the bizarre argument you claim I am making, merely stating what is and using simple logic. You're the one making silly arguments and claims trying to make things up to defend your silly claims. Then, you completely fail by trying to argue emotion. I do not believe you have any experience in the semi industry or worked in it an any capacity that mattered nor your invented story that this is a COMMON PRACTICE YOU SAW ALL THE TIME "From the inside". Bullshit. Your arguments and your defense of them are inept and dumb. And there you are.
https://forums.guru3d.com/data/avatars/m/132/132389.jpg
Denial:

Seems like a weird way to obsolete a subset of products. Seems more like it might just be a bug.
The thing is, it's ridiculously specific and also odd, to be a bug. Someone had to write something very specific for that to happen.
https://forums.guru3d.com/data/avatars/m/156/156133.jpg
Moderator
Sooooo a few us should take a break from posting in here...And calm down a little too.
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
Neo Cyrus:

The thing is, it's ridiculously specific and also odd, to be a bug. Someone had to write something very specific for that to happen.
I don't get why that has to be the case. Various parts of the system firmware use the drive time. For example garbage collection and various other algorithms to optimize the drive all occur at various times of the drives life. Most enterprise drives are designed to fail the entire drive if it detects an issue - so any issue involving these algorithms and the drive time could cause the failed drive.
Ne1l:

This 'bug' appeared last year too.. when is a bug a feature? https://blocksandfiles.com/2019/11/25/hpe-issues-firmware-fix-to-to-stop-ssd-failure/
I mean yeah if you bothered to read the article it mentions it.
https://forums.guru3d.com/data/avatars/m/273/273678.jpg
ugh, its just an oversight, probably through an intern submitting unchecked code.
https://forums.guru3d.com/data/avatars/m/233/233786.jpg
"I mean yeah if you bothered to read the article it mentions it" I did read it, last years does look like a bug but once bitten twice shy? Wouldn't they specifically double check this when they create future FW and order SSD's? Any I'm outta here, mods appear to let HeavyHemi act like his avatar and abuse people, call them liars and clearly contradict himself while doing so "You should learn how to counter a fact based logical argument || I am not making an argument"
data/avatar/default/avatar05.webp
@Aura89 First they had it 'after precisely 32,768 hours of usage'.... now after 40000h (soo stupid/anoying that companies want a 5year-warranty)... (https://www.guru3d.com/news-story/hp-enterprise-ssd-usersplease-check-and-update-firmware-(before-a-kill-switch-kicks-in).html) One point is, that a SSD (SLC/MLC) can last veeery long, if it's only modestly hammered with writecycles... A 'bad' example is my Crucial C300 256GB, that is having it's 10th anniversary this year. This is only a consumer SSD, which had a 3 years warranty and was running for the last 9.6 years as Window 7 systemdrive with no issues whatsoever... Another 'bad example' is the 850 Pro lineup from Samsung, that has a 10 year warranty. I personally own 4 drives (2 x 512GB, 1TB & 2TB) They testet two of the 256GB models at heise.de - the 'weak' one died after 2.2 PB and the 'good' one after 9.1 Petabyte(!!!)! Here's the link: https://www.heise.de/newsticker/meldung/SSD-Langzeittest-beendet-Exitus-bei-9-1-Petabyte-3755009.html And those HP drives are enterprise-level drives... they come with a way higher write endurance than consumerdrives... The company wants to sell (harddisks & much more), but when you suddenly have customers, that order more than a 2 or 3-year warranty, than such a 'bug' must be 'fixed'... which leads me to the question, why was there such a bug at all? And why didn't they remove the 32768h-'bug' completely, but only (!)expanded(!) the lifecycle to 40000h which is called a 'bug' again (because 5 years are 43480h) ???. Or do you sell those drives?!? - just open your eyes! For me personally, the propability of beeing a conicidentally 'bug' is as high as the propability, that big companies like hp or even bigger ones are caring for the customer only & not for themselves. Just my 2 cents...
https://forums.guru3d.com/data/avatars/m/233/233786.jpg
DG21:

@Aura89 The company wants to sell (harddisks & much more), but when you suddenly have customers, that order more than a 2 or 3-year warranty, than such a 'bug' must be 'fixed'... which leads me to the question, why was there such a bug at all? For me personally, the propability of beeing a conicidentally 'bug' is as high as the propability, that big companies like hp or even bigger ones are caring for the customer only & not for themselves.
nicely put... Crucial C300 & Micron C300 we're identical apart from the enterprise Price and FW, I've still got a few Microns alive an kicking too.. I think I even cross-flashed one, as the initial Micron FW was buggy and we were literally binning 100's. Maybe some useful for anyone with Micron C300 bought from ebay thats dead on a shelf somewhere: The drive would freeze and not appear in the bios, FW 07 fixed it, but you needed to go to FW03 first, if they did freeze you removed that sata cable or just supplied power for 30 minutes which triggered a reset internally and they would come alive again)
data/avatar/default/avatar21.webp
Ne1l:

Maybe some useful for anyone with Micron C300 bought from ebay thats dead on a shelf somewhere: The drive would freeze and not appear in the bios, FW 07 fixed it, but you needed to go to FW03 first, if they did freeze you removed that sata cable or just supplied power for 30 minutes which triggered a reset internally and they would come alive again)
thanx 4 the hint! 🙂
data/avatar/default/avatar27.webp
Aura89:

Not sure how people don't understand you need proof to validify a claim, you don't need proof to not. It's about the oddest arguement i've ever heard. I'm not making an arguement. I'm not making a claim. I have no proof because there's zero reason to have proof for a lack of a claim. Really can't get more simple then that. You on the other hand decided to make a claim with zero proof. That's called a conspiracy theory. Here let me put it in a way even the simplest person could understand. If you bring someone to court claiming they stole from you, it's up to YOU to prove it. If you don't have proof, your court claim is lost and you'll likely have to pay for their court fees. What do they have to prove? Absolutely nothing. They only have to prove their innocence if you HAVE proof. And if during this court case you stated this long story about how they stole from you, but then the lawyer (me in this instance) asked you for proof, you don't get to say "well, prove it didn't happen!". No, that's not how anything works, you don't prove something didn't happen, you prove that it did, and the people claiming it are the ones to do it. The world may be going crazy right now, but left is still left and right is still right, and having to prove a lack of a claim still makes zero sense. ....Unless you want me to claim you're biased @cryohellinc because you and i have worked in the past for a competitor to Dell and HP and the only reason you're making these statements is because you still work for them and are trying to drive business over to the company you work for and that it completely lines up with your personality that i know you have since again, i know you and have worked for you.... Oh, want me to prove the statement i just said? No, you prove that it's not true. Lets all go back to the salem witch trials since that apparently is where you wish to be.
First, well, u are claiming it's a bug. Where is the proof that it is a bug? u have the testimony of the accused, not worth a lot. Second, u are saying it yourself, "if you bring someone to court", so u are entitled to suspect and ask for the justice system assistance into proving things (like a judge asking a company for other-wise not public information). Making fun of people for taking the first step and suspecting something is silly. U are saying we shouldn't be able to get others to court until we have absolute evidence? Yes u need something for a judge to even open a case, but HP (and others) don't have a "clean" record in this matter which is enough to warrant some raised browns and forum posting. end of quote ----------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------- As for reasons to do this and then patch it themselves there are a couple. Forcing consumers to apply an update or bricking the drives can be useful. For example: - as already mentioned, a company sells the drives to a "recycler" that resells some of them to other customers that have no idea nor the means to perform the update. Making recycling harder is profitable for them. (Apple makes more money on services and repairs than...... u know the story) - window of opportunity to introduce new things in the firmware, u never know, if a war starts it could be useful to brick or infect all enemy drives? or the EULA changes and know they can pull usage data directly from the disk (too far fetched but not impossible at all) ----------------------------------------------------------------------------------------------------------------------- Something I never understood is the value for the customer of having a "feature" that bricks a disk after detecting a legitimate failure. If a failure is detected, it can be informed and let the customer decide what to do with the drive, what's the value in an automatic "implosion"??? I'm not being sarcastic, this is a genuine question.
https://forums.guru3d.com/data/avatars/m/132/132389.jpg
Denial:

I don't get why that has to be the case. Various parts of the system firmware use the drive time. For example garbage collection and various other algorithms to optimize the drive all occur at various times of the drives life. Most enterprise drives are designed to fail the entire drive if it detects an issue - so any issue involving these algorithms and the drive time could cause the failed drive.
Garbage collection and other methods? How the hell could they screw that up to nuke the drive after X amount of hours? This is a genuine question, if you think it's something along the lines of that, how?
https://forums.guru3d.com/data/avatars/m/80/80129.jpg
Neo Cyrus:

Garbage collection and other methods? How the hell could they screw that up to nuke the drive after X amount of hours? This is a genuine question, if you think it's something along the lines of that, how?
Look I'm not going to pretend I'm a firmware engineer for an SSD manufacturer but I did study CE for 5 years at RIT - lots of projects I've worked on then and since then use a device's power on time as triggers for various operations to take place - notably maintenance ones, which is why I mentioned it. In the past and I'm sure today, manufacturers of enterprise grade SSD's would fail the entire drive in place if it detected any kind of error in the firmware or with the memory itself. They do this because when you build massive storage arrays one drive out of hundreds or thousands is basically nothing costwise - yet one customer trying to pull error-prone data from a funked SSD creates a shitstorm of problems. So let's say they stored the time as the wrong variable type, or the logic in the code has some case that when it crosses a time threshold it fails one of the maintenance commands, or any other scenario where its simply using time as a trigger and causes an error - boom device is bricked. I don't know if this is happening - for all I know some intern or disgruntled engineer wrote shit code or intentionally put code in - or perhaps the CEO of HP and Dell colluded to crash a subset of their drives at exactly the same time but not before announcing and creating a patch for it. All I'm saying is that it's more than likely it's just a bug found in both companies drives because realistically Dell/HP are probably buying the drives/firmware from the same provider and that provider just didn't do the due-diligence of proper unit testing. I think Hanlon's Razor is appropriate here.