SSDD reliability
Yeah, on that subject, anybody else see this:
<>
Absolutely pathetic.
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
On May 4, 2011, at 10:50 AM, Greg Smith wrote:
Your link didn't show up on this.
Sigh... Step 2: paste link in ;-)
<http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html>
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
Import Notes
Reply to msg id not found: 4DC183D8.10002@2ndquadrant.com
On 5/4/2011 11:15 AM, Scott Ribe wrote:
Sigh... Step 2: paste link in ;-)
<http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html>
To be honest, like the article author, I'd be happy with 300+ days to
failure, IF the drives provide an accurate predictor of impending doom.
That is, if I can be notified "this drive will probably quit working in
30 days", then I'd arrange to cycle in a new drive.
The performance benefits vs rotating drives are for me worth this hassle.
OTOH if the drive says it is just fine and happy, then suddenly quits
working, that's bad.
Given the physical characteristics of the cell wear-out mechanism, I
think it should be possible to provide a reasonable accurate remaining
lifetime estimate, but so far my attempts to read this information via
SMART have failed, for the drives we have in use here.
FWIW I have a server with 481 days uptime, and 31 months operating that
has an el-cheapo SSD for its boot/OS drive.
On May 4, 2011, at 11:31 AM, David Boreham wrote:
To be honest, like the article author, I'd be happy with 300+ days to failure, IF the drives provide an accurate predictor of impending doom.
No problem with that, for a first step. ***BUT*** the failures in this article and many others I've read about are not in high-write db workloads, so they're not write wear, they're just crappy electronics failing.
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
No problem with that, for a first step. ***BUT*** the failures in this article and
many others I've read about are not in high-write db workloads, so they're not write wear,
they're just crappy electronics failing.
As a (lapsed) electronics design engineer, I'm suspicious of the notion that
a subassembly consisting of solid state devices surface-mounted on FR4 substrate will fail
except in very rare (and of great interest to the manufacturer) circumstances.
And especially suspicious that one product category (SSD) happens to have a much
higher failure rate than all others.
Consider that an SSD is much simpler (just considering the electronics) than a traditional
disk drive, and subject to less vibration and heat.
Therefore one should see disk drives failing at the same (or higher rate).
Even if the owner is highly statically charged, you'd expect the to destroy all categories
of electronics at roughly the same rate (rather than just SSDs).
So if someone says that SSDs have "failed", I'll assume that they suffered from Flash cell
wear-out unless there is compelling proof to the contrary.
Import Notes
Resolved by subject fallback
On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they
suffered from Flash cell
wear-out unless there is compelling proof to the contrary.
I've been involved in four recovery situations similar to the one
described in that coding horror article, and zero of them were flash
wear-out issues. The telling sign is that the device should fail to
read-only mode if it wears out. That's not what I've seen happen
though; what reports from the field are saying is that sudden, complete
failures are the more likely event.
The environment inside a PC of any sort, desktop or particularly
portable, is not a predictable environment. Just because the drives
should be less prone to heat and vibration issues doesn't mean
individual components can't slide out of spec because of them. And hard
drive manufacturers have a giant head start at working out reliability
bugs in that area. You can't design that sort of issue out of a new
product in advance; all you can do is analyze returns from the field,
see what you screwed up, and do another design rev to address it.
The idea that these new devices, which are extremely complicated and
based on hardware that hasn't been manufactured in volume before, should
be expected to have high reliability is an odd claim. I assume that any
new electronics gadget has an extremely high failure rate during its
first few years of volume production, particularly from a new
manufacturer of that product.
Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards. I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can easily
be less reliable than a regular hard drive still.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 5/4/2011 6:02 PM, Greg Smith wrote:
On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they
suffered from Flash cell
wear-out unless there is compelling proof to the contrary.I've been involved in four recovery situations similar to the one
described in that coding horror article, and zero of them were flash
wear-out issues. The telling sign is that the device should fail to
read-only mode if it wears out. That's not what I've seen happen
though; what reports from the field are saying is that sudden,
complete failures are the more likely event.
Sorry to harp on this (last time I promise), but I somewhat do know what
I'm talking about, and I'm quite motivated to get to the bottom of this
"SSDs fail, but not for the reason you'd suspect" syndrome (because we
want to deploy SSDs in production soon).
Here's my best theory at present : the failures ARE caused by cell
wear-out, but the SSD firmware is buggy in so far as it fails to boot up
and respond to host commands due to the wear-out state. So rather than
the expected outcome (SSD responds but has read-only behavior), it
appears to be (and is) dead. At least to my mind, this is a more
plausible explanation for the reported failures vs. the alternative (SSD
vendors are uniquely clueless at making basic electronics
subassemblies), especially considering the difficulty in testing the
firmware under all possible wear-out conditions.
One question worth asking is : in the cases you were involved in, was
manufacturer failure analysis performed (and if so what was the failure
cause reported?).
The environment inside a PC of any sort, desktop or particularly
portable, is not a predictable environment. Just because the drives
should be less prone to heat and vibration issues doesn't mean
individual components can't slide out of spec because of them. And
hard drive manufacturers have a giant head start at working out
reliability bugs in that area. You can't design that sort of issue
out of a new product in advance; all you can do is analyze returns
from the field, see what you screwed up, and do another design rev to
address it.
That's not really how it works (I've been the guy responsible for this
for 10 years in a prior career, so I feel somewhat qualified to argue
about this). The technology and manufacturing processes are common
across many different types of product. They either all work , or they
all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on
the exact same production lines as regular disk drives, DRAM modules,
and so on (manufacturing tends to be contracted to high volume factories
that make all kinds of things on the same lines). The only different
thing about SSDs vs. any other electronics you'd come across is the
Flash devices themselves. However, those are used in extraordinary high
volumes all over the place and if there were a failure mode with the
incidence suggested by these stories, I suspect we'd be reading about it
on the front page of the WSJ.
Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards. I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can
easily be less reliable than a regular hard drive still.
Hmm, this is speculation I don't support (non-intel vendors have a 10x
worse early failure rate). The entire industry uses very similar
processes (often the same factories). One rogue vendor with a bad
process...sure, but all of them ??
For the benefit of anyone reading this who may have a failed SSD : all
the tier 1 manufacturers have departments dedicated to the analysis of
product that fails in the field. With some persistence, you can usually
get them to take a failed unit and put it through the FA process (and
tell you why it failed). For example, here's a job posting for someone
who would do this work :
http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the
failure analysis pile. If units are not returned, the manufacturer never
finds out what broke, and therefore can't fix the problem.
On Wed, May 4, 2011 at 6:31 PM, David Boreham <david_list@boreham.org> wrote:
this). The technology and manufacturing processes are common across many
different types of product. They either all work , or they all fail.
Most of it is. But certain parts are fairly new, i.e. the
controllers. It is quite possible that all these various failing
drives share some long term ~ 1 year degradation issue like the 6Gb/s
SAS ports on the early sandybridge Intel CPUs. If that's the case
then the just plain up and dying thing makes some sense.
On 5/4/2011 9:06 PM, Scott Marlowe wrote:
Most of it is. But certain parts are fairly new, i.e. the
controllers. It is quite possible that all these various failing
drives share some long term ~ 1 year degradation issue like the 6Gb/s
SAS ports on the early sandybridge Intel CPUs. If that's the case
then the just plain up and dying thing makes some sense.
That Intel SATA port circuit issue was an extraordinarily rare screwup.
So ok, yeah...I said that chips don't just keel over and die mid-life
and you came up with the one counterexample in the history of
the industry :) When I worked in the business in the 80's and 90's
we had a few things like this happen, but they're very rare and
typically don't escape into the wild (as Intel's pretty much didn't).
If a similar problem affected SSDs, they would have been recalled
and lawsuits would be underway.
SSDs are just not that different from anything else.
No special voodoo technology (besides the Flash devices themselves).
On Wed, May 4, 2011 at 9:34 PM, David Boreham <david_list@boreham.org> wrote:
On 5/4/2011 9:06 PM, Scott Marlowe wrote:
Most of it is. But certain parts are fairly new, i.e. the
controllers. It is quite possible that all these various failing
drives share some long term ~ 1 year degradation issue like the 6Gb/s
SAS ports on the early sandybridge Intel CPUs. If that's the case
then the just plain up and dying thing makes some sense.That Intel SATA port circuit issue was an extraordinarily rare screwup.
So ok, yeah...I said that chips don't just keel over and die mid-life
and you came up with the one counterexample in the history of
the industry :) When I worked in the business in the 80's and 90's
we had a few things like this happen, but they're very rare and
typically don't escape into the wild (as Intel's pretty much didn't).
If a similar problem affected SSDs, they would have been recalled
and lawsuits would be underway.
Not necessarily. If there's a chip that has a 15% failure rate
instead of the predicted <1% it might not fail enough for people to
have noticed, since a user with a typically small sample might think
he just got a bit unlucky etc. Nvidia made GPUs that overheated and
died by the thousand, but took 1 to 2 years to die. There WAS a
lawsuit, and now to settle it, they're offering to buy everybody who
got stuck with the broken GPUs a nice single core $279 Compaq
computer, even if they bought a $4,000 workstation with one of those
dodgy GPUs.
There's a lot of possibilities as to why some folks are seeing high
failure rates, it'd be nice to know the cause. But we can't assume
it's not an inherent problem with some part in them any more than we
can assume that it is.
On 05/05/11 03:31, David Boreham wrote:
On 5/4/2011 11:15 AM, Scott Ribe wrote:
Sigh... Step 2: paste link in ;-)
<http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html>
To be honest, like the article author, I'd be happy with 300+ days to
failure, IF the drives provide an accurate predictor of impending doom.
That is, if I can be notified "this drive will probably quit working in
30 days", then I'd arrange to cycle in a new drive.
The performance benefits vs rotating drives are for me worth this hassle.OTOH if the drive says it is just fine and happy, then suddenly quits
working, that's bad.Given the physical characteristics of the cell wear-out mechanism, I
think it should be possible to provide a reasonable accurate remaining
lifetime estimate, but so far my attempts to read this information via
SMART have failed, for the drives we have in use here.
In what way has the SMART read failed?
(I get the relevant values out successfully myself, and have Munin graph
them.)
FWIW I have a server with 481 days uptime, and 31 months operating that
has an el-cheapo SSD for its boot/OS drive.
Likewise, I have a server with a first-gen SSD (Kingston 60GB) that has
been running constantly for over a year, without any hiccups. It runs a
few small websites, a few email lists, all of which interact with
PostgreSQL databases.. lifetime writes to the disk are close to
three-quarters of a terabyte, and despite its lack of TRIM support, the
performance is still pretty good.
I'm pretty happy!
I note in the comments of that blog post above, it includes:
"I have shipped literally hundreds of Intel G1 and G2 SSDs to my
customers and never had a single in the field failure (save for one
drive in a laptop where the drive itself functioned fine but one of the
contacts on the SATA connector was actually flaky, probably from
vibrational damage from a lot of airplane flights, and one DOA drive). I
think you just got unlucky there."
I do have to wonder if this Portman Wills guy was somehow Doing It Wrong
to get a 100% failure rate over eight disks..
* Greg Smith:
Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards. I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can
easily be less reliable than a regular hard drive still.
I'm a bit concerned with usage-dependent failures. Presumably, two SDDs
in a RAID-1 configuration are weared down in the same way, and it would
be rather inconvenient if they failed at the same point. With hard
disks, this doesn't seem to happen; even bad batches fail pretty much
randomly.
--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
On 5/5/2011 2:36 AM, Florian Weimer wrote:
I'm a bit concerned with usage-dependent failures. Presumably, two SDDs
in a RAID-1 configuration are weared down in the same way, and it would
be rather inconvenient if they failed at the same point. With hard
disks, this doesn't seem to happen; even bad batches fail pretty much
randomly.
fwiw this _can_ happen with traditional drives : we had a bunch of WD
300G velociraptor
drives that had a firmware bug related to a 32-bit counter roll-over.
This happened at
exactly the same time for all drives in a machine (because the counter
counted since
power-up time). Needless to say this was quite frustrating !
On 5/4/2011 11:50 PM, Toby Corkindale wrote:
In what way has the SMART read failed?
(I get the relevant values out successfully myself, and have Munin
graph them.)
Mis-parse :) It was my _attempts_ to read SMART that failed.
Specifically, I was able to read a table of numbers from the drive, but
none of the numbers looked particularly useful or likely to be a "time
to live" number. Similar to traditional drives, where you get this table
of numbers that are either zero or random, that you look at saying
"Huh?", all of which are flagged as "failing". Perhaps I'm using the
wrong SMART groking tools ?
I do have to wonder if this Portman Wills guy was somehow Doing It
Wrong to get a 100% failure rate over eight disks..
There are people out there who are especially highly charged.
So if he didn't wear out the drives, the next most likely cause I'd
suspect is that he ESD zapped them.
On May 4, 2011, at 9:34 PM, David Boreham wrote:
So ok, yeah...I said that chips don't just keel over and die mid-life
and you came up with the one counterexample in the history of
the industry
Actually, any of us who really tried could probably come up with a dozen examples--more if we've been around for a while. Original design cutting corners on power regulation; final manufacturers cutting corners on specs; component manufacturers cutting corners on specs or selling outright counterfeit parts...
--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
On 5/5/2011 8:04 AM, Scott Ribe wrote:
Actually, any of us who really tried could probably come up with a dozen examples--more if we've been around for a while. Original design cutting corners on power regulation; final manufacturers cutting corners on specs; component manufacturers cutting corners on specs or selling outright counterfeit parts...
These are excellent examples of failure causes for electronics, but they are
not counter-examples. They're unrelated to the discussion about SSD
early lifetime hard failures.
On 05/05/2011 10:35 AM, David Boreham wrote:
On 5/5/2011 8:04 AM, Scott Ribe wrote:
Actually, any of us who really tried could probably come up with a
dozen examples--more if we've been around for a while. Original
design cutting corners on power regulation; final manufacturers
cutting corners on specs; component manufacturers cutting corners on
specs or selling outright counterfeit parts...These are excellent examples of failure causes for electronics, but
they are
not counter-examples. They're unrelated to the discussion about SSD
early lifetime hard failures.
That's really optimistic. For all we know, these problems are the
latest incarnation of something like the bulging capacitor plague circa
5 years ago. Some part that is unique to the SSDs other than the flash
cells that there's a giant bad batch of.
I think your faith in PC component manufacturing is out of touch with
the actual field failure rates for this stuff, which is produced with
enormous cost cutting pressure driving tolerances to the bleeding edge
in many cases. The equipment of the 80's and 90's you were referring to
ran slower, and was more expensive so better quality components could be
justified. The quality trend at the board and component level has been
trending for a long time toward cheap over good in almost every case
nowadays.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Thu, May 5, 2011 at 1:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:
I think your faith in PC component manufacturing is out of touch with the
actual field failure rates for this stuff, which is produced with enormous
cost cutting pressure driving tolerances to the bleeding edge in many cases.
The equipment of the 80's and 90's you were referring to ran slower, and
was more expensive so better quality components could be justified. The
quality trend at the board and component level has been trending for a long
time toward cheap over good in almost every case nowadays.
Modern CASE tools make this more and more of an issue. You can be in
a circuit design program, right click on a component and pick from a
dozen other components with lower tolerances and get a SPICE
simulation that says initial production line failure rates will go
from 0.01% to 0.02%. Multiply that times 100 components and it seems
like a small change. But all it takes is one misstep and you've got a
board with a theoretical production line failure rate of 0.05 that's
really 0.08, and the first year failure rate goes from 0.5% to 2 or 3%
and the $2.00 you saved on all components on the board times 1M units
goes right out the window.
BTW, the common term we used to refer to things that fail due to weird
and unforseen circumstances were often referred to as P.O.M.
dependent, (phase of the moon) because they'd often cluster around
certain operating conditions that were unobvious until you collected
and collated a large enough data set. Like hard drives that have
abnormally high failure rates at altitudes above 4500ft etc. Seem
fine til you order 1,000 for your Denver data center and they all
start failing. It could be anything like that. SSDs that operate
fine until they're in an environment with constant % humidity below
15% and boom they start failing like mad. It's impossible to test for
all conditions in the field, and it's quite possible that
environmental factors affect some of these SSDs we've heard about.
More research is necessary to say why someone would see such
clustering though.
On 05/04/2011 08:31 PM, David Boreham wrote:
Here's my best theory at present : the failures ARE caused by cell
wear-out, but the SSD firmware is buggy in so far as it fails to boot
up and respond to host commands due to the wear-out state. So rather
than the expected outcome (SSD responds but has read-only behavior),
it appears to be (and is) dead. At least to my mind, this is a more
plausible explanation for the reported failures vs. the alternative
(SSD vendors are uniquely clueless at making basic electronics
subassemblies), especially considering the difficulty in testing the
firmware under all possible wear-out conditions.One question worth asking is : in the cases you were involved in, was
manufacturer failure analysis performed (and if so what was the
failure cause reported?).
Unfortunately not. Many of the people I deal with, particularly the
ones with budgets to be early SSD adopters, are not the sort to return
things that have failed to the vendor. In some of these shops, if the
data can't be securely erased first, it doesn't leave the place. The
idea that some trivial fix at the hardware level might bring the drive
back to life, data intact, is terrifying to many businesses when drives
fail hard.
Your bigger point, that this could just easily be software failures due
to unexpected corner cases rather than hardware issues, is both a fair
one to raise and even more scary.
Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%. Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards. I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can
easily be less reliable than a regular hard drive still.Hmm, this is speculation I don't support (non-intel vendors have a 10x
worse early failure rate). The entire industry uses very similar
processes (often the same factories). One rogue vendor with a bad
process...sure, but all of them ??
I was postulating that you only have to be 4X as bad as Intel to reach
2.4%, and then be worse than a mechanical drive for early failures. If
you look at http://labs.google.com/papers/disk_failures.pdf you can see
there's a 5:1 ratio in first-year AFR just between light and heavy usage
on the drive. So a 4:1 ratio between best and worst manufacturer for
SSD seemed possible. Plenty of us have seen particular drive models
that were much more than 4X as bad as average ones among regular hard
drives.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 05/05/11 22:50, David Boreham wrote:
On 5/4/2011 11:50 PM, Toby Corkindale wrote:
In what way has the SMART read failed?
(I get the relevant values out successfully myself, and have Munin
graph them.)
Mis-parse :) It was my _attempts_ to read SMART that failed.
Specifically, I was able to read a table of numbers from the drive, but
none of the numbers looked particularly useful or likely to be a "time
to live" number. Similar to traditional drives, where you get this table
of numbers that are either zero or random, that you look at saying
"Huh?", all of which are flagged as "failing". Perhaps I'm using the
wrong SMART groking tools ?
I run:
sudo smartctl -a /dev/sda
And amongst the usual values, I also get:
232 Available_Reservd_Space 0x0002 100 048 000 Old_age Always
- 9011683733561
233 Media_Wearout_Indicator 0x0002 100 000 000 Old_age Always
- 0
The media wearout indicator is the useful one.
Plus some unknown attributes:
229 Unknown_Attribute 0x0002 100 000 000 Old_age Always
- 21941823264152
234 Unknown_Attribute 0x0002 100 000 000 Old_age Always
- 953583437830
235 Unknown_Attribute 0x0002 100 000 000 Old_age Always
- 1476591679
I found some suggested definitions for those attributes, but they didn't
seem to match up with my values once I decoded them, so mine must be
proprietary.
-Toby