SSD HD Failure White Paper at DVinfo.net

Ed Roo · June 16th, 2015, 10:36 PM

http://users.ece.cmu.edu/~omutlu/pub...gmetrics15.pdf

Andrew Smith · June 17th, 2015, 12:35 AM

Abstract of linked paper ....

ABSTRACT
Servers use flash memory based solid state drives (SSDs) as a
high-performance alternative to hard disk drives to store persistent
data. Unfortunately, recent increases in flash density
have also brought about decreases in chip-level reliability. In
a data center environment, flash-based SSD failures can lead
to downtime and, in the worst case, data loss. As a result,
it is important to understand flash memory reliability characteristics
over flash lifetime in a realistic production data
center environment running modern applications and system
software.

This paper presents the first large-scale study of flash-based
SSD reliability in the field. We analyze data collected across
a majority of flash-based solid state drives at Facebook data
centers over nearly four years and many millions of operational
hours in order to understand failure properties and trends of
flash-based SSDs. Our study considers a variety of SSD characteristics,
including: the amount of data written to and read
from flash chips; how data is mapped within the SSD address
space; the amount of data copied, erased, and discarded by the
flash controller; and flash board temperature and bus power.
Based on our field analysis of how flash memory errors manifest
when running modern workloads on modern SSDs, this
paper is the first to make several major observations:

(1) SSD failure rates do not increase monotonically with flash
chip wear; instead they go through several distinct periods
corresponding to how failures emerge and are subsequently
detected, (2) the effects of read disturbance errors are not
prevalent in the field, (3) sparse logical data layout across an
SSD’s physical address space (e.g., non-contiguous data), as
measured by the amount of metadata required to track logical
address translations stored in an SSD-internal DRAM buffer,
can greatly affect SSD failure rate, (4) higher temperatures
lead to higher failure rates, but techniques that throttle SSD
operation appear to greatly reduce the negative reliability impact
of higher temperatures, and (5) data written by the operating
system to flash-based SSDs does not always accurately
indicate the amount of wear induced on flash cells due to optimizations
in the SSD controller and buffering employed in
the system software. We hope that the findings of this first
large-scale flash memory reliability study can inspire others
to develop other publicly-available analyses and novel flash
reliability solutions.

Andrew Smith · June 17th, 2015, 12:37 AM

Also ....

SUMMARY AND CONCLUSIONS

We performed an extensive analysis of the effects of various
factors on flash-based SSD reliability across a majority of the
SSDs employed at Facebook, running production data center
workloads. We analyze a variety of internal and external
characteristics of SSDs and examine how these characteristics
affect the trends for uncorrectable errors. To conclude, we
briefly summarize the key observations from our study and
discuss their implications for SSD and system design.
Observation 1: We observe that SSDs go through several
distinct failure periods – early detection, early failure, usable
life, and wearout – during their lifecycle, corresponding to the
amount of data written to flash chips.

Due to pools of flash blocks with different reliability characteristics,
failure rate in a population does not monotonically
increase with respect to amount of data written to flash chips.
This is unlike the failure rate trends seen in raw flash chips.
We suggest that techniques should be designed to help reduce
or tolerate errors throughout SSD lifecycle. For example, additional
error correction at the beginning of an SSD’s life could
help reduce the failure rates we see during the early detection
period.

Observation 2: We find that the effect of read disturbance
errors is not a predominant source of errors in the SSDs we
examine.

While prior work has shown that such errors can occur under
certain access patterns in controlled environments [5, 32,
6, 8], we do not observe this effect across the SSDs we examine.
This corroborates prior work which showed that the effect
of retention errors in flash cells dominate error rate compared
to read disturbance [32, 6]. It may be beneficial to perform
a more detailed study of the effect of these types of errors in
flash-based SSDs used in servers.

Observation 3: Sparse data layout across an SSD’s physical
address space (e.g., non-contiguously allocated data) leads
to high SSD failure rates; dense data layout (e.g., contiguous
data) can also negatively impact reliability under certain conditions,
likely due to adversarial access patterns.

Further research into flash write coalescing policies with information
from the system level may help improve SSD reliability.
For example, information about write access patterns
from the operating system could potentially inform SSD controllers
of non-contiguous data that is accessed very frequently,
which may be one type of access pattern that adversely affects
SSD reliability and is a candidate for storing in a separate
write buffer.

Observation 4: Higher temperatures lead to increased failure
rates, but do so most noticeably for SSDs that do not employ
throttling techniques.

In general, we find techniques like throttling, which may
be employed to reduce SSD temperature, to be effective at
reducing the failure rate of SSDs. We also find that SSD
temperature is correlated with the power used to transmit
data across the PCIe bus, which can potentially be used as
a proxy for temperature in the absence of SSD temperature
sensors.
Observation 5: The amount of data reported to be written
by the system software can overstate the amount of data actually
written to flash chips, due to system-level buffering and
wear reduction techniques.

Techniques that simply reduce the rate of software-level
writes may not reduce the failure rate of SSDs. Studies seeking
to model the effects of reducing software-level writes on flash
reliability should also consider how other aspects of SSD operation,
such as system-level buffering and SSD controller wear
leveling, affect the actual amount of data written to SSDs.
Conclusions. We hope that our new observations, with
real workloads and real systems from the field, can aid in (1)
understanding the effects of different factors, including system
software, applications, and SSD controllers on flash memory
reliability, (2) the design of more reliable flash architectures
and systems, and (3) improving the evaluation methodologies
for future flash memory reliability studies.

June 16th, 2015, 10:36 PM	#1
Ed Roo Major Player Join Date: Jul 2010 Location: Columbus USA Posts: 312	SSD HD Failure White Paper http://users.ece.cmu.edu/~omutlu/pub...gmetrics15.pdf

June 17th, 2015, 12:35 AM	#2
Andrew Smith Equal Opportunity Offender Join Date: Feb 2009 Location: Brisbane, Australia Posts: 3,081	Re: SSD HD Failure White Paper Abstract of linked paper .... ABSTRACT Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center environment running modern applications and system software. This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of flash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how data is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power. Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD’s physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software. We hope that the findings of this first large-scale flash memory reliability study can inspire others to develop other publicly-available analyses and novel flash reliability solutions.

June 17th, 2015, 12:37 AM	#3
Andrew Smith Equal Opportunity Offender Join Date: Feb 2009 Location: Brisbane, Australia Posts: 3,081	Re: SSD HD Failure White Paper Also .... SUMMARY AND CONCLUSIONS We performed an extensive analysis of the effects of various factors on flash-based SSD reliability across a majority of the SSDs employed at Facebook, running production data center workloads. We analyze a variety of internal and external characteristics of SSDs and examine how these characteristics affect the trends for uncorrectable errors. To conclude, we briefly summarize the key observations from our study and discuss their implications for SSD and system design. Observation 1: We observe that SSDs go through several distinct failure periods – early detection, early failure, usable life, and wearout – during their lifecycle, corresponding to the amount of data written to flash chips. Due to pools of flash blocks with different reliability characteristics, failure rate in a population does not monotonically increase with respect to amount of data written to flash chips. This is unlike the failure rate trends seen in raw flash chips. We suggest that techniques should be designed to help reduce or tolerate errors throughout SSD lifecycle. For example, additional error correction at the beginning of an SSD’s life could help reduce the failure rates we see during the early detection period. Observation 2: We find that the effect of read disturbance errors is not a predominant source of errors in the SSDs we examine. While prior work has shown that such errors can occur under certain access patterns in controlled environments [5, 32, 6, 8], we do not observe this effect across the SSDs we examine. This corroborates prior work which showed that the effect of retention errors in flash cells dominate error rate compared to read disturbance [32, 6]. It may be beneficial to perform a more detailed study of the effect of these types of errors in flash-based SSDs used in servers. Observation 3: Sparse data layout across an SSD’s physical address space (e.g., non-contiguously allocated data) leads to high SSD failure rates; dense data layout (e.g., contiguous data) can also negatively impact reliability under certain conditions, likely due to adversarial access patterns. Further research into flash write coalescing policies with information from the system level may help improve SSD reliability. For example, information about write access patterns from the operating system could potentially inform SSD controllers of non-contiguous data that is accessed very frequently, which may be one type of access pattern that adversely affects SSD reliability and is a candidate for storing in a separate write buffer. Observation 4: Higher temperatures lead to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques. In general, we find techniques like throttling, which may be employed to reduce SSD temperature, to be effective at reducing the failure rate of SSDs. We also find that SSD temperature is correlated with the power used to transmit data across the PCIe bus, which can potentially be used as a proxy for temperature in the absence of SSD temperature sensors. Observation 5: The amount of data reported to be written by the system software can overstate the amount of data actually written to flash chips, due to system-level buffering and wear reduction techniques. Techniques that simply reduce the rate of software-level writes may not reduce the failure rate of SSDs. Studies seeking to model the effects of reducing software-level writes on flash reliability should also consider how other aspects of SSD operation, such as system-level buffering and SSD controller wear leveling, affect the actual amount of data written to SSDs. Conclusions. We hope that our new observations, with real workloads and real systems from the field, can aid in (1) understanding the effects of different factors, including system software, applications, and SSD controllers on flash memory reliability, (2) the design of more reliable flash architectures and systems, and (3) improving the evaluation methodologies for future flash memory reliability studies.