|
|||||||||
|
Thread Tools | Search this Thread |
June 16th, 2015, 10:36 PM | #1 |
Major Player
Join Date: Jul 2010
Location: Columbus USA
Posts: 312
|
SSD HD Failure White Paper
|
June 17th, 2015, 12:35 AM | #2 |
Equal Opportunity Offender
Join Date: Feb 2009
Location: Brisbane, Australia
Posts: 3,068
|
Re: SSD HD Failure White Paper
Abstract of linked paper ....
ABSTRACT Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center environment running modern applications and system software. This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of flash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how data is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power. Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD’s physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software. We hope that the findings of this first large-scale flash memory reliability study can inspire others to develop other publicly-available analyses and novel flash reliability solutions. |
June 17th, 2015, 12:37 AM | #3 |
Equal Opportunity Offender
Join Date: Feb 2009
Location: Brisbane, Australia
Posts: 3,068
|
Re: SSD HD Failure White Paper
Also ....
SUMMARY AND CONCLUSIONS We performed an extensive analysis of the effects of various factors on flash-based SSD reliability across a majority of the SSDs employed at Facebook, running production data center workloads. We analyze a variety of internal and external characteristics of SSDs and examine how these characteristics affect the trends for uncorrectable errors. To conclude, we briefly summarize the key observations from our study and discuss their implications for SSD and system design. Observation 1: We observe that SSDs go through several distinct failure periods – early detection, early failure, usable life, and wearout – during their lifecycle, corresponding to the amount of data written to flash chips. Due to pools of flash blocks with different reliability characteristics, failure rate in a population does not monotonically increase with respect to amount of data written to flash chips. This is unlike the failure rate trends seen in raw flash chips. We suggest that techniques should be designed to help reduce or tolerate errors throughout SSD lifecycle. For example, additional error correction at the beginning of an SSD’s life could help reduce the failure rates we see during the early detection period. Observation 2: We find that the effect of read disturbance errors is not a predominant source of errors in the SSDs we examine. While prior work has shown that such errors can occur under certain access patterns in controlled environments [5, 32, 6, 8], we do not observe this effect across the SSDs we examine. This corroborates prior work which showed that the effect of retention errors in flash cells dominate error rate compared to read disturbance [32, 6]. It may be beneficial to perform a more detailed study of the effect of these types of errors in flash-based SSDs used in servers. Observation 3: Sparse data layout across an SSD’s physical address space (e.g., non-contiguously allocated data) leads to high SSD failure rates; dense data layout (e.g., contiguous data) can also negatively impact reliability under certain conditions, likely due to adversarial access patterns. Further research into flash write coalescing policies with information from the system level may help improve SSD reliability. For example, information about write access patterns from the operating system could potentially inform SSD controllers of non-contiguous data that is accessed very frequently, which may be one type of access pattern that adversely affects SSD reliability and is a candidate for storing in a separate write buffer. Observation 4: Higher temperatures lead to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques. In general, we find techniques like throttling, which may be employed to reduce SSD temperature, to be effective at reducing the failure rate of SSDs. We also find that SSD temperature is correlated with the power used to transmit data across the PCIe bus, which can potentially be used as a proxy for temperature in the absence of SSD temperature sensors. Observation 5: The amount of data reported to be written by the system software can overstate the amount of data actually written to flash chips, due to system-level buffering and wear reduction techniques. Techniques that simply reduce the rate of software-level writes may not reduce the failure rate of SSDs. Studies seeking to model the effects of reducing software-level writes on flash reliability should also consider how other aspects of SSD operation, such as system-level buffering and SSD controller wear leveling, affect the actual amount of data written to SSDs. Conclusions. We hope that our new observations, with real workloads and real systems from the field, can aid in (1) understanding the effects of different factors, including system software, applications, and SSD controllers on flash memory reliability, (2) the design of more reliable flash architectures and systems, and (3) improving the evaluation methodologies for future flash memory reliability studies. |
| ||||||
|
|