Predicting DRAM Failures at Scale: A Two-Stage Approach for Heterogeneous Systems
Memory failures in large-scale production environments pose critical threats to system reliability and service availability. While existing studies have conducted in-depth analyses of the temporal and spatial correlations of memory errors, differences in characteristics across architectures remain largely unexplored. To uncover these overlooked correlations, this paper conducts an extensive analysis of over 130{,}000 DDR4 DIMMs collected from large-scale heterogeneous production clusters over a nine-month period. Through systematic spatial and temporal analysis across Intel x86v5 architectures and four major DRAM vendors, we uncover five new findings and propose a novel two-stage training strategy. This strategy addresses sample quality issues by applying temporal weighting to positive samples and adaptive reweighting to negative samples. It also incorporates comprehensive multi-dimensional feature engineering, covering static, spatial, temporal, and micro-level characteristics. Finally, it integrates dual-driven sampling strategies and adaptive prediction timing to balance prediction accuracy and operational efficiency. Extensive evaluation shows that our CatBoost-based model achieves F1-scores of 49.9% on Intel x86v5 and 57.6% on Intel x86v6, substantially outperforming existing methods. This cross-architecture validation demonstrates the robustness and generalization of our approach across different hardware platforms. To the best of our knowledge, our work presents the first large-scale cross-architecture analysis of memory error patterns and establishes new benchmarks for production-scale memory failure prediction systems.
Mon 2 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | Predicting DRAM Failures at Scale: A Two-Stage Approach for Heterogeneous Systems Main Conference Chenglin Wang Xiamen University, Shouxin Wang Xiamen University, Shuyue Zhou Xiamen University, Ronglong Wu Xiamen University, Zhirong Shen Xiamen University, Lu Tang Xiamen University, Yiming Zhang Xiamen University, Jialiang Yu Huawei, Min Zhou Huawei | ||
14:30 20mTalk | MemSOS: OS-Guided Selective Memory Mirroring Main Conference Junghoon Kim Seoul National University & Samsung Electronics, Jongheon Jeong Seoul National University, Seokwon Moon Seoul National University, Seong Hoon Seo Seoul National University, Yeonhong Park Seoul National University, Jinkyu Jeong Yonsei University, Nam Sung Kim UIUC, Jae W. Lee Seoul National University | ||
14:50 20mTalk | ASPA: Reassigning DDR5 Parity Bandwidth Main Conference Fan Li University of Central Florida, Qiufeng Li George Washington University, Yanan Guo University of Rochester, Weidong Cao George Washington University, Xin Xin University of Central Florida | ||
15:10 20mTalk | HR-DCIM: High-Reliability Floating-Point Digital CIM Architecture with Unified Low-Cost Iterative Error Correction Main Conference Zhen He Tsinghua University, Yiqi Wang Tsinghua University, Zhiheng Yue Tsinghua University, Zihan Wu Tsinghua University, Huiming Han Tsinghua University, Shaojun Wei Tsinghua University, Yang Hu Tsinghua University, Fengbin Tu The Hong Kong University of Science and Technology, Shouyi Yin Tsinghua University | ||