MemSOS: OS-Guided Selective Memory Mirroring
Memory errors pose an escalating threat to datacenter reliability as DRAM technology scales to smaller nodes and servers process ever-larger datasets. While Error Correction Code (ECC) provides first-line defense, uncorrectable errors still cause catastrophic server failures with significant economic impact. Memory mirroring offers complementary protection against these errors, but existing memory mirroring solutions require reserving specific memory regions exclusively for mirroring, which incurs significant capacity overhead and thus limits wide adoption. A recent proposal suggests leveraging free memory space for mirroring, but it leaves a critical research question unanswered: which data to mirror when free memory is limited. Thus, we propose MemSOS, a selective memory mirroring system that dynamically chooses which pages to mirror based on their impact on system reliability. Specifically, MemSOS selects pages to mirror based on their criticality and recency. Criticality is evaluated by examining the page type, while recency serves as a proxy for the likelihood of future access. Our evaluation demonstrates that MemSOS reduces system Failures In Time (FIT) by up to 19,000$\times$ compared to a state-of-the-art partial mirroring scheme, while maintaining less than 3% performance overhead. In many cases, MemSOS achieves reliability levels comparable to full mirroring, underscoring its effectiveness in maximizing system availability under limited free memory space.
Mon 2 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | Predicting DRAM Failures at Scale: A Two-Stage Approach for Heterogeneous Systems Main Conference Chenglin Wang Xiamen University, Shouxin Wang Xiamen University, Shuyue Zhou Xiamen University, Ronglong Wu Xiamen University, Zhirong Shen Xiamen University, Lu Tang Xiamen University, Yiming Zhang Xiamen University, Jialiang Yu Huawei, Min Zhou Huawei | ||
14:30 20mTalk | MemSOS: OS-Guided Selective Memory Mirroring Main Conference Junghoon Kim Seoul National University & Samsung Electronics, Jongheon Jeong Seoul National University, Seokwon Moon Seoul National University, Seong Hoon Seo Seoul National University, Yeonhong Park Seoul National University, Jinkyu Jeong Yonsei University, Nam Sung Kim UIUC, Jae W. Lee Seoul National University | ||
14:50 20mTalk | ASPA: Reassigning DDR5 Parity Bandwidth Main Conference Fan Li University of Central Florida, Qiufeng Li George Washington University, Yanan Guo University of Rochester, Weidong Cao George Washington University, Xin Xin University of Central Florida | ||
15:10 20mTalk | HR-DCIM: High-Reliability Floating-Point Digital CIM Architecture with Unified Low-Cost Iterative Error Correction Main Conference Zhen He Tsinghua University, Yiqi Wang Tsinghua University, Zhiheng Yue Tsinghua University, Zihan Wu Tsinghua University, Huiming Han Tsinghua University, Shaojun Wei Tsinghua University, Yang Hu Tsinghua University, Fengbin Tu The Hong Kong University of Science and Technology, Shouyi Yin Tsinghua University | ||