HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026

Memory errors pose an escalating threat to datacenter reliability as DRAM technology scales to smaller nodes and servers process ever-larger datasets. While Error Correction Code (ECC) provides first-line defense, uncorrectable errors still cause catastrophic server failures with significant economic impact. Memory mirroring offers complementary protection against these errors, but existing memory mirroring solutions require reserving specific memory regions exclusively for mirroring, which incurs significant capacity overhead and thus limits wide adoption. A recent proposal suggests leveraging free memory space for mirroring, but it leaves a critical research question unanswered: which data to mirror when free memory is limited. Thus, we propose MemSOS, a selective memory mirroring system that dynamically chooses which pages to mirror based on their impact on system reliability. Specifically, MemSOS selects pages to mirror based on their criticality and recency. Criticality is evaluated by examining the page type, while recency serves as a proxy for the likelihood of future access. Our evaluation demonstrates that MemSOS reduces system Failures In Time (FIT) by up to 19,000$\times$ compared to a state-of-the-art partial mirroring scheme, while maintaining less than 3% performance overhead. In many cases, MemSOS achieves reliability levels comparable to full mirroring, underscoring its effectiveness in maximizing system availability under limited free memory space.