PinDrop: Breaking the Silence on SDCs in a Large-Scale Fleet
Silent Data Corruptions (SDCs) pose a significant and often hidden threat to the reliability of large-scale computing infrastructure, as they can silently compromise data integrity without immediate detection. Detecting such behaviors at hyper-scale is often challenging due to their intermittent nature and the vast diversity of hardware and workloads in a large-scale fleet. This work addresses these challenges by introducing PinDrop, a characterization methodology which leverages continuous, high-frequency testing infrastructure across millions of servers to gather information about SDCs at scale. Leveraging extensive test suites tailored to mimic real-world applications and exercise a wide range of CPU features, we provide the most comprehensive characterization of SDC failures to date, analyzing over 500 million test executions across millions of devices. Our findings reveal that 0.035% of tested machines suffer from at least one SDC failure during their lifetime. Examining years of data (rather than just a testing snapshot in time), we observe SDCs emerging long after initial deployment and persisting over time. Detailed analysis shows that an average of 0.0024% of tested machines begin failing in each quarter they are tested beyond an initial burn-in period, confirming a critical need for continuous testing. Our findings also provide further insights into detailed SDC behaviours observed, including failure breakdowns across architectures, test families, specific core IDs, and output-level behaviours.
Mon 2 FebDisplayed time zone: Hobart change
09:50 - 11:10 | |||
09:50 20mTalk | Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models Main Conference Chiyue Wei Duke University, Cong Guo Duke University, Junyao Zhang Duke University, Haoxuan Shan Duke University, Yifan Xu Duke University, Ziyue Zhang Duke University, Yudong Liu Duke University, Qinsi Wang Duke University, Changchun Zhou Duke University, Hai "Helen" Li Duke University, Yiran Chen Duke University | ||
10:10 20mTalk | LoCaLUT: Harnessing Capacity–Computation Tradeoffs for LUT-Based Inference in DRAM-PIM Main Conference Junguk Hong Seoul National University, Changmin Shin Seoul National University, Sukjin Kim Seoul National University, Si Ung Noh Seoul National University, Taehee Kwon Seoul National University, Seongyeon Park Seoul National University, Hanjun Kim Yonsei University, Youngsok Kim Yonsei University, Jinho Lee Seoul National University | ||
10:30 20mTalk | RPU - A Reasoning Processing Unit Main Conference Matthew Adiletta Harvard University, David Brooks Harvard University, Gu-Yeon Wei Harvard University | ||
10:50 20mTalk | PinDrop: Breaking the Silence on SDCs in a Large-Scale Fleet Main Conference Peter W. Deutsch Massachusetts Institute of Technology/Meta, Harish D. Dixit Meta, Gautham Vunnam Meta, Carl Moran Meta, Eleanor Ozer Meta, Sriram Sankar Meta | ||