HR-DCIM: High-Reliability Floating-Point Digital CIM Architecture with Unified Low-Cost Iterative Error Correction
Digital computing-in-memory (CIM) is a promising computing paradigm for neural networks (NN) acceleration. However, during the actual deployment process of digital CIM chips, we find that existing digital CIM designs face severe computing reliability issues, which is crucial for real product development but remains underexplored. Thus, this work pioneers a systematical computing reliability analysis for digital CIM across off-memory and in-memory levels. We find that both the off-memory floating-point (FP) exponent alignment and the in-memory random cell bit-flip errors impair digital CIM’s computing reliability, causing significant truncation and bit-flip accuracy loss. Critically, existing reliability solutions are incompatible with the unique multi-row accumulation structure of digital CIM, which either severely damage digital CIM’s performance or result in prohibitive overhead.
To address above challenges, we propose HR-DCIM: a high-reliability FP digital CIM architecture featuring unified low-cost iterative error correction. Specifically, for the off-memory reliability, we propose an exponent-mantissa joint-alignment mechanism to repurpose inherent invalid bits of aligned mantissas as compensation bits to reduce alignment truncation loss, without damaging digital CIM’s performance. Then, for the in-memory reliability, we propose a remainder aliasing-based unified multiply-accumulation (MAC) error correction mechanism to correct possible MAC errors caused by various cell error cases with low-cost iteration. Experimental results show that the proposed techniques enable digital CIM to maintain high performance and efficiency across various operating voltage conditions without significant accuracy loss.
Mon 2 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | Predicting DRAM Failures at Scale: A Two-Stage Approach for Heterogeneous Systems Main Conference Chenglin Wang Xiamen University, Shouxin Wang Xiamen University, Shuyue Zhou Xiamen University, Ronglong Wu Xiamen University, Zhirong Shen Xiamen University, Lu Tang Xiamen University, Yiming Zhang Xiamen University, Jialiang Yu Huawei, Min Zhou Huawei | ||
14:30 20mTalk | MemSOS: OS-Guided Selective Memory Mirroring Main Conference Junghoon Kim Seoul National University & Samsung Electronics, Jongheon Jeong Seoul National University, Seokwon Moon Seoul National University, Seong Hoon Seo Seoul National University, Yeonhong Park Seoul National University, Jinkyu Jeong Yonsei University, Nam Sung Kim UIUC, Jae W. Lee Seoul National University | ||
14:50 20mTalk | ASPA: Reassigning DDR5 Parity Bandwidth Main Conference Fan Li University of Central Florida, Qiufeng Li George Washington University, Yanan Guo University of Rochester, Weidong Cao George Washington University, Xin Xin University of Central Florida | ||
15:10 20mTalk | HR-DCIM: High-Reliability Floating-Point Digital CIM Architecture with Unified Low-Cost Iterative Error Correction Main Conference Zhen He Tsinghua University, Yiqi Wang Tsinghua University, Zhiheng Yue Tsinghua University, Zihan Wu Tsinghua University, Huiming Han Tsinghua University, Shaojun Wei Tsinghua University, Yang Hu Tsinghua University, Fengbin Tu The Hong Kong University of Science and Technology, Shouyi Yin Tsinghua University | ||