CoCoTree: A Computation-Capable Architecture for Collective Communication in Scalable PIM
The growing demand for high-bandwidth and large-capacity memory access in data-intensive workloads has driven the development and deployment of Processing-in-Memory (PIM) architectures. However, existing DIMM-based PIM systems suffer from the severe communication bottleneck between the processing elements (PEs) near the PIM banks due to their requirement on host CPU forwarding. This bottleneck limits the efficiency of collective operations and degrades scalability and performance for workloads that require inter-PE communication.
To address the communication limitation, we propose CoCoTree, a computation-capable architecture for collective communication in scalable DIMM-based PIM. CoCoTree supports direct and high-throughput inter-PE communication without host intervention. CoCoTree accelerates key collective communication using novel hierarchical binary tree topology and lightweight in-network computation support. We design and implement microarchitectures for the main building blocks: Co-Leaf and Co-Node, to efficiently handle the data packing, routing, and processing in CoCoTree. Furthermore, we also introduce a packet-based communication protocol tailored to the CoCoTree architecture, which decouples control and data through a two-phase configuration-computation communication mechanism to efficiently support a wide range of collective communication operations. CoCoTree effectively mitigates inter-PE communication bottlenecks, enabling scalable PIM systems capable of meeting the demands of growing data size. Experimental results show that CoCoTree achieves up to 95.6× improvement for collective operations and improves end-to-end application performance by up to 10.5× across various workloads over the baseline PIM, while outperforming state-of-the-art PIM communication architectures in both performance and scalability.
Mon 2 FebDisplayed time zone: Hobart change
15:50 - 17:10 | Processing-in-Memory ArchitecturesMain Conference at Collaroy Chair(s): Byeongho Kim Samsung Electronics | ||
15:50 20mTalk | The Memory Processing Unit: A Generalized Interface for End-to-End In-Memory Execution Main Conference Minh S. Q. Truong Carnegie Mellon University, Yiqiu Sun University of Illinois Urbana-Champaign, Dawei Xiong University of Illinois Urbana-Champaign, Amol Shah University of Illinois Urbana-Champaign, Alex Glass Carnegie Mellon University, Abraham Farrell University of Illinois Urbana-Champaign, James A. Bain Carnegie Mellon University, L. Richard Carley Carnegie Mellon University, Saugata Ghose University of Illinois Urbana-Champaign Link to publication | ||
16:10 20mTalk | CoCoTree: A Computation-Capable Architecture for Collective Communication in Scalable PIM Main Conference Shunchen Shi Institute of Computing Technology, Chinese Academy of Sciences ; University of Chinese Academy of Sciences, Qijia Yang Institute of Computing Technology, Chinese Academy of Sciences ; University of Chinese Academy of Sciences, Fan Yang Institute of Computing Technology, Chinese Academy of Science, Yu Huang Huazhong University of Science and Technology, Youwei Zhuo Peking University, Zhichun Li Institute of Computing Technology, Chinese Academy of Sciences ; University of Chinese Academy of Sciences, Ninghui Sun State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Xueqi Li State Key Lab of Processors, Institute of Computing Technology, CAS | ||
16:30 20mTalk | PIM-malloc: A Fast and Scalable Dynamic Memory Allocator for Processing-In-Memory (PIM) Architectures Main Conference | ||
16:50 20mTalk | Count2Multiply: Reliable In-Memory High-Radix Counting Main Conference Joao Paulo Cardoso de Lima TU Dresden, ScaDS.AI, Benjamin F. Morris III Duke University, Asif Ali Khan TU Dresden, Germany, Jeronimo Castrillon TU Dresden, Germany, Alex Jones Syracuse University | ||