An Efficient and Scalable Hardware Architecture for Number Theoretic Transform on FPGA with Design Automation (HPCA 2026 - Main Conference)

Who

Yilan Zhu, Geng Yang, Xingyu Tian, Dilshan Kumarathunga, Liang Kong, Xianglong Deng, Shengyu Fan, Guang Fan, Guiming Shi, Lei Chen, Bo Zhang, Yisong Chang, Shoumeng Yan, Zhenman Fang, Mingzhe Zhang

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 12:10 - 12:30 at Cronulla - Zero-Knowledge and Private Information Retrieval Chair(s): Hanjun Kim

Abstract

Fully Homomorphic Encryption (FHE) has become a promising approach to protecting data privacy in emerging application scenarios. Unfortunately, FHE suffers from significant processing speed degradation compared to plaintext computation, with one of the primary bottlenecks being the time-consuming Number Theoretic Transform (NTT). Therefore, accelerating NTT to accommodate various FHE parameters is crucial to advancing FHE towards practical use. With highly reconfigurable and performant logical fabrics, Field Programmable Gate Arrays (FPGAs) have exhibited great potential in NTT acceleration.

By decomposing large-point NTT with strong data dependency into independent and simple small-point NTTs, the emerging Ten-step NTT (TNTT) algorithm intuitively enables higher parallelism and thereby have the potential to explore better performance compared to traditional algorithm. However, our quantitative analysis reveals that TNTT exhibits significantly performance degradation as parallelism increases due to additional varying-size transpositions and Hadamard products.

This paper proposes AutoNest, an efficient and scalable hardware architecture, along with an accelerator auto-generation framework for TNTT. The proposed hardware architecture maximizes performance by 1) adopting a 2D block decomposition dataflow to address critical path delays in transpose logic, thereby improving clock frequency. 2) integrating algorithm-level cost-free twiddle factor fusion to reduce the number of modular multiplications in Hadamard products, thereby allowing higher parallelism on chip. Moreover, we also deliver an accelerator generation framework conducting automated design space exploration to elaborate a performant TNTT architecture under the target FPGAs’ resource budget for user-defined FHE parameters. Experimental results on the AMD-Xilinx U280 FPGA demonstrate that NTT accelerators generated by AutoNest achieve an average speedup of 2.31× compared to prior designs.

Yilan Zhu

Ant Group

Geng Yang

Ant Group

Xingyu Tian

Simon Fraser University

Dilshan Kumarathunga

Simon Fraser University

Liang Kong

Ant Group

Xianglong Deng

UCAS

Shengyu Fan

UCAS

Guang Fan

Ant Group

Guiming Shi

Tsinghua University

Lei Chen

University of Chinese Academy of Sciences

China

Bo Zhang

Ant Group

Yisong Chang

Ant Group

Shoumeng Yan

Ant Group

Zhenman Fang

Simon Fraser University

Canada

Mingzhe Zhang

Ant Group

China

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

11:30 - 12:50	Zero-Knowledge and Private Information RetrievalMain Conference at Cronulla Chair(s): Hanjun Kim POSTECH

11:30 20m Talk		zkPHIRE: A Programmable Accelerator for ZKPs over HIgh-degRee, Expressive Gates Main Conference Alhad Daftardar New York University, Jianqiao Cambridge Mo New York University, Joey Ah-kiow New York University, Benedikt Bünz New York University, Siddharth Garg New York University, Brandon Reagen New York University
11:50 20m Talk		Conflux: A High-Performance Keyword Private Retrieval System for Dynamic Datasets Main Conference Zehao Chen Shandong University, Zhaoyan Shen Shandong University, Qian Wei Shandong University, Hang Lu Institute of Computing Technology, Chinese Academy of Sciences, Lei Ju Shandong University
12:10 20m Talk		An Efficient and Scalable Hardware Architecture for Number Theoretic Transform on FPGA with Design Automation Main Conference Yilan Zhu Ant Group, Geng Yang Ant Group, Xingyu Tian Simon Fraser University, Dilshan Kumarathunga Simon Fraser University, Liang Kong Ant Group, Xianglong Deng UCAS, Shengyu Fan UCAS, Guang Fan Ant Group, Guiming Shi Tsinghua University, Lei Chen University of Chinese Academy of Sciences, Bo Zhang Ant Group, Yisong Chang Ant Group, Shoumeng Yan Ant Group, Zhenman Fang Simon Fraser University, Mingzhe Zhang Ant Group
12:30 20m Talk		IVE: An Accelerator for Single-Server Private Information Retrieval Using a Versatile Processing Element Main Conference Sangpyo Kim Seoul National University, Hyesung Ji Seoul National University, Jongmin Kim Seoul National University, Jaiyoung Park Seoul National University, Wonseok Choi Seoul National University, Jung Ho Ahn Seoul National University Pre-print