HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026

SPHINCS$^+$ is a stateless hash-based signature scheme known for its strong post-quantum security, but it suffers from slow signature speed due to the heavy use of hash operations. The parallel architecture of GPUs offers a potential advantage for accelerating the computation of SPHINCS$^+$ signatures. However, existing GPU-based optimization efforts for SPHINCS$+$ either do not fully exploit the inherent parallelism of its Merkle Tree-based structure, or lack fine-grained, compiler-level customization tailored to its diverse computational kernels.

This paper proposes HERO-Sign, which adopts hierarchical tuning methodologies and efficient compiler-time GPU optimizations for SPHINCS$^+$. HERO-Sign rethinks the parallelization potential arising from data independence in SPHINCS$^+$’ components, including FORS (Forest of Random Subsets), MSS (Merkle Signature Scheme, and WOTS$^+$ (Winternitz One-Time Signature Plus). First, it introduces a Tree Fusion strategy for FORS, whose structure contains a large number of branches. Our FORS Fusion strategy is supported by an automated Tree Tuning search algorithm, allowing it to adapt and optimize fusion schemes across various GPU platforms. To further enhance performance, HERO-Sign adopts an adaptive compilation strategy that accounts for the varying effectiveness of compiler optimizations across different SPHINCS$^+$ component kernels (\texttt{FORS_Sign}, \texttt{TREE_Sign}, \texttt{WOTS$^+$_Sign}). This strategy automatically selects between PTX and native branches during the compilation phase to maximize efficiency. For multiple batches of message signatures, HERO-Sign focuses on optimizing kernel-level overlapping and employs a Task Graph-based construction strategy to minimize multi-stream idle time and reduce kernel launch overhead. Compared to state-of-the-art GPU implementations, under the SPHINCS$^+$-128f, SPHINCS$^+$-192f and SPHINCS$^+$-256f parameter sets, HERO-Sign demonstrates an enhanced throughput of 1.28$\times$–3.13$\times$, 1.28$\times$–2.92$\times$, and 1.24$\times$–2.60$\times$ on RTX 4090. Similar performance improvements have also been achieved on other architectures, including the A100, H100, and GTX 2080. HERO-Sign also achieves a two-order-of-magnitude reduction in kernel launch latency.

Wed 4 Feb

Displayed time zone: Hobart change

09:50 - 11:10
Hardware Security and Side-Channel DefensesMain Conference at Collaroy
Chair(s): Georgios Vavouliotis Huawei Zurich Research Center, Switzerland
09:50
20m
Talk
DSASSASSIN: Cross-VM Side-Channel Attacks by Exploiting Intel Data Streaming Accelerator
Main Conference
Ben Chen The Hong Kong University of Science and Technology (Guangzhou), Kunlin Li The Hong Kong University of Science and Technology (Guangzhou), Shuwen Deng Tsinghua University, Dongsheng Wang Tsinghua University, Yun Chen The Hong Kong University of Science and Technology (Guangzhou)
10:10
20m
Talk
SSBleed: Non-speculative Side-channel Attacks via Speculative Store Bypass on Armv9 CPUs
Main Conference
Chang Liu Tsinghua University, Hongpei Zheng Tsinghua University, Xin Zhang Peking University, Dapeng Ju Tsinghua University, Dongsheng Wang Tsinghua University, Yinqian Zhang Southern University of Science and Technology, Trevor E. Carlson National University of Singapore
10:30
20m
Talk
Protean: A Programmable Spectre Defense
Main Conference
Nicholas Mosier Stanford University, Hamed Nemati KTH Royal Institute of Technology, John C. Mitchell Stanford University, Caroline Trippel Stanford University
10:50
20m
Talk
HERO-Sign: Hierarchical Tuning and Efficient Compiler-Time GPU Optimizations for SPHINCS$^+$ Signature Generation
Main Conference
Yaoyun Zhou University of California, Merced, Qian Wang University of California, Merced (UC Merced)