CTRU-Prime High-throughput Implementation Based on CUDA Core and Tensor Core
Author:
Affiliation:

Clc Number:

TP309

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    The rapid development of quantum computers poses significant threats to existing cryptographic systems. The implementation and migration of post-quantum cryptographic algorithms are therefore of utmost importance. Among these, NTRU lattice-based cryptographic schemes have gained attention due to their simplicity and computational efficiency. The CTRU-Prime scheme, based on NTRU lattices, stands out for its excellent performance in security, bandwidth, and implementation efficiency. Given the powerful capabilities of GPUs in handling large-scale parallel processing tasks, this study presents the first high-throughput implementation of CTRU-Prime using Tensor Core and compute unified device architecture (CUDA) Core. The underlying algebraic structure of CTRU-Prime is large-Galois-group prime-degree prime-ideal number field (LPPNF), which not only resists attacks targeting cyclotomic rings but also presents challenges for the implementation of polynomial multiplication. First, two GPU implementations of polynomial multiplication over LPPNF are proposed. The CUDA Core-based Pseudo-Mersenne incomplete NTT polynomial multiplication uses layer fusion techniques to optimize memory access patterns, achieving a throughput of 256.98 times. The Tensor Core-based schoolbook polynomial multiplication converts polynomial multiplication into matrix operations, leveraging low-precision matrix-multiply-and-accumulate (MMA) operations, achieving a throughput of 177.24 times. Next, an overall architecture for CTRU-Prime on the GPU platform is presented, focusing on throughput. This architecture combines batch mode and single mode, multi-stream technology, and multi-thread techniques. Optimization strategies such as fused kernels, coalesced global memory access, and optimized memory access patterns are employed to accelerate memory access and computation speeds of various kernel functions. Experimental results show that, on the RTX 3060 platform, CTRU-Prime-653, CTRU-Prime-761, and CTRU-Prime-1277 can perform key generation at rates of 63000, 54000, and 16000 times per second, respectively; key encapsulation at rates of 635000, 2745000, and 1601000 times per second, respectively; and key decapsulation at rates of 351000, 2622000, and 1524000 times per second, respectively. These rates are 68.85, 79.78, and 66.84 times higher for key generation, 10.32, 46.57, and 46.81 times higher for key encapsulation, and 11.43, 89.19, and 90.32 times higher for key decapsulation compared to the C implementation. Compared to the latest Kyber implementation, the key encapsulation throughput is 1.46 times higher, and the key decapsulation throughput is 1.74 times higher, making it 26 times more efficient than other high-throughput NTRU lattice-based GPU implementations.

    Reference
    Related
    Cited by
Get Citation

胡晓雯,邹恒川,沈诗羽,李文倩,赵运磊.基于CUDA Core和Tensor Core的CTRU-Prime高吞吐量实现.软件学报,,():1-20

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:November 26,2024
  • Revised:March 17,2025
  • Adopted:
  • Online: January 14,2026
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063