###
Journal of Software:2020.31(10):3184-3196

申威26010众核处理器上一维FFT实现与优化
赵玉文,敖玉龙,杨超,刘芳芳,尹万旺,林蓉芬
(中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049;北京大学 数学科学学院, 北京 100871;中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;北京大学 数学科学学院, 北京 100871;中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;计算机科学国家重点实验室(中国科学院 软件研究所), 北京 100190;中国科学院大学, 北京 100049;国家并行计算机工程技术研究中心, 北京 100190)
General Implementation of 1-D FFT on the Sunway 26010 Processor
ZHAO Yu-Wen,AO Yu-Long,YANG Chao,LIU Fang-Fang,YIN Wan-Wang,LIN Rong-Fen
(Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;School of Mathematical Sciences, Peking University, Beijing 100871, China;Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;School of Mathematical Sciences, Peking University, Beijing 100871, China;Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 28   Download 45
Received:January 22, 2018    Revised:September 20, 2018
> 中文摘要: 根据申威26010众核处理器的特点提出了基于两层分解的一维FFT众核并行算法.该算法基于迭代的Stockham FFT计算框架和Cooley-Tukey FFT算法,将大规模FFT分解成一系列的小规模FFT来计算,并通过设计合理的任务划分方式、寄存器通信、双缓冲以及SIMD向量化等与计算平台相关的优化方法来提高FFT的计算性能.最后对所提出算法的性能进行了测试,相比于单主核上运行的FFTW3.3.4库,获得了平均44.53x的加速比,最高加速比可达56.33x,且其带宽利用率最高可达83.45%.
Abstract:A two-layer decomposition 1-D FFT multi-core parallel algorithm is proposed according to the characteristics of Sunway 26010 processor. It is based on the iterative Stockholm FFT framework and the Cooley-Tukey FFT algorithm. It decomposes large scale FFT into a series of small scale FFTs. It improves the performance of the algorithm by means of designing reasonable task partitioning, register communication, double-buffering, and SIMD vectorization. Finally, the performance of the two-layer decomposition 1-D FFT multi-core parallel algorithm is tested. It achieves an average speedup of 44.53x, with a maximum speedup of up to 56.33x, and a maximum bandwidth utilization of 83.45%, compared to FFTW3.3.4 library running on the single MPE.
文章编号:     中图分类号:TP301    文献标志码:
基金项目:国家重点研发计划(2016YFB0200603);北京市自然科学基金(JQ18001) 国家重点研发计划(2016YFB0200603);北京市自然科学基金(JQ18001)
Foundation items:National Key Research and Development Program of China (2016YFB0200603); Beijing Natural Science Foundation, China (JQ18001)
Reference text:

赵玉文,敖玉龙,杨超,刘芳芳,尹万旺,林蓉芬.申威26010众核处理器上一维FFT实现与优化.软件学报,2020,31(10):3184-3196

ZHAO Yu-Wen,AO Yu-Long,YANG Chao,LIU Fang-Fang,YIN Wan-Wang,LIN Rong-Fen.General Implementation of 1-D FFT on the Sunway 26010 Processor.Journal of Software,2020,31(10):3184-3196