面向SW26010-Pro的1、2级BLAS函数众核并行优化技术

doi:10.13328/j.cnki.jos.006527

微信服务号

微信订阅号

首页 > 过刊浏览>2023年第34卷第9期 >4421-4436. DOI:10.13328/j.cnki.jos.006527

PDF HTML阅读 XML下载导出引用引用提醒

面向SW26010-Pro的1、2级BLAS函数众核并行优化技术
DOI:
                        10.13328/j.cnki.jos.006527
                    
作者:
                        
                        
                    
作者单位:
作者简介:胡怡(1995-),女,博士生,主要研究领域为高性能计算,异构并行,BLAS库,稠密矩阵的相关算法研究;陈道琨(1994-),男,博士生,主要研究领域为高性能计算,异构并行,稀疏矩阵的相关算法研究;杨超(1979-),男,博士,教授,博士生导师,主要研究领域为高性能计算,科学与工程计算;刘芳芳(1982-),女,正高级工程师,CCF专业会员,主要研究领域为高性能扩展数学库,超级计算机评测软件;马文静(1981-),女,副研究员,CCF专业会员,主要研究领域为高性能计算,代码生成与优化;尹万旺(1980-),男,副研究员,主要研究领域为高性能计算,数值模拟,并行调试;袁欣辉(1989-),男,助理研究员,主要研究领域为软硬件协同设计,并行算法设计与优化;林蓉芬(1984-),女,工程师,主要研究领域为高性能计算及其应用.
通讯作者:杨超,E-mail:chao_yang@pku.edu.cn
中图分类号:
基金项目:国家重点研发计划(2020YFB0204601)

Many-core Optimization of Level 1 and Level 2 BLAS Routines on SW26010-Pro

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

BLAS (basic linear algebra subprograms)是高性能扩展数学库的一个重要模块, 广泛应用于科学与工程计算领域. BLAS 1级提供向量-向量运算, BLAS 2级提供矩阵-向量运算. 针对国产SW26010-Pro众核处理器设计并实现了高性能BLAS 1、2级函数. 基于RMA通信机制设计了从核归约策略, 提升了BLAS 1、2级若干函数的归约效率. 针对TRSV、TPSV等存在数据依赖关系的函数, 提出了一套高效并行算法, 该算法通过点对点同步维持数据依赖关系, 设计了适用于三角矩阵的高效任务映射机制, 有效减少了从核点对点同步的次数, 提高了函数的执行效率. 通过自适应优化、向量压缩、数据复用等技术, 进一步提升了BLAS 1、2级函数的访存带宽利用率. 实验结果显示, BLAS 1级函数的访存带宽利用率最高可达95%, 平均可达90%以上, BLAS 2级函数的访存带宽利用率最高可达98%, 平均可达80%以上. 与广泛使用的开源数学库GotoBLAS相比, BLAS 1、2级函数分别取得了平均18.78倍和25.96倍的加速效果. LU分解、QR分解以及对称特征值问题通过调用所提出的高性能BLAS 1、2级函数取得了平均10.99倍的加速效果.

Abstract:

BLAS (basic linear algebra subprograms) is an important module of the high-performance extended math library, which is widely used in the field of scientific and engineering computing. Level 1 BLAS provides vector-vector operation, Level 2 BLAS provides matrix-vector operation. This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro, a domestic many-core processor. A reduction strategy among CPEs is designed based on the RMA communication mechanism, which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines. For TRSV and TPSV and other routines that have data dependencies, a series of efficient parallelization algorithms are proposed. The algorithm maintains data dependencies through point-to-point synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices, which reduces the number of point-to-point synchronizations effectively, and improves the execution efficiency. In this study, adaptive optimization, vector compression, data multiplexing, and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines. The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%, with an average bandwidth of more than 90%. The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%, with an average bandwidth of more than 80%. Compared with the widely used open-source linear algebra library GotoBLAS, the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times. With the optimized Level 1 and Level 2 BLAS routines, LQ decomposition, QR decomposition, and eigenvalue problems achieved an average speedup of 10.99 times.

参考文献

相似文献

引证文献

引用本文

胡怡,陈道琨,杨超,刘芳芳,马文静,尹万旺,袁欣辉,林蓉芬.面向SW26010-Pro的1、2级BLAS函数众核并行优化技术.软件学报,2023,34(9):4421-4436

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-07-02
最后修改日期:2021-09-22
录用日期:
在线发布日期: 2022-11-30
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史