BLAS (basic linear algebra subprograms)是最基本、最重要的底层数学库之一. 在一个标准的BLAS库中, BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要, 在许多大规模科学与工程计算应用中被广泛调用. 另外, BLAS 3级属于计算密集型函数, 对充分发挥处理器的计算性能有至关重要的作用. 针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术. 具体而言, 根据SW26010-Pro的存储层次结构, 设计多级分块算法, 挖掘矩阵运算的并行性. 在此基础上, 基于远程内存访问 (remote memory access, RMA)机制设计数据共享策略, 提高从核间的数据传输效率. 进一步, 采用三缓冲、参数调优等方法对算法进行全面优化, 隐藏直接内存访问 (direct memory access, DMA)访存开销和RMA通信开销. 此外, 利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令, 还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化, 提高了函数的浮点计算效率. 实验结果显示, 所提出的并行优化技术在SW26010-Pro处理器上为BLAS 3级函数带来了明显的性能提升, 单核组BLAS 3级函数的浮点计算性能最高可达峰值性能的92%, 多核组BLAS 3级函数的浮点计算性能最高可达峰值性能的88%.
Basic linear algebra subprogram (BLAS) is one of the most basic and important math libraries. The matrix-matrix operations covered in the level-3 BLAS functions are particularly significant for a standard BLAS library and are widely employed in many large-scale scientific and engineering computing applications. Additionally, level-3 BLAS functions are computing intensive functions and play a vital role in fully exploiting the computing performance of processors. Multi-core parallel optimization technologies are studied for level-3 BLAS functions on SW26010-Pro, a domestic processor. According to the memory hierarchy of SW26010-Pro, this study designs a multi-level blocking algorithm to exploit the parallelism of matrix operations. Then, a data-sharing scheme based on remote memory access (RMA) mechanism is proposed to improve the data transmission efficiency among CPEs. Additionally, it employs triple buffering and parameter tuning to fully optimize the algorithm and hide the memory access costs of direct memory access (DMA) and the communication overhead of RMA. Besides, the study adopts two hardware pipelines and several vectorized arithmetic/memory access instructions of SW26010-Pro and improves the floating-point computing efficiency of level-3 BLAS functions by writing assembly code manually for matrix-matrix multiplication, matrix equation solving, and matrix transposition. The experimental results show that level-3 BLAS functions can significantly improve the performance on SW26010-Pro by leveraging the proposed parallel optimization. The floating-point computing efficiency of single-core level-3 BLAS is up to 92% of the peak performance, while that of multi-core level-3 BLAS is up to 88% of the peak performance.