###
DOI:
Journal of Software:2010.21(zk):214-223

BLAS 库在多核处理器上的性能测试与分析
陈少虎,张云泉,张先轶,程豪
(中国科学院 软件研究所 并行软件与计算科学实验室,北京 100190; 中国科学院 软件研究所 计算机科学国家重点实验室,北京 100190; 中国科学院 研究生院,北京 100190)
Performance Testing and Analysis of BLAS Libraries on Multi-Core CPUs
CHEN Shao-Hu,ZHANG Yun-Quan,ZHANG Xian-Yi,CHENG Hao
(Laboratory of Parallel Software and Computational Science, Institute of Software, The Chinese Academy of Sciences, Beijing 100190, China; State Key Laboratory of Computing Science, The Chinese Academy of Sciences, Beijing 100190, China; Graduate Universit)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 2695   Download 7557
Received:June 15, 2010    Revised:December 10, 2010
> 中文摘要: BLAS 库是高性能计算中最基本的数学库,它的性能对超级计算机的性能有着极大的影响.而且随着CPU多核化的发展,BLAS 的多核并行性能已经变得比与体系结构相关的单核性能更加重要.实验以流行于高性能计算的Xeon、Opteron 系列多核X86 处理器为例,全面测试了GotoBLAS、Atlas、MKL 和ACML 四种主流的BLAS 库的所有1,2,3 级函数,并覆盖了不同计算规模和多核并行方面的测试.通过测试结果,分析源代码、BLAS 库资料和论文的方式,分析BLAS 有效的优化和并行方法,以及它们所适合的平台.为BLAS 的优化、使用,甚至高性能处理器的发展上提供有益的建议.实验结果表明,比起一个逻辑处理强大但是复杂的处理器,一个cache 更大、性能更好,内存带宽更宽、延迟更小,主频更高的处理器往往能在高性能计算中取得更好的性能.同时,X86 平台上的状况对其他体系结构也有巨大的借鉴意义.
中文关键词: BLAS  体系结构  多核并行  X86  GotoBLAS  Atlas  MKL  ACML  优化
Abstract:BLAS library is the most basic math library in high performance computing. Its performance has a great impact on the performance of supercomputers. With the multi-core technology development, BLAS’ multi-core parallel performance has become more important than single-core performance associated with architecture. The experiment takes X86 multi-core processors like Xeon, Opteron series as platform for example, which are popular in HPC. It fully tests GotoBLAS, Atlas, MKL and ACML BLAS libraries of all 1,2,3-level functions, and covers different scales and multi-core parallel aspect. BLAS source code, material and papers, test results are used to analyze the way of BASL optimization and parallelism, and which platform they are suitable for. Then we will provide some useful suggestions for the use of BLAS, BLAS optimization method or even the development of high-performance CPUs. It was found that compared with a logically powerful and complex CPU, a CPU which has larger and better caches, wider memory bandwidth, smaller memory latency, higher core frequency can often obtain better performance in HPC applications. At the same time, the condition of X86 platform is also a good example for other architectures.
文章编号:     中图分类号:    文献标志码:
基金项目:Supported by the National Natural Science Foundation of China under Grant No.60533020 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant Nos.2006AA01A125, 2009AA01A134, 2009AA01A129 (国家高技术研究发展计划(863)); the National Key Scientific & Technological Project HEGAOJI of China under Grant No.2009ZX01036-001-002 (核高基项目) Supported by the National Natural Science Foundation of China under Grant No.60533020 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant Nos.2006AA01A125, 2009AA01A134, 2009AA01A129 (国家高技术研究发展计划(863)); the National Key Scientific & Technological Project HEGAOJI of China under Grant No.2009ZX01036-001-002 (核高基项目)
Foundation items:
Reference text:

陈少虎,张云泉,张先轶,程豪.BLAS 库在多核处理器上的性能测试与分析.软件学报,2010,21(zk):214-223

CHEN Shao-Hu,ZHANG Yun-Quan,ZHANG Xian-Yi,CHENG Hao.Performance Testing and Analysis of BLAS Libraries on Multi-Core CPUs.Journal of Software,2010,21(zk):214-223