BLAS 库在多核处理器上的性能测试与分析
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

Supported by the National Natural Science Foundation of China under Grant No.60533020 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant Nos.2006AA01A125, 2009AA01A134, 2009AA01A129 (国家高技术研究发展计划(863)); the National Key Scientific & Technological Project HEGAOJI of China under Grant No.2009ZX01036-001-002 (核高基项目)


Performance Testing and Analysis of BLAS Libraries on Multi-Core CPUs
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    BLAS 库是高性能计算中最基本的数学库,它的性能对超级计算机的性能有着极大的影响.而且随着CPU多核化的发展,BLAS 的多核并行性能已经变得比与体系结构相关的单核性能更加重要.实验以流行于高性能计算的Xeon、Opteron 系列多核X86 处理器为例,全面测试了GotoBLAS、Atlas、MKL 和ACML 四种主流的BLAS 库的所有1,2,3 级函数,并覆盖了不同计算规模和多核并行方面的测试.通过测试结果,分析源代码、BLAS 库资料和论文的方式,分析BLAS 有效的优化和并行方法,以及它们所适合的平台.为BLAS 的优化、使用,甚至高性能处理器的发展上提供有益的建议.实验结果表明,比起一个逻辑处理强大但是复杂的处理器,一个cache 更大、性能更好,内存带宽更宽、延迟更小,主频更高的处理器往往能在高性能计算中取得更好的性能.同时,X86 平台上的状况对其他体系结构也有巨大的借鉴意义.

    Abstract:

    BLAS library is the most basic math library in high performance computing. Its performance has a great impact on the performance of supercomputers. With the multi-core technology development, BLAS’ multi-core parallel performance has become more important than single-core performance associated with architecture. The experiment takes X86 multi-core processors like Xeon, Opteron series as platform for example, which are popular in HPC. It fully tests GotoBLAS, Atlas, MKL and ACML BLAS libraries of all 1,2,3-level functions, and covers different scales and multi-core parallel aspect. BLAS source code, material and papers, test results are used to analyze the way of BASL optimization and parallelism, and which platform they are suitable for. Then we will provide some useful suggestions for the use of BLAS, BLAS optimization method or even the development of high-performance CPUs. It was found that compared with a logically powerful and complex CPU, a CPU which has larger and better caches, wider memory bandwidth, smaller memory latency, higher core frequency can often obtain better performance in HPC applications. At the same time, the condition of X86 platform is also a good example for other architectures.

    参考文献
    相似文献
    引证文献
引用本文

陈少虎,张云泉,张先轶,程豪. BLAS 库在多核处理器上的性能测试与分析.软件学报,2010,21(zk):214-223

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2010-06-15
  • 最后修改日期:2010-12-10
  • 录用日期:
  • 在线发布日期:
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号