###
DOI:
Journal of Software:2011.22(zk2):163-171

基于OpenCL 的归约算法优化
颜深根,张云泉,龙国平,李焱
(中国科学院 软件研究所 并行软件与计算科学实验室,北京 100190; 中国科学院 软件研究所 计算机科学国家重点实验室,北京 100190; 中国科学院 研究生院,北京 100190)
Reduction Algorithm Optimization Based on the OpenCL
YAN Shen-Gen,ZHANG Yun-Quan,LONG Guo-Ping,LI Yan
(Laboratory of Parallel Software and Computational Science, Institute of Software, The Chinese Academy of Sciences, Beijing 100190, China; State Key Laboratory of Computing Science, Institute of Software, The Chinese Academy of Sciences, Beijing 100190, Ch)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 2096   Download 3950
Received:July 15, 2011    Revised:December 02, 2011
> 中文摘要: 归约算法在科学计算和图像等领域有着广泛应用,系统研究了在OpenCL 框架下,归约算法在GPU 上的跨平台性能优化.已有研究工作一般只侧重单个硬件架构,基于OpenCL 从向量化、片上存储体冲突、线程组织方式和指令选择优化等多个优化角度系统考察了不同优化方法在GPU 硬件平台的影响.具体以minMax 函数为例,对每种优化方法进行了详细的性能分析,并给出了提高性能的原因.在AMD GPU 和NVIDIA GPU 平台分别测试的结果表明,优化后的算法在两个平台上都能实现很好的性能加速.在AMD ATI Radeon HD 5850 平台上,Int 和Float 类型数据带宽利用最高达到了实测带宽的89%.在NVIDIA GPU Tesla C2050 平台上,性能也达到了CUDA 版本的相应函数性能的1.3~1.9 倍.
中文关键词: GPU  并行归约  OpenCL  CUDA
Abstract:Reduction algorithm has a wide range of applications in areas such as scientific computing and image processing. This paper systematically studies the reduction algorithm optimization on the GPU’s cross-platform performance optimization based on the OpenCL framework. Previous research has generally focused on a single hardware architecture, however, this paper based on the OpenCL, studies various kinds of optimization methods, such as using vector, on-chip memory bank conflict, threads organization, instruction selection and so on. The research takes the minMax function for example, dilatationed each optimization method for develep the performance, and detailed the reason. The study tests the algorithm both on AMD GPU and NVIDIA GPU platforms. The test results show that the optimized algorithm on both platforms has achieved good performance. In the AMD ATI Radeon HD 5850 platform, Int and Float types of data bandwidth utilization up to 89%. In the NVIDIA GPU Tesla C2050 platform, the performance has reached 1.3 to 1.9 times compare to appropriate function version of CUDA.
keywords: GPU  parallel reduction  OpenCL  CUDA
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(60303020, 60533020); 国家高技术研究发展计划(863) (2006AA01A102); ISCAS-AMD 联合fusion 软件中心资助项目 国家自然科学基金(60303020, 60533020); 国家高技术研究发展计划(863) (2006AA01A102); ISCAS-AMD 联合fusion 软件中心资助项目
Foundation items:
Reference text:

颜深根,张云泉,龙国平,李焱.基于OpenCL 的归约算法优化.软件学报,2011,22(zk2):163-171

YAN Shen-Gen,ZHANG Yun-Quan,LONG Guo-Ping,LI Yan.Reduction Algorithm Optimization Based on the OpenCL.Journal of Software,2011,22(zk2):163-171