###
Journal of Software:2020.31(9):2944-2964

面向稀疏卷积神经网络的GPU性能优化方法
董晓,刘雷,李晶,冯晓兵
(计算机体系结构国家重点实验室(中国科学院 计算技术研究所), 北京 100190;中国科学院大学, 北京 100190)
Performance Optimizing Method for Sparse Convolutional Neural Networks on GPU
DONG Xiao,LIU Lei,LI Jing,FENG Xiao-Bing
(State Key Laboratory of Computer Architecture(Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100190, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 498   Download 1788
Received:October 05, 2019    Revised:January 13, 2020
> 中文摘要: 近些年来,深度卷积神经网络在多项任务中展现了惊人的能力,并已经被用在物体检测、自动驾驶和机器翻译等众多应用中.但这些模型往往参数规模庞大,并带来了沉重的计算负担.神经网络的模型剪枝技术能够识别并删除模型中对精度影响较小的参数,从而降低模型的参数数目和理论计算量,给模型的高效执行提供了机会.然而,剪枝后的稀疏模型却难以在GPU上实现高效执行,其性能甚至差于剪枝前的稠密模型,导致模型剪枝难以带来真正的执行性能收益.提出一种稀疏感知的代码生成方法,能够生成高效的稀疏卷积GPU程序.首先为卷积算子设计了算子模板,并结合GPU的特点对模板代码进行了多种优化.算子模板中的源代码经过编译和分析被转换为算子中间表示模板,设计了一种稀疏代码生成方法,能够结合剪枝后的稀疏参数,基于中间表示模板生成对应的稀疏卷积代码.同时,利用神经网络执行过程中的数据访问特点对数据的访问和放置进行了优化,有效提升了访存吞吐量.最后,稀疏参数的位置信息被隐式编码在生成的代码中,不需要额外的索引结构,降低了访存需求.在实验中证明了:相对于GPU上已有的稀疏神经网络执行方法,提出的稀疏感知的代码生成方法能够有效提升稀疏卷积神经网络的性能.
中文关键词: 神经网络  稀疏  GPU  性能优化  卷积  代码生成
Abstract:In recent years, with dominating capability shown in plenty of tasks, deep convolutional neural networks have been deployed in applications including object detection, autonomous driving, machine translation, etc. But these models are accompanied by huge amounts of parameters and bring a heavy computational burden. The neural network pruning technique can recognize and remove parameters that contribute little to the accuracy, resulting in reduced amounts of parameters and decreased theoretical computational requirement, thus providing a chance to accelerate neural network models. However, it is hard for the pruned sparse models to achieve efficient execution on GPUs, and the performance of sparse models cannot even match their well-optimized dense counterparts. This study designs a sparsity-aware code generating method, which can generate efficient GPU code for sparse convolutions in pruned neural networks. First, a template is designed for convolution operators with several optimizations targeting GPU architecture. Through compiling and analyzing, the operator template is transformed to the intermediate representation template, which serves as the input to the designed algorithm to generate sparse convolution code according to specific sparse convolution parameters. Moreover, to improve memory throughput, optimizations are performed on data access and data placement based on the characteristics of memory access in neural networks. Finally, as the location information can be encoded into the generated code implicitly, the index structure for the sparse parameters can be eliminated, reducing the memory footprint during the execution. In experiments, it is demonstrated that the proposed sparse code generating method can improve the performance of sparse convolutional neural networks compared with current methods.
文章编号:     中图分类号:TP303    文献标志码:
基金项目:国家自然科学基金(61521092);国家重点研发计划(2017YFB1003103) 国家自然科学基金(61521092);国家重点研发计划(2017YFB1003103)
Foundation items:National Natural Science Foundation of China (61521092); National Key Research and Development Program of China (2017YFB 1003103)
Reference text:

董晓,刘雷,李晶,冯晓兵.面向稀疏卷积神经网络的GPU性能优化方法.软件学报,2020,31(9):2944-2964

DONG Xiao,LIU Lei,LI Jing,FENG Xiao-Bing.Performance Optimizing Method for Sparse Convolutional Neural Networks on GPU.Journal of Software,2020,31(9):2944-2964