软件学报  2017, Vol. 28 Issue (11): 2879-2890 PDF

1. 智能技术与系统国家重点实验室(清华大学), 北京 100084;
2. 清华大学 计算机科学与技术系, 北京 100084

Convolution Neural Network Feature Importance Analysis and Feature Selection Enhanced Model
LU Hong-Yu1,2, ZHANG Min1,2, LIU Yi-Qun1,2, MA Shao-Ping1,2
1. State Key Laboratory of Intelligent Technology and System(Tsinghua University), Beijing 100084, China;
2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Foundation item: Foundation item: National Natural Science Foundation of China (61622208, 61532011, 61672311); National Program on Key Basic Research Project of China (973) (2015CB358700)
Abstract: Because of its strong expressive power and outstanding performance of classification, deep neural network (DNN), such as like convolution neural network (CNN), is widely used in various fields. When faced with high-dimensional features, DNNs are usually considered to have good robustness, for it can implicitly select relevant features. However, due to the huge number of parameters, if the data is not enough, the learning of neural network will be inadequate and the feature selection will not be desirable. DNN is a black box, which makes it difficult to observe what features are chosen and to evaluate its ability of feature selection. This paper proposes a feature contribution analysis method based on neuron receptive field. Using this method, the feature importance of a neural network, for example CNN, can be explicitly obtained. Further, the study finds that the neural network's ability in recognizing relevant and noise features is weaker than the tratitional evaluation methods. To enhance its feature selection ability, a feature selection enhanced CNN model is proposed to improve classification accuracy by applying traditional feature evaluation method to the learning process of neural network. In the task of the text-based user attribute modeling in social media, experimental results demonstrate the validity of the preoposed model.
Key words: convolution neural network     feature importance analysis     feature selection     text categorization

1 相关工作 1.1 神经网络的样本特征分析

 ${S_{{x_{ij}}}} = \partial L\left( {\tilde y, x} \right)/{\partial _{{x_{ij}}}},$

1.2 样本特征分析方法的评估

 $x_{MF}^{\left( 0 \right)} = x;\forall 1 \le k \le L:x_{MF}^{\left( k \right)} = g\left( {x_{MF}^{\left( {k-1} \right)}, {r_k}} \right)$ (1)

 $AOPC = \frac{1}{{L + 1}}{\left\langle {\sum\nolimits_{k = 0}^L {f\left( {x_{MF}^{\left( 0 \right)}} \right)-f\left( {x_{MF}^{\left( k \right)}} \right)} } \right\rangle _x}$ (2)

1.3 传统特征选择方法

 ${\chi ^2} = \sum\nolimits_{i = 1}^2 {\sum\nolimits_{j = 1}^k {\frac{{\left( {{A_{ij}}-{E_{ij}}} \right)}}{{{E_{ij}}}}} }$ (3)

 ${\rho _{X, C}} = \frac{{{\mathop{\mathit cov}} \left( {X, C} \right)}}{{{\sigma _X}{\sigma _C}}}$ (4)

2 神经网络的特征重要性分析

2.1 基于感受野的神经网络特征贡献度分析

 Fig. 1 Sketch map of the feature contribution analysis based on receptive field 图 1 基于感受野的特征贡献度分析示意图

1.输出层神经元yj的贡献度被初始化为${C_{{y_j}}} = {\delta _{jc}}$, δ为克罗内克函数, c为待观测的类别(例如样本的正确类别).

2.输出层神经元yj值由池化层神经元p经过一层全连接得到, 因此pi的贡献度Cpi可以通过Cyj和相应的全连接层权重${w_{{p_i}{y_j}}}$计算得到:

 ${C_{{p_i}}} = {w_{{p_i}{y_j}}}{C_{{y_j}}}$ (5)

3.最大池化层pj仅保留对应的特征图fmi中最大的一项, 赢者通吃, 池化神经元的贡献度${C_{{p_i}}}$全部反向传播给特征图fmj最大激活卷积神经元$con{v_{i, {k_{\max }}}}$:

 $con{v_{i, k}} = {I_{k = {k_{\max }}}}{C_{{p_j}}}$ (6)

4.卷积神经元convj, k的激活值由其感受野内特征wi与卷积核参数进行卷积操作得来, 因此, wi的贡献度${C_{{w_i}}}$可以通过其词向量${x_{{w_i}}}$与卷积核对应位置参数向量的点积得到:

 ${C_{{w_i}}} = \sum {_j\sum {_k{I_{i \in RF\left( k \right)}}conv\_kene{l_{i-k + kenel\_size/2}}{x_{{w_i}}} \times cin{v_{j, k}}} }$ (7)

 $im{p_{{w_i}}} = \frac{1}{N}\sum\nolimits_{j \in doc\left( {{w_i}} \right)} {im{p_{{w_{ij}}}}}$ (8)

2.2 样本特征重要性分析方法的有效性对比实验 2.2.1 实验数据及模型

 Fig. 2 A convolution neural network model for text categorization tasks 图 2 文本分类任务下的卷积神经网络模型

2.2.2 有效性实验及结果分析

 Fig. 3 Visual display of feature contribution and feature sensitivity analysis 图 3 特征贡献度和特征敏感性分析可视化展示

 Fig. 4 Effective experiments of feature analysis method 图 4 特征分析方法有效性实验

2.3 神经网络的特征选择结果

Table 1 Top10 keywords of different feature importance evaluation methods 表 1 不同特征重要性评价方法Top10特征词

3 神经网络特征选择能力与传统特征选择方法的对比分析 3.1 特征选择能力的评估

3.2 高重要性特征的识别能力的实验性对比研究(正向选择)

 Fig. 5 Experimental result of positive selection 图 5 正向选择实验结果

3.3 噪声特征的识别能力的实验性对比研究(反向遮挡)

 Fig. 6 Experimental result of reverse occlusion 图 6 反向遮挡实验结果

4 卷积神经网络的增强特征选择模型

4.1 特征选择层

 Fig. 7 Sketch map of feature selection layer 图 7 特征选择层示意图

 $x' = \mathit{ReLU}\left( {x \odot w + b} \right)$ (9)

 Fig. 8 Feature selection enhanced model applied to the convolutional neural network with embedded layer 图 8 增强特征选择模型应用于包含嵌入层的卷积神经网络

 Fig. 9 Feature selection enhanced model applied to the neural networks with fixed length features 图 9 增强特征选择模型应用于定长特征的神经网络

4.2 模型有效性验证

Table 2 Experimental results of feature selection enhanced convolution neural network 表 2 增强特征选择的卷积神经网络模型实验结果

5 结论与展望

 [1] Szegedy C, Liu W, Jia YQ, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich, A. Going deeper with convolutions. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. IEEE Computer Society, 2014. 1-9.[doi: 10.1109/CVPR.2015.7298594] [2] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 770-778.[doi:10.1109/CVPR.2016.90] [3] Graves A. Long short-term memory. In:Proc. of the Supervised Sequence Labelling with Recurrent Neural Networks. Berlin, Heidelberg:Springer-Verlag, 2012. 1735-1780.[doi:10.1007/978-3-642-24797-2_4] [4] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014. [doi:10.3115/v1/P14-1062] [5] Santos CND, Gattit M. Deep convolutional neural networks for sentiment analysis of short texts. In:Proc. of the Int'l Conf. on Computational Linguistics. 2014. 69-78.http://www.aclweb.org/anthology/C14-1008 [6] Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S. Recurrent neural network based language model. In:Proc. of the 11th Annual Conf. of the Int'l Speech Communication Association (INTERSPEECH 2010). 2010. [7] Liu SJ, Yang N, Li M, Zhou M. A recursive recurrent neural network for statistical machine translation. In:Proc. of the Meeting of the Association for Computational Linguistics. 2014. 1491-1500.[doi:10.3115/v1/P14-1140] [8] Cho K, Merrienboer BV, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. https://nyuscholars.nyu.edu/en/publications/learning-phrase-representations-using-rnn-encoder-decoder-for-sta [9] Sutskever I, Martens J, Hinton GE. Generating text with recurrent neural networks. In:Proc. of the Int'l Conf. on Machine Learning (ICML 2011). 2011. 1017-1024. [10] Shahshahani BM, Landgrebe DA. The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Trans. on Geoscience and Remote Sensing, 1994, 32(5): 1087–1095. [doi:10.1109/36.312897] [11] Tadjudin S, Landgrebe DA. Covariance estimation with limited training samples. IEEE Trans. on Geoscience and Remote Sensing, 1999, 37(4): 2113–2118. [doi:10.1109/36.774728] [12] Lu S, Oki K, Shimizu Y, Omasa K. Comparison between several feature extraction/classification methods for mapping complicated agricultural land use patches using airborne hyperspectral data. Int'l Journal of Remote Sensing, 2007, 28(5): 963–984. [doi:10.1080/01431160600771561] [13] Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, 3: 1157–1182. https://www.scribd.com/document/86922270/An-Introduction-to-Variable-and-Feature-Selection [14] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In:Proc. of the European Conf. on Computer Vision. Springer-Verlag, 2014. 818-833.[doi:10.1007/978-3-319-10590-1_53] [15] Mares MA, Wang S, Guo Y. Combining multiple feature selection methods and deep learning for high-dimensional data. Trans. on Machine Learning and Data Mining, 2016, 9: 27–45. https://spiral.imperial.ac.uk/handle/10044/1/34535 [16] Poria S, Cambria E, Gelbukh AF. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In:Proc. of the EMNLP. 2015. 2539-2544.[doi:10.18653/v1/D15-1303] [17] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In:Proc. of the Advances in Neural Information Processing Systems. 2012. 1097-1105.http://www.oalib.com/references/14856003 [18] Engelbrecht AP, Cloete I, Zurada JM. Determining the significance of input parameters using sensitivity analysis. In:Proc. of the Int'l Workshop on Artificial Neural Networks. Berlin, Heidelberg:Springer-Verlag, 1995. 382-388.[doi:10.1007/3-540-59497-3_199] [19] Alain G, Bengio Y. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016. [20] Zeiler MD, Taylor GW, Fergus R. Adaptive deconvolutional networks for mid and high level feature learning. In:Proc. of the Int'l Conf. on Computer Vision. IEEE, 2011. 2018-2025.[doi:10.1109/ICCV.2011.6126474] [21] Samek W, Binder A, Montavon G, Lapuschkin S, Müller KR. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. on Neural Networks and Learning Systems, 2016. [doi:10.1109/TNNLS.2016.2599820] [22] Baehrens D, Schroeter T, Harmeling S, Kawanabe M, Hansen K, MÃžller KR. How to explain individual classification decisions. Journal of Machine Learning Research, 2010, 11: 1803–1831. [23] Dimopoulos Y, Bourret P, Lek S. Use of some sensitivity criteria for choosing networks with good generalization ability. Neural Processing Letters, 1995, 2(6): 1–4. [doi:10.1007/BF02309007] [24] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks:Visualising image classification models and saliency maps. arXiv Preprint arXiv:1312.6034, 2013. http://www.doc88.com/p-9773599287466.html [25] Denil M, Demiraj A, de Freitas N. Extraction of salient sentences from labelled documents. arXiv preprint arXiv:1412.6815, 2014. [26] Montavon G, Lapuschkin S, Binder A, Samek W, Müller KR. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 2017, 65: 211–222. [doi:10.1016/j.patcog.2016.11.008] [27] Li JW, Chen XL, Hovy E, Jurafsky D. Visualizing and understanding neural models in NLP. arXiv preprint arXiv:1506.01066, 2015. https://github.com/jiweil/Visualizing-and-Understanding-Neural-Models-in-NLP [28] Samek W, Binder A, Montavon G, Lapuschkin S, Müller KR. Evaluating the visualization of what a deep neural network has learned. IEEE Trans. on Neural Networks and Learning Systems, 2016. [doi:10.1109/TNNLS.2016.2599820] [29] Seiler MC, Seiler F. Numerical recipes in C:The art of scientific computing. Risk Analysis, 1989, 9(3): 415–416. [doi:10.1111/risk.1989.9.issue-3] [30] Liu H, Setiono R. Chi2:Feature selection and discretization of numeric attributes. In:Proc. of the 7th IEEE Int'l Conf. on Tools with Artificial Intelligence. 1995. 388-391. [31] Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In:Proc. of the 14th Int'l Conf. on Machine Learning. Morgan Kaufmann Publishers Inc., 1998. 412-420. [32] Zhang Y, Wallace B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820, 2015. https://www.semanticscholar.org/paper/A-Sensitivity-Analysis-of-and-Practitioners-Guide-Zhang-Wallace/06b919f865d0a0c3adbc10b3c34cbfc35fb98d43 [33] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. https://www.bibsonomy.org/bibtex/29665b85e8756834ac29fcbd2c6ad0837/wool