WANG He-Peng , WANG Hong-Zhi , LI Jian-Zhong , GAO Hong
2017, 28(11):2814-2824. DOI: 10.13328/j.cnki.jos.005344
Abstract:In recent years, with the increasing amount of data in real life, inconsistent data becomes more frequent. This makes manual correction of inconsistent data more time-consuming. Moreover, manual correction prone to human errors. Thus, such correction method is no longer feasible. How to perform classification directly on inconsistent data without correcting data beforehand is the core research content of this paper. In this paper, the objective function of the decision tree generation algorithm is improved so that it can directly classify inconsistent data and achieve better results. Multidimensional measures of the influence of the feature are used on classification results to adjust the influence factor of the feature so that nodes of the decision tree can be split more accurate to achieve more effective classification results.
CHEN Yu , ZHAO Su-Yun , LI Xue-Feng , CHEN Hong , LI Cui-Ping
2017, 28(11):2825-2835. DOI: 10.13328/j.cnki.jos.005337
Abstract:Traditional attribute reduction is less effective when applying to large-scale datasets because of its high time and space complexity. In this paper, random sampling is introduced into traditional rough reduction. First, statistical discernibility degree and statistical rough reduction are proposed based on statistical rough approximation. Here the statistical rough reduction is not the traditional reduction any more, it is a subset which keeps the statistical discernibility degree almost invariant. By using random sampling to find the estimated value of statistical discernibility degree, all the condition attributes can be sorted. And then the reduction can be done on the sorted attributes by keeping the statistical discernibility degree almost invariant. Finally, numerical experimental comparison demonstrates that the random sampling based rough reduction is effective on both time and space consumption.
WANG Yan , PENG Tao , HAN Jia-Yu , LIU Lu
2017, 28(11):2836-2850. DOI: 10.13328/j.cnki.jos.005343
Abstract:Clustering is an important method for data analysis in the field of data mining. The function of clustering is to divide unlabeled data divided into several groups according to the data similarity. CSDP is a density-based clustering method. When data size is large or data dimensionality is high, the efficiency of clustering is relatively low. In order to improve the efficiency of clustering algorithm, this paper proposes a density-based distributed clustering method, called MRCSDP, which uses MapReduce to cluster text data. This method introduces the definition of independent calculation unit and independent calculation block. First, data are split into several data blocks which are used to construct independent calculation unit and independent calculation block. The task for each independent calculation block is assigned. Then the distributed calculation is conducted to obtain the local density of the data blocks. The local densities are combined to obtain the global density. The center value is calculated according to the global density. Based on the global density and the center value, the candidate cluster centers of each data block can be obtained. Finally, the global cluster centers are obtained by calculating the density of all candidate cluster centers. MRCSDP can achieve better clustering performance by reducing time complexity. Experimental results show that compared to CSDP, MRCSDP can process large scale data more effectively with load-balancing on each computing nodes.
TAN Qiao-Yu , YU Guo-Xian , WANG Jun , GUO Mao-Zu
2017, 28(11):2851-2864. DOI: 10.13328/j.cnki.jos.005339
Abstract:Weak label learning is an important sub-branch of multi-label learning which has been widely studied and applied in replenishing missing labels of partially labeled instances or classifying new instances. However, existing weak label learning methods are generally vulnerable to noisy and redundant features in high-dimensional data where multiple labels and missing labels are more likely present. To accurately classify high-dimensional multi-label instances, in this paper, an ensemble weak label classification method is proposed by maximizing dependency between labels and features (EnWL for short). EnWL first repeatedly utilizes affinity propagation clustering in the feature space of high-dimensional data to find cluster centers. Next, it uses the obtained cluster centers to construct representative feature subsets and to reduce the impact of noisy and redundant features. Then, EnWL trains a semi-supervised multi-label classifier by maximizing the dependency between labels and features on each feature subset. Finally, it combines these base classifiers into an ensemble classifier via majority vote. Experimental results on several high-dimensional datasets show that EnWL significantly outperforms other related methods across various evaluation metrics.
2017, 28(11):2865-2878. DOI: 10.13328/j.cnki.jos.005341
Abstract:In the real world, multi-label learning has become a hotspot in machine learning research area. In the multi-label learning problem, each instance is usually described by multiple class labels, which could be correlated with each other. It is well known that exploiting label correlations is important for multi-label learning. In this paper, an improved association rule mining algorithm based is designed on the matrix divide-and-conquer strategy. In addition, a proof is given to show the proposed algorithm in finding correct frequent items, and an application of the algorithm to the multi-label learning framework is also provided. Moreover, a global association rule mining and a local association rule mining based multi-label classification methods are proposed. Experimental results on several datasets show that the proposed methods can obtain a better classification performance on 5 different evaluation criteria.
LU Hong-Yu , ZHANG Min , LIU Yi-Qun , MA Shao-Ping
2017, 28(11):2879-2890. DOI: 10.13328/j.cnki.jos.005349
Abstract:Because of its strong expressive power and outstanding performance of classification, deep neural network (DNN), such as like convolution neural network (CNN), is widely used in various fields. When faced with high-dimensional features, DNNs are usually considered to have good robustness, for it can implicitly select relevant features. However, due to the huge number of parameters, if the data is not enough, the learning of neural network will be inadequate and the feature selection will not be desirable. DNN is a black box, which makes it difficult to observe what features are chosen and to evaluate its ability of feature selection. This paper proposes a feature contribution analysis method based on neuron receptive field. Using this method, the feature importance of a neural network, for example CNN, can be explicitly obtained. Further, the study finds that the neural network's ability in recognizing relevant and noise features is weaker than the tratitional evaluation methods. To enhance its feature selection ability, a feature selection enhanced CNN model is proposed to improve classification accuracy by applying traditional feature evaluation method to the learning process of neural network. In the task of the text-based user attribute modeling in social media, experimental results demonstrate the validity of the preoposed model.
DU Chao , WANG Zhi-Hai , JIANG Jing-Jing , SUN Yan-Ge
2017, 28(11):2891-2904. DOI: 10.13328/j.cnki.jos.005350
Abstract:Pattern-Based Bayesian model is one of the solutions for the classification problem in data mining. Most pattern-based Bayesian classifiers consider the supports of patterns in the dataset of the home class only. However, the supports of the patterns in the counterpart class are ignored. In addition, for the high-speed dynamic changes and infinite data stream, pattern-based Bayesian classifier which aims at static datasets can not work. To overcome these problems, EPDS (Bayesian classifier algorithm based on emerging pattern for data stream) is proposed. EPDS is a Bayesian classification model based on the emerging patterns discovered over data stream. In this model, EPDS presents a simple hybrid forests (HYF) data structure to maintain the itemsets of the transactions in memory, and uses a fast pattern extracting mechanism to accelerate the algorithm. EPDS adopts partially-lazy learning strategy to update emerging itemsets continuously, and establishes a local classification model in each class for the test transaction. Experimental results on real and synthetic data streams show that EPDS achieves higher classification accuracy compared to other classic classifiers.
LIANG Tian-Xin , YANG Xiao-Ping , WANG Liang , ZHANG Yong-Jun , ZHU Yan-Li , XU Cui
2017, 28(11):2905-2924. DOI: 10.13328/j.cnki.jos.005334
Abstract:Firstly, in this paper, the key features of memory neural networks in the strongly supervised model and the weakly supervised model and introduced. Then the corresponding application scenarios and processing methods, as well as the advantages and disadvantages of the two models are summarized. Next, a brief survey on the development and application of the two models (including the innovation on the model and the innovation in application) is provided, and the key roles of individual innovative models in the natural language processing are summarized. Finally, the complex challenges of memory neural networks in the natural language processing and the future development of memory neural networks are also discussed.
KUANG Qiu-Ming , YANG Xue-Bing , ZHANG Wen-Sheng , HE Xian-Feng , HUI Jian-Zhong
2017, 28(11):2925-2939. DOI: 10.13328/j.cnki.jos.005336
Abstract:High spatiotemporal resolution rainfall estimation is closely related to transportation, tourism, agricultural irrigation and people's daily travel. However, accurate high-resolution rain/no-rain classification is a very challenging problem. This paper proposes a multi-source data based multi-view learning method for rain/no-rain classification. The multiple source data used in this paper include radar, satellite and ground observation factors and rain/no-rain observation data. This method can be summarized as follows. Firstly, VisCAPPI view and VisPPI views are constructed according to the radar observation factors. VisSat view is constructed from the sunflower satellite data. VisGround view is constructed according to the ground observation factors. Secondly, the views of VisCAPPI_PPI, VisRadar_Sat, VisRadar_Groumd, VisSat_Ground, and VisRadar_Sat_Ground are obtained by combining features from preconstructed views. Random forest (RF) classification models are trained from these views using RF method. Finally, the rain/no rain classification results are obtained from the estimated results of these RF classification models. The main contributions of this paper arelisted as follows:(1) Present a method for constructing VisCAPPI, VisPPI, VisSat and VisGround views and their feature combined views for radar, satellite and ground observations; (2) A multi-view weight random forest method (MVWRF) is proposed. Multi-source data of radar, satellite and near surface observations are fused for rain/no-rain classification with temporal resolution of 6-minute and spatial resolution of 1km×1km in virtue of the proposed method. The experimental results show that the proposed method in this paper can obtain high precision of rain/no-rain classification after training and testing on 393 meteorological stations covered by radar in Quanzhou on October 7 and 8, 2016.
TANG Shi-Qi , WEN Yi-Min , QIN Yi-Xiu
2017, 28(11):2940-2960. DOI: 10.13328/j.cnki.jos.005352
Abstract:In recent years, transfer learning has gained more and more attention. However, most of the existing online transfer learning methods transfer knowledge from a single source, and it is hard to make effective transfer learning when the similarity between source domain and target domain is low. To solve this problem, this paper proposes a multi-source online transfer learning method, LC-MSOTL, based on local classification accuracy. LC-MSOTL stores multiple classifiers each trained on a different source, computes the distance between the new arrived sample and its k-nearest neighbor samples in the target domain as well as the local classification accuracies of each source domain classifier on the k-nearest neighbor samples, selects the classifier with the highest local classification accuracy from source domain classifiers and combines it with the target domain classifier, so as to realize the knowledge transfer from multi-source domains to a target domain. Experiments on artificial datasets and real datasets illustrates that LC-MSOTL can effectively transfer knowledge selectively from multi-source domains, and can get higher classification accuracy compared with the single source online transfer learning algorithm OTL.
JI Zhong , SUN Tao , YU Yun-Long
2017, 28(11):2961-2970. DOI: 10.13328/j.cnki.jos.005338
Abstract:Zero-Shot classification aims at recognizing instances from unseen categories that have no training instances in the training stage. To address this task, most existing approaches resort to class semantic information to transfer knowledge from the seen classes to the unseen ones. In this paper, a transductive dictionary learning approach is proposed to facilitate the task in two steps. A discriminative dictionary learning model is first proposed for constructing the relations between the visual modality and the class semantic modality with the labeled seen instances. Then a transductive modified model is used to alleviate the domain shift issue caused by the disjointness between the seen classes and the unseen classes. Experimental results on three benchmark datasets (AwA, CUB and SUN) demonstrate the effectiveness and superiority of the proposed approach.
YANG Liu , YU Jian , LIU Ye , ZHAN De-Chuan
2017, 28(11):2971-2991. DOI: 10.13328/j.cnki.jos.005348
Abstract:In the age of big data, learning from multi-source data plays an important role in many real applications. To date, plenty of multi-source data learning algorithms have been proposed, however, they pay little attention to the fundamental theoretic laws. Meanwhile, it is hard for the classical machine learning theories to govern all learning systems, and to further provide a theoretical support for multi-source learning algorithms. From the perspective of knowledge acquisition through learning, a survey is given on the research progress of three key problems:the human cognitive mechanism, three classical machine learning theories (such as computational learning theory, statistical learning theory, and probabilistic graphical model), and the design of multi-source learning algorithms. Future theoretical research issues of multi-source data learning also presented and investigated.
QI Ren , ZHU Peng-Fei , LIANG Jian-Qing
2017, 28(11):2992-3001. DOI: 10.13328/j.cnki.jos.005346
Abstract:How to choose a proper distance metric is vital to many machine learning and pattern recognition tasks. Metric learning mainly uses discriminant information to learn a Mahalanobis distance or similarity metric. However, most existing metric learning methods are for numerical data, and it is unreasonable to calculate the similarity between two heterogeneous objects (e.g., categorical data) using traditional distance metrics. Besides, they suffer from curse of dimensionality, resulting in poor efficiency and scalability when the feature dimension is very high. In this paper, a geometric mean metric learning method is proposed for heterogeneous data. The numerical data and categorical data are mapped to a reproducing kernel Hilbert space by using different kernel functions, thus avoiding the negative influence of the high dimensionality of the feature. At the same time, a multiple kernel metric learning model based on geometric mean is introduced to transform the metric learning problem of heterogeneous data into solving the midpoint between two points on the Riemannian manifold. Experiments on benchmark UCI datasets show that the presented method shows promising performances in terms of accuracy in comparison with the state-of-the-art metric learning methods.
YUAN Ji-Dong , WANG Zhi-Hai , SUN Yan-Ge , ZHANG Wei
2017, 28(11):3002-3017. DOI: 10.13328/j.cnki.jos.005331
Abstract:Temporal alignment based k-nearest neighbor classifier is a benchmark for time series classification. Since complex time series generally exhibit different global behaviors within classes in real applications, it is difficult for standard alignment, where features are treated equally while local discriminative behaviors are ignored, to handle these challenging time series correctly and efficiently. To facilitate aligning and classifying such complex time series, this paper proposes a discriminative locally weighted dynamic time warping dissimilarity measure that reveals the commonly shared subsequence within classes as well as the most differential subsequence between classes. Meanwhile, time series alignments of positive and negative subsets are employed to learning discriminative weight for each feature of each time series iteratively. Experiments performed on synthetic and real datasets demonstrate that this locally weighted, temporal alignment based k-nearest neighbor classifier is effective in differentiating time series with good interpretability. Extension of the proposed weighting strategy to multivariate time series is also discussed.
2017, 28(11):3018-3029. DOI: 10.13328/j.cnki.jos.005332
Abstract:In recent years, deep learning in the computer vision has made great progress, showing good application prospects in medical image reading. In this paper, a model with construction of two-level deep convolution neural network is designed to achieve feature extraction, feature blend, and classification of the fundus photo. By comparing with doctor's diagnosis, it is shown that the output of the model is highly consistent with the doctor's diagnosis. In addition, an improved method of fine-grained image classification using weak supervised learning is proposed. Finally, future research direction is discussed.
GOU Cheng-Cheng , QIN Yu-Jun , TIAN Tian , WU Da-Yong , LIU Yue , CHENG Xue-Qi
2017, 28(11):3030-3042. DOI: 10.13328/j.cnki.jos.005333
Abstract:Outbreak prediction in social networks is a part of popularity dynamic analysis of social networks, and it is an active research topic in the domain of social computing. This study proposes a social messages outbreak prediction model based on recurrent neural network (SMOP) by modeling the message propagation process. Compared with the traditional models on machine learning, SMOP directly models the arrival process of message without the need of tedious feature engineering in traditional methods. When it comes to point process models, SMOP is able to automatically learn the rate functions of propagation process, making it adaptable to a variety of scenarios. Moreover, time vector and user vector, which contain the periodicity of time and the user profile, are used as input to improve the performance of outbreak prediction. Experimental results on real word data sets such as Twitter and Sina Weibo show that SMOP has excellent data adaptability, and it is able to predict whether a message would outbreak with higher F1 score in the beginning of the message spread (within 0.5h).
QIAO Shao-Jie , HAN Nan , LI Tian-Rui , LI Rong-Hua , LI Bin-Yong , WANG Xiao-Teng , Louis Alberto GUTIERREZ
2017, 28(11):3043-3057. DOI: 10.13328/j.cnki.jos.005340
Abstract:Smart phones, GPS equipped vehicles and wearable devices can generate a large number of trajectory data. These data can not only describe the historical trajectory of moving objects, but also accurately reflect the characteristics of moving objects. The existing trajectory prediction approaches have the following drawbacks:both prediction accuracy and efficiency cannot be guaranteed together, effective trajectory prediction is limited to road-network constrained local spatial areas, and complex and large-scale location data are difficult to process. Aiming to cope with the aforementioned problems, a prefix projection based trajectory prediction model targeting massive trajectory data of moving objects is proposed by employing the basic idea of frequent sequential patterns discovery. The new model, called PPTP (prefix projection based trajectory prediction model), includes two essential steps:(1) Discovering frequent trajectory patterns by creating projected databases and iteratively mining frequent prefix trajectory patterns from projected databases; (2) Trajectory matching by incrementally extending the postfix trajectory based on each frequent sequential pattern and outputting the longest continuous trajectory that is greater than the threshold of minimum support count. The advantages of the proposed algorithm are that it can generate long-term trajectory patterns via short frequent sequential patterns in an incremental manner, and it will not generate useless candidate trajectory sequences in order to overcome the drawback of time-intensive in discovering frequent sequential patterns. Extensive experiments are conducted on real large-scale trajectory data from multiple aspects, and the results show that PPTP algorithm has very high trajectory prediction accuracy when comparing to 1st-order Markov chain prediction algorithm and the average improvement of accuracy can reach to 39.8%. A generic trajectory prediction system is developed based on the proposed trajectory prediction model, and the complete prediction trajectories are visualized in order to provide assistance for users in path planning.
2017, 28(11):3058-3071. DOI: 10.13328/j.cnki.jos.005342
Abstract:Mobile application security detection and protection is an active research topic in the domain of software security. The traditional security solution is to install the APP developed by security vendors on user terminals. However, for the normal users lacking of security awareness, they do not understand the seriousness of security threat and the importance of security management, thus leading to insufficient terminal security defense. It is necessary to take protection from the threat source and transmission route. This paper implements various security measures including source code authorship attribution based on coding style, mobile application security reinforcement and channel monitoring, and mobile application security detection based on deep learning over the view of threat source, transmission route and threatened terminal. A mobile application security ecological chain is also constructed to protect users' personal information security. The paper verifies the effectiveness of proposed method in the practical application environment. The results show that it can achieve the goal of all-around application security protection. Future work in this research area is also discussed.
2017, 28(11):3072-3079. DOI: 10.13328/j.cnki.jos.005345
Abstract:It is costly to identify bugs from numerous source code files in a large software project. Thus, locating bug automatically and effectively becomes a worthy problem. Bug report is one of the most valuable source of bug description, and precisely locating related source codes linked to the bug reports can help reducing software development cost. Currently, most of the research on bug localization based on deep neural networks focus on design of network structures while lacking attention to the loss function, which impacts the performance significantly in prediction tasks. In this paper, a cost-sensitive margin distribution optimization (CSMDO) loss function is proposed and applied to deep neural networks. This new method is capable of handling the imbalance of software defect data sets, and improves the accuracy significantly.
FAN Hong , HOU Cun-Cun , ZHU Yan-Chun , RAO Ruo-Xia
2017, 28(11):3080-3093. DOI: 10.13328/j.cnki.jos.005335
Abstract:The existing soft subspace clustering algorithm is susceptible to random noise when MR images are segmented, and it is easy to fall into local optimum due to the choice of the initial clustering centers, which leads to unsatisfactory segmentation results. To solve these problems, this paper proposes a soft subspace algorithm for MR image clustering based on fireworks algorithm. Firstly, a new objective function with boundary constraints and noise clustering is designed to overcome the shortcomings of the existing algorithms that are sensitive to noise data. Next, a new method of calculating affiliation degree is proposed to find the subspace where the cluster is located quickly and accurately. Then, adaptive fireworks algorithm is introduced in the clustering process to effectively balance the local and global search, overcoming the disadvantage of falling into local optimum in the existing algorithms. Comparing with EWKM, FWKM,FSC and LAC algorithms, experiments are conducted on UCI datasets, synthetic images, Berkeley image datasets, as well as clinical breast MR images and brain MR images. The results demonstrate that the proposed algorithm not only can get better results on UCI datasets, but also has better anti-noise performance. Especially for MR images, high clustering precision and robustness can be obtained, and effective MR images segmentation can be achieved.
GUO Mao-Zu , WANG Shi-Ming , LIU Xiao-Yan , TIAN Zhen
2017, 28(11):3094-3102. DOI: 10.13328/j.cnki.jos.005351
Abstract:MicroRNAs (miRNAs) play an important role in the process of life. In recent years, predicting the associations between miRNAs and diseases has become a hot topic in research. Existing computational methods can be mainly divided into two categories:methods based on similarity measurement, and methods based on machine learning. The former approaches predict miRNA-disease associations by measuring similarity of nodes in the biological networks, but they need to build high quality biological networks. The latter approaches apply machine learning algorithms to this problem, but they need to build a negative collection of high credibility. To address those shortcomings, this paper presents a novel computational model called BNPDCMDA (bipartite network projection based on density clustering to predict miRNA-disease associations) to predict miRNAs-disease associations. First, a miRNA-disease double-layer network model is constructed. Then, similarity of miRNAs is used to perform density clustering. Next, bipartite network projection is applied to miRNA-disease double-layer composed of density clustered miRNAs and disease sets. Finally, predictions for miRNA-disease association are performed. Further experimental results show that the proposed approach achieves AUC of 99.08% by using the leave-one-out cross-validation test, which demonstrates better predictive performance of BNPDCMDA than other methods. Moreover, certain miRNAs associated common diseases are predicted by BNPDCMDA.
YANG Gui , ZHENG Wen-Ping , WANG Wen-Jian , ZHANG Hao-Jie
2017, 28(11):3103-3114. DOI: 10.13328/j.cnki.jos.005347
Abstract:Most community detection algorithms in complex networks find communities based on topological structure of the network. Some important information is included in real network data, which represents data reliability or link closeness. Combined these prior information to detect communities might obtain better clustering results. An overlapping community detection on weighted networks (OCDW) is proposed in this study. Edge weight is defined by combining network topological structure and real information. Then, vertex weight is induced by edge weight. To obtain cluster, OCDW selects seed nodes according to vertex weight. After finding a cluster, edges in this cluster reduce their weights to avoid being selected as a seed node with high probability. Compared with some classical algorithms on 9 real networks including 5 unweighted networks and 4 weighted networks, OCDW shows a considerable or better performance on F-measure, accuracy, separation, NMI, ARI, modularity and time efficiency.