CHENG Xue-Qi , JIN Xiao-Long , WANG Yuan-Zhuo , GUO Jia-Feng , ZHANG Tie-Ying , LI Guo-Jie
2014, 25(9):1889-1908. DOI: 10.13328/j.cnki.jos.004674
Abstract:This paper first introduces the key features of big data in different processing modes and their typical application scenarios, as well as corresponding representative processing systems. It then summarizes three development trends of big data processing systems. Next, the paper gives a brief survey on system supported analytic technologies and applications (including deep learning, knowledge computing, social computing, and visualization), and summarizes the key roles of individual technologies in big data analysis and understanding. Finally, the paper lays out three grand challenges of big data processing and analysis, i.e., data complexity, computation complexity, and system complexity. Potential ways for dealing with each complexity are also discussed.
REN Lei , DU Yi , MA Shuai , ZHANG Xiao-Long , DAI Guo-Zhong
2014, 25(9):1909-1936. DOI: 10.13328/j.cnki.jos.004645
Abstract:Visual analytics is an important method used in big data analysis. The aim of big data visual analytics is to take advantage of human’s cognitive abilities in visualizing information while utilizing computer’s capability in automatic analysis. By combining the advantages of both human and computers, along with interactive analysis methods and interaction techniques, big data visual analytics can help people to understand the information, knowledge and wisdom behind big data directly and effectively. This article emphasizes on the cognition, visualization and human computer interaction. It first analyzes the basic theories, including cognition theory, information theory, interaction theory and user interface theory. Based on the analysis, the paper discusses the information visualization techniques used in mainstream applications of big data, such as text visualization techniques, network visualization techniques, spatio-temporal visualization techniques and multi-dimensional visualization techniques. In addition, it reviews the interaction techniques supporting visual analytics, including interface metaphors and interaction components, multi-scale/multi-focus/multi-facet interaction techniques, and natural interaction techniques faced on Post-WIMP. Finally, it discusses the bottleneck problems and technical challenges of big data visual analytics.
ZHANG Yu , LIU Yan-Bing , XIONG Gang , JIA Yan , LIU Ping , GUO Li
2014, 25(9):1937-1952. DOI: 10.13328/j.cnki.jos.004636
Abstract:How to effectively compress and represent the large-scale graphic data becomes the fundamental issue for analysis and processing. Graphic data compression technology is an effective solution to significantly reduce the storage space while supporting fast access in the compressed form. An in-depth analysis is provided on the current development of the technologies, including compression technology based on the traditional storage structure, Web graph compression technology, social network compression technology and compression technology for a particular query. A detailed analysis and performance comparison about the representative methods of each technology is presented. Finally, the summary and prospect are listed.
YIN Jian , WANG Zhi-Sheng , LI Qi , SU Wei-Jie
2014, 25(9):1953-1966. DOI: 10.13328/j.cnki.jos.004648
Abstract:This paper explores the area of personalized recommendation based on large-scale implicit feedback, where only positive feedback is available. To tackle the difficulty arising from lack of negative samples, a novel latent factor model IFRM is proposed, to convert the recommendation task into adoption probability optimization problem. To further improve efficiency and scalability, a parallel version of IFRM named p-IFRM is presented. By randomly partitioning users and items into buckets and thus reconstructing update sequence, IFRM can be learnt in parallel. The study theoretically derives the model from Bayesian analysis and experimentally demonstrates its effectiveness and efficiency by implementing p-IFRM under MapReduce framework and conducting comprehensive experiments on real world large datasets. The experiment results show that the model improves recommendation quality and performs well in scalability.
HE Zhi-Fen , YANG Ming , LIU Hui-Dong
2014, 25(9):1967-1981. DOI: 10.13328/j.cnki.jos.004634
Abstract:In this paper, joint learning of multi-label classification and label correlations (JMLLC) is proposed. In JMLLC, a directed conditional dependency network is constructed based on class label variables. This not only enables joint learning of independent label classifiers to enhance the performance of label classifiers, but also allows joint learning of label classifiers and label correlations, thereby making the learned label correlations more accurate. JMLLC-LR (JMLLC with logistic regression) and JMLLC-LS (JMLLC with least squares), are proposed respectively by adopting two different loss functions: logistic regression and least squares, and are both extended to the reproducing kernel Hilbert space (RKHS). Finally, both JMLLC-LR and JMLLC-LS can be solved by alternating solution approaches. Experimental results on twenty benchmark data sets based on five different evaluation criteria demonstrate that JMLLC outperforms the state-of-the-art MLL algorithms.
2014, 25(9):1982-1991. DOI: 10.13328/j.cnki.jos.004639
Abstract:With the explosive growth of open source software, retrieving desired software in open source software communities becomes a great challenge. Tagging open source software is usually a manual process which assigns software with several tags describing its functions and characteristics. Users can search their desired software by matching the keywords. Because of the simplicity and convenience, software retrieval based on tags has been widely used. However, since human effort is expensive and time-consuming, developers are not willing to tag software sufficiently when uploading software projects. Thus automatic software tagging, with tags describing functions and characteristics according to software projects’ text descriptions provided by users, becomes key to effective software retrieval. This article formalizes this problem as a multi-label learning problem and proposes a new multi-label learning method ML-CKNN which can effectively solve this problem when the number of different tags is extremely large. By imposing cost value of wrong classification into multi-label learning, ML-CKNN can effectively solve this imbalanced problem, as each tag instances associated with this tag are much less than those not associated with this tag. Experiments on three open source software community datasets show that ML-CKNN can provide high-quality tags for new uploading open source software while significantly outperforming existing methods.
2014, 25(9):1992-2001. DOI: 10.13328/j.cnki.jos.004641
Abstract:In the framework of multi-label learning, each example is represented by a single instance (feature vector) while simultaneously associated with multiple class labels. A common strategy adopted by most existing multi-label learning algorithms is that the very feature set of each example is employed in the discrimination processes of all class labels. However, this popular strategy might be suboptimal as each label is supposed to possess specific characteristics of its own. Based on this assumption, a multi-label learning algorithm named LIFT is proposed, in which label specific feature of each label is utilized in the discrimination process of the corresponding label. LIFT contains two steps: label-specific features construction and classification models induction. LIFT constructs the label-specific features by querying the clustering results and then induces the classification model with the corresponding label-specific features. In this paper, three variants of LIFT are studied, all employ other label-specific feature construction mechanisms while retaining the classification models induction process of LIFT. To validate the general helpfulness of label-specific feature mechanism to multi-label learning and the effectiveness of those label-specific features adopted by LIFT, two groups of experiments are conducted on a total of twelve multi-label benchmark datasets.
2014, 25(9):2002-2017. DOI: 10.13328/j.cnki.jos.004635
Abstract:Time series data is an important object of data mining. In analysis of time series, misjudgment of correlation will occur if time lags are not considered. Therefore, there exists mutual restraint between correlation and time lags in time series. Based on the exploration of correlation and simultaneousness of time series, the correlation identification and curve registration methods for double sequences are given in this paper. Concretely, the study investigates the reasons and characteristics of two types of errors in correlation analysis in the view of time warping, and then deduces the correlation coefficient’s bounds in a certain significance level by its asymptotic distribution. Further, a correlation identification method based on time-lag series is proposed. Finally, the curve registration model of maximizing the correlation coefficient is presented with a broader application than AISE. Smoothing-generalized expectation maximization (S-GEM) algorithm is used to solve the time warping function of the new model. The experimental results on simulated and real data demonstrate that the proposed correlation identification approach is more effective than 3 correlation coefficients and Granger causality test in recognition of spurious regression. The registration method provided is obviously performed better than the classical continuous monotone registration method (CMRM), Self-modeling registration (SMR) and maximum likelihood registration (MLR) in most situations. Linear correlation of double series and functional curve registration are considered here, and the results can provide the theoretical basis for correlation identification and time alignment in regression and reference direction for correlation analysis and curves registration of multiple series.
ZHOU Hang-Xing , CHEN Song-Can
2014, 25(9):2018-2025. DOI: 10.13328/j.cnki.jos.004649
Abstract:Multi-View learning is a method to improve the robustness and learning performance of single-view learning by fusing the complementary information. Canonical correlation analysis (CCA) which is used to analyze correlation between two datasets of the same objects is an important method for multi-view feature fusion. CCA aims to seek a pair of projections associated with the two sets of data such that they are maximally correlated. However, CCA results in constraint of the classification performance due to not utilizing the class information or ordinal information of different classes for some applications in which the data labels are ordinal. In order to compensate such a shortcoming, ordinal discriminative canonical correlation analysis (OR-DisCCA) is proposed in this paper by incorporating the class information and ordinal information for extending the traditional CCA. The experimental results show that OR-DisCCA outperforms existing related methods.
LI Jia-Fei , ZHOU Bin , LIU Da-You , HU Liang , WANG Feng
2014, 25(9):2026-2036. DOI: 10.13328/j.cnki.jos.004632
Abstract:To solve the problem that the evidence theory can’t efficiently deal with the fusion of massive information, a new method combining clustering and the convex evidence theory is put forward. The method aims to solve the common and important application problems of the status evaluation. First, the famous clustering algorithm BIRCH is performed to pre-process the data, generating multiple clusters. Second, the centroid of each cluster is calculated as the representation of the cluster pertaining to the fact that most data used for status evaluation have numeric attribute or ordinal attribute. Then, to form the evidence provided by the information in each cluster, the centroid information is given a basic probability assignment value based on the generalized triangular fuzzy membership function. Finally, evidences are combined according to the combination rule of the convex evidence theory. As a result, the massive information fusion is achieved. The results of simulation experiment show that the presented method can efficiently and reasonably perform the massive information fusion, providing a new way to improve the massive information fusion techniques.
DING Shi-Fei , JIA Hong-Jie , SHI Zhong-Zhi
2014, 25(9):2037-2049. DOI: 10.13328/j.cnki.jos.004643
Abstract:Spectral clustering is a flexible and effective clustering method for complex structure data sets. It is based on spectral graph theory and can produce satisfactory clustering results by mapping the data points into a low-dimensional space constituted by eigenvectors so that the data structure is optimized. But in the process of spectral clustering, the computational complexity of eigen-decomposition is usually O(n3), which limits the application of spectral clustering algorithm in big data problems. Nyström extension method uses partial points sampled from the data set and approximate calculation to simulate the real eigenspace. In this way, the computational complexity can be effectively reduced, which provides a new idea for big data spectral clustering algorithm. The selection of sampling strategy is essential for Nyström extension technology. In this paper, the design of an adaptive Nyström sampling method is presented. The sampling probability of every data point will be updated after each sampling pass, and a proof is given that the sampling error will decrease exponentially with the increase of sample times. Based on the adaptive Nyström sampling method, a spectral clustering algorithm for big data analysis is presented, and its feasibility and effectiveness is verified by experiments.
2014, 25(9):2050-2075. DOI: 10.13328/j.cnki.jos.004644
Abstract:To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy (named Random) is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the classifier based on the selected gene subset achieves very high classification accuracy.
HUAI Bao-Xing , BAO Teng-Fei , ZHU Heng-Shu , LIU Qi
2014, 25(9):2076-2087. DOI: 10.13328/j.cnki.jos.004642
Abstract:Named entity linking (NEL) is an advanced technology which links a given named entity to an unambiguous entity in the knowledge base, and thus plays an important role in a wide range of Internet services, such as online recommender systems and Web search engines. However, with the explosive increasing of online information and applications, traditional solutions of NEL are facing more and more challenges towards linking accuracy due to the large number of online entities. Moreover, the entities are usually associated with different semantic topics (e.g., the entity “Apple” could be either a fruit or a brand) whereas the latent topic distributions of words and entities in same documents should be similar. To address this issue, this paper proposes a novel topic modeling approach to named entity linking. Different from existing works, the new approach provides a comprehensive framework for NEL and can uncover the semantic relationship between documents and named entities. Specifically, it first builds a knowledge base of unambiguous entities with the help of Wikipedia. Then, it proposes a novel bipartite topic model to capture the latent topic distribution between entities and documents. Therefore, given a new named entity, the new approach can link it to the unambiguous entity in the knowledge base by calculating their semantic similarity with respect to latent topics. Finally, the paper conducts extensive experiments on a real-world data set to evaluate our approach for named entity linking. Experimental results clearly show that the proposed approach outperforms other state-of-the-art baselines with a significant margin.
OUYANG Dan-Tong , QU Jian-Feng , YE Yu-Xin
2014, 25(9):2088-2101. DOI: 10.13328/j.cnki.jos.004638
Abstract:Distant supervision is a suitable method for relation extraction in big data. It provides a large amount of sample data by aligning relation instances in knowledge base with nature sentences in corpus. In this paper, a new method of distant supervision with expansion of ontology-based sampling is investigated to address the difficulty of extracting relations from sparse training data. First, an ontology which has a deep link with relation extraction is sought through the definition of cover ratio and volume ratio. Second, some relation instances are added by ontology reasoning and examples of queries. Finally, the expansion of training sets is completed by aligning the new relation instances and nature sentences in corpus. The experiment shows that the presented method is capable of extracting some relations whose training sets are weak, a task impossible by the normal distant supervision method.
LI Zhao-Kui , DING Li-Xin , HE Jin-Rong , HU Qing-Hui
2014, 25(9):2102-2118. DOI: 10.13328/j.cnki.jos.004651
Abstract:This paper presents a face feature representation method based on image decomposition (FRID). FRID first decomposes an image into a series of orientation sub-images by executing multiple orientations operator. Then, each orientation sub-image is decomposed into a real part image and an imaginary part image by applying Euler mapping operator. For each real and imaginary part image, FRID divides them into multiple non-overlapping local blocks. The real and imaginary part histograms are calculated by accumulating the number of different values of image blocks respectively. All the real and imaginary part histograms of an image are concatenated into a super-vector. Finally, the dimensionality of the super-vector is reduced by linear discriminant analysis to yield a low-dimensional, compact, and discriminative representation. Experimental results show that FRID achieves better results in comparison with state-of-the-art methods, and is the most stable method.
XU Fei-Fei , LEI Jing-Sheng , BI Zhong-Qin , MIAO Duo-Qian , DU Hai-Zhou
2014, 25(9):2119-2135. DOI: 10.13328/j.cnki.jos.004640
Abstract:For the big data on electric power, many specific applications, such as load forecasting and fault diagnosis, need to consider data changes during a period of time to determine their decision classes, as deriving a class label of only one data record is meaningless. Based on the above discussion, interval-valued rough set is introduced into big data classification. Employing algebra and information theory, this paper defines the related concepts and proves the properties for interval-valued reductions based on dependency and mutual information, and presents the corresponding heuristic reduction algorithms. The proposed methods can not only enrich and develop the interval-valued rough set theory, but also provide a new way for the analysis of big data. Pertaining to the distributed data storage architecture of big data, this paper further proposes the interval-valued global reduction in multi-decision tables with proofs of its properties. The corresponding algorithm is also given. In order for the algorithms to achieve better results in practical applications, approximate reduction is introduced. To evaluate three proposed algorithms, it uses six months’ operating data of one 600MW unit in some power plant. Experimental results show that the three algorithms proposed in this article can maintain high classification accuracy with the proper parameters, and the numbers of objects and attributes can both be greatly reduced.
WANG Xu-Cong , LI Cui-Ping , CHEN Hong
2014, 25(9):2136-2148. DOI: 10.13328/j.cnki.jos.004637
Abstract:P-Rank enriches the traditional similarity measure, SimRank. It is also a method to measure the similarity between two objects in graph model. Different from SimRank which only considers the in-link information, P-Rank also takes the out-link information into consideration. Consequently, P-Rank could effectively and comprehensively measure “how similar two nodes are”. P-Rank is applied widely in graph mining. With the arrival of big-data era, the data scale which P-Rank processes is increasing. The existing methods which implement P-Rank, such as the MapReduce model, are essentially synchronous iterative methods. These methods have some shortcomings in common: the iterative time, especially the waiting time of processors during iterative computing, is long, thus leading to very low efficiency. To solve this problem, this paper uses a new iterative method—the Asynchronous Accumulative Update method. Different from the traditional synchronous methods, this method successfully implementes asynchronous computations and as a result reduces the waiting time of processors during computing. This paper implements P-Rank using the asynchronous accumulative update method, and the experiment results indicate that this method can effectively improve the computation speed.
DING Li-Zhong , JIA Lei , LIAO Shi-Zhong
2014, 25(9):2149-2159. DOI: 10.13328/j.cnki.jos.004650
Abstract:Model selection is critical to support vector learning. Previous model selection methods mainly adopt a nested two-layer framework, where the inner layer trains the learner and the outer one conducts model selection by minimizing the estimate of the generalization error. Breaking from this framework, this paper proposes an approach of simultaneously tuning multiple parameters of support vector learning, which integrates model selection and learning into one optimization process. It first combines the parameters and hyperparameters involved in support vector learning into one parameter vector. Then, using sequential unconstrained minimization technique (SUMT), it reformulates the constrained optimization problems for support vector classification (SVC) and support vector regression (SVR) as unconstrained optimization problems to give the simultaneous tuning model of SVC and SVR. In addition, it proves the basic properties of the simultaneous tuning model of SVC and SVR, including the local Lipschitz continuity and the boundedness of their level sets. Further, it develops a simultaneous tuning algorithm to iteratively solve simultaneous tuning model. Finally, it proves the convergence of the developed algorithm based on the basic properties of the simultaneous tuning model and provides analysis on complexity of the algorithm as compared with related approaches. The empirical evaluation on benchmark datasets shows that the proposed simultaneous approach has lower running time complexity and exhibits similar predictive performance as existing approaches. Theoretical and experimental results demonstrate that the simultaneous tuning approach is a sound and efficient model selection approach for support vector learning.
SHAO Yan-Jian , TAO Qing , JIANG Ji-Yuan , ZHOU Bai
2014, 25(9):2160-2171. DOI: 10.13328/j.cnki.jos.004633
Abstract:Stochastic gradient descent (SGD) is one of the efficient methods for dealing with large-scale data. Recent research shows that the black-box SGD method can reach an O(1/T) convergence rate for strongly-convex problems. However, for solving the regularized problem with L1 plus L2 terms, the convergence rate of the structural optimization method such as COMID (composite objective mirror descent) can only attain O(lnT/T). In this paper, a weighted algorithm based on COMID is presented, to keep the sparsity imposed by the L1 regularization term. A prove is provided to show that it achieves an O(1/T) convergence rate. Furthermore, the proposed scheme takes the advantage of computation on-the-fly so that the computational costs are reduced. The experimental results demonstrate the correctness of theoretic analysis and effectiveness of the proposed algorithm.
HAN Bing , LIAO Qian , GAO Xin-Bo
2014, 25(9):2172-2179. DOI: 10.13328/j.cnki.jos.004647
Abstract:In this paper, a method for recognizing the arc aurora sequences from all-sky image sequences is proposed. For the movement trend of arc aurora sequences, a method named ST-PVLBP (spatial-temporal poleward volume local binary patterns), which is based on VLBP (volume local binary patterns) and uses ST-PVLBP to present the aurora sequences, is proposed. Combined with the interframe continuity information of the sequence and the spatial location information of the single frame, the algorithm reduces the feature dimension while maintaining high classification accuracy at the same time. The proposed method was evaluated using auroral observations at the Chinese Arctic Yellow River Station. Experimental results show that the proposed method can effectively detect the poleward moving arc aurora sequences.
WANG Han , LIU Chong-Jin , FU Xiang , FENG Ju-Fu
2014, 25(9):2180-2186. DOI: 10.13328/j.cnki.jos.004646
Abstract:While minutiae is important for high-resolution palmprint matching, the quality of minutiae is affected by principal lines, creases and other noises, and therefore it is necessary to estimate the quality of minutiae and to exclude poor minutiae. In this paper, a minutiae quality estimation algorithm based on learning for high-resolution palmprint is proposed. First, a series of features obtained by applying Gabor convolution, complex filtering, etc., are used to describe the local area of minutiae redundancy. Then, with each feature as a weak classifier, AdaBoost algorithm is applied in multi-layered training to identify the best features for discriminating minutiae. Finally, the response of weighted linear combination of weak classifiers is used as minutiae quality score, and minutiae with score above the threshold is selected as true minutiae. Comparing with the method based on Fourier transform response, the presented method is superior at distinguishing true from false minutiae, and provides better evaluation of minutiae quality.