###
Journal of Software:2020.31(11):3492-3505

类属型数据核子空间聚类算法
徐鲲鹏,陈黎飞,孙浩军,王备战
(福建师范大学 数学与信息学院, 福建 福州 350117;数字福建环境监测物联网实验室(福建师范大学), 福建 福州 350117;汕头大学 工学院, 广东 汕头 515063;厦门大学 软件学院, 福建 厦门 361005)
Kernel Subspace Clustering Algorithm for Categorical Data
XU Kun-Peng,CHEN Li-Fei,SUN Hao-Jun,WANG Bei-Zhan
(College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China;Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring(Fujian Normal University), Fuzhou 350117, China;College of Engineering, Shantou University, Shantou 515063, China;College of Software, Xiamen University, Xiamen 361005, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 69   Download 91
Received:January 10, 2018    Revised:May 16, 2018
> 中文摘要: 现有的类属型数据子空间聚类方法大多基于特征间相互独立假设,未考虑属性间存在的线性或非线性相关性.提出一种类属型数据核子空间聚类方法.首先引入原作用于连续型数据的核函数将类属型数据投影到核空间,定义了核空间中特征加权的类属型数据相似性度量.其次,基于该度量推导了类属型数据核子空间聚类目标函数,并提出一种高效求解该目标函数的优化方法.最后,定义了一种类属型数据核子空间聚类算法.该算法不仅在非线性空间中考虑了属性间的关系,而且在聚类过程中赋予每个属性衡量其与簇类相关程度的特征权重,实现了类属型属性的嵌入式特征选择.还定义了一个聚类有效性指标,以评价类属型数据聚类结果的质量.在合成数据和实际数据集上的实验结果表明,与现有子空间聚类算法相比,核子空间聚类算法可以发掘类属型属性间的非线性关系,并有效提高了聚类结果的质量.
Abstract:Currently, the mainstream subspace clustering methods for categorical data are dependent on linear similarity measure and the relationship between attributes is overlooked. In this study, an approach is proposed for clustering categorical data with a novel kernel soft feature-selection scheme. First, categorical data is projected into the high-dimensional kernel space by introducing the kernel function and the similarity measure of categorical data in kernel subspace is given. Based on the measure, the kernel subspace clustering objective function is derived and an optimization method is proposed to solve the objective function. At last, kernel subspace clustering algorithm for categorical data is proposed, the algorithm considers the relationship between the attributes and each attribute assigned with weights measuring its degree of relevance to the clusters, enabling automatic feature selection during the clustering process. A cluster validity index is also defined to evaluate the categorical clusters. Experimental results carried out on some synthetic datasets and real-world datasets demonstrate that the proposed method effectively excavates the nonlinear relationship among attributes and improves the performance and efficiency of clustering.
文章编号:     中图分类号:TP181    文献标志码:
基金项目:国家自然科学基金(U1805263,61672157);福建省科技厅项目(JK2017007);福建师范大学创新团队项目(IRTL1704) 国家自然科学基金(U1805263,61672157);福建省科技厅项目(JK2017007);福建师范大学创新团队项目(IRTL1704)
Foundation items:National Natural Science Foundation of China (U1805263, 61672157); Project of Science and Technology Bureau, Fujian Province (JK2017007); Program of Innovative Research Team of Fujian Normal University (IRTL1704)
Reference text:

徐鲲鹏,陈黎飞,孙浩军,王备战.类属型数据核子空间聚类算法.软件学报,2020,31(11):3492-3505

XU Kun-Peng,CHEN Li-Fei,SUN Hao-Jun,WANG Bei-Zhan.Kernel Subspace Clustering Algorithm for Categorical Data.Journal of Software,2020,31(11):3492-3505