基于谱聚类的无监督特征选择算法

doi:10.13328/j.cnki.jos.005927

微信服务号

微信订阅号

首页 > 过刊浏览>2020年第31卷第4期 >1009-1024. DOI:10.13328/j.cnki.jos.005927

PDF HTML阅读 XML下载导出引用引用提醒

基于谱聚类的无监督特征选择算法
DOI:
                        10.13328/j.cnki.jos.005927
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:谢娟英(1971-),女,陕西西安人,博士,教授,博士生导师,CCF高级会员,主要研究领域为机器学习,数据挖掘,生物医学数据分析;王明钊(1990-),男,博士生,主要研究领域为数据挖掘,生物信息学;丁丽娟(1994-),女,硕士生,主要研究领域为机器学习,数据挖掘.
通讯作者:xiejuany@snnu.edu.cn
中图分类号:TP181
基金项目:国家自然科学基金（61673251）；陕西省科技攻关重点项目（2018ZDXMSF-079）；国家重点研发计划（2016YFC0901900）；科技成果转化培育项目（GK201806013）；中央高校基本科研业务费专项资金（GK201701006）；研究生培养创新基金（2015CXS028，2016CSY009，2018TS078）

Spectral Clustering Based Unsupervised Feature Selection Algorithms

Author:

Affiliation:

Fund Project:

National Natural Science Foundation of China (61673251); Key Projects of Science and Technology Research in Shaanxi Province (2018ZDXMSF-079); National Key Research and Development Program of China (2016YFC0901900); Scientific and Technological Achievements Transformation and Cultivation Funds of Shaanxi Normal University (GK201806013); Fundamental Research Funds for the Central Universities (GK201701006); Innovation Funds of Graduate Programs at Shaanxi Normal University (2015CXS028, 2016CSY009, 2018TS078)

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

基因表达数据具有高维小样本特点，包含了大量与疾病无关的基因，对该类数据进行分析的首要步骤是特征选择.常见的特征选择方法需要有类标的数据，但样本类标获取往往比较困难.针对基因表达数据的特征选择问题，提出基于谱聚类的无监督特征选择思想FSSC（feature selection by spectral clustering）.FSSC对所有特征进行谱聚类，将相似性较高的特征聚成一类，定义特征的区分度与特征独立性，以二者之积度量特征重要性，从各特征簇选取代表性特征，构造特征子集.根据使用的不同谱聚类算法，得到FSSC-SD（FSSC based on standard deviation）、FSSC-MD（FSSC based on mean distance）和FSSC-ST（FSSC based on self-tuning）这3种无监督特征选择算法.以SVMs（support vector machines）和KNN（K-nearest neighbours）为分类器，在10个基因表达数据集上进行实验测试.结果表明，FSSC-SD、FSSC-MD和FSSC-ST算法均能选择到具有强分类能力的特征子集.

Abstract:

Gene expression data usually comprise small number of samples with tens of thousands of genes. There are a large number of genes unrelated to diseases in this kind of data. The primary task is to detect those key essential genes when analyzing this kind of data. The common feature selection algorithms depend on labels of data, but it is very difficult to get labels for data. To overcome the challenges, especially for gene expression data, the unsupervised feature selection idea is proposed, named as FSSC (feature selection by spectral clustering). FSSC groups all of features into clusters by a spectral clustering algorithm, so that similar features are in same clusters. The feature discernibility and independence are defined, and the feature importance is defined as the product of its discernibility and independence. The representative feature is selected from each cluster to construct the feature subset. According to the spectral clustering algorithms used in FSSC, three kinds of unsupervised feature selection algorithms named as FSSC-SD (FSSC based on standard deviation), FSSC-MD (FSSC based on mean distance) and FSSC-ST (FSSC based on self-tuning) are developed. The SVM (support vector machines) and KNN (K-nearest neighbors) classifiers are adopted to test the performance of the selected feature subsets in experiments. Experimental results on 10 gene expression datasets show that FSSC-SD, FSSC-MD, and FSSC-ST algorithms can select powerful features to classify samples.

参考文献

相似文献

引证文献

引用本文

谢娟英,丁丽娟,王明钊.基于谱聚类的无监督特征选择算法.软件学报,2020,31(4):1009-1024

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2019-05-31
最后修改日期:2019-07-29
录用日期:
在线发布日期: 2020-01-14
出版日期: 2020-04-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码