###
Journal of Software:2019.30(9):2857-2868

PUseqClust:一种RNA-seq数据聚类分析方法
石险峰,刘学军,张礼
(南京航空航天大学 计算机科学与技术学院, 江苏 南京 211106;南京林业大学 信息科学技术学院, 江苏 南京 210037)
PUseqClust: A Clustering Analysis Method for RNA-Seq Data
SHI Xian-Feng,LIU Xue-Jun,ZHANG Li
(College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 17   Download 33
Received:January 03, 2017    Revised:September 17, 2017
> 中文摘要: 基因的聚类分析是基因表达数据分析研究的重要技术,它按照表达谱相近原则将基因表达数据归类,探究未知的基因功能.近年来,RNA-seq技术广泛应用于测量基因表达水平,产生了大量的读段数据,为基因表达聚类分析提供了充分条件.由于读段非均匀分布的特性,对读段计数一般采用负二项分布进行建模.现有的负二项分布算法和传统的聚类算法对于聚类分析都是直接对读段计数进行建模,没有充分考虑实验本身存在的各种噪声,以及基因表达水平测量的不确定性,或者对聚类中心的不确定性考虑不够.基于PGSeq模型,模拟读段的随机产生过程,采用拉普拉斯方法考虑多条件多重复基因表达水平之间的相关性,获得了基因表达水平的不确定性,联合混合t分布聚类模型,提出PUseqClust (propagating uncertainty into RNA-seq clustering)框架进行RNA-seq读段数据的聚类分析.实验结果表明,该方法相比其他方法获得了更具生物意义的聚类结果.
Abstract:Clustering analysis is an important technique for gene expression data analysis. It groups the data according to similar gene expression patterns to explore the unknown gene functions. In recent years, RNA-seq technology has been widely adopted to measure gene expression. It produces a large number of read data, which provide possibilities for clustering analysis of gene expression. In this area, read counts are popularly modeled by the negative binomial distribution to reduce the impact of the non-uniform read distribution, while most existing clustering methods process directly read counts. They donot fully consider the various noise existing in the data, and the uncertainty of gene expression measurements. Some methods also ignore the variability of clustering centers. This study proposes PUseqClust (propagating uncertainty into RNA-Seq clustering) framework for clustering of RNA-seq data. This framework first uses PGSeq to model the stochastic process of read generation. Laplace method is next used to consider correlation between expressions under various conditions and replicates to obtain the uncertainty of expression estimation. Finally, the method adopts the student's t mixture model to perform gene expression clustering. Results show that the proposed methods obtained more biologically relevant clustering results.
文章编号:     中图分类号:TP311    文献标志码:
基金项目:国家自然科学基金(61170152);航空基金(20151452021) 国家自然科学基金(61170152);航空基金(20151452021)
Foundation items:National Natural Science Foundation of China (61170152); Aeronautical Science Foundation of China (20151452021)
Reference text:

石险峰,刘学军,张礼.PUseqClust:一种RNA-seq数据聚类分析方法.软件学报,2019,30(9):2857-2868

SHI Xian-Feng,LIU Xue-Jun,ZHANG Li.PUseqClust: A Clustering Analysis Method for RNA-Seq Data.Journal of Software,2019,30(9):2857-2868