一种用于发现正负共调控基因的新方法
A Novel Approach to Revealing Positive and Negative Co-Regulated Genes
-
摘要: 发现共调控基因是微阵列数据分析的重要目的之一,也是揭示基因调控网中各成员之间关系的必要途径。共调控基因的一种重要表现形式是共表达,即一组基因的表达谱在某一条件子集下同时起伏。传统的基于距离的聚类方法并不适合基因表达数据分析,因为即使对象(基因)间的空间距离很远,它们的表达模式(表达谱)之间也会存在着明显的相关性。最近提出的基于模式的聚类方法试图解决上述问题,但该方法只能发现任意两个基因表达值间存在某种特定线性关系的共调控基因聚类,如平行,成比例等。微阵列实验非常容易遭受其它因素(如芯片探针浓度,图象扫描精度等)的影响,因而微阵列数据中必然包含许多噪音数据,仅允许纯粹的平行或成比例模式显然过分严格了。作为改进,Liu等人提出了基于趋势的方法,用于弥补基于模式的聚类方法过分严格的问题。该方法并不强求共调控基因的表达谱间存在某种特定的量值转化关系,只要求它们的表达谱蕴含着相同的趋势信息。但是,基于趋势的方法根据表达值大小重排属性序列,寻找在同一排列下表达谱表现出“上升”趋势的基因作为共调控基因聚类,可能导致丢失许多具有生物意义的聚类结果。另外,基于趋势的方法认为基因在每个属性上的表达值相互独立,忽视了时间序列数据的序列本质,因而也不适于基因??-时间类型的微阵列数据集。除了上文提到的不足,基于模式和趋势的方法还都忽略了以下两个重要的问题:(1)调控意义检测。基于模式和趋势的方法总是假定表达值的任意增长为正调控,任意下降为负调控。而事实上,表达值的某些细小变化对应的生物意义往往微不足道。(2)负共调控。最近的一系列生物研究结果表明,共调控基因之间存在着某些目前基于模式或趋势的方法无法发现的表达模式。负共调控就是其中之一,也就是说,当一个基因的表达水平升高时,另一个基因的表达水平下降,或者反之。因此,从生物学的角度看,有必要将正负共调控基因聚集到同一类中。针对以上问题,提出了新的解决方案。主要贡献点如下: (1) 提出了一种新的子空间共调控基因聚类模型g-Cluster,用来同时聚类正负共调控基因;(2) 提出了一种新的基于编码的方法,两个基因是共调控的当且仅当它们具有相同的编码;(3) 设计了两种基于树的聚类方法,即深度优先方法和宽度优先方法,并结合有效的削减和优化策略来挖掘所有符合要求的最大g-Cluster;(4) 在真实数据集和人造数据集上进行了大量的实验来验证算法的有效性。实验结果证实了提出的算法性能优于目前已有的算法。Abstract: As explored by biologists, there is a real and emerging need to identifyco-regulated gene clusters, which include both positive and negativeregulated gene clusters. However, the existing pattern-based andtendency-based clustering approaches are only designed for findingpositive regulated gene clusters. In this paper, a new subspaceclustering model called {g-Cluster} is proposed for geneexpression data. The proposed model has thefollowing advantages: 1) find both positive and negativeco-regulated genes in a shot, 2) get away from therestriction of magnitude transformation relationship amongco-regulated genes, and 3) guarantee quality of clusters andsignificance of regulations using a novel similarity measurement{gCode} and a user-specified regulation threshold δ,respectively. No previous work measures up to the task which has beenset. Moreover, MDL technique is introduced to avoid insignificantg-Clusters generated. A tree structure, namely GS-tree, is also designed,and two algorithms combined with efficient pruning andoptimization strategies to identify all qualified g-Clusters.Extensive experiments are conducted on real and synthetic datasets. Theexperimental results show that 1) the algorithm is able to findan amount of co-regulated gene clusters missed by previous models,which are potentially of high biological significance, and 2)the algorithms are effective and efficient, and outperform theexisting approaches.
-
Keywords:
- microarray data /
- pattern-based clustering /
- co-regulated genes
-
-
[1] Liu J, Wang W. Op-cluster: Clustering by tendency in high dimensional space. In -\it Proc. ICDM 2003 Conference}, Melbourne, USA, 2003, 187--194.
[2] Haixun Wang, Wei Wang, Jiong Yang, Philip S Yu. Clustering by pattern similarity in large data sets. In -\it Proc. the 2002 ACM SIGMOD Conference}, Wisconsin, 2002, pp.394--405.
[3] Jian Pei, Xiaoling Zhang, Moonjung Cho \it et al. %, %Haixun Wang, Philip S Yu. \rm Maple: A fast algorithm for maximal pattern-based clustering. In -\it Proc. ICDM 2003 Conf}., Florida, 2003, pp.259--266.
[4] Haixun Wang, Fang Chu, Wei Fan, Philip S Yu, Jian Pei. A fast algorithm for subspace clustering by pattern similarity. In -\it Proc. Scientific and Statistical Database Management Conference}, Santorini Island, Greece, 2004, pp.51--62.
[5] Lizhuang Zhao, Mohammed J Zaki. Tricluster: An effective algorithm for mining coherent clusters in 3d microarray data. In -\it Proc. SIGMOD 2005 Conference}, Maryland, USA, 2005, pp.51--62.
[6] Jinze Liu, Jiong Yang, Wei Wang. Biclustering in gene expression data by tendency. In -\it Proc. 3rd Int. IEEE Computer Society Computational Systems Bioinformatics Conf.}, Stanford, USA, 2004, pp.182--193.
[7] Selnur Erdal, Ozgur Ozturk, David L Armbruster \it et al. \rm A time series analysis of microarray data. In -\it Proc. 4th IEEE Int. Symp. Bioinformatics and Bioengineering Conference}, Taichung, 2004, pp.366--378.
[8] Daxin Jiang, Chun Tang, Aidong Zhang. Cluster analysis for gene expression data: A survey. -\it IEEE Trans. Knowl. Data Eng.}, 2004, 16(11): 1370--1386.
[9] Jason Ernst, Gerard J Nau, Ziv Bar-Joseph. Clustering short time series gene expression data. -\it Bioinformatics}, 2005, 21(Suppl): 159--168.
[10] Yizong Cheng, George M Church. Biclustering of expression data. In -\it Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology 2000 Conference}, San Diego, USA, 2000, pp.93--103.
[11] Yu H, Luscombe N, Qian J, Gerstein M. Genomic analysis of gene expression relation-ships in transcriptional regulatory networks. -\it Trends Genet}, 2003, 19(8): 422--427.
[12] Zhang Y, Zha H, Chu C H. A time-series biclustering algorithm for revealing co-regulated genes. In -\it Proc. Int. Symp. Information and Technology: Coding and Computing, (ITCC 2005)}, Las Vegas, USA, 2005, pp.32--37.
[13] Terry P Speed. Review of ``stochastic complexity in statistical inquiry''. -\it IEEE Trans. Information Theory}, 1991, 37(6): 1739--1746.
[14] Kesheng Wu, Ekow J. Otoo, Arie Shoshani. On the performance of bitmap indices for high cardinality attributes. In -\it Proc. VLDB 2004 Conference}, Canada, 2004, pp.24--35.
[15] Kesheng Wu, Ekow J. Otoo, Arie Shoshani. Compressing bitmap indexes for faster search operations. In -\it Proc. SSDBM 2002 Conference}, Scotland, UK, 2002, pp.99--108.
[16] Golub T R, Slonim D K, Tamayo P \it et al. \rm Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. -\it Science}, 1999, 286(5439): 531--537.
[17] Spellman P T, Sherlock G, Zhang M Q \it et al. \rm Comprehensive identification of cell cycle-regulated genes of the yeast sacccha-romyces cerevisiae by microarray hybridization. -\it Molecular Biology of the Cell}, 1998, 1(9):3273--3297.
[18] Levine E, Getz G, Domany E. Coupled two-way clustering analysis of gene microarray data. In -\it Proc. Natural Academy of Sciences US}, 2000, pp.12079--12084.
计量
- 文章访问数: 27
- HTML全文浏览量: 0
- PDF下载量: 3939