›› 2014,Vol. 29 ›› Issue (3): 423-435.doi: 10.1007/s11390-014-1440-y

所属专题: Artificial Intelligence and Pattern Recognition Data Management and Data Mining Emerging Areas

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于最优集合覆盖的基因表达独占行二聚类研究

Amichai Painsky and Saharon Rosset   

  1. School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel
  • 收稿日期:2013-08-30 修回日期:2014-03-14 出版日期:2014-05-05 发布日期:2014-05-05
  • 作者简介:Amichai Painsky received his B.Sc. degree in electrical engineering from Tel Aviv University (2007) and his M.Eng. degree in electrical engineering from Princeton University (2009). He is currently carrying a statistics Ph.D. at School of Mathematical Sciences, Tel Aviv University. His research interests include data mining, machine learning, statistical learning and their connection to information theory.
  • 基金资助:

    This research was funded in part by Israeli Science Foundation under Grant No. 1227/09 and by a grant to Amichai Painsky from the Israeli Center for Absorption in Science.

Optimal Set Cover Formulation for Exclusive Row Biclustering of Gene Expression

Amichai Painsky and Saharon Rosset   

  1. School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel
  • Received:2013-08-30 Revised:2014-03-14 Online:2014-05-05 Published:2014-05-05
  • About author:Amichai Painsky received his B.Sc. degree in electrical engineering from Tel Aviv University (2007) and his M.Eng. degree in electrical engineering from Princeton University (2009). He is currently carrying a statistics Ph.D. at School of Mathematical Sciences, Tel Aviv University. His research interests include data mining, machine learning, statistical learning and their connection to information theory.
  • Supported by:

    This research was funded in part by Israeli Science Foundation under Grant No. 1227/09 and by a grant to Amichai Painsky from the Israeli Center for Absorption in Science.

在过去的十年里,大量可用的微阵列数据引起了研究人员对于双分聚类方法越来越大的兴趣。根据不同的相似度量和约束条件,研究人员提出了一系列用于识别基因和环境子集的算法。本文聚焦于基因表达中的独占行二聚类(即投影聚类)问题,即一行数据只能属于二聚类中的某一类,而一列数据则可能属于多个类。此种类型的二聚类对于癌症患者的划分可能十分有用。例如,每一个患者(行)只会患上一种类型的癌症,而一种癌症类型则可能与多个(或相互重叠的)基因(列)相关。基于最优集合覆盖问题,结合已有的二聚类算法和组合拍卖技术,本文提出了一种新的独占行二聚类方法。进一步,基于与零模型的比较和Gap统计方法,本文提出了此种新的独占行二聚类方法的阈值调整方法。无论是在仿真或者真实的基因表达数据上,本文所提出的方法都能识别出具有实际特定意义的大跨度无重叠行子矩阵。

Abstract: The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem (also known as projected clustering) for gene expression, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters in the spirit of the optimal set cover problem. We present our algorithmic solution as a combination of existing biclustering algorithms and combinatorial auction techniques. Furthermore, we devise an approach for tuning the threshold of our algorithm based on comparison with a null model, inspired by the Gap statistic approach. We demonstrate our approach on both synthetic and real world gene expression data and show its power in identifying large span non-overlapping rows submatrices, while considering their unique nature.

[1] Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1(1): 2445.

[2] Cheng Y, Church G M. Biclustering of expression data. In Proc. the 8th Int. Conf. Intelligent Systems for Molecular Biology, Aug. 2000, pp.93-103.

[3] Yang J, Wang W, Wang H, Yu P S. Enhanced biclustering on expression data. In Proc. the 3rd IEEE Symposium on Bioinformatics and Bioengineering, Mar. 2000, pp.321-327.

[4] Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. Bioinformatics, 2003, 19(suppl. 2): 196205.

[5] Tang C, Zhang L, Zhang A, Ramanathan M. Interrelated twoway clustering: An unsupervised approach for gene expression data analysis. In Proc. the 2nd IEEE Int. Symposium on Bioinformatics and Bioengineering, Nov. 2001, pp.41-48.

[6] Divina F, Aguilar-Ruize J. Biclustering of expression data with evolutionary computation. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(5): 590-602.

[7] Aggarwal C C, Procopiuc C, Wolf J L, Yu P S, Park J S. Fast algorithm for projected clustering. ACM SIGMOD Record, 1999, 28(2): 61-72.

[8] Yip K Y, Cheng D W, Ng M K. HARP: A practical projected clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(11): 1387-1397.

[9] Bouguessa M, Wang S. PCGEN: A practical approach to projected clustering and its application to gene expression data. In Proc. the IEEE Symposium on Computational Intelligence and Data Mining, April 2007, pp.661-667.

[10] Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 2002, 18(suppl. 1): 136-144.

[11] Ayadi W, Elloumi M, Hao J K. BicFinder: A biclustering algorithm for microarray data analysis. Knowledge and Information Systems, 2012, 30(2): 341-358.

[12] Vuki?evi? M, Kirchner K, Delibaši? B, Jovanovi? M, Ruhland J, Suknovi?M. Finding best algorithmic components for clustering microarray data. Knowledge and Information Systems, 2013, 35(1): 111-130.

[13] Leyton-Brown K. Resource allocation in competitive multiagent systems [Ph.D. Thesis]. Stanford University, 2003.

[14] Rothkopf M, Pekec A, Harstad R. Computationally manageable combinatorial auctions. Management Science, 1998, 44(8): 1131-1147.

[15] de Vries S, Vohra R. Combinatorial auctions: A survey. INFORMS Journal on Computing, 2003, 15(3): 284-309.

[16] Nisan N. Bidding and allocation in combinatorial auctions. In Proc. the 2nd ACM Conference on Electronic Commerce, Oct. 2000, pp.1-12.

[17] Tenenholtz M. Some tractable cominatorial auctions. In Proc. the AAAI/IAAI, Jul. 2000, pp.98-103.

[18] Sandholm T. Algorithm for optimal winner determination in combinatorial auctions. Artificial Intelligence, 2002, 135(1/2): 1-54.

[19] Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2001, 63(2): 411-423.

[20] Mohajer M, Englmeier K H, Schmid V J. A comparison of Gap statistic definitions with and without logarithm function. arXiv:1103.4767v1 [Stat ME], 2011.

[21] Armstrong S A, Stauton J E, Silveman L B et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 2002, 30(1): 41-47.

[22] Gordon G, Jensen R, Hsiao L et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 2002, 62(17): 4963-4967.

[23] Hubert L, Araie P. Comparing partitions. Journal of Classificastion, 1985, 2(1): 193-218.

[24] Duan K B, Rajapakse J C,Wang H, Azuaje F. Multiple SVMRFE for gene selection in cancer classification with expression data. IEEE Transactions on NanoBioscience, 2005, 4(3): 228-234.

[25] Alba E, Garcia-Nieto J, Jourdan L, Talbi E G. Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms. In Proc. the IEEE Congress on Evolutionary Computation, Sept. 2007, pp.284-290.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: