|
计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (2): 299-309.doi: 10.1007/s11390-021-0804-3
所属专题: Emerging Areas
Li Wang1, Hao Zhang1,2,*, Senior Member, CCF, Hao-Wu Chang2, Qing-Ming Qin3, Bo-Rui Zhang4, Xue-Qing Li2, Tian-Heng Zhao2, and Tian-Yue Zhang2
Li Wang1, Hao Zhang1,2,*, Senior Member, CCF, Hao-Wu Chang2, Qing-Ming Qin3, Bo-Rui Zhang4, Xue-Qing Li2, Tian-Heng Zhao2, and Tian-Yue Zhang2
背景:
近年来,由于用于低维数据全局搜索的传统聚类方法不能很好地适应于解决高维数据和大型数据聚类问题,使得双向聚类方法迅速发展起来,在基因分析、文本聚类、推荐系统等领域得到广发应用。目前绝大部分的双聚类算法都是针对差异表达的生物大数据而设计,因为差异表达数据中存在非常丰富的生物信息。相比于差异表达的数据而言,生物大数据中更多的是关系型数据,即二进制数据,典型的有miRNA-mRNA靶向数据,但是针对此类二进制数据的双聚方法鲜有探讨。
目标:
本文提出一种新的双聚类算法对二进制数据进行建模,解析二进制数据中变量集(属性集)之间的相关性,从而获取到类型丰富且泛化性良好的双聚类。
方法:
本文首先应用网络爬虫广泛收集大豆的miRNA-mRNA的相关数据,然后提出一种并行图自编码器模型PGAE用于捕获数据矩阵中样本集或变量集的相似性,并且根据这一相似性度量,进一步提出一种新的非规则聚类策略BiGAE来挖掘符合实际生物意义的双聚类功能模块。按照这一方法本文对大豆的miRNA-mRNA靶向数据进行了必要的预处理,紧接着应用GAEBic方法来挖掘大豆的miRNA-mRNA功能模块。
结果:
本文对大豆miRNA-mRNA靶向数据分别应用GAEBic与Bimax、Bibit、Spectral Biclustering算法,并对大豆的双聚类模块进行GO富集分析,比较四种算法之间的优劣性。本文比较结果发现,GAEBic在挖掘窄型双聚类上具有很大的优势,而且其挖掘到的双聚类中包含更多的价值信息。本文提出的GAEBic方法相关代码及数据结果发布在https://github.com/wang1i/GAEBic。
讨论:
本文从实验结果发现,GAEBic算法在挖掘miRNA靶向mRNA二进制数据中双聚类结果上明显优于Bimax、Bibit以及Spectral Biclustering算法,对于解析miRNA-mRNA之间复杂的调控机制具有巨大的潜力,而且GAEBic可以由用户根据需求指定不同的相似度阈值从而寻找到不同数量的双聚类数和不同质量的双聚类,这点在实际问题中是非常灵活的。同时也说明了通过并行图自编码器PGAE捕获的相似性关系是具有重大价值的,根据这一相似性而定义的双聚类更加符合实际的生物学意义。但是GAEBic时间成本相对较高,未来我们会考虑采用一些技术性方案来优化其中的BiGAE模型,比如利用一种巧妙的树或图结构将原有的多次迭代过程用树或图的遍历来取代。另外,GAEBic目前针对的是关系型数据集,而当今的研究往往面对的是多源异构数据,即数据集来源多样且数据类型不一,如何将GAEBic算法的应用从单源同构数据推广到多源异构数据是我们未来重点考虑的地方。
[1] Kuwabara P E. DNA microarrays and gene expression:From experiments to data analysis and modeling. Briefings in Functional Genomics and Proteomics, 2003, 2(1):80-81. DOI:10.1093/bfgp/2.1.80. [2] Jain A K, Murty M N, Flynn P J et al. Data clustering:A review. ACM Computing Surveys, 1999, 31(3):264-323. DOI:10.1145/331499.331504. [3] Wang H, Wang W, Yang J et al. Clustering by pattern similarity in large data sets. In Proc. the 2002 ACM SIGMOD International Conference on Management of Data, June 2002, pp.394-405. DOI:10.1145/564691.564737. [4] Gasch A P, Eisen M B. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology, 2002, 3(11):Article No. research0059. DOI:10.1186/gb-2002-3-11-research0059. [5] Cheng Y, Church G M. Biclustering of expression data. In Proc. the 8th International Conference on Intelligent Systems for Molecular Biology, August 2000, pp.93-103. [6] Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis:A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1(1):24-45. DOI:10.1109/TCBB.2004.2. [7] Busygin S, Prokopyev O A, Pardalos P M et al. Biclustering in data mining. Computers & Operations Research, 2008, 35(9):2964-2987. DOI:10.1016/j.cor.2007.01.005. [8] Eren K, Deveci M, Küçüktunç O et al. A comparative analysis of biclustering algorithms for gene expression data. Briefings in Bioinformatics, 2013, 14(3):279-292. DOI:10.1093/bib/bbs032. [9] Oghabian A, Kilpinen S, Hautaniemi S et al. Biclustering methods:Biological relevance and application in gene expression analysis. PLoS ONE, 2014, 9(3):Ariticle No. e90801. DOI:10.1371/journal.pone.0090801. [10] Pontes B, R. Giráldez, Aguilar-Ruiz J S. Biclustering on expression data:A review. Journal of Biomedical Informatics, 2015, 57:163-180. DOI:10.1016/j.jbi.2015.06.028. [11] Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences of the United States of America, 2000, 97(22):12079-12084. DOI:10.1073/pnas.210134797. [12] Bhattacharya A, De Rajat K. Bi-correlation clustering algorithm for determining a set of co-regulated genes. Bioinformatics, 2009, 25(21):2795-2801. DOI:10.1093/bioinformatics/btp526. [13] Prelić A, Bleuler S, Zimmermann P et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 2006, 22(9):1122-1129. DOI:10.1093/bioinformatics/btl060. [14] Hartigan J A. Direct clustering of a data matrix. Journal of the American Statistical Association, 1972, 67(337):123-129. DOI:10.1080/01621459.1972.10481214. [15] Yang J, Wang H, Wang W et al. Enhanced biclustering on expression data. In Proc. the 3rd IEEE Symposium on BioInformatics and BioEngineering, March 2003, pp.321-327. DOI:10.1109/BIBE.2003.1188969. [16] Liu J, Wang W. OP-cluster:Clustering by tendency in high dimensional space. In Proc. the 3rd IEEE International Conference on Data Mining, November 2003, pp.187-194. DOI:10.1109/ICDM.2003.1250919. [17] Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. In Proc. the 10th International Conference on Intelligent Systems for Molecular Biology, August 2002, pp.136-144. [18] Rodriguez-Baena D S, Perez-Pulido A J, Aguilarruiz J S. A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics, 2011, 27(19):2738-2745. DOI:10.1093/bioinformatics/btr464. [19] Alzahrani M, Kuwahara H, Wang W et al. Gracob:A novel graph-based constant-column biclustering method for mining growth phenotype data. Bioinformatics, 2017, 33(16):2523-2531. DOI:10.1093/bioinformatics/btx199. [20] Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. Bioinformatics, 2003, 19(suppl 2):ii196-ii205. DOI:10.1093/bioinformatics/btg1078. [21] Kluger Y, Basri R, Chang J T et al. Spectral biclustering of microarray data:Coclustering genes and conditions. Genome Research, 2003, 13(4):703-716. DOI:10.1101/gr.648603. [22] Kipf T, Welling M. Semi-supervised classification with graph convolutional networks. In Proc. the 5th International Conference on Learning Representations, April 2017. [23] Niepert M, Ahmed M H, Kutzkov K. Learning convolutional neural networks for graphs. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2014-2023. [24] Kipf T N, Welling M. Variational graph auto-encoders. arXiv:1611.07308, 2016. https://arxiv.org/abs/1611.07308, November 2020. [25] Zhou J, Cui G, Zhang Z et al. Graph neural networks:A review of methods and applications. arXiv:1812.08434, 2018. https://arxiv.org/abs/1812.08434, July 2020. [26] Wu Z, Pan S, Chen F et al. A comprehensive survey on graph neural networks. arXiv:1901.00596, 2019. https://arxiv.org/abs/1901.00596v4, December 2019. [27] Cao S S, Lu W, Xu Q K. Deep neural networks for learning graph representations. In Proc. the 13th AAAI Conference on Artificial Intelligence, February 2016, pp.1145-1152. [28] Hammer B, Micheli A, Sperduti A. Universal approximation capability of cascade correlation for structures. Neural Computation, 2005, 17(5):1109-1159. DOI:10.1162/0899766053491878. [29] Wang D, Cui P, Zhu W. Structural deep network embedding. In Proc. the 22nd ACM Conference on Knowledge Discovery and Data Mining, August 2016, pp.1225-1234. DOI:10.1145/2939672.2939753. [30] Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In Proc. the 31st Annual Conference on Neural Information Processing Systems, December 2017, pp.1024-1034. |
[1] | Jooil Lee, Yanhua Jin, and Won Suk Lee. SUBic:一种发现高质量双聚类的无监督可扩展框架[J]. , 2013, 28(4): 636-646. |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |