计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (2): 299-309.doi: 10.1007/s11390-021-0804-3

所属专题: Emerging Areas

• • 上一篇    下一篇

GAEBic:一种基于图自编码器的miRNA-mRNA靶向数据新型双聚类分析方法

Li Wang1, Hao Zhang1,2,*, Senior Member, CCF, Hao-Wu Chang2, Qing-Ming Qin3, Bo-Rui Zhang4, Xue-Qing Li2, Tian-Heng Zhao2, and Tian-Yue Zhang2   

  1. 1 College of Software, Jilin University, Changchun 130012, China;
    2 College of Computer Science and Technology, Jilin University, Changchun 130012, China;
    3 College of Plant Science, Jilin University, Changchun 130062, China;
    4 Department of Biochemistry, University of Illinois at Urbana-Champaign, Champaign 61820, U.S.A
  • 收稿日期:2020-07-14 修回日期:2021-03-05 出版日期:2021-03-05 发布日期:2021-04-01
  • 通讯作者: Hao Zhang E-mail:zhangh@jlu.edu.cn
  • 作者简介:Li Wang is a Master candidate of College of Software, Jilin University, Changchun. His research interests include data mining, machine learning, bioinformatics, pattern recognition, image processing, and neural network.
  • 基金资助:
    This work was supported by the National Natural Science Foundation of China under Grant No. 62072210 and the Project of the Development and Reform Commission of Jilin Province of China under Grant No. 2019C053-6.

GAEBic: A Novel Biclustering Analysis Method for miRNA-Targeted Gene Data Based on Graph Autoencoder

Li Wang1, Hao Zhang1,2,*, Senior Member, CCF, Hao-Wu Chang2, Qing-Ming Qin3, Bo-Rui Zhang4, Xue-Qing Li2, Tian-Heng Zhao2, and Tian-Yue Zhang2        

  1. 1 College of Software, Jilin University, Changchun 130012, China;
    2 College of Computer Science and Technology, Jilin University, Changchun 130012, China;
    3 College of Plant Science, Jilin University, Changchun 130062, China;
    4 Department of Biochemistry, University of Illinois at Urbana-Champaign, Champaign 61820, U.S.A
  • Received:2020-07-14 Revised:2021-03-05 Online:2021-03-05 Published:2021-04-01
  • Contact: Hao Zhang E-mail:zhangh@jlu.edu.cn
  • About author:Li Wang is a Master candidate of College of Software, Jilin University, Changchun. His research interests include data mining, machine learning, bioinformatics, pattern recognition, image processing, and neural network.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China under Grant No. 62072210 and the Project of the Development and Reform Commission of Jilin Province of China under Grant No. 2019C053-6.

背景:
近年来,由于用于低维数据全局搜索的传统聚类方法不能很好地适应于解决高维数据和大型数据聚类问题,使得双向聚类方法迅速发展起来,在基因分析、文本聚类、推荐系统等领域得到广发应用。目前绝大部分的双聚类算法都是针对差异表达的生物大数据而设计,因为差异表达数据中存在非常丰富的生物信息。相比于差异表达的数据而言,生物大数据中更多的是关系型数据,即二进制数据,典型的有miRNA-mRNA靶向数据,但是针对此类二进制数据的双聚方法鲜有探讨。
目标:
本文提出一种新的双聚类算法对二进制数据进行建模,解析二进制数据中变量集(属性集)之间的相关性,从而获取到类型丰富且泛化性良好的双聚类。
方法:
本文首先应用网络爬虫广泛收集大豆的miRNA-mRNA的相关数据,然后提出一种并行图自编码器模型PGAE用于捕获数据矩阵中样本集或变量集的相似性,并且根据这一相似性度量,进一步提出一种新的非规则聚类策略BiGAE来挖掘符合实际生物意义的双聚类功能模块。按照这一方法本文对大豆的miRNA-mRNA靶向数据进行了必要的预处理,紧接着应用GAEBic方法来挖掘大豆的miRNA-mRNA功能模块。
结果:
本文对大豆miRNA-mRNA靶向数据分别应用GAEBic与Bimax、Bibit、Spectral Biclustering算法,并对大豆的双聚类模块进行GO富集分析,比较四种算法之间的优劣性。本文比较结果发现,GAEBic在挖掘窄型双聚类上具有很大的优势,而且其挖掘到的双聚类中包含更多的价值信息。本文提出的GAEBic方法相关代码及数据结果发布在https://github.com/wang1i/GAEBic。
讨论:
本文从实验结果发现,GAEBic算法在挖掘miRNA靶向mRNA二进制数据中双聚类结果上明显优于Bimax、Bibit以及Spectral Biclustering算法,对于解析miRNA-mRNA之间复杂的调控机制具有巨大的潜力,而且GAEBic可以由用户根据需求指定不同的相似度阈值从而寻找到不同数量的双聚类数和不同质量的双聚类,这点在实际问题中是非常灵活的。同时也说明了通过并行图自编码器PGAE捕获的相似性关系是具有重大价值的,根据这一相似性而定义的双聚类更加符合实际的生物学意义。但是GAEBic时间成本相对较高,未来我们会考虑采用一些技术性方案来优化其中的BiGAE模型,比如利用一种巧妙的树或图结构将原有的多次迭代过程用树或图的遍历来取代。另外,GAEBic目前针对的是关系型数据集,而当今的研究往往面对的是多源异构数据,即数据集来源多样且数据类型不一,如何将GAEBic算法的应用从单源同构数据推广到多源异构数据是我们未来重点考虑的地方。

关键词: 双聚类, 图自编码器, miRNA-mRNA, 二进制数据

Abstract: Unlike traditional clustering analysis, the biclustering algorithm works simultaneously on two dimensions of samples (row) and variables (column). In recent years, biclustering methods have been developed rapidly and widely applied in biological data analysis, text clustering, recommendation system and other fields. The traditional clustering algorithms cannot be well adapted to process high-dimensional data and/or large-scale data. At present, most of the biclustering algorithms are designed for the differentially expressed big biological data. However, there is little discussion on binary data clustering mining such as miRNA-targeted gene data. Here, we propose a novel biclustering method for miRNA-targeted gene data based on graph autoencoder named as GAEBic. GAEBic applies graph autoencoder to capture the similarity of sample sets or variable sets, and takes a new irregular clustering strategy to mine biclusters with excellent generalization. Based on the miRNA-targeted gene data of soybean, we benchmark several different types of the biclustering algorithm, and find that GAEBic performs better than Bimax, Bibit and the Spectral Biclustering algorithm in terms of target gene enrichment. This biclustering method achieves comparable performance on the high throughput miRNA data of soybean and it can also be used for other species.

Key words: biclustering, graph autoencoder, miRNA-targeted gene, binary data

[1] Kuwabara P E. DNA microarrays and gene expression:From experiments to data analysis and modeling. Briefings in Functional Genomics and Proteomics, 2003, 2(1):80-81. DOI:10.1093/bfgp/2.1.80.
[2] Jain A K, Murty M N, Flynn P J et al. Data clustering:A review. ACM Computing Surveys, 1999, 31(3):264-323. DOI:10.1145/331499.331504.
[3] Wang H, Wang W, Yang J et al. Clustering by pattern similarity in large data sets. In Proc. the 2002 ACM SIGMOD International Conference on Management of Data, June 2002, pp.394-405. DOI:10.1145/564691.564737.
[4] Gasch A P, Eisen M B. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology, 2002, 3(11):Article No. research0059. DOI:10.1186/gb-2002-3-11-research0059.
[5] Cheng Y, Church G M. Biclustering of expression data. In Proc. the 8th International Conference on Intelligent Systems for Molecular Biology, August 2000, pp.93-103.
[6] Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis:A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1(1):24-45. DOI:10.1109/TCBB.2004.2.
[7] Busygin S, Prokopyev O A, Pardalos P M et al. Biclustering in data mining. Computers & Operations Research, 2008, 35(9):2964-2987. DOI:10.1016/j.cor.2007.01.005.
[8] Eren K, Deveci M, Küçüktunç O et al. A comparative analysis of biclustering algorithms for gene expression data. Briefings in Bioinformatics, 2013, 14(3):279-292. DOI:10.1093/bib/bbs032.
[9] Oghabian A, Kilpinen S, Hautaniemi S et al. Biclustering methods:Biological relevance and application in gene expression analysis. PLoS ONE, 2014, 9(3):Ariticle No. e90801. DOI:10.1371/journal.pone.0090801.
[10] Pontes B, R. Giráldez, Aguilar-Ruiz J S. Biclustering on expression data:A review. Journal of Biomedical Informatics, 2015, 57:163-180. DOI:10.1016/j.jbi.2015.06.028.
[11] Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences of the United States of America, 2000, 97(22):12079-12084. DOI:10.1073/pnas.210134797.
[12] Bhattacharya A, De Rajat K. Bi-correlation clustering algorithm for determining a set of co-regulated genes. Bioinformatics, 2009, 25(21):2795-2801. DOI:10.1093/bioinformatics/btp526.
[13] Prelić A, Bleuler S, Zimmermann P et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 2006, 22(9):1122-1129. DOI:10.1093/bioinformatics/btl060.
[14] Hartigan J A. Direct clustering of a data matrix. Journal of the American Statistical Association, 1972, 67(337):123-129. DOI:10.1080/01621459.1972.10481214.
[15] Yang J, Wang H, Wang W et al. Enhanced biclustering on expression data. In Proc. the 3rd IEEE Symposium on BioInformatics and BioEngineering, March 2003, pp.321-327. DOI:10.1109/BIBE.2003.1188969.
[16] Liu J, Wang W. OP-cluster:Clustering by tendency in high dimensional space. In Proc. the 3rd IEEE International Conference on Data Mining, November 2003, pp.187-194. DOI:10.1109/ICDM.2003.1250919.
[17] Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. In Proc. the 10th International Conference on Intelligent Systems for Molecular Biology, August 2002, pp.136-144.
[18] Rodriguez-Baena D S, Perez-Pulido A J, Aguilarruiz J S. A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics, 2011, 27(19):2738-2745. DOI:10.1093/bioinformatics/btr464.
[19] Alzahrani M, Kuwahara H, Wang W et al. Gracob:A novel graph-based constant-column biclustering method for mining growth phenotype data. Bioinformatics, 2017, 33(16):2523-2531. DOI:10.1093/bioinformatics/btx199.
[20] Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling. Bioinformatics, 2003, 19(suppl 2):ii196-ii205. DOI:10.1093/bioinformatics/btg1078.
[21] Kluger Y, Basri R, Chang J T et al. Spectral biclustering of microarray data:Coclustering genes and conditions. Genome Research, 2003, 13(4):703-716. DOI:10.1101/gr.648603.
[22] Kipf T, Welling M. Semi-supervised classification with graph convolutional networks. In Proc. the 5th International Conference on Learning Representations, April 2017.
[23] Niepert M, Ahmed M H, Kutzkov K. Learning convolutional neural networks for graphs. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2014-2023.
[24] Kipf T N, Welling M. Variational graph auto-encoders. arXiv:1611.07308, 2016. https://arxiv.org/abs/1611.07308, November 2020.
[25] Zhou J, Cui G, Zhang Z et al. Graph neural networks:A review of methods and applications. arXiv:1812.08434, 2018. https://arxiv.org/abs/1812.08434, July 2020.
[26] Wu Z, Pan S, Chen F et al. A comprehensive survey on graph neural networks. arXiv:1901.00596, 2019. https://arxiv.org/abs/1901.00596v4, December 2019.
[27] Cao S S, Lu W, Xu Q K. Deep neural networks for learning graph representations. In Proc. the 13th AAAI Conference on Artificial Intelligence, February 2016, pp.1145-1152.
[28] Hammer B, Micheli A, Sperduti A. Universal approximation capability of cascade correlation for structures. Neural Computation, 2005, 17(5):1109-1159. DOI:10.1162/0899766053491878.
[29] Wang D, Cui P, Zhu W. Structural deep network embedding. In Proc. the 22nd ACM Conference on Knowledge Discovery and Data Mining, August 2016, pp.1225-1234. DOI:10.1145/2939672.2939753.
[30] Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In Proc. the 31st Annual Conference on Neural Information Processing Systems, December 2017, pp.1024-1034.
[1] Jooil Lee, Yanhua Jin, and Won Suk Lee. SUBic:一种发现高质量双聚类的无监督可扩展框架[J]. , 2013, 28(4): 636-646.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[2] 冯玉琳;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[3] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[4] 章萃; 赵沁平; 徐家福;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[5] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[6] 史忠植;. Knowledge-Based Decision Support System[J]. , 1987, 2(1): 22 -29 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 夏培肃; 方信我; 王玉祥; 严开明; 张廷军; 刘玉兰; 赵春英; 孙继忠;. Design of Array Processor Systems[J]. , 1987, 2(3): 163 -173 .
[9] 孙永强; 陆汝占; 黄小戎;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .
[10] S.T.Chanson; 梁路平; A.Kumar;. Throughput Models of CSMA Network with Stations Uniformly Distributed along the Bus[J]. , 1987, 2(4): 243 -264 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: