GAEBic:一种基于图自编码器的miRNA-mRNA靶向数据新型双聚类分析方法

doi:10.1007/s11390-021-0804-3

GAEBic:一种基于图自编码器的miRNA-mRNA靶向数据新型双聚类分析方法

GAEBic: A Novel Biclustering Analysis Method for miRNA-Targeted Gene Data Based on Graph Autoencoder

摘要

摘要: 背景：
近年来，由于用于低维数据全局搜索的传统聚类方法不能很好地适应于解决高维数据和大型数据聚类问题，使得双向聚类方法迅速发展起来，在基因分析、文本聚类、推荐系统等领域得到广发应用。目前绝大部分的双聚类算法都是针对差异表达的生物大数据而设计，因为差异表达数据中存在非常丰富的生物信息。相比于差异表达的数据而言，生物大数据中更多的是关系型数据，即二进制数据，典型的有miRNA-mRNA靶向数据，但是针对此类二进制数据的双聚方法鲜有探讨。
目标：
本文提出一种新的双聚类算法对二进制数据进行建模，解析二进制数据中变量集（属性集）之间的相关性，从而获取到类型丰富且泛化性良好的双聚类。
方法：
本文首先应用网络爬虫广泛收集大豆的miRNA-mRNA的相关数据，然后提出一种并行图自编码器模型PGAE用于捕获数据矩阵中样本集或变量集的相似性，并且根据这一相似性度量，进一步提出一种新的非规则聚类策略BiGAE来挖掘符合实际生物意义的双聚类功能模块。按照这一方法本文对大豆的miRNA-mRNA靶向数据进行了必要的预处理，紧接着应用GAEBic方法来挖掘大豆的miRNA-mRNA功能模块。
结果：
本文对大豆miRNA-mRNA靶向数据分别应用GAEBic与Bimax、Bibit、Spectral Biclustering算法，并对大豆的双聚类模块进行GO富集分析，比较四种算法之间的优劣性。本文比较结果发现，GAEBic在挖掘窄型双聚类上具有很大的优势，而且其挖掘到的双聚类中包含更多的价值信息。本文提出的GAEBic方法相关代码及数据结果发布在https://github.com/wang1i/GAEBic。
讨论：
本文从实验结果发现，GAEBic算法在挖掘miRNA靶向mRNA二进制数据中双聚类结果上明显优于Bimax、Bibit以及Spectral Biclustering算法，对于解析miRNA-mRNA之间复杂的调控机制具有巨大的潜力，而且GAEBic可以由用户根据需求指定不同的相似度阈值从而寻找到不同数量的双聚类数和不同质量的双聚类，这点在实际问题中是非常灵活的。同时也说明了通过并行图自编码器PGAE捕获的相似性关系是具有重大价值的，根据这一相似性而定义的双聚类更加符合实际的生物学意义。但是GAEBic时间成本相对较高，未来我们会考虑采用一些技术性方案来优化其中的BiGAE模型，比如利用一种巧妙的树或图结构将原有的多次迭代过程用树或图的遍历来取代。另外，GAEBic目前针对的是关系型数据集，而当今的研究往往面对的是多源异构数据，即数据集来源多样且数据类型不一，如何将GAEBic算法的应用从单源同构数据推广到多源异构数据是我们未来重点考虑的地方。

Abstract: Unlike traditional clustering analysis, the biclustering algorithm works simultaneously on two dimensions of samples (row) and variables (column). In recent years, biclustering methods have been developed rapidly and widely applied in biological data analysis, text clustering, recommendation system and other fields. The traditional clustering algorithms cannot be well adapted to process high-dimensional data and/or large-scale data. At present, most of the biclustering algorithms are designed for the differentially expressed big biological data. However, there is little discussion on binary data clustering mining such as miRNA-targeted gene data. Here, we propose a novel biclustering method for miRNA-targeted gene data based on graph autoencoder named as GAEBic. GAEBic applies graph autoencoder to capture the similarity of sample sets or variable sets, and takes a new irregular clustering strategy to mine biclusters with excellent generalization. Based on the miRNA-targeted gene data of soybean, we benchmark several different types of the biclustering algorithm, and find that GAEBic performs better than Bimax, Bibit and the Spectral Biclustering algorithm in terms of target gene enrichment. This biclustering method achieves comparable performance on the high throughput miRNA data of soybean and it can also be used for other species.

HTML全文

参考文献()

施引文献

资源附件()