计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (2): 320-329.doi: 10.1007/s11390-021-1174-6

所属专题: Artificial Intelligence and Pattern Recognition

• • 上一篇    下一篇

基于迁移学习的DNA甲基化缺失数据补齐

  

  • 收稿日期:2020-11-23 修回日期:2021-09-06 接受日期:2022-02-18 出版日期:2022-03-31 发布日期:2022-03-31

Imputing DNA Methylation by Transferred Learning Based Neural Network

Xin-Feng Wang1 (王新峰), Xiang Zhou1 (周翔), Jia-Hua Rao1 (饶家华), Zhu-Jin Zhang1 (张柱金), and Yue-Dong Yang1,2,* (杨跃东), Member, CCF        

  1. 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
    2Key Laboratory of Machine Intelligence and Advanced Computing of Ministry of Education (Sun Yat-sen University), Guangzhou 510000, China
  • Received:2020-11-23 Revised:2021-09-06 Accepted:2022-02-18 Online:2022-03-31 Published:2022-03-31
  • Contact: Yue-Dong Yang E-mail:yangyd25@mail.sysu.edu.cn
  • About author:Yue-Dong Yang is a professor in the School of Computer Science and National Super Computer Center at Guangzhou, Sun Yet-sen University, Guangzhou. He received his Ph.D. degree in the computational biology from the University of Science and Technology of China (USTC), Hefei, in 2006. Dr. Yang has published more than 100 articles that have been cited more than 4,000 times, including five ESI highly cited articles. Currently his research group emphasizes on developing HPC and AI algorithms for multi-scale integration of omics data and intelligent drug design. He is also responsible for constructing the HPC platform for biomedical applications based on the Tianhe-2 supercomputer.
  • Supported by:
    This study was supported by the National Key Research and Development Program of China under Grant No. 2020YFB0204803, the National Natural Science Foundation of China under Grant No. 61772566, the Guangdong Key Field Research and Development Plan under Grant Nos. 2019B020228001 and 2018B010109006, the Introducing Innovative and Entrepreneurial Teams of Guangdong under Grant No. 2016ZT06D211, and the Guangzhou Science and Technology Research Plan under Grant No. 202007030010.

研究背
DNA甲基化是一种重要的表观遗传类型,在包括癌症的许多重大疾病中起着至关重要的作用。随着高通量测序技术的发展,揭示DNA甲基化与疾病的关系有了很大的进展。然而,由于实验技术的局限性导致测量的数据中存在随机缺失值,给DNA甲基化数据分析带来了巨大的挑战。目前已有许多方法可以对缺失值进行补齐操作,但大多是基于单个样本之间的相关性,结果受到癌症异常样本的影响。
目的
我们研究目标是充分利用各种癌症之间存在的普适规律,通过学习泛癌样本中DNA甲基化之间的普适相关性,再将这种普适相关规律迁移应用于单个癌症数据的补齐,从而降低单个癌症数据集较少以及异常样本带来的负面影响。
方法
我们提出了一种新的基于神经网络迁移学习方法用于填补缺失的DNA甲基化数据,即TDimpute-DNAmeth。该模型在泛癌症数据集上训练获得一般模型,然后在目标癌症数据集上进行优化。这样训练出来的模型既学习了泛癌数据集之间的相关性,又学习了目标癌症数据集的独特性。研究中,我们通过5折交叉验证来保证模型的稳定性,并与其它方法包括简单均值补齐、最近邻(KNN)、主成分分析(PCA)、奇异值分解(SVD)、随机森林等进行比较。
结果
通过在16个癌症数据集上的测试,我们的方法被证明优于其他常用的方法,结果表明泛癌数据集之间的相关性确实有利于提升单个癌症数据集的补齐精度。进一步的分析表明,DNA甲基化与肿瘤的生存有关,可作为肿瘤预后的生物标志物。
结论

研究结果表明,采用迁移学习方法利用泛癌样本间DNA甲基化的相关性,有效地解决了样本量小、维数高的问题。通过对模拟缺失DNA甲基化数据的测试,我们的模型在RMSE和R2两个指标上均一致性优于现有方法。我们进一步用于真实缺失数据的补齐,并根据补齐数据 进行生存分析,结果证实我们的模型补齐的数据质量能更好地反应患者状态。更重要的是,该模型框架并不局限癌症DNA甲基化补齐任务,未来可以进一步应用于其他组学类型、其他疾病类型、以及基于补齐结果的年龄预测和细胞分类等其它任务。


关键词: 神经网络, 迁移学习, DNA甲基化, 数据补齐, 生存分析

Abstract:

DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers. With the development of the high-throughput sequencing technology, there is much progress to disclose the relations of DNA methylation with diseases. However, the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques. While many methods have been developed to impute the missing values, these methods are mostly based on the correlations between individual samples, and thus are limited for the abnormal samples in cancers. In this study, we present a novel transfer learning based neural network to impute missing DNA methylation data, namely the TDimpute-DNAmeth method. The method learns common relations between DNA methylation from pan-cancer samples, and then fine-tunes the learned relations over each specific cancer type for imputing the missing data. Tested on 16 cancer datasets, our method was shown to outperform other commonly-used methods. Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis.


Key words: neural network, transfer learning, DNA methylation, data imputation, survival analysis

[1] Francis R C. Epigenetics: The Ultimate Mystery of Inheritance. WW Norton & Company, 2011.
[2] Ye P, Luan Y, Chen K, Liu Y, Xiao C, Xie Z. MethSMRT: An integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing. Nucleic Acids Research, 2016, 45(D1): D85-D89. DOI: 10.1093/nar/gkw950.
[3] Kulis M, Esteller M. DNA methylation and cancer. Advances in Genetics, 2010, 70(22): 27-56. DOI: 10.1016/B978-0-12-380866-0.60002-2.
[4] Gerd P. Defining driver DNA methylation changes in human cancer. International Journal of Molecular Sciences, 2018, 19(4): Article No.~1166. DOI: 10.3390/ijms19041166.
[5] Jouinot A, Assie G, Libe R et al. DNA methylation is an independent prognostic marker of survival in adrenocortical cancer. The Journal of Clinical Endocrinology & Metabolism, 2016, 102(3): 923-932. DOI: 10.1210/jc.2016-3205.
[6] Zhang G, Huang K C, Xu Z et al. Across-platform imputation of DNA methylation levels incorporating nonlocal information using penalized functional regression. Genetic Epidemiology, 2016, 40(4): 333-340. DOI: 10.1002/gepi.21969.
[7] Troyanskaya O, Cantor M, Sherlock G et al. Missing value estimation methods for DNA microarrays. Bioinformatics, 2001, 17(6): 520-525. DOI: 10.1093/bioinformatics/17.6.520.
[8] Guttorp P, Fuentes M, Sampson P. Using transforms to analyze space-time processes. In Statistical Methods for Spatio-Temporal Systems, Finkenstadt B, Held L, Isham V (eds.), CRC/Chapman, 2006, pp.77-150.
[9] Josse J, Husson F. Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique, 2012, 153(2): 77-99.
[10] Di Lena P, Sala C, Prodi A, Nardini C. Missing value estimation methods for DNA methylation data. Bioinformatics, 2019, 35(19): 3786-3793. DOI: 10.1093/bioinformatics/btz134.
[11] Stekhoven D J, Bühlmann P. MissForest-Non-parametric missing value imputation for mixed-type data. Bioinformatics, 2012, 28(1): 112-118. DOI: 10.1093/bioinformatics/btr597.
[12] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444. DOI: 10.1038/nature14539.
[13] Heffernan R, Paliwal K, Lyons J et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Scientific Reports, 2015, 5: Article No.11476. DOI: 10.1038/srep11476.
[14] Chen J, Zheng S, Zhao H, Yang Y. Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map. Journal of Cheminformatics, 2021, 13(1): Article No.~7. DOI: 10.1186/s13321-021-00488-1.
[15] Senior A W, Evans R, Jumper J et al. Improved protein structure prediction using potentials from deep learning. Nature, 2020, 577(7792): 706-710. DOI: 10.1038/s41586-019-1923-7.
[16] Ching T, Himmelstein D S, Beaulieu-Jones B K et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface, 2018, 15(141): Article No.~20170387. DOI: 10.1098/rsif.2017.0387.
[17] Zheng S, Li Y, Chen S, Xu J, Yang Y. Predicting drug-protein interaction using quasi-visual question answering system. Nature Machine Intelligence, 2020, 2(2): 134-140. DOI: 10.1038/s42256-020-0152-y.
[18] Zheng S, Rao J, Zhang Z, Xu J, Yang Y. Predicting retrosynthetic reactions using self-corrected transformer neural networks. Journal of Chemical Information and Modeling, 2019, 60(1): 47-55. DOI: 10.1021/acs.jcim.9b00949.
[19] Way G P, Greene C S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac Symp Biocomput, 2018, 23: 80-91. DOI: 10.1101/174474.
[20] Titus A J, Wilkins O M, Bobak C A, Christensen B C. Unsupervised deep learning with variational autoencoders applied to breast tumor genome-wide DNA methylation data with biologic feature extraction. https://www.biorxiv.org/content/10.1101/433763v5, Dec. 2021. DOI: 10.1101/433763.
[21] Lv X, Chen Z, Lu Y, Yang Y. An end-to-end Oxford Nanopore basecaller using convolution-augmented transformer. In Proc. the 2020 IEEE International Conference on Bioinformatics and Biomedicine, Dec. 2020, pp.337-342. DOI: 10.1109/BIBM49941.2020.9313290.
[22] Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nature Machine Intelligence, 2019, 1(4): 191-198. DOI: 10.1038/s42256-019-0037-0.
[23] Lopez R, Regier J, Cole M B, Jordan M I, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature Methods, 2018, 15(12): 1053-1058. DOI: 10.1038/s41592-018-0229-2.
[24] Zeng Y, Zhou X, Rao J, Lu Y, Yang Y. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In Proc. the 2020 IEEE International Conference on Bioinformatics and Biomedicine, Dec. 2020, pp.519-522. DOI: 10.1109/BIBM49941.2020.9313569.
[25] Zhou X, Chai H, Zeng Y, Zhao H, Luo C H, Yang Y. scAdapt: Virtual adversarial domain adaptation network for single cell RNA-seq data classification across platforms and species. Briefings in Bioinformatics, 2021, 22(6): Article No.~bbab281. DOI: 10.1093/bib/bbab281.
[26] Zhang Z, Zhao Y, Liao X et al. Deep learning in omics: A survey and guideline. Briefings in Functional Genomics, 2019, 18(1): 41-57. DOI: 10.1093/bfgp/ely030.
[27] The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature, 2020, 578(7793): 82-93. DOI: 10.1038/s41586-020-1969-6.
[28] Li Y, Wang L, Wang J, Ye J, Reddy C K. Transfer learning for survival analysis via efficient L2, 1-Norm regularized cox regression. In Proc. the 2016 IEEE International Conference on Data Mining, Dec. 2016, pp.231-240. DOI: 10.1109/ICDM.2016.0034.
[29] Yousefi S, Amrollahi F, Amgad M et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Scientific Reports, 2017, 7(1): Article No.~11707. DOI: 10.1038/s41598-017-11817-6.
[30] Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Briefings in Bioinformatics, 2016, 18(5): 761-773. DOI: 10.1093/bib/bbw063.
[31] Hoadley K A, Yau C, Wolf D M et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 2014, 158(4): 929-944. DOI: 10.1016/j.cell.2014.06.049.
[32] Zhou X, Chai H, Zhao H, Luo C H, Yang Y. Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning-based neural network. GigaScience, 2020, 9(7): Article No.~giaa076. DOI: 10.1093/gigascience/giaa076.
[33] Wei L, Jin Z, Yang S, Xu Y, Zhu Y, Ji Y. TCGA-assembler 2: Software pipeline for retrieval and processing of TCGA/CPTAC data. Bioinformatics, 2017, 34(9): 1615-1617. DOI: 10.1093/bioinformatics/btx812.
[34] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 2010, 33(1): 1-22.
[35] Van Belle V, Pelckmans K, Van Huffel S, Suykens J A. Support vector methods for survival analysis: A comparison between ranking and regression approaches. Artificial Intelligence in Medicine, 2011, 53(2): 107-118. DOI: 10.1016/j.artmed.2011.06.006.
[1] 魏华鹏, 邓盈盈, 唐帆, 潘兴甲, 董未名. 基于卷积神经网络和Transformer的视觉风格迁移的比较研究[J]. 计算机科学技术学报, 2022, 37(3): 601-614.
[2] 陈铮、方晓楠、张松海. 少纹理区域的局部单应性矩阵估计[J]. 计算机科学技术学报, 2022, 37(3): 615-625.
[3] 解晓政, 牛建伟, 刘雪峰, 李青锋, 王勇, 韩洁, 唐少杰. 基于卷积神经网络并融合边界信息的乳腺癌超声图像诊断[J]. 计算机科学技术学报, 2022, 37(2): 277-294.
[4] 张鑫, 陆思源, 王水花, 余翔, 王甦菁, 姚仑, 潘毅, 张煜东. 通过新型深度学习架构诊断COVID-19肺炎[J]. 计算机科学技术学报, 2022, 37(2): 330-343.
[5] Dan-Hao Zhu, Xin-Yu Dai, Jia-Jun Chen. 预训练和学习:在图神经网络中保留全局信息[J]. 计算机科学技术学报, 2021, 36(6): 1420-1430.
[6] Yi Zhong, Jian-Hua Feng, Xiao-Xin Cui, Xiao-Le Cui. 机器学习辅助的抗逻辑块加密密钥猜测攻击范式[J]. 计算机科学技术学报, 2021, 36(5): 1102-1117.
[7] Feng Wang, Guo-Jie Luo, Guang-Yu Sun, Yu-Hao Wang, Di-Min Niu, Hong-Zhong Zheng. 在忆阻器中基于模式表示法的二值神经网络权重映射法[J]. 计算机科学技术学报, 2021, 36(5): 1155-1166.
[8] Shao-Jie Qiao, Guo-Ping Yang, Nan Han, Hao Chen, Fa-Liang Huang, Kun Yue, Yu-Gen Yi, Chang-An Yuan. 基数估计器:利用垂直扫描卷积神经网络处理SQL[J]. 计算机科学技术学报, 2021, 36(4): 762-777.
[9] Songjie Niu, Shimin Chen. TransGPerf:利用迁移学习建模分布式图计算性能[J]. 计算机科学技术学报, 2021, 36(4): 778-791.
[10] Chen-Chen Sun, De-Rong Shen. 面向深度实体匹配的混合层次网络[J]. 计算机科学技术学报, 2021, 36(4): 822-838.
[11] Yang Liu, Ruili He, Xiaoqian Lv, Wei Wang, Xin Sun, Shengping Zhang. 婴儿的年龄和性别容易被识别吗?[J]. 计算机科学技术学报, 2021, 36(3): 508-519.
[12] Wei Du, Yu Sun, Hui-Min Bao, Liang Chen, Ying Li, Yan-Chun Liang. 基于迁移学习与深度学习的人类血液分泌蛋白预测框架[J]. 计算机科学技术学报, 2021, 36(2): 234-247.
[13] Zhang-Jin Huang, Xiang-Xiang He, Fang-Jun Wang, Qing Shen. 基于卷积神经网络的实时多阶段斑马鱼头部姿态估计框架[J]. 计算机科学技术学报, 2021, 36(2): 434-444.
[14] Bo-Wei Zou, Rong-Tao Huang, Zeng-Zhuang Xu, Yu Hong, Guo-Dong Zhou. 基于对抗神经网络的跨语言实体关系分类[J]. 计算机科学技术学报, 2021, 36(1): 207-220.
[15] Wan-Wei Liu, Fu Song, Tang-Hao-Ran Zhang, Ji Wang. 基于模型检验的ReLU神经网络验证[J]. 计算机科学技术学报, 2020, 35(6): 1365-1381.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙永强; 陆汝占; 黄小戎;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .
[2] 张福炎; 蔡士杰; 王曙; 葛如顶;. The Human-Computer Dialogue Management of FCAD System[J]. , 1988, 3(3): 221 -227 .
[3] 沈一栋;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[4] 庞民治; 张永光; 胥虹; 丁杰;. OOMMS:A Module Management System Based on an Object-Oriented Model[J]. , 1993, 8(2): 76 -85 .
[5] 张钹; 张铃;. On Memory Capacity of the Probabilistic Logic Neuron Network[J]. , 1993, 8(3): 62 -66 .
[6] 顾君忠;. Modelling Enterprises with Object-Oriented Paradigm[J]. , 1993, 8(3): 80 -89 .
[7] 陈偕雄; 吴浩敏;. The Mapping Synthesis of Ternary Functions under Fixed Polarities[J]. , 1993, 8(4): 70 -75 .
[8] 应明生;. Institutions of Variable Truth Values:An Approach in the Ordered Style[J]. , 1995, 10(3): 267 -273 .
[9] 曲云尧; 田增平; 王宇君; 施伯乐;. Design and Implementation of a Concurrency Control Mechanism in an Object-Oriented Database System[J]. , 1996, 11(4): 337 -246 .
[10] 帅典勋;. Asynchronous Superimposition Mechanismsof Concurrent Competitve Waves forHyper-Distributed Hyper-Parallel HeuristicProblem Solving[J]. , 1997, 12(4): 330 -336 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: