计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (2): 234-247.doi: 10.1007/s11390-021-0851-9

所属专题: Emerging Areas

• • 上一篇    下一篇

基于迁移学习与深度学习的人类血液分泌蛋白预测框架

Wei Du1, Member, CCF, IEEE, Yu Sun1, Hui-Min Bao1, Liang Chen2, Member, CCF, Ying Li1,*, Senior Member, CCF, and Yan-Chun Liang1,3,*, Senior Member, CCF   

  1. 1 Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China;
    2 Department of Computer Science, College of Engineering, Shantou University, Shantou 515063, China;
    3 Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education Zhuhai College of Jilin University, Zhuhai 519041, China
  • 收稿日期:2020-07-30 修回日期:2021-02-28 出版日期:2021-03-05 发布日期:2021-04-01
  • 通讯作者: Ying Li, Yan-Chun Liang E-mail:liying@jlu.edu.cn;ycliang@jlu.edu.cn
  • 作者简介:Wei Du received his Ph.D. degree in computer science and technology from Jilin University, Changchun, in 2011. He was a visiting scholar with the University of Georgia, Athens, from 2015 to 2016. He is currently an associate professor in the College of Computer Science and Technology, Jilin University, Changchun. He has published more than 40 journal and conference papers. His major research interests include bioinformatics, computational biology, and computational intelligence.
  • 基金资助:
    The work was supported by the National Natural Science Foundation of China under Grant Nos. 61872418, 61972174, and 62002212, the Natural Science Foundation of Jilin Province of China under Grant Nos. 20180101050JC and 20180101331JC, the Science and Technology Planning Project of Guangdong Province of China under Grant No. 2020A0505100018, and the Guangdong Key-Project for Applied Fundamental Research under Grant No. 2018KZDXM076.

DeepHBSP: A Deep Learning Framework for Predicting Human Blood-Secretory Proteins Using Transfer Learning

Wei Du1, Member, CCF, IEEE, Yu Sun1, Hui-Min Bao1, Liang Chen2, Member, CCF, Ying Li1,*, Senior Member, CCF, and Yan-Chun Liang1,3,*, Senior Member, CCF        

  1. 1 Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China;
    2 Department of Computer Science, College of Engineering, Shantou University, Shantou 515063, China;
    3 Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education Zhuhai College of Jilin University, Zhuhai 519041, China
  • Received:2020-07-30 Revised:2021-02-28 Online:2021-03-05 Published:2021-04-01
  • Contact: Ying Li, Yan-Chun Liang E-mail:liying@jlu.edu.cn;ycliang@jlu.edu.cn
  • About author:Wei Du received his Ph.D. degree in computer science and technology from Jilin University, Changchun, in 2011. He was a visiting scholar with the University of Georgia, Athens, from 2015 to 2016. He is currently an associate professor in the College of Computer Science and Technology, Jilin University, Changchun. He has published more than 40 journal and conference papers. His major research interests include bioinformatics, computational biology, and computational intelligence.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China under Grant Nos. 61872418, 61972174, and 62002212, the Natural Science Foundation of Jilin Province of China under Grant Nos. 20180101050JC and 20180101331JC, the Science and Technology Planning Project of Guangdong Province of China under Grant No. 2020A0505100018, and the Guangdong Key-Project for Applied Fundamental Research under Grant No. 2018KZDXM076.

1、研究背景(context):
血液中蛋白质生物标志物的鉴定和检测具有重要的临床应用价值。然而,由于人类血液中蛋白质组成的复杂性,很难直接比较和分析疾病和对照样品之间的血液蛋白质组学数据。解决这个问题的可行方法是对可能分泌到血液中的蛋白质进行比较分析。现有的预测血液分泌蛋白的方法主要基于传统的机器学习算法,这些方法严重依赖于注释蛋白的特征。然而,特征工程和特征选择的过程可能存在特征不完全或者特征偏差的问题。
2、目的(Objective):
本文的研究目的是提出一种不依赖蛋白质注释特征,直接基于氨基酸序列进行预测的血液分泌蛋白预测方法。该方法可以自动学习蛋白质特征表示,基于氨基酸序列进行血液分泌蛋白端到端的高精度预测。
3、方法(Method):
本文提出了一种结合迁移学习的深度学习模型DeepHBSP,通过整合分类网络和排序网络,仅使用氨基酸序列信息预测血液分泌蛋白。模型的特征提取子网络由多通道胶囊网络组成,模型训练的损失函数由分类网络的分类损失和排序网络的紧致损失组成。对于已经验证的小样本血液分泌蛋白,使用迁移学习的技术训练具有高精确的广义模型。
4、结果(Result&Findings):
对于分类问题,本文提出的模型在训练集和独立测试集上分别能够达到0.915和0.917的预测准确率。预测效果要优于现有基于传统机器学习算法和其他主流的深度学习生物序列分析方法。该模型在生物实验所获得的人类血液分泌蛋白预测上,可以达到0.895的真阳性率;在已知的结直肠癌和肺癌的血液生物标志物预测上,可以分别达到0.878和0.858的真阳性率。我们还开发了一个用于血液分泌蛋白预测的网络服务器,可以通过以下网址访问:http://www.csbg-jlu.info/DeepHBSP/。
5、结论(Conclusions):
本文所提出的模型对于在血液中寻找蛋白质生物标志物的生物医学研究人员具有实用价值,尤其是当他们已经获得了通过转录组学或蛋白质组学数据分析得到的候选蛋白质时。本文的主要贡献如下:1)提出了一种仅使用氨基酸序列的深度学习模型,该模型具有良好的性能,并且优于现有的血液分泌蛋白预测方法。2)模型着重于血液分泌蛋白的特征分布,并提供分类和排名预测结果。3)模型识别出的血液分泌蛋白同已知的血液癌症生物标志物相比具有显著统计学意义。

关键词: 血液分泌蛋白, 深度学习, 胶囊网络, 迁移学习

Abstract: The identification of blood-secretory proteins and the detection of protein biomarkers in the blood have an important clinical application value. Existing methods for predicting blood-secretory proteins are mainly based on traditional machine learning algorithms, and heavily rely on annotated protein features. Unlike traditional machine learning algorithms, deep learning algorithms can automatically learn better feature representations from raw data, and are expected to be more promising to predict blood-secretory proteins. We present a novel deep learning model (DeepHBSP) combined with transfer learning by integrating a binary classification network and a ranking network to identify blood-secretory proteins from the amino acid sequence information alone. The loss function of DeepHBSP in the training step is designed to apply descriptive loss and compactness loss to the binary classification network and the ranking network, respectively. The feature extraction subnetwork of DeepHBSP is composed of a multi-lane capsule network. Additionally, transfer learning is used to train a highly accurate generalized model with small samples of blood-secretory proteins. The main contributions of this study are as follows: 1) a novel deep learning architecture by integrating a binary classification network and a ranking network is proposed, superior to existing traditional machine learning algorithms and other state-of-the-art deep learning architectures for biological sequence analysis; 2) the proposed model for blood-secretory protein prediction uses only amino acid sequences, overcoming the heavy dependence of existing methods on annotated protein features; 3) the blood-secretory proteins predicted by our model are statistically significant compared with existing blood-based biomarkers of cancer.

Key words: blood-secretory protein, deep learning, capsule network, transfer learning

[1] Nagpal M, Singh S, Singh P, Chauhan P, Zaidi M A. Tumor markers:A diagnostic tool. National Journal of Maxillofacial Surgery, 2016, 7(1):17-20. DOI:10.4103/0975-5950.196135.
[2] Loke S Y, Lee A S G. The future of blood-based biomarkers for the early detection of breast cancer. European Journal of Cancer, 2018, 92:54-68. DOI:10.1016/j.ejca.2017.12.025.
[3] Geyer P E, Kulak N A, Pichler G, Holdt L M, Teupser D, Mann M. Plasma proteome profiling to assess human health and disease. Cell Systems, 2016, 2(3):185-195. DOI:10.1016/j.cels.2016.02.015.
[4] Cui J, Liu Q, Puett D, Xu Y. Computational prediction of human proteins that can be secreted into the bloodstream. Bioinformatics, 2008, 24(20):2370-2375. DOI:10.1093/bioinformatics/btn418.
[5] Dhanasekaran S M, Barrette T R, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta K J, Rubin M A, Chinnaiyan A M. Delineation of prognostic biomarkers in prostate cancer. Nature, 2001, 412(6849):822-826. DOI:10.1038/35090585.
[6] Liu Q, Cui J, Yang Q, Xu Y. In-silico prediction of blood-secretory human proteins using a ranking algorithm. BMC Bioinformatics, 2010, 11:Article No. 250. DOI:10.1186/1471-2105-11-250.
[7] Robinson J L, Feizi A, Uhlén M, Nielsen J. A systematic investigation of the malignant functions and diagnostic potential of the cancer secretome. Cell Reports, 2019, 26(10):2622-2635. DOI:10.1016/j.celrep.2019.02.025.
[8] Geyer P E, Holdt L M, Teupser D, Mann M. Revisiting biomarker discovery by plasma proteomics. Molecular Systems Biology, 2017, 13(9):Article No. 942. DOI:10.15252/msb.20156297.
[9] Huang L, Shao D, Wang Y, Cui X, Li Y, Chen Q, Cui J. Human body-fluid proteome:Quantitative profiling and computational prediction. Briefings in Bioinformatics, 2021, 22(1):315-333. DOI:10.1093/bib/bbz160.
[10] Zhang J, Chai H, Guo S, Guo H, Li Y. Highthroughput identification of mammalian secreted proteins using species-specific scheme and application to human proteome. Molecules, 2018, 23(6):Article No. 1448. DOI:10.3390/molecules23061448.
[11] Zhang J, Zhang Y, Ma Z. In silico prediction of human secretory proteins in plasma based on discrete firefly optimization and application to cancer biomarkers identification. Frontiers in Genetics, 2019, 10:Article No. 542. DOI:10.3389/fgene.2019.00542.
[12] Wang D, Zeng S, Xu C, Qiu W, Liang Y, Joshi T, Xu D. MusiteDeep:A deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics, 2017, 33(24):3909-3916. DOI:10.1093/bioinformatics/btx496.
[13] Liang H, Sun X, Sun Y, Gao Y. Text feature extraction based on deep learning:A review. EURASIP Journal on Wireless Communications and Networking, 2017, 2017:Article No. 211. DOI:10.1186/s13638-017-0993-1.
[14] Cao Z, Du W, Li G, Cao H. DEEPSMP:A deep learning model for predicting the ectodomain shedding events of membrane proteins. Journal of Bioinformatics Computational Biology, 2020, 18(3):Article No. 2050017. DOI:10.1142/S0219720020500171.
[15] Du W, Pang R, Li G, Cao H, Li Y, Liang Y. DeepUEP:Prediction of urine excretory proteins using deep learning. IEEE Access, 2020, 8:100251-100261. DOI:10.1109/ACCESS.2020.2997937.
[16] Altschul S F, Madden T L, Schäffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Gapped BLAST and PSI-BLAST:A new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17):3389-3402. DOI:10.1093/nar/25.17.3389.
[17] The UniProt Consortium. UniProt:The universal protein knowledgebase. Nucleic Acids Research, 2017, 45(D1):D158-D169. DOI:10.1093/nar/gkw1099.
[18] Meinken J, Walker G, Cooper C R, Min X J. MetazSecKB:The human and animal secretome and subcellular proteome knowledgebase. Database, 2015:Article No. bav077. DOI:10.1093/database/bav077.
[19] Omenn G S. The HUPO human plasma proteome project. Proteomics Clinical Applications, 2007, 1(8):769-779. DOI:10.1002/prca.200700369.
[20] Li S J, Peng M, Li H, Liu B S, Wang C, Wu J R, Li Y X, Zeng R. Sys-BodyFluid:A systematical database for human body fluid proteome research. Nucleic Acids Research, 2009, 37(Database Issue):D907-D912. DOI:10.1093/nar/gkn849.
[21] Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT suite:A web server for clustering and comparing biological sequences. Bioinformatics, 2010, 26(5):680-682. DOI:10.1093/bioinformatics/btq003.
[22] Maurer-Stroh S, Debulpaep M, Kuemmerer N et al. Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nature Methods, 2010, 7(3):237-242. DOI:10.1038/nmeth.1432.
[23] Suzek B E, Wang Y, Huang H, McGarvey P B, Wu C H, the UniProt Consortium. UniRef clusters:A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 2015, 31(6):926-932. DOI:10.1093/bioinformatics/btu739.
[24] Magnan C N, Baldi P. SSpro/ACCpro 5:Almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics, 2014, 30(18):2592-2597. DOI:10.1093/bioinformatics/btu352.
[25] Perera P, Patel V M. Learning deep features for one-class classification. IEEE Transactions on Image Processing, 2019, 28(11):5450-5463. DOI:10.1109/TIP.2019.2917862.
[26] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.3856-3866. DOI:10.5555/3294996.3295142.
[27] Li Y, Yuan Y. Convergence analysis of two-layer neural networks with ReLU activation. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.597-607. DOI:10.5555/3294771.3294828.
[28] Armenteros J J A, Sønderby C K, Sønderby S K, Nielsen H, Winther O. DeepLoc:Prediction of protein subcellular localization using deep learning. Bioinformatics, 2017, 33(21):3387-3395. DOI:10.1093/bioinformatics/btx431.
[29] Wang D, Liang Y, Xu D. Capsule network for protein post-translational modification site prediction. Bioinformatics, 2019, 35(14):2386-2394. DOI:10.1093/bioinformatics/bty977.
[30] Caruana R. Learning many related tasks at the same time with backpropagation. In Proc. the 1994 International Conference on Neural Information Processing Systems, Jan. 1994, pp.657-664. DOI:10.5555/2998687.2998769.
[31] Ng H W, Nguyen V D, Vonikakis V, Winkler S. Deep learning for emotion recognition on small datasets using transfer learning. In Proc. the 2015 ACM International Conference Multimodal Interaction, Nov. 2015, pp.443-449. DOI:10.1145/2818346.2830593.
[32] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout:A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1):1929-1958.
[33] Yao Y, Rosasco L, Caponnetto A. On early stopping in gradient descent learning. Constructive Approximatio, 2007, 26(2):289-315. DOI:10.1007/s00365-006-0663-2.
[34] Jurtz V I, Johansen A R, Nielsen M, Armenteros J J A, Nielsen H, Sønderby C K, Winther O, Sønderby S K. An introduction to deep learning on biological sequence data:Examples and solutions. Bioinformatics, 2017, 33(22):3685-3690. DOI:10.1093/bioinformatics/btx531.
[35] Kingma D P, Ba J. Adam:A method for stochastic optimization. arXiv:1412.6980, 2014. http://arxiv.org/abs/14-12.6980, May 2020.
[36] Matthews B W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405(2):442-451. DOI:10.1016/0005-2795(75)90109-9.
[37] Linden A. Measuring diagnostic and predictive accuracy in disease management:An introduction to receiver operating characteristic (ROC) analysis. Journal of Evaluation in Clinical Practice, 2006, 12(2):132-139. DOI:10.1111/j.1365-2753.2005.00598.x.
[38] Savojardo C, Martelli P L, Fariselli P, Casadio R. DeepSig:Deep learning improves signal peptide detection in proteins. Bioinformatics, 2018, 34(10):1690-1696. DOI:10.1093/bioinformatics/btx818.
[39] Quang D, Xie X. DanQ:A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Research, 2016, 44(11):Article No. e107. DOI:10.1093/nar/gkw226.
[40] Du W, Sun Y, Li G, Cao H, Pang R, Li Y. CapsNet-SSP:Multilane capsule network for predicting human salivasecretory proteins. BMC Bioinformatics, 2020, 21(1):Article No. 237. DOI:10.1186/s12859-020-03579-2.
[41] Zhou Y, Zhou B, Pache L, Chang M, Khodabakhshi A H, Tanaseichuk O, Benner C, Chanda S K. Metascape provides a biologist-oriented resource for the analysis of systemslevel datasets. Nature Communications, 2019, 10(1):Article No. 1523. DOI:10.1038/s41467-019-09234-6.
[42] Emilsson V, Ilkov M, Lamb J R et al. Co-regulatory networks of human serum proteins link genetics to disease. Science, 2018, 361(6404):769-773. DOI:10.1126/science.aaq1327.
[43] Ahn S B, Sharma S, Mohamedali A et al. Potential early clinical stage colorectal cancer diagnosis using a proteomics blood test panel. Clinical Proteomics, 2019, 16:Article No. 34. DOI:10.1186/s12014-019-9255-z.
[44] Ahn J M, Sung H J, Yoon Y H, Kim B G, Yang W S, Lee C, Park H M, Kim B J, Kim B G, Lee S Y, An H J, Cho J Y. Integrated glycoproteomics demonstrates fucosylated serum paraoxonase 1 alterations in small cell lung cancer. Molecular & Cellular Proteomics, 2014, 13(1):30-48. DOI:10.1074/mcp.M113.028621.
[1] 王新峰、周翔、饶家华、张柱金、杨跃东. 基于迁移学习的DNA甲基化缺失数据补齐[J]. 计算机科学技术学报, 2022, 37(2): 320-329.
[2] 张鑫, 陆思源, 王水花, 余翔, 王甦菁, 姚仑, 潘毅, 张煜东. 通过新型深度学习架构诊断COVID-19肺炎[J]. 计算机科学技术学报, 2022, 37(2): 330-343.
[3] Songjie Niu, Shimin Chen. TransGPerf:利用迁移学习建模分布式图计算性能[J]. 计算机科学技术学报, 2021, 36(4): 778-791.
[4] Sheng-Luan Hou, Xi-Kun Huang, Chao-Qun Fei, Shu-Han Zhang, Yang-Yang Li, Qi-Lin Sun, Chuan-Qing Wang. 基于深度学习的文本摘要研究综述[J]. 计算机科学技术学报, 2021, 36(3): 633-663.
[5] Lan Chen, Juntao Ye, Xiaopeng Zhang. 基于多特征超分网络的布料褶皱合成[J]. 计算机科学技术学报, 2021, 36(3): 478-493.
[6] Yu-Jie Yuan, Yukun Lai, Tong Wu, Lin Gao, Li-Gang Liu. 回顾形状编辑技术:从几何角度到神经网络方法[J]. 计算机科学技术学报, 2021, 36(3): 520-554.
[7] Yang-Jie Cao, Shuang Wu, Chang Liu, Nan Lin, Yuan Wang, Cong Yang, Jie Li. Seg-CapNet:一种用于心脏核磁共振左心室图像分割的胶囊神经网络[J]. 计算机科学技术学报, 2021, 36(2): 323-333.
[8] Jun Gao, Paul Liu, Guang-Di Liu, Le Zhang. 基于深度学习与波束偏转的穿刺针定位与增强算法[J]. 计算机科学技术学报, 2021, 36(2): 334-346.
[9] Hua Chen, Juan Liu, Qing-Man Wen, Zhi-Qun Zuo, Jia-Sheng Liu, Jing Feng, Bao-Chuan Pang, Di Xiao. CytoBrain:基于深度学习技术的宫颈癌筛查系统[J]. 计算机科学技术学报, 2021, 36(2): 347-360.
[10] Andrea Caroppo, Alessandro Leone, Pietro Siciliano. 用于老年人面部表情识别的深度学习模型和传统机器学习方法的对比研究[J]. 计算机科学技术学报, 2020, 35(5): 1127-1146.
[11] Ying Li, Jia-Jie Xu, Peng-Peng Zhao, Jun-Hua Fang, Wei Chen, Lei Zhao. ATLRec:用于跨领域推荐的注意力对抗迁移学习网络[J]. 计算机科学技术学报, 2020, 35(4): 794-808.
[12] 梁盾, 郭元晨, 张少魁, 穆太江, 黄晓蕾. 车道检测-新结果和调查研究[J]. 计算机科学技术学报, 2020, 35(3): 493-505.
[13] Zheng Zeng, Lu Wang, Bei-Bei Wang, Chun-Meng Kang, Yan-Ning Xu. 一种基于多重残差网络的随机渐进式光子映射的降噪方法[J]. 计算机科学技术学报, 2020, 35(3): 506-521.
[14] Zhou Xu, Shuai Pang, Tao Zhang, Xia-Pu Luo, Jin Liu, Yu-Tian Tang, Xiao Yu, Lei Xue. 基于平衡分布适应迁移学习的跨项目缺陷预测[J]. 计算机科学技术学报, 2019, 34(5): 1039-1062.
[15] Shuai Li, Zheng Fang, Wen-Feng Song, Ai-Min Hao, Hong Qin. 基于双向特征共享网络的多人姿态估计方法研究[J]. 计算机科学技术学报, 2019, 34(3): 522-536.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[2] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[3] 冯玉琳;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[4] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[5] 孙永强; 陆汝占; 黄小戎;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .
[6] 戚余禄;. A Systolic Approach for an Improvement of a Finite Field Multiplier[J]. , 1987, 2(4): 303 -309 .
[7] 冯玉琳;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
[8] 徐洁; 李庆南; 黄世泽; 徐江峰;. DFTSNA:A Distributed Fault-Tolerant Shipboard System[J]. , 1990, 5(2): 109 -116 .
[9] 周笛; 徐向文;. A Distributed Error Recovery Technique and Its Implementation and Application on UNIX[J]. , 1990, 5(2): 127 -138 .
[10] 李锦涛; 闵应骅;. Product-Oriented Test-Pattern Generation for Programmable Logic Arrays[J]. , 1990, 5(2): 164 -174 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: