计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (1): 3-15.doi: 10.1007/s11390-019-1895-y

所属专题: Artificial Intelligence and Pattern Recognition Emerging Areas

• • 上一篇    下一篇

解码蛋白质构象空间中的关键结构片段

Wessam Elhefnawy1, Min Li2, Jian-Xin Wang2, Member, IEEE, and Yaohang Li1,*, Member, ACM, IEEE   

  1. 1 Department of Computer Science, Old Dominion University, Norfolk, VA 23452, U.S.A.;
    2 Department of Computer Science, Central South University, Changsha 410083, China
  • 收稿日期:2018-07-13 修回日期:2018-12-04 出版日期:2019-01-05 发布日期:2019-01-12
  • 作者简介:Wessam Elhefnawy is a Ph.D. candidate in the Department of Computer Science at Old Dominion University, Norfolk, Virginia. His research interest lies in computational biology, scientific computing, machine learning, artificial intelligence, image processing, and bioinformatics. Wessam received his B.Sc. degree in computer engineering from Arab Academy for Science & Technology, and Maritime Transport, Cairo, Egypt, in 2004. He received his M.Sc. degree in computer engineering from Arab Academy for Science & Technology, Cairo, Egypt, in 2011.
  • 基金资助:
    This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61728211 and 61832019.

Decoding the Structural Keywords in Protein Structure Universe

Wessam Elhefnawy1, Min Li2, Jian-Xin Wang2, Member, IEEE, and Yaohang Li1,*, Member, ACM, IEEE   

  1. 1 Department of Computer Science, Old Dominion University, Norfolk, VA 23452, U.S.A.;
    2 Department of Computer Science, Central South University, Changsha 410083, China
  • Received:2018-07-13 Revised:2018-12-04 Online:2019-01-05 Published:2019-01-12
  • About author:Wessam Elhefnawy is a Ph.D. candidate in the Department of Computer Science at Old Dominion University, Norfolk, Virginia. His research interest lies in computational biology, scientific computing, machine learning, artificial intelligence, image processing, and bioinformatics. Wessam received his B.Sc. degree in computer engineering from Arab Academy for Science & Technology, and Maritime Transport, Cairo, Egypt, in 2004. He received his M.Sc. degree in computer engineering from Arab Academy for Science & Technology, Cairo, Egypt, in 2011.
  • Supported by:
    This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61728211 and 61832019.

尽管因为高通量测序工具的发展蛋白质序列-结构差距继续扩大,但是由于最近没有蛋白质具有新的结构折叠保留在蛋白质数据库(PDB)中,显示已知蛋白质的构象空间已经趋于完整。在这篇文章中,我们找到了一个由一组4到20个残基的主干片段组成的蛋白质结构片段词典(Frag-K),可以作为有效区分主要蛋白质折叠的结构"关键词"。我们首先应用随机谱聚类和随机森林算法从PDB中可利用的大规模高质量,非同源蛋白质结构构建敏感的和具有代表性的蛋白质片段文库。我们分析了聚类截断值对蛋白质结构片段词典性能的影响。然后,我们使用Frag-K片段作为结构特征来分类由SCOP(蛋白质的结构分类)定义的主要蛋白质折叠中的蛋白质结构。我们的结果表明,具有约400个4-20个残基的Frag-K片段的蛋白质结构片段词典就能够以高准确度对主要SCOP折叠进行分类。

关键词: 蛋白质结构片段, 折叠识别, 蛋白质构象空间

Abstract: Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools, the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB) recently. In this work, we identify a protein structural dictionary (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural "keywords" that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.

Key words: protein fragment, fold recognition, protein structure universe

[1] Schwede T. Protein modeling:What happened to the protein structure gap? Structure, 2013, 21(9):1531-1540.
[2] Chothia C. Proteins. One thousand families for the molecular biologist. Nature, 1992, 357(6379):543-544.
[3] Andreeva A, Howorth D, Chandonia J M, Brenner S E, Hubbard T J P, Chothia C, Murzin A G. Data growth and its impact on the SCOP database:New developments. Nucleic Acids Research, 2008, 36:D419-D425.
[4] Sillitoe I, Cuff A L, Dessailly B H, Dawson D L, Furnham N, Lee D, Lees J G, Lewis T E, Studer R A, Rentzsch R, Yeats C, Thornton J M, Orengo C A. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Research, 2013, 41(D1):D490-D498.
[5] Chen D. Structural genomics:Exploring the 3D protein landscape, 2010. Biomedical Computation Review. http://biomedicalcomputationreview.org/content/structural-genomics-exploring-3d-protein-landscape, Nov. 2018.
[6] Kolinski A. Protein modeling and structure prediction with a reduced representation. Acta Biochimica Polonica, 2004, 51(2):349-371.
[7] Schwede T, Kopp J, Guex N, Peitsch M C. SWISS-MODEL:An automated protein homology-modeling server. Nucleic Acids Research, 2003, 31(13):3381-3385.
[8] Zhou J F, Grigoryan G. Rapid search for tertiary fragments reveals protein sequence-structure relationships. Protein Science, 2015, 24(4):508-524.
[9] Simons K T, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Journal of Molecular Biology, 1997, 268(1):209-225.
[10] Li Y. Conformational sampling in template-free protein loop structure modeling:An overview. Computational and Structural Biotechnology Journal, 2013, 5:Article No. e201302003.
[11] Li Y, Rata I, Jakobsson E. Integrating multiple scoring functions to improve protein loop structure conformation space sampling. In Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, May 2010.
[12] Li Y, Rata I, Chiu S W, Jakobsson E. Improving predicted protein loop structure ranking using a Pareto-optimality consensus method. BMC Structural Biology, 2010, 10:Article No. 22.
[13] Simons K T, Ruczinski I, Kooperberg C, Fox B A, Bystroff C, Baker D. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins:Structure, Function, and Genetics, 1999, 34(1):82-95.
[14] Kolodny R, Koehl P, Guibas L, Levitt M. Small libraries of protein fragments model native protein structures accurately. Journal of Molecular Biology, 2002, 323(2):297-307.
[15] Budowski-Tal I, Nov Y, Kolodny R. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately. Proceedings of the National Academy of Sciences of the United States of America, 2010, 107(8):3481-3486.
[16] Handl J, Knowles J, Vernon R, Baker D, Lovell S C. The dual role of fragments in fragment-assembly methods for de novo protein structure prediction. Proteins:Structure, Function, and Bioinformatics, 2012, 80(2):490-504.
[17] Ji H, Yu W, Li Y. A rank revealing randomized singular value decomposition (R3SVD) algorithm for low-rank matrix approximations. arXiv:1605.08134, 2016. https://arxiv.org/ftp/arxiv/papers/1605/1605.08134.pdf, September 2018.
[18] Elhefnawy W, Li M, Wang J, Li Y. Construction of protein backbone fragments libraries on large protein sets using a randomized spectral clustering algorithm. In Proc. the 13th International Symposium on Bioinformatics Research and Applications, May 2016, pp.108-119.
[19] Wang G L, Dunbrack R L. PISCES:A protein sequence culling server. Bioinformatics, 2003, 19(12):1589-1591.
[20] Dong Q W, Zhou S G, Guan J H. A new taxonomybased protein fold recognition approach based on autocrosscovariance transformation. Bioinformatics, 2009, 25(20):2655-2662.
[21] Ding C H Q, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 2001, 17(4):349-358.
[22] Fox N K, Brenner S E, Chandonia J M. SCOPe:Structural classification of proteins-extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research, 2014, 42(D1):D304-D309.
[23] von Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17(4):395-416.
[24] Shi J B, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8):888-905.
[25] Ng A Y, Jordan M I, Weiss Y. On spectral clustering:Analysis and an algorithm. In Proc. the 14th International Conference on Neural Information Processing Systems:Natural and Synthetic, December 2001, pp.849-856.
[26] Halko N, Martinsson P G, Tropp J A. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 2011, 53(2):217-288.
[27] Gu Y, Yu W, Li J, Liu S, Li Y. Single-pass PCA of large high-dimensional data. In Proc. the 26th International Joint Conference on Artificial Intelligence, August 2017, pp.3350-3356.
[28] Li Y, Yu W. A fast implementation of singular value thresholding algorithm using recycling rank revealing randomized singular value decomposition. arXiv:1704.05528, 2017. https://arxiv.org/pdf/1704.05528.pdf, September 2018.
[29] Strobl C, Boulesteix A L, Zeileis A, Hothorn T. Bias in random forest variable importance measures:Illustrations, sources and a solution. BMC Bioinformatics, 2007, 8:Article No. 25.
[30] Chiang Y S, Gelfand T I, Kister A E, Gelfand I M. New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins:Structure, Function, and Bioinformatics, 2007, 68(4):915-921.
[31] Holmes J B, Tsai J. Some fundamental aspects of building protein structures from fragment libraries. Protein Science, 2004, 13(6):1636-1650.
[32] Le Q, Pollastri G, Koehl P. Structural alphabets for protein structure classification:A comparison study. Journal of Molecular Biology, 2009, 387(2):431-450.
[33] Bazzoli A, Tettamanzi A G B, Zhang Y. Computational protein design and large-scale assessment by I-TASSER structure assembly simulations. Journal of Molecular Biology, 2011, 407(5):764-776.
[34] Elhefnawy W, Chen L, Han Y, Li Y. ICOSA:A distancedependent, orientation-specific coarse-grained contact potential for protein structure modeling. Journal of Molecular Biology, 2015, 427(15):2562-2576.
[35] Li Y, Liu H, Rata I, Jakobsson E. Building a knowledgebased statistical potential by capturing high-order interresidue interactions and its applications in protein secondary structure assessment. Journal of Chemical Information and Modeling, 2013, 53(2):500-508.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: