We use cookies to improve your experience with our site.

解码蛋白质构象空间中的关键结构片段

Decoding the Structural Keywords in Protein Structure Universe

  • 摘要: 尽管因为高通量测序工具的发展蛋白质序列-结构差距继续扩大,但是由于最近没有蛋白质具有新的结构折叠保留在蛋白质数据库(PDB)中,显示已知蛋白质的构象空间已经趋于完整。在这篇文章中,我们找到了一个由一组4到20个残基的主干片段组成的蛋白质结构片段词典(Frag-K),可以作为有效区分主要蛋白质折叠的结构"关键词"。我们首先应用随机谱聚类和随机森林算法从PDB中可利用的大规模高质量,非同源蛋白质结构构建敏感的和具有代表性的蛋白质片段文库。我们分析了聚类截断值对蛋白质结构片段词典性能的影响。然后,我们使用Frag-K片段作为结构特征来分类由SCOP(蛋白质的结构分类)定义的主要蛋白质折叠中的蛋白质结构。我们的结果表明,具有约400个4-20个残基的Frag-K片段的蛋白质结构片段词典就能够以高准确度对主要SCOP折叠进行分类。

     

    Abstract: Although the protein sequence-structure gap continues to enlarge due to the development of high-throughput sequencing tools, the protein structure universe tends to be complete without proteins with novel structural folds deposited in the protein data bank (PDB) recently. In this work, we identify a protein structural dictionary (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural "keywords" that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy.

     

/

返回文章
返回