›› 2010, Vol. 25 ›› Issue (1): 82-94.

• Special Issue on Computational Challenges from Modern Molecular Biology • Previous Articles     Next Articles

Understanding the "Horizontal Dimension'' of Molecular Evolution to Annotate, Classify, and Discover Proteins with Functional Domains

Gloria Rendon1,2, Mao-Feng Ger2,3, Ruth Kantorovitz1,4, Shreedhar Natarajan5, Jeffrey Tilson6, and Eric Jakobsson1,2,3, Fellow, APS   

  1. 1National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, U.S.A.
    2Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, U.S.A.
    3Center for Biophysics and Computational Biology, University of Illinois at Urbana-Champaign, U.S.A.
    4Department of Mathematics, University of Illinois at Urbana-Champaign, U.S.A.
    5Department of Biology, University of Pennsylvania, Philadelphia, U.S.A.
    6Renaissance Computing Institute, Chapel Hill, North Carolina 27517, U.S.A.
  • Received:2009-10-05 Revised:2009-12-16 Online:2010-01-05 Published:2010-01-05
  • About author:
    Jeffrey Tilson is currently a senior research scientist in Renaissance Computing Institute, USA. He received the Ph.D. degree in physical chemistry from Michigan State University in 1992. His research interests are parallel processing, development of parallel algorithms, performance models, computational chemistry, and computational biology.
    Eric Jakobsson is the director of the National Center for Design of Biomimetic Nanoconductors and is professor in the Department of Molecular and Integrative Physiology at the University of Illinois at Urbana-Champaign. He also has appointments at the National Center for Supercomputing Applications (NCSA) and the Beckman Institute for Advanced Science and Technology. His lab works on computational studies of membrane biophysics and organization, ion channel function, and ion channel evolution. His research interests are bioinformatics, cell physiology, computational biology, ion transport, protein dynamics, protein structure
  • Supported by:

    This work is supported by NSF of USA under Grant Nos. 0835718 and 0235792, NIH under Grant Nos. 5PN2EY016570-06 and 5R01NS063405-02, the Beckman Institute for Advanced Science and Technology, the National Center for Supercomputing Applications, and the Renaissance Computing Institute.

Protein evolution proceeds by two distinct processes: 1) individual mutation and selection for adaptive mutations and 2) rearrangement of entire domains within proteins into novel combinations, producing new protein families that combine functional properties in ways that previously did not exist. Domain rearrangement poses a challenge to sequence alignment-based search methods, such as BLAST, in predicting homology since the methodology implicitly assumes that related proteins primarily differ from each other by individual mutations. Moreover, there is ample evidence that the evolutionary process has used (and continues to use) domains as building blocks, therefore, it seems fit to utilize computational, domain-based methods to reconstruct that process. A challenge and opportunity for computational biology is how to use knowledge of evolutionary domain recombination to characterize families of proteins whose evolutionary history includes such recombination, to discover novel proteins, and to infer protein-protein interactions. In this paper we review techniques and databases that exploit our growing knowledge of ``horizontal'' protein evolution, and suggest possible areas of future development. We illustrate the power of the domain-based methods and the possible directions of future development by a case history in progress aiming at facilitating a particular approach to understanding microbial pathogenicity.

[1] Hunter S, Apweiler R, Attwood T K et al. InterPro: The integrative protein signature database. Nucleic Acids Res., 2009, 37(Database Issue): D211-D215.
[2] Orengo C A, Thornton J M. Protein families and their evolution — A structural perspective. Annual Review of Biochemistry, 2005, 74(1): 867-900.
[3] Apic G, Gough J, Teichmann S A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. Journal of Molecular Biology, 2001, 310(2): 311-325.
[4] Bjorklund A K, Ekman D, Light S, Frey-Skott J, Elofsson A. Domain rearrangements in protein evolution. Journal of Molecular Biology, 2005, 353(4): 911-923.
[5] Moore A D, Bj¨orklund ?A K, Ekman D, Bornberg-Bauer E, Elofsson A. Arrangements in the modular evolution of proteins. Trends in Biochemical Sciences, 2008, 33(9): 444-451.
[6] Woese C R, Fox G E. Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America, 1977, 74(11): 5088-5090.
[7] Tasneem A, Iyer L, Jakobsson E, Aravind L. Identification of the prokaryotic ligand-gated ion channels and their implications for the mechanisms and origins of animal Cys-loop ion channels. Genome Biology, 2004, 6(1): R4.
[8] Bocquet N, L Prado de Carvalho, Cartaud J et al. A prokaryotic proton-gated ion channel from the nicotinic acetylcholine receptor family. Nature, 2007, 445(7123): 116-119.
[9] Hilf R J C, Dutzler R. X-ray structure of a prokaryotic pentameric ligand-gated ion channel. Nature, 2008, 452(7185): 375-379.
[10] Mulder N, Apweiler R. InterPro and InterProScan: Tools for protein sequence classification and comparison. Methods Mol. Biol., 2007, 396: 59-70.
[11] Benson D A, Karsch-Mizrachi I, Lipman D J, Ostell J, Wheeler D L. GenBank. Nucl. Acids Res., 2008, 36(Suppl. 1): D25-D30.
[12] UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res., 2008, 36(Database Issue): D190-D195.
[13] Hulo N, Bairoch A, Bulliard Vetal. The 20 years of PROSITE. Nucleic Acids Res., 2008, 36(Database Issue): D245-D249.
[14] Lima T, Auchincloss A H, Coudert E et al. HAMAP: A database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res., 2009, 37(Database Issue): D471-D478.
[15] Finn R D, Mistry J, Tate J et al. The Pfam protein families database. Nucleic Acids Res., 2002, 30(1): 276-280.
[16] Attwood T K, Bradley P, Flower D R et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res., 2003, 31(1): 400-402.
[17] Corpet F, Servant F, Gouzy J, Kahn D. ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res., 2000, 28(1): 267- 269.
[18] Letunic I, Goodstadt L, Dickens NJ et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res., 2002, 30(1): 242-244.
[19] Haft D H, Selengut J D, White O. The TIGRFAMs database of protein families. Nucleic Acids Res., 2003, 31(1): 371-373.
[20] Wu C H, Lai-Su L, Yeh L-S L, Huang H et al. The protein information resource. Nucleic Acids Res., 2003, 31(1): 345-347.
[21] Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 2001, 313(4): 903-919.
[22] Pearl F, Todd A, Sillitoe I et al. The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res., 2005. 33(Database Issue): D247- D251.
[23] Mi H, Lazareva-Ulitsky B, Loo R et al. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res., 2005, 33(Database Issue): D284- D288.
[24] Ashburner M, Ball C A, Blake J A, Botstein D, Butler H, Cherry J M, Davis A P, Dolinski K, Dwight S S, Eppig J T, Harris M A, Hill D P, Issel-Tarver L, Kasarskis A, Lewis S, Matese J C, Richardson J E, Ringwald M, Rubin G M, Sherlock G. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 2000, 25: 25-29.
[25] Berman H, Henrick K, Nakamura H, Markley J L. The worldwide protein data bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucl. Acids Res., 2007, 35(Suppl. 1): D301-D303.
[26] Bailey T L, Boden M, Buske F A, Frith M, Grant C E, Clementi L, Ren J, Li W W, Noble W S. MEME SUITE: Tools for motif discovery and searching. Nucl. Acids Res., 2009, 37(Suppl. 2): W202-W208.
[27] Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux P S, Pagni M, Sigrist C J A. The PROSITE database. Nucl. Acids Res., 2006, 34(Suppl. 1): D227-D230.
[28] Attwood T K, Bradley P, Flower D R, Gaulton A, Maudling N, Mitchell A L, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C. PRINTS and its automatic supplement, prePRINTS. Nucl. Acids Res., 2003, 31(1): 400-402.
[29] Bateman A, Coin L, Durbin R, Finn R D, Hollich V, Griffiths- Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E L. The Pfam protein families database. Nucleic Acids Res., 2004, 32(Database Issue): D138-D141.
[30] Letunic I, Copley R R, Pils B, Pinkert S, Schultz J, Bork P. SMART 5: Domains in the context of genomes and networks. Nucl. Acids Res., 2006, 34(Suppl. 1): D257-D260.
[31] Bailey T L, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 1995, 21(1): 51-80.
[32] Bailey T L, Elkan C. The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, UK, July 16-19, 1995, pp.21-29.
[33] Tompa M, Li N, Bailey T L, Church G M, De Moor B, Eskin E, Favorov A V, Frith M C, Fu Y, Kent W J, Makeev V J, Mironov A A, Noble W S, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotech., 2005, 23(1): 137-144.
[34] Bailey T L, Gribskov M. Combining evidence using p-values: Application to sequence homology searches. Bioinformatics, 1998, 14(1): 48-54.
[35] Saier M H Jr, Tran C V, Barabote R D. TCDB: The transporter classification database for membrane transport protein analyses and information. Nucl. Acids Res., 2006, 34(Suppl. 1): D181-D186.
[36] Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucl. Acids Res., 2005, 33(15): 4899-4913.
[37] Liu Y, Liu X S, Wei L, Altman R B, Batzoglou S. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Research, 2004, 14(3): 451-458.
[38] Wang T, Stormo G. Combining phylogenetic data with coregulated genes to identify regulatory motifs. Bioinformatics, 2003, 19(18): 2369-2380.
[39] Sinha S, van Nimwegen E, Siggia E. A probabilistic method to detect regulatory modules. In Proc. the Eleventh International Conference on Intelligent Systems for Molecular Biology, Brisbane, Australia, June 20-July 3, 2003, pp.292-301.
[40] Sinha S, Blanchette M, Tompa M. PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, 2004, 5(1): 170.
[41] Frith M C, Saunders N F W, Kobe B, Bailey T L. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol., 2008, 4(5): e1000071.
[42] Tilson J L, Blatecky A, Rendon G, Ger M F, Jakobsson E. MotifNetwork: Genome-wide domain analysis using gridenabled workflows. In Proc. the 7th IEEE International Conference on Bioinformatics and Bioengineering, Boston, USA, Oct. 14-17, 2007, pp.872-879.
[43] Tilson J L, Rendon G, Ger M F, Jakobsson E. MotifNetwork: A grid-enabled workflow for high-throughput domain analysis of biological sequences: Implications for annotation and study of phylogeny, protein interactions, and intraspecies variation. In Proc. the 7th IEEE International Conference on Bioinformatics and Bioengineering, Boston, USA, Oct. 14-17, 2007, pp.620-627.
[44] Foster I, Kesselman C. Chapter 2 — Framework. The Grid: Blueprint for a New Computing Infrastructure. Morgan- Kaufman, 1999.
[45] Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock M R, Wipat A, Li P. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 2004, 20(17): 3045-3054.
[46] Kandaswamy G, Gannon D. A mechanism for creating scientific application services on-demand from workflows. In International Conference on Parallel Processing Workshops, Columbus, USA, Aug. 14-18, 2006.
[47] Rajasekar A, Wan M, Moore R, Schroeder W. A prototype rule-based distributed data management system. In HPDC Workshop on Next Generation Distributed Data Management, Paris, France, June 19-23, 2006.
[48] Tilson J L, Rendon G, Ger M F, Jakobsson E. Algorithms and Performance Measurements for MotifNetwork Analysis Programs. 2009, RENCI: Chapel Hill, NC. p.46.
[49] Kuzniar A, van Ham R C H J, Pongor S, Leunissen J A M. The quest for orthologs: Finding the corresponding gene across genomes. Trends in Genetics, 2008, 24(11): 539-551.
[50] Jothi R, Zotenko E, Tasneem A, Przytycka T M. COCO-CL: Hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics, 2006, 22(7): 779-788.
[51] Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E, Krylov D, Mazumder R, Mekhedov S, Nikolskaya A, Rao B S, Smirnov S, Sverdlov A, Vasudevan S, Wolf Y, Yin J, Natale D. The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 2003, 4(1): 41.
[52] Schneider A, Dessimoz C, Gonnet G H. OMA browser exploring orthologous relations across 352 complete genomes. Bioinformatics, 2007, 23(16): 2180-2182.
[53] Natarajan S, Jakobsson E. Functional equivalency inferred from “authoritative sources”. in Networks of Homologous Proteins. PLoS ONE, 2009, 4(6): e5898.
[54] Finn R D, Marshall M, Bateman A. iPfam: Visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics, 2005, 21: 410-412.
[55] Stein A, Russell R B, Aloy P. 3did: Interacting protein domains of known three-dimensional structure. Nucleic Acids Res., 2005, 33(Database Issue): D413-D417.
[56] Ng S K, Zhang Z, Tan S H., Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 2003, 19(8): 923-929.
[57] Rhodes D R, Tomlins S A, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan A M. Probabilistic model of the human protein-protein interaction network. Nat. Biotech., 2005, 23(8): 951-959.
[58] Pagel P, Wong P, Frishman D. A domain interaction map based on phylogenetic profiling. Journal of Molecular Biology, 2004, 344(5): 1331-1346.
[59] Raghavachari B, Tasneem A, Przytycka T M, Jothi R. DOMINE: A database of protein domain interactions. Nucl. Acids Res., 2008, 36(Suppl. 1): D656-D661.
[60] Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. J. Mol. Biol., 2001, 311(4): 681-692.
[61] Kim W K, Park J, Suh J K. Database of interacting proteins large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. Genome Inform., 2002, 13: 42-50.
[62] Deng M, Mehta S, Sun F, Chen T. Inferring domain-domain interactions from protein-protein interactions. Genome Res., 2002, 12(10): 1540-1548.
[63] Nye T M, Berzuini C, Gilks W R, Babu M M, Teichmann S A. Statistical analysis of domains in interacting protein pairs. Bioinformatics, 2005, 21(7): 993-1001.
[64] Riley R, Lee C, Sabatti C, Eisenberg D. Inferring protein domain interactions from databases of interacting proteins. Genome Biol., 2005, 6(10): R89.
[65] Jothi R, Cherukuri P F, Tasneem A, Przytycka T M. Coevolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. Journal of Molecular Biology, 2006, 362(4): 861-875.
[66] Natarajan S, Mashl R J, Jakobsson E. Evolutionary coupling in the Kv1.2-Beta2 complex. University of Illinois at Urbana- Champaign, 2009.
[67] Han D S, Kim H S, Jang W H, Lee S D, Suh J K. PreSPI: A domain combination based prediction system for proteinprotein interaction. Nucl. Acids Res., 2004, 32(21): 6312- 6320.
[68] Wojcik J, Schachter V. Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics, 2001, 17(Suppl. 1): S296-S305.
[69] Chen X W, Liu M. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics, 2005, 21(24): 4394-4400.
[70] Schlicker A, Huthmacher C, Ramirez F, Lengauer T, Albrecht M. Functional evaluation of domain domain interactions and human protein interaction networks. Bioinformatics, 2007, 23(7): 859-865.
[71] Bjorkholm P, Sonnhammer E L L. Comparative analysis and unification of domain-domain interaction networks. Bioinformatics, 2009, Advance Access Published Online, Aug. 31, 2009, DOI: 10.1093/bioinformatics/btp522.
[72] Pandey J, Koyuturk M, Subramaniam S, Grama A. Functional coherence in domain interaction networks. Bioinformatics, 2008, 24(16): i28-i34.

No related articles found!
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved