›› 2010, Vol. 25 ›› Issue (1): 107-123.

• Special Issue on Computational Challenges from Modern Molecular Biology • Previous Articles     Next Articles

Challenges in Computational Analysis of Mass Spectrometry Data for Proteomics

Bin Ma (马斌)   

  1. Cheriton School of Computer Science, University of Waterloo, Canada
    Dingsheng Technologies, Beijing 100085, China
  • Received:2009-09-09 Revised:2009-11-21 Online:2010-01-05 Published:2010-01-05
  • About author:
    Bin Ma is an associate professor and university research chair in David R. Cheriton School of Computer Science at University of Waterloo. He received his Ph.D. degree from Beijing University in 1999. During 2000~2008 he worked at University of Western Ontario as assistant professor, associate professor, and Canada research chair. He received the Ontario PREA Award in 2003 and Ontario Premier's Catalyst Award for Best Young Innovator in 2009.
  • Supported by:

    This work is supported by the National High-Tech Research and Development 863 Program of China under Grant No. 2008AA02Z313, NSERC RGPIN under Grant No. 238748-2006, and a start up grant at University of Waterloo.

Mass spectrometry is an analytical technique for determining the composition of a sample. Recently it has become a primary tool for protein identification and quantification, and post translational modification characterization in proteomics research. Both the size and the complexity of the data produced by this experimental technique impose great computational challenges in the data analysis. This article reviews some of these challenges and serves as an entry point for those who want to study the area in general.

[1] Peng J, Elias J E, Thoreen C C, Licklider L J, Gygi S P. Evaluation of multidimensional chromatography coupled with Tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. Journal of Proteome Research, 2003, 2(1): 43-50.
[2] Mann M. Quantitative proteomics? Nature Biotechnology, 1999, 17(10): 954-955.
[3] Martin-Visscher L A, van Belkum M J, Garneau-Tsodikova S, Whittal R M, Zheng J, McMullen L M, Vederas J C. Isolation and characterization of carnocyclin A, a novel circular bacteriocin produced by Carnobacterium maltaromaticum UAL307. Applied and Environmental Microbiology, 2008, 74(15): 4756- 4763.
[4] MannM, Jensen O N. Proteomic analysis of post-translational modifications. Nature Biotechnology, 2003, 21(3): 255-261.
[5] Keykhosravani M, Doherty-Kirby A, Zhang C, Brewer D, Goldberg H A, Hunter G K, Lajoie G. Comprehensive identification of post-translational modifications of rat bone osteopontin by mass spectrometry. Biochemistry, 2005, 44(18): 6990-7003.
[6] Hoffmann E, Stroobant V. Mass Spectrometry: Principles and Applications. John Wiley & Sons Ltd., 2007.
[7] Tang K, Page J S, Smith R D. Charge competition and the linear dynamic range of detection in electrospray ionization mass spectrometry. Journal of American Society of Mass Spectrometry, 2004, 15(10): 1416-1423.
[8] Gygi S P, Corthals G L, Zhang Y, Rochon Y, Aebersold R. Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. PNAS, 2000, 97(17): 9390-9395.
[9] Perkins D N, Pappin D J, Creasy D M, Cottrell J S. Probability-based protein identification by searching sequence database using mass spectrometry data. Electrophoresis, 1999, 20(18): 3551-3567.
[10] Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G. PEAKS: Powerful software for MS/MS peptide de novo sequencing. Rapid Communications in Mass Spectrometry, 2003, 17(20): 2337-2342.
[11] Eng J K, McCormack A L, Yates III J R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Amer. Soc. Mass Spectrom., 1994, 5(11): 976-989.
[12] Craig R, Beavis R C. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics, 2004, 20(9): 1466-1467.
[13] Geer L Y, Markey S P, Kowalak J A, Wagner L, Xu M, Maynard D M, Yang X, Shi W, Bryant S H. Open mass spectrometry search algorithm. J. Proteome Research, 2004, 3(5): 958-964.
[14] Colinge J, Masselot A, Giron M, Dessingy T, Magnin J. OLAV: Towards high-throughput tandem mass spectrometry data identification. Proteomics, 2003, 3(8): 1454-1463.
[15] Bafna V, Edwards N. SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics, 2001, 17(Supplement 1): S13-S21.
[16] Wan Y et al. PepHMM: A hidden Markov model based scoring function for mass spectrometry database search. In Proc. RECOMB 2005, Standford, USA, May 21-22, 2005, pp.342- 356.
[17] Zhang Z. Prediction of low-energy collision-induced dissociation spectra of peptides. Analytical Chemistry, 2004, 76(14): 3908-3922.
[18] Fenyo D, Beavis R C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Analytical Chemistry, 2003, 75(4): 768-774.
[19] Elias J E, Gygi S P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods, 2007, 4(3): 207-214.
[20] Bianco L, Mead J A, Bessant C. Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets. Journal of Proteome Research, 2009, 8(4): 1782-1791.
[21] Moore R E, Young M K, Lee T D. Qscore: An algorithm for evaluating SEQUEST database search results. Journal of the American Society for Mass Spectrometry, 2002, 13(4): 378-386.
[22] Lu B, Motoyama A, Ruse C, Venable J, Yates J R III. Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQOrbitrap high mass accuracy data. Analytical Chemistry, 2008, 80(6): 2018-2025.
[23] Nesvizhskii A I, Aebersold R. Interpretation of shotgun proteomic data — The protein inference problem. Molecular & Cellular Proteomics, 2005, 4(10): 1419-1440.
[24] Carr S, Aebersold R, Baldwin M, Burlingame A, Clauser K, Nesvizhskii A. The need for guidelines in publication of peptide and protein identification data. Molecular and Cellular Proteomics, 2004, 3(6): 531-533.
[25] Junqueira M et al. Separating the wheat from the chaff: Unbiased filtering of background tandem mass spectra improves protein identification. J. Proteome Research, 2008, 7(8): 3382-3395.
[26] Hughes C, Doble B, Xin L, Chen C, Shan B, Ma B, Lajoie G. SILAC quantitation with PEAKS to a depth of 3000 proteins from a double knockout GSK-3 of mouse embryonic stem cells. In ASMS 2009, Philadelphia, USA, May 31-June 4, 2009, Session Bioinformatics: Quantification, Poster, No. 056.
[27] Frank A, Pevzner P. Pepnovo: De novo peptide sequencing via probabilistic network modeling. Analytical Chemistry, 2005, 77(4): 964-973.
[28] Taylor J A, Johnson R S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Analytical Chemistry, 2001, 73(11): 2594-2604.
[29] Bartels C. Fast algorithm for peptide sequencing by mass spectroscopy. Biomed. Environ. Mass Spectrom., 1990, 19(6): 363-368.
[30] Ma B, Zhang K, Liang C. An effective algorithm for the peptide de novo sequencing from MS/MS spectrum. Journal of Computer and System Sciences, 2005, 70(3): 418-430.
[31] Lu B, Chen T. Algorithms for de novo peptide sequencing via tandem mass spectrometry. Drug Discovery Today: BioSilico, 2004, 2(2): 85-90.
[32] Xu C, Ma B. Review of software for computational peptide identification from MS/MS data. Drug Discovery Today, 2006, 11(13/14): 595-600.
[33] Hughes C, Ma B, Lajoie G. De Novo Sequencing Methods in Proteomics. Methods in Molecular Biology, Series, Springer. (to appear)
[34] Pevtsov S, Fedulova I, Mirzaei H, Buck C, Zhang X. Performance evaluation of existing de novo sequencing algorithms. Journal of Proteome Research, 2006, 5(11): 3018-3028.
[35] Yan B, Qu Y, Mao F, Olman V, Xu Y. PRIME: A mass spectrum data mining tool for de novo sequencing and PTMs identification. Journal of Computer Science and Technology, 2005, 20(4): 483-490.
[36] Dancik V et al. De novo peptide sequencing via tandem massspectrometry. J. Comp. Biology, 1999, 6(3/4): 327-342.
[37] Xin L, Lajoie G, Ma B. New method for the validation of de novo sequencing results. In ASMS 2008, Denver, USA, Jun. 1-5, Session: Bioinformatics III, Poster, No. 645.
[38] Savitski M M, Nielsen M L, Kjeldsen F, Zubarev R A. Proteomics-Grade de Novo Sequencing Approach. J. Proteome Research, 2005, 4: 2348-2354.
[39] Datta R, Bern M. Spectrum fusion: Using multiple mass spectra for de novo peptide sequencing. In Proc. RECOMB, 2008, pp.140-153.
[40] Genome News Network. http://www.genomenewsnetwork. org/.
[41] Mackey A J, Haystead T A J, Pearson W R. Getting more for less: Algorithms for rapid protein identification with multiple short peptide sequences. Mol. Cell. Proteomics, 2002, 1(2): 139-147.
[42] Huang L, Jacob R J, Pegg S C H, Baldwin M A, Wang C C, Burlingame A L, Babbitt P C. Functional assignment of the 20 S proteasome from Trypanosoma Brucei using mass spectrometry and new bioinformatics approaches. J. Biol. Chem., 2001, 276(30): 28327-28339.
[43] Shevchenko A, Sunyaev S, Loboda A, Shevchenko A, Bork P, Ens W, Standing K G. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole timeofflight mass spectrometry and BLAST homology searching, Anal. Chem., 2001, 73(9): 1917-1926.
[44] Han Y, Ma B, Zhang K. SPIDER: Software for protein identification from sequence tags containing de novo sequencing error. Journal of Bioinformatics and Computational Biology, 2005, 3(3): 697-716.
[45] Searle B C et al. High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Anal. Chem., 2004, 76(8): 2220-2230.
[46] Tabb D L, Saraf A, Yates J R III. GutenTag: Highthroughput sequence tagging via an empirically derived fragmentation model. Anal. Chem., 2003, 75(23): 6415-6421.
[47] Hopper S, Johnson R S, Vath J E, Biemann K. Glutaredoxin from rabbit bone marrow. Purification, characterization, and amino acid sequence determined by tandem mass spectrometry. J. Biol. Chem., 1989, 264(34): 20438-20447.
[48] Bandeira N, Tang H, Bafna V, Pevzner P. Shotgun protein sequencing by tandem mass spectra assembly. Analytical Chemistry, 2004, 76(24): 7221-7233.
[49] Bandeira N, Clauser K R, Pevzner P. Shotgun protein sequencing: Assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol. Cell Proteomics, 2007, 6(7): 1123-1134.
[50] Bandeira N, Pham V, Pevzner P, Arnott D, Lill J R. Automated de novo protein sequencing of monoclonal antibodies. Nature Biotechnology, 2008, 26(12): 1336-1338.
[51] Liu X, Han Y, Yuen D, Ma B. Automated protein (re)sequencing with MS/MS and a homologous database yields almost full coverage and accuracy. Bioinformatics, 2009, 25(17): 2174-2180.
[52] Unimod database. http://www.unimod.org.
[53] Oki M, Aihara H, Ito T. Role of histone phosphorylation in chromatin dynamics and its implications in diseases. Subcellular Biochemistry, 2007, 41: 319-336.
[54] Blom N, Gammeltoft S, Brunak S. Sequence and structurebased prediction of eukaryotic protein phosphorylation sites. Journal of Molecular Biology, 1999, 294(5): 1351-1362.
[55] Tsur D, Tanner S, Zandi E, Bafna V, Pevzner PA. Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol., 2005, 23(12): 1562-1567.
[56] MacCoss M J et al. Shotgun identification of protein modifications from protein complexes and lens tissue. Proc. Natl. Acad. Sci. USA, 2002, 99(12): 7900-7905.
[57] Bandeira N, Tsur D, Frank A, Pevzner P. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA, 2007, 104(15): 6140-6145.
[58] Witze E S, Old W M, Resing K A, Ahn N G. Mapping protein post-translational modifications with mass spectrometry. Nature Methods, 2007, 4(10): 798-806.
[59] Dwek R A, Butters TD , Platt F M, Zitzmann N. Targeting glycosylation as a therapeutic approach. Nature Reviews Drug Discoveries, 2002, 1(1): 65-75.
[60] Parekh R B et al. Association of rheumatoid arthritis and primary osteoarthritis with changes in the glycosylation pattern of total serum IgG. Nature, 1985, 316(6027): 452-457.
[61] Dennisa JW, Granovskya M,Warrena C E. Glycoprotein glycosylation and cancer progression. Biochimica et Biophysica Acta (BBA) — General Subjects, 1999, 1473(1): 21-34.
[62] Tang H, Mechref Y, Novotny M V. Automated interpretation of MS/MS spectra of oligosaccharides. Bioinformatics, 2005, 21(Suppl. 1): i431-i439.
[63] Zala J. Mass spectrometry of oligosaccharides. Mass Spectrometry Reviews, 2004, 23(3): 161-227.
[64] Zhang C, Doherty-Kirby A, Lajoie G. Investigation of cationic peanut peroxidase glycans by electrospray ionization mass spectrometry. Phytochemistry, 2004, 65(11): 1575-1588.
[65] Shan B, Lajoie G, Ma B, Zhang K. Complexities and algorithms for glycan structure sequencing using tandem mass spectrometry. Journal of Bioinformatics and Computational Biology, 2008, 6(1): 77-91.
[66] An H J, Tillinghast J S, Woodruff D L, Rocke D M, Lebrilla C B. A new computer program (GlycoX) to determine simultaneously the glycosylation sites and oligosaccharide heterogeneity of glycoproteins. Journal of Proteome Research, 2006, 5(10): 2800-2808.
[67] Prince J T, Carlson M W, Wang R, Lu P, Marcotte E M. The need for a public proteomics repository. Nature Biotechnology, 2004, 22(4): 471-472.
[68] Desiere F et al. The PeptideAtlas project. Nucleic Acids Research, 2006, 34(Database Issue): D655-D658.
[69] Rudnick P et al. NIST reference libraries of peptide fragmentation spectra: 2008. In ASMS 2008, Denver, USA, Jun. 1-5, Session: Bioinformatics III, Poster, No. 2008.
[70] Craig R, Cortens J, Fenyo D, Beavis R. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res., 2006, 5(8): 1843-1849.
[71] Dutta D, Chen T. Speeding up tandem mass spectrometry database search: Metric embeddings and fast near neighbor search. Bioinformatics, 2007, 23(5): 612-618.
[72] Wu Z, Lajoie G, Ma B. MSDash: Mass spectrometry database and search. In Proc. the 7th Int. Conf. Computational System Bioinformatics, Stanford, USA, Aug. 26-29, 2008, pp.63- 71.
[73] Gygi S P, Rist B, Gerber S A, Turecek F, Gelb M H, Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology, 1999, 17(10): 994-999.
[74] Ong S E, Blagoev B, Kratchmarova I, Kristensen D B, Steen H, Pandey A, Mann M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Molecular & Cellular Proteomics, 2002, 1(5): 376-386.
[75] Wiese S, Reidegeld K A, Meyer H E, Warscheid B. Protein labeling by iTRAQ: A new tool for quantitative mass spectrometry in proteome research. Proteomics, 2007, 7(3): 340- 350.
[76] Wang et al. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Analytical Chemistry, 2003, 75(18): 4818-4826.
[77] Old W M et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell Proteomics, 2005, 4(10): 1487-1502.
[78] Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal. Chem., 2006, 78(3): 779-787.
[79] Chen W W et al. New algorithm for label-free protein quantification. In ASMS, Philadelphia, USA, May 31-June 4, 2009, Session MPB: Bioinformatics: Quantification, Poster, No. 043.
[80] Andreev V P, Li L, Cao L, Gu Y, Rejtar T, Wu S L, Karger B L. A new algorithm using cross-assignment for label-free quantitation with LC/LTQ-FT MS. Journal of Proteome Research, 2007, 6(6): 2186-2194.
[81] Lee T, Singh R, Yen TY, Macher B. An algorithmic approach to automated high-throughput identification of disulfide connectivity in proteins using tandem mass spectrometry. In Proc. Computational System Bioinformatics Conference, San Diego, USA, Aug. 13-17, 2007, pp.41-51.
[82] Ng J, Bandeira N, Liu W T, Ghassemian M, Simmons T L, Gerwick W H, Linington R, Dorrestein P C, Pevzner P A. Dereplication and de novo sequencing of nonribosomal peptides. Nature Methods, 2009, 6(8): 596-599.
[83] Zhang N et al. ProbIDtree: An automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 2005, 5(16): 4096-4106.
[84] Kelleher N L, Lin H Y, Valaskovic G A, Aaserud D J, Fridriksson E K, McLafferty F W. Top down versus bottom up protein characterization by tandem high-resolution mass spectrometry. Journal of the American Chemistry Society, 1999, 121(4): 806-812.
[85] Tang H et al. A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics, 2006, 22(14): e481-e488.
[86] Alves P, Arnold R J, Novotny M V, Radivojac P, Reilly J P, Tang H. Advancement in protein inference from shotgun proteomics using peptide detectability. In Proc. Pac. Symp. Biocomput., Maui, USA, Jan. 3-7, 2007, pp.409-20.
[87] H?akansson K et al. Combined electron capture and infrared multiphoton dissociation for multistage MS/MS in a Fourier transform ion cyclotron resonance mass spectrometer. Anal. Chem., 2003, 75(13): 3256-3262.
[88] Nuno Bandeira, Jesper V Olsen, Matthias Mann, Pavel A Pevzner. Multi-spectra peptide sequencing and its applications to multistage mass spectrometry. Bioinformatics, 2008, 24(13): i416-i423.
[89] Xie M, Ma B. MSPack — Mass spectrometry data compression software. In Proc. the 54th ASMS Conf. Mass Spectrometry, Seattle, USA, May 28-June 1, 2006, Session: Computer Applications, Poster, No. 071.
[90] Miguel A C, Kearney-Fischer M, Keane J F, Whiteaker J, Feng L C, Paulovich A. Near-lossless compression of mass spectra for proteomics. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, USA, April 15-20, 2007, pp.I369-I372.
[91] Meek J L. Prediction of peptide retention times in highpressure liquid chromatography on the basis of amino acid composition. Proc. Natl. Acad. Sci. USA, 77(3): 1632-1636.
[92] Strittmatter E F et al. Application of peptide LC retention time information in a discriminant function for peptide identification by tandem mass spectrometry. Journal of Proteome Research, 2004, 3(4): 760-769.
[93] Henzel W J, Billeci T M, Stults J T, Wong S C, Grimley C, Watanabe C. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. USA, 1993, 90(11): 5011-5015.
[94] Du P, Kibbe W A, Lin S M. Improved peak detection in mass spectrum by incorporating continuous wavelet transformbased pattern matching. Bioinformatics, 2006, 22(17): 2059- 2065.
[95] Katajamaa M, Oreˇsiˇc M. Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics, 2005, 6: 179.
[96] Nagalla S R et al. Proteomic analysis of maternal serum in down syndrome: Identification of novel protein biomarkers. Journal of Proteome Research, 2007, 6(4): 1245-1257.
[97] Issaq H J, Veenstra T D, Conrads T P, Felschow D. The SELDI-TOF MS approach to proteomics: Protein profiling and biomarker identification. Biochemical and Biophysical Research Communications, 2002, 292(3): 587-592.
[98] Hancock W S, Wu S L, Shieh P. The challenges of developing a sound proteomics strategy. Proteomics, 2002, 2(4): 352-359.
[99] Steen H, Mann M. The ABC’s (and XYZ’s) of peptide sequencing. Nature Reviews Molecular Cell Biology, 2004, 5(9): 699-711.
[100] Snyder A P. Interpreting Protein Mass Spectra: A Comprehensive Resource. The American Chemical Society and Oxford University Press, 2000.
[101] Kinter M, Sherman N E. Protein Sequencing and Identification Using Tandem Mass Spectrometry. John Wiley & Sons Inc., 2000.

No related articles found!
Full text



[1] Song Maoqiang; Felix Grimm; Horst Bunke;. A Prototype Expert System for Automatic Generation of Image Processing Programs[J]. , 1991, 6(3): 296 -300 .
[2] ZHANG Wensong; JIN Shiyao; WU Quanyuan;. LinuxDirector: A Connection Director for Scalable Internet Services[J]. , 2000, 15(6): 560 -571 .
[3] Zhong-Xuan Liu, Shi-Guo Lian, and Zhen Ren. Quaternion Diffusion for Color Image Filtering[J]. , 2006, 21(1): 126 -136 .
[4] Ian Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems[J]. , 2006, 21(4): 513 -520 .
[5] Kai Liu, Ke-Yan Wang, Yun-Song Li, and Cheng-Ke Wu. A Novel VLSI Architecture for Real-Time Line-Based Wavelet Transform Using Lifting Scheme[J]. , 2007, 22(5): 661 -672 .
[6] Jian-Xin Wang, Xiao-Shuang Xu, and Jian-Er Chen. Approximation Algorithm Based on Chain Implication for Constrained Minimum Vertex Covers in Bipartite Graphs[J]. , 2008, 23(5 ): 763 -768 .
[7] Joonhoon Lee, Jeongmin Park, Giljong Yoo and Eunseok Lee. Goal-Based Automated Code Generation in Self-Adaptive System[J]. , 2010, 25(6): 1118 -1129 .
[8] Maryam Zarezadeh, Hamid Mala, Homa Khajeh. Preserving Privacy of Software-Defined Networking Policies by Secure Multi-Party Computation[J]. Journal of Computer Science and Technology, 2020, 35(4): 863 -874 .
[9] Hua Huang (黄华), Senior Member, CCF, Member, IEEE and Xiang-Wang Ma (马湘旺). Frontal and Semi-Frontal Facial Caricature Synthesis Using Non-Negative Matrix Factorization[J]. , 2010, 25(6): 1282 -1292 .
[10] Adrian Atanasiu. A New Batch Verifying Scheme for Identifying Illegal Signatures[J]. , 2013, 28(1): 144 -151 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved