Journal of Computer Science and Technology  2010, 25(1) 3-9 DOI:     ISSN: 1000-9000 CN: CN 11-2296/TP

本期目录 | 过刊浏览 | 高级检索                                                            [打印本页]   [关闭]
扩展功能
本文信息
Supporting info
PDF(212KB)
服务与反馈
把本文推荐给朋友
加入我的书架
加入引用管理器
引用本文
Email Alert
文章反馈
浏览反馈信息
本文关键词相关文章
genome sequencing
new generation sequencing
read mapping
optical mapping
sequence assembly
Eulerian graphs
本文作者相关文章
David C. Schwartz
Michael S. Waterman
PubMed
Article by David C. Schwartz
Article by Michael S. Waterman
中文题目: 测序机器和其面临的计算挑战
中文导读

新一代测序系统提高了一种新的分子生物学研究手段。 新一代测序的应用将给医疗卫生带来深刻的变化,使提供个性化医疗成为可能。不为人注意的是由此带来的对计算,包括CPU和储存,以及新的计算方法的爆炸性需求。在本文中,我们将简要回顾在计算方法领域所取得的一些成果以及新的需求。
当前技术方法
虽然商业新一代平台在获得序列的方法各不相同,但是它们有共同的优势是不需要细菌的克隆库。在许多方面,消除克隆库的构建和处理是基因组测序成本大幅下降,并平台的吞吐量大大提高的重要原因。下一代平台通过微乳液或者桥式PCR扩增技术直接从个人的DNA基因组分子构建“克隆库”。目前,有可供选择的四种商用系统,每个提供强大的技术支持但是需要考虑整体的费用和应用:1) Illumina Genome Analyzer,2) Life Technologies SOLiD System,3) Roche 454 GS FLX,and 4) Helicos’ Heliscope Sequencer。各种平台基本都采用合成测序全(SBS)的方法,即使用标记的核苷酸实时检测聚合酶的合成过程。除此之外,三代测序平台也在开发当中,如Pacific BioSciences (PacBio)和Life Technologies。其特点是单分子模板、实时无周期检测、消除了模板扩增步骤,检测合成过程中每个增加核苷酸的瞬态。而四代测序,即纳米孔测序(Nanopore Sequencing)是DNA测序最简单的方法,但是却是最难实现的。一个DNA单链通过一个小孔(~1-2nm),测量每一个碱基,获得它通过小孔之后它阻挡了多少电流。嘌呤(A,G)比嘧啶(C,T)体积大,阻挡更多的电流。纳米孔测序的优势是DNA样本可以立即分析而不需要标记或其他的准备工作。
计算新一代测序拼接
测序得到小片段,需要经过拼接才能恢复原始DNA序列。传统的sanger方法得到的序列片断一般采用Roger Staden和Gingeras等人提出的贪心方法,把overlap最长的两个片段连起来形成更长的片段。这个贪心算法的简单表述是:给定一个小片段的集合,包含每一个小片段的最短长度的超串是什么?理论证明显示贪心算法所找出的解最差不超过最优值的4倍,实际中可能达到2倍。 但是贪心是一个很弱的算法,在人类基因组中,有很多重复的基因片段,用贪心算法得到的结果不能信服。
这种序列拼接算法包括三个步骤:overlap-layout-consensus。overlap需要考虑任意两个小片段在在当前方向上或另外的方向上是否有重复。layout需要确定小片段相互一致的序列重复,和大致确定它们在基因组上的位置顺序。consensus,从方向一致的小片段中给出一个大致的多序列联配,并最终得到一条一致序列(consensus sequence)。
与sanger测序不同,二代测序技术得到的片断要短得多,也要多很多,用overlap-layout-consensus 的方法将很难拼接。一个替代的方法就是Idury与Waterman(1995)提出的欧拉图方法。其基本思想是把一个序列编码成一个图。给定一个k值,得到n-k+1个k长的片段。序列图的每个顶点是不同的k-1长片段,每条边是相应的k片段。如果把k长片段看作图中的每一个点,那么片断拼接就变成一个NP-hard的Hamiltonian路径的问题。但是欧拉图方法的存储开销很大,没有基本的计算方法的突破(包括存储技术)很难在测序方法上有几个数量级的改进。
前景
第二代和第三代测序系统需要一定数量和质量的测序数据。期望在2年之后,使用第二代和第三代测序系统产生的数据,使用新的算法回贴形成参考基因组。
当经济上可行的时候,将可以分析切片组织细微的基因变异。这些很少发生的基因变异来自于基因固有的不稳定性,并可能发展成为癌症。目前这样的计算量超过了我们的能力。如果能将多个细胞的DNA合并测序,就可以覆盖多个细胞基因组也可能包括变异的基因。由此产生更大的计算量,更有意义的是研究发现基因变异的计算方法。

New Generations: Sequencing Machines and Their Computational Challenges

David C. Schwartz1 and Michael S. Waterman2,3

1Laboratory for Molecular and Computational Genomics, Department of Chemistry and Laboratory of Genetics, University of Wisconsin-Madison, WI 53706, U.S.A.
2Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-2910, U.S.A.
3Department of Automation, Tsinghua University, Beijing 100084, China

Abstract:

New generation sequencing systems are changing how molecular biology is practiced. The widely promoted $1000 genome will be a reality with attendant changes for healthcare, including personalized medicine. More broadly the genomes of many new organisms with large samplings from populations will be commonplace. What is less appreciated is the explosive demands on computation, both for CPU cycles and storage as well as the need for new computational methods. In this article we will survey some of these developments and demands.

Keywords: genome sequencing    new generation sequencing    read mapping    optical mapping    sequence assembly    Eulerian graphs  
收稿日期 2009-09-05 修回日期 2009-11-24 出版日期  
DOI:
基金项目:

This work is supported by NIH under Grant No. R01 HG000225 (DCS) and NSF of USA under Grant No. DBI-0501818 (DCS).

作者简介:
David C. Schwartz has been a professor of chemistry, genetics, and biotechnology at the University of Wisconsin-Madison since 1999. Previous faculty appointments were at New York University and The Carnegie Institution of Washington-Baltimore. He works at the interface between nanotechnology and genomic science through his invention of single molecule platforms for genome analysis. As a graduate student, he invented Pulsed Field Gel Electrophoresis and continues to train students in the science of how to conceive and develop new tools for biological investigation as the director of a training program.
Michael S. Waterman is university professor at the University of Southern California where he has been a faculty member since 1982. Prior to that he held positions at Los Alamos National Laboratory and Idaho State University. He also currently holds a chair professor team at Tsinghua University. Professor Waterman works in the area of computational biology, particularly on the analysis DNA, RNA and protein sequence data. He is the co-developer of the Smith-Waterman algorithm for sequence comparison and of the Lander-Waterman formula for physical mapping. He is a member of the U.S. National Academy of Sciences and of the French Académie des Sciences.

参考文献:

[1] Harris T D, Buzby P R, Babcock H, Beer E, Bowers J, Braslavsky I, Causey M, Colonell J, Dimeo J, Efcavitch J W, Giladi E, Gill J, Healy J, Jarosz M, Lapen D, Moulton K, Quake S R, Steinmann K, Thayer E, Tyurina A, Ward R, Weiss H, Xie Z. Single-molecule DNA sequencing of a viral genome. Science, 2008, 320(5872): 106-109.
[2] Fuller C W, Middendorf L R, Benner S A, Church G M, Harris T, Huang X, Jovanovich S B, Nelson J R, Schloss J A, Schwartz D C, Vezenov D V. The challenges of sequencing by synthesis. Nat. Biotechnol., 2009, 27(11): 1013-1023.
[3] Pemov A, Modi H, Chandler D P, Bavykin S. DNA analysis with multiplex microarray-enhanced PCR. Nucleic Acids Res., 2005, 33(2): e11.
[4] Ronaghi M, Uhlen M, Nyren P. A sequencing method based on real-time pyrophosphate. Science, 1998, 281(5375): 363- 365.
[5] Ronaghi M, Karamohamed S, Pettersson B, Uhlen M, Nyren P. Real-time DNA sequencing using detection of pyrophosphate release. Anal. Biochem., 1996, 242(1): 84-89.
[6] Bentley D R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev., 2006, 16(6): 545-552.
[7] Eid J, Fehr A, Gray J et al. Real-time DNA sequencing from single polymerase molecules. Science, 2009, 323(5910): 133- 138.
[8] Kasianowicz J J, Brandin E, Branton D, Deamer D W. Characterization of individual polynucleotide molecules using a membrane channel. Proc. Natl. Acad. Sci. USA, 1996, 93(24): 13770-13773.
[9] Astier Y, Braha O, Bayley H. Toward single molecule DNA sequencing: Direct identification of ribonucleoside and deoxyribonucleoside 5_-monophosphates by using an engineered protein nanopore equipped with a molecular adapter. J. Am. Chem. Soc., 2006, 128(5): 1705-1710.
[10] Branton D, Deamer D W, Marziali A et al. The potential and challenges of nanopore sequencing. Nat. Biotechnol., 2008, 26(10): 1146-1153.
[11] Sigalov G, Comer J, Timp G, Aksimentiev A. Detection of DNA sequences using an alternating electric field in a nanopore capacitor. Nano. Lett., 2008, 8(1): 56-63.
[12] Jett J H, Keller R A, Martin J C, Marrone B L, Moyzis R K, Ratliff R L, Seitzinger N K, Shera E B, Stewart C C. Highspeed DNA sequencing: An approach based upon fluorescence detection of single molecules. J. Biomol. Struct. Dyn., 1989, 7(2): 301-309.
[13] Clarke J, Wu H C, Jayasinghe L, Patel A, Reid S, Bayley H. Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol., 2009, 4(4): 265-270.
[14] Zhang M Q, Smith A D. Challenges in understanding genomewide DNA methylation. J. Comput. Sci. & Technol., 2010, 25(1): 26-34.
[15] Morozova O, Marra M. Applications of next-generation sequencing technologies in functional genomics. Genomics, 2008, 92(5): 255-264.
[16] Chen Y, Souaiaia T, Chen T. PerM: Efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics, 2009, 25 (19): 2514-2521.
[17] Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res., June 11, 1979, 6(7): 2601-2610.
[18] Gingeras T R, Milazzo J P, Sciaky D, Roberts R J. Computer programs for the assembly of DNA sequences. Nucleic Acids Res., September 25, 1979, 7(9): 529-545.
[19] Gallant J, Maier D, Storer J. On finding minimal length superstrings. J. Computer System Sci., 1980, 20(1): 50-58.
[20] Waterman M S. Introduction to Computational Biology. Chapman & Hall, 1995.
[21] Kececioglu J D, Myers E W. Combinatiorial algorithms for DNA sequence assembly. Algorithmica, 1995, 13(1/2): 7-51.
[22] Kececioglu J D. Exact and approximation algorithms for DNA sequence reconstruction
[Ph.D. Dissertation]. University of Arizona, Tucson, USA, 1992.
[23] Myers E W. Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 1995, 2(2): 275-290.
[24] Myers E S. The fragment assembly string graph. Bioinformatics, 2005, 21(Suppl. 2): ii79-ii85.
[25] Venter J C, Adams M D, Myers E W et al. The sequence of the human genome. Science, 2001, 291: 1304-1351.
[26] Lander E S, Linton L M, Birren B et al. Initial sequencing and analysis of the human genome. Nature, 2001, 409(6822): 860-921.
[27] Istrail S, Sutton G, Florea L et al. Whole genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA, 2003, 101(7): 1916-1921.
[28] Idury R, Waterman M S. A new algorithm for DNA sequence. J. Comput. Biol., 1995, 2(2): 291-306.
[29] Chaisson M J, Pevzner P A. Short read fragment assembly of bacterial genomes. Genome Res., 2008, 18(2): 324-330.
[30] Chaisson M J, Tang H, Pevzner P A. Fragment assembly with short reads. Bioinformatics, 2004, 20(13): 2067-2074.
[31] Myers E W. The fragment assembly string graph. Bioinformatics, 2005, 21(Suppl. 2): ii79-ii85, doi:10.1093/ bioinformatics/ bti7114.
[32] Pevzner P A, Tang H. Fragment assembly with doublebarreled data. Bioinformatics, 2001, 17(Suppl. 1): S225- S233.
[33] Pevzner P A, Tang H, Waterman M S. A Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 2001, 98(17): 9748-9753.
[34] Zerbino D R, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 2008, 18(5): 821-829.
[35] Church D M, Goodstadt L, Hillier L W et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol., 2009, 7(5): e1000112.
[36] Valouev A, Schwartz D C, Zhou S, Waterman M S. An algorithm for assembly of ordered restriction maps from single DNA molecules. Proc. Natl. Acad. Sci. USA, 2006, 103(43): 15770-15775.
[37] Nagarajan N, Read T D, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics, 2008, 24(10): 1229-1235.
[38] Schwartz D C, Li X, Hernandez L, Ramnarain S P, Huff E J, Wang Y K. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science, 1993, 262(5130): 110-114.
[39] Zhou S, Herschleb J, Schwartz D C. A Single Molecule System forWhole Genome Analysis. New Methods for DNA Sequencing, Mitchelson K R (ed)., Amsterdam: Elsevier, 2007.
[40] Dimalanta E T, Lim A, Runnheim R et al. A microfluidic system for large DNA molecule arrays. Anal. Chem., 2004, 76(18): 5293-5301.
[41] Valouev A, Zhang Y, Schwartz D C, Waterman M S. Refinement of optical map assemblies (original paper). Bioinformatics, 2006, 22(10): 1217-1224.
[42] Valouev A, Li L, Liu Y C, Schwartz D C, Yang Y, Zhang Y, Waterman M S. Alignment of optical maps. J. Comput. Biol., 2006, 13(2): 442-462.
[43] Jo K, Dhingra D M, Odijk T et al. A single-molecule barcoding system using nanoslits for DNA analysis. Proc. Natl. Acad. Sci. USA, 2007, 104(8): 2673-2678.
[44] Ramanathan A, Pape L, Schwartz D C. High-density polymerase-mediated incorporation of fluorochrome-labeled nucleotides. Analytical Biochemistry, 2005, 337(1): 1-11.
[45] Ramanathan A, Huff E J, Lamers C C, Potamousis K D. Forrest D K, Schwartz D C. An integrative approach for the optical sequencing of single DNA molecules. Analytical Biochemistry, 2004, 330(2): 227-241.
[46] Zhou S, Pape L, Schwartz D C. Optical Sequencing: Acquisition from Mapped Single Molecule Templates. Next Generation Genome Sequencing: Towards Personalized Medicine, Janitz M (ed.), 2008, Weinheim: Wiley-VCH Verlag & Co., pp.133-149.
[47] Aguilera A, Gomez-Gonzalez B. Genome instability: A mechanistic view of its causes and consequences. Nat. Rev. Genet., 2008, 9(3): 204-217.

文章评论

Copyright 2008 by Journal of Computer Science and Technology