|
计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (2): 248-260.doi: 10.1007/s11390-021-0856-4
所属专题: Emerging Areas
Jian Liu1,*, Member, CCF, Jia-Liang Sun1, and Yong-Zhuang Liu2
Jian Liu1,*, Member, CCF, Jia-Liang Sun1, and Yong-Zhuang Liu2
研究背景(context)
真菌引起的相关疾病引起了国内外研究者的广泛关注,致病真菌可以引起消化不良和过敏等轻微影响,严重情况下可导致错觉、器官衰竭甚至死亡。因此,有效地鉴定引起人类疾病的致病菌有着重要的意义和价值。当前随着测序技术的不断进步,通过测序数据进行真菌等微生物的快速精准鉴定和注释已成为了研究热点。
目的(Objective)
随着测序价格的不断降低,出现了大量的基因组测序数据。面对日益增多的基因组数据,当前仍缺乏易于使用、高效精准的测序数据分析工作流,尤其是缺乏面向大规模真菌基因组测序数据的高效鉴定和注释分析工作流。此外,以Illumina为代表的第二代测序平台产生的reads准确率较高,但长度较短;而以PacBio为代表的第三代测序平台产生的reads虽然长度较长,但准确率较低。在研究新菌种鉴定和注释过程中,往往会统筹考虑二三代测序数据的优势进行综合分析,因此,构建既可以支持短序列数据分析,又可以支持长序列数据分析的高效分析工作流,对提升微生物测序数据分析质量和效率有着重要的作用。
方法(Method)
面向二三代测序数据,本文首先研究了真菌基因组短序列和长序列数据分析方法,并在此基础上构建了支持真菌基因组测序数据快速识别和注释的自动化生物信息学分析工作流PFGI。具体来讲,PFGI可以首先选择短序列或长序列数据分析模式,通过质量控制等预处理后,通过序列组装、序列比对和相似参考基因组鉴定三个过程完成真菌基因组测序数据鉴定。此外,PFGI提供了CDS注释,同时支持prokka注释以及MLST注释等功能。
结果(Result&Findings)
为了验证PFGI工作流的分析性能,选取了EMBL Nucleotide Sequence Database数据集中的烟曲霉(aspergillus fumigatus),白色念珠菌(candida albicans),酵母菌(saccharomyces cerevisiae)和黄萎病菌(verticillium dahlia)等短序列和长序列基因组测序数据进行测试。通过实验评估可以发现PFGI具有较好的分析效率和较高的精准度,可以快速有效地完成对短序列和长序列真菌基因组测序数据的鉴定和注释工作,提供精准的分析结果。
结论(Conclusions)
本文构建了一种支持二三代测序数据、面向大规模真菌基因组数据的高效鉴定和注释分析工作流PFGI。PFGI同时提供了CDS注释及MLST注释等分析功能,可以为生物学家、临床医生等科研工作者提供易于使用、快速精准的生物信息学分析工具,可以被广泛应用于工业微生物菌种鉴定和改造以及临床诊疗等应用服务。
[1] Desprez-Loustau M L, Robin C, Buée M, Courtecuisse R, Garbaye J, Suffert F, Sache I, Rizzo D M. The fungal dimension of biological invasions. Trends in Ecology & Evolution, 2007, 22(9):472-480. DOI:10.1016/j.tree.2007.04.005. [2] Schuster S C. Next-generation sequencing transforms today's biology. Nature Methods, 2008, 5(1):16-18. DOI:10.1038/nmeth1156. [3] van Dijk E L, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends in Genetics, 2014, 30(9):418-426. DOI:10.1016/j.tig.2014.07.001. [4] van Dijk E L, Jaszczyszyn Y, Naquin D, Thermes C. The third revolution in sequencing technology. Trends in Genetics, 2018, 34(9):666-681. DOI:10.1016/j.tig.2018.05.008. [5] Dannemiller K C, Reeves D, Bibby K, Yamamoto N, Peccia J. Fungal high-throughput taxonomic identification tool for use with next-generation sequencing (FHiTINGS). Journal of Basic Microbiology, 2014, 54(4):315-321. DOI:10.1002/jobm.201200507. [6] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden T L. BLAST+:Architecture and applications. BMC Bioinformatics, 2009, 10(1):Article No. 421. DOI:10.1186/1471-2105-10-421. [7] Gweon H S, Oliver A, Taylor J, Booth T, Gibbs M, Read D S, Griffiths R I, Schonrogge K. PIPITS:An automated pipeline for analyses of fungal internal transcribed spacer sequences from the I llumina sequencing platform. Methods in Ecology and Evolution, 2015, 6(8):973-980. DOI:10.1111/2041-210X.12399. [8] Eng A, Verster A J, Borenstein E. Meta-LAFFA:A flexible, end-to-end, distributed computing-compatible metagenomic functional annotation pipeline. BMC Bioinformatics, 2020, 21(1):Article No. 471. DOI:10.1186/s12859-020-03815-9. [9] Clarke E L, Taylor L J, Zhao C, Connell A, Lee J J, Fett B, Bushman F D, Bittinger K. Sunbeam:An extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome, 2019, 7(1):Article No. 46. DOI:10.1186/s40168-019-0658-x. [10] Rhoads A, Au K F. PacBio sequencing and its applications. Genomics, Proteomics & Bioinformatics, 2015, 13(5):278-289. DOI:10.1016/j.gpb.2015.08.002. [11] Seemann T. Prokka:Rapid prokaryotic genome annotation. Bioinformatics, 2014, 30(14):2068-2069. DOI:10.1093/bioinformatics/btu153. [12] Jolley K A, Maiden M C. BIGSdb:Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics, 2010, 11(1):Article No. 595. DOI:10.1186/1471-2105-11-595. [13] Chen S, Zhou Y, Chen Y, Gu J. FASTQ:An ultra-fast allin-one FASTQ preprocessor. Bioinformatics, 2018, 34(17):i884-i890. DOI:10.1093/bioinformatics/bty560. [14] Bolger A M, Lohse M, Usadel B. Trimmomatic:A flexible trimmer for Illumina sequence data. Bioinformatics, 2014, 30(15):2114-2120. DOI:10.1093/bioinformatics/btu170. [15] Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet Journal, 2011, 17(1):10-12. DOI:10.14806/ej.17.1.200. [16] Benson D A, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman D J, Ostell J, Sayers E W. GenBank. Nucleic Acids Research, 2012, 41(D1):D36-D42. DOI:10.1093/nar/gks1195. [17] Li D, Liu C M, Luo R, Sadakane K, Lam T W. MEGAHIT:An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015, 31(10):1674-1676. DOI:10.1093/bioinformatics/btv033. [18] Zerbino D R, Birney E. Velvet:Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008, 18(5):821-829. DOI:10.1101/gr.074492.107. [19] Bankevich A, Nurk S, Antipov D et al. SPAdes:A new genome assembly algorithm and its applications to singlecell sequencing. Journal of Computational Biology, 2012, 19(5):455-477. DOI:10.1089/cmb.2012.0021. [20] Koren S, Walenz B P, Berlin K, Miller J R, Bergman N H, Phillippy A M. Canu:Scalable and accurate longread assembly via adaptive k-mer weighting and repeat separation. Genome Research, 2017, 27(5):722-736. DOI:10.1101/gr.215087.116. [21] Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST:Quality assessment tool for genome assemblies. Bioinformatics, 2013, 29(8):1072-1075. DOI:10.1093/bioinformatics/btt086. [22] Cock P J, Antao T, Chang J T et al. Biopython:Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 2009, 25(11):1422-1423. DOI:10.1093/bioinformatics/btp163. [23] Rowe W P. When the levee breaks:A practical guide to sketching algorithms for processing the flood of genomic data. Genome Biology, 2019, 20(1):Article No. 199. DOI:10.1186/s13059-019-1809-x. [24] Li H. Minimap2:Pairwise alignment for nucleotide sequences. Bioinformatics, 2018, 34(18):3094-3100. DOI:10.1093/bioinformatics/bty191. [25] Kanz C, Aldebert P, Althorpe N et al. The EMBL nucleotide sequence database. Nucleic Acids Research, 2005, 33(suppl 1):D29-D33. DOI:10.1093/nar/gki098. [26] Cornish-Bowden A. Nomenclature for incompletely specified bases in nucleic acid sequences:Recommendations 1984. Nucleic Acids Research, 1985, 13(9):3021-3030. DOI:10.1093/nar/13.9.3021. [27] Caboche S, Even G, Loywick A, Audebert C, Hot D. MICRA:An automatic pipeline for fast characterization of microbial genomes from high-throughput sequencing data. Genome Biology, 2017, 18(1):Article No. 233. DOI:10.1186/s13059-017-1367-z. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |