计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (2): 248-260.doi: 10.1007/s11390-021-0856-4

所属专题: Emerging Areas

• • 上一篇    下一篇

真菌基因组的有效鉴定与注释分析

Jian Liu1,*, Member, CCF, Jia-Liang Sun1, and Yong-Zhuang Liu2   

  1. 1 College of Computer Science, Nankai University, Tianjin 300350, China;
    2 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • 收稿日期:2020-08-01 修回日期:2021-02-23 出版日期:2021-03-05 发布日期:2021-04-01
  • 通讯作者: Jian Liu E-mail:jianliu@hit.edu.cn
  • 作者简介:Jian Liu received his M.S. and Ph.D. degrees in computer application technology from Northeastern University, Shenyang, in 2009 and 2014 respectively. He is a professor in the College of Computer Science, Nankai University, Tianjin. His current research interests include massive biological database management, multi-omics data analysis and bioinformatics. He has published over 30 papers in international journals, conferences and edited books in these areas since 2010.
  • 基金资助:
    The work was supported by the National Key Research and Development Program of China under Grant Nos. 2018YFC1603800, 2018YFC1603802, 2020YFA0908700 and 2020YFA0908702, and the National Natural Science Foundation of China under Grant No. 61872115.

Effective Identification and Annotation of Fungal Genomes

Jian Liu1,*, Member, CCF, Jia-Liang Sun1, and Yong-Zhuang Liu2        

  1. 1 College of Computer Science, Nankai University, Tianjin 300350, China;
    2 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2020-08-01 Revised:2021-02-23 Online:2021-03-05 Published:2021-04-01
  • Contact: Jian Liu E-mail:jianliu@hit.edu.cn
  • About author:Jian Liu received his M.S. and Ph.D. degrees in computer application technology from Northeastern University, Shenyang, in 2009 and 2014 respectively. He is a professor in the College of Computer Science, Nankai University, Tianjin. His current research interests include massive biological database management, multi-omics data analysis and bioinformatics. He has published over 30 papers in international journals, conferences and edited books in these areas since 2010.
  • Supported by:
    The work was supported by the National Key Research and Development Program of China under Grant Nos. 2018YFC1603800, 2018YFC1603802, 2020YFA0908700 and 2020YFA0908702, and the National Natural Science Foundation of China under Grant No. 61872115.

研究背景(context)
真菌引起的相关疾病引起了国内外研究者的广泛关注,致病真菌可以引起消化不良和过敏等轻微影响,严重情况下可导致错觉、器官衰竭甚至死亡。因此,有效地鉴定引起人类疾病的致病菌有着重要的意义和价值。当前随着测序技术的不断进步,通过测序数据进行真菌等微生物的快速精准鉴定和注释已成为了研究热点。
目的(Objective)
随着测序价格的不断降低,出现了大量的基因组测序数据。面对日益增多的基因组数据,当前仍缺乏易于使用、高效精准的测序数据分析工作流,尤其是缺乏面向大规模真菌基因组测序数据的高效鉴定和注释分析工作流。此外,以Illumina为代表的第二代测序平台产生的reads准确率较高,但长度较短;而以PacBio为代表的第三代测序平台产生的reads虽然长度较长,但准确率较低。在研究新菌种鉴定和注释过程中,往往会统筹考虑二三代测序数据的优势进行综合分析,因此,构建既可以支持短序列数据分析,又可以支持长序列数据分析的高效分析工作流,对提升微生物测序数据分析质量和效率有着重要的作用。
方法(Method)
面向二三代测序数据,本文首先研究了真菌基因组短序列和长序列数据分析方法,并在此基础上构建了支持真菌基因组测序数据快速识别和注释的自动化生物信息学分析工作流PFGI。具体来讲,PFGI可以首先选择短序列或长序列数据分析模式,通过质量控制等预处理后,通过序列组装、序列比对和相似参考基因组鉴定三个过程完成真菌基因组测序数据鉴定。此外,PFGI提供了CDS注释,同时支持prokka注释以及MLST注释等功能。
结果(Result&Findings)
为了验证PFGI工作流的分析性能,选取了EMBL Nucleotide Sequence Database数据集中的烟曲霉(aspergillus fumigatus),白色念珠菌(candida albicans),酵母菌(saccharomyces cerevisiae)和黄萎病菌(verticillium dahlia)等短序列和长序列基因组测序数据进行测试。通过实验评估可以发现PFGI具有较好的分析效率和较高的精准度,可以快速有效地完成对短序列和长序列真菌基因组测序数据的鉴定和注释工作,提供精准的分析结果。
结论(Conclusions)
本文构建了一种支持二三代测序数据、面向大规模真菌基因组数据的高效鉴定和注释分析工作流PFGI。PFGI同时提供了CDS注释及MLST注释等分析功能,可以为生物学家、临床医生等科研工作者提供易于使用、快速精准的生物信息学分析工具,可以被广泛应用于工业微生物菌种鉴定和改造以及临床诊疗等应用服务。

关键词: 真菌基因组, 真菌鉴定, 分析工作流

Abstract: In the past few decades, the dangers of mycosis have caused widespread concern. With the development of the sequencing technology, the effective analysis of fungal sequencing data has become a hotspot. With the gradual increase of fungal sequencing data, there is now a lack of sufficient approaches for the identification and functional annotation of fungal chromosomal genomes. To overcome this challenge, this paper firstly deals with the approaches of the identification and annotation of fungal genomes based on short and long reads sequenced by using multiple platforms such as Illumina and Pacbio. Then this paper develops an automated bioinformatics pipeline called PFGI for the identification and annotation task. The experimental evaluation on a real-world dataset ENA (European Nucleotide Archive) shows that PFGI provides a user-friendly way to perform fungal identification and annotation based on the sequencing data analysis, and could provide accurate analyzing results, accurate to the species level (97% sequence identity).

Key words: fungal genome, fungal identification, bioinformatics pipeline

[1] Desprez-Loustau M L, Robin C, Buée M, Courtecuisse R, Garbaye J, Suffert F, Sache I, Rizzo D M. The fungal dimension of biological invasions. Trends in Ecology & Evolution, 2007, 22(9):472-480. DOI:10.1016/j.tree.2007.04.005.
[2] Schuster S C. Next-generation sequencing transforms today's biology. Nature Methods, 2008, 5(1):16-18. DOI:10.1038/nmeth1156.
[3] van Dijk E L, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends in Genetics, 2014, 30(9):418-426. DOI:10.1016/j.tig.2014.07.001.
[4] van Dijk E L, Jaszczyszyn Y, Naquin D, Thermes C. The third revolution in sequencing technology. Trends in Genetics, 2018, 34(9):666-681. DOI:10.1016/j.tig.2018.05.008.
[5] Dannemiller K C, Reeves D, Bibby K, Yamamoto N, Peccia J. Fungal high-throughput taxonomic identification tool for use with next-generation sequencing (FHiTINGS). Journal of Basic Microbiology, 2014, 54(4):315-321. DOI:10.1002/jobm.201200507.
[6] Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden T L. BLAST+:Architecture and applications. BMC Bioinformatics, 2009, 10(1):Article No. 421. DOI:10.1186/1471-2105-10-421.
[7] Gweon H S, Oliver A, Taylor J, Booth T, Gibbs M, Read D S, Griffiths R I, Schonrogge K. PIPITS:An automated pipeline for analyses of fungal internal transcribed spacer sequences from the I llumina sequencing platform. Methods in Ecology and Evolution, 2015, 6(8):973-980. DOI:10.1111/2041-210X.12399.
[8] Eng A, Verster A J, Borenstein E. Meta-LAFFA:A flexible, end-to-end, distributed computing-compatible metagenomic functional annotation pipeline. BMC Bioinformatics, 2020, 21(1):Article No. 471. DOI:10.1186/s12859-020-03815-9.
[9] Clarke E L, Taylor L J, Zhao C, Connell A, Lee J J, Fett B, Bushman F D, Bittinger K. Sunbeam:An extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome, 2019, 7(1):Article No. 46. DOI:10.1186/s40168-019-0658-x.
[10] Rhoads A, Au K F. PacBio sequencing and its applications. Genomics, Proteomics & Bioinformatics, 2015, 13(5):278-289. DOI:10.1016/j.gpb.2015.08.002.
[11] Seemann T. Prokka:Rapid prokaryotic genome annotation. Bioinformatics, 2014, 30(14):2068-2069. DOI:10.1093/bioinformatics/btu153.
[12] Jolley K A, Maiden M C. BIGSdb:Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics, 2010, 11(1):Article No. 595. DOI:10.1186/1471-2105-11-595.
[13] Chen S, Zhou Y, Chen Y, Gu J. FASTQ:An ultra-fast allin-one FASTQ preprocessor. Bioinformatics, 2018, 34(17):i884-i890. DOI:10.1093/bioinformatics/bty560.
[14] Bolger A M, Lohse M, Usadel B. Trimmomatic:A flexible trimmer for Illumina sequence data. Bioinformatics, 2014, 30(15):2114-2120. DOI:10.1093/bioinformatics/btu170.
[15] Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet Journal, 2011, 17(1):10-12. DOI:10.14806/ej.17.1.200.
[16] Benson D A, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman D J, Ostell J, Sayers E W. GenBank. Nucleic Acids Research, 2012, 41(D1):D36-D42. DOI:10.1093/nar/gks1195.
[17] Li D, Liu C M, Luo R, Sadakane K, Lam T W. MEGAHIT:An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015, 31(10):1674-1676. DOI:10.1093/bioinformatics/btv033.
[18] Zerbino D R, Birney E. Velvet:Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008, 18(5):821-829. DOI:10.1101/gr.074492.107.
[19] Bankevich A, Nurk S, Antipov D et al. SPAdes:A new genome assembly algorithm and its applications to singlecell sequencing. Journal of Computational Biology, 2012, 19(5):455-477. DOI:10.1089/cmb.2012.0021.
[20] Koren S, Walenz B P, Berlin K, Miller J R, Bergman N H, Phillippy A M. Canu:Scalable and accurate longread assembly via adaptive k-mer weighting and repeat separation. Genome Research, 2017, 27(5):722-736. DOI:10.1101/gr.215087.116.
[21] Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST:Quality assessment tool for genome assemblies. Bioinformatics, 2013, 29(8):1072-1075. DOI:10.1093/bioinformatics/btt086.
[22] Cock P J, Antao T, Chang J T et al. Biopython:Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 2009, 25(11):1422-1423. DOI:10.1093/bioinformatics/btp163.
[23] Rowe W P. When the levee breaks:A practical guide to sketching algorithms for processing the flood of genomic data. Genome Biology, 2019, 20(1):Article No. 199. DOI:10.1186/s13059-019-1809-x.
[24] Li H. Minimap2:Pairwise alignment for nucleotide sequences. Bioinformatics, 2018, 34(18):3094-3100. DOI:10.1093/bioinformatics/bty191.
[25] Kanz C, Aldebert P, Althorpe N et al. The EMBL nucleotide sequence database. Nucleic Acids Research, 2005, 33(suppl 1):D29-D33. DOI:10.1093/nar/gki098.
[26] Cornish-Bowden A. Nomenclature for incompletely specified bases in nucleic acid sequences:Recommendations 1984. Nucleic Acids Research, 1985, 13(9):3021-3030. DOI:10.1093/nar/13.9.3021.
[27] Caboche S, Even G, Loywick A, Audebert C, Hot D. MICRA:An automatic pipeline for fast characterization of microbial genomes from high-throughput sequencing data. Genome Biology, 2017, 18(1):Article No. 233. DOI:10.1186/s13059-017-1367-z.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[5] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[6] 章萃; 赵沁平; 徐家福;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[7] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[8] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[9] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[10] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: