We use cookies to improve your experience with our site.

水稻基因组中预测基因的程序评估及测试数据集

Test Data Sets and Evaluation of Gene Prediction Programs on the Rice Genome

  • 摘要: 随着全球范围内多个水稻基因组测序计划接近完成阶段,用计算机来寻找和预测基因成为迫切任务。为了对不同程序的预测结果进行评估,必须有较好的与各个程序的训练集合不相交的测试数据。我们把日本KOME数据库中的28469个全长cDNA同国际水稻基因组计划的粳稻BAC序列对比,经过严格选择构建了两个测试数据集合:550条单基因序列组成OsSNG550集合,包含271个基因的62条多基因序列组成OsMTG62集合。用这两个测试数据集合评估了5个在水稻基因组中预测基因的程序:RiceHMM、GlimmerR、GeneMark、FGENESH和我们自己编写的BGF。BGF是基于半隐马可夫模型和动态规划算法的程序,我们的新贡献主要在于引进了一些新的信号模型。对预测结果的评估在核苷酸、外显子和基因整体结构水平上进行,除了一些通用的测试标准,我们还定义了一些新的测度。测试结果表明,这5个程序的实际效果基本上随着完成年代而不断有所进步。目前BGF和FGENESH的效果最好。在两个数据集合的550和271个基因中,BGF(FGENESH)完全正确地预测出237(231)和124(114)个,完全丢失5(4)和2(2)个,部分正确地找到308(315)和145(155)个。这些程序的测试结果还具有一定的互补性,说明有可能合并几个不同程序的结果来进一步提高预测效果。

     

    Abstract: With several rice genome projects approaching completion gene prediction/finding by computer algorithms has become an urgent task. Two test sets were constructed by mapping the newly published 28,469 full-length KOME rice cDNA to the RGP BAC clone sequences of Oryza sativa ssp. Japonica : a single-gene set of 550 sequences and a multi-gene set of 62 sequences with 271 genes. These data sets were used to evaluate five ab initio gene prediction programs: RiceHMM, GlimmerR, GeneMark, FGENSH and BGF. The predictions were compared on nucleotide, exon and whole gene structure levels using commonly accepted measures and several new measures. The test results show a progress in performance in chronological order. At the same time complementarity of the programs hints on the possibility of further improvement and on thefeasibility of reaching better performance by combining several gene-finders.

     

/

返回文章
返回