时间和空间高效的DNA序列联配算法

时间和空间高效的DNA序列联配算法

Improvement of Performance of MegaBlast Algorithm for DNA Sequence Alignment

摘要

摘要: 序列比对是生物信息学研究中一个最基本的问题之一。随着基因组测序的进行和序列规模增加，原始两两序列比对问题也扩展到有多个序列组成的集合之间的比对，现有的程序存在内存和时间消耗太大的瓶颈。其中最常用的NCBI BLAST 工具包中MegaBlast程序就是一个典型的例子。MegaBlast采用的是启发式算法，单纯从时间和空间复杂的角度进一步降低其复杂度的空间相当有限。从实验的方法出发，结合现代计算机系统的体系结构优化算法是另外一种可选的方法。在此背景下，本文首先分析了MegaBlast程序的流程和计算行为特征。由于需要排序得到最终的比对结果，MegaBlast采用先计算分值然后输出的串行过程，这样还导致大量中间结果的积累出现内存消耗过大的情况。现在的大多数的计算机系统采用的是I/O控制器,以减少CPU处理I/O的负担，即从硬件角度看，CPU单元的计算和I/O控制器的处理可以并行进行。基于这一体系结构特点，本文对MegaBlast程序的流程进行了优化，采用对库造表的策略，实现了计算和输出的重叠。而且不需要保留中间结果，大大减少了内存的消耗。机群系统是目前最流行的高性能并行计算机系统。尽管优化的MegaBlast在时间和存储上得到了改善，但当问题规模增大时，合适的并行实现是有必要。本文根据优化程序的计算特点，采用了对查询序列分割的策略。在4000A系统上的实现获得了较好的绝对加速比和可扩展性。

Abstract: MegaBlast is one of the most important programs in NCBIBLAST (Basic Local Alignment Search Tool) toolkits. However, MegaBlastis computation and I/O intensive. It consumes a great deal of memorywhich is proportional to the size of the query sequencesset and subject (database) sequences set of product. This paper proposes a newstrategy for optimizing MegaBlast. The new strategy exchanges the queryand subject sequences sets, and builds a hash table based on newsubject sequences. It overlaps I/O with computation, shortensthe overall time and reduces the cost of memory, since the memory hereis only proportional to the size of subject sequences set. The optimizedalgorithm is suitable to be parallelized in cluster systems. Theparallel algorithm uses query segmentation method. As our experimentsshown, the parallel program which is implemented with MPI has finescalability.

HTML全文

参考文献()

施引文献

资源附件()