计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (1): 56-70.doi: 10.1007/s11390-020-0825-3

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

计算存储一体化基因比对FM-index算法加速体系结构设计

Xue-Qi Li, Student Member, CCF, ACM, IEEE, Guang-Ming Tan, Member, CCF, ACM, IEEE, and Ning-Hui Sun, Fellow, CCF, Member, ACM, IEEE   

  1. State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China University of Chinese Academy of Sciences, Beijing 100049, China
  • 收稿日期:2020-07-15 修回日期:2021-01-06 出版日期:2021-01-05 发布日期:2021-01-23
  • 作者简介:Xue-Qi Li is currently working toward his Ph.D. degree in computer architecture at Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also a visiting Ph.D. student in Prof. Yuan Xie's lab at UCSB. His current research interests include high-performance computing, computer architecture, and bioinformatics.
  • 基金资助:
    This work is supported by the National Key Research and Development Program of China under Grant Nos. 2018YFB0204400, 2016YFB0201305, 2016YFB0200803, 2016YFB0200300, and XDC01030000, the National Natural Science Foundation of China under Grant Nos. 6197237, and 61702483, and the CAS QYZDJ-SSW-JSC035 Funding.

PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm

Xue-Qi Li, Student Member, CCF, ACM, IEEE, Guang-Ming Tan, Member, CCF, ACM, IEEE, and Ning-Hui Sun, Fellow, CCF, Member, ACM, IEEE        

  1. State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2020-07-15 Revised:2021-01-06 Online:2021-01-05 Published:2021-01-23
  • About author:Xue-Qi Li is currently working toward his Ph.D. degree in computer architecture at Institute of Computing Technology, Chinese Academy of Sciences, Beijing. He is also a visiting Ph.D. student in Prof. Yuan Xie's lab at UCSB. His current research interests include high-performance computing, computer architecture, and bioinformatics.
  • Supported by:
    This work is supported by the National Key Research and Development Program of China under Grant Nos. 2018YFB0204400, 2016YFB0201305, 2016YFB0200803, 2016YFB0200300, and XDC01030000, the National Natural Science Foundation of China under Grant Nos. 6197237, and 61702483, and the CAS QYZDJ-SSW-JSC035 Funding.

研究背景:在生物信息学中,FM-Index算法是基因组数据分析中的一个重要算法。测序技术产生的海量基因大数据给基于FM-index的基因比对程序带来了巨大的挑战。现阶段的研究工作通过利用SIMD,FPGA和ASIC等技术加速了FM-index算法,特别是在定制加速器领域取得了很好的加速效果。但是,大量的随机内存访问造成了传统冯·诺依曼体系结构中处理单元与内存之间的巨大数据搬移,现有的内存带宽也限制了对于算法并行性的挖掘
。目的:本文中,我们认为计算存储一体化(或称近内存计算)是解决这些挑战的可行解决方案。
方法:本文量化分析了FM-index算法的计算和访存特征,基于FM-index算法特征和3D堆叠存储器的特性,设计并实现了一个基因数据比对算法的加速体系结构。为了充分利用3D堆叠技术提供的更高且可扩展的内存带宽,本文提出了(1)一种充分利用可用内存带宽的新加速器结构;(2)轻量级消息传递机制和非阻塞通信机制;(3)计算-访存解耦与数据预取机制。
结果:实验表明,与最佳可用的ASIC解决方案相比,近内存计算加速器远未触及3D堆叠存储器逻辑层的能耗、面积等开销限制的红线,并且在原有加速器基础上将性能提升了20多倍。

关键词: 加速体系结构设计, 基因比对算法, 计算存储一体化

Abstract: Genomic sequence alignment is the most critical and time-consuming step in genomic analysis. Alignment algorithms generally follow a seed-and-extend model. Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), and graphics processing unit (GPU) (e.g., the Smith-Waterman algorithm). Compared with the extension phase, the seeding phase is more critical and essential. However, the seeding phase is bounded by memory, i.e., fine-grained random memory access and limited parallelism on conventional system. In this paper, we argue that the processing-in-memory (PIM) concept could be a viable solution to address these problems. This paper describes \PIM-Align"|an application-driven near-data processing architecture for sequence alignment. In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory (DRAM) technology, we propose a lightweight message mechanism between different memory partitions, and a specialized hardware prefetcher for memory access patterns of sequence alignment. Our evaluation shows that the proposed architecture can achieve 20x and 1 820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU, respectively.

Key words: accelerator design, genomic sequence alignment, near-memory computing

[1] Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnology, 2008, 26(10):1135-1145. DOI:10.1038/nbt1486.
[2] Erdmann J. Next generation technology edges genome sequencing toward the clinic. Chemistry & Biology, 2011, 18(12):1513-1514. DOI:10.1016/j.chembiol.2011.12.006.
[3] Stephens Z D, Lee S Y, Faghri F, Campbell R H, Zhai C, Efron M J, Iyer R, Schatz M C, Sinha S, Robinson G E. Big data:Astronomical or genomical? PLoS Biology, 2015, 13(7):Article No. e1002195. DOI:10.1371/journal.pbio.1002195.
[4] Turakhia Y, Bejerano G, Dally W J. Darwin:A genomics co-processor provides up to 15,000X acceleration on long read assembly. In Proc. the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2018, pp.199-213. DOI:10.1145/3173162.3173193.
[5] Zhang J, Lin H, Balaji P, Feng W C. Optimizing burrowswheeler transform-based sequence alignment on multicore architectures. In Proc. the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, May 2013, pp.377-384. DOI:10.1109/CCGrid.2013.67.
[6] Lu M, Tan Y, Bai G, Luo Q. High-performance short sequence alignment with GPU acceleration. Distributed and Parallel Databases, 2012, 30(5/6):385-399. DOI:10.1007/s10619-012-7099-x.
[7] Chang M C F, Chen Y T, Cong J, Huang P T, Kuo C L, Yu C H. The SMEM seeding acceleration for DNA sequence alignment. In Proc. the 24th International Symposium on Field-Programmable Custom Computing Machines, May 2016, pp.32-39. DOI:10.1109/FCCM.2016.21.
[8] Wang Y, Li X, Zang D, Tan G, Sun N. Accelerating FMindex search for genomic data processing. In Proc. the 47th International Conference on Parallel Processing, Aug. 2018, Article No. 65. DOI:10.1145/3225058.3225134.
[9] Kocberber O, Grot B, Picorel J, Falsafi B, Lim K, Ranganathan P. Meet the walkers accelerating index traversals for in-memory databases. In Proc. the 46th IEEE/ACM International Symposium on Microarchitecture, Dec. 2013, pp.468-479. DOI:10.1145/2540708.2540748.
[10] Weis C, Wehn N, Igor L, Benini L. Design space exploration for 3D-stacked DRAMs. In Proc. the Design, Automation & Test in Europe, Mar. 2011, pp.389-394. DOI:10.1109/DATE.2011.5763068.
[11] Langmead B, Trapnell C, Pop M, Salzberg S L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 2009, 10(3):Article No. R25. DOI:10.1186/gb-2009-10-3-r25.
[12] Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997, 2013, Mar. 2013. https://arxiv.org/abs/1303.3997, Nov. 2020.
[13] Langmead B, Salzberg S L. Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012, 9(4):357-359. DOI:10.1038/nmeth.1923.
[14] Luo R, Wong T, Zhu J et al. SOAP3-dp:Fast, accurate and sensitive GPU-based short read aligner. PloS One, 2013, 8(5):Article No. e65632. DOI:10.1371/journal.pone.0065632.
[15] Ahmed N, Bertels K, Al-Ars Z. A comparison of seed-andextend techniques in modern DNA read alignment algorithms. In Proc. the 2016 IEEE International Conference on Bioinformatics and Biomedicine, Dec. 2016, pp.1421- 1428. DOI:10.1109/BIBM.2016.7822731.
[16] Hu X, Stow D, Xie Y. Die stacking is happening. IEEE Micro, 2018, 38(1):22-28. DOI:10.1109/MM.2018.011441561.
[17] Shevgoor M, Kim J S, Chatterjee N, Balasubramonian R, Davis A, Udipi A N. Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. In Proc. the 46th International Symposium on Microarchitecture, Feb. 2013, pp.198-209. DOI:10.1145/2540708.2540726.
[18] Zhu Y, Wang B, Li D, Zhao J. Integrated thermal analysis for processing in die-stacking memory. In Proc. the 2nd International Symposium on Memory Systems, Oct. 2016, pp.402-414. DOI:10.1145/2989081.2989093.
[19] Gao M, Ayers G, Kozyrakis C. Practical near-data processing for in-memory analytics frame-works. In Proc. the 2015 International Conference on Parallel Architecture and Compilation, Mar. 2015, pp.113-124. DOI:10.1109/PACT.2015.22.
[20] Kim Y, Yang W, Mutlu O. Ramulator:A fast and extensible dram simulator. IEEE Computer Architecture Letters, 2015, 15(1):45-49. DOI:10.1109/LCA.2015.2414456.
[21] Chen K, Li S, Muralimanohar N, Ahn J H, Brockman J B, Jouppi N P. CACTI-3DD:Architecture-level modeling for 3D die-stacked dram main memory. In Proc. the Conference on Design, Automation and Test in Europe, Mar. 2012, pp.33-38. DOI:10.1109/DATE.2012.6176428.
[22] Pugsley S H, Jestes J, Zhang H, Balasubramonian R, Srinivasan V, Buyuktosunoglu A, Davis A, Li F. NDC:Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In Proc. the IEEE International Symposium on Performance Analysis of Systems and Software, Mar. 2014, pp.190-200. DOI:10.1109/ISPASS.2014.6844483.
[23] Pran K, Taher A. Logic Synthesis Using Synopsysr. Springer Science & Business Media, 2012.
[24] Canzar S, Salzberg S L. Short read mapping:An algorithmic tour. Proc. the IEEE, 2017, 105(3):436-458. DOI:10.1109/JPROC.2015.2455551.
[25] Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with FastHASH. BMC Genomics, 2013, 14(Suppl 1):Article No. S13. DOI:10.1186/1471-2164-14-S1-S13.
[26] Alkan C, Kidd J M, Marques-Bonet T et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 2009, 41(10):1061-1067. DOI:10.1038/ng.437.
[27] Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler E E, Sahinalp S C. mrsFAST:A cache-oblivious algorithm for short-read mapping. Nature Methods, 2010, 7(8):576-577. DOI:10.1038/nmeth0810-576.
[28] David M, Dzamba M, Lister D, Ilie L, Brudno M. SHRiMP2:Sensitive yet practical short read mapping. Bioinformatics, 2011, 27(7):1011-1012. DOI:10.1093/bioinformatics/btr046.
[29] Li H, Durbin R. Fast and accurate short read alignment with burrows wheeler transform. Bioinformatics, 2009, 25(14):1754-1760. DOI:10.1093/bioinformatics/btp324.
[30] Fernandez E, Najjar W, Lonardi S. String matching in hardware using the FM-index. In Proc. the 19th Annual International Symposium on Field-Programmable Custom Computing Machines, May 2011, pp.218-225. DOI:10.1109/FCCM.2011.55.
[31] Fernandez E B, Najjar W A, Lonardi S, Villarreal J. Multithreaded FPGA acceleration of DNA sequence mapping. In Proc. the 2012 IEEE Conference on High Performance Extreme Computing, Sept. 2012. DOI:10.1109/HPEC.2012.6408669.
[32] Fernandez E B, Villarreal J, Lonardi S, Najjar W A. FHAST:FPGA-based acceleration of Bowtie in hardware. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015, 12(5):973-981. DOI:10.1109/TCBB.2015.2405333.
[33] Liu Y, Schmidt B. Evaluation of GPU-based seed generation for computational genomics using burrows-wheeler transform. In Proc. the 26th IEEE International Symposium on Parallel and Distributed Processing Symposium Workshops & PhD Forum, Aug. 2012, pp.684-690. DOI:10.1109/IPDPSW.2012.85.
[34] Fujiki D, Subramaniyan A, Zhang T, Zeng Y, Das R, Blaauw D, Narayanasamy S. GenAx:A genome sequencing accelerator. In Proc. the 45th Annual International Symposium on Computer Architecture, July 2018, pp.69-82. DOI:10.1109/ISCA.2018.00017.
[35] Balasubramonian R, Chang J, Manning T, Moreno J H, Murphy R, Nair R, Swanson S. Near-data processing:Insights from a micro-46 workshop. IEEE Micro, 2014, 34(4):36-42. DOI:10.1109/MM.2014.55.
[36] Seshadri V, Kim Y, Fallin C et al. RowClone:Fast and energy-efficient in-dram bulk data copy and initialization. In Proc. the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2013, pp.185-197. DOI:10.1145/2540708.2540725.
[37] Zhu Q, Akin B, Sumbul H E, Sadi F, Hoe J C, Pileggi L, Franchetti F. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proc. the 2013 IEEE International 3D Systems Integration Conference, Oct. 2013. DOI:10.1109/3DIC.2013.6702348.
[38] Zhu Q, Graf T, Sumbul H E, Pileggi L, Franchetti F. Accelerating sparse matrix-matrix multiplication with 3Dstacked logic-in-memory hardware. In Proc. the 2013 IEEE High Performance Extreme Computing Conference, Sept. 2013. DOI:10.1109/HPEC.2013.6670336.
[39] Vijayaraghavan T, Rajesh A, Sankaralingam K. MPUBWM:Accelerating sequence alignment. IEEE Computer Architecture Letters, 2018, 17(2):179-182. DOI:10.1109/LCA.2018.2849064.
[40] Asghari-Moghaddam H, Son Y H, Ahn J H, Kim N S. Chameleon:Versatile and practical near-DRAM acceleration architecture for large memory systems. In Proc. the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2016. DOI:10.1109/MICRO.2016.7783753.
[41] Kaplan R, Yavits L, Ginosar R, Weiser U. A resistive cam processing-in-storage architecture for DNA sequence alignment. IEEE Micro, 2017, 37(4):20-28. DOI:10.1109/MM.2017.3211121.
[42] Huangfu W, Li S, Hu X, Xie Y. RADAR:A 3D-ReRAM based DNA alignment accelerator architecture. In Proc. the 55th Design Automation Conference, Jun. 2018, Article No. 59. DOI:10.1145/3195970.3196098.
[43] Ahn J, Hong S, Yoo S, Mutlu O, Choi K. A scalable processing-in-memory accelerator for parallel graph processing. In Proc. the 42nd Annual International Symposium on Computer Architecture, June 2015, pp.105-117. DOI:10.1145/2749469.2750386.
[44] Nagasaka Y, Nukada A, Matsuoka S. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Computer Science, 2016, 80:131- 142. DOI:10.1016/j.procs.2016.05.304.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 衷仁保; 邢林; 任朝阳;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: