›› 2016,Vol. 31 ›› Issue (1): 36-49.doi: 10.1007/s11390-016-1610-1

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于赛道存储的寄存器堆在GPU上的性能优化

Yun Liang*(梁云), Member, CCF, ACM, IEEE, and Shuo Wang(王硕)   

  1. Center for Energy-Efficient Computing and Applications(CECA), School of Electrical Engineering and Computer Sciences, Peking University, Beijing 100871, China
  • 收稿日期:2015-09-09 修回日期:2015-12-03 出版日期:2016-01-05 发布日期:2016-01-05
  • 通讯作者: Yun Liang E-mail:ericlyun@pku.edu.cn
  • 作者简介:Yun Liang obtained his B.S. degree in software engineering from Tongji University, Shanghai, and his Ph.D. degree in computer science from National University of Singapore, in 2004 and 2010, respectively. He was a research scientist with Advanced Digital Science Center, University of Illinois Urbana-Champaign, Urbana, IL, USA, from 2010 to 2012. He has been an assistant professor with the School of Electronics Engineering and Computer Science, Peking University, Beijing, since 2012. His current research interests include graphics processing unit (GPU) architecture and optimization, heterogeneous computing, embedded system, and high level synthesis. Dr. Liang was a recipient of the Best Paper Award in International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2011 and the Best Paper Award nominations in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) 2008 and Design Automation Conference (DAC) 2012. He serves a technical committee member for Asia South Pacific Design Automation Conference (ASPDAC), Design Automation and Test in Europe (DATE), International Conference on Compilers Architecture and Synthesis for Embedded System (CASES), and International Conference on Parallel Architectures and Compilation Techniques (PACT). He is the TPC subcommittee chair for ASPDAC 2013.
  • 基金资助:

    This work was supported by the National Natural Science Foundation of China under Grant No. 61300005.

Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs

Yun Liang*(梁云), Member, CCF, ACM, IEEE, and Shuo Wang(王硕)   

  1. Center for Energy-Efficient Computing and Applications(CECA), School of Electrical Engineering and Computer Sciences, Peking University, Beijing 100871, China
  • Received:2015-09-09 Revised:2015-12-03 Online:2016-01-05 Published:2016-01-05
  • Contact: Yun Liang E-mail:ericlyun@pku.edu.cn
  • About author:Yun Liang obtained his B.S. degree in software engineering from Tongji University, Shanghai, and his Ph.D. degree in computer science from National University of Singapore, in 2004 and 2010, respectively. He was a research scientist with Advanced Digital Science Center, University of Illinois Urbana-Champaign, Urbana, IL, USA, from 2010 to 2012. He has been an assistant professor with the School of Electronics Engineering and Computer Science, Peking University, Beijing, since 2012. His current research interests include graphics processing unit (GPU) architecture and optimization, heterogeneous computing, embedded system, and high level synthesis. Dr. Liang was a recipient of the Best Paper Award in International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2011 and the Best Paper Award nominations in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) 2008 and Design Automation Conference (DAC) 2012. He serves a technical committee member for Asia South Pacific Design Automation Conference (ASPDAC), Design Automation and Test in Europe (DATE), International Conference on Compilers Architecture and Synthesis for Embedded System (CASES), and International Conference on Parallel Architectures and Compilation Techniques (PACT). He is the TPC subcommittee chair for ASPDAC 2013.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant No. 61300005.

支持大量线程并行运算是GPU提供高性能计算的关键。但是,在实际应用中,GPU上运行的线程数量常常受限于GPU寄存器堆存储的大小。传统的基于SRAM的寄存器堆会消耗大量的片上资源,因此不能可持续的满足GPU寄存器堆存储容量的日益增长。赛道存储是一种新型的高存储密度存储器件,它能够满足GPU寄存器堆大容量的设计需求。但是,如果不精心设计基于赛道存储的GPU寄存器堆,赛道存储器件冗长的移位操作很可能极大的降低GPU的性能。在本文中,我们探索使用赛道存储器设计GPU寄存器堆。尽管极高的存储密度使得赛道存储器件能够支持更高的GPU线程并行度,但是,如果赛道存储器件上的存储位和寄存器堆的读写口没有对上,就需要赛道存储器件把该存储位移动到相应的读写口,从而增加了读写操作的延迟。对此,我们提出了一种从应用、编译和系统结构三个方面对基于赛道存储器件的GPU寄存器堆进行优化的优化框架。在应用层,我们通过优化TLP减少缓存的竞争;在编译层,我们优化寄存器在赛道存储器件上的摆放位置从而减少移位操作;在体系结构层,我们采用提前移动赛道存储器件的方法,减少了读写延迟。实验结果显示,我们的优化框架能够在不同的应用场景下最多为GPU提高29%的性能(平均提高21%的性能)。

Abstract: The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.

[1] Gebhart M, Keckler S W, Khailany B, Krashinsky R, Dally W J. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.

[2] Li X, Liang Y. Energy-efficient kernel management on gpus. In Proc. the Design Automation and Test in Europe (DATE), Mar. 2016.

[3] Liang Y, Huynh H, Rupnow K, Goh R, Chen D. Efficient GPU spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(3):748-760.

[4] Liang Y, Xie X, Sun G, Chen D. An efficient compiler framework for cache bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015, 34(10):1677-1690.

[5] Xie X, Liang Y, Li X, Wu Y, Sun G, Wang T, Fan D. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proc. the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2015.

[6] Xie X, Liang Y, Sun G, Chen D. An efficient compiler framework for cache bypassing on GPUs. In Proc. the International Conference on Computer Aided Design (ICCAD), Nov. 2013, pp.516-523.

[7] Xie X, Liang Y, Wang Y, Sun G, Wang T. Coordinated static and dynamic cache bypassing on GPUs. In Proc. the 21st IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2015, pp.76-88.

[8] Mao M,Wen W, Zhang Y, Chen Y, Li H H. Exploration of GPGPU register file architecture using domain-wall-shiftwrite based racetrack memory. In Proc. the 51st Annual Design Automation Conference (DAC), June 2014, pp.196:1- 196:6.

[9] Zhang C, Sun G, Zhang W, Mi F, Li H, Zhao W. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In Proc. the 20th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2015, pp.100-105.

[10] Parkin S S P, Hayashi M, Thomas L. Magnetic domain-wall racetrack memory. Science, 2008, 320(5873):190-194.

[11] Sun Z, Wu W, Li H. Cross-layer racetrack memory design for ultra high density and low power consumption. In Proc. the 50th Annual Design Automation Conference (DAC), May 2013, Article No. 53.

[12] Venkatesan R, Ramasubramanian S G, Venkataramani S, Roy K, Raghunathan A. Stag:Spintronic-tape architecture for GPGPU cache hierarchies. In Proc. the 41st Annual International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.253-264.

[13] Jing N, Shen Y, Lu Y, Ganapathy S, Mao Z, Guo M, Canal R, Liang X. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proc. the 40th Annual International Symposium on Computer Architecture (ISCA), Jun. 2013, pp.344-355.

[14] Wang S, Liang Y, Zhang C, Xie X, Sun G, Liu Y, Wang Y, Li X. Performance-centric register file design for GPUs using racetrack memory. In Proc. the 21st Asia and South Pacific Design Automation Conference (ASP-DAC), Jan. 2016.

[15] Kayiran O, Jog A, Kandemir M T, Das C R. Neither more nor less:Optimizing thread-level parallelism for GPGPUs. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Oct. 2013, pp.157-166.

[16] Jog A, Mishra A K, Xu C, Xie Y, Narayanan V, Iyer R, Das C R. Cache revive:Architecting volatile STT-RAM caches for enhanced performance in CMPs. In Proc. the 49th Annual Design Automation Conference (DAC), June 2012, pp.243-252.

[17] Samavatian M H, Abbasitabar H, Arjomand M, SarbaziAzad H. An efficient STT-RAM last level cache architecture for GPUs. In Proc. the 51st Annual Design Automation Conference (DAC), May 2014, pp.197:1-197:6.

[18] Sun Z, Bi X, Li H H, Wong W F, Ong Z L, Zhu X, Wu W. Multi retention level STT-RAM cache designs with a dynamic refresh scheme. In Proc. the 44th Annual International Symposium on Microarchitecture (MICRO), Dec. 2011, pp.329-338.

[19] Chen X, Sha E H M, Zhuge Q, Dai P, Jiang W. Optimizing data placement for reducing shift operations on domain wall memories. In Proc. the 52nd Annual Design Automation Conference (DAC), June 2015, pp.139:1-139:6.

[20] Venkatesan R, Kozhikkottu V, Augustine C, Raychowdhury A, Roy K, Raghunathan A. TapeCache:A high density, energy efficient cache based on domain wall memory. In Proc. the International Symposium on Low Power Electronics and Design (ISLPED), July 30-August 1, 2012, pp.185-190.

[21] Jing N, Liu H, Lu Y, Liang X. Compiler assisted dynamic register file in GPGPU. In Proc. the International Symposium on Low Power Electronics and Design (ISLPED), Sept. 2013, pp.3-8.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 乔香珍;. An Efficient Parallel Algorithm for FFT[J]. , 1987, 2(3): 174 -190 .
[2] 周巢尘; 柳欣欣;. Denote CSP with Temporal Formulas[J]. , 1990, 5(1): 17 -23 .
[3] 郭恒昌;. On the Characterization and Fault Identification of Sequentially t-Diagnosable System Under PMC Model[J]. , 1991, 6(1): 83 -90 .
[4] 林作铨; 石纯一;. A Generalization of Circumscription[J]. , 1992, 7(2): 97 -104 .
[5] 张钹; 张铃;. On Memory Capacity of the Probabilistic Logic Neuron Network[J]. , 1993, 8(3): 62 -66 .
[6] 顾君忠;. Modelling Enterprises with Object-Oriented Paradigm[J]. , 1993, 8(3): 80 -89 .
[7] 马军; 马绍汉;. An O(k~2n~2) Algorithm to Find a k-Partition in a k-Connected Graph[J]. , 1994, 9(1): 86 -91 .
[8] 叶世伟; 史忠植;. A Necessary Condition about the Optimum Partition on a Finite Set of Samples and Its Application to Clustering Analysis[J]. , 1995, 10(6): 545 -556 .
[9] 王显著; 廖恒; 李三立;. DYNAMEM-A Microarchitecture for Improving Memory Disambiguation at Run-Time[J]. , 1996, 11(6): 589 -600 .
[10] 高文; 陈熙霖;. A Stochastic Approach for Blurred Image Restoration and Optical Flow Computation on Field Image Sequence[J]. , 1997, 12(5): 385 -399 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: