›› 2016, Vol. 31 ›› Issue (1): 36-49.doi: 10.1007/s11390-016-1610-1

Special Issue: Computer Architecture and Systems

• Special Section on Computer Architecture and Systems with Emerging Technologies • Previous Articles     Next Articles

Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs

Yun Liang*(梁云), Member, CCF, ACM, IEEE, and Shuo Wang(王硕)   

  1. Center for Energy-Efficient Computing and Applications(CECA), School of Electrical Engineering and Computer Sciences, Peking University, Beijing 100871, China
  • Received:2015-09-09 Revised:2015-12-03 Online:2016-01-05 Published:2016-01-05
  • Contact: Yun Liang E-mail:ericlyun@pku.edu.cn
  • About author:Yun Liang obtained his B.S. degree in software engineering from Tongji University, Shanghai, and his Ph.D. degree in computer science from National University of Singapore, in 2004 and 2010, respectively. He was a research scientist with Advanced Digital Science Center, University of Illinois Urbana-Champaign, Urbana, IL, USA, from 2010 to 2012. He has been an assistant professor with the School of Electronics Engineering and Computer Science, Peking University, Beijing, since 2012. His current research interests include graphics processing unit (GPU) architecture and optimization, heterogeneous computing, embedded system, and high level synthesis. Dr. Liang was a recipient of the Best Paper Award in International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2011 and the Best Paper Award nominations in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) 2008 and Design Automation Conference (DAC) 2012. He serves a technical committee member for Asia South Pacific Design Automation Conference (ASPDAC), Design Automation and Test in Europe (DATE), International Conference on Compilers Architecture and Synthesis for Embedded System (CASES), and International Conference on Parallel Architectures and Compilation Techniques (PACT). He is the TPC subcommittee chair for ASPDAC 2013.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant No. 61300005.

The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.

