›› 2014, Vol. 29 ›› Issue (2): 273-280.doi: 10.1007/s11390-014-1429-6

Special Issue: Computer Architecture and Systems

• Special Section on Cloud-Sea Computing Systems • Previous Articles     Next Articles

Reinventing Memory System Design for Many-Accelerator Architecture

Ying Wang1, 2 (王颖), Student Member, CCF, ACM, IEEE, Lei Zhang1 (张磊), Member, CCF, ACM, IEEE Yin-He Han1, * (韩银和), Member, CCF, ACM, IEEE and Hua-Wei Li1 (李华伟), Senior Member, CCF, IEEE, Member, ACM   

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2013-11-19 Revised:2014-01-21 Online:2014-03-05 Published:2014-03-05
  • About author:Ying Wang received the B.S. and M.S. degrees in electrical engineering from Harbin Institute of Technology, in 2007 and 2009, respectively. He is currently a Ph.D. candidate at Institute of Computing (ICT), Chinese Academy of Sciences (CAS), Beijing. His research interests include reconfigurable computing, interconnects, memory system and fault-tolerance for many-core architectures.
  • Supported by:

    Supported by the National Natural Science Foundation of China under Grant Nos. 61173006, 60921002, the National Basic Research 973 Program of China under Grant No. 2011CB302503, and the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA06010403.

The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-effciency. However, the emerging many-accelerator processor shows a much more complicated memory accessing pattern than general purpose processors (GPPs) because the abundant on-chip FUs tend to generate highly-concurrent memory streams with distinct locality and bandwidth demand. The disordered memory streams issued by diverse accelerators exhibit a mutual-interference behavior and cannot be effciently handled by the orthodox main memory interface that provides an inflexible data fetching mode. Unlike the traditional DRAM memory, our proposed Aggregation Memory System (AMS) can function adaptively to the characterized memory streams from different FUs, because it provides the FUs with different data fetching sizes and protects their locality in memory access by intelligently interleaving their data to memory devices through sub-rank binding. Moreover, AMS can batch the requests without sub-rank conflict into a read burst with our optimized memory scheduling policy. Experimental results from trace-based simulation show both conspicuous performance boost and energy saving brought by AMS.

[1] Yan G, Li Y, Han Y, Li X, Guo M, Liang X. AgileRegulator: A hybird voltage regulator scheme redeeming dark silicon for power effciency in a multicore architecture. In Proc. the 18th International Symposium on High Performance Com-puter Architecture, Feb. 2012, pp.287-298.

[2] Fu B, Han Y, Ma J, Li H, Li X. An abacus turn model for time/space-effcient reconfigurable routing. In Proc. the 38th International Symposium on Computer Architecture, June 2011, pp.259-270.

[3] Hameed R, Qadeer W, Wachs M, Azizi O, Solomatnikov A, Lee B C, Richardson S, Kozyrakis C, Horowitz M. Under-standing sources of ineffciency in general-purpose chips. In Proc. the 37th Annual International Symposium on Com-puter Architecture, June 2010, pp.37-47.

[4] Cong J, Grigorian B, Reinman G, Vitanza M. Accelerating vision and navigation applications on a customizable plat-form. In Proc. the 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors, Sept. 2011, pp.25-32.

[5] Auras D, Girbal S, Berry H et al. CMA: Chip multi-accelerator. In Proc. the 8th IEEE Symposium on Appli-cation Specific Processors, June 2010, pp.8-15.

[6] Girbal S, Temam O, Yehia S, Berry H, Li Z. A memory inter-face for multi-purpose multi-stream accelerators. In Proc. the 13rd International Conference on Compilers, Architectures and Synthesis for Embedded Systems, October 2010, pp.107-116.

[7] Chien A A, Snavely A, Gahagan M. 10×10: A general-purpose architectural approach to heterogeneity and energy effciency. In Proc. the 11th International Conference on Computational Science, June 2011, pp.1987-1996.

[8] Yoon D H, Jeong M K, Erez M. Adaptive granularity memory systems: A tradeoff between storage effciency and through-put. In Proc. the 38th Annual International Symposium on Computer Architecture, June 2011, pp.295-306.

[9] Rosenfeld P, Cooper-Balis E, Jacob B. DRAMSim2: A cycle accurate memory system simulator. Computer Architecture Letters, 2011, 10(1): 16-19.

[10] Seznec A. Decoupled sectored caches: Conciliating low tag implementation cost. In Proc. the 21st Annual International Symposium on Computer Architecture, Apr. 1994, pp.384-393.

[11] Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual In-ternational Symposium on Microarchitecture, December 2012, pp.376-388.

[12] Ahn J H, Leverich J, Schreiber R, Jouppi N P. Multicore DIMM: An energy effcient memory module with indepen-dently controlled DRAMs. IEEE Computer Architecture Let-ters, 2009, 8(1): 5-8.

[13] Udipi A N, Muralimanohar N, Chatterjee N, Balasubramo-nian R, Davis A, Jouppi N P. Rethinking DRAM design and organization for energy-constrained multi-cores. In Proc. the 37th Annual International Symposium on Computer Archi-tecture, June 2010, pp.175-186.

[14] Kim J S, Oh C S, Lee H et al. A 1.2V 12.8 GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stack-ing. In Proc. the International Solid-State Circuits Confer-ence, February 2011, pp.496-498.

[15] Liu C, Zhang L, Han Y, Li X. Vertical interconnects squeezing in symmetric 3D mesh network-on-Chip. In Proc. the 16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.357-362

[16] Wang Y, Zhang L, Han Y, Li H, Li X. FlexMemory: Exploit-ing and managing abundant off-chip optical bandwidth. In Proc. Design, Automation and Test in Europe, March 2011, pp.968-973

[17] Rafique N, Lim W, Thottethodi M. Effective management of DRAM bandwidth in multicore processors. In Proc. the 16th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2007, pp.245-258.

[18] Bitirgen R, Ipek E, Martinez J. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proc. the 41st IEEE/ACM International Symposium on Microarchitecture, Nov. 2008, pp.318-329.

[19] Liu F, Jiang X, Solihin Y. Understanding how off-chip mem-ory bandwidth partitioning in chip multiprocessors affects sys-tem performance. In Proc. the 16th IEEE International Sym-posium on High Performance Computer Architecture, Jan-uary 2010.

[20] Muralidhara S P, Subramanian L, Mutlu O et al. Reduc-ing memory interference in multicore systems via application-aware memory channel partitioning. In Proc. the 44th Inter-national Symposium on Microarchitecture, December 2011, pp.374-385.

[21] Liu L, Cui Z, Xing M, Bao Y, Chen M, Wu C. A software memory partition approach for eliminating bank-level interfe-rence in multicore systems. In Proc. the 21st International Conference on Parallel Architectures and Compilation Tech-niques, August 2012, pp.367-376.

[22] Thiebaut D, Stone H S. Footprints in the cache. ACM Trans. Computer Systems, 1987, 5(4): 305-329.

[23] Sudan K, Chatterjee N, Nellans D, Awasthi M, Balasubramo-nian R, Davis A. Micro-pages: Increasing DRAM effciency with locality-aware data placement. In Proc. the 15th Edi-tion of ASPLOS on Architectural Support for Programming Languages and Operating systems, March 2010, pp.219-230.

[24] Luk C K, Cohn R, Muth R et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. the 10th International Conference on Programming Language Design and Implementation, June 2005, pp.190-200.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved