›› 2009, Vol. 24 ›› Issue (6): 1086-1097.

• Special Section on International Partnership Programs Supported by CAS • Previous Articles     Next Articles

PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization

Wei Mi1,2 (米伟), Xiao-Bing Feng1 (冯晓兵), Member, CCF, ACM, Yao-Cang Jia1 (贾耀仓), Li Chen1 (陈莉), Member, CCF, ACM, and Jing-Ling Xue3 (薛京灵), Senior Member, IEEE   

  1. 1Key Laboratory of Computer System and Architecture, Institution of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    2Graduate University of Chinese Academy of Sciences, Beijing 100039, China
    3Programming Languages and Compilers Group, School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
  • Received:2009-06-12 Revised:2009-09-29 Online:2009-11-05 Published:2009-11-05
  • About author:
    Wei Mi is a Ph.D. candidate in the Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences. His research interests include compiler and computer architecture. He received his Bachelor's degree in computer science from University of Science and Technology of China in 2003.
    Xiao-Bing Feng was born in 1969. He received his B.E. degree from Tianjin University in 1992, M.S. degree from Peking University in 1996 and Ph.D. degree from the Institute of Computing Technology (ICT), Chinese Academe of Sciences (CAS). He is currently a professor of the Key Laboratory of Computer System and Architecture, ICT, CAS. His research interests include program analysis, compiler and tools.
    Yao-Cang Jia is a Ph.D. candidate in the Key Laboratory of Computer System and Architecture, ICT, CAS. His research interests include in the compiler and dynamic optimization. He received his Bachelor's degree in computer science from Northwestern Polytechnical University in 2003.
    Li Chen was born in 1970. She received her B.E. and M.E. degrees from Shandong University of Science and Technology in 1992 and 1995 respectively, and Ph.D. degree from ICT, CAS, where she is currently an assistant professor. Her research interests include parallel optimization and environment.
    Jing-Ling Xue is a professor of computer science and engineering at the University of New South Wales. He received his B.Eng and M.Eng degrees in computer science and engineering from Tsinghua University in 1984 and 1987, respectively. He received his Ph.D. degree in computer science and engineering from Edinburgh University in 1992. He leads the Programming Languages and Compilers Group and its subgroup Compiler Research Group (CORG) at UNSW. His research interests are programming languages, compiler optimisations, computer architecture, parallel computing, distributed systems and cluster computing, and embedded systems.
  • Supported by:

    Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321602, and the National Natural Science Foundation of China under Grant No. 60736012.

DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new page-allocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%)

[1] McKee S A, Wulf W A, Aylor J H et al. Dynamic access ordering for streamed computations. IEEE Trans. Computers, 2000, 49(11): 1255–1271.
[2] Rixner S, Dally W J, Kapasi U J, Mattson P R, Owens J D. Memory access scheduling. In Proc. ISCA 2000, Vancouver, Canada, June 10–14, pp.128–138.
[3] Scott Rixner. Memory controller optimizations for Web servers. In Proc. MICRO 2004, Portland, USA, Dec. 4–8, pp.355–366.
[4] Shao J, Davis B T. A burst scheduling access reordering mechanism. In Proc. HPCA 2007, Phoenix, USA, Feb. 10–14, 2007, pp.285–294.
[5] Zhang Z, Zhu Z, Zhang X. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proc. MICRO 2000, Montery, USA, Dec. 10–13, 2000, pp.32–41.
[6] Lin W F, Reinhardt S K, Burger D. Reducing DRAM latencies with an integrated memory hierarchy design. In Proc. HPCA 2001, Nuevo Leone, Mexico, Jan. 20–24, pp.301–312.
[7] Shin J, Chame J, Hall MW. A compiler algorithm for exploiting page-mode memory access in embedded-DRAM devices. In Proc. the 4th Workshop on Media and Streaming Processors, Istanbul, Turkey, Nov. 18–19, November 2002.
[8] Ding C, Kennedy K. Improving effective bandwidth through compiler enhancement of global cache reuse. In Proc. IPDPS 2001, San Francisco, USA, April 23–27, 2001, p.38.
[9] Jacob B, Ng S W, Wang D T. With Contributions by Samuel Rodriguez, Memory Systems: Cache, DRAM, Disk. ISBN 978-0-12-379751-3, Morgan Kaufmann Publishers, September 2007.
[10] Mutlu O, Moscibroda T. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. ISCA 2008, Beijing, China, June 21–25, 2008, pp.63–74.
[11] Kessler R E, Hill M D. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst., 1992, 10(4): 338–359.
[12] McDougall R, Mauro J. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture. Sun Microsystems Press, Prentice Hall, 2006.
[13] Ishizaka K, Obata M, Kasahara H. Cache optimization for coarse grain task parallel processing using inter-array padding. In Proc. LCPC 2003, College Station, USA, Oct. 2– 4, 2003, pp.64–76.
[14] Bahadur S, Kalyanakrishnan V,Westall J. An empirical study of the effects of careful page placement in Linux. In Proc. ACM Southeast Regional Conference, Marietta, USA, April 1–3, 1998, pp.241–250.
[15] Ding C, Zhong Y. Predicting whole-program locality through reuse distance analysis. In Proc. PLDI 2003, San Diego, USA, June 9–11, 2003, pp.245–257.
[16] Mowry T C, Lam M S. Anoop gupta: Design and evaluation of a compiler algorithm for prefetching. In Proc. ASPLOS 1992, Boston, USA, Oct. 12–15, 1992, pp.62–73.
[17] Horwitz S, Reps T W, Binkley D. Interprocedural slicing using dependence graphs. In Proc. PLDI 1988, Atlanta, USA, June 22–24, 1988, pp.35–46.
[18] Zhang Z, Zhu Z, Zhang X. Breaking address mapping symmetry at multi-levels of memory heirarchy to reduce DRAM row-buffer conflicts. Journal of Instruction-Level Parallelism, 2001, 3.
[19] Naveen Neelakantam, Colin Blundell, Joe Devietti, Milo M K Martin, Craig Zilles. FeS2: A full-system execution-driven Simulator for x86. Poster session of ASPLOS 2008, Seattle, USA, March 1–5, 2008.
[20] Wang D, Ganesh B, Tuaycharoen N, Baynes K, Jaleel A, Jacob B. DRAMsim: A memory-system simulator. SIGARCH Computer Architecture News, September 2005, 33(4): 100– 107.
[21] Micron. DDR2 SDRAM Datasheet.
[22] Hur I, Lin C. Adaptive history-based memory schedulers. In Proc. MICRO 2004, Portland, USA, Dec. 4–8, 2004, pp.343– 354.
[23] Grun P, Dutt N D, Nicolau A. Memory aware compilation through accurate timing extraction. In Proc. DAC 2000, Los Angeles, USA, June 5–9, 2000, pp.316–321.
[24] Kandemir M T, Yemliha T, Son S W, Ozturk O. Memory bank aware dynamic loop scheduling. In Proc. DATE 2007, Nice, France, April 16–20, 2007, pp.1671–1676.
[25] Chen G, Kandemir M T, Saputra H, Irwin M J. Exploiting bank locality in multi-bank memories. In Proc CASES 2003, San Jose, USA, Oct. 30–Nov. 1, 2003, pp.287–297.
[26] Zheng H, Lin J, Zhang Z, Zhu Z. Memory access scheduling schemes for systems with multi-core processors. In Proc. ICPP 2008, Portland, USA, Sept. 8–12, 2008, pp.406–413.
[27] Nesbit K J, Aggarwal N, Laudon J, Smith J E. Fair queuing memory systems. In Proc. MICRO 2006, Orlando, USA, Dec. 9–13, 2006, pp.208–222.
[28] Rafique N, Lim W T, Thottethodi M. Effective management of DRAM bandwidth in multicore processors. In Proc. PACT 2007, Brasov, Romania, Sept. 15–19, 2007, pp.245–258.
[29] Mutlu O, Moscibroda T. Stall-time fair memory access scheduling for chip multiprocessors. In Proc. MICRO 2007, Chicago, USA, Dec. 1–5, 2007, pp.146–160.
[30] Lee C J, Mutlu O, Narasiman V, Patt Y N. Prefetch-aware DRAM controllers. In Proc. MICRO 2008, Lake Como, Italy, Nov. 8–12, 2008, pp.200–209.
[31] Bugnion E, Anderson J-A M, Mowry T C, Rosenblum M, Lam M S. Compiler-directed page coloring for multiprocessors. In Proc. ASPLOS 1996, Cambridge, USA, Oct. 1–5, 1996, pp.244–255.
[32] Lin J, Lu Q, Ding X, Zhang Z, Zhang X, Sadayappan P. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proc. HPCA 2008, Salt Lake City, USA, Feb. 16–20, 2008, pp.367–378.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved