• Architecture and High Performance Computer Systems • Previous Articles     Next Articles

Runtime Engine for Dynamic Profile Guided Stride Prefetching

Qiong Zou1,2, Xiao-Feng Li3, and Long-Bing Zhang2   

  1. 1Department of Computer Science, University of Science and Technology of China, Hefei 230027, China 2Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China 3Intel China Research Center, Beijing 100190, China
  • Received:2007-12-10 Revised:2008-05-09 Online:2008-07-10 Published:2008-07-10

Stride prefetching is recognized as an important technique to improve memory access performance. The prior work usually profiles and/or analyzes the program behavior offline, and uses the identified stride patterns to guide the compilation process by injecting the prefetch instructions at appropriate places. There are some researches trying to enable stride prefetching in runtime systems with online profiling, but they either cannot discover cross-procedural prefetch opportunity, or require special supports in hardware or garbage collection. In this paper, we present a prefetch engine for JVM (Java Virtual Machine). It firstly identifies the candidate load operations during just-in-time (JIT) compilation, and then instruments the compiled code to profile the addresses of those loads. The runtime profile is collected in a trace buffer, which triggers a prefetch controller upon a protection fault. The prefetch controller analyzes the trace to discover any stride patterns, then modifies the compiled code to inject the prefetch instructions in place of the instrumentations. One of the major advantages of this engine is that, it can detect striding loads in any virtual code places for both regular and irregular code, not being limited with plain loop or procedure scopes. Actually we found the cross-procedural patterns take about 30\% of all the prefetchings in the representative Java benchmarks. Another major advantage of the engine is that it has runtime overhead much smaller (the maximal is less than 4.0\%) than the benefits it brings. Our evaluation with Apache Harmony JVM shows that the engine can achieve an average 6.2\% speed-up with SPECJVM98 and DaCapo on Intel Pentium 4 platform, in spite of the runtime overhead.

Key words: fault-tolerance; check-pointing; atomic operation; recoverability of file systems;

[1] Youfeng Wu. Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. {\it SIGPLAN Not.}, 2002, 37(5): 210--221.
[2]} Robert Muth, Harish Patil, Richard Weiss, P Geoffrey Lowney, Robert Cohn. Profile-guided post-link stride prefetching. In {\it Proc. 16th Int. Supercomputing}, New York, USA, June 22--26, 2002, pp.167--178.
[3]} Brendon D Cahoon. Effective compile-time analysis for data prefetching in Java [Dissertation]. University of Massachusetts, Amherst, 2002.
[4]} Inagaki T, Onodera T, Komatsu H, Nakatani T. Stride prefetching by dynamically inspecting objects. In {\it Proc. the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation}, San Diego, California, USA, June 9--11, 2003, pp.269--277.
[5]} Adl-Tabatabai A, Hudson R L, Serrano M J, Subramoney S. Prefetch injection based on hardware monitoring and object metadata. In {\it Proc. the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation}, Washington DC, USA, June 9--11, 2004, pp.267--276.
[6]} Vanderwiel S P, Lilja D J. Data prefetch mechanisms. {\it ACM Comput. Surv.}, Jun. 2000, 32(2): 174--199.
[7]} Bernstein D, Cohen D, Freund A. Compiler techniques for data prefetching on the PowerPC. In {\it Proc. the IFIP Wg10.3 Working Conference on Parallel Architectures and Compilation Techniques}, Limassol, Cyprus, June 27--29, 1995, pp.19--26.
[8]} Santhanam V, Gornish E H, Hsu W. Data prefetching on the HP PA-8000. In {\it Proc. the 24th Annual International Symposium on Computer Architecture}, Denver, Colorado, United States, June 1--4, 1997, pp.264--273.
[9]} H\"olzle U, Ungar D. Reconciling responsiveness with performance in pure object-oriented languages. {\it ACM Trans. Program. Lang. Syst.}, July 1996, 18(4): 355--400.
[10]} Suganuma T, Yasue T, Kawahito M, Komatsu H, Nakatani T. A dynamic optimization framework for a Java just-in-time compiler. In {\it Proc. the 16th ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications}, Tampa Bay, FL, USA, October 14--18, 2001, pp.180--195.
[11]} Arnold M, Ryder B G. A framework for reducing the cost of instrumented code. In {\it Proc. the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation}, Snowbird, Utah, United States, New York, pp.168--179.
[12]} Apache Harmony project. http://harmony.apache.org/.
[13]} Standand Performance Evaluation Corporation (SPEC). JVM client98 (SPECjvm98).
[14]} The DaCapo benchmark suite. http://dacapobench.org/.
[15]} Intel Corporation. Intel(R) architecture software developer's manual, Volume 2: Instruction set reference manual. http:// download.intel.com/design/intarch/manuals.
[16]} Shuf Y, Serrano M J, Gupta M, Singh J P. Characterizing the memory behavior of Java workloads: A structured view and opportunities for optimizations. {\it SIGMETRICS Perform. Eval. Rev.}, June 2001, pp.194--205.
[17]} Hosking A L, Moss J E, Stefanovic D. A comparative performance evaluation of write barrier implementation. In {\it Proc. Object-Oriented Programming Systems, Languages, and Applications}, Vancouver, British Columbia, Canada, October 18--22, 1992, pp.92--109.
[18]} Intel Corporation. VTune performance analyzer. http://www. intel.com/cd/software/products/apac/zho/vtune/ 275878.htm.
[19]} Chen W, Bhansali S, Chilimbi T, Gao X, Chuang W. Profile-guided proactive garbage collection for locality optimization. {\it SIGPLAN Not.}, Jun. 2006, pp.332--340.
[20]} Chilimbi T M, Davidson B, Larus J R. Cache-conscious structure definition. In {\it Proc. the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation}, Atlanta, Georgia, United States, May 1--4, 1999, pp.13--24.
[21]} Huang X, Blackburn S M, McKinley K S, Moss J B, Wang Z, Cheng P. The garbage collection advantage: Improving program locality. {\it SIGPLAN Not.}, Oct. 2004, 39(10): 69--80.
[1] Gui-Juan Wang, Cheng-Kuan Lin, Jian-Xi Fan, Jing-Ya Zhou, Bao-Lei Cheng. Fault-Tolerant Hamiltonicity and Hamiltonian Connectivity of BCube with Various Faulty Elements [J]. Journal of Computer Science and Technology, 2020, 35(5): 1064-1083.
[2] Hong-Liang Li, Jie Wu, Zhen Jiang, Xiang Li, Xiao-Hui Wei. A Task Allocation Method for Stream Processing with Recovery Latency Constraint [J]. Journal of Computer Science and Technology, 2018, 33(6): 1125-1139.
[3] Seong Woo Kwak and Jung-Min Yang. Optimal Checkpoint Placement on Real-Time Tasks with Harmonic Periods [J]. , 2012, 27(1): 105-112.
[4] InSung Kang, SungJin Choi, Member, IEEE, SoonYoung Jung, and SangKeun Lee. Tree-Based Index Overlay in Hybrid Peer-to-Peer Systems [J]. , 2010, 25(2): 313-329.
[5] Ji-Gang Wu and Thambipillai Srikanthan. Power Efficient Sub-Array in Reconfigurable VLSI Meshes [J]. , 2005, 20(5): 647-653 .
[6] Chun-Hua Yang, Geert Deconinck, and Wei-Hua Gui. Fault-Tolerant Scheduling for Real-Time Embedded Control Systems [J]. , 2004, 19(2): 0-0.
[7] WEI Xiaohui; JU Jiubin;. SCR Algorithm: Saving/Restoring States of File Systems [J]. , 2000, 15(4): 393-400.
[8] WEI Xiaohui(魏晓辉)and JU Jiubin(鞠九滨). SCR Algorithm:Saving/Restoring States of File Systems [J]. , 2000, 15(4): 0-0.
[9] WEI Xiaohui; JU Jiubin;. SFT: A Consistent Checkpointing Algorithm with Short Freezing Time [J]. , 2000, 15(2): 169-175.
[10] LI Layuan; LI Chunlin;. A Semantics-Based Approach for Achieving Self Fault-Tolerance of Protocols [J]. , 2000, 15(2): 176-183.
[11] WEI Hua; LUO Yupin; YANG Shiyuan;. Fault Tolerance of Reconfigurable Bi-Directional Double-Loop LANs [J]. , 1999, 14(4): 379-385.
Full text



[1] Yao Xin; Li Guojie;. General Simulated Annealing[J]. , 1991, 6(4): 329 -338 .
[2] Zhao Yu; Zhang Qiong; Xiang Hui; Shi Jiaosing; He Zhijun;. A Simplified Model for Generating 3D Realistic Sound in the Multimedia and Virtual Reality Systems[J]. , 1996, 11(4): 461 -470 .
[3] BI Jun; WU Jianping;. An Approach to Concurrent TTCN Test Generation[J]. , 1999, 14(6): 614 -618 .
[4] Qi Ge, Hai-Tao Wang, and Hong Zhu. An Improved Algorithm for Finding the Closest Pair of Points[J]. , 2006, 21(1): 27 -31 .
[5] Zhi-Wei Xu, Hao-Jie Zhou, and Guo-Jie Li. Usability Issues of Grid System Software[J]. , 2006, 21(5): 641 -647 .
[6] Guo-Liang Chen, Guang-Zhong Sun, Yun-Quan Zhang, and Ze-Yao Mo. Study on Parallel Computing[J]. , 2006, 21(5): 665 -673 .
[7] Yuan Jiang (姜远), Member, CCF, Ming Li (黎铭), Member, CCF, ACM, IEEE, and Zhi-Hua Zhou (周志华), Senior Member, CCF, IEEE, <. Software Defect Detection with ROCUS[J]. , 2011, 26(2): 328 -342 .
[8] Luke Kien-Weng Tan (陈坚永), Jin-Cheon Na (罗镇川), Member, ACM, Yin-Leng Theng (邓燕玲), and Kuiyu Chang (张圭煜). Phrase-Level Sentiment Polarity Classification Using Rule-Based Typed Dependencies and Additional Complex Phrases Consideration[J]. , 2012, 27(3): 650 -666 .
[9] Shao-Bin Huang (黄绍斌), Senior Member, CCF, Hong-Tao Huang (黄宏涛), Zhi-Yuan Chen (陈志远), Tian-Yang Lü (吕天阳), Member, CCF, and Tao Zhang (张涛). Lazy Slicing for State-Space Exploration[J]. , 2012, 27(4): 872 -890 .
[10] Abdelwadood Mesleh, Dmitriy Skopin, Sergey Baglikov, Anas Quteishat. Heart Rate Extraction from Vowel Speech Signals[J]. , 2012, 27(6): 1243 -1251 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved