›› 2010, Vol. 25 ›› Issue (4): 886-894.doi: 10.1007/s11390-010-1069-4

• Architecture and High Performance Computer Systems • Previous Articles    

Landing Stencil Code on Godson-T

Hui-Min Cui1,2(崔慧敏), Lei Wang1,2(王 蕾), Dong-Rui Fan1(范东睿), Member CCF, IEEE and Xiao-Bing Feng1(冯晓兵)   

  1. 1. Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China
    2. Graduate University of Chinese Academy of Sciences, Beijing 100039, China
  • Received:2009-06-12 Revised:2010-05-21 Online:2010-07-09 Published:2010-07-09
  • About author:
    Hui-Min Cui is a Ph.D. candidate in the Key Laboratory of Computer System and Architecture, Institute of Computing Technology, CAS. Her research interests include compiler, runtime system and binary translation. She received her Bachelor's and Master's degrees in computer science from Tsinghua University in 2001 and 2004 respectively. Lei Wang was born in 1976. She received her B.E.and M.S. degrees from Beijing Institute of Technology in 1999 and 2002. She is currently an assistant professor of the Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences. Her research interests include compiler and runtime system.
    Dong-Rui Fan graduated from the Department of Mathematical Science at Beijing Jiaotong University with a Bachelor's degree in 2000. He received the Ph.D. degree from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) in 2005. Now, he is an associate professor in ICT, a member of IEEE and CCF. He worked together with the members of AMS (Advanced Micro-System) research group and designed the new processing models---Godson-X and Godson-T. Currently, his research interests focus on many-core system, including the design of micro architecture, parallel processing, and runtime system.
    Xiao-Bing Feng was born in 1969. He received his B.E. degree from Tianjin University in 1992, M.S. degree from Peking University in 1996 and Ph.D. degree from the Institute of Computing Technology, Chinese Academe of Sciences. He is currently a professor of the Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences. His research interests include program analysis, compiler and tools.
  • Supported by:

    Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321602, the National Natural Science Foundation of China under Grant No. 60736012, the National High Technology Research and Development 863 Program of China under Grant Nos. 2007AA01Z110 and 2009AA01Z103.

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology --- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.

[1] Dally W J. Computer architecture in the many-core era. In Keynote at the 24th Int. Conf. Comput. Design, San Jose, CA, USA, Oct. 1, 2006.

[2] Borkar S Y, Mulder H, Dubey P, Pawlowski S S, Kahn K C, Rattner J R, Kuck D J. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel White Paper, Mar. 2005.

[3] Seiler L, Carmean D, Sprangle E, Forsyth T, Abrash M, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3): Article No. 18.

[4] Zhu W, Sreedhar V C, Hu Z, Gao G R. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.35-45.

[5] Hu Z, del Cuvillo J, Zhu W, Gao G R. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, Aug. 29-Sept. 1, 2006, pp.134-144.

[6] Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, USA, June 10-13, 2007, pp.235-244.

[7] Frigo M, Strumpen V. The memory behavior of cache oblivious stencil computations. Journal of Supercomputing, 2006, 29(2): 93-112.

[8] Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K. Implicit and explicit optimizations for stencil computations. In Proc. MSPC,2006, San Jose, USA, Oct. 22, 2006, pp.51-60.

[9] Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. SC2008, Austin, USA, Nov. 15-21, 2008, Article No. 1.

[10] Renganarayanan L, Harthikote-Matha M, Dewri R, Rajopadhye S V. Towards optimal multi-level tiling for stencil computations. In Proc. IPDPS, Long Beach, USA, Mar. 26-30, 2007, p.101.

[11] McCalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, DCS, Rugers University, 1999.

[12] Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.

[13] Wonnacott D. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proc. International Conference on Parallel and Distributed Computing Systems, Cancun, Mexico, May 1-5, 2000, p.171.

[14] Baskaran M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-23, 2008, pp.1-10.

[15] Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 2008, 51(1): 129-159.

[16] Huang H, Yuan N et al. Architecture supported synchronization-based cache coherence protocol for many-core processors. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMPMSI) of ISCA'08, Beijing, China, June 22, 2008.

[17] Ye X, Nguyen V H, Lavenier D, Fan D. Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In Proc. the Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, Dunedin, New Zealand, Dec. 1-4, 2008, pp.167-170.

[18] Long G, Fan D \emphet al. A performance model of dense matrix operations on many-core architectures. In Proc. Euro-Par 2008, Las Palmas de Gran Canaria, Spain, Aug. 26-29, 2008, pp.120-129.

[19] Tan G, Fan D, Zhang J, Russo A, Gao G R. Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In Proc. PPoPP 2008, Salt Lake City, USA, Feb. 14-18, pp.279-280.

[20] Alverson R, Callahan D et al. The Tera compute system. SIGARCH Comput. Archit. News, 1990, 18(3b): 1-6.

[21] Michael E Wolf, Monica S Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, Toronto, Canada, Jun. 24-28, 1991, pp.30-44.

[22] Tseng C W. Compiler optimizations for eliminating barrier synchronization. In Proc. PPOPP 1995, Santa Barbara, California, USA, July 19-21, 1995, pp.144-155.

[23] Haataja J, Savolainen V. Cray T3E User's Guide. Center for Scientific Computing, Finland, 1997.

[24] Smith B. The Architecture of HEP. Parallel MIMD Computation: HEP Supercomputer and Its Applications. Kowalik J S (ed.), Scientific Computation Series, Cambridge: MIT Press, MA, 1985, p.41-55.

[25] Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B. The Tera computer system. SIGARCH Comput. Archit. News, 1990, 18(3b): 1-6.

[26] Dally W J et al. The message-driven processor. IEEE Micro., 1992, 12(2): 23-39.

[27] Kranz D, Lim B H, Agarwal A. Low-cost support for fine-grain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, Massachusetts Institute of Technology, Cambridge, 1992.

[28] Keckler S W, Dally W J, Maskit D, Carter N P, Chang A, Lee W S. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In \emphProc. the 25th Int. Symp. Computer Architecture, Barcelona, Spain, Jun. 27-Jul. 2, 1998, pp.302-317.

[29] Cray MTA-2 System, http://www.cray.com/About/History. aspx.

[30] Montrym J, Moreton H. The GeForce 6800. IEEE Micro, March 2005, 25(2): 41-51.

[31] Hofstee P. Power efficient architecture and the cell processor. In HPCA-11,Invited Paper and Keynote Speech, San Francisco, USA, Feb. 12-16, 2005.

[32] Asanovic K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The landscape of parallel computing research: A view from Berkeley. UCB/EECS-2006-183, University of California, Berkeley, 2006.

[33] Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Iyer P, Singh A, Jacob T, Jain S, Venkataraman S, Hoskote Y, Borkar N. An 80-tile 1.28TFLOPS network-on-chip in 65\,nm CMOS. In Proc. IEEE International Solid-State Circuits Conference, San Francisco, USA, Feb. 11-15, 2007.

[34] Dally W J, Labonte F, Das A, Hanrahan P, Ahn J H, Gummaraju J, Erez M, Jayasena N, Buck I, Knight T J, Kapasi U J. Merrimac: Supercomputing with Streams. In Proc. the Supercomputer Conference, Phoenix, USA, Nov. 15-21, 2003.

[35] Venetis I E, Gao G R. Mapping the LU decomposition on a many core architecture: Challenges and solutions. In Proc. ACM International Conference on Computing Frontiers (CF2009), Ischia, Italy, May 18-20, 2009, pp.71-80.

[36] Xue L, Chen L, Hu Z, Gao G R. Performance Tuning of the Fast Fourier Transform on a Multicore Architecture. CAPSL Technical Memo 81, Feb. 8, 2008.

No related articles found!
Full text



[1] Harald E. Otto;. UNDO, An Aid for Explorative Learning?[J]. , 1992, 7(3): 226 -236 .
[2] Gu Junzhong;. Modelling Enterprises with Object-Oriented Paradigm[J]. , 1993, 8(3): 80 -89 .
[3] Ju Jiubin; Wang Yong; Yin Yu;. Scheduling PVM Tasks[J]. , 1997, 12(2): 167 -176 .
[4] Fu Yuxi;. Symmetric π-Calculus[J]. , 1998, 13(3): 202 -208 .
[5] LI Xiaoshan;. Decidability of Mean Value Calculus[J]. , 1999, 14(2): 173 -180 .
[6] MA Huadong; LIU Shenquan;. Multimedia Data Modeling Based on TemporalLogic and XYZ System[J]. , 1999, 14(2): 188 -193 .
[7] CAI Jiamei;. The Sequence Modeling Method Based on ECCin Developing Program Specifications[J]. , 1999, 14(4): 337 -348 .
[8] GAO Shuming; WAN Huagen; PENG Qunsheng;. Constraint-Based Virtual Solid Modeling[J]. , 2000, 15(1): 56 -63 .
[9] Sheng-En Li and Shan Wang. Semi-Closed Cube: An Effective Approach to Trading Off Data Cube Size and Query Response Time[J]. , 2005, 20(3): 367 -372 .
[10] Xiang-Sheng Wu. A New Technique for Digital Image Watermarking[J]. , 2005, 20(6): 843 -848 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved