›› 2009, Vol. 24 ›› Issue (6): 1061-1073.

Special Issue: Computer Architecture and Systems

• Special Section on International Partnership Programs Supported by CAS • Previous Articles     Next Articles

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Dong-Rui Fan* (范东睿), Member, CCF, IEEE, Nan Yuan (袁楠), Jun-Chao Zhang (张军超), Member, CCF, ACM, Yong-Bin Zhou (周永彬), Wei Lin (林伟), Feng-Long Song (宋风龙), Xiao-Chun Ye (叶笑春), He Huang (黄河), Lei Yu (余磊), Guo-Ping Long (龙国平), Hao Zhang (张浩), and Lei Liu (刘磊)   

  1. Key Laboratory of Computer Systems and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2009-03-13 Revised:2009-09-28 Online:2009-11-05 Published:2009-11-05
  • About author:
    Dong-Rui Fan graduated from the Department of Mathematical Science at Beijing Jiaotong University with a Bachelor's degree in 2000, and he received the Ph.D. degree from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) in 2005. Now, he is an associate researcher at ICT, a member of CCF and IEEE. He worked together with members of AMS (Advanced Micro-System) research group and designed the new processing models --- Godson-X and Godson-T. Currently, His research interest focuses on many-core system, including the design of microarchitecture, parallel processing, and runtime system.
    Nan Yuan graduated from the Department of Computer Science and Technology at Beijing University of Posts and Telecommunication with a Bachelor's degree in 2004, and he is currently a Ph.D. candidate of ICT, CAS. His current research interests include parallel architecture design and runtime system design.
    Jun-Chao Zhang is currently an engineer at ICT, CAS. He received his Ph.D. degree in computer science from ICT, CAS in 2005 and his B.Eng. degree from Xi'an Jiaotong University in 1999. His research interests include computer architecture, parallel computing, compiler and parallel languages. He is an ACM member and CCF member.
    Yong-Bin Zhou received his B.Eng. degree from University of Science and Technology of China (USTC). Currently, he is a Ph.D. candidate in computer science at ICT, CAS. His recent research topics include computer architecture and parallel computing.
    Wei Lin received his B.Sc. degree from Tianjin University. Currently, he is a Ph.D. candidate in computer science at ICT, CAS. His research interests include computer architecture, parallel computing, and operating system.
    Feng-Long Song graduated from the Department of Management and Economics at Shandong Normal University and received Master's degree in 2006. He is a Ph.D. candidate of ICT, CAS. His research interests focus on high performance computer architecture, on-chip memory hierarchy, and parallel computing.
    Xiao-Chun Ye received his B.Sc. degree from Beijing Normal University in 2004. Currently, he is a Ph.D. candidate in computer science at ICT, CAS. His recent research topics include computer architecture, parallel computing, and bioinformatics.
    He Huang is a Ph.D. candidate at ICT, CAS. His research interests include processor micro-architecture, operating system and VLSI backend design.
    Lei Yu is currently a Ph.D. candidate at ICT, CAS. His current research interests include computer architecture and parallel computing.
    Guo-Ping Long is currently a Ph.D. candidate at ICT, CAS. His research interests include parallel programming, performance modeling and evaluation.
    Hao Zhang is an assistant researcher at ICT, CAS. Zhang received the Ph.D. degree in computer science from ICT in 2008. His research interests include design, analysis, implementation, and benchmarking of processor architectures|switching and routing of on chip networks|and high throughput memory system.
    Lei Liu received his B.Sc. degree from Peking University in 2004. Currently he is a Ph.D. candidate at ICT, CAS. His research topic is power management of many-core architecture.
  • Supported by:

    Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321600, the National High-Tech Research and Development 863 Program of China under Grant No. 2009AA01Z103, the National Natural Science Foundation of China under Grant No. 60736012, the National Science Fund for Distinguished Young Scholars under Grant No. 60925009, and the Beijing Natural Science Foundation under Grant No. 4092044.

Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

[1] Asanovic K et al. The landscape of parallel computing research: A view from Berkeley. Technical Report No.UCB/EECS-2006-183, University of California, Berkeley, December 18, 2006.
[2] Lee E A. The problem with threads. Computer, 2006, 39(5): 33–42.
[3] Cantrill B, Bonwick J. Real-world concurrency. ACM Queue, 2008, 6(5): 16–25.
[4] Adve S V, Adve V S et al. Parallel computing research at Illinois: The UPCRC agenda. Technical Report, University of Illinois at Urbana-Champaign, November 2008.
[5] Yuan N, Yu L, Fan D. An efficient and flexible task management for many-core architectures. In Proc. Workshop on Software and Hardware Challenges of Manycore Platforms, in Conjunction with the 35th International Symposium on Computer Architecture (ISCA-35), Beijing, China, June 22– 26, 2008, pp.1–17.
[6] Blumofe R D, Leiserson C E. Scheduling multithreaded computations by work stealing. Journal of the ACM, 1999, 46(5): 720–748.
[7] Palatin P, Lhuillier Y, Temam O. CAPSULE: Hardwareassisted parallel execution of component-based programs. In Proc. the 39th Annual IEEE/ACM International Symposium on Micro-Architecture, Washington, DC, USA: IEEE Computer Society, Dec. 9–13, 2006, pp.247–258.
[8] Villa O, Palermo G, Silvano C. Efficiency and scalability of barrier synchronization on NoC based many-core architecture. In Proc. CASES 2008, Atlanta, USA, Oct. 19–24, 2008, pp.81–90.
[9] Carlson W W, Draper J M et al. Introduction to UPC and language specification. Technical Report No. CCS-TR-99- 157, University of California, Berkeley, 1999.
[10] Numrich R W, Reid J. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 1998, 17(2): 1–31.
[11] Yelick K, Semenzato L et al. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 1998, 10(11-13): 825–836.
[12] Fatahalian K, Horn D R et al. Sequoia: Programming the memory hierarchy. In Proc. the 2006 ACM/IEEE Conference on Supercomputing, Tampa, Florida, Nov. 11–17, 2006, pp.83–95.
[13] Bikshandi G, Guo J et al. Programming for parallelism and locality with hierarchically tiled arrays. In Proc. the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, USA, March 29–31, 2006, pp.48–57.
[14] Mellor-Crummey J M, Scott M L. Synchronization without contention. In Proc. Architectural Support for Programming Languages and Operating Systems, Santa Clara, USA, April 8–11, 1991, pp.269–278.
[15] Alverson R, Callahan D et al. The Tera computer system. In Proc. the 4th Int. Conf. Supercomputing, Amsterdam, The Netherlands, June 11–15, 1990, pp.1–6.
[16] Zhu W, Sreedhar V C et al. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. the 34th Annual International Symposium on Computer Architecture, San Diego, USA, June 9–13, 2007, pp.35–45.
[17] Woo S C, Ohara M et al. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Annual International Symposium on Computer Architecture, Santa Margnerita Ligure, Italy, June 22–24, 1995, pp.24–36.
[18] Fu Y, Yang Q et al. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics, 2004, 20(1): 1948–1954.
[19] Altschul S, Madden T, Schaffer A et al. Gapped Blast and Psi-Blast: A new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389–3402.
[20] Kumar S, Jiang D et al. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review (SIGMETRICS 1999), 1999, 27(1): 23–34.
[21] Feo J. An analysis of the computational and parallel complexity of the Livermore loops. Parallel Computing, 1988, 7(2): 163–185.
[22] Yuan N, Zhou Y et al. High performance matrix multiplication on many cores. In Proc. European Conference on Parallel and Distributed Computing (Euro-Par), Delft, The Netherlands, Aug. 25–28, 2009, pp.948–959.
[23] Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In Proc. 2008 ACM/IEEE Conf. Supercomputing (SC 2008), Austin, USA, Now. 15–21, IEEE Press, 2008, pp.1–11.
[24] Chen L, Hu Z et al. Optimizing fast Fourier transform on a multi-core architecture. In Proc. IEEE International Parallel and Distributed Processing Symposium, Long Beach, USA, March 26–30, 2007, pp.1–8.
[25] Hu Z, Cuvillo J et al. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, August 28– September 1, pp.134–144.
[26] Govindaraju N K et al. High performance discrete Fourier transforms on graphics processors. In Proc. the 2008 ACM/IEEE Conference on Supercomputing (SC2008), Austin, USA, Nov. 15–21, 2008, pp.13–24.
[27] Williams S, Shalf J et al. The potential of the cell processor for scientific computing. In Proc. CF’06, Ischia, Italy, May 3–5, 2006, pp.9–20.
[28] Gao G R, Sarkar V. Location consistency — A new memory model and cache consistency protocol. IEEE Transactions on Computers, 2000, 49(8): 798–813.
[29] Shen X et al. Commit-reconcile & fences (CRF): A new memory model for architects and compiler writers. In Proc. the 26th Annual International Symposium on Computer Architecture, Atlanta, USA, May 2–4, 1999, pp.150–161.
[30] Lftode L et al. Scope consistency: A bridge between release consistency and entry consistency. In Proc. the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, Padua, Italy, June 24–26, 1996, pp.277–287.
[31] Ceze L, Tuck J et al. BulkSC: Bulk enforcement of sequential consistency. In Proc. the 34th Annual International Symposium on Computer Architecture, San Diego, USA, June 9–13, 2007, pp.278–289.
[32] Hofstee P. Power efficient architecture and the cell processor. In Proc. HPCA-11, San Francisco, USA, February 12–16, 2005, pp.258–262.
[33] Almasi G, Cascaval C et al. Dissecting cyclops: A detailed analysis of a multithreaded architecture. ACM SIGARCH Computer Architecture News, 2003, 31(1): 26–38.
[34] Lindholm E et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55.
[35] Mellor-Crummey, J M, Scott M L. Synchronization without contention. In Proc. Architectural Support for Programming Languages and Operating Systems, Santa Clara, USA, April 8–11, 1991, pp.269–278.
[36] Keckler S W et al. Exploiting fine-grain thread level parallelism on the MIT multi-alu processor. In Proc. the 25th Annual International Symposium on Computer Architecture, Barcelona, Spain, June 27–July 1, 1998, pp.306–317.
[37] Sampson J, Gonzalez R. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, USA, Dec. 9–13, 2006, pp.235– 246.
[38] Villa O et al. Efficiency and scalability of barrier synchronization on NoC based many-core architecture. In Proc. CASES 2008, Atlanta, USA, October 19–24, 2008, pp.81–90.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Li Weihua; Yuan Youguang;. Error Recovery in a Real-Time Multiprocessor System[J]. , 1992, 7(1): 83 -87 .
[2] Zheng Yuhua; Xie Li; Sun Zliongxiu;. Full Or-Parallemism and Restricted And-Parallelism in BTM[J]. , 1994, 9(4): 373 -381 .
[3] Chen Ke; Masumi Ishikawa;. A Parallel Voting Scheme for Aspect Recovery[J]. , 1995, 10(5): 385 -402 .
[4] Yu Shengke;. Reasoning in H-Net: A Unified Approach to Intelligent Hypermedia Systems[J]. , 1996, 11(1): 83 -89 .
[5] Sun Yufang;. Hanzix and Chinese Open System Platform[J]. , 1997, 12(3): 283 -288 .
[6] Wang Haohong; Wu Ruixun; Cai Shijie;. A New Algorithm for Two-Dimensional Line Clipping via Geometric Transformation[J]. , 1998, 13(5): 410 -416 .
[7] XU Xiaofei; YE Dan; LI Quanlong; ZHAN Dechen;. Dynamic Organization and Methodology for Agile Virtual Enterprises[J]. , 2000, 15(4): 368 -375 .
[8] NIE Xumin; GUO Qing;. Renaming a Set of Non-Horn Clauses[J]. , 2000, 15(5): 409 -415 .
[9] Wei Wang, Ming Liu, and Andrew Hsu. Hybrid Nanoelectronics: Future of Computer Technology[J]. , 2006, 21(6): 871 -886 .
[10] Yu-Hai Zhao, Guo-Ren Wang, Ying Yin, and Guang-Yu Xu. A Novel Approach to Revealing Positive and Negative Co-Regulated Genes[J]. , 2007, 22(2): 261 -272 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved