›› 2011, Vol. 26 ›› Issue (4): 578-587.doi: 10.1007/s11390-011-1158-z

Special Issue: Surveys; Computer Architecture and Systems

• Special Section on Perspectives on Future Computer Science • Previous Articles     Next Articles

New Methodologies for Parallel Architecture

Dong-Rui Fan (范东睿), Member, CCF,IEEE, Xiao-Wei Li (李晓维), and Guo-Jie Li (李国杰), Fellow, CCF   

  1. Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China
  • Received:2011-05-03 Online:2011-07-05 Published:2011-07-05
  • Supported by:

    This work is in part supported by the National Basic Research 973 Program of China under Grant Nos. 2011CB302500, 2005CB321600, and the National Natural Science Foundation of China under Grant No.60921002.

Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and para-llelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are summarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. Computers, etc.

[1] Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G.Godson-3: A scalable multi-core RISC processor with x86 emulationsupport. IEEE Micro, 2009, 29(2): 17-29.



[2] Fan D R, Yuan N, Zhang J C et al. Godson-T: An efficientmany-core architecture for parallel program executions. Journal ofComputer Science and Technology, 2009, 24(6): 1061-1073.



[3] Lv H, Cheng Y, Bai L, Chen M, Fan D, Sun N. P-GAS: Parallelizinga cycle-accurate event-driven many-core processor simulator using paralleldiscrete event simulation. In Proc. Workshop on Principle of Advancedand Distributed Simulation, Atlanta, USA, May 17-19, 2010, pp.1-8.



[4] Tang D, Bao Y, Hu W, Chen M. DMA cache: Using on-chip storageto architecturally separate I/O data from CPU data for improving I/Operformance. In Proc. Int. Conf. High-Performance Computer Architecture,Bangalore, India, Jan.9-14, 2010, pp.1-12.



[5] Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F T.Minimal multi-threading: Finding and remo-ving redundant instructions inmulti-threaded processors. In Proc. IEEE/ACM Int. Symp. Microarchitecture,Atlanta, USA, Dec.4-8, 2010, pp.337-348.



[6] Chen Y, Hu W, Chen T, Wu R. LReplay: A pending period baseddeterministic replay scheme. In Proc. Int. Symp. Computer Architecture,Saint-Malo, France, Jun.19-23, 2010, pp.187-197.end{multicolsbegin{multicols{2footnotesize



[7] Su M, Chen Y, Gao X. A general method to make multi-clock systemdeterministic. In Proc. Conf. Design, Automation and Test in Europe,Dresden, Germany, Mar.8-12, 2010, pp.1480-1485.



[8] Guo Q, Chen T, Chen Y, Zhou Z H, Hu W, Xu Z. Effective andefficient microprocessor design space exploration using unlabeled designconfigurations. In Proc. Int. Joint Conf. Artificial Intelligence,Spain, 2011. (To appear)



[9] Xu D, Wu C, Yew P C. On mitigating memory bandwidth contentionthrough bandwidth-aware scheduling. In Proc. Int. Conf. ParallelArchitectures and Compilation Techniques, Vienna, Austria, Sept.11-15,2010, pp.237-247.



[10] Chen L, Liu L, Tang S, Huang L, Jing Z, Xu S, Zhang D, Shou B.Unified parallel C for GPU clusters: Language extensions and compilerimplementation. In Proc. the 23rd International Workshop on Languagesand Compilers for Parallel Computing, Huston, USA, Oct.7-9, 2010, pp.151-165.



[11] Wang L, Cui H, Duan Y, Lu F, Feng X, Yew P C. An adaptive taskcreation strategy for work-stealing scheduling. In Proc. Int. Conf.Code Generation and Optimization, Toronto, Canada, Apr.24-28, 2010, pp.266-277.



[12] Liu L, Chen L, Wu C Y, Feng X B. Global tiling for communicationminimal parallelization on distributed memory systems. In Proc. Int.Euro-Par Conf. Parallel Processing, Klagenfurt, Austria, Aug.26-29, 2008,pp.382-391.



[13] Chen Y, Huang Y, Eeckhout L, Fursin G, Peng L, Temam O, Wu C.Evaluating iterative optimization across 1000 data sets. In Proc. Conf.Programming Language Design and Implementation, Toronto, Canada, Jun.5-10,2010, pp.448-459.



[14] Yu T, Xue J, Huo W, Feng X, Zhang Z. Level by level: Makingflow- and context-sensitive pointer analysis scalable for millions of linesof code. In Proc. Int. Conf. Code Generation and Optimization,Toronto, Canada, Apr.24-28, 2010, pp.218-229.



[15] Wang Z, Wu C. Yew P C. On improving heap memory layout bydynamic pool allocation. In Proc. Int. Conf. Code Generation andOptimization, Toronto, Canada, Apr.24-28, 2010, pp.92-100.



[16] Li J, Wu C, Hsu W C. An evaluation of misaligned data accesshandling mechanisms in dynamic binary translation systems. In Proc.Int. Conf. Code Generation and Optimization, Seattle, USA, Mar.22-25, 2009, pp.180-189.



[17] Lv F, Wang L, Feng X, Li Z, Zhang Z. Exploiting idle registerclasses for fast spill destination. In Proc. Int. Conf. Supercomputing,Island of Kos, Greece, Jun.7-12, 2008, pp.319-326.



[18] Zhang L, Han Y, Xu Q, Li X, Li H. On topology reconfigurationfor defect-tolerant NoC-based homogeneous manycore systems. IEEETrans. VLSI Systems, 2009, 17(9): 1173-1186.



[19] Yan G, Liang X, Han Y, Li X. Leveraging the core-levelcomplementary effects of PVT variations to reduce timing emergencies inmulti-core processors. In Proc. Int. Symp. Computer Architecture,Saint-Malo, France, Jun.19-23, 2010, pp.485-496.



[20] Pan S, Hu Y, Li X. IVF: Characterizing the vulnerability ofmicroprocessor structures to intermittent faults. In Proc. Conf.Design, Automation and Test in Europe, Dresden, Germany, Mar.8-12, 2010, pp.238-243.



[21] Hu W, Wang R, Chen Y, Fan B, Zhong S, Gao X, Qi Z, Yang X.Godson-3B: A 1,GHz 40,W 8-Core 128,GFlops processor in 65,nm CMOS. In Proc. Int. Solid-State Circuits Conference, 2011. (To appear)



[22] Zhang M, Li H, Li X. Path delay test generation toward activationof worst case coupling effects. IEEE Transactions on Very Large ScaleIntegration Systems, 2010, 18(12): 1-14.



[23] Han Y, Hu Y, Li X, Li H, Chandra A. Embedded test decompressorto reduce the required channels and vector memory of tester for complexprocessor circuit. IEEE Transactions on Very Large Scale IntegrationSystems, 2007, 5(15): 531-540.



[24] Wang D, Hu Y, Li H, Li X. The design-for-testability featuresand test implementation of a giga hertz general purpose microprocessor. Journal of Computer Science and Technology, 2008, 23(6): 1037-1046.



[25] Chen Y, Lv Y, Hu W, Chen T, Shen H, Wang P, Pan H. Fast completememory consistency verification. In Proc. Int. Symp. High-PerformanceComputer Architecture, Raleigh, USA, Feb.14-18, 2009, pp.381-392.



[26] Hu W, Chen Y, Chen T, Qian C, Li L. Linear time memoryconsistency verification. IEEE Transactions on Computers, 2011. (Accepted)



[27] Li L, Chen T, Chen Y, Li L, Qian C, Hu W. Brief announcement:Program regularization in verifying memory consistency. In Proc. Symp.Parallelism in Algorithms and Architectures, San Jose, USA, Jun.4-6,2011. (To appear)



[28] Guo Q, Chen T, Shen H, Chen Y, Wu Y, Hu W. Empirical designbugs prediction for verification. In Proc. Conf. Design, Automationand Test in Europe, Grenoble, France, Mar.14-18, 2011, pp.1-6.



[29] Zhang T, Lv T, Li X. An abstraction-guided simulation approachusing Markov models for microprocessor verification. In Proc. Conf.Design, Automation and Test in Europe, Dresden, Germany, Mar.8-12, 2010,pp.484-489.



[30] Hu W, Wang J, Gao X, Chen Y. Micro-architecture of Godson-3multi-core processor. In Proc. Symp. High Performance Chips, StanfordUniversity, USA, Aug.24-26, 2008.



[31] Gao X, Chen Y J, Wang H D et al. System architecture ofGodson-3 multi-core processors. Journal of Computer Science andTechnology, 2010, 25(2): 181-191.



[32] Hu W, Chen Y. GS464V: A high-performance low-power XPU with512-bit vector extension. In Proc. Symp. High Performance Chips,Aug.22-24, Stanford University, USA, 2010.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved