SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.
Citation: | Fang Zheng, Hong-Liang Li, Hui Lv, Feng Guo, Xiao-Hong Xu, Xiang-Hui Xie. Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture[J]. Journal of Computer Science and Technology, 2015, 30(1): 145-162. DOI: 10.1007/s11390-015-1510-9 |
[1] |
Manferdelli J L, Govindaraju N K, Crall C. Challenges and opportunities in many-core computing. Proceedings of the IEEE, 2008, 96(5): 808-815.
|
[2] |
Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th Int. High Performance Computing for Computational Science{VECPAR, June 2011, pp.1-25.
|
[3] |
Daga M, Aji A M, Feng W. On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In Proc. Symposium on Application Accelerators in HighPerformance Computing, July 2011, pp.141-149.
|
[4] |
Chung E S, Milder P A, Hoe J C, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010, pp.225-236.
|
[5] |
Lee V W, Grochowski E, Geva R. Performance benefits of heterogeneous computing in HPC workloads. In Proc. the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), May 2012, pp.16-26.
|
[6] |
Kumar R, Farkas K I, Jouppi N P et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2003, pp.81-92.
|
[7] |
Lee V W, Kim C, Chhugani J et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. the 37th Annual International Symposium on Computer Architecture (ISCA), June 2010, pp. 451-460.
|
[8] |
Wittenbrink C M, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. IEEE Micro, 2011, 31(2): 50-59.
|
[9] |
Kapasi U J, Dally W J, Rixner S et al. The imagine stream processor. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors(ICCD), September 2002, pp. 282-288.
|
[10] |
Duran A, Klemm M. The Intel® many integrated core architecture. In Proc. International Conference on High Performance Computing and Simulation (HPCS), July 2012, pp. 365-366.
|
[11] |
Alves M A Z, Freitas H C, Navaux P O A. Investigation of shared L2 cache on many-core processors. In Proc. the 22nd International Conference on Architecture of Computing Systems (ARCS), March 2009, pp. 1-10.
|
[12] |
Wentzlaff D, Griffin P, Hoffmann H et al. On-chip interconnection architecture of the Tile Processor. IEEE Micro, 2007, 27(5): 15-31.
|
[13] |
Howard J, Dighe S, Hoskote Y et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proc. IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb. 2010, pp.108-109.
|
[14] |
Hoskote Y, Vangal S, Singh A et al. A 5-GHz mesh interconnect for a Teraflops processor. IEEE Micro, 2007, 27(5): 51-61.
|
[15] |
Gries M, Hoffmann U, Konow M et al. SCC: A flexible architecture for many-core platform research. Computing in Science and Engineering, 2011, 13(6): 79-83.
|
[16] |
Balakrishnan A, Naeemi A. Interconnect network analysis of many-core chips. IEEE Transactions on Electron Devices, 2011, 58(9): 2831-2837.
|
[17] |
Taylor M B, Lee W, Amarasinghe S et al. Scalar operand networks: On-chip interconnect for ILP in partitioned architectures. In Proc. the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2003, pp.341-353.
|
[18] |
Kim J. Low-cost router microarchitecture for on-chip networks. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2009, pp.255-266.
|
[19] |
Jung H, Ju M, Che H. A theoretical framework for design space exploration of manycore processors. In Proc. the 19th Annual IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, July 2011, pp.117-125.
|
[20] |
Seiler L, Carmean D, Sprangle E et al. Larrabee: A manycore x86 architecture for visual computing. IEEE Micro, 2009, 29(1): 10-21.
|
[21] |
Chen P, Zhao H L, Tao C, Sang H S. Block-run-based connected component labelling algorithm for GPGPU using shared memory. Electronics Letters, 2011, 47(24): 1309-1311.
|
[22] |
Sawant N, Kulkarni D. Performance evaluation of feature extraction algorithm on GPGPU. In Proc. International Conference on Communication Systems and Network Technologies (CSNT), June 2011, pp. 536-540.
|
[23] |
Heinecke A, Klemm M, Bungartz H J. From GPGPU to many-core: Nvidia Fermi and Intel many integrated core architecture. Computing in Science & Engineering, 2012, 14(2): 78-83.
|
[24] |
Bell S, Edwards B, Amann J et al. TILE64TM-processor: A 64-core SoC with mesh interconnect. In Proc. IEEE Int. Solid-State Circuits Conference (ISSCC), February 2008, pp.88-89, 598.
|
[25] |
Sewell K, Dreslinski R G, Manville T et al. Swizzle-switch networks for many-core systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2012, 2(2): 278-294.
|
[26] |
Kim J, Balfour J, Dally W. Flattened butterfly topology for on-chip networks. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.172-182.
|
[27] |
Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd Annual IEEE/ACM MICRO, Dec. 2010, pp. 421-432.
|
[28] |
Fan D, Zhang H, Wang D et al. Godson-T: An efficient many-core processor exploring thread-level parallelism. IEEE Micro, 2012, 32(2): 38-47.
|
[29] |
Wang X, Gan G, Manzano J et al. A quantitative study of the on-chip network and memory hierarchy design for manycore processor. In Proc. the 14th IEEE International Conference on Parallel and Distributed Systems, Dec. 2008, pp. 689-696.
|
[30] |
Taylor M B, Psota J, Saraf A et al. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp. 2-13.
|
[31] |
Taylor M B, Kim J, Miller J et al. The Raw microprocessor: A computational fabric for software circuits and generalpurpose programs. IEEE Micro, 2002, 22(2): 25-35.
|
[32] |
Bakhoda A, Kim J, Aamodt T M. Throughput-effective onchip networks for manycore accelerators. In Proc. the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2010, pp.421-432.
|
[33] |
Asanovic K, Bodik R, Catanzaro B C et al. The landscape of parallel computing research: A view from Berkeley. Technical Report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
|
[34] |
Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No.18.
|
[35] |
Choi J W, Singh A, Vuduc R. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, January 2010, pp.115-126.
|
[36] |
Luo L, Wong M, Hwu W. An effective GPU implementation of breadth-first search. In Proc. the 47th Design Automation Conference (DAC), June 2010, pp.52-55.
|
[37] |
Bo Z, Zheng-hui X, Wu R et al. Accelerating FDTD algorithm using GPU computing. In Proc. IEEE International Conference on Microwave Technology & Computational Electromagnetics, May 2011, pp.410-413.
|
[38] |
Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5th edition): Morgan Kaufmann, 2011.
|
[39] |
Hill M, Marty M. Amdahl's law in the multicore era. IEEE Computer, 2008, 41(7): 33-38.
|
[40] |
Riley M W, Warnock J D, Wendel D F. Cell broadband engine processor: Design and implementation. IBM Journal of Research and Development, 2007, 51(5): 545-557.
|
[41] |
Kahle J A, Day M N, Hofstee H P et al. Introduction to the Cell multiprocessor. IBM Journal Research and Development, 2005, 49(4): 589-604.
|
[42] |
Woo D H, Lee H H S. Extending Amdahl's law for energyefficient computing in the many-core era. IEEE Computer, 2008, 41(12): 24-31.
|
[43] |
Kumar R, Tullsen D M, Ranganathan P et al. SingleISA heterogeneous multicore architectures for multithreaded workload performance. In Proc. the 31st Annual International Symposium on Computer Architecture, June 2004, pp. 64-75.
|
[44] |
Yang Y, Xiang P, Mantor M et al. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proc. the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA), February 2012.
|
[45] |
Branover A, Foley D, Steinman M. AMD fusion APU: Llano. IEEE Micro, 2012, 32(2): 28-37.
|
[46] |
Keckler S W, Dally W J, Khailany B et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7-17.
|
[47] |
Khunjush F, Dimopoulos N J. Extended characterization of DMA transfers on the Cell BE processor. In Proc. IEEE International Symposium on Parallel and Distributed Processing, April 2008.
|
[48] |
Gebhart M, Keckler S W, Khailany B et al. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.96-106.
|
[49] |
Keckler S W, Dally W J, Maskit D et al. Exploiting finegrain thread level parallelism on the MIT multi-ALU processor. ACM SIGARCH Computer Architecture News, 1998, 26(3): 306-317.
|
[50] |
Korch M, Rauber T, Scholtes C. Memory-intensive applications on a many-core processor. In Proc. the 13th IEEE International Conference on High Performance Computing and Communications (HPCC), September 2011, pp.126-134.
|
[51] |
Abellán J L, Fernández J, Acacio M E. Efficient hardware barrier synchronization in many-core CMPs. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(8): 1453-1466.
|
[52] |
WatkinsMA, Albonesi D H. ReMAP: A reconfigurable heterogeneous multicore architecture. In Proc. the 43rd IEEE International Symposium on Microarchitecture, Dec. 2010, pp. 497-508.
|
[53] |
Yu L, Liu Z, Fan D et al. Study on fine-grained synchronization in many-core architecture. In Proc. the 10th ACIS International Conference on Software Engineering, Arti cial Intelligences, Networking and Parallel/Distributed Computing, May 2009, pp.524-529.
|
[1] | Jin-Hua Tao, Zi-Dong Du, Qi Guo, Hui-Ying Lan, Lei Zhang, Sheng-Yuan Zhou, Ling-Jie Xu, Cong Liu, Hai-Feng Liu, Shan Tang, Allen Rush, Willian Chen, Shao-Li Liu, Yun-Ji Chen, Tian-Shi Chen. BENCHIP: Benchmarking Intelligence Processors[J]. Journal of Computer Science and Technology, 2018, 33(1): 1-23. DOI: 10.1007/s11390-018-1805-8 |
[2] | Hui-Ying Lan, Lin-Yang Wu, Xiao Zhang, Jin-Hua Tao, Xun-Yu Chen, Bing-Rui Wang, Yu-Qing Wang, Qi Guo, Yun-Ji Chen. DLPlib: A Library for Deep Learning Processor[J]. Journal of Computer Science and Technology, 2017, 32(2): 286-296. DOI: 10.1007/s11390-017-1722-2 |
[3] | Pedro Luis Mateo Navarro, Gregorio Martínez Pérez, Diego Sevilla Ruiz. A Script-Based Prototyping Framework to Boost Agile-UX Developments[J]. Journal of Computer Science and Technology, 2016, 31(6): 1246-1261. DOI: 10.1007/s11390-016-1695-6 |
[4] | Li Zhang, Don Xie, Di Wu. Improved FFSBM Algorithm and Its VLSI Architecture for AVS Video Standard[J]. Journal of Computer Science and Technology, 2006, 21(3): 378-382. |
[5] | Jun-Hao Zheng, Lei Deng, Peng Zhang, Don Xie. An Efficient VLSI Architecture for Motion Compensation of AVS HDTV Decoder[J]. Journal of Computer Science and Technology, 2006, 21(3): 370-377. |
[6] | LIU Thnpei. Orthogonal Drawings of Graphs for the Automation of VLSI Circuit Design[J]. Journal of Computer Science and Technology, 1999, 14(5): 447-459. |
[7] | Li Wei, Yang Qiaolin. Functional-Level Fault Simulation with Concurrent and Parallel Mechanisms Using Object-Oriented VLSI Model[J]. Journal of Computer Science and Technology, 1998, 13(2): 147-160. |
[8] | Chung-Han CHEN. Embedding Binary Tree in VLSI/WSI Processor Array[J]. Journal of Computer Science and Technology, 1996, 11(3): 326-336. |
[9] | Liu Hong, Wang Wenhong, Zhang Defu. A Methodology for Mapping and Partitioning Arbitrary N-Dimensional Nested Loops into 2-Dimensional VLSI Arrays[J]. Journal of Computer Science and Technology, 1993, 8(3): 31-42. |
[10] | Xu Meirui, Liu Xiaolin. A VLSI Algorithm for Calculating the Tree to Tree Distance[J]. Journal of Computer Science and Technology, 1993, 8(1): 68-76. |