[1] Tolentino M, Cameron K W. The optimist, the pessimist, and the global race to exascale in 20 megawatts. Computer, 2012, 45(1):95-97.[2] Kogge P. The tops in flops. IEEE Spectrum, 2011, 48(2):48-54.[3] Kogge P, Bergman K, Borkar S et al. ExaScale computing study:Technology challenges in achieving exascale systems. Technical Report TR-2008-13, Defense Advanced Research Projects Agency Information Processing Technigues Office, 2008. http://www.citeulike.org/group/11430/article/6638217, Dec. 2017.[4] Milutinovi V, Salom J, Trifunovic N, Giorgi R. Guide to DataFlow Supercomputing:Basic Concepts, Case Studies, and a Detailed Example. Springer, 2015.[5] Dennis J B. First version of a data flow procedure language. In Proc. the Programming Symp., April 1974, pp.362-376.[6] Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. Symp. Application Accelerators in High Performance Computing, July 2012, pp.129-132.[7] Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st IEEE Annual Int. Symp. Field-Programmable Custom Computing Machines, April 2013, pp.177-180.[8] Fu H H, Gan L, Clapp R G et al. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1):30-40.[9] Ackerman W B, Dennis J B. VAL-A value-oriented algorithmic language:Preliminary reference manual. Technical Report TR-218, Computation Structure Group, Laboratory for Computer Science, MIT, 1979. http://citeseerx.ist.psu.edu/showciting?cid=928490, Dec. 2017.[10] Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with edge architectures. Computer, 2004, 37(7):44-55.[11] Arvind N, Gostelow K, Plouffe W. An asynchronous programming language and computing machine. Technical Report TR114a, Department of Information and Computer Science, University of California, 1978.[12] Arvind K, Nikhil R S. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Computers, 1990, 39(3):300-318.[13] Swanson S, Schwerin A, Mercaldi M et al. The wavescalar architecture. ACM Trans. Computer Systems, 2007, 25(2):Article No. 4.[14] Zuckerman S, Suetterlein J, Knauerhase R, Gao G R. Position paper:Using a "codelet" program execution model for exascale machines. In Proc. the 1st Int. Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, June 2011, pp.64-69.[15] Suettlerlein J, Zuckerman S, Gao G R. An implementation of the codelet model. In Proc. the 19th Int. Conf. Parallel Processing, August 2013, pp.633-644.[16] Pell O, Averbukh V. Maximum performance computing with dataflow engines. Computing in Science & Engineering, 2012, 14(4):98-103.[17] Voitsechov D, Etsion Y. Single-graph multiple flows:Energy efficient design alternative for GPGPUs. ACM SIGARCH Computer Architecture News, 2014, 42(3):205-216.[18] Gurd J R, Kirkham C C, Watson I. The Manchester prototype dataflow computer. Communications of the ACM, 1985, 28(1):34-52.[19] Shen X W, Ye X C, Tan X, Wang D, Zhang L K, Li W M, Zhang Z M, Fan D R, Sun N H. An efficient network-onchip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1):11-25.[20] Tan X, Shen X W, Ye X C, Wang D, Fan D R, Zhang L K, Li W M, Zhang Z M, Tang Z M. A non-stop double buffering mechanism for dataflow architecture. Journal of Computer Science and Technology, 2018, 33(1):145-157.[21] Ye X C, Fan D R, Sun N H et al. SimICT:A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. Int. Symp. Low Power Electronics and Design, September 2013, pp.273-278.[22] Nguyen A, Satish N, Chhugani J et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for High Performance Computing Networking Storage and Analysis, Nov. 2010.[23] Kurzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel and Distributed Systems, 2012, 23(11):2045-2057.[24] Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel and Distributed Systems, 2013, 24(3):417-427.[25] del Mundo C, Feng W C. Towards a performance-portable FFT library for heterogeneous computing. In Proc. the 11th ACM Conf. Computing Frontiers, May 2014, Article No.11.[26] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.[27] Naffziger S. High-performance processors in a power-limited world. In Proc. Symp. VLSI Circuits Digest of Technical Papers, June 2006, pp.93-97.[28] Solinas M, Badia R M, Bodin F et al. The TERAFLUX project:Exploiting the dataflow paradigm in next generation teradevices. In Proc. Euromicro Conf. Digital System Design, September 2013, pp.272-279.[29] Carter N P, Agrawal A, Borkar S et al. Runnemede:An architecture for ubiquitous high-performance computing. In Proc. the 19th IEEE Int. Symp. High Performance Computer Architecture, February 2013, pp.198-209. |