[1] Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao:A small-foot print high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.269-284.[2] Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao:A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.369-381.[3] Voitsechov D, Etsion Y. Single-graph multiple flows:Energy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.[4] Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. the Int. Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.[5] Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.[6] Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1):30-40.[7] Theobald K B. EARTH:An efficient architecture for running threads[Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.[8] Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer Press, 2015.[9] Sancho J C, Kerbyson D J. Analysis of double buffering on two different multicore architectures:Quad-core Opteron and the Cell-BE. In Proc. the IEEE Int. Symp. Parallel and Distributed Processing, Apr. 2008.[10] Che W J, Chatha K. Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming. In Proc. the 48th Design Automation Conference, Jun. 2011, pp.122-127.[11] Saidi S, Tendulkar P, Lepley T, Maler O. Optimizing explicit data transfers for data parallel applications on the cell architecture. ACM Transactions on Architecture and Code Optimization, 2012, 8(4):Article No. 37.[12] Deng Y, Wang L, Yan X B, Yang X J. A double-buffering strategy for the SRF management in the Imagine stream processor. In Proc. the 9th International Conference for Young Computer Scientists, Nov. 2008, pp.160-165.[13] Zinner C, Kubinger W. ROS-DMA:A DMA double buffering method for embedded image processing with resource optimized slicing. In Proc. the 12th IEEE Real-Time and Embedded Technology and Applications Symp., Apr. 2006, pp.361-372.[14] Bai Y W, Liu C C. The performance improvement of a photo card reader by the use of a high-integration chip solution with double FIFO buffers. IEEE Transactions on Consumer Electronics, 2005, 51(2):329-334.[15] Li J, Han K P, Hong S, Luo S M, Dong Z J, Lu P. A prefetching method with double-buffer for multimedia streaming servers. In Proc. International Conference on Transportation, Mechanical and Electrical Engineering, Dec. 2011, pp.1485-1489.[16] Singh H, Lee M H, Lu G M, Kurdahi F J, Bagherzadeh N, Filho E C. MorphoSys:An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 2000, 49(5):465-481.[17] Zhang C, Li P, Sun G Y, Guan Y J, Xiao B J, Cong J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proc. the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2015, pp.161-170.[18] Shen X W, Ye X C, Tan X, Wang D, Lunkai Zhang, Li W M, Zhang Z M, Fan D R, Sun N H. An efficient network-onchip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1):11-25.[19] Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT:A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.[20] Holewinski J, Pouchet L N, Sadayappan P. Highperformance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM International Conference on Supercomputing, Jun. 2012, pp.311-320.[21] Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel and Distributed Systems, 2013, 24(3):417-427.[22] Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel and Distributed Systems, 2012, 23(11):2045-2057.[23] Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT:An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.[24] Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luj'an M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project:Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.[25] Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede:An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.[26] Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7):44-55.[27] Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2):Article No. 4. |