Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (1): 92-120.doi: 10.1007/s11390-020-9781-1

Special Issue: Computer Architecture and Systems

• Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics • Previous Articles     Next Articles

I/O Acceleration via Multi-Tiered Data Buffering and Prefetching

Anthony Kougkas, Member, ACM, IEEE, Hariharan Devarajan, Xian-He Sun, Fellow, IEEE        

  1. Department of Computer Science, Illinois Institute of Technology, Chicago 60616, U.S.A
  • Received:2019-06-21 Revised:2019-08-25 Online:2020-01-05 Published:2020-01-14
  • About author:Anthony Kougkas is a research assistant professor of computer science at the Department of Computer Science in the Illinois Institute of Technology (IIT), Chicago. He is a faculty member and the director of I/O research development of the Scalable Computing Software Laboratory at Illinois Tech. He recently received his Ph.D. degree under Dr. Xian-He Sun titled "Accelerating I/O Using Data Labels:A Contention-aware, Multi-tiered, Scalable, and Distributed I/O Platform". Dr. Kougkas is an ACM/IEEE member and is very active at the storage community. Before joining IIT, he worked for more than 12 years as a military officer. He holds a B.Sc. degree in military science, an MBA in leadership, and an M.Sc. degree in computer science all received in Athens, Greece. His research is focused on parallel and distributed systems, parallel I/O optimizations, HPC storage, BigData analytics, I/O convergence, and I/O advanced buffering. He is the receiver of the 2019 Karsten Schwan Best Paper Award for his work LABIOS at the 28th International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'19). More information about Dr. Kougkas can be found at
  • Supported by:
    This work is funded by the National Science Foundation of USA under Grants Nos. OCI-1835764 and CSR-1814872.

Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. Further, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency by requesting data before it is needed to move it from a high-latency medium (e.g., disk) to a low-latency one (e.g., main memory). However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application's I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach resolves challenges such as cache pollution and redundancy. In this paper, we present the design and implementation of Hermes:a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Additionally, we demonstrate the benefits of a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms. Lastly, results show 10%-35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.

Key words: I/O buffering; heterogeneous buffering; layered buffering; deep memory hierarchy; burst buffers; hierarchical data prefetching; data-centric architecture;

[1] Kitchin R. Big Data, new epistemologies and paradigm shifts. Big Data & Society, 2014, 1(1):Article No. 1.
[2] Reinsel D, Gantz J, Rydning J. Data age 2025:The evolution of data to life-critical-Don't focus on big data; focus on the data that's big.,May 2019.
[3] Hey T, Stewart T, Tolle K M. The Fourth Paradigm:DataIntensive Scientific Discovery (1st edition). Microsoft Research, 2009.
[4] Thakur R, Gropp W, Lusk E. Data sieving and collective I/O in ROMIO. In Proc. the 7th Symposium on the Frontiers of Massively Parallel Computation, February 1999, pp.182-189.
[5] Folk M, Cheng A, Yates K. HDF5:A file format and I/O library for high performance computing applications. In Proc. Supercomputing, November 1999, pp.5-33.
[6] Braam P. The Lustre storage architecture. arXiv:1903.01955, 2019.,May 2019.
[7] Schmuck F B, Haskin R L. GPFS:A shared-disk file system for large computing clusters. In Proc. the Conference on File and Storage Technologies, January 2002, pp.231-244.
[8] Carns P H, Ligon III W B, Ross R B, Thakur R. PVFS:A parallel file system for Linux clusters. In Proc. the 4th Annual Linux Showcase and Conference, October 2000, pp.391-430.
[9] Khaleel M A. Scientific Grand Challenges:Crosscutting Technologies for Computing at the Exascale. Pacific Northwest National Laboratory, 2010.,Dec.2019.
[10] Dongarra J, Beckman P, Moore T et al. The international exascale software project roadmap. International Journal of High Performance Computing Applications, 2011, 25(1):3-60.
[11] Reed D A, Dongarra J. Exascale computing and big data. Communications of the ACM, 2015, 58(7):56-68.
[12] Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th International Conference on High Performance Computing for Computational Science, June 2010, pp.1-25.
[13] Bent J, Grider G, Kettering B, Manzanares A, McClelland M, Torres A, Torrez A. Storage challenges at Los Alamos National Lab. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 12.
[14] Caulfield A M, Grupp L M, Swanson S. Gordon:Using flash memory to build fast, power efficient clusters for data-intensive applications. In Proc. the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2009, pp.217-228.
[15] Kannan S, Gavrilovska A, Schwan K, Milojicic D, Talwar V. Using active NVRAM for I/O staging. In Proc. the 2nd International Workshop on Petascal Data Analytics:Challenges and Opportunities, November 2011, pp.15-22.
[16] Caulfield A M, Coburn J, Mollov T, De A, Akel A, He J H, Jagatheesan A, Gupta R K, Snavely A, Swanson S. Understanding the impact of emerging non-volatile memories on high-performance, IO-intensive computing. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2010.
[17] Lockwood G K, Hazen D, Koziol Q et al. Storage 2020:A vision for the future of HPC storage. Technical Report, Lawrence Berkeley National Laboratory, 2017.,May 2019.
[18] Li J W, Liao W K, Choudhary A, Ross R, Thakur R, Gropp W, Latham R, Siegel A, Gallagher B, Zingale M. Parallel netCDF:A high-performance scientific I/O interface. In Proc. the 2003 ACM/IEEE Conference on Supercomputing, Nov. 2003.
[19] Lofstead J F, Klasky S, Schwan K, Podhorszki N, Jin C. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In Proc. the 6th International Workshop on Challenges of Large Applications in Distributed Environments, June 2008, pp.15-24.
[20] Chang F, Gibson G A. Automatic I/O hint generation through speculative execution. In Proc. the 3rd USENIX Symposium on Operating Systems Design and Implementation, February 1999, pp.1-14.
[21] He J, Sun X H, Thakur R. KNOWAC:I/O prefetch via accumulated knowledge. In Proc. the 2012 IEEE International Conference on Cluster Computing, September 2012, pp.429-437.
[22] Dong B, Wang T, Tang H J, Koziol Q, Wu K S, Byna S. ARCHIE:Data analysis acceleration with array caching in hierarchical storage. In Proc. the 2018 IEEE International Conference on Big Data, December 2018, pp. 211-220.
[23] Buyya R, Calheiros R N, Amir V D. Big Data:Principles and Paradigms (1st edition). Morgan Kaufmann, 2016.
[24] Kune R, Konugurthi P, Agarwal A, Rao C R, Buyya R. The anatomy of big data computing. Software-Practice and Experience, 2016, 46(1):79-105.
[25] Kougkas A, Devarajan H, Sun X H, Lofstead J F. Harmonia:An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp. 290-301.
[26] Xie B, Huang Y Z, Chase J S, Choi J Y, Klasky S, Lofstead J, Oral S. Predicting output performance of a petascale supercomputer. In Proc. the 26th International Symposium on High-Performance Parallel and Distributed Computing, June 2017, pp.181-192.
[27] Kim Y, Gunasekaran R, Shipman G M, Dillow D A, Zhang Z, Settlemyer B W. Workload characterization of a leadership class storage cluster. In Proc. the 5th Petascale Data Storage Workshop, Nov. 2010, Article No. 4.
[28] Mi N F, Riska A, Zhang Q, Smirni E, Riedel E. Efficient management of idleness in storage systems. ACM Transactions on Storage, 2019, (2):Article No. 4.
[29] Ahern S, Alam S R, Fahey M R et al. Scientific application requirements for leadership computing at the exascale. Technical Report, Oak Ridge National Laboratory, 2007.,May 2019.
[30] Carns P, Harms K, Allcock W, Bacon C, Lang S, Latham R, Ross R. Understanding and improving computational science storage access through continuous characterization. ACM Transactions on Storage, 2011, 7(3):Article No. 8.
[31] Dundas J, Mudge T. Improving data cache performance by pre-executing instructions under a cache miss. In Proc. the 11th International Conference on Supercomputing, July 1997, pp.68-75.
[32] Doweck J. Shared memory access.,May 2019.
[33] Mutlu O, Stark J, Wilkerson C, Patt Y N. Runahead execution:An alternative to very large instruction windows for out-of-order processors. In Proc. the 9th International Symposium on High-Performance Computer Architecture, February 2003, pp.129-140.
[34] Qadri M Y, Qadri N N, Fleury M, McDonald-Maier K D. Energy-efficient data prefetch buffering for low-end embedded processors. Microelectronics Journal, 2017, 62:57-64.
[35] Sun X H, Byna S, Chen Y. Server-based data push architecture for multi-processor environments. Journal of Computer Science and Technology, 2007, 22(5):641-652.
[36] Zhou H Y. Dual-core execution:Building a highly scalable single-thread instruction window. In Proc. the 14th International Conference on Parallel Architectures and Compilation Techniques, September 2005, pp.231-242.
[37] Cao P, Felten E W, Karlin A R, Li K. Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling. ACM Transactions on Computer Systems, 1996, 14(4):311-343.
[38] Ding X N, Jiang S, Chen F, Davis K, Zhang X D. DiskSeen:Exploiting disk layout and access history to enhance I/O prefetch. In Proc. the 2017 USENIX Annual Technical Conference, June 2007, pp.261-274.
[39] Klaiber A C, Levy H M. An architecture for softwarecontrolled data prefetching. In Proc. the 18th Annual International Symposium on Computer Architecture, May 1991, pp.43-53.
[40] Mowry T, Gupta A. Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 1991, 12(2):87-106.
[41] Subedi P, Davis P, Duan S H, Klasky S, Kolla H, Parashar M. Stacker:An autonomic data movement engine for extreme-scale data staging-based in-situ workflows. In Proc. the International Conference for High Performance Computing, Networking, November 2018, Article No. 73.
[42] Cherubini G, Kim Y, Lantz M, Venkatesan V. Data prefetching for large tiered storage systems. In Proc. the 2017 IEEE International Conference on Data Mining, November 2017, pp.823-828.
[43] Joo Y, Park S, Bahn H. Exploiting I/O reordering and I/O interleaving to improve application launch performance. ACM Transactions on Storage, 2017, 13(1):Article No. 8.
[44] Abbasi H, Wolf M, Eisenhauer G, Klasky S, Schwan K, Zheng F. DataStager:Scalable data staging services for petascale applications. Cluster Computing, 2010, 13(3):277-290.
[45] Bengio Y. Markovian models for sequential data. Neural Computing Surveys, 1999, 2(199):129-162.
[46] Thilaganga V, Karthika M, Lakshmi M M. A prefetching technique using HMM forward and backward chaining for the DFS in cloud. Asian Journal of Computer Science and Technology, 2017, 6(2):23-26.
[47] Tran N, Reed D A. Automatic ARIMA time series modeling for adaptive I/O prefetching. IEEE Transactions on Parallel and Distributed Systems, 2004, 15(4):362-377.
[48] Matthieu D, Ibrahim S, Antoniu G, Ross R. Omnisc'IO:A grammar-based approach to spatial and temporal I/O patterns prediction. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2014, pp.623-634.
[49] Luo Y F, Shi J, Zhou S G. JeCache:Just-enough data caching with just-in-time prefetching for big data applications. In Proc. the 37th IEEE International Conference on Distributed Computing Systems, June 2017, pp.2405-2410.
[50] Daniel G, Sunyé G, Cabot J. PrefetchML:A framework for prefetching and caching models. In Proc. the 19th ACM/IEEE International Conference on Model Driven Engineering Languages and Systems, October 2016, pp.318-328.
[51] Xu R, Jin X, Tao L F, Guo S Z, Xiang Z K, Tian T. An efficient resource-optimized learning prefetcher for solid state drives. In Proc. the 2018 Design, Automation & Test in Europe Conference & Exhibition, March 2018, pp.273-276.
[52] Wu K, Huang Y C, Li D. Unimem:Run-time data management on non-volatile memory-based heterogeneous main memory. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 58.
[53] Snyder B, Bosanac D, Davies R. Introduction to Apache ActiveMQ. In Active MQ in Action, Snyder B, Bosanac D, Davies R (eds.), Manning Publications, 2011, pp.6-16.
[54] Kreps J, Narkhede N, Rao J. Kafka:A distributed messaging system for log processing. In Proc. the 6th Workshop on Networking Meets Databases, June 2011, pp.1-7.
[55] Zawislak D, Toonen B, Allcock W, Rizzi S, Insley J, Vishwanath V, Papka M E. Early investigations into using a remote RAM pool with the vl3 visualization framework. In Proc. the 2nd Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, November 2016, pp.23-28.
[56] Carns P, Latham R, Ross R, Iskra K, Lang S, Riley K. 24/7 characterization of petascale I/O workloads. In Proc. the 2009 IEEE International Conference on Cluster Computing, August 2009, Article No. 73.
[57] Rao D S, Kumar S, Keshavamurthy A, Lantz P, Reddy D, Sankaran R, Jackson J. System software for persistent memory. In Proc. the 9th Eurosys Conference, April 2014, Article No. 15.
[58] Qreshi M K, Srinivasan V, Rivers J A. Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Computer Architecture News, 2009, 37(3):24-33.
[59] Berriman G B, Good J C, Laity A C, Kong M. The Montage image mosaic service:Custom image mosaics on-demand. In Proc. the 2007 Conference on Astronomical Data Analysis Software and Systems, September 2007, pp.83-102.
[60] Strukov D B, Snider G S, Stewart D R, Williams R S. The missing memristor found. Nature, 2008, 453(7191):80-83.
[61] Joo Y, Ryu J, Park S, Shin K G. FAST:Quick application launch on solid-state drives. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011, pp.259-272.
[62] Maghraoui K E, Kandiraju G, Jann J, Pattnaik P. Modeling and simulating flash based solid-state disks for operating systems. In Proc. the 1st Joint WOSP/SIPEW International Conference on Performance Engineering, January 2010, pp.15-26.
[63] Andersen D G, Franklin J, Kaminsky M, Phanishayee A, Tan L, Vasudevan V. FAWN:A fast array of wimpy nodes. In Proc. the 22nd ACM SIGOPS Symposium on Operating Systems Principles, October 2009, pp.1-14.
[64] Chen S. FlashLogging:Exploiting flash devices for synchronous logging performance. In Proc. the 2009 ACM SIGMOD International Conference on Management of Data, June 2009, pp.73-86.
[65] Bhimji W, Bard D, Romanus M et al. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group, May 2016.
[66] Kang S, Park S, Jung H, Shim H, Cha J. Performance tradeoffs in using NVRAM write buffer for flash memory-based storage devices. IEEE Transactions on Computers, 2008, 58(6):744-758.
[67] Caulfield A M, De A, Coburn J, Mollow T I, Gupta R K, Swanson S. Moneta:A high-performance storage array architecture for next-generation, non-volatile memories. In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, December 2010, pp.385-395.
[68] Akel A, Caulfield A M, Mollov T I, Gupta R K, Swanson S. Onyx:A prototype phase change memory storage array. In Proc. the 3rd USENIX Workshop on Hot Topics in Storage and File Systems, June 2011, Article No. 8.
[69] Dong X Y, Muralimanohar N, Jouppi N, Kaufmann R, Xie Y. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In Proc. the 2009 Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No. 57.
[70] Wang T, Oral S, Wang Y D, Settlemyer B, Atchley S, Yu W K. BurstMem:A high-performance burst buffer system for scientific applications. In Proc. the 2014 IEEE International Conference on Big Data, October 2014, pp.71-79.
[71] Sato K, Mohror K, Moody A, Gamblin T, de Supinski B R, Maruyama N, Matsuoka S. A user-level InfiniBand-based file system and checkpoint strategy for burst buffers. In Proc. the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2014, pp.21-30.
[72] Ma X S, Winslett M, Lee J, Yu S K. Faster collective output through active buffering. In Proc. the 16th International Parallel and Distributed Processing Symposium, April 2001, Article No. 44.
[73] Ma X S, Winslett M, Lee J, Yu S K. Improving MPI-IO output performance with active buffering plus threads. In Proc. the 17th International Parallel and Distributed Processing Symposium, April 2003, Article No. 68.
[74] Pai V S, Druschel P, Zwaenepoel W. IO-Lite:A unified I/O buffering and caching system. In Proc. the 3rd USENIX Symposium on Operating Systems Design and Implementation, February 1999, pp.15-28.
[75] Nitzberg B, Lo V. Collective buffering:Improving parallel I/O performance. In Proc. the 6th IEEE International Symposium on High Performance Distributed Computing, August 1997, pp.148-157.
[76] Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS:A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis, November 2009, Article No. 21.
[77] Dong B, Byna S, Wu K, Johansen H, Johnson J N, Keen N. Data elevator:Low-contention data movement in hierarchical storage system. In Proc. the 23rd International Conference on High Performance Computing, December 2016, pp.152-161.
[78] Wang T, Byna S, Dong B, Tang H. UniviStor:Integrated hierarchical and distributed storage for HPC. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.134-144.
[79] Lee D, Choi J, Kim J H, Noh S H, Min S L, Cho Y, Kim C S. LRFU:A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Transactions on Computers, 2001, 50(12):1352-1361.
No related articles found!
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[7] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[8] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[9] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] Zhong Renbao; Xing Lin; Ren Zhaoyang;. An Interactive System SDI on Microcomputer[J]. , 1987, 2(1): 64 -71 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved