|
计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (1): 27-46.doi: 10.1007/s11390-020-9799-4
所属专题: 综述; Computer Architecture and Systems
Yu-Tong Lu1, Distinguished Member, CCF, Peng Cheng2, Zhi-Guang Chen1, Member, CCF
Yu-Tong Lu1, Distinguished Member, CCF, Peng Cheng2, Zhi-Guang Chen1, Member, CCF
随着高性能计算、大数据与人工智能的不断融合,高性能计算社区亟需同时支持这三种场景的计算系统来加速科学发现。然而,科学数据的爆炸性增长以及不同场景下的应用截然不同的I/O特点,导致传统的高性能计算系统在支持此类融合应用时面临严峻的数据存储与管理挑战。本文探讨了驱动该融合趋势发展的背景和原因,剖析数据存储与管理方面的三个挑战,并总结了我们在并行文件系统、数据管理中间件和上层应用这三个层面上应对这些挑战所做的工作。其中,文件系统方面,我们提出元数据预分配和代理服务器机制提升元数据操作吞吐率;定制元数据索引结构和Key-Value数据库优化大目录和小文件访问延迟。数据管理中间件方面,我们设计层次式数据管理策略优化I/O性能;设计数据感知任务调度机制减少数据移动开销;提出基于机器学习的数据管理策略智能匹配应用特征;设计原位索引和数据查询机制满足数据定位需求。上层应用方面,我们以天河二号超算系统上的计算模拟、数据分析、深度学习和科学工作流等应用为代表,介绍部分针对应用的特定优化,并评估各项优化方案取得的效果。随着高性能计算系统朝着E级计算不断发展,本文重点关注如何实现"应用驱动"的数据管理,旨在为E级计算生态系统与大数据和人工智能的深度融合提供可借鉴的经验。
[1] Zhang Z, Barbary K, Nothaft F et al. Scientific computing meets big data technology:An astronomy use case. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.918-927. [2] Yang X, Liu N, Feng B, Sun X H, Zhou S. PortHadoop:Support direct HPC data processing in Hadoop. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.223-232. [3] Klein M, Sharma R, Bohrer C, Avelis C, Roberts E. Biospark:Scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark. Bioinformatics, 2017, 33(2):303-305. [4] Usman S, Mehmood R, Katib I. Big data and HPC convergence:The cutting edge and outlook. In Proc. the 1st International Conference on Smart Societies, Infrastructure, Technologies and Applications, November 2017, pp.11-26. [5] Kurth T, Treichler S, Romero J et al. Exascale deep learning for climate analytics. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 51. [6] Song F G, Dongarra J J. A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems. Concurrency and Computation:Practice and Experience, 2015, 27(14):3702-3723. [7] Karp R M, Zhang Y J. Randomized parallel algorithms for backtrack search and branch-and-bound computation. J. ACM, 1993, 40(3):765-789. [8] Schwan P. Lustre:Building a file system for 1,000-node clusters. In Proc. the 2013 Linux Symposium, July 2003, pp.380-386. [9] Li J W, Liao W K, Choudhary A N et al. Parallel netCDF:A high-performance scientific I/O interface. In Proc. the 2003 ACM/IEEE Conference on High Performance Networking and Computing, November 2003, Article No. 39. [10] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, Article No. 9. [11] Barisits M, Beermann T, Berghaus F et al. Rucio-Scientific data management. arXiv:1902.09857, 2019. https://arxiv.org/abs/1902.09857,Oct.2019. [12] Narasimhamurthy S, Danilov N, Wu S, Umanesan G, Markidis S, Gomez S R, Peng I B, Laure E, Pleiter D, Witt S D. SAGE:Percipient storage for exascale data centric computing. Parallel Computing, 2019, 83:22-33. [13] Sewell C M, Heitmann K, Finkel H et al. Large-scale compute-intensive analysis via a combined in-situ and coscheduling workflow approach. In Proc. the 2015 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2015, Article No. 50. [14] Miyoshi T, Lien G Y, Satoh S et al. "Big data assimilation" toward post-petascale severe weather prediction:An overview and progress. Proceedings of the IEEE, 2016, 104(11):2155-2179. [15] Bhimji W, Bard D, Romanus M. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group Meeting, May 2016. [16] Kakoulli E, Herodotou H. OctopusFS:A distributed file system with tiered storage management. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.65-78. [17] Dong B, Byna S, Wu K S, Prabhat, Johansen H, Johnson J N, Keen N. Data elevator:Low-contention data movement in hierarchical storage system. In Proc. the 23rd IEEE International Conference on High Performance Computing, December 2016, pp.152-161. [18] Lim S H, Sim H, Gunasekaran R, Vazhkudai S S. Scientific user behavior and data-sharing trends in a petascale file system. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 46. [19] Sim H, Kim Y, Vazhkudai S S, Vallée G R, Lim S H, Butt A R. Tagit:An integrated indexing and search service for file systems. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 5. [20] Jenkins J, Arkatkar I, Lakshminarasimhan S, Boyuka-II D A, Schendel E R, Shah N, Ethier S, Chang C S, Chen J, Kolla H, Klasky S, Ross R B, Samatova N F. ALACRITY:Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. Trans. Large-Scale Dataand Knowledge-Centered Systems, 2013, 10:95-114. [21] Lu T, Suchyta E, Pugmire D, Choi J, Klasky S, Liu Q, Podhorszki N, Ainsworth M, Wolf M. Canopus:A paradigm shift towards elastic extreme-scale data analytics on HPC storage. In Proc. the 2017 IEEE International Conference on Cluster Computing, September 2017, pp.58-69. [22] Foster I T, Ainsworth M, Allen B et al. Computing just what you need:Online data analysis and reduction at extreme scales. In Proc. the 23rd International Conference on Parallel and Distributed Computing, August 2017, pp.3-19. [23] Liao X K, Xiao L Q, Yang C Q, Lu Y T. MilkyWay-2 supercomputer:System and application. Frontiers Comput. Sci., 2014, 8(3):345-356. [24] Xu W X, Lu Y T, Li Q et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Frontiers Comput. Sci., 2014, 8(3):367-377. [25] Li H B, Cheng P, Chen Z G, Xiao N. Pream:Enhancing HPC storage system performance with pre-allocated metadata management mechanism. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, August 2019, pp.413-420. [26] Cheng P, Lu Y T, Du Y F, Chen Z G. Accelerating scientific workflows with tiered data management system. In Proc. the 20th IEEE International Conference on High Performance Computing and Communications, June 2018, pp.75-82. [27] Kougkas A, Devarajan H, Sun X H. Hermes:A heterogeneous-aware multi-tiered distributed I/O buffering system. In Proc. the 27th International Symposium on High-Performance Parallel and Distributed Computing, June 2018, pp.219-230. [28] Wang T, Byna S, Dong B, Tang H J. UniviStor:Integrated hierarchical and distributed storage for HPC. In Proc. IEEE International Conference on Cluster Computing, September 2018, pp.134-144. [29] Dong B, Wang T, Tang H J, Koziol Q, Wu K S, Byna S. ARCHIE:Data analysis acceleration with array caching in hierarchical storage. In Proc. the 2018 IEEE International Conference on Big Data, December 2018, pp.211-220. [30] Feng K, Sun X H, Yang X, Zhou S J. SciDP:Support HPC and big data applications via integrated scientific data processing. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.114-123. [31] Wasi-ur-Rahman M, Lu X Y, Islam N S, Rajachandrasekar R, Panda D K. High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In Proc. the 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.291-300. [32] Pumma S, Si M, Feng W C, Balaji P. Parallel I/O optimizations for scalable deep learning. In Proc. the 23rd IEEE International Conference on Parallel and Distributed Systems, December 2017, pp.720-729. [33] Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T. Caffe:Convolutional architecture for fast feature embedding. In Proc. the ACM International Conference on Multimedia, November 2014, pp.675-678. [34] Tomes E, Rush E N, Altiparmak N. Towards adaptive parallel storage systems. IEEE Trans. Computers, 2018, 67(12):1840-1848. [35] He S B, Sun X H, Wang Y, Xu C Z. A migratory heterogeneity-aware data layout scheme for parallel file systems. In Proc. the 2018 IEEE International Parallel and Distributed Processing Symposium, May 2018, pp.1133-1142. [36] Subedi P, Davis P E, Duan S H, Klasky S, Kolla H, Parashar M. Stacker:An autonomic data movement engine for extreme-scale data staging-based in-situ workflows. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 73. [37] Wu K, Ren J, Li D. Runtime data management on nonvolatile memory-based heterogeneous memory for taskparallel programs. In Proc. the International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 31. [38] Stonebraker M, Brown P, Zhang D H, Becla J. SciDB:A database management system for applications with complex analytics. Computing in Science and Engineering, 2013, 15(3):54-62. [39] Dong B, Wu K S, Byna S, Liu J L, Zhao W J, Rusu F. ArrayUDF:User-defined scientific data analysis on arrays. In Proc. the 26th International Symposium on HighPerformance Parallel and Distributed Computing, June 2017, pp.53-64. [40] Chou J, Howison M, Austin B, Wu K S, Qiang J, Bethel E W, Shoshani A, Rübel O, Prabhat, Ryne R D. Parallel index and query for large scale data analysis. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2011, Article No. 30. [41] Chiu H T, Chou J, Vishwanath V, Wu K S. In-memory query system for scientific dataseis. In Proc. the 21st IEEE International Conference on Parallel and Distributed Systems, December 2015, pp.362-371. [42] Dong B, Byna S, Wu K S. Spatially clustered join on heterogeneous scientific data sets. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.371-380. [43] Gu J M, Klasky S, Podhorszki N, Qiang J, Wu K S. Querying large scientific data sets with adaptable IO system ADIOS. In Proc. the 4th Asian Conference on Supercomputing Frontiers, March 2018, pp.51-69. [44] Wu T H, Chou J, Hao S, Dong B, Klasky S, Wu K S. Optimizing the query performance of block index through data analysis and I/O modeling. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 12. [45] Kim J, Abbasi H, Chacón L, Docan C, Klasky S, Liu Q, Podhorszki N, Shoshani A, Wu K S. Parallel in situ indexing for data-intensive computing. In Proc. the IEEE Symposium on Large Data Analysis and Visualization, October 2011, pp.65-72. [46] Liu N, Cope J, Carns P H et al. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5. [47] Lee J Y, Lee J H. Pre-allocated duplicate name prefix detection mechanism using naming-pool in mobile contentcentric network. In Proc. the 7th International Conference on Ubiquitous and Future Networks, July 2015, pp.115-117. [48] Pagh R, Rodler F F. Cuckoo hashing. In Proc. the 9th Annual European Symposium, August 2001, pp.121-133. [49] Phillips D. A directory index for EXT2. In Proc. the 5th Annual Linux Showcase & Conference, November 2001. [50] Sweeney A, Doucette D, Hu W, Anderson C, Nishimoto M, Peck G. Scalability in the XFS file system. In Proc. the 1996 USENIX Annual Technical Conference, January 1996, pp.1-14. [51] Lensing P H, Cortes T, Brinkmann A. Direct lookup and hash-based metadata placement for local file systems. In Proc. the 6th Annual International Systems and Storage Conference, July 2013, Article No. 5. [52] Lensing P, Meister D, Brinkmann A. hashFS:Applying hashing to optimize file systems for small file reads. In Proc. the 2010 International Workshop on Storage Network Architecture and Parallel I/Os, May 2010, pp.33-42. [53] Mathur A, Cao M M, Bhattacharya S, Dilger A, Tomas A, Vivier L. The new ext4 filesystem:Current status and future plans. In Proc. the 2007 Linux Symposium, June 2007, pp.21-33. [54] Shibata T, Choi S J, Taura K. File-access characteristics of data-intensive workflow applications. In Proc. the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, May 2010, pp.522-525. [55] Katz D S, Armstrong T G, Zhang Z, Wilde M, Wozniak J M. Many-task computing and blue waters. arXiv:1202.3943, 2012. https://arxiv.org/abs/1202.3943,Oct.2019. [56] Yoo A B, Jette M A, Grondona M. SLURM:Simple Linux utility for resource management. In Proc. the 9th International Workshop on Job Scheduling Strategies for Parallel Processing, June 2003, pp.44-60. [57] Wu K S, Ahern S, Bethel E W et al. FastBit:Interactively searching massive data. Journal of Physics:Conference Series, 2009, 180(1):Article No. 012053. [58] Cheng P, Wang Y, Lu Y T, Du Y F, Chen Z G. IndexIt:Enhancing data locating services for parallel file systems. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, August 2019, pp.1011-1019. [59] Wu T H, Chou J, Podhorszki N, Gu J M, Tian Y, Klasky S, Wu K S. Apply block index technique to scientific data analysis and I/O systems. In Proc. the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2017, pp.865-871. [60] Chen D H, Xue J S, Yang X S et al. New generation of multi-scale NWP system (GRAPES):General scientific design. Chinese Science Bulletin, 2008, 53(22):3433-3445. [61] Bush W S, Moore J H. Chapter 11:Genome-wide association studies. PLoS Computational Biology, 2012, 8(12):Article No. e1002822. [62] Chaimov N, Malony A D, Canon S, Iancu C, Ibrahim K Z, Srinivasan J. Scaling spark on HPC systems. In Proc. the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, May 2016, pp.97-110. [63] Taft R, Vartak M, Satish N R, Sundaram N, Madden S, Stonebraker M. GenBase:A complex analytics genomics benchmark. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.177-188. [64] Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet:A large-scale hierarchical image database. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.248-255. [65] Deelman E, Gannon D, Shields M S, Taylor I J. Workflows and e-science:An overview of workflow system features and capabilities. Future Generation Comp. Syst., 2009, 25(5):528-540. [66] Berriman B G, Good J C, Laity A C et al. Chapter 19:Web-based Tools-Montage:An astronomical image mosaic engine. In The National Virtual Observatory:Tools and Techniques for Astronomical Aesearch, Graham M J, Fitzpatrick M J, McGlynn T A (eds.), Astronomical Society of the Pacific, 2007, pp.179-189. [67] Hazekamp N, Kremer-Herman N, Tovar B et al. Combining static and dynamic storage management for data intensive scientific workflows. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(2):338-350. |
[1] | Kai Wu, Dong Li. Unimem: 用于高性能计算的基于非易失性内存的异构主内存上的运行时系统数据管理[J]. 计算机科学技术学报, 2021, 36(1): 90-109. |
[2] | Zhi-Guang Chen, Yu-Bo Liu, Yong-Feng Wang, Yu-Tong Lu. 基于GPU的大规模并行文件系统元数据加速[J]. 计算机科学技术学报, 2021, 36(1): 44-55. |
[3] | Hong-Mei Wei, Jian Gao, Peng Qing, Kang Yu, Yan-Fei Fang, Ming-Lu Li. MPI-RCDD:一种MPI运行时的通信死锁检测框架[J]. 计算机科学技术学报, 2020, 35(2): 395-411. |
[4] | André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. 高性能计算专用文件系统[J]. 计算机科学技术学报, 2020, 35(1): 4-26. |
[5] | Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang. 神威存储系统面向应用I/O性能提升的优化介绍[J]. 计算机科学技术学报, 2020, 35(1): 47-60. |
[6] | Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—一种用于高性能计算应用的临时突发缓冲文件系统[J]. 计算机科学技术学报, 2020, 35(1): 72-91. |
[7] | Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey, Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, Qing Zheng. Mochi:为高性能计算环境组合数据服务[J]. 计算机科学技术学报, 2020, 35(1): 121-144. |
[8] | Xu Tan, Xiao-Wei Shen, Xiao-Chun Ye, Da Wang, Dong-Rui Fan, Lunkai Zhang, Wen-Mi. 一种面向数据流架构的无停顿双缓冲机制[J]. , 2018, 33(1): 145-157. |
[9] | Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan, Ning-Hui Sun. 一种面向数据流架构的高效片上路由结构[J]. , 2017, 32(1): 11-25. |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |