Tianhe-2数据存储与管理系统设计与实现
Design and Implementation of the Tianhe-2 Data Storage and Management System
-
摘要: 随着高性能计算、大数据与人工智能的不断融合,高性能计算社区亟需同时支持这三种场景的计算系统来加速科学发现。然而,科学数据的爆炸性增长以及不同场景下的应用截然不同的I/O特点,导致传统的高性能计算系统在支持此类融合应用时面临严峻的数据存储与管理挑战。本文探讨了驱动该融合趋势发展的背景和原因,剖析数据存储与管理方面的三个挑战,并总结了我们在并行文件系统、数据管理中间件和上层应用这三个层面上应对这些挑战所做的工作。其中,文件系统方面,我们提出元数据预分配和代理服务器机制提升元数据操作吞吐率;定制元数据索引结构和Key-Value数据库优化大目录和小文件访问延迟。数据管理中间件方面,我们设计层次式数据管理策略优化I/O性能;设计数据感知任务调度机制减少数据移动开销;提出基于机器学习的数据管理策略智能匹配应用特征;设计原位索引和数据查询机制满足数据定位需求。上层应用方面,我们以天河二号超算系统上的计算模拟、数据分析、深度学习和科学工作流等应用为代表,介绍部分针对应用的特定优化,并评估各项优化方案取得的效果。随着高性能计算系统朝着E级计算不断发展,本文重点关注如何实现"应用驱动"的数据管理,旨在为E级计算生态系统与大数据和人工智能的深度融合提供可借鉴的经验。Abstract: With the convergence of high-performance computing (HPC), big data and artificial intelligence (AI), the HPC community is pushing for "triple use" systems to expedite scientific discoveries. However, supporting these converged applications on HPC systems presents formidable challenges in terms of storage and data management due to the explosive growth of scientific data and the fundamental differences in I/O characteristics among HPC, big data and AI workloads. In this paper, we discuss the driving force behind the converging trend, highlight three data management challenges, and summarize our efforts in addressing these data management challenges on a typical HPC system at the parallel file system, data management middleware, and user application levels. As HPC systems are approaching the border of exascale computing, this paper sheds light on how to enable application-driven data management as a preliminary step toward the deep convergence of exascale computing ecosystems, big data, and AI.
-
-
[1] Zhang Z, Barbary K, Nothaft F et al. Scientific computing meets big data technology:An astronomy use case. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.918-927.
[2] Yang X, Liu N, Feng B, Sun X H, Zhou S. PortHadoop:Support direct HPC data processing in Hadoop. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.223-232.
[3] Klein M, Sharma R, Bohrer C, Avelis C, Roberts E. Biospark:Scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark. Bioinformatics, 2017, 33(2):303-305.
[4] Usman S, Mehmood R, Katib I. Big data and HPC convergence:The cutting edge and outlook. In Proc. the 1st International Conference on Smart Societies, Infrastructure, Technologies and Applications, November 2017, pp.11-26.
[5] Kurth T, Treichler S, Romero J et al. Exascale deep learning for climate analytics. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 51.
[6] Song F G, Dongarra J J. A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems. Concurrency and Computation:Practice and Experience, 2015, 27(14):3702-3723.
[7] Karp R M, Zhang Y J. Randomized parallel algorithms for backtrack search and branch-and-bound computation. J. ACM, 1993, 40(3):765-789.
[8] Schwan P. Lustre:Building a file system for 1,000-node clusters. In Proc. the 2013 Linux Symposium, July 2003, pp.380-386.
[9] Li J W, Liao W K, Choudhary A N et al. Parallel netCDF:A high-performance scientific I/O interface. In Proc. the 2003 ACM/IEEE Conference on High Performance Networking and Computing, November 2003, Article No. 39.
[10] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, Article No. 9.
[11] Barisits M, Beermann T, Berghaus F et al. Rucio-Scientific data management. arXiv:1902.09857, 2019. https://arxiv.org/abs/1902.09857,Oct.2019.
[12] Narasimhamurthy S, Danilov N, Wu S, Umanesan G, Markidis S, Gomez S R, Peng I B, Laure E, Pleiter D, Witt S D. SAGE:Percipient storage for exascale data centric computing. Parallel Computing, 2019, 83:22-33.
[13] Sewell C M, Heitmann K, Finkel H et al. Large-scale compute-intensive analysis via a combined in-situ and coscheduling workflow approach. In Proc. the 2015 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2015, Article No. 50.
[14] Miyoshi T, Lien G Y, Satoh S et al. "Big data assimilation" toward post-petascale severe weather prediction:An overview and progress. Proceedings of the IEEE, 2016, 104(11):2155-2179.
[15] Bhimji W, Bard D, Romanus M. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group Meeting, May 2016.
[16] Kakoulli E, Herodotou H. OctopusFS:A distributed file system with tiered storage management. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.65-78.
[17] Dong B, Byna S, Wu K S, Prabhat, Johansen H, Johnson J N, Keen N. Data elevator:Low-contention data movement in hierarchical storage system. In Proc. the 23rd IEEE International Conference on High Performance Computing, December 2016, pp.152-161.
[18] Lim S H, Sim H, Gunasekaran R, Vazhkudai S S. Scientific user behavior and data-sharing trends in a petascale file system. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 46.
[19] Sim H, Kim Y, Vazhkudai S S, Vallée G R, Lim S H, Butt A R. Tagit:An integrated indexing and search service for file systems. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 5.
[20] Jenkins J, Arkatkar I, Lakshminarasimhan S, Boyuka-II D A, Schendel E R, Shah N, Ethier S, Chang C S, Chen J, Kolla H, Klasky S, Ross R B, Samatova N F. ALACRITY:Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. Trans. Large-Scale Dataand Knowledge-Centered Systems, 2013, 10:95-114.
[21] Lu T, Suchyta E, Pugmire D, Choi J, Klasky S, Liu Q, Podhorszki N, Ainsworth M, Wolf M. Canopus:A paradigm shift towards elastic extreme-scale data analytics on HPC storage. In Proc. the 2017 IEEE International Conference on Cluster Computing, September 2017, pp.58-69.
[22] Foster I T, Ainsworth M, Allen B et al. Computing just what you need:Online data analysis and reduction at extreme scales. In Proc. the 23rd International Conference on Parallel and Distributed Computing, August 2017, pp.3-19.
[23] Liao X K, Xiao L Q, Yang C Q, Lu Y T. MilkyWay-2 supercomputer:System and application. Frontiers Comput. Sci., 2014, 8(3):345-356.
[24] Xu W X, Lu Y T, Li Q et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Frontiers Comput. Sci., 2014, 8(3):367-377.
[25] Li H B, Cheng P, Chen Z G, Xiao N. Pream:Enhancing HPC storage system performance with pre-allocated metadata management mechanism. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, August 2019, pp.413-420.
[26] Cheng P, Lu Y T, Du Y F, Chen Z G. Accelerating scientific workflows with tiered data management system. In Proc. the 20th IEEE International Conference on High Performance Computing and Communications, June 2018, pp.75-82.
[27] Kougkas A, Devarajan H, Sun X H. Hermes:A heterogeneous-aware multi-tiered distributed I/O buffering system. In Proc. the 27th International Symposium on High-Performance Parallel and Distributed Computing, June 2018, pp.219-230.
[28] Wang T, Byna S, Dong B, Tang H J. UniviStor:Integrated hierarchical and distributed storage for HPC. In Proc. IEEE International Conference on Cluster Computing, September 2018, pp.134-144.
[29] Dong B, Wang T, Tang H J, Koziol Q, Wu K S, Byna S. ARCHIE:Data analysis acceleration with array caching in hierarchical storage. In Proc. the 2018 IEEE International Conference on Big Data, December 2018, pp.211-220.
[30] Feng K, Sun X H, Yang X, Zhou S J. SciDP:Support HPC and big data applications via integrated scientific data processing. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.114-123.
[31] Wasi-ur-Rahman M, Lu X Y, Islam N S, Rajachandrasekar R, Panda D K. High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In Proc. the 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.291-300.
[32] Pumma S, Si M, Feng W C, Balaji P. Parallel I/O optimizations for scalable deep learning. In Proc. the 23rd IEEE International Conference on Parallel and Distributed Systems, December 2017, pp.720-729.
[33] Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T. Caffe:Convolutional architecture for fast feature embedding. In Proc. the ACM International Conference on Multimedia, November 2014, pp.675-678.
[34] Tomes E, Rush E N, Altiparmak N. Towards adaptive parallel storage systems. IEEE Trans. Computers, 2018, 67(12):1840-1848.
[35] He S B, Sun X H, Wang Y, Xu C Z. A migratory heterogeneity-aware data layout scheme for parallel file systems. In Proc. the 2018 IEEE International Parallel and Distributed Processing Symposium, May 2018, pp.1133-1142.
[36] Subedi P, Davis P E, Duan S H, Klasky S, Kolla H, Parashar M. Stacker:An autonomic data movement engine for extreme-scale data staging-based in-situ workflows. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 73.
[37] Wu K, Ren J, Li D. Runtime data management on nonvolatile memory-based heterogeneous memory for taskparallel programs. In Proc. the International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 31.
[38] Stonebraker M, Brown P, Zhang D H, Becla J. SciDB:A database management system for applications with complex analytics. Computing in Science and Engineering, 2013, 15(3):54-62.
[39] Dong B, Wu K S, Byna S, Liu J L, Zhao W J, Rusu F. ArrayUDF:User-defined scientific data analysis on arrays. In Proc. the 26th International Symposium on HighPerformance Parallel and Distributed Computing, June 2017, pp.53-64.
[40] Chou J, Howison M, Austin B, Wu K S, Qiang J, Bethel E W, Shoshani A, Rübel O, Prabhat, Ryne R D. Parallel index and query for large scale data analysis. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2011, Article No. 30.
[41] Chiu H T, Chou J, Vishwanath V, Wu K S. In-memory query system for scientific dataseis. In Proc. the 21st IEEE International Conference on Parallel and Distributed Systems, December 2015, pp.362-371.
[42] Dong B, Byna S, Wu K S. Spatially clustered join on heterogeneous scientific data sets. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.371-380.
[43] Gu J M, Klasky S, Podhorszki N, Qiang J, Wu K S. Querying large scientific data sets with adaptable IO system ADIOS. In Proc. the 4th Asian Conference on Supercomputing Frontiers, March 2018, pp.51-69.
[44] Wu T H, Chou J, Hao S, Dong B, Klasky S, Wu K S. Optimizing the query performance of block index through data analysis and I/O modeling. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 12.
[45] Kim J, Abbasi H, Chacón L, Docan C, Klasky S, Liu Q, Podhorszki N, Shoshani A, Wu K S. Parallel in situ indexing for data-intensive computing. In Proc. the IEEE Symposium on Large Data Analysis and Visualization, October 2011, pp.65-72.
[46] Liu N, Cope J, Carns P H et al. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5.
[47] Lee J Y, Lee J H. Pre-allocated duplicate name prefix detection mechanism using naming-pool in mobile contentcentric network. In Proc. the 7th International Conference on Ubiquitous and Future Networks, July 2015, pp.115-117.
[48] Pagh R, Rodler F F. Cuckoo hashing. In Proc. the 9th Annual European Symposium, August 2001, pp.121-133.
[49] Phillips D. A directory index for EXT2. In Proc. the 5th Annual Linux Showcase & Conference, November 2001.
[50] Sweeney A, Doucette D, Hu W, Anderson C, Nishimoto M, Peck G. Scalability in the XFS file system. In Proc. the 1996 USENIX Annual Technical Conference, January 1996, pp.1-14.
[51] Lensing P H, Cortes T, Brinkmann A. Direct lookup and hash-based metadata placement for local file systems. In Proc. the 6th Annual International Systems and Storage Conference, July 2013, Article No. 5.
[52] Lensing P, Meister D, Brinkmann A. hashFS:Applying hashing to optimize file systems for small file reads. In Proc. the 2010 International Workshop on Storage Network Architecture and Parallel I/Os, May 2010, pp.33-42.
[53] Mathur A, Cao M M, Bhattacharya S, Dilger A, Tomas A, Vivier L. The new ext4 filesystem:Current status and future plans. In Proc. the 2007 Linux Symposium, June 2007, pp.21-33.
[54] Shibata T, Choi S J, Taura K. File-access characteristics of data-intensive workflow applications. In Proc. the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, May 2010, pp.522-525.
[55] Katz D S, Armstrong T G, Zhang Z, Wilde M, Wozniak J M. Many-task computing and blue waters. arXiv:1202.3943, 2012. https://arxiv.org/abs/1202.3943,Oct.2019.
[56] Yoo A B, Jette M A, Grondona M. SLURM:Simple Linux utility for resource management. In Proc. the 9th International Workshop on Job Scheduling Strategies for Parallel Processing, June 2003, pp.44-60.
[57] Wu K S, Ahern S, Bethel E W et al. FastBit:Interactively searching massive data. Journal of Physics:Conference Series, 2009, 180(1):Article No. 012053.
[58] Cheng P, Wang Y, Lu Y T, Du Y F, Chen Z G. IndexIt:Enhancing data locating services for parallel file systems. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, August 2019, pp.1011-1019.
[59] Wu T H, Chou J, Podhorszki N, Gu J M, Tian Y, Klasky S, Wu K S. Apply block index technique to scientific data analysis and I/O systems. In Proc. the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2017, pp.865-871.
[60] Chen D H, Xue J S, Yang X S et al. New generation of multi-scale NWP system (GRAPES):General scientific design. Chinese Science Bulletin, 2008, 53(22):3433-3445.
[61] Bush W S, Moore J H. Chapter 11:Genome-wide association studies. PLoS Computational Biology, 2012, 8(12):Article No. e1002822.
[62] Chaimov N, Malony A D, Canon S, Iancu C, Ibrahim K Z, Srinivasan J. Scaling spark on HPC systems. In Proc. the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, May 2016, pp.97-110.
[63] Taft R, Vartak M, Satish N R, Sundaram N, Madden S, Stonebraker M. GenBase:A complex analytics genomics benchmark. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.177-188.
[64] Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet:A large-scale hierarchical image database. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.248-255.
[65] Deelman E, Gannon D, Shields M S, Taylor I J. Workflows and e-science:An overview of workflow system features and capabilities. Future Generation Comp. Syst., 2009, 25(5):528-540.
[66] Berriman B G, Good J C, Laity A C et al. Chapter 19:Web-based Tools-Montage:An astronomical image mosaic engine. In The National Virtual Observatory:Tools and Techniques for Astronomical Aesearch, Graham M J, Fitzpatrick M J, McGlynn T A (eds.), Astronomical Society of the Pacific, 2007, pp.179-189.
[67] Hazekamp N, Kremer-Herman N, Tovar B et al. Combining static and dynamic storage management for data intensive scientific workflows. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(2):338-350.
-
期刊类型引用(4)
1. Jia Wei, Mo Chen, Longxiang Wang, et al. Status, challenges and trends of data-intensive supercomputing. CCF Transactions on High Performance Computing, 2022, 4(2): 211. 必应学术
2. Fahad Alblehai. A Caching-Based Pipelining Model for Improving the Input/Output Performance of Distributed Data Storage Systems. Journal of Nanoelectronics and Optoelectronics, 2022, 17(6): 946. 必应学术
3. Cheng Luo. Computer Data Storage and Management Platform Based on Big Data. Journal of Physics: Conference Series, 2021, 2066(1): 012022. 必应学术
4. Wei Zhang, Yunxiao Lv. A Connectivity Reduction Based Optimization Method for Task Level Parallel File System. 2022 IEEE 10th International Conference on Computer Science and Network Technology (ICCSNT), 必应学术
其他类型引用(0)
-
其他相关附件
-
本文英文pdf
2020-1-3-9799-Highlights 点击下载(529KB) -
本文附件外链
https://rdcu.be/cRrmJ
-
计量
- 文章访问数: 69
- HTML全文浏览量: 1
- PDF下载量: 0
- 被引次数: 4