计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (1): 27-46.doi: 10.1007/s11390-020-9799-4

所属专题: 综述 Computer Architecture and Systems

• • 上一篇    下一篇

Tianhe-2数据存储与管理系统设计与实现

Yu-Tong Lu1, Distinguished Member, CCF, Peng Cheng2, Zhi-Guang Chen1, Member, CCF   

  1. 1 National Supercomputer Center in Guangzhou, Sun Yat-sen University, Guangzhou 510000, China;
    2 College of Computer, National University of Defense Technology, Changsha 410073, China
  • 收稿日期:2019-07-15 修回日期:2019-10-14 出版日期:2020-01-05 发布日期:2020-01-14
  • 作者简介:Yu-Tong Lu is a professor in the School of Data and Computer Science, Sun Yat-sen University, Guangzhou. She is also the Director of National Supercomputer Center in Guangzhou. Her research interests include highperformance computing, parallel file system, and advanced programming environment.
  • 基金资助:
    Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000302, the National Natural Science Foundation of China under Grant Nos. U1611261 and 61872392, and the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211.

Design and Implementation of the Tianhe-2 Data Storage and Management System

Yu-Tong Lu1, Distinguished Member, CCF, Peng Cheng2, Zhi-Guang Chen1, Member, CCF        

  1. 1 National Supercomputer Center in Guangzhou, Sun Yat-sen University, Guangzhou 510000, China;
    2 College of Computer, National University of Defense Technology, Changsha 410073, China
  • Received:2019-07-15 Revised:2019-10-14 Online:2020-01-05 Published:2020-01-14
  • About author:Yu-Tong Lu is a professor in the School of Data and Computer Science, Sun Yat-sen University, Guangzhou. She is also the Director of National Supercomputer Center in Guangzhou. Her research interests include highperformance computing, parallel file system, and advanced programming environment.
  • Supported by:
    Special Section on Selected I/O Technologies for High-Performance Computing and Data Analytics This work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000302, the National Natural Science Foundation of China under Grant Nos. U1611261 and 61872392, and the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211.

随着高性能计算、大数据与人工智能的不断融合,高性能计算社区亟需同时支持这三种场景的计算系统来加速科学发现。然而,科学数据的爆炸性增长以及不同场景下的应用截然不同的I/O特点,导致传统的高性能计算系统在支持此类融合应用时面临严峻的数据存储与管理挑战。本文探讨了驱动该融合趋势发展的背景和原因,剖析数据存储与管理方面的三个挑战,并总结了我们在并行文件系统、数据管理中间件和上层应用这三个层面上应对这些挑战所做的工作。其中,文件系统方面,我们提出元数据预分配和代理服务器机制提升元数据操作吞吐率;定制元数据索引结构和Key-Value数据库优化大目录和小文件访问延迟。数据管理中间件方面,我们设计层次式数据管理策略优化I/O性能;设计数据感知任务调度机制减少数据移动开销;提出基于机器学习的数据管理策略智能匹配应用特征;设计原位索引和数据查询机制满足数据定位需求。上层应用方面,我们以天河二号超算系统上的计算模拟、数据分析、深度学习和科学工作流等应用为代表,介绍部分针对应用的特定优化,并评估各项优化方案取得的效果。随着高性能计算系统朝着E级计算不断发展,本文重点关注如何实现"应用驱动"的数据管理,旨在为E级计算生态系统与大数据和人工智能的深度融合提供可借鉴的经验。

关键词: 高性能计算, 数据管理, 融合应用, 并行文件系统

Abstract: With the convergence of high-performance computing (HPC), big data and artificial intelligence (AI), the HPC community is pushing for "triple use" systems to expedite scientific discoveries. However, supporting these converged applications on HPC systems presents formidable challenges in terms of storage and data management due to the explosive growth of scientific data and the fundamental differences in I/O characteristics among HPC, big data and AI workloads. In this paper, we discuss the driving force behind the converging trend, highlight three data management challenges, and summarize our efforts in addressing these data management challenges on a typical HPC system at the parallel file system, data management middleware, and user application levels. As HPC systems are approaching the border of exascale computing, this paper sheds light on how to enable application-driven data management as a preliminary step toward the deep convergence of exascale computing ecosystems, big data, and AI.

Key words: high-performance computing (HPC), data management, converged application, parallel file system

[1] Zhang Z, Barbary K, Nothaft F et al. Scientific computing meets big data technology:An astronomy use case. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.918-927.
[2] Yang X, Liu N, Feng B, Sun X H, Zhou S. PortHadoop:Support direct HPC data processing in Hadoop. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.223-232.
[3] Klein M, Sharma R, Bohrer C, Avelis C, Roberts E. Biospark:Scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark. Bioinformatics, 2017, 33(2):303-305.
[4] Usman S, Mehmood R, Katib I. Big data and HPC convergence:The cutting edge and outlook. In Proc. the 1st International Conference on Smart Societies, Infrastructure, Technologies and Applications, November 2017, pp.11-26.
[5] Kurth T, Treichler S, Romero J et al. Exascale deep learning for climate analytics. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 51.
[6] Song F G, Dongarra J J. A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems. Concurrency and Computation:Practice and Experience, 2015, 27(14):3702-3723.
[7] Karp R M, Zhang Y J. Randomized parallel algorithms for backtrack search and branch-and-bound computation. J. ACM, 1993, 40(3):765-789.
[8] Schwan P. Lustre:Building a file system for 1,000-node clusters. In Proc. the 2013 Linux Symposium, July 2003, pp.380-386.
[9] Li J W, Liao W K, Choudhary A N et al. Parallel netCDF:A high-performance scientific I/O interface. In Proc. the 2003 ACM/IEEE Conference on High Performance Networking and Computing, November 2003, Article No. 39.
[10] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, Article No. 9.
[11] Barisits M, Beermann T, Berghaus F et al. Rucio-Scientific data management. arXiv:1902.09857, 2019. https://arxiv.org/abs/1902.09857,Oct.2019.
[12] Narasimhamurthy S, Danilov N, Wu S, Umanesan G, Markidis S, Gomez S R, Peng I B, Laure E, Pleiter D, Witt S D. SAGE:Percipient storage for exascale data centric computing. Parallel Computing, 2019, 83:22-33.
[13] Sewell C M, Heitmann K, Finkel H et al. Large-scale compute-intensive analysis via a combined in-situ and coscheduling workflow approach. In Proc. the 2015 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2015, Article No. 50.
[14] Miyoshi T, Lien G Y, Satoh S et al. "Big data assimilation" toward post-petascale severe weather prediction:An overview and progress. Proceedings of the IEEE, 2016, 104(11):2155-2179.
[15] Bhimji W, Bard D, Romanus M. Accelerating science with the NERSC burst buffer early user program. In Proc. the 2016 Cray User Group Meeting, May 2016.
[16] Kakoulli E, Herodotou H. OctopusFS:A distributed file system with tiered storage management. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.65-78.
[17] Dong B, Byna S, Wu K S, Prabhat, Johansen H, Johnson J N, Keen N. Data elevator:Low-contention data movement in hierarchical storage system. In Proc. the 23rd IEEE International Conference on High Performance Computing, December 2016, pp.152-161.
[18] Lim S H, Sim H, Gunasekaran R, Vazhkudai S S. Scientific user behavior and data-sharing trends in a petascale file system. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 46.
[19] Sim H, Kim Y, Vazhkudai S S, Vallée G R, Lim S H, Butt A R. Tagit:An integrated indexing and search service for file systems. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 5.
[20] Jenkins J, Arkatkar I, Lakshminarasimhan S, Boyuka-II D A, Schendel E R, Shah N, Ethier S, Chang C S, Chen J, Kolla H, Klasky S, Ross R B, Samatova N F. ALACRITY:Analytics-driven lossless data compression for rapid in-situ indexing, storing, and querying. Trans. Large-Scale Dataand Knowledge-Centered Systems, 2013, 10:95-114.
[21] Lu T, Suchyta E, Pugmire D, Choi J, Klasky S, Liu Q, Podhorszki N, Ainsworth M, Wolf M. Canopus:A paradigm shift towards elastic extreme-scale data analytics on HPC storage. In Proc. the 2017 IEEE International Conference on Cluster Computing, September 2017, pp.58-69.
[22] Foster I T, Ainsworth M, Allen B et al. Computing just what you need:Online data analysis and reduction at extreme scales. In Proc. the 23rd International Conference on Parallel and Distributed Computing, August 2017, pp.3-19.
[23] Liao X K, Xiao L Q, Yang C Q, Lu Y T. MilkyWay-2 supercomputer:System and application. Frontiers Comput. Sci., 2014, 8(3):345-356.
[24] Xu W X, Lu Y T, Li Q et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Frontiers Comput. Sci., 2014, 8(3):367-377.
[25] Li H B, Cheng P, Chen Z G, Xiao N. Pream:Enhancing HPC storage system performance with pre-allocated metadata management mechanism. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, August 2019, pp.413-420.
[26] Cheng P, Lu Y T, Du Y F, Chen Z G. Accelerating scientific workflows with tiered data management system. In Proc. the 20th IEEE International Conference on High Performance Computing and Communications, June 2018, pp.75-82.
[27] Kougkas A, Devarajan H, Sun X H. Hermes:A heterogeneous-aware multi-tiered distributed I/O buffering system. In Proc. the 27th International Symposium on High-Performance Parallel and Distributed Computing, June 2018, pp.219-230.
[28] Wang T, Byna S, Dong B, Tang H J. UniviStor:Integrated hierarchical and distributed storage for HPC. In Proc. IEEE International Conference on Cluster Computing, September 2018, pp.134-144.
[29] Dong B, Wang T, Tang H J, Koziol Q, Wu K S, Byna S. ARCHIE:Data analysis acceleration with array caching in hierarchical storage. In Proc. the 2018 IEEE International Conference on Big Data, December 2018, pp.211-220.
[30] Feng K, Sun X H, Yang X, Zhou S J. SciDP:Support HPC and big data applications via integrated scientific data processing. In Proc. the 2018 IEEE International Conference on Cluster Computing, September 2018, pp.114-123.
[31] Wasi-ur-Rahman M, Lu X Y, Islam N S, Rajachandrasekar R, Panda D K. High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In Proc. the 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.291-300.
[32] Pumma S, Si M, Feng W C, Balaji P. Parallel I/O optimizations for scalable deep learning. In Proc. the 23rd IEEE International Conference on Parallel and Distributed Systems, December 2017, pp.720-729.
[33] Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R B, Guadarrama S, Darrell T. Caffe:Convolutional architecture for fast feature embedding. In Proc. the ACM International Conference on Multimedia, November 2014, pp.675-678.
[34] Tomes E, Rush E N, Altiparmak N. Towards adaptive parallel storage systems. IEEE Trans. Computers, 2018, 67(12):1840-1848.
[35] He S B, Sun X H, Wang Y, Xu C Z. A migratory heterogeneity-aware data layout scheme for parallel file systems. In Proc. the 2018 IEEE International Parallel and Distributed Processing Symposium, May 2018, pp.1133-1142.
[36] Subedi P, Davis P E, Duan S H, Klasky S, Kolla H, Parashar M. Stacker:An autonomic data movement engine for extreme-scale data staging-based in-situ workflows. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 73.
[37] Wu K, Ren J, Li D. Runtime data management on nonvolatile memory-based heterogeneous memory for taskparallel programs. In Proc. the International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 31.
[38] Stonebraker M, Brown P, Zhang D H, Becla J. SciDB:A database management system for applications with complex analytics. Computing in Science and Engineering, 2013, 15(3):54-62.
[39] Dong B, Wu K S, Byna S, Liu J L, Zhao W J, Rusu F. ArrayUDF:User-defined scientific data analysis on arrays. In Proc. the 26th International Symposium on HighPerformance Parallel and Distributed Computing, June 2017, pp.53-64.
[40] Chou J, Howison M, Austin B, Wu K S, Qiang J, Bethel E W, Shoshani A, Rübel O, Prabhat, Ryne R D. Parallel index and query for large scale data analysis. In Proc. the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2011, Article No. 30.
[41] Chiu H T, Chou J, Vishwanath V, Wu K S. In-memory query system for scientific dataseis. In Proc. the 21st IEEE International Conference on Parallel and Distributed Systems, December 2015, pp.362-371.
[42] Dong B, Byna S, Wu K S. Spatially clustered join on heterogeneous scientific data sets. In Proc. the 2015 IEEE International Conference on Big Data, October 29-November 1, 2015, pp.371-380.
[43] Gu J M, Klasky S, Podhorszki N, Qiang J, Wu K S. Querying large scientific data sets with adaptable IO system ADIOS. In Proc. the 4th Asian Conference on Supercomputing Frontiers, March 2018, pp.51-69.
[44] Wu T H, Chou J, Hao S, Dong B, Klasky S, Wu K S. Optimizing the query performance of block index through data analysis and I/O modeling. In Proc. the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2017, Article No. 12.
[45] Kim J, Abbasi H, Chacón L, Docan C, Klasky S, Liu Q, Podhorszki N, Shoshani A, Wu K S. Parallel in situ indexing for data-intensive computing. In Proc. the IEEE Symposium on Large Data Analysis and Visualization, October 2011, pp.65-72.
[46] Liu N, Cope J, Carns P H et al. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5.
[47] Lee J Y, Lee J H. Pre-allocated duplicate name prefix detection mechanism using naming-pool in mobile contentcentric network. In Proc. the 7th International Conference on Ubiquitous and Future Networks, July 2015, pp.115-117.
[48] Pagh R, Rodler F F. Cuckoo hashing. In Proc. the 9th Annual European Symposium, August 2001, pp.121-133.
[49] Phillips D. A directory index for EXT2. In Proc. the 5th Annual Linux Showcase & Conference, November 2001.
[50] Sweeney A, Doucette D, Hu W, Anderson C, Nishimoto M, Peck G. Scalability in the XFS file system. In Proc. the 1996 USENIX Annual Technical Conference, January 1996, pp.1-14.
[51] Lensing P H, Cortes T, Brinkmann A. Direct lookup and hash-based metadata placement for local file systems. In Proc. the 6th Annual International Systems and Storage Conference, July 2013, Article No. 5.
[52] Lensing P, Meister D, Brinkmann A. hashFS:Applying hashing to optimize file systems for small file reads. In Proc. the 2010 International Workshop on Storage Network Architecture and Parallel I/Os, May 2010, pp.33-42.
[53] Mathur A, Cao M M, Bhattacharya S, Dilger A, Tomas A, Vivier L. The new ext4 filesystem:Current status and future plans. In Proc. the 2007 Linux Symposium, June 2007, pp.21-33.
[54] Shibata T, Choi S J, Taura K. File-access characteristics of data-intensive workflow applications. In Proc. the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, May 2010, pp.522-525.
[55] Katz D S, Armstrong T G, Zhang Z, Wilde M, Wozniak J M. Many-task computing and blue waters. arXiv:1202.3943, 2012. https://arxiv.org/abs/1202.3943,Oct.2019.
[56] Yoo A B, Jette M A, Grondona M. SLURM:Simple Linux utility for resource management. In Proc. the 9th International Workshop on Job Scheduling Strategies for Parallel Processing, June 2003, pp.44-60.
[57] Wu K S, Ahern S, Bethel E W et al. FastBit:Interactively searching massive data. Journal of Physics:Conference Series, 2009, 180(1):Article No. 012053.
[58] Cheng P, Wang Y, Lu Y T, Du Y F, Chen Z G. IndexIt:Enhancing data locating services for parallel file systems. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, August 2019, pp.1011-1019.
[59] Wu T H, Chou J, Podhorszki N, Gu J M, Tian Y, Klasky S, Wu K S. Apply block index technique to scientific data analysis and I/O systems. In Proc. the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2017, pp.865-871.
[60] Chen D H, Xue J S, Yang X S et al. New generation of multi-scale NWP system (GRAPES):General scientific design. Chinese Science Bulletin, 2008, 53(22):3433-3445.
[61] Bush W S, Moore J H. Chapter 11:Genome-wide association studies. PLoS Computational Biology, 2012, 8(12):Article No. e1002822.
[62] Chaimov N, Malony A D, Canon S, Iancu C, Ibrahim K Z, Srinivasan J. Scaling spark on HPC systems. In Proc. the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, May 2016, pp.97-110.
[63] Taft R, Vartak M, Satish N R, Sundaram N, Madden S, Stonebraker M. GenBase:A complex analytics genomics benchmark. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.177-188.
[64] Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet:A large-scale hierarchical image database. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.248-255.
[65] Deelman E, Gannon D, Shields M S, Taylor I J. Workflows and e-science:An overview of workflow system features and capabilities. Future Generation Comp. Syst., 2009, 25(5):528-540.
[66] Berriman B G, Good J C, Laity A C et al. Chapter 19:Web-based Tools-Montage:An astronomical image mosaic engine. In The National Virtual Observatory:Tools and Techniques for Astronomical Aesearch, Graham M J, Fitzpatrick M J, McGlynn T A (eds.), Astronomical Society of the Pacific, 2007, pp.179-189.
[67] Hazekamp N, Kremer-Herman N, Tovar B et al. Combining static and dynamic storage management for data intensive scientific workflows. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(2):338-350.
[1] Kai Wu, Dong Li. Unimem: 用于高性能计算的基于非易失性内存的异构主内存上的运行时系统数据管理[J]. 计算机科学技术学报, 2021, 36(1): 90-109.
[2] Zhi-Guang Chen, Yu-Bo Liu, Yong-Feng Wang, Yu-Tong Lu. 基于GPU的大规模并行文件系统元数据加速[J]. 计算机科学技术学报, 2021, 36(1): 44-55.
[3] Hong-Mei Wei, Jian Gao, Peng Qing, Kang Yu, Yan-Fei Fang, Ming-Lu Li. MPI-RCDD:一种MPI运行时的通信死锁检测框架[J]. 计算机科学技术学报, 2020, 35(2): 395-411.
[4] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. 高性能计算专用文件系统[J]. 计算机科学技术学报, 2020, 35(1): 4-26.
[5] Qi Chen, Kang Chen, Zuo-Ning Chen, Wei Xue, Xu Ji, Bin Yang. 神威存储系统面向应用I/O性能提升的优化介绍[J]. 计算机科学技术学报, 2020, 35(1): 47-60.
[6] Marc-André Vef, Nafiseh Moti, Tim Süß, Markus Tacke, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, André Brinkmann. GekkoFS—一种用于高性能计算应用的临时突发缓冲文件系统[J]. 计算机科学技术学报, 2020, 35(1): 72-91.
[7] Robert B. Ross, George Amvrosiadis, Philip Carns, Charles D. Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K. Gutierrez, Robert Latham, Bob Robey, Dana Robinson, Bradley Settlemyer, Galen Shipman, Shane Snyder, Jerome Soumagne, Qing Zheng. Mochi:为高性能计算环境组合数据服务[J]. 计算机科学技术学报, 2020, 35(1): 121-144.
[8] Xu Tan, Xiao-Wei Shen, Xiao-Chun Ye, Da Wang, Dong-Rui Fan, Lunkai Zhang, Wen-Mi. 一种面向数据流架构的无停顿双缓冲机制[J]. , 2018, 33(1): 145-157.
[9] Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Lunkai Zhang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan, Ning-Hui Sun. 一种面向数据流架构的高效片上路由结构[J]. , 2017, 32(1): 11-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[9] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[10] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: