Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (5): 1184-1199.doi: 10.1007/s11390-021-0232-4

Special Issue: Computer Networks and Distributed Computing

• Regular Paper • Previous Articles     Next Articles

Apollo: Rapidly Picking the Optimal Cloud Configurations for Big Data Analytics Using a Data-Driven Approach

Yue-Wen Wu1, Yuan-Jia Xu1, Heng Wu2,*, Member, CCF, ACM, IEEE, Lin-Gang Su1 Wen-Bo Zhang2, Senior Member, CCF, and Hua Zhong2, Senior Member, CCF        

  1. 1 University of Chinese Academy of Sciences, Beijing 100049, China;
    2 State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2020-02-27 Revised:2021-08-05 Online:2021-09-30 Published:2021-09-30
  • About author:Yue-Wen Wu received his M.S. degree in software engineering from Huazhong University of Science and Technology, Wuhan, in 2013. He is a Ph.D. candidate with the Institute of Software, Chinese Academy of Sciences, Beijing. His current research interests include cloud computing and resource provisioning, machine learning and performance modeling.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China under Grant No. 2017YFB1001804.

Big data analytics applications are increasingly deployed on cloud computing infrastructures, and it is still a big challenge to pick the optimal cloud configurations in a cost-effective way. In this paper, we address this problem with a high accuracy and a low overhead. We propose Apollo, a data-driven approach that can rapidly pick the optimal cloud configurations by reusing data from similar workloads. We first classify 12 typical workloads in BigDataBench by characterizing pairwise correlations in our offline benchmarks. When a new workload comes, we run it with several small datasets to rank its key characteristics and get its similar workloads. Based on the rank, we then limit the search space of cloud configurations through a classification mechanism. At last, we leverage a hierarchical regression model to measure which cluster is more suitable and use a local search strategy to pick the optimal cloud configurations in a few extra tests. Our evaluation on 12 typical workloads in HiBench shows that compared with state-of-the-art approaches, Apollo can improve up to 30% search accuracy, while reducing as much as 50% overhead for picking the optimal cloud configurations.

Key words: big data analytics; cloud configuration; data driven;

[1] Bilal M, Canini M, Rodrigues R. Finding the right cloud configuration for analytics clusters. In Proc. the 11th ACM Symposium on Cloud Computing, October 2020, pp.208-222. DOI:10.1145/3419111.3421305.
[2] Alipourfard O, Liu H H, Chen J, Venkataraman S, Yu M, Zhang M. Cherrypick:Adaptively unearthing the best cloud configurations for big data analytics. In Proc. the 14th USENIX Symposium on Networked Systems Design and Implementation, March 2017, pp.469-482.
[3] Delimitrou C, Kozyrakis C. QoS-aware scheduling in heterogeneous datacenters with paragon. ACM Transactions on Computer Systems, 2013, 31(4):Article No. 12. DOI:10.1145/2556583.
[4] Venkataraman S, Yang Z, Franklin M, Recht B, Stoica I. Ernest:Efficient performance prediction for large-scale advanced analytics. In Proc. the 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016, pp.363-378.
[5] Hsu C J, Nair V, Freeh V W, Menzies T. Arrow:Low-level augmented Bayesian optimization for finding the best cloud VM. In Proc. the 38th IEEE International Conference on Distributed Computing Systems, July 2018, pp.660-670. DOI:10.1109/ICDCS.2018.00070.
[6] Wang H, Wang N, Yeung D Y. Collaborative deep learning for recommender systems. In Proc. the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2015, pp.1235-1244. DOI:10.1145/2783258.2783273.
[7] Abdi H. The Kendall rank correlation coefficient. In Encyclopedia of Measurement and Statistics, Salkind N J (ed.), SAGE, 2007, pp.508-510.
[8] Leevy J L, Khoshgoftaar T M, Bauder R A, Seliya N. A survey on addressing high-class imbalance in big data. Journal of Big Data, 2018, 5(1):Article No. 42. DOI:10.1186/s40537-018-0151-6.
[9] Quinton C, Haderer N, Rouvoy R, Duchien L. Towards multi-cloud configurations using feature models and ontologies. In Proc. the 2013 International Workshop on Multi-Cloud Applications and Federated Clouds, April 2013, pp.21-26. DOI:10.1145/2462326.2462332.
[10] Herodotou H, Dong F, Babu S. No one (cluster) size fits all:Automatic cluster sizing for data-intensive analytics. In Proc. the 2nd ACM Symposium on Cloud Computing, October 2011, Article No. 18. DOI:10.1145/2038916.2038934.
[11] Jung G, Mukherjee T, Kunde S, Kim H, Sharma N, Goetz F. CloudAdvisor:A recommendation-as-a-service platform for cloud configuration and pricing. In Proc. the 9th IEEE World Congress on Services, June 28-July 3, 2013, pp.456-463. DOI:10.1109/SERVICES.2013.55.
[12] Grandl R, Chowdhury M, Akella A, Ananthanarayanan G. Altruistic scheduling in multi-resource clusters. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.65-80.
[13] Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S. BigDataBench:A big data benchmark suite from Internet services. In Proc. the 20th IEEE International Symposium on High Performance Computer Architecture, Feb. 2014, pp.488-499. DOI:10.1109/HPCA.2014.6835958.
[14] Yadwadkar N J, Hariharan B, Gonzalez J E, Katz R. Multitask learning for straggler avoiding predictive job scheduling. The Journal of Machine Learning Research, 2016, 17(106):1-37.
[15] Zhang Z, Cherkasova L, Verma A, Loo B T. Automated profiling and resource management of pig programs for meeting service level objectives. In Proc. the 9th International Conference on Autonomic Computing, September 2012, pp.53-62. DOI:10.1145/2371536.2371546.
[16] Wagstaff K, Cardie C, Rogers S, Schr?dl S. Constrained kmeans clustering with background knowledge. In Proc. the 18th International Conference on Machine Learning, June 28-July 1, 2001, pp.577-584.
[17] Yadwadkar N J, Hariharan B, Gonzalez J E, Smith B, Katz R H. Selecting the best VM across multiple public clouds:A data-driven performance modeling approach. In Proc. the 2017 Symposium on Cloud Computing, September 2017, pp. 452-465. DOI:10.1145/3127479.3131614.
[18] Lama P, Zhou X. AROMA:Automated resource allocation and configuration of MapReduce environment in the cloud. In Proc. the 9th International Conference on Autonomic Computing, September 2012, pp.63-72. DOI:10.1145/2371536.2371547.
[19] Kodinariya T M, Makwana P R. Review on determining number of cluster in K-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 2013, 1(6):90-95.
[20] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M. TensorFlow:A system for large-scale machine learning. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.265-283.
[21] Paszke A, Gross S, Massa F et al. Pytorch:An imperative style, high-performance deep learning library. In Proc. the 2019 Annual Conference on Neural Information Processing Systems, December 2019, pp.8026-8037.
[22] Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R. Resource central:Understanding and predicting workloads for improved resource management in large cloud platforms. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.153-167. DOI:10.1145/3132747.3132772.
[23] Foga S, Scaramuzza P L, Guo S, Zhu Z, Dilley Jr R D, Beckmann T, Schmidt G L, Dwyer J L, Hughes M J, Laue B. Cloud detection algorithm comparison and validation for operational Landsat data products. Remote Sensing of Environment, 2017, 194:379-390. DOI:10.1016/j.rse.2017.03.026.
[24] Basaru R R, Child C, Alonso E, Slabaugh G. Data-driven recovery of hand depth using CRRF on stereo images. IET Computer Vision, 2018, 12(5):666-678. DOI:10.1049/ietcvi.2017.0227.
[25] Maricq A, Duplyakin D, Jimenez I, Maltzahn C, Stutsman R, Ricci R. Taming performance variability. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.409-425.
[26] Uta A, Custura A, Duplyakin D, Jimenez I, Rellermeyer J, Maltzahn C, Ricci R, Iosup A. Is big data performance reproducible in modern cloud networks? In Proc. the 17th USENIX Symposium on Networked Systems Design and Implementation, February 2020, pp.513-527.
[27] Baccarelli E, Cordeschi N, Mei A, Panella M, Shojafar M, Stefa J. Energy-efficient dynamic traffic offloading and reconfiguration of networked data centers for big data stream mobile computing:Review, challenges, and a case study. IEEE Network, 2016, 30(2):54-61. DOI:10.1109/MNET.2016.7437025.
[28] Cohen M B, Elder S, Musco C, Musco C, Persu M. Dimensionality reduction for k-means clustering and low rank approximation. In Proc. the 47th Annual ACM Symposium on Theory of Computing, June 2015, pp.163-172. DOI:10.1145/2746539.2746569.
[29] Shi J, Zou J, Lu J, Cao Z, Li S, Wang C. MRTuner:A toolkit to enable holistic optimization for MapReduce jobs. Proceedings of the VLDB Endowment, 2014, 7(13):1319-1330. DOI:10.14778/2733004.2733005.
[30] Delimitrou C, Kozyrakis C. Quasar:Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices, 2014, 49(4):127-144. DOI:10.1145/2644865.2541941.
No related articles found!
Full text



[1] Jin Lan; Yang Yuanyuan;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[2] Sun Yongqiang; Lu Ruzhan; Huang Xiaorong;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .
[3] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[4] Zhang Fuyan; Cai Shijie; Wang Shu; Ge Ruding;. The Human-Computer Dialogue Management of FCAD System[J]. , 1988, 3(3): 221 -227 .
[5] Shen Yidong;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[6] Shen Meiming; Tian Xinmin; Wang Dingxing; Zheng Weimin; Wen Dongchan;. Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors[J]. , 1993, 8(3): 43 -52 .
[7] Gu Junzhong;. Modelling Enterprises with Object-Oriented Paradigm[J]. , 1993, 8(3): 80 -89 .
[8] Huang WeiKang; F.Lombard;. On GID-Testable Two-Dimensional Iterative Arrays[J]. , 1994, 9(1): 27 -36 .
[9] Ying Mingsheng;. Institutions of Variable Truth Values:An Approach in the Ordered Style[J]. , 1995, 10(3): 267 -273 .
[10] Qu Yunyao; Tian Zengping; Wang Yuun; Shi Baile;. Design and Implementation of a Concurrency Control Mechanism in an Object-Oriented Database System[J]. , 1996, 11(4): 337 -246 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved