大数据环境下基于数据驱动的云配置优化方法
Apollo: Rapidly Picking the Optimal Cloud Configurations for Big Data Analytics Using a Data-Driven Approach
-
摘要: 云计算已成为大数据分析应用的主流运行支撑环境,选取合适的云配置优化其性能面临巨大挑战.本文为了兼顾云配置选取的高准确性和低开销,提出了大数据环境下基于数据驱动的云配置优化方法Apollo,基于大数据应用负载的相似性重用历史数据以减少云配置选取的开销.本文的离线分析阶段通过机器学习方法分析了典型大数据应用负载的特征.当新应用负载发生时,本文的在线阶段基于历史数据缩小搜索空间,并通过分层回归算法快速检索最优搜索路径.基于HiBench、BigDataBench测试基准的结果显示,Apollo相对于已有方法云配置选取的准确性提升了30%,同时配置检索的开销减少了50%.
说明:该文件将放在JCST网站上免费下载,以及用于国内其它宣传渠道,其目的是便于我国读者能更快速地了解论文的研究内容和贡献,从而有助于论文工作的传播和引用。论文摘要应具有独立性和自明性,使得其读者不阅读全文,就能获得必要的信息,也就是说,摘要是一种可以被引用的完整短文。此处的中文摘要,是长摘要,相对于论文的abstract,更为详细,而不是简单的翻译。此外,也不是论文各章节内容的罗列。以下为建议的提纲,请参照撰写论文中文长摘要。另外需提供的论文的Highlight(英文),也按照此提纲撰写。谢谢!
1、研究背景(context)。
2、目的(Objective):准确描述该研究的目的,说明提出问题的缘由,表明研究的范围和重要性。
3、方法(Method):简要说明研究课题的基本设计,结论是如何得到的。
4、结果(Result&Findings):简要列出该研究的主要结果,有什么新发现,说明其价值和局限。叙述要具体、准确,尽量给出量化数据而不只是定性描述,并给出结果的置信值(如果有)。
5、结论(Conclusions):简要地说明经验,论证取得的正确观点及理论价值或应用价值,是否还有与此有关的其它问题有待进一步研究,是否可推广应用,其应用价值如何?Abstract: Big data analytics applications are increasingly deployed on cloud computing infrastructures, and it is still a big challenge to pick the optimal cloud configurations in a cost-effective way. In this paper, we address this problem with a high accuracy and a low overhead. We propose Apollo, a data-driven approach that can rapidly pick the optimal cloud configurations by reusing data from similar workloads. We first classify 12 typical workloads in BigDataBench by characterizing pairwise correlations in our offline benchmarks. When a new workload comes, we run it with several small datasets to rank its key characteristics and get its similar workloads. Based on the rank, we then limit the search space of cloud configurations through a classification mechanism. At last, we leverage a hierarchical regression model to measure which cluster is more suitable and use a local search strategy to pick the optimal cloud configurations in a few extra tests. Our evaluation on 12 typical workloads in HiBench shows that compared with state-of-the-art approaches, Apollo can improve up to 30% search accuracy, while reducing as much as 50% overhead for picking the optimal cloud configurations.