大数据学习场景中面向二分图的分布式图划分算法

陈榕; 施佳鑫; 陈海波; 臧斌宇

doi:10.1007/s11390-015-1501-x

大数据学习场景中面向二分图的分布式图划分算法

Bipartite-Oriented Distributed Graph Partitioning for Big Learning

摘要

摘要: 众多机器学习和数据挖掘问题,如推荐系统、主题建模和医学诊断等,能够被抽象成基于二分图的计算.然而,大部分分布式图并行计算系统均忽视了二分图的独有特征,并且现有的在线图划分算法也普遍会造成大量冗余的顶点复制和严重的网络通信负担.本文针对在分布式机器学习和数据挖掘应用中对二分图进行划分存在的机遇和挑战进行了深入分析,并在此基础上提出一组面向二分图的划分算法(BiGraph).BiGraph利用二分图顶点分布特征和两组顶点间计算和数据负载的不平衡特性构建了最优的图划分算法,能够尽可能减少图顶点的复制数和网络通信开销.在PowerGraph系统上的实现和测试表明,BiGraph通过减少高达80%的顶点复制和96%的网络开销,将四个典型机器学习和数据挖掘算法的性能提升1.16倍至17.75倍.

Abstract: Many machine learning and data mining (MLDM) problems like recommendation, topic modeling, and medical diagnosis can be modeled as computing on bipartite graphs. However, most distributed graph-parallel systems are oblivious to the unique characteristics in such graphs and existing online graph partitioning algorithms usually cause excessive replication of vertices as well as significant pressure on network communication. This article identifies the challenges and opportunities of partitioning bipartite graphs for distributed MLDM processing and proposes BiGraph, a set of bipartite-oriented graph partitioning algorithms. BiGraph leverages observations such as the skewed distribution of vertices, discriminated computation load and imbalanced data sizes between the two subsets of vertices to derive a set of optimal graph partitioning algorithms that result in minimal vertex replication and network communication. BiGraph has been implemented on PowerGraph and is shown to have a performance boost up to 17.75X (from 1.16X) for four typical MLDM algorithms, due to reducing up to 80% vertex replication, and up to 96% network traffic.

HTML全文

参考文献()

施引文献

资源附件()