›› 2015,Vol. 30 ›› Issue (1): 20-29.doi: 10.1007/s11390-015-1501-x

所属专题: Computer Architecture and Systems Artificial Intelligence and Pattern Recognition Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

大数据学习场景中面向二分图的分布式图划分算法

Rong Chen(陈榕), Member, CCF, ACM, IEEE, Jia-Xin Shi(施佳鑫), Hai-Bo Chen(陈海波), Senior Member, CCF, Member, ACM, IEEE, Bin-Yu Zang(臧斌宇), Senior Member, CCF, Member, ACM, IEEE   

  1. Shanghai Key Laboratory of Scalable Computing and Systems, Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, Shanghai 200240, China
  • 收稿日期:2014-07-15 修回日期:2014-10-14 出版日期:2015-01-05 发布日期:2015-01-05
  • 作者简介:Rong Chen received his B.S., M.S., and Ph.D. degrees in computer science from Fudan University, Shanghai, in 2004, 2007, and 2011, respectively. He is currently an assistant professor of the Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University, China. He is a member of CCF, ACM, and IEEE. His current research interests include, but are not limited to, distributed systems, operating systems and virtualization.
  • 基金资助:

    This work was supported in part by the Doctoral Fund of Ministry of Education of China under Grant No. 20130073120040, the Program for New Century Excellent Talents in University of Ministry of Education of China, the Shanghai Science and Technology Development Funds under Grant No. 12QA1401700, a foundation for the Author of National Excellent Doctoral Dissertation of China, the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A05, the National Natural Science Foundation of China under Grant Nos. 61003002, 61402284, the Shanghai Science and Technology Development Fund for High-Tech Achievement Translation under Grant No. 14511100902, and the Singapore National Research Foundation under Grant No. CREATE E2S2.

Bipartite-Oriented Distributed Graph Partitioning for Big Learning

Rong Chen(陈榕), Member, CCF, ACM, IEEE, Jia-Xin Shi(施佳鑫), Hai-Bo Chen(陈海波), Senior Member, CCF, Member, ACM, IEEE, Bin-Yu Zang(臧斌宇), Senior Member, CCF, Member, ACM, IEEE   

  1. Shanghai Key Laboratory of Scalable Computing and Systems, Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2014-07-15 Revised:2014-10-14 Online:2015-01-05 Published:2015-01-05
  • About author:Rong Chen received his B.S., M.S., and Ph.D. degrees in computer science from Fudan University, Shanghai, in 2004, 2007, and 2011, respectively. He is currently an assistant professor of the Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University, China. He is a member of CCF, ACM, and IEEE. His current research interests include, but are not limited to, distributed systems, operating systems and virtualization.
  • Supported by:

    This work was supported in part by the Doctoral Fund of Ministry of Education of China under Grant No. 20130073120040, the Program for New Century Excellent Talents in University of Ministry of Education of China, the Shanghai Science and Technology Development Funds under Grant No. 12QA1401700, a foundation for the Author of National Excellent Doctoral Dissertation of China, the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A05, the National Natural Science Foundation of China under Grant Nos. 61003002, 61402284, the Shanghai Science and Technology Development Fund for High-Tech Achievement Translation under Grant No. 14511100902, and the Singapore National Research Foundation under Grant No. CREATE E2S2.

众多机器学习和数据挖掘问题,如推荐系统、主题建模和医学诊断等,能够被抽象成基于二分图的计算.然而,大部分分布式图并行计算系统均忽视了二分图的独有特征,并且现有的在线图划分算法也普遍会造成大量冗余的顶点复制和严重的网络通信负担.本文针对在分布式机器学习和数据挖掘应用中对二分图进行划分存在的机遇和挑战进行了深入分析,并在此基础上提出一组面向二分图的划分算法(BiGraph).BiGraph利用二分图顶点分布特征和两组顶点间计算和数据负载的不平衡特性构建了最优的图划分算法,能够尽可能减少图顶点的复制数和网络通信开销.在PowerGraph系统上的实现和测试表明,BiGraph通过减少高达80%的顶点复制和96%的网络开销,将四个典型机器学习和数据挖掘算法的性能提升1.16倍至17.75倍.

Abstract: Many machine learning and data mining (MLDM) problems like recommendation, topic modeling, and medical diagnosis can be modeled as computing on bipartite graphs. However, most distributed graph-parallel systems are oblivious to the unique characteristics in such graphs and existing online graph partitioning algorithms usually cause excessive replication of vertices as well as significant pressure on network communication. This article identifies the challenges and opportunities of partitioning bipartite graphs for distributed MLDM processing and proposes BiGraph, a set of bipartite-oriented graph partitioning algorithms. BiGraph leverages observations such as the skewed distribution of vertices, discriminated computation load and imbalanced data sizes between the two subsets of vertices to derive a set of optimal graph partitioning algorithms that result in minimal vertex replication and network communication. BiGraph has been implemented on PowerGraph and is shown to have a performance boost up to 17.75X (from 1.16X) for four typical MLDM algorithms, due to reducing up to 80% vertex replication, and up to 96% network traffic.

[1] Malewicz G, Austern M H, Bik A J, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.135-146.

[2] Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2001, pp.269-274.

[3] Zha H, He X, Ding C, Simon H, Gu M. Bipartite graph partitioning and data clustering. In Proc. the 10th International Conference on Information and Knowledge Management, August 2001, pp.25-32.

[4] Gao B, Liu T Y, Zheng X, Cheng Q S, Ma W Y. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proc. the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, August 2005, pp.41-50.

[5] Gao B, Liu T Y, Feng G, Qin T, Cheng Q S, Ma W Y. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph copartitioning. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1263-1273.

[6] Chen R, Shi J, Chen Y, Guan H, Zang B, Chen H. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report, IPADSTR-2013-001, Shanghai Jiao Tong University, 2013.

[7] Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein J M. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.

[8] Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symp. Operating Systems Design and Implementation, October 2012, pp.17-30.

[9] Jain N, Liao G, Willke T L. Graphbuilder: Scalable graph ETL framework. In Proc. the 1st International Workshop on Graph Data Management Experiences and Systems, June 2013, Article No.4.

[10] Chen R, Shi J, Zang B, Guan H. Bipartite-oriented distributed graph partitioning for big learning. In Proc. the 5th Asia-Paci c Workshop on Systems, June 2014, pp.14:1-14:7.

[11] Chen R, Ding X, Wang P, Chen H, Zang B, Guan H. Computation and communication efficient graph processing with distributed immutable view. In Proc. the 23rd International Symposium on High-Performance Parallel and Distributed Computing, June 2014, pp.215-226.

[12] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1): 107-117.

[13] Schloegel K, Karypis G, Kumar V. Parallel multilevel algorithms for multi-constraint graph partitioning. In Proc. the 6th Int. Euro-Par Conf. Parallel Processing, August 2000, pp.296-310.

[14] Ng A Y, Jordan M I,Weiss Y. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, Dietterich T G, Becker S, Ghahramani Z (eds), MIT Press, 2002, pp.849-856.

[15] LÜcking T, Monien B, Elsässer R. New spectral bounds on k-partitioning of graphs. In Proc. the 13th Annual ACM Symposium on Parallel Algorithms and Architectures, July 2001, pp.255-262.

[16] Stanton I, Kliot G. Streaming graph partitioning for large distributed graphs. In Proc. the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2012, pp.1222-1230.

[17] Tsourakakis C, Gkantsidis C, Radunovic B, Vojnovic M. FENNEL: Streaming graph partitioning for massive scale graphs. In Proc. the 7th ACM International Conference on Web Search and Data Mining, February 2014, pp.333-342.

[18] Abou-Rjeili A, Karypis G. Multilevel algorithms for partitioning power-law graphs. In Proc. the 20th International Parallel and Distributed Processing Symposium, April 2006, p.124.

[19] Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29-123.

[20] Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer, 2009, 42(8): 30-37.

[21] Kumar A, Beutel A, Ho Q, Xing E P. Fugue: Slow-workeragnostic distributed learning for big models on big data. In Proc. the 17th International Conference on Arti cial Intelligence and Statistics, April 2014, pp.531-539.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: