›› 2015, Vol. 30 ›› Issue (1): 20-29.doi: 10.1007/s11390-015-1501-x

Special Issue: Computer Architecture and Systems; Artificial Intelligence and Pattern Recognition; Data Management and Data Mining

• Special Section on Computer Architecture and Systems for Big Data • Previous Articles     Next Articles

Bipartite-Oriented Distributed Graph Partitioning for Big Learning

Rong Chen(陈榕), Member, CCF, ACM, IEEE, Jia-Xin Shi(施佳鑫), Hai-Bo Chen(陈海波), Senior Member, CCF, Member, ACM, IEEE, Bin-Yu Zang(臧斌宇), Senior Member, CCF, Member, ACM, IEEE   

  1. Shanghai Key Laboratory of Scalable Computing and Systems, Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2014-07-15 Revised:2014-10-14 Online:2015-01-05 Published:2015-01-05
  • About author:Rong Chen received his B.S., M.S., and Ph.D. degrees in computer science from Fudan University, Shanghai, in 2004, 2007, and 2011, respectively. He is currently an assistant professor of the Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University, China. He is a member of CCF, ACM, and IEEE. His current research interests include, but are not limited to, distributed systems, operating systems and virtualization.
  • Supported by:

    This work was supported in part by the Doctoral Fund of Ministry of Education of China under Grant No. 20130073120040, the Program for New Century Excellent Talents in University of Ministry of Education of China, the Shanghai Science and Technology Development Funds under Grant No. 12QA1401700, a foundation for the Author of National Excellent Doctoral Dissertation of China, the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A05, the National Natural Science Foundation of China under Grant Nos. 61003002, 61402284, the Shanghai Science and Technology Development Fund for High-Tech Achievement Translation under Grant No. 14511100902, and the Singapore National Research Foundation under Grant No. CREATE E2S2.

Many machine learning and data mining (MLDM) problems like recommendation, topic modeling, and medical diagnosis can be modeled as computing on bipartite graphs. However, most distributed graph-parallel systems are oblivious to the unique characteristics in such graphs and existing online graph partitioning algorithms usually cause excessive replication of vertices as well as significant pressure on network communication. This article identifies the challenges and opportunities of partitioning bipartite graphs for distributed MLDM processing and proposes BiGraph, a set of bipartite-oriented graph partitioning algorithms. BiGraph leverages observations such as the skewed distribution of vertices, discriminated computation load and imbalanced data sizes between the two subsets of vertices to derive a set of optimal graph partitioning algorithms that result in minimal vertex replication and network communication. BiGraph has been implemented on PowerGraph and is shown to have a performance boost up to 17.75X (from 1.16X) for four typical MLDM algorithms, due to reducing up to 80% vertex replication, and up to 96% network traffic.

[1] Malewicz G, Austern M H, Bik A J, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.135-146.

[2] Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2001, pp.269-274.

[3] Zha H, He X, Ding C, Simon H, Gu M. Bipartite graph partitioning and data clustering. In Proc. the 10th International Conference on Information and Knowledge Management, August 2001, pp.25-32.

[4] Gao B, Liu T Y, Zheng X, Cheng Q S, Ma W Y. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proc. the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, August 2005, pp.41-50.

[5] Gao B, Liu T Y, Feng G, Qin T, Cheng Q S, Ma W Y. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph copartitioning. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1263-1273.

[6] Chen R, Shi J, Chen Y, Guan H, Zang B, Chen H. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report, IPADSTR-2013-001, Shanghai Jiao Tong University, 2013.

[7] Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein J M. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.

[8] Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symp. Operating Systems Design and Implementation, October 2012, pp.17-30.

[9] Jain N, Liao G, Willke T L. Graphbuilder: Scalable graph ETL framework. In Proc. the 1st International Workshop on Graph Data Management Experiences and Systems, June 2013, Article No.4.

[10] Chen R, Shi J, Zang B, Guan H. Bipartite-oriented distributed graph partitioning for big learning. In Proc. the 5th Asia-Paci c Workshop on Systems, June 2014, pp.14:1-14:7.

[11] Chen R, Ding X, Wang P, Chen H, Zang B, Guan H. Computation and communication efficient graph processing with distributed immutable view. In Proc. the 23rd International Symposium on High-Performance Parallel and Distributed Computing, June 2014, pp.215-226.

[12] Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1): 107-117.

[13] Schloegel K, Karypis G, Kumar V. Parallel multilevel algorithms for multi-constraint graph partitioning. In Proc. the 6th Int. Euro-Par Conf. Parallel Processing, August 2000, pp.296-310.

[14] Ng A Y, Jordan M I,Weiss Y. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, Dietterich T G, Becker S, Ghahramani Z (eds), MIT Press, 2002, pp.849-856.

[15] LÜcking T, Monien B, Elsässer R. New spectral bounds on k-partitioning of graphs. In Proc. the 13th Annual ACM Symposium on Parallel Algorithms and Architectures, July 2001, pp.255-262.

[16] Stanton I, Kliot G. Streaming graph partitioning for large distributed graphs. In Proc. the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2012, pp.1222-1230.

[17] Tsourakakis C, Gkantsidis C, Radunovic B, Vojnovic M. FENNEL: Streaming graph partitioning for massive scale graphs. In Proc. the 7th ACM International Conference on Web Search and Data Mining, February 2014, pp.333-342.

[18] Abou-Rjeili A, Karypis G. Multilevel algorithms for partitioning power-law graphs. In Proc. the 20th International Parallel and Distributed Processing Symposium, April 2006, p.124.

[19] Leskovec J, Lang K J, Dasgupta A, Mahoney M W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6(1): 29-123.

[20] Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer, 2009, 42(8): 30-37.

[21] Kumar A, Beutel A, Ho Q, Xing E P. Fugue: Slow-workeragnostic distributed learning for big models on big data. In Proc. the 17th International Conference on Arti cial Intelligence and Statistics, April 2014, pp.531-539.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved