Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (5): 985-1001.doi: 10.1007/s11390-021-1234-y

Special Issue: Data Management and Data Mining

• Special Section of APPT 2021 (Part 1) • Previous Articles     Next Articles

SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks

Songjie Niu1,2, Student Member, CCF, and Dongyan Zhou3        

  1. 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Bytedance Technology, Beijing 100086, China
  • Received:2020-12-25 Revised:2021-08-23 Online:2021-09-30 Published:2021-09-30
  • About author:Songjie Niu received her B.E. degree in software engineering from Beijing Institute of Technology, Beijing, in 2014. She is a Ph.D. candidate at Institute of Computing Technology, Chinese Academy of Sciences, Beijing. Her research interests include graph computation, database systems, and big data processing. She is a student member of CCF.

The second-order random walk has recently been shown to effectively improve the accuracy in graph analysis tasks. Existing work mainly focuses on centralized second-order random walk (SOW) algorithms. SOW algorithms rely on edge-to-edge transition probabilities to generate next random steps. However, it is prohibitively costly to store all the probabilities for large-scale graphs, and restricting the number of probabilities to consider can negatively impact the accuracy of graph analysis tasks. In this paper, we propose and study an alternative approach, SOOP (second-order random walks with on-demand probability computation), that avoids the space overhead by computing the edge-to-edge transition probabilities on demand during the random walk. However, the same probabilities may be computed multiple times when the same edge appears multiple times in SOW, incurring extra cost for redundant computation and communication. We propose two optimization techniques that reduce the complexity of computing edge-to-edge transition probabilities to generate next random steps, and reduce the cost of communicating out-neighbors for the probability computation, respectively. Our experiments on real-world and synthetic graphs show that SOOP achieves orders of magnitude better performance than baseline precompute solutions, and it can efficiently computes SOW algorithms on billion-scale graphs.

Key words: second-order random walk (SOW); Node2Vec; second-order PageRank; distributed graph computation; SOOP (second-order random walks with on-demand probability computation);

[1] Grover A, Leskovec J. Node2Vec:Scalable feature learning for networks. In Proc. the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, pp.855-864. DOI:10.1145/2939-672.2939754.
[2] Wu Y, Bian Y, Zhang X. Remember where you came from:On the second-order random walk based proximity measures. Proceedings of the VLDB Endowment, 2016, 10(1):13-24. DOI:10.14778/3015270.3015272.
[3] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013., March 2021.
[4] Tsoumakas G, Katakis I. Multi-label classification:An overview. International Journal of Data Warehousing and Mining, 2007, 3(3):1-13. DOI:10.4018/jdwm.2007070101.
[5] Liben-Nowell D, Kleinberg J M. The link prediction problem for social networks. In Proc. the 2003 ACM CIKM International Conference on Information and Knowledge Management, November 2003, pp.556-559. DOI:10.1145/956863.956972.
[6] Tang L, Liu H. Leveraging social media networks for classification. Data Min. Knowl. Discov., 2011, 23(3):447-478. DOI:10.1007/s10618-010-0210-x.
[7] Perozzi B, Al-Rfou R, Skiena S. DeepWalk:Online learning of social representations. In Proc. the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2014, pp.701-710. DOI:10.1145/2623330.2623732.
[8] Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. LINE:Large-scale information network embedding. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1067-1077. DOI:10.1145/2736277.2741093.
[9] Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel:A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.135-146. DOI:10.1145/1807167.1807184.
[10] Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph:Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, October 2012, pp.17-30.
[11] Salihoglu S, Widom J. GPS:A graph processing system. In Proc. the 25th International Conference on Scientific and Statistical Database Management, July 2013, Article No. 22. DOI:10.1145/2484838.2484843.
[12] Tian Y, Balmin A, Corsten S A, Tatikonda S, McPherson J. From "think like a vertex" to "think like a graph". Proceedings of the VLDB Endowment, 2013, 7(3):193-204. DOI:10.14778/2732232.2732238.
[13] Xin R S, Gonzalez J E, Franklin M J, Stoica I. Graphx:A resilient distributed graph system on Spark. In Proc. the 1st International Workshop on Graph Data Management Experiences and Systems, June 2013, Article No. 2. DOI:10.1145/2484425.2484427.
[14] Yan D, Cheng J, Lu Y, Ng W. Blogel:A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment, 2014, 7(14):1981-1992. DOI:10.14778/2733085.2733103.
[15] Chen R, Shi J, Chen Y, Chen H. PowerLyra:Differentiated graph computation and partitioning on skewed graphs. In Proc. the 10th European Conference on Computer Systems, April 2015, Article No. 1. DOI:10.1145/2741948.2741970.
[16] Zhu X, Chen W, Zheng W, Ma X. Gemini:A computationcentric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.301-316.
[17] Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, Zhang B, Cao Y, Tian C. Parallelizing sequential graph computations. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.495-510. DOI:10.1145/303-5918.3035942.
[18] Iosup A, Hegeman T, Ngai W L, Heldens S, Prat-Pérez A, Manhardt T, Chafi H, Capot? M, Sundaram N, Anderson M J, Tanase I G, Xia Y, Nai L, Boncz P A. LDBC graphalytics:A benchmark for large-scale graph analysis on parallel and distributed platforms. Proceedings of the VLDB Endowment, 2016, 9(13):1317-1328. DOI:10.14778/3007263.3007270.
[19] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.15-28.
[20] Zhou D, Niu S, Chen S. Efficient graph computation for Node2Vec. arXiv:1805.00280, 2018., March 2021.
[21] Andersen R, Chung F R K, Lang K J. Local graph partitioning using PageRank vectors. In Proc. the 47th Annual IEEE Symposium on Foundations of Computer Science, October 2006, pp.475-486. DOI:10.1109/FOCS.2006.44.
[22] Yang K, Zhang M, Chen K, Ma X, Bai Y, Jiang Y. KnightKing:A fast distributed graph random walk engine. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.524-537. DOI:10.1145/334-1301.3359634.
[23] Vose M D. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Eng., 1991, 17(9):972-975. DOI:10.1109/32.92917.
[24] Niu S, Chen S. Optimizing CPU cache performance for Pregel-like graph computation. In Proc. the 31st IEEE International Conference on Data Engineering Workshops, April 2015, pp.149-154. DOI:10.1109/ICDEW.2015.7129568.
[25] Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. In Proc. the 12th IEEE International Conference on Data Mining, December 2012, pp.745-754. DOI:10.1109/ICDM.2012.138.
[26] Boldi P, Vigna S. The WebGraph framework I:Compression techniques. In Proc. the 13th International World Wide Web Conference, May 2004, pp.595-601. DOI:10.1145/988672.988752.
[27] Chakrabarti D, Zhan Y, Faloutsos C. R-MAT:A recursive model for graph mining. In Proc. the 4th SIAM International Conference on Data Mining, April 2004, pp.442-446. DOI:10.1137/1.9781611972740.43.
[28] Park H, Kim M. TrillionG:A trillion-scale synthetic graph generator using a recursive vector model. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.913-928. DOI:10.1145/3035918.3064014.
[1] Songjie Niu, Shimin Chen. TransGPerf: Exploiting Transfer Learning for Modeling Distributed Graph Computation Performance [J]. Journal of Computer Science and Technology, 2021, 36(4): 778-791.
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[5] Feng Yulin;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[6] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[7] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[8] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[9] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[10] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved