Processing math: 100%
We use cookies to improve your experience with our site.

Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data Management

Mo Zou, Ming-Zhe Zhang, Ru-Jia Wang, Xian-He Sun, Xiao-Chun Ye, Dong-Rui Fan, Zhi-Min Tang

downloadPDF
邹沫, 张明喆, 王茹嘉, 孙贤和, 叶笑春, 范东睿, 唐志敏. Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式[J]. 计算机科学技术学报, 2024, 39(4): 871-894. DOI: 10.1007/s11390-023-2939-x
引用本文: 邹沫, 张明喆, 王茹嘉, 孙贤和, 叶笑春, 范东睿, 唐志敏. Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式[J]. 计算机科学技术学报, 2024, 39(4): 871-894. DOI: 10.1007/s11390-023-2939-x
Zou M, Zhang MZ, Wang RJ et al. Skyway: Accelerate graph applications with a dual-path architecture and fine-grained data management. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(4): 871−894 July 2024. DOI: 10.1007/s11390-023-2939-x.
Citation: Zou M, Zhang MZ, Wang RJ et al. Skyway: Accelerate graph applications with a dual-path architecture and fine-grained data management. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(4): 871−894 July 2024. DOI: 10.1007/s11390-023-2939-x.
邹沫, 张明喆, 王茹嘉, 孙贤和, 叶笑春, 范东睿, 唐志敏. Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式[J]. 计算机科学技术学报, 2024, 39(4): 871-894. CSTR: 32374.14.s11390-023-2939-x
引用本文: 邹沫, 张明喆, 王茹嘉, 孙贤和, 叶笑春, 范东睿, 唐志敏. Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式[J]. 计算机科学技术学报, 2024, 39(4): 871-894. CSTR: 32374.14.s11390-023-2939-x
Zou M, Zhang MZ, Wang RJ et al. Skyway: Accelerate graph applications with a dual-path architecture and fine-grained data management. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(4): 871−894 July 2024. CSTR: 32374.14.s11390-023-2939-x.
Citation: Zou M, Zhang MZ, Wang RJ et al. Skyway: Accelerate graph applications with a dual-path architecture and fine-grained data management. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(4): 871−894 July 2024. CSTR: 32374.14.s11390-023-2939-x.

Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式

Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data Management

Funds: This work was supported in part by the U.S. National Science Foundation under Grant Nos. CCF-2008907 and CCF-2029014, the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No. YSBR-029, and the Chinese Academy of Sciences Project for Youth Innovation Promotion Association.
More Information
    Author Bio:

    Mo Zou received her Bachelor's degree in software engineering from Shandong University, Jinan, in 2017, and her Ph.D. degree in computer architecture from University of Chinese Academy of Sciences, Beijing, in 2023. She is now a postdoc researcher in Institute of Computing Technology, Chinese Academy of Sciences, Beijing. Her research interests include computer architecture and memory system, especially on domain-specific hardware optimization

    Ming-Zhe Zhang is currently an associate professor at the State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing. His research interests include NVM, memory-centric architecture, and domain specific accelerators

    Ru-Jia Wang received her Bachelor's degree in automation from Zhejiang University, Hangzhou, in 2013, and her M.S. and Ph.D. degrees in electrical and computer engineering from the University of Pittsburgh, Pittsburgh, in 2015 and 2018, respectively. She is now an assistant professor in computer science at the Illinois Institute of Technology, Chicago. Her research interests are in the broader computer architecture and systems area, including scalable, secure, reliable, and high-performance memory systems and architectures

    Xian-He Sun is a University Distinguished Professor and the Ron Hochsprung Endowed Chair of the Department of Computer Science at the Illinois Institute of Technology (Illinois Tech), Chicago. Dr. Sun is an IEEE Fellow and is known for his memory-bounded speedup model, also called Sun-Ni's Law, for scalable computing. His research interests include high-performance computing, memory and I/O systems, and performance evaluation and optimization. Dr. Sun is the Editor-in-Chief of IEEE Transactions on Parallel and Distributed Systems. Dr. Sun received the Golden Core Award from IEEE CS Society in 2017, the Overseas Outstanding Contributions Award from CCF in 2018. More information about Dr. Sun can be found at his website: www.cs.iit.edu/~scs/sun

    Xiao-Chun Ye received his Ph.D. degree in computer architecture from the Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, in 2010. Currently he is a professor and the director of the High-Throughput Computer Research Center in Institute of Computing Technology, CAS, Beijing. His main research interests include many-core processor architecture and graph accelerator

    Dong-Rui Fan received his Ph.D. degree in computer architecture from Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, in 2005. He is currently a professor and Ph.D. supervisor in Institute of Computing Technology, CAS, Beijing. His main research interests include high-throughput computer architecture and high performance computer architecture

    Zhi-Min Tang received his B.S. degree from the Department of Computer Science, Nanjing University, Nanjing, in 1985, and his Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, in 1990, both in computer science. He is currently a professor in Institute of Computing Technology, CAS, Beijing. His research interests include high performance computer architecture, parallel processing, and VLSI design

  • 摘要:
    研究背景 

    图计算已经成为许多人工智能和大数据的关键组成部分,然而图计算因为其大量不规则访存请求和多种访存模式并存而访存效率极低,限制了图计算的性能,需要针对其访存特点设计针对性的更高效率的存储结构。

    目的 

    本文面向图计算,从片上缓存与片下存储两个角度分别提出针对性的优化方案,能够提高内存带宽利用率,并最终提高图计算的执行速度。

    方法 

    本文提出了Skyway,包括片上多路径缓存结构PBuf和片下数据感知型硬件结构DRow。PBuf能够准确区分图计算中的不同访存类型并对局部性较差的访存请求提供快速通路与细粒度的存储单元,提高资源利用率。DRow以缓存的思想感知特定数据并提供细粒度保护与快速返回通路,减少不同类型访存之间的相互干扰。

    结果 

    为了验证Skyway的有效性,本文在模拟器Zsim与DRAMsim3上完成代码实现,并在常用的35个图计算测试用例上进行实验,通过实验评估发现相比较于最先进的图计算硬件加速方案,Skyway提高了23%的性能。

    结论 

    本文深入分析了图计算的访存特征,针对不同类型的访存请求设计针对性的硬件存储结构,能够为特定请求提供快速通路并降低不同请求在存储结构内的相互干扰。本文提出的Skyway设计能够显著提高内存带宽利用率与图计算的执行性能,但需要进一步研究来动态调整通路的选择,以适应输入图的结构特点。

    Abstract:

    Graph processing is a vital component of many AI and big data applications. However, due to its poor locality and complex data access patterns, graph processing is also a known performance killer of AI and big data applications. In this work, we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations. We first identify that memory accesses to the offset, edge, and state array have distinct locality and impact on performance. We then introduce the Skyway architecture, which consists of two primary components: 1) a dedicated direct data path between the core and memory to transfer state array elements efficiently, and 2) a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path. The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead. We evaluate Skyway on a set of diverse algorithms using large real-world graphs. On a simulated four-core system, Skyway improves the performance by 23% on average over the best-performing graph-specialized hardware optimizations.

  • Figure  1.   Example graph and its CSR representation encoding push-based approach.

    Figure  2.   Typical cache hierarchy and main memory in the dual-core system.

    Figure  3.   Breakdown of memory accesses to the different data arrays in the cache hierarchy (_C) and the DRAM (_D), taking PR as an example.

    Figure  4.   Comparison of the cacheline reuse rates for different arrays, taking PR as an example.

    Figure  5.   Row buffer hit rates of different arrays, taking PR as an example. Note that each rate is separately calculated as the ratio of the hit number over the total access to the same array.

    Figure  6.   Performance improvements when removing the impact of one type of DRAM requests, taking PR as an example. IPC means instruction per cycle.

    Figure  7.   DRAM bandwidth utilization across various workloads.

    Figure  8.   Overview of Skyway hardware structure integrated with a four-core system. The Skyway components are shown in color.

    Figure  9.   Extended memory request format in Skyway.

    Figure  10.   Property buffer hardware design.

    Figure  11.   Property buffer workflow.

    Figure  12.   Duplication row hardware design. The added components are shown in color.

    Figure  13.   Duplication row workflow.

    Figure  14.   Performance improvements of (a) BFS, (b) BC, (c) CC, (d) PR, and (e) SSSP over the baseline.

    Figure  15.   DRAM bandwidth utilization of (a) BFS, (b) BC, (c) CC, (d) PR, and (e) SSSP over the baseline.

    Figure  16.   Memory traffic of (a) BFS, (b) BC, (c) CC, (d) PR, and (e) SSSP over the baseline.

    Figure  17.   Run-time request queue occupancy of PBuf and the baseline of (a) BFS, (b) BC, (c) CC, (d) PR, and (e) SSSP.

    Figure  18.   Bank conflicts reduction of DRow over the baseline.

    Figure  19.   Performance improvements for different (a) ProCache capacities and (b) LineBuf capacities.

    Figure  20.   Performance improvements for different (a) DRow numbers and (b) segments in each row.

    Table  1   System Configurations

    Hardware Configuration
    Core Four OoO cores, 4 GHz clock frequency, 128-entry ROB, 4-wide issue width, 16 MSHRs per core
    L1-I/D cache Private, 8-way 32 KB per core, 64 B cache line, 4-cycle access latency
    L2 cache Private, 8-way 256 KB per core, 64 B cache line, 12-cycle access latency
    LLC Shared, 32-way 8 MB, 64 B cache line, 32-cycle access latency
    Memory controller 64-entry read/write queue, FR-FCFS[34] scheduling policy, Open-Page, address interleaving: rochrababgco
    DRAM Four channels, 2 ranks/channel, 4 bankgroups/rank, 4 banks/bankgroup, 16 Gb DDR4-2400 x8 chips,
    8 KB row buffer size[35], tRCD/tRAS/tWR 17/39/18 cycles, peak bandwidth 76.8 GB/s
    下载: 导出CSV

    Table  2   Graph Applications

    Application Brief Description Model
    BFS[36] Traversing a graph from one root vertex until all neighbors are accessed and returning a distance array Push
    BC[37] Scoring the centrality of every vertex to find the center Push
    CC[38] Labeling vertices into disjoint subsets to calculate the number of components Both
    PR Ranking all vertices based on incoming neighbors until convergence or reaching the iteration limitation Pull
    SSSP[39] Finding the shortest paths from one source vertex to all the other vertices in a weighted graph Push
    下载: 导出CSV

    Table  3   Scale of the Graph Datasets

    Graph Dataset |V| (×106) |E| (×106)
    Orkut 2 9 327
    DBpedia[40] 18 136
    PLD[41] 43 623
    Web[42] 51 1930
    MPI[43] 53 1963
    Twitter[2] 62 1468
    UK-2002[44] 134 261
    下载: 导出CSV

    Table  4   MPKI in Various Workloads Including Seven Real-World Graph Datasets and Five Graph Applications

    Graph DatasetBFSBCCCPRSSSP
    DBpedia163174121
    MPI203286020
    PLD355196428
    Twitter533163919
    Web8241412
    Orkut153112717
    UK-20028161410
    GM183123217
    下载: 导出CSV

    Table  5   Skyway Configurations and Hardware Overhead

    Hardware Configuration Overhead
    PBuf ProCache: 32 KB per core, shared, 4-way associated, 4 B-entry, 4-cycle latency; 132 KB
    LineBuf: 1 KB per core, shared, 1-way associated, 64 B-entry, 2-cycle latency
    DRow 8 KB per extra buffer, eight segments in one buffer, four buffers per bank,
    tCCD five cycles, LRU replacement policy
    4 MB
    DRowM 32 entries per bank, 19 bits per entry 9.5 KB
    Register One 64-bit register to record the end address of hot vertices in state array, 56 B
    six 64-bit registers to record array address range (start and end)
    下载: 导出CSV

    Table  6   Power-Law Distribution of DBpedia and Orkut

    Edge Percentage (%) Vertex Percentage (%)
    DBpedia Orkut
    70 7 5
    75 9 6
    80 11 7
    85 13 8
    90 16 11
    95 21 15
    99 46 23
    下载: 导出CSV
  • [1]

    Fan W F. Graph pattern matching revised for social network analysis. In Proc. the 15th International Conference on Database Theory, Mar. 2012, pp.8–21. DOI: 10.1145/2274576.2274578.

    [2]

    Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media? In Proc. the 19th International Conference on World Wide Web, Apr. 2010, pp.591–600. DOI: 10.1145/1772690.1772751.

    [3]

    Tang L, Liu H. Graph mining applications to social network analysis. In Managing and Mining Graph Data, Aggarwal C C, Wang H X (eds.), Springer, 2010, pp.487–513. DOI: 10.1007/978-1-4419-6045-0_16.

    [4]

    Caetano T S, McAuley J J, Cheng L, Le Q V, Smola A J. Learning graph matching. IEEE Trans. Pattern Analysis and Machine Intelligence, 2009, 31(6): 1048–1058. DOI: 10.1109/TPAMI.2009.28.

    [5]

    Navlakha S, Schatz M C, Kingsford C. Revealing biological modules via graph summarization. Journal of Computational Biology, 2009, 16(2): 253–264. DOI: 10.1089/cmb.2008.11TT.

    [6]

    Han S, Liu X Y, Mao H Z, Pu J, Pedram A, Horowitz M A, Dally W J. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 2016, 44(3): 243–254. DOI: 10.1145/3007787.3001163.

    [7]

    Mukkara A, Beckmann N, Abeydeera M, Ma X S, Sanchez D. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2018. DOI: 10.1109/MICRO.2018.00010.

    [8]

    Arai J, Shiokawa H, Yamamuro T, Onizuka M, Iwamura S. Rabbit order: Just-in-time parallel reordering for fast graph analysis. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.22–31. DOI: 10.1109/IPDPS.2016.110.

    [9]

    Balaji V, Lucia B. When is graph reordering an optimization? Studying the effect of lightweight graph reordering across applications and input graphs. In Proc. the 2018 IEEE International Symposium on Workload Characterization, Sept. 30–Oct. 2, 2018, pp.203–214. DOI: 10.1109/IISWC.2018.8573478.

    [10]

    Faldu P, Diamond J, Grot B. A closer look at lightweight graph reordering. In Proc. the 2019 IEEE International Symposium on Workload Characterization, Nov. 2019. DOI: 10.1109/IISWC47752.2019.9041948.

    [11]

    Lakhotia K, Singapura S, Kannan R, Prasanna V. ReCALL: Reordered cache aware locality based graph processing. In Proc. the 24th International Conference on High Performance Computing, Dec. 2017, pp.273–282. DOI: 10.1109/HiPC.2017.00039.

    [12]

    Wei H, Yu J X, Lu C, Lin X M. Speedup graph processing by graph ordering. In Proc. the 2016 International Conference on Management of Data, Jun. 2016, pp.1813–1828. DOI: 10.1145/2882903.2915220.

    [13]

    Zhang Y M, Kiriansky V, Mendis C, Amarasinghe S, Zaharia M. Making caches work for graph analytics. In Proc. the 2017 IEEE International Conference on Big Data, Dec. 2017, pp.293–302. DOI: 10.1109/BigData.2017.8257937.

    [14]

    Zou M, Zhang M Z, Wang R J, Sun X H, Ye X C, Fan D R, Tang Z M. Accelerating graph processing with lightweight learning-based data reordering. IEEE Computer Architecture Letters, 2022, 21(1): 5–8. DOI: 10.1109/ LCA.2022.3151087.

    [15]

    Balaji V, Crago N, Jaleel A, Lucia B. P-OPT: Practical optimal cache replacement for graph analytics. In Proc. the 2021 IEEE International Symposium on High-Performance Computer Architecture, Feb. 27–/Mar. 3, 2021, pp.668–681. DOI: 10.1109/HPCA51647.2021.00062.

    [16]

    Faldu P, Diamond J, Grot B. Domain-specialized cache management for graph analytics. In Proc. the 2020 IEEE International Symposium on High Performance Computer Architecture, Feb. 2020, pp.234–248. DOI: 10.1109/HPCA47549.2020.00028.

    [17]

    Mukkara A, Beckmann N, Sanchez D. PHI: Architectural support for synchronization- and bandwidth-efficient commutative scatter updates. In Proc. the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2019, pp.1009–1022. DOI: 10.1145/3352460.3358254.

    [18]

    Rahman S, Abu-Ghazaleh N, Gupta R. GraphPulse: An event-driven hardware accelerator for asynchronous graph processing. In Proc. the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2020, pp.908–921. DOI: 10.1109/MICRO50266.2020.00078.

    [19]

    Yan M Y, Hu X, Li S C, Basak A, Li H, Ma X, Akgun I, Feng Y J, Gu P, Deng L, Ye X C, Zhang Z M, Fan D R, Xie Y. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach. In Proc. the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2019, pp.615–628. DOI: 10.1145/3352460.3358318.

    [20]

    Zhang D, Ma X Y, Thomson M, Chiou D. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices, 2018, 53(2): 593–607. DOI: 10.1145/3296957.3173197.

    [21]

    Zhang Y, Liao X F, Jin H, He L G, He B S, Liu H K, Gu L. DepGraph: A dependency-driven accelerator for efficient iterative graph processing. In Proc. the 2021 IEEE International Symposium on High-Performance Computer Architecture, Feb. 27–Mar. 3, 2021, pp.371–384. DOI: 10.1109/HPCA51647.2021.00039.

    [22]

    Zou M, Yan M Y, Li W M, Tang Z M, Ye X C, Fan D R. GEM: Execution-aware cache management for graph analytics. In Proc. the 22nd International Conference on Algorithms and Architectures for Parallel Processing, Oct. 2022, pp.273–292. DOI: 10.1007/978-3-031-22677-9_15.

    [23]

    Maass S, Min C, Kashyap S, Kang W, Kumar M, Kim T. Mosaic: Processing a trillion-edge graph on a single machine. In Proc. the 12th European Conference on Computer Systems, Apr. 2017, pp.527–543. DOI: 10.1145/3064176.3064191.

    [24]

    Shun J L, Blelloch G E. Ligra: A lightweight graph processing framework for shared memory. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2013, pp.135–146. DOI: 10.1145/2442516.2442530.

    [25]

    Beamer S, Asanović K, Patterson D. The GAP benchmark suite. arXiv: 1508.03619, 2015. https://doi.org/10.48550/arXiv.1508.03619, Jan. 2024.

    [26]

    Kyrola A, Blelloch G, Guestrin C. GraphChi: Large-scale graph computation on just a PC. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2012, pp.31–46.

    [27]

    Sundaram N, Satish N, Patwary M M A, Dulloor S R, Vadlamudi S G, Das D, Dubey P. GraphMat: High performance graph analytics made productive. Proceedings of the VLDB Endowment, 2015, 8(11): 1214–1225. DOI: 10.14778/2809974.2809983.

    [28]

    Faloutsos M, Faloutsos P, Faloutsos C. On power-law relationships of the Internet topology. In The Structure and Dynamics of Networks, Newman M, Barabási A L, Watts D J (eds.), Princeton University Press, 2006, pp.195–206. DOI: 10.1515/9781400841356.195.

    [29]

    Gonzalez J E, Low Y, Gu H J, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2012, pp.17–30.

    [30]

    Jiang L, Chen L S, Qiu J. Performance characterization of multi-threaded graph processing applications on many-integrated-core architecture. In Proc. the 2018 IEEE International Symposium on Performance Analysis of Systems and Software, Apr. 2018, pp.199–208. DOI: 10.1109/ISPASS.2018.00033.

    [31]

    Sanchez D, Kozyrakis C. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer Architecture News, 2013, 41(3): 475–486. DOI: 10.1145/2508148.2485963.

    [32]

    Li S, Yang Z Y, Reddy D, Srivastava A, Jacob B. DRAMsim3: A cycle-accurate, thermal-capable DRAM simulator. IEEE Computer Architecture Letters, 2020, 19(2): 106–109. DOI: 10.1109/LCA.2020.2973991.

    [33]

    Basak A, Li S C, Hu X, Oh S M, Xie X F, Zhao L, Jiang X W, Xie Y. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proc. the 2019 IEEE International Symposium on High Performance Computer Architecture, Feb. 2019, pp.373–386. DOI: 10.1109/HPCA.2019.00051.

    [34]

    Rixner S, Dally W J, Kapasi U J, Mattson P, Owens J D. Memory access scheduling. ACM SIGARCH Computer Architecture News, 2000, 28(2): 128–138. DOI: 10.1145/342001.339668.

    [35]

    Hassan H, Patel M, Kim J S, Yaglikci A G, Vijaykumar N, Ghiasi N M, Ghose S, Mutlu O. CROW: A low-cost substrate for improving DRAM performance, energy efficiency, and reliability. In Proc. the 46th International Symposium on Computer Architecture, Jun. 2019, pp.129–142. DOI: 10.1145/3307650.3322231.

    [36]

    Beamer S, Asanovic K, Patterson D. Direction-optimizing breadth-first search. In Proc. the 2012 International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2012. DOI: 10.1109/SC.2012.50.

    [37]

    Madduri K, Ediger D, Jiang K, Bader D A, Chavarria-Miranda D. A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In Proc. the 2009 IEEE International Symposium on Parallel & Distributed Processing, May 2009. DOI: 10.1109/IPDPS.2009.5161100.

    [38]

    Sutton M, Ben-Nun T, Barak A. Optimizing parallel graph connectivity computation via subgraph sampling. In Proc. the 2018 IEEE International Parallel and Distributed Processing Symposium, May 2018, pp.12–21. DOI: 10.1109/IPDPS.2018.00012.

    [39]

    Zhang Y M, Brahmakshatriya A, Chen X Y, Dhulipala L, Kamil S, Amarasinghe S, Shun J. Optimizing ordered graph algorithms with Graphit. In Proc. the 18th ACM/IEEE International Symposium on Code Generation and Optimization, Feb. 2020, pp.158–170. DOI: 10.1145/3368826.3377909.

    [40]

    Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: A nucleus for a web of open data. In Proc. the 6th International Semantic Web Conference on the Semantic Web, Nov. 2007, pp.722–735. DOI: 10.1007/978-3-540-76298-0_52.

    [41]

    Lehmberg O, Meusel R, Bizer C. Graph structure in the web: Aggregated by pay-level domain. In Proc. the 2014 ACM Conference on Web Science, Jun. 2014, pp.119–128. DOI: 10.1145/2615569.2615674.

    [42]

    Kunegis J. KONECT: The Koblenz network collection. In Proc. the 22nd International Conference on World Wide Web, May 2013, pp.1343–1350. DOI: 10.1145/2487788.2488173.

    [43]

    Cha M, Haddadi H, Benevenuto F, Gummadi K. Measuring user influence in Twitter: The million follower fallacy. In Proc. the 2010 International AAAI Conference on Web and Social Media, May 2010, pp.10–17. DOI: 10.1609/icwsm.v4i1.14033.

    [44]

    Davis T A, Hu Y F. The university of Florida sparse matrix collection. ACM Trans. Mathematical Software, 2011, 38(1): Article No. 1. DOI: 10.1145/2049662.2049663.

    [45]

    Wang Y H, Orosa L, Peng X J, Guo Y, Ghose S, Patel M, Kim J S, Luna J G, Sadrosadati M, Ghiasi N M, Mutlu O. FIGARO: Improving system performance via fine-grained In-DRAM data relocation and caching. In Proc. the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2020, pp.313–328. DOI: 10.1109/MICRO50266.2020.00036.

    [46]

    Lin B, Healy M B, Miftakhutdinov R, Emma P G, Patt Y. Duplicon cache: Mitigating off-chip memory bank and bank group conflicts via data duplication. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2018, pp.285–297. DOI: 10.1109/MICRO.2018.00031.

    [47]

    Muralimanohar N, Balasubramonian R, Jouppi N P. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.3–14. DOI: 10.1109/MICRO.2007.33.

    [48]

    Jaleel A, Theobald K B, Steely S C, Emer J. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. the 37th International Symposium on Computer Architecture, Jun. 2010, pp.60–71. DOI: 10.1145/1815961.1815971.

    [49]

    Gupta S, Gao H L, Zhou H Y. Adaptive cache bypassing for inclusive last level caches. In Proc. the 27th International Symposium on Parallel and Distributed Processing, May 2013, pp.1243–1253. DOI: 10.1109/IPDPS.2013.16.

    [50]

    Xiang L X, Chen T Z, Shi Q S, Hu W. Less reused filter: Improving L2 cache performance via filtering less reused lines. In Proc. the 23rd International Conference on Supercomputing, Jun. 2009, pp.68–79. DOI: 10.1145/1542275.1542290.

    [51]

    John L K, Subramanian A. Design and performance evaluation of a cache assist to implement selective caching. In Proc. the 1997 International Conference on Computer Design VLSI in Computers and Processors, Oct. 1997, pp.510–518. DOI: 10.1109/ICCD.1997.628916.

    [52]

    Malkowski K, Link G, Raghavan P, Irwin M J. Load miss prediction-exploiting power performance trade-offs. In Proc. the 2007 IEEE International Parallel and Distributed Processing Symposium, Mar. 2007. DOI: 10.1109/IPDPS.2007.370536.

    [53]

    Etsion Y, Feitelson D G. Exploiting core working sets to filter the L1 cache with random sampling. IEEE Trans. Computers, 2012, 61(11): 1535–1550. DOI: 10.1109/TC.2011.197.

    [54]

    Collins J D, Tullsen D M. Hardware identification of cache conflict misses. In Proc. the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, Nov. 1999, pp.126–135. DOI: 10.1109/MICRO.1999.809450.

    [55]

    Jalminger J, Stenstrom P. A novel approach to cache block reuse predictions. In Proc. the 2003 International Conference on Parallel Processing, Oct. 2003, pp.294–302. DOI: 10.1109/ICPP.2003.1240592.

    [56]

    Wang P Y, Wang J, Li C, Wang J Z, Zhu H J, Guo M Y. Grus: Toward unified-memory-efficient high-performance graph processing on GPU. ACM Trans. Architecture and Code Optimization, 2021, 18(2): Article No. 22. DOI: 10.1145/3444844.

    [57]

    Wang P Y, Li C, Wang J, Wang T L, Zhang L, Leng J W, Chen Q, Guo M Y. Skywalker: Efficient alias-method-based graph sampling and random walk on GPUs. In Proc. the 30th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2021, pp.304–317. DOI: 10.1109/PACT52795.2021.00029.

    [58]

    Sabet A H N, Zhao Z J, Gupta R. Subway: Minimizing data transfer during out-of-GPU-memory graph processing. In Proc. the 15th European Conference on Computer Systems, Apr. 2020, Article No. 12. DOI: 10.1145/3342195.3387537.

图(20)  /  表(6)
计量
  • 文章访问数:  249
  • HTML全文浏览量:  4
  • PDF下载量:  64
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-10-27
  • 录用日期:  2023-10-13
  • 网络出版日期:  2023-10-16
  • 刊出日期:  2024-08-14

目录

    /

    返回文章
    返回