Abstract Performance metrics and models are prerequisites for scientific understanding and optimization. This paper introduces a new footprint-based theory and reviews the research in the past four decades leading to the new theory. The review groups the past work into metrics and their models in particular those of the reuse distance, metrics conversion, models of shared cache, performance and optimization, and other related techniques.
The work is partially supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61232008, the NSFC Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao under Grant No. 61328201, the National Science Foundation of USA under Contract Nos. CNS-1319617, CCF-1116104, CCF-0963759, an IBM CAS Faculty Fellowship and a research grant from Huawei. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding organizations.
About author: Chen Ding received his Ph.D. degree from Rice University, M.S. degree from Michigan Technological University, and B.S. degree from Beijing University, all in computer science before joining University of Rochester in 2000. His research received young investigator awards from NSF and DOE. He co-founded the ACM SIGPLAN Workshop on Memory System Performance and Correctness (MSPC) and was a visiting researcher at Microsoft Research and a visiting associate professor at MIT. He is an external faculty fellow at IBM Center for Advanced Studies.
Cite this article:
Chen Ding, Xiaoya Xiang, Bin Bao, Hao Luo, Ying-Wei Luo, and Xiao-Lin Wang.Performance Metrics and Models for Shared Cache[J] Journal of Computer Science and Technology, 2014,V29(4): 692-712
 Zhang X, Dwarkadas S, Shen K. Towards practical page coloring-based multicore cache management. In Proc. the EuroSys Conference, April 2009, pp.89-102. Denning P J. Working sets past and present. IEEE Transactions on Software Engineering, 1980, 6(1): 64-84. Denning P J. The working set model for program behaviour. Communications of the ACM, 1968, 11(5): 323-333. Brock J, Luo H, Ding C. Locality analysis: A nonillion time window problem. In Proc. Big Data Analytics Workshop, June 2013. Zhong Y, Shen X, Ding C. Program locality analysis using reuse distance. ACM TOPLAS, 2009, 31(6): 1-39. Zhong Y, Orlovich M, Shen X, Ding C. Array regrouping and structure splitting using whole-program reference affinity. In Proc. PLDI, June 2004, pp.255-266. Ding C, Chilimbi T. All-window profiling of concurrent executions. In Proc. the 13th PPoPP (Poster Paper), Feb. 2008, pp.265-266. Xiang X, Bao B, Bai T, Ding C, Chilimbi T M. All-window profiling and composable models of cache sharing. In Proc. PPoPP, Feb. 2011, pp.91-102. Xiang X, Bao B, Ding C, Gao Y. Linear-time modeling of program working set in shared cache. In Proc. PACT, Oct. 2011, pp.350-360. Xiang X, Ding C, Luo H, Bao B. HOTL: A higher order theory of locality. In Proc. ASPLOS, March 2013, pp.343-356. Xiang X, Bao B, Ding C, Shen K. Cache conscious task regrouping on multicore processors. In Proc. the 12th CCGrid, May 2012, pp.603-611. Xiang X. A higher order theory of locality and its application in multicore cache management [Ph.D. Thesis]. Computer Science Dept., Univ. of Rochester, 2014. Wu M, Yeung D. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proc. PACT, Oct. 2011, pp.264-275. Wu M, Zhao M, Yeung D. Studying multicore processor scaling via reuse distance analysis. In Proc. the 40th ISCA, June 2013, pp.499-510. Thiébaut D, Stone H S. Footprints in the cache. ACM Transactions on Computer Systems, 1987, 5(4): 305-329. Suh G E, Devadas S, Rudolph L. Analytical cache models with applications to cache partitioning. In Proc. the 15th ICS, June 2001, pp.1-12. Chandra D, Guo F, Kim S, Solihin Y. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proc. the 11th HPCA, Feb. 2005, pp.340-351. Belady L A. A study of replacement algorithms for a virtualstorage computer. IBM Systems Journal, 1966, 5(2): 78-101. Denning P J. Thrashing: Its causes and prevention. In Proc. AFIPS Fall Joint Computer Conference, Part 1, Dec. 1968, pp.915-922. Chilimbi T M, Hirzel M. Dynamic hot data stream prefetching for general-purpose programs. In Proc. PLDI, June 2002, pp.199-209. Mattson R L, Gecsei J, Slutz D, Traiger I L. Evaluation techniques for storage hierarchies. IBM System Journal, 1970, 9(2): 78-117. Jiang S, Zhang X. LIRS: An efficient low inter-reference recency set replacement to improve buffer cache performance. In Proc. SIGMETRICS, June 2002, pp.31-42. Smith A J. On the effectiveness of set associative page mapping and its applications in main memory management. In Proc. the 2nd ICSE, Oct. 1976, pp.286-292. Hill M D, Smith A J. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 1989, 38(12): 1612-1630. Marin G, Mellor-Crummey J. Cross architecture performance predictions for scientific applications using parameterized models. In Proc. SIGMETRICS, June 2004, pp.2-13. Snir M, Yu J. On the theory of spatial and temporal locality. Technical Report, DCS-R-2005-2564, Computer Science Dept., Univ. of Illinois at Urbana-Champaign, 2005. Fang C, Carr S, Önder S, Wang Z. Path-based reuse distance analysis. In Proc. the 15th CC, Mar. 2006, pp.32-46. Zhong Y, Dropsho S G, Shen X, Studer A, Ding C. Miss rate prediction across program inputs and cache configurations. IEEE Transactions on Computers, 2007, 56(3): 328-343. Fang C, Carr S, Önder S, Wang Z. Instruction based memory distance analysis and its application to optimization. In Proc. PACT, Sept. 2005, pp.27-37. Beyls K, D'Hollander E H. Discovery of locality-improving refactorings by reuse path analysis. In Proc. the 2nd Int. Conf. High Performance Computing and Communications, Sept. 2006, pp.220-229. Beyls K, D'Hollander E H. Intermediately executed code is the key to find refactorings that improve temporal data locality. In Proc. the 3rd ACM Conference on Computing Frontiers, May 2006, pp.373-382. Kelly T, Cohen I, Goldszmidt M, Keeton K. Inducing models of black-box storage arrays. Technical Report, HPL-2004-108, HP Laboratories Palo Alto, 2004. Almeida V, Bestavros A, Crovella M, de Oliveira A. Characterizing reference locality in the WWW. In Proc. the 4th International Conference on Parallel and Distributed Information Systems (PDIS), December 1996, pp.92-103. Bennett B T, Kruskal V J. LRU stack processing. IBM Journal of Research and Development, 1975, 19(4): 353-357. Olken F. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report, LBL-12370, Lawrence Berkeley Laboratory, 1981. Ding C, Zhong Y. Predicting whole-program locality through reuse distance analysis. In Proc. PLDI, June 2003, pp.245257. Zhong Y, Ding C, Kennedy K. Reuse distance analysis for scientific programs. In Proc. Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers, March 2002. Schuff D L, Kulkarni M, Pai V S. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proc. the 19th PACT, Sept. 2010, pp.53-64. Kim Y H, Hill M D, Wood D A. Implementing stack simulation for highly-associative memories. In Proc. SIGMETRICS, May 1991, pp.212-213. Sugumar R A, Abraham S G. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical Report, University of Michigan, August 1993. Burger D, Austin T. The SimpleScalar tool set, version 2.0. Technical Report, CS-TR-97-1342, Department of Computer Science, University of Wisconsin, June 1997. Almasi G, Cascaval C, Padua D A. Calculating stack distances efficiently. In Proc. the ACM SIGPLAN Workshop on Memory System Performance, June 2002, pp.37-43. Denning P J, Schwartz S C. Properties of the working set model. Communications of the ACM, 1972, 15(3): 191-198. Berg E, Hagersten E. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proc. ISPASS, March 2004, pp.20-27. Berg E, Hagersten E. Fast data-locality profiling of native execution. In Proc. SIGMETRICS, June 2005, pp.169-180. Eklov D, Hagersten E. StatStack: Efficient modeling of LRU caches. In Proc. ISPASS, March 2010, pp.55-65. Eklov D, Black-Schaffer D, Hagersten E. Fast modeling of shared caches in multicore systems. In Proc. the 6th HiPEAC, Jan. 2011, pp.147-157. Shen X, Shaw J, Meeker B, Ding C. Locality approximation using time. In Proc. the 34th POPL, Jan. 2007, pp.55-61. Shen X, Shaw J. Scalable implementation of efficient locality approximation. In Proc. the 21st LCPC Workshop, July 31-August 2, 2008, pp.202-216. Jiang Y, Zhang E Z, Tian K, Shen X. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proc. the 19th CC, Mar. 2010, pp.264-282. Shen X, Shaw J, Meeker B, Ding C. Locality approximation using time. Technical Report, TR 901, Department of Computer Science, University of Rochester, December 2006. Jiang Y, Tian K, Shen X. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proc. HiPEAC, Jan. 2010, pp.201-215. West R, Zaroo P, Waldspurger C A, Zhang X. Online cache modeling for commodity multicore processors. Operating Systems Review, 2010, 44(4): 19-29. Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proc. the 16th PACT, Sept. 2007, pp.25-38. Zhou S. An efficient simulation algorithm for cache of random replacement policy. In Proc. the IFIP Int. Conf. Network and Parallel Computing, Sept. 2010, pp.144-154. Arnold M, Ryder B G. A framework for reducing the cost of instrumented code. In Proc. PLDI, June 2001, pp.168-179. Hirzel M, Chilimbi T M. Bursty tracing: A framework for low-overhead temporal profiling. In Proc. ACM Workshop on Feedback-Directed and Dynamic Optimization, Dec. 2001. Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W. Multiple page size modeling and optimization. In Proc. the 14th PACT, Sept. 2005, pp.339-349. Zhong Y, Chang W. Sampling-based program locality approximation. In Proc. the 7th ISMM, June 2008, pp.91-100. Tam D K, Azimi R, Soares L, Stumm M. RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations. In Proc. the 14th ASPLOS, Mar. 2009, pp.121-132. Niu Q, Dinan J, Lu Q, Sadayappan P. PARDA: A fast parallel reuse distance analysis algorithm. In Proc. IPDPS, May 2012. Cui H, Yi Q, Xue J, Wang L, Yang Y, Feng X. A highly parallel reuse distance analysis algorithm on GPUs. In Proc. the 26th IPDPS, May 2012, pp. 1284-1294. Gupta S, Xiang P, Yang Y, Zhou H. Locality principle revisited: A probability-Based quantitative approach. In Proc. the 26th IPDPS, May 2012, pp.995-1009. Moseley T, Shye A, Reddi V J, Grunwald D, Peri R. Shadow profiling: Hiding instrumentation costs with parallelism. In Proc. CGO, March 2007, pp.198-208. Wallace S, Hazelwood K. Superpin: Parallelizing dynamic instrumentation for real-time performance. In Proc. CGO, Mar. 2007, pp.209-220. Cascaval C, Padua D A. Estimating cache misses and locality using stack distances. In Proc. the 17th ICS, June 2003, pp.150-159. Allen R, Kennedy K. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers, 2001. Beyls K, D'Hollander E H. Generating cache hints for improved program efficiency. Journal of Systems Architecture, 2005, 51(4): 223-250. Pugh W, Wonnacott D. Eliminating false data dependences using the Omega test. In Proc. PLDI, June 1992, pp.140-151. Chauhan A, Shei C Y. Static reuse distances for locality-based optimizations in MATLAB. In Proc. the 24th ICS, June 2010, pp.295-304. Shen X, Gao Y, Ding C et al. Lightweight reference affinity analysis. In Proc. the 19th ICS, June 2005, pp.131-140. Bao B, Ding C. Defensive loop tiling for shared cache. In Proc. CGO, Feb. 2013, pp.1-11. Bao B. Peer-aware program optimization [Ph.D. Thesis]. Computer Science Dept., Univ. of Rochester, January 2013. Yuan L, Ding C, Štefankovi? D, Zhang Y. Modeling the locality in graph traversals. In Proc. the 41st ICPP, Sept. 2012, pp.138-147. Agarwal A, Hennessy J L, Horowitz M. Cache performance of operating system and multiprogramming workloads. ACM Transactions on Computer Systems, 1988, 6(4): 393-431. Ding C, Chilimbi T. A composable model for analyzing locality of multi-threaded programs. Technical Report, MSR-TR2009-107, Microsoft Research, August 2009. Strohmaier E, Shan H. APEX-Map: A parameterized scalable memory access probe for high-performance computing systems. Concurrency and Computation: Practice and Experience, 2007, 19(17): 2185-2205. Ibrahim K Z, Strohmaier E. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions. In Proc. ICPP, Sept. 2010, pp.353-362. He L, Yu Z, Jin H. FractalMRC: Online cache miss rate curve prediction on commodity systems. In Proc. IPDPS, May 2012, pp.1341-1351. Saltzer J H. A simple linear model of demand paging performance. Communications of the ACM, 1974, 17(4): 181-186. Strecker W D. Transient behavior of cache memories. ACM Transactions on Computer Systems, 1983, 1(4): 281-293. King W F. Analysis of demand paging algorithms. In Proc. IFIP Congress, August 1971, pp.485-490. Fagin R, Price T G. Efficient calculation of expected miss ratios in the independent reference model. SIAM Journal of Computing, 1978, 7(3): 288-297. Dan A, Towsley D F. An approximate analysis of the LRU and FIFO buffer replacement schemes. In Proc. SIGMETRICS, May 1990, pp.143-152. Gu X, Ding C. Reuse distance distribution in random access. Technical Report, URCS #930, University of Rochester, January 2008. Denning P J, Slutz D R. Generalized working sets for segment reference strings. Communications of the ACM, 1978, 21(9): 750-759. Easton M C, Fagin R. Cold-start vs. warm-start miss ratios. Communications of the ACM, 1978, 21(10): 866-872. Shedler G, Tung C. Locality in page reference strings. SIAM Journal on Computing, 1972, 1(3): 218-241. Stone H S, Turek J, Wolf J L. Optimal partitioning of cache memory. IEEE Transactions on Computers, 1992, 41(9): 1054-1068. Thiébaut D, Stone H S, Wolf J L. Improving disk cache hitratios through cache partitioning. IEEE Transactions on Computers, 1992, 41(6): 665-676. Falsafi B, Wood D A. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 1997, 7(1): 104-130. Wu M J, Yeung D. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In Proc. the ACM SIGPLAN Workshop on Memory System Performance and Correctness, June 2012, pp.2-11. Fedorova A, Blagodurov S, Zhuravlev S. Managing contention for shared resources on multicore processors. Communications of the ACM, 2010, 53(2): 49-57. Zhuravlev S, Blagodurov S, Fedorova A. Addressing shared resource contention in multicore processors via scheduling. In Proc. ASPLOS, March 2010, pp.129-142. Blagodurov S, Zhuravlev S, Fedorova A. Contention-aware scheduling on multicore systems. ACM Transactions on Computer Systems, 2010, 28(4): Article No.8. Chen X E, Aamodt T M. A first-order fine-grained multithreaded throughput model. In Proc. HPCA, Feb. 2009, pp.329-340. Xie Y, Loh G H. Dynamic classification of program memory behaviors in CMPs. In Proc. CMP-MSI Workshop, June 2008. Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (4th edition). Morgan Kaufmann, 2006. Sun X H, Wang D. APC: A performance metric of memory systems. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(2): 125-130. Zhao J, Feng X, Cui H et al. An empirical model for predicting cross-core performance interference on multicore processors. In Proc. PACT, Sept. 2013, pp.201-212. Wang W, Dey T, Davidson J W et al. DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. In Proc. HPCA, Feb. 2014. Kim M, Kumar P, Kim H, Brett B. Predicting potential speedup of serial code via lightweight profiling and emulations with memory performance model. In Proc. IPDPS, May 2012, pp.1318-1329. Zhang X, Zhong R, Dwarkadas S, Shen K. A flexible framework for throttling-enabled multicore management (TEMM). In Proc. ICPP, Sept. 2012, pp.389-398. Liu L, Cui Z, Xing M et al. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proc. PACT, Sept. 2012, pp.367-376. Jiang Y, Tian K, Shen X, Zhang J, Chen J, Tripathi R. The complexity of optimal job co-scheduling on chip multiprocessors and heuristics-based solutions. IEEE Trans. Parallel and Distributed Systems, 2011, 22(7): 1192-1205. Jiang Y, Shen X, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proc. PACT, Oct. 2008, pp.220-229. Snavely A, Tullsen D M. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proc. ASPLOS, Nov. 2000, pp.234-244. Shen K. Request behavior variations. In Proc. ASPLOS, Mar. 2010, pp.103-116. Knauerhase R, Brett P, Hohlt B, Li T, Hahn S. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008, 38(3): 54-66. Denning P J. Equipment configuration in balanced computer systems. IEEE Transactions on Computers, 1969, C-18(11): 1008-1012. Wulf W A. Performance monitors for multi-programming systems. In Proc. the ACM Symposium on Operating System Principles, Oct. 1969, pp.175-181. Mars J, Tang L, Skadron K, Soffa M L, Hundt R. Increasing utilization in modern warehouse-scale computers using bubble-up. IEEE Micro, 2012, 32(3): 88-99. Delimitrou C, Kozyrakis C. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. ASPLOS, March 2013, pp.77-88. Ahn D H, Vetter J S. Scalable analysis techniques for microprocessor performance counter metrics. In Proc. ACM/IEEE Conf. Supercomputing, Nov. 2002. Rodríguez G, Badia R M, Labarta J. Generation of simple analytical models for message passing applications. In Proc. Euro-Par., Aug. 31-Sept. 3, 2004, pp.183-188. Jacquet A, Janot V, Leung C et al. An executable analytical performance evaluation approach for early performance prediction. In Proc. IPDPS, April 2003. Miller B P, Callaghan M D, Cargille J M et al. The Paradyn parallel performance measurement tool. IEEE Computer, 1995, 28(11): 37-46. Kerbyson D J, Hoisie A, Wasserman H J. Modelling the performance of large-scale systems. IEE Proceedings Software, 2003, 150(4): 214-222. Wall D W. Predicting program behavior using real or estimated profiles. In Proc. PLDI, June 1991, pp.59-70. Tian K, Jiang Y, Zhang E Z, Shen X. An input-centric paradigm for program dynamic optimizations. In Proc. OOPSLA, Oct. 2010, pp.125-139. Shen X, Zhong Y, Ding C. Regression-based multi-model prediction of data reuse signature. In Proc. the 4th Annual Symposium of the Los Alamos Computer Science Institute, Oct. 2003. Marin G, Mellor-Crummey J. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In Proc. the Symposium of the Los Alamos Computer Science Institute, Oct. 2005. Shen X, Ding C. Parallelization of utility programs based on behavior phase analysis. In Proc. the International Workshop on Languages and Compilers for Parallel Computing, Oct. 2005, pp.425-432. Shen X, Zhong Y, Ding C. Locality phase prediction. In Proc. ASPLOS, Oct. 2004, pp.165-176. Shen X, Zhong Y, Ding C. Predicting locality phases for dynamic memory optimization. Journal of Parallel and Distributed Computing, 2007, 67(7): 783-796. Mao F, Shen X. Cross-input learning and discriminative prediction in evolvable virtual machines. In Proc. CGO, Mar. 2009, pp.92-101. Jiang Y, Zhang E Z, Tian K et al. Exploiting statistical correlations for proactive prediction of program behaviors. In Proc. the 8th CGO, April 2010, pp.248-256. Cavazos J, Moss J E B. Inducing heuristics to decide whether to schedule. In Proc. PLDI, June 2004, pp.183-194. Wu B, Zhao Z, Shen X, Jiang Y, Gao Y, Silvera R. Exploiting inter-sequence correlations for program behavior prediction. In Proc. OOPSLA, Oct. 2012, pp.851-866. Arnold M, Welc A, Rajan V T. Improving virtual machine performance using a cross-run profile repository. In Proc. OOPSLA, Oct. 2005, pp.297-311. Tian K, Zhang E Z, Shen X. A step towards transparent integration of input-consciousness into dynamic program optimizations. In Proc. OOPSLA, Oct. 2011, pp.445-462. Chen Y, Huang Y, Eeckhout L et al. Evaluating iterative optimization across 1000 datasets. In Proc. PLDI, June 2010, pp.448-459. Wu B, Zhou M, Shen X et al. Simple profile rectifications go a long way——Statistically exploring and alleviating the effects of sampling errors for program optimizations. In Proc. the European Conference on Object-Oriented Programming, July 2013, pp.654-678. Srivastava A, Eustace A. ATOM: A system for building customized program analysis tools. In Proc. PLDI, June 1994, pp.196-205. Luk C, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi V J, Hazelwood K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. PLDI, June 2005, pp.190-200. Wagner Meira Jr., LeBlanc T, Poulos A. Waiting time analysis and performance visualization in Carnival. In Proc. ACM SIGMETRICS Symposium on Parallel and Distributed Tools, May 1996. Reed D A, Elford C L, Madhyastha T M, Smirni E, Lamm S E. The next frontier: Interactive and closed loop performance steering. In Proc. ICPP Workshop, Aug. 1996, pp.20-31. Darema-Rogers F, Pfister G F, So K. Memory access patterns of parallel scientific programs. In Proc. SIGMETRICS, May 1987, pp.46-58. Browne S, Dongarra J, Garner N, Ho G, Mucci P. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 2000, 14(3): 189-204. Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J, Tallent N R. HPCTOOLKIT: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010, 22(6): 685-701. Shende S, Malony A D. The TAU parallel performance system. International Journal of High Performance Computing Applications, 2006, 20(2): 287-311. Schulz M, Galarowicz J, Maghrak D, Hachfeld W, Montoya D, Cranford S. OpenjSpeedShop: An open source infrastructure for parallel performance analysis. Scientific Programming, 2008, 16(2/3): 105-121. Hauswirth M, Sweeney P F, Diwan A. Temporal vertical profiling. Software: Practice and Experience, 2010, 40(8): 627654. Childers B, Davidson J, Soffa M L. Continuous compilation: A new approach to aggressive and adaptive code transformation. In Proc. Symp. Parallel and Distributed Processing, April 2003. Cascaval C, Duesterwald E, Sweeney P F, Wisniewski R W. Performance and environment monitoring for continuous program optimization. IBM Journal of Research and Development, 2006, 50(2/3): 239-248. McCurdy C, Vetter J S. Memphis: Finding and fixing NUMArelated performance problems on multi-core platforms. In Proc. ISPASS, March 2010, pp.87-96. Liu X, Mellor-Crummey J M. Pinpointing data locality problems using data-centric analysis. In Proc. the 9th CGO, April 2011, pp.171-180. Liu X, Mellor-Crummey J. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proc. the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2014, pp.259-272. Zhuang X, Serrano M J, Cain H W, Choi J. Accurate, efficient, and adaptive calling context profiling. In Proc. PLDI, June 2006, pp.263-271. Ding C, Yuan L. Program interaction on multicore: Theory and applications. Computer Engineering and Science, 2014, 36(1): 1-5. (In Chinese)
Copyright 2010 by Journal of Computer Science and Technology