›› 2012, Vol. 27 ›› Issue (1): 75-91.doi: 10.1007/s11390-012-1207-2

Special Issue: Computer Architecture and Systems

• Architecture and VLSI Design • Previous Articles     Next Articles

Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

Benjamín Sahelices1, Agustín de Dios1, Pablo Ibáñez2, Member, IEEE, Víctor Viñals-Yúfera2, Member, ACM, IEEE, and José María Llabería3   

  1. 1. Computer Science Department and HiPEAC European Network of Excellence, University of Valladolid, Valladolid, Spain;
    2. Computer Science and Systems Engineering Department, I3A Research Institute and HiPEAC European Network of Excellence, University of Zaragoza, Zaragoza, Spain;
    3. Computer Architecture Department and HiPEAC European Network of Excellence, Polytechnic University of Cataluña Barcelona, Spain
  • Received:2010-10-30 Revised:2011-08-23 Online:2012-01-05 Published:2012-01-05
  • Supported by:

    This work was supported in part by Spanish Government and European ERDF under Grant Nos. TIN2007-66423, TIN2010-21291-C02-01 and TIN2007-60625, gaZ: T48 research group (Aragón Government and European ESF), Consolider CSD2007-00050 (Spanish Government) and HiPEAC-2 NoE (European FP7/ICT 217068)

Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.

[1] Mellor-Crummey J M, Scott M L. Algorithms for scalablesynchronization on shared-memory multiprocessors. ACMTrans. Computer Systems, 1991, 9(1): 21-65.

[2] Michael M M, Scott M L. Implementation of atomic primi-tives on distributed shared memory multiprocessors. In Proc.the 1st HPCA, Raleigh, USA, Jan. 22-25, 1995, pp.221-231.

[3] Anderson T E. The performance implications of spin-waitingalternatives for shared-memory multiprocessors. In Proc.ICPP, volume II Software, University Park, USA, Aug. 1989,pp.170-174.

[4] Anderson T E. The performance of spin lock alternatives forshared-memory multiprocessors. IEEE Trans. Parallel andDistributed Systems, 1990, 1(1): 6-16.

[5] Goodman J R, Vernon M K,Woest P J. Efficient synchroniza-tion primitives for large-scale cache-coherent multiprocessors.In Proc. the 3rd ASPLOS, Boston Mass, USA, Apr. 3-6, 1989,pp.64-75.

[6] Kagi A. Mechanisms for efficient shared-memory, lock-basedsynchronization. [PhD thesis]. University of Wisconsin-Madison, May 1999.

[7] Kagi A, Burger D, Goodman J R. Efficient synchronization:Let them eat QOLB. In Proc. the 24th ISCA, Denver, USA,June 2-4, 1997, pp.170-180.

[8] Kumar S, Jiang D, Chandra R, Singh J P. Evaluating synchro-nization on shared address space multiprocessors: Methodo-logy and performance. In Proc. ACM SIGMETRICS 1999,Atlanta, USA, May 1-4, 1999, pp.23-34.

[9] Rudolph L, Segall Z. Dynamic decentralized cache schemesfor mimd parallel processors. In Proc. the 11th ISCA, AnnArbor, USA, June 5-1, 1984, pp.340-347.

[10] Radovic Z, Hagersten E. Hierarchical backoff locks for nonuni-form communication architectures. In Proc. the 9th HPCA,Anaheim, USA, Feb. 8-12, 2003, pp.241-252.

[11] Graunke G, Thakkar S. Synchronization algorithms forshared-memory multiprocessors. IEEE Computer, 1990,23(6): 60-69.

[12] Magnusson P S, Landin A, Hagersten E. Queue locks on cachecoherent multiprocessors. In Proc. the 8th ISPP, Cancun,Mexico, Apr. 26-29, 1994, pp.165-171.

[13] Rajwar R, Kagi A, Goodman J R. Improving the throughputof synchronization by insertion of delays. In Proc. the 6thHPCA, Toulouse, France, Jan. 8-12, 2000, pp.168-179.

[14] Rajwar R, Kagi A, Goodman J R. Inferential queueing andspeculative push for reducing critical communication laten-cies. In Proc. the 17th ICS, San Francisco, USA, June 23-26,2003, pp.273-284.

[15] Hoffmann R, Korch M, Rauber T. Performance evaulation oftask pools based on hardware synchronization. In Proc. the18th SC, Pittsburgh, USA, Nov. 6-12, 2004, pp.44.

[16] Vallejo E, Beivide R et al. Architectural support for fairreader-writer locking. In Proc. the 43rd MICRO, Atlanta,USA, Dec. 4-8, 2010, pp.275-286.

[17] Lev Y, Luchangco V, Olszewski M. Scalable reader-writerlocks. In Proc. the 21st SPAA, Calgary, Canada, Aug. 11-13,2009, pp.101-110.

[18] Suleman M A, Mutlu O, Qureshi M L, Patt Y N. Accelerat-ing critical section execution with asymmetric multi-core ar-chitectures. In Proc. the 14th ASPLOS, Washington, USA,March 7-11, 2009, pp.253-264.

[19] Kuskin J et al. The stanford flash multiprocessor. In Proc.the 21st ISCA, Chicago, USA, Apr. 18-21, 1994, pp.302-313.

[20] Laudon J, Lenoski D. The sgi origin: A ccNUMA highlyscalable server. In Proc. the 24th ISCA, Denver, USA, June2-4, 1997, pp.170-180.

[21] Barroso L A et al. Piranha: A scalable architecture based onsingle-chip multiprocessing. In Proc. the 27th ISCA, Vancou-ver, Canada, June 10-14, 2000, pp.282-293.

[22] Gharachorloo K et al. Architecture and design of AlphaServer GS320. In Proc. the 9th ASPLOS, Cambridge, USA,Nov. 12-15, 2000, pp.13-24.

[23] James D D, Laundrie A T, Gjessing S, Sohni G S. Distributeddirectory scheme: Scalable coherence interface. IEEE Com-puter, June 1990, 23(6): 74-77.

[24] Agarwal A, Bianchini R et al. The MIT alewife machine: Ar-chitecture and performance. In Proc. the 22nd ISCA, SantaMargherita Ligure, Italy, 22-24, 1995, pp.2-13.

[25] Chaudhuri M, Heinrich M. The impact of negative acknow-ledgments in shared memory scientific applications. IEEETrans. Parallel and Distributed Systems, 2004, 15(2): 134-152.

[26] Hu W, Hou R, Xiao J, Zhang L. High Performance general-purpose microprocessors: Past and future. Journal of Com-puter Science and Technology, 2006, 21(5): 631-640.

[27] Pai V S, Ranganathan P, Adve S V. RSIM: An execution-driven simulator for ilp-based shared-memory multiproces-sors and uniprocessors. In Proc. the 3rd Workshop on Com-puter Architecture Education, San Antonio, USA, Feb. 1-5,1997.

[28] Pai V S, Ranganathan P, Adve S V. RSIM reference manualversion 1.0. Technical Report 9705, Dept. of Electrical andComputer Engineering, Rice University, 1997.

[29] Gharachorloo K, Gupta A, Hennessy J. Two techniques toenhance the performance of memory consistency models. InProc. ICPP, Austin, USA, Aug. 1991, pp.355-364.

[30] Woo S C et al. The splash-2 programs: Characterizationand methodological considerations. In Proc. the 22nd ISCA,Santa Margherita Ligure, Italy, June 22-24, 1995, pp.24-36.

[31] Heinrich M, Chaudhuri M. Ocean warning: Avoid drowing.Computer Architecture News, 2003, 31(3): 30-32.

[32] de Dios A, Sahelices B, Ib?a~nez P, Vi~nals V, Llabería J M.Speeding-up synchronizations in dsm multiprocessors. InProc. the 12nd Euro-Par, Dresden, Germany, Aug. 28-Sept.1, 2006, pp.473-484.

[33] Alameldeen A R,Wood D A. Variability in architectural simu-lations of multi-threaded workloads. In Proc. the 9th HPCA,Anaheim, USA, Feb. 8-12, 2003, pp.7-18.

[34] Lenoski D et al. The Stanford dash multiprocessor. IEEEComputer, 1992, 25(3): 63-79.

[35] Rajwar R, Goodman J R. Speculative lock elision: Enablinghighly concurrent multithreaded execution. In Proc. the 34thMICRO, Austin, USA, Dec. 2-5, 2001, pp.294-305.
No related articles found!
Full text



[1] Lian Lin; Zhang Yili; Tang Changjie;. A Non-Recursive Algorithm Computing Set Expressions[J]. , 1988, 3(4): 310 -316 .
[2] Xu Jiepan; Wang Lei;. A New Approach to Database Auto-Design by Logic[J]. , 1991, 6(2): 201 -204 .
[3] Adelino Santos;. Cooperative Hypermedia Editing with CoMEdiA[J]. , 1993, 8(3): 67 -79 .
[4] Chen Yiyun;. Head Boundedness of Nonterminating Rewritings[J]. , 1995, 10(3): 281 -284 .
[5] Zhang Zhong;. Simulation of ATPG Neural Network and Its Experimental Results[J]. , 1995, 10(4): 310 -324 .
[6] Xu Dianxiang; Zheng Guoliang;. Logical Object as a Basis of Knowledge Based Systems[J]. , 1995, 10(5): 425 -438 .
[7] Lu Bo; Cai Shijie;. A Skeleton-Based Approach of Automatically Generating Some Chinese Typefaces[J]. , 1996, 11(1): 30 -38 .
[8] Fu Yuxi;. Symmetric π-Calculus[J]. , 1998, 13(3): 202 -208 .
[9] WANG Bingshan; LI Zhoujun; CHEN Huowang;. Universal Abstract Consistency Class and Universal Refutation[J]. , 1999, 14(2): 165 -172 .
[10] WANG Xiaodong; XU Ming; ZHOU Xingming;. Fast Multicast on Multistage Interconnection Networks Using Multi-Head Worms[J]. , 1999, 14(3): 250 -258 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved