›› 2014,Vol. 29 ›› Issue (2): 303-315.doi: 10.1007/s11390-014-1432-y

所属专题: Computer Architecture and Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

一种异步的机群文件系统元数据一致性保证协议

Bing-Qing Shao1 (邵冰清), Jun-Wei Zhang1 (张军伟), Cai-Ping Zheng1 (郑彩平), Hao Zhang1, 2 (张浩) Zhen-Jun Liu1 (刘振军), and Lu Xu1 (许鲁)   

  1. 1 Data Storage and Management Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China
  • 收稿日期:2013-11-17 修回日期:2014-01-09 出版日期:2014-03-05 发布日期:2014-03-05
  • 作者简介:Bing-Qing Shao received her M.S. degree in computer architecture from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2012. She is an assistant engineer at the Data Storage and Management Technology Research Center, ICT, CAS. Her research interests include cluster file systems and network storage.
  • 基金资助:

    This work was supported by the National Basic Research 973 Program of China under Grant No. 2011CB302304, the National High Technology Research and Development 863 Program of China under Grant Nos. 2011AA01A102 and 2013AA013205, the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA06010401, and the Chinese Academy of Sciences Key Deployment Project under Grant No. KGZD-EW-103-5(7).

A Non-Forced-Write Atomic Commit Protocol for Cluster File Systems

Bing-Qing Shao1 (邵冰清), Jun-Wei Zhang1 (张军伟), Cai-Ping Zheng1 (郑彩平), Hao Zhang1, 2 (张浩) Zhen-Jun Liu1 (刘振军), and Lu Xu1 (许鲁)   

  1. 1 Data Storage and Management Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2013-11-17 Revised:2014-01-09 Online:2014-03-05 Published:2014-03-05
  • About author:Bing-Qing Shao received her M.S. degree in computer architecture from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, in 2012. She is an assistant engineer at the Data Storage and Management Technology Research Center, ICT, CAS. Her research interests include cluster file systems and network storage.
  • Supported by:

    This work was supported by the National Basic Research 973 Program of China under Grant No. 2011CB302304, the National High Technology Research and Development 863 Program of China under Grant Nos. 2011AA01A102 and 2013AA013205, the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA06010401, and the Chinese Academy of Sciences Key Deployment Project under Grant No. KGZD-EW-103-5(7).

在机群文件系统中,采用多元数据服务器架构成为一种必然趋势。其中分布式元数据操作的一致性维护成为影响机群文件系统可靠性和可用性的关键问题。现有的原子提交协议都需要多次的同步写磁盘操作,会极大的降低分布式元数据操作的性能。鉴于网络交互的延迟远远低于写磁盘的延迟,本文提出了一种不需要同步写的原子提交协议——Dual-Log(DL). DL是针对只涉及到两台元数据服务器的分布式元数据操作设计的,两台分布式元数据服务器通过网络互相把本地子操作的重做日志发送到对方服务器。当其中的一台元数据服务器宕机时,便可以根据另外一台元数据服务器上冗余记录的重做日志恢复自己的子操作。我们在机群文件系统实现了DL并且对DL的性能进行了评测,测试结果表明:与另外两种广泛使用的原子提交协议,EP 和 S2PC-MP相比,在采用DL的系统中,分布式元数据操作的平均响应时间减少了40%-60%。并且宕机服务器对不超过10000个未完成的分布式元数据操作的恢复时间不超过1s。

Abstract: Distributed metadata consistency is one of the critical issues of metadata clusters in distributed file systems. Existing methods to maintain metadata consistency generally need several log forced write operations. Since synchronous disk IO is very ineffcient, the average response time of metadata operations is greatly increased. In this paper, an asynchronous atomic commit protocol (ACP) named Dual-Log (DL) is presented. It does not need any log forced write operations. Optimizing for distributed metadata operations involving only two metadata servers, DL mutually records the redo log in counterpart metadata servers by transferring through the low latency network. A crashed metadata server can redo the metadata operation with the redundant redo log. Since the latency of the network is much lower than the latency of disk IO, DL can improve the performance of distributed metadata service significantly. The prototype of DL is implemented based on local journal. The performance is tested by comparing with two widely used protocols, EP and S2PC-MP, and the results show that the average response time of distributed metadata operations is reduced by about 40%~60%, and the recovery time is only 1 second under 10 thousands uncompleted distributed metadata operations.

[1] Reinsel D, Gantz J. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east, December 2012. http://www.emc.com/leadership/digital-universe/iview/index.htm, January 2014.

[2] Adrian M. Information management goes‘Extreme’: The biggest challenges for 21st century CIOs, 2011. http:// www.sas.com/offces/NA/canada/lp/Big-Data/Extreme-In-formation-Management.pdf, January 2014.

[3] Roselli D S, Lorch J R, Anderson T E. A comparison of file system workloads. In Proc. the 2000 USENIX Annual Tech-nical Conference, June 2000, pp.41-54.

[4] Gray J. Notes on data base operating systems. In Lec-ture Notes in Computer Science 60, Bayer R, Graham R M, Seegmüller G (eds.), Springer Berlin Heidelberg, 1978, pp.393-481.

[5] Ganger G, McKusick M, Soules C A et al. Soft updates: A so-lution to the metadata update problem in file systems. ACM Trans. Computer Systems, 2000, 18(2): 127-153.

[6] Seltzer M, Ganger G, McKusick M K, et al. Journaling versus soft updates: Asynchronous meta-data protection in file systems. In Proc. USENIX Annual Technical Conference, June 2000, pp.18-23.

[7] Yang D Z, Huang H, Zhang J G, Xu L. A large capacity, high performance and scalability distributed file system | BWFS. Journal of Computer Research and Development, 2005, 42(3): 1028-1033. (In Chinese)

[8] Abd-El-Malek M, Courtright II W V, Cranor C et al. Ursa minor: Versatile cluster-based storage. In Proc. the 4th USENIX Conf. File and Storage Technologies, Dec. 2005, pp.59-72.

[9] Cluster File Systems Inc. Lustre: A scalable, high-performa-nce file system, 2002. http://www.cse.buffalo.edu/faculty/ tkosar/cse710/papers/lustre-whitepaper.pdf, January 2014.

[10] Weil S A, Brandt S A, Miller E L et al. Ceph: A scalable, high-performance distributed file system. In Proc. the 7th OSDI, Nov. 2006, pp.307-320.

[11] Xiong J. Research on key issues in large-scale cluster file sys-tem [Ph.D. Thesis]. Institute of Computing Technology, Chi-nese Academy of Sciences, 2006.

[12] Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (5 edition). Morgan Kaufmann, 2011.

[13] Stamos J W, Cristian F. A low-cost atomic commit protocol. In Proc. the 9th IEEE Symposium on Reliable Distributed Systems, October 1990, pp.66-75.

[14] Al-Houmaily Y, Chrysanthis P. Two-phase commit in gigabit-networked distributed databases. In Proc. the 8th Int. Conf. Parallel and Distributed Computing Systems, Sept. 1995.

[15] Qiu Y J, Liu X S, Yang F. A low-cost distributed database log mechanism. Journal of Computer Research and Develop-ment, 2004, 41(11): 1942-1948.

[16] Bernstein P A, Hadzilacos V, Goodman N. Concurrency Control and Recovery in Database Systems. Boston, USA: Addison-Wesley, Longman Publishing Co., Inc., 1987.

[17] Mohan C, Lindsay B, Obermarck R. Transaction management in the R* distributed database management system. ACM Transactions on Database Systems, 1986, 11(4): 378-396.

[18] Gray J. A comparison of the Byzantine agreement problem and the transaction commit problem. In Lecture Notes in Computer Science 448, Simons B, Spector A (eds.), Springer New York, 1990, pp.10-17.

[19] Xiong J, Hu Y, Li G et al. Metadata distribution and con-sistency techniques for large-scale cluster file systems. IEEE Trans. Parallel and Distributed Systems, 2011, 22(5): 803-816.

[20] Kuhn D R. IEEE's Posix: Making progress. IEEE Spectrum, 1991, 28(12): 36-39.

[21] Tweedie S C. Journaling the Linux ext2fs filesystem. In Proc. the 4th Annual Linux Expo, May 1998.

[22] Wood W G. Recovery control of communicating processes in a distributed system. In Texts and Monographs in Computer Science 1985, Shrivastava S K (ed.), Springer Berlin Heidel-berg, 1985, pp.448-484.

[23] Katcher J. Postmark: A new file system benchmark. Technical Report TR3022, Network Appliance, 1997. http://www.netapp.com/tech library/3022.html, Jan. 2014.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: