Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (6): 1258-1277.doi: 10.1007/s11390-020-0496-0

Special Issue: Software Systems

Previous Articles     Next Articles

Learning Human-Written Commit Messages to Document Code Changes

Yuan Huang1, Nan Jia2, Hao-Jie Zhou1, Xiang-Ping Chen3,* Member, IEEE Zi-Bin Zheng1, Senior Member, IEEE, and Ming-Dong Tang4,5, Member, ACM, IEEE        

  1. 1 National Engineering Research Center of Digital Life, School of Data and Computer Science, Sun Yat-sen University Guangzhou 510006, China;
    2 School of Information Engineering, Hebei GEO University, Shijiazhuang 050031, China;
    3 Guangdong Key Laboratory for Big Data Analysis and Simulation of Public Opinion, School of Communication and Design, Sun Yat-sen University, Guangzhou 510006, China;
    4 School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, China;
    5 Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China
  • Received:2020-04-05 Revised:2020-10-15 Online:2020-11-20 Published:2020-12-01
  • Contact: Xiang-Ping Chen E-mail:chenxp8@mail.sysu.edu.cn
  • About author:Yuan Huang received his Ph.D. degree in computer science from Sun Yat-sen University, Guangzhou, in 2017. He is an associate research fellow in the School of Data and Computer Science, Sun Yat-sen University, Guangzhou. He is particularly interested in software evolution and maintenance, code analysis and comprehension, and mining software repositories.
  • Supported by:
    This work was (partially) supported by the Key-Area Research and Development Program of Guangdong Province of China under Grant No. 2020B010164002, the National Natural Science Foundation of China under Grant Nos. 61902441, 61722214 and 61976061, the China Postdoctoral Science Foundation under Grant No. 2018M640855, the Fundamental Research Funds for the Central Universities of China under Grant Nos. 20wkpy06 and 20lgpy129, and the Opening Project of Guangdong Key Laboratory of Big Data Analysis and Processing under Grant No. 202003.

Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these approaches focus on generating summary of the changed software entities at the superficial level, without considering the intent behind the code changes (e.g., the existing approaches cannot generate such message: “fixing null pointer exception”). Considering developers often describe the intent behind the code change when writing the messages, we propose ChangeDoc, an approach to reuse existing messages in version control systems for automatical commit message generation. Our approach includes syntax, semantic, pre-syntax, and pre-semantic similarities. For a given commit without messages, it is able to discover its most similar past commit from a large commit repository, and recommend its message as the message of the given commit. Our repository contains half a million commits that were collected from SourceForge. We evaluate our approach on the commits from 10 projects. The results show that 21.5% of the recommended messages by ChangeDoc can be directly used without modification, and 62.8% require minor modifications. In order to evaluate the quality of the commit messages recommended by ChangeDoc, we performed two empirical studies involving a total of 40 participants (10 professional developers and 30 students). The results indicate that the recommended messages are very good approximations of the ones written by developers and often include important intent information that is not included in the messages generated by other tools.

Key words: commit message recommendation; code syntax similarity; code semantic similarity; code change comprehension;

[1] Barnett M, Bird C, Brunet J, Lahiri S K. Helping developers help themselves:Automatic decomposition of code review changesets. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.134-144.
[2] Huang Y, Jia N, Zhou Q, Chen X, Xiong Y F, Luo X N. Guiding developers to make informative commenting decisions in source code. In Proc. the 40th IEEE/ACM International Conference on Software Engineering:Companion, May 2018, pp.260-261.
[3] Hattori L, Lanza M. On the nature of commits. In Proc. the 23rd IEEE/ACM International Conference on Automated Software Engineering, September 2008, pp.63-71.
[4] Huang Y, Huang S, Chen H, Chen X, Zheng Z, Luo X, Jia N, Hu X, Zhou X. Towards automatically generating block comments for code snippets. Information and Software Technology, 2020, 127:Article No. 106373.
[5] Tao Y, Dang Y, Xie T, Zhang D, Kim S. How do software engineers understand code changes? An exploratory study in industry. In Proc. the 20th ACM SIGSOFT Symposium on the Foundations of Software Engineering, November 2012, Article No. 51.
[6] Huang Y, Chen X, Zou Q, Luo X. A probabilistic neural network-based approach for related software changes detection. In Proc. the 21st Asia-Pacific Software Engineering Conference, Dec. 2014, pp.279-286.
[7] Maalej W, Happel H J. Can development work describe itself? In Proc. the 7th International Working Conference on Mining Software Repositories, May 2010, pp.191-200.
[8] Dyer R, Nguyen H A, Rajan H, Nguyen T N. Boa:A language and infrastructure for analyzing ultra-large-scale software repositories. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.422-431.
[9] Linares-Vásquez M, Cortés-Coy L F, Aponte J, Poshyvanyk D. ChangeScribe:A tool for automatically generating commit messages. In Proc. the 37th IEEE/ACM International Conference on Software Engineering, May 2015, pp.709-712.
[10] Moreno L, Bavota G, Penta M D, Oliveto R, Marcus A, Canfora G. Automatic generation of release notes. In Proc. the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, November 2014, pp.484-495.
[11] Moreno L, Bavota G, Penta M D, Oliveto R, Marcus A, Canfora G. ARENA:An approach for the automated generation of release notes. IEEE Transactions on Software Engineering, 2016, 43(2):106-127.
[12] Shen J, Sun X, Li B, Yang H, Hu J. On automatic summarization of what and why information in source code changes. In Proc. the 40th IEEE Annual Computer Software and Applications Conference, June 2016, pp.103-112.
[13] Buse R P, Weimer W R. Automatically documenting program changes. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, September 2010, pp.33-42.
[14] Rastkar S, Murphy G C. Why did this code change? In Proc. the 35th International Conference on Software Engineering, May 2013, pp.1193-1196.
[15] Parnin C, Görg C. Improving change descriptions with change contexts. In Proc. the 2008 International Working Conference on Mining Software Repositories, May 2008, pp.51-60.
[16] Sridhara G, Hill E, Muppaneni D, Pollock L, Vijay-Shanker K. Towards automatically generating summary comments for Java methods. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, September 2010, pp.43-52.
[17] Moreno L, Aponte J, Sridhara G, Marcus A, Pollock L, Vijay-Shanker K. Automatic generation of natural language summaries for Java classes. In Proc. the 21st International Conference on Program Comprehension, May 2013, pp.23-32.
[18] Spinellis D. Version control systems. IEEE Software, 2005, 22(5):108-109.
[19] Zhong H, Meng N. Towards reusing hints from past fixes:An exploratory study on thousands of real samples. In Proc. the 40th IEEE/ACM International Conference on Software Engineering, May 2018, pp.885-885.
[20] Huang Y, Zheng Q, Chen X, Xiong Y, Liu Z, Luo X. Mining version control system for automatically generating commit comment. In Proc. the 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, November 2017, pp.414-423.
[21] Cortes-Coy L F, Linares-Vásquez M, Aponte J, Poshyvanyk D. On automatically generating commit messages via summarization of source code changes. In Proc. the 14th IEEE International Working Conference on Source Code Analysis and Manipulation, September 2014, pp.275-284.
[22] Jiang S, McMillan C. Towards automatic generation of short summaries of commits. arXiv:1703.09603, 2017. https://arxiv.org/abs/1703.09603, Sept. 2020.
[23] Jiang S, Armaly A. Automatically generating commit messages from diffs using neural machine translation. In Proc. the 32nd IEEE/ACM International Conference on Automated Software Engineering, October 2017, pp.135-146.
[24] Hoang T, Kang H J, Lawall J, Lo D. CC2Vec:Distributed representations of code changes. arXiv:2003.05620, 2003. https://arxiv.org/pdf/2003.05620.pdf, Sept. 2020.
[25] Xu S, Yao Y, Xu F, Gu T, Tong H, Lu J. Commit message generation for source code changes. In Proc. the 28th International Joint Conference on Artificial Intelligence, August 2019, pp.3975-3981.
[26] Liu Z, Xia X, Hassan A E, Lo D, Xing Z, Wang X. Neural-machine-translation-based commit message generation:How far are we? In Proc. the 33rd ACM/IEEE International Conference on Automated Software Engineering, September 2018, pp. 373-384.
[27] Nie L Y, Gao C, Zhong Z, Lam W, Liu Y, Xu Z. Contextualized code representation learning for commit message generation. arXiv:2007.06934, 2020. https://arxiv.org/pdf/2007.06934, Sept. 2020.
[28] Liu S, Gao C, Chen S, Nie L Y, Liu Y. ATOM:Commit message generation based on abstract syntax tree and hybrid ranking. arXiv:1912.02972, 2019. https://arxiv.org/abs/1912.02972, Sept. 2020.
[29] McBurney P W, McMillan C. Automatic documentation generation via source code summarization of method context. In Proc. the 22nd International Conference on Program Comprehension, June 2014, pp.279-290.
[30] Wong E, Yang J, Tan L. AutoComment:Mining question and answer sites for automatic comment generation. In Proc. the 28th IEEE/ACM International Conference on Automated Software Engineering, November 2013, pp.562-567.
[31] Wong E, Liu T, Tan L. CloCom:Mining existing source code for automatic comment generation. In Proc. the 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, March 2015, pp.380-389.
[32] Haiduc S, Aponte J, Moreno L, Marcus A. On the use of automated text summarization techniques for summarizing source code. In Proc. the 17th Working Conference on Reverse Engineering, October 2010, pp.35-44.
[33] Haiduc S, Aponte J, Marcus A. Supporting program comprehension with source code summarization. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering, May 2010, pp.223-226.
[34] Iyer S, Konstas I, Cheung A, Zettlemoyer L. Summarizing source code using a neural attention model. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics, August 2016, pp.2073-2083.
[35] Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2091-2100.
[36] Hu X, Li G, Xia X, Lo D, Jin Z. Deep code comment generation. In Proc. the 26th IEEE International Conference on Program Comprehension, May 2018, pp.200-210.
[37] Hu X, Li G, Xia X, Lo D, Lu S, Jin Z. Summarizing source code with transferred API knowledge. In Proc. the 27th International Joint Conference on Artificial Intelligence, July 2018, pp.2269-2275.
[38] Baxter I D, Yahin A, de Moura L M et al. Clone detection using abstract syntax trees. In Proc. the 1998 Int. Conf. Software Maintenance, November 1998, pp.368-377.
[39] Roy C K, Cordy J R, Koschke R. Comparison and evaluation of code clone detection techniques and tools:A qualitative approach. Science of Computer Programming, 2009, 74(7):470-495.
[40] Wettel R, Marinescu R. Archeology of code duplication:Recovering duplication chains from small duplication fragments. In Proc. the 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, September 2005, pp.63-70.
[41] Yuan Y, Guo Y. Boreas:An accurate and scalable tokenbased approach to code clone detection. In Proc. the 27th IEEE/ACM International Conference on Automated Software Engineering, Sept. 2012, pp.286-289.
[42] Kamiya T, Kusumoto S, Inoue K. CCFinder:A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 2002, 28(7):654-670.
[43] Fluri B, Wuersch M, PInzger M, Gall H. Change distilling:Tree differencing for fine-grained source code change extraction. IEEE Transactions on Software Engineering, 2007, 33(11):725-743.
[44] Misra J, Annervaz K, Kaulgud V. Software clustering:Unifying syntactic and semantic features. In Proc. the 19th Working Conference on Reverse Engineering, October 2012, pp.113-122.
[45] Huang Y, Chen X, Liu Z, Luo X, Zheng Z. Using discriminative feature in software entities for relevance identification of code changes. Journal of Software:Evolution and Process, 2017, 29(7):Article No. 2.
[46] Huang Y, Jia N, Chen X, Hong K, Zheng Z. Salient-class location:Help developers understand code change in code review. In Proc. the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 2018, pp.770-774.
[47] Khatchadourian R, Rashid A, Masuhara H, Watanabe T. Detecting broken pointcuts using structural commonality and degree of interest (N). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.641-646.
[48] Nguyen H A, Nguyen A T, Nguyen T T, Nguyen T N, Rajan H. A study of repetitiveness of code changes in software evolution. In Proc. the 28th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2013, pp.180-190.
[49] Gao Q, Zhang H, Wang J, Xiong Y, Zhang L, Mei H. Fixing recurring crash bugs via analyzing Q & A sites (T). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.307-318.
[50] Huang Y, Hu X, Jia N, Chen X, Xiong Y, Zheng Z. Learning code context information to predict comment locations. IEEE Transactions on Reliability, 2020, 69(1):88-105.
[51] Huang Y, Jia N, Shu J, Hu X, Chen X, Zhou Q. Does your code need comment? Software-Practice and Experience, 2020, 50(3):227-245.
[52] Huang Y, Hu X, Jia N, Chen X, Zheng Z, Luo X. CommtPst:Deep learning source code for commenting positions prediction. Journal of Systems and Software, 2020, 170:Article No. 110754.
[53] Oliva J, Serrano J I, del Castillo M D, Iglesias Á. SyMSS: A syntax-based measure for short-text semantic similarity. Data & Knowledge Engineering, 2011, 70(4):390-405.
[54] Salton G. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11):613-620.
[55] Zhang J, Chen J, Hao D, Xiong Y, Xie B, Zhang L, Mei H. Search-based inference of polynomial metamorphic relations. In Proc. the 2014 ACM/IEEE International Conference on Automated Software Engineering, September 2014, pp.701-712.
[56] Li Q. A novel Likert scale based on fuzzy sets theory. Expert Systems with Applications, 2013, 40(5):1609-1618.
[57] Navigli R. Word sense disambiguation:A survey. ACM Computing Surveys, 2009, 41(2):115-183.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[3] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[4] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[5] Zhang Cui; Zhao Qinping; Xu Jiafu;. Kernel Language KLND[J]. , 1986, 1(3): 65 -79 .
[6] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[7] Shi Zhongzhi;. Knowledge-Based Decision Support System[J]. , 1987, 2(1): 22 -29 .
[8] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[9] Xia Peisu; Fang Xinwo; Wang Yuxiang; Yan Kaiming; Zhang Tingjun; Liu Yulan; Zhao Chunying; Sun Jizhong;. Design of Array Processor Systems[J]. , 1987, 2(3): 163 -173 .
[10] Sun Yongqiang; Lu Ruzhan; Huang Xiaorong;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved