›› 2016,Vol. 31 ›› Issue (5): 883-909.doi: 10.1007/s11390-016-1671-1

所属专题: 不能删除 Software Systems

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

软件工件摘要方法综述

Najam Nazar1, Yan Hu2, Member, CCF, ACM, and He Jiang1,2*, Member, CCF, ACM   

  1. 1 Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, School of Software Dalian University of Technology, Dalian 116621, China;
    2 State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China
  • 收稿日期:2015-11-20 修回日期:2016-07-30 出版日期:2016-09-05 发布日期:2016-09-05
  • 通讯作者: He Jiang E-mail:jianghe@dlut.edu.cn
  • 作者简介:Najam Nazar received his B.Sc. (Hons.) degree in computer science from University of the Punjab, Lahore, Pakistan, in 2005, and M.S. degree in software engineering from Chalmers University of Technology, Sweden, in 2010. He is currently working towards his Ph.D. degree in software engineering at Dalian University of Technology, Dalian. His current research interest includes mining software repositories, data mining, natural language processing, and machine learning.
  • 基金资助:

    This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2013CB035906, the Fundamental Research Funds for the Central Universities of China under Grant No. DUT13RC(3)53, and in part by the New Century Excellent Talents in University of China under Grant No. NCET-13-0073 and the National Natural Science Foundation of China under Grant No. 61300017.

Summarizing Software Artifacts: A Literature Review

Najam Nazar1, Yan Hu2, Member, CCF, ACM, and He Jiang1,2*, Member, CCF, ACM   

  1. 1 Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, School of Software Dalian University of Technology, Dalian 116621, China;
    2 State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China
  • Received:2015-11-20 Revised:2016-07-30 Online:2016-09-05 Published:2016-09-05
  • Contact: He Jiang E-mail:jianghe@dlut.edu.cn
  • About author:Najam Nazar received his B.Sc. (Hons.) degree in computer science from University of the Punjab, Lahore, Pakistan, in 2005, and M.S. degree in software engineering from Chalmers University of Technology, Sweden, in 2010. He is currently working towards his Ph.D. degree in software engineering at Dalian University of Technology, Dalian. His current research interest includes mining software repositories, data mining, natural language processing, and machine learning.
  • Supported by:

    This work was supported in part by the National Basic Research 973 Program of China under Grant No. 2013CB035906, the Fundamental Research Funds for the Central Universities of China under Grant No. DUT13RC(3)53, and in part by the New Century Excellent Talents in University of China under Grant No. NCET-13-0073 and the National Natural Science Foundation of China under Grant No. 61300017.

本文是针对软件工件摘要技术的综述,其中重点关注bug报告、源代码、邮件列表以及开发者论坛讨论等典型的软件工件。在过去的7年(2010-2016)里,为了帮助改进软件性能与软件质量、方便开发人员快速理解软件架构与细节,研究者们提出了多种软件工件摘要方法。软件工件中既包含具有明显结构的数据,也包含非结构化数据。为此,研究者们提出了多种不同的基于机器学习和数据挖掘的算法来进行软件工件摘要的生成。本综述旨在对现有最新的软件工件摘要技术进行总结,描述了软件工件的种类、软件摘要方法以及软件摘要实验的典型步骤。我们还对软件工件摘要的应用范围进行了讨论。同时,我们对软件工件摘要相关的工具进行了汇总介绍,并对已有研究中使用的软件工件摘要方法以及对于生成的软件工件摘要质量的评估方法进行了讨论和说明。此外,本文简要介绍了产生软件工件数据的现代通信渠道,以及不同软件工件之间的差异与共性。最后,对软件工件摘要研究所面临的挑战,以及未来研究方向进行了讨论。本综述能够为软件工件摘要方向新的研究者们提供广泛而全面的背景知识。

Abstract: This paper presents a literature review in the field of summarizing software artifacts, focusing on bug reports, source code, mailing lists and developer discussions artifacts. From Jan. 2010 to Apr. 2016, numerous summarization techniques, approaches, and tools have been proposed to satisfy the ongoing demand of improving software performance and quality and facilitating developers in understanding the problems at hand. Since aforementioned artifacts contain both structured and unstructured data at the same time, researchers have applied different machine learning and data mining techniques to generate summaries. Therefore, this paper first intends to provide a general perspective on the state of the art, describing the type of artifacts, approaches for summarization, as well as the common portions of experimental procedures shared among these artifacts. Moreover, we discuss the applications of summarization, i.e., what tasks at hand have been achieved through summarization. Next, this paper presents tools that are generated for summarization tasks or employed during summarization tasks. In addition, we present different summarization evaluation methods employed in selected studies as well as other important factors that are used for the evaluation of generated summaries such as adequacy and quality. Moreover, we briefly present modern communication channels and complementarities with commonalities among different software artifacts. Finally, some thoughts about the challenges applicable to the existing studies in general as well as future research directions are also discussed. The survey of existing studies will allow future researchers to have a wide and useful background knowledge on the main and important aspects of this research field.

[1] Lloret E, Palomar M. Text summarisation in progress: A literature review. Artificial Intelligence Review, 2012, 37(1): 1-41.

[2] Murphy G C. Lightweight structural summarization as an aid to software evolution [Ph.D. Thesis]. University of Washington, 1996.

[3] Sridhara G, Hill E, Muppaneni D, Pollock L L, Vijay-Shanker K. Towards automatically generating summary comments for java methods. In Proc. the 25th IEEE/ACM International Conference on Automated Software Engineering, Sept. 2010, pp.43-52.

[4] Eddy B P, Robinson J A, Kraft N A, Carver J C. Evaluating source code summarization techniques: Replication and expansion. In Proc. the 21st International Conference on Program Comprehension, May 2013, pp.13-22.

[5] Rastkar S, Murphy G C, Murray G. Automatic summarization of bug reports. IEEE Transactions on Software Engineering, 2014, 40(4): 366-380.

[6] Bettenburg N, Premraj R, Zimmermann T, Kim S. Extracting structural information from bug reports. In Proc. the International Working Conference on Mining Software Repositories, May 2008, pp.27-30.

[7] Bacchelli A, Lanza M, Mastrodicasa E S. On the road to hades-helpful automatic development email summarization. In Proc. the 1st International Workshop on the Next Five Years of Text Analysis in Software Maintenance, Sept. 2012.

[8] Di Sorbo A, Panichella S, Visaggio C A, Di Penta M, Canfora G, Gall H C. Development emails content analyzer: Intention mining in developer discussions (T). In Proc. the 30th IEEE/ACM International Conference on Automated Software Engineering, Nov. 2015, pp.12-23.

[9] Haiduc S, Aponte J, Moreno L, Marcus A. On the use of automated text summarization techniques for summarizing source code. In Proc. the 17th Working Conference on Reverse Engineering, Oct. 2010, pp.35-44.

[10] Nenkova A, McKeown K. A survey of text summarization techniques. In Mining Text Data, Aggarwal C C, Zhai C (eds.), Springer US, 2012, pp.43-76.

[11] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval (1 edition). Cambridge University Press, 2008.

[12] Kagdi H, Collard M L, Maletic J I. A survey and taxonomy of approaches for mining software repositories in the context of software evolution. Journal of Software Maintenance and Evolution: Research and Practice, 2007, 19(2): 77-131.

[13] Bacchelli A, Lanza M, Robbes R. Linking e-mails and source code artifacts. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering -Volume 1, May 2010, pp.375-384.

[14] Haiduc S, Aponte J, Marcus A. Supporting program comprehension with source code summarization. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering, May 2010, pp.223-226.

[15] Moreno L, Aponte J. On the analysis of human and automatic summaries of source code. CLEI Electronic Journal, 2012, 15(2).

[16] Rodeghero P, McMillan C, McBurney P W, Bosch N, D'Mello S. Improving automated source code summarization via an eyetracking study of programmers. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.390-401.

[17] Rodeghero P, Liu C, McBurney P, McMillan C. An eyetracking study of java programmers and application to source code summarization. IEEE Transactions on Software Engineering, 2015, 41(11): 1038-1054.

[18] Rastkar S, Murphy G C. Why did this code change? In Proc. the 2013 International Conference on Software Engineering, May 2013, pp.1193-1196.

[19] Binkley D, Lawrie D, Hill E, Burge J, Harris I, Hebig R, Keszocze O, Reed K, Slankas J. Task-driven software summarization. In Proc. the 29th IEEE International Conference on Software Maintenance, Sept. 2013, pp.432-435.

[20] Panichella A, Aponte J, Di Penta M, Marcus A, Canfora G. Mining source code descriptions from developer communications. In Proc. the 20th International Conference on Program Comprehension (ICPC), Jun. 2012, pp.63-72.

[21] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993-1022.

[22] Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A. How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.522-531.

[23] De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S. Using IR methods for labeling source code artifacts: Is it worthwhile? In Proc. the 20th International Conference on Program Comprehension, Jun. 2012, pp.193-202.

[24] De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S. Labeling source code with information retrieval methods: An empirical study. Empirical Software Engineering, 2014, 19(5): 1383-1420.

[25] Vassallo C, Panichella S, Di Penta M, Canfora G. Codes: Mining source code descriptions from developers discussions. In Proc. the 22nd International Conference on Program Comprehension, May 2014, pp.106-109.

[26] Rahman M M, Roy C K, Keivanloo I. Recommending insightful comments for source code using crowdsourced knowledge. In Proc. the 15th International Working Conference on Source Code Analysis and Manipulation (SCAM), Sept. 2015, pp.81-90.

[27] Sridhara G, Pollock L L, Vijay-Shanker K. Generating parameter comments and integrating with method summaries. In Proc. the 19th IEEE International Conference on Program Comprehension, Jun. 2011, pp.71-80.

[28] Sridhara G, Pollock L, Vijay-Shanker K. Automatically detecting and describing high level actions within methods. In Proc. the 33rd International Conference on Software Engineering (ICSE), May 2011, pp.101-110.

[29] Rastkar S. Summarizing software concerns. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering -Volume 2, May 2010, pp.527-528.

[30] Rastkar S, Murphy G C, Bradley A W J. Generating natural language summaries for crosscutting source code concerns. In Proc. the 27th International Conference on Software Maintenance, Sept. 2011, pp.103-112.

[31] Moreno L, Aponte J, Sridhara G, Marcus A, Pollock L L, Vijay-Shanker K. Automatic generation of natural language summaries for java classes. In Proc. the 21st International Conference on Program Comprehension, May 2013, pp.23-32.

[32] Moreno L, Marcus A, Pollock L L, Vijay Shanker K. Jsummarizer: An automatic generator of natural language summaries for java classes. In Proc. the 21st International Conference on Program Comprehension (ICPC), May 2013, pp.230-232.

[33] McBurney P W, McMillan C. Automatic documentation generation via source code summarization of method context. In Proc. the 22nd International Conference on Program Comprehension, Jun. 2014, pp.279-290.

[34] McBurney P W, McMillan C. Automatic source code summarization of context for java methods. IEEE Transactions on Software Engineering, 2016, 42(2): 103-119.

[35] McBurney P W. Automatic documentation generation via source code summarization. In Proc. the 37th International Conference on Software Engineering -Volume 2, May 2015, pp.903-906.

[36] McBurney P W, Liu C, McMillan C, Weninger T. Improving topic model source code summarization. In Proc. the 22nd International Conference on Program Comprehension, June 2014, pp.291-294.

[37] Moreno L, Bavota G, Di Penta M, Oliveto R, Marcus A, Canfora G. Automatic generation of release notes. In Proc. the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, Nov. 2014, pp.484-495.

[38] Kulkarni N, Varma V. Supporting comprehension of unfamiliar programs by modeling an expert's perception. In Proc. the 3rd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, Jun. 2014, pp.19-24.

[39] Wong E, Yang J, Tan L. Autocomment: Mining question and answer sites for automatic comment generation. In Proc. the IEEE/ACM 28th International Conference on Automated Software Engineering (ASE), Nov. 2013, pp.562-567.

[40] Zhang Y, Hou D. Extracting problematic API features from forum discussions. In Proc. the 21st International Conference on Program Comprehension (ICPC), May 2013, pp.142-151.

[41] Kamimura M, Murphy G C. Towards generating humanoriented summaries of unit test cases. In Proc. the 21st International Conference on Program Comprehension (ICPC), May 2013, pp.215-218.

[42] Panichella S, Panichella A, Beller M, Zaidman A, Gall H C. The impact of test case summaries on bug fixing performance: An empirical investigation. In Proc. the 38th International Conference on Software Engineering, May 2016, pp.547-558.

[43] Li B, Vendome C, Linares-Vásquez M, Poshyvanyk D, Kraft N A. Automatically documenting unit test cases. In Proc. the IEEE Int. Conf. Software Testing, Verification and Valication, Apr. 2016, pp.341-352.

[44] Dragan N, CollardM, Maletic J. Automatic identification of class stereotypes. In Proc. the IEEE International Conference on Software Maintenance (ICSM), Sept. 2010, pp.1-10.

[45] Abid N, Dragan N, Collard M, Maletic J. Using stereotypes in the automatic generation of natural language summaries for C++ methods. In Proc. the International Conference on Software Maintenance and Evolution, Sept.29-Oct.1, 2015, pp.561-565.

[46] Cortés-Coy L F, Linares-Vásquez M, Aponte J, Poshyvanyk D. On automatically generating commit messages via summarization of source code changes. In Proc. the 14th IEEE International Working Conference on Source Code Analysis and Manipulation, Sept. 2014, pp.275-284.

[47] Moreno L, Marcus A. Jstereocode: Automatically identifying method and class stereotypes in java code. In Proc. the 27th IEEE/ACM International Conference on Automated Software Engineering, Sept. 2012, pp.358-361.

[48] Buse R P, Weimer W R. Automatically documenting program changes. In Proc. the IEEE/ACM International Conference on Automated Software Engineering, Sept. 2010, pp.33-42.

[49] Nielson F, Nielson H R, Hankin C. Principles of Program Analysis. Springer, 2015.

[50] Kupiec J, Pedersen J O, Chen F. A trainable document summarizer. In Proc the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 1995, pp.68-73.

[51] Lotufo R, Malik Z, Czarnecki K. Modelling the ‘hurried’ bug report reading process to summarize bug reports. In Proc. the 28th IEEE International Conference on Software Maintenance, Sept. 2012, pp.430-439.

[52] Rastkar S, Murphy G C, Murray G. Summarizing software artifacts: A case study of bug reports. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering, Volume 1, May 2010, pp.505-514.

[53] Murray G, Carenini G. Summarizing spoken and written conversations. In Proc. the Conference on Empirical Methods in Natural Language Processing, Oct. 2008, pp.773-782.

[54] Jiang H, Zhang J, Ma H, Nazar N, Ren Z. Mining authorship characteristics in bug repositories. Science China Information Sciences, 2015. (Accepted)

[55] Ying A T T, Robillard M P. Code fragment summarization. In Proc. the 9th Joint Meeting on Foundations of Software Engineering, Aug. 2013, pp.655-658.

[56] Nazar N, Jiang H, Gao G, Zhang T, Li X, Ren Z. Source code fragment summarization with small-scale crowdsourcing based features. Frontiers of Computer Science, 2016, 10(3): 504-517.

[57] Petrosyan G, Robillard M P, Mori R D. Discovering information explaining API types using text classification. In Proc. the 37th International Conference on Software Engineering-Volume 1, May 2015, pp.869-879.

[58] Mani S, Catherine R, Sinha V S, Dubey A. AUSUM: Approach for unsupervised bug report summarization. In Proc. the 20th International Symposium on the Foundations of Software Engineering, Nov. 2012, Article No. 11.

[59] Lotufo R,Malik Z, Czarnecki K.Modelling the ‘hurried’ bug report reading process to summarize bug reports. Empirical Software Engineering, 2015, 20(2): 516-548.

[60] Yeasmin S, Roy C, Schneider K. Interactive visualization of bug reports using topic evolution and extractive summaries. In Proc. the IEEE International Conference on Software Maintenance and Evolution, Sept. 2014, pp.421-425.

[61] Fowkes J, Chanthirasegaran P, Allamanis M, Lapata M, Sutton C A. TASSAL: Autofolding for source code summarization. In Proc. the 38th International Conference on Software Engineering Companion, May 2016, pp.649-652.

[62] Aponte J, Marcus A. Improving traceability link recovery methods through software artifact summarization. In Proc. the 6th International Workshop on Traceability in Emerging Forms of Software Engineering, May 2011, pp.46-49.

[63] Fritz T, Shepherd D C, Kevic K, Snipes W, Bräunlich C. Developers' code context models for change tasks. In Proc. the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, Nov. 2014, pp.7-18.

[64] Kevic K, Walters B M, Shaffer T R, Sharif B, Shepherd D C, Fritz T. Tracing software developers' eyes and interactions for change tasks. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, Aug.31-Sept.4, 2015, pp.202-213.

[65] Ying A T T, Robillard M P. Selection and presentation practices for code example summarization. In Proc. the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, Nov. 2014, pp.460-471.

[66] Sun C, Lo D, Khoo S C, Jiang J. Towards more accurate retrieval of duplicate bug reports. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov. 2011, pp.253-262.

[67] Wang X, Zhang L, Xie T, Anvik J, Sun J. An approach to detecting duplicate bug reports using natural language and execution information. In Proc. the 30th ACM/IEEE International Conference on Software Engineering, May 2008, pp.461-470.

[68] Runeson P, Alexandersson M, Nyholm O. Detection of duplicate defect reports using natural language processing. In Proc. the 29th International Conference on Software Engineering, May 2007, pp.499-510.

[69] McBurney PW, McMillan C. An empirical study of the textual similarity between source code and source code summaries. Empirical Software Engineering, 2014: 21(1): 17-42.

[70] Hill E, Pollock L, Vijay-Shanker K. Automatically capturing source code context of NL-queries for software maintenance and reuse. In Proc. the 31st International Conference on Software Engineering, May 2009, pp.232-242.

[71] Treude C, Filho F F, Kulesza U. Summarizing and measuring development activity. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, Sept. 2015, pp.625-636.

[72] Chang C C, Lin C J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): Article No. 27.

[73] Fan R E, Chang K W, Hsieh C J, Wang X R, Lin C J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 2008, 9: 1871-1874.

[74] Wong E, Liu T, Tan L. Clocom: Mining existing source code for automatic comment generation. In Proc. the 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER), Mar. 2015, pp.380-389.

[75] Jones K S, Galliers J R. Evaluating Natural Language Processing Systems: An Analysis and Review. Springer-Verlag Berlin Heidelberg, 1995.

[76] Nenkova A, McKeown K. Automatic summarization. Foundations and Trends in Information Retrieval, 2011, 5(2/3): 103-233.

[77] Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20(1): 37-46.

[78] Nenkova A, Passonneau R J. Evaluating content selection in summarization: The pyramid method. In Proc. the Human Language Technology/North American Chapter of the Association for Computational Linguistics, May 2004, pp.145-152.

[79] Kitchenham B, Brereton P. A systematic review of systematic review process research in software engineering. Information and Software Technology, 2013, 55(12): 2049-2075.

[80] Mesquida A L, Mas A, Amengual E, Calvo-Manzano J A. It service management process improvement based on ISO/IEC 15504: A systematic review. Information and Software Technology, 2012, 54(3): 239-247.

[81] Shihab E, Jiang Z M, Hassan A E. Studying the use of developer IRC meetings in open source projects. In Proc. the IEEE International Conference on Software Maintenance, Nov. 2009, pp.147-156.

[82] Guzzi A, Begel A, Miller J K, Nareddy K. Facilitating enterprise software developer communication with cares. In Proc. the 28th IEEE International Conference on Software Maintenance (ICSM), Sept. 2012, pp.527-536.

[83] Ponzanelli L, Mocci A, Lanza M. Summarizing complex development artifacts by mining heterogeneous data. In Proc. the 12th IEEE/ACM Working Conference on Mining Software Repositories, May 2015, pp.401-405.

[84] Zhao Y, Zhu Q. Evaluation on crowdsourcing research: Current status and future direction. Information Systems Frontiers, 2014, 16(3): 417-434.

[85] Howe J. The rise of crowdsourcing. http: //www.wired.com/ 2006/06/crowds/, July 2006.

[86] Greengard S. Following the crowd. Communications of the ACM, 2011, 54(2): 20-22.

[87] Whitla P. Crowdsourcing and its application in marketing activities. Contemporary Management Research, 2009, 5(1): 15-28.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[3] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[4] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[8] 吴允曾;. On the Development of Applications of Logic in Programming[J]. , 1987, 2(1): 30 -34 .
[9] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: