›› 2016,Vol. 31 ›› Issue (5): 910-924.doi: 10.1007/s11390-016-1672-0

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

开发者问什么安全问题?在Stack Overflow上的大规模实证研究

Xin-Li Yang(杨昕立)1, David Lo2, Member, ACM, IEEE, Xin Xia(夏鑫)1*, Member, CCF, ACM, IEEE, Zhi-Yuan Wan(万志远)1, and Jian-Ling Sun(孙建伶)1, Member, CCF, ACM   

  1. 1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
    2 School of Information Systems, Singapore Management University, Singapore, Singapore
  • 收稿日期:2016-03-21 修回日期:2016-08-14 出版日期:2016-09-05 发布日期:2016-09-05
  • 通讯作者: Xin Xia E-mail:xxkidd@zju.edu.cn
  • 作者简介:Xin-Li Yang is a Ph.D. candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou. His research interests include mining software repository and empirical study.
  • 基金资助:

    This work is supported by the National Natural Science Foundation of China under Grant No. 61572426 and the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH17F01.

What Security Questions Do Developers Ask? A Large-Scale Study of Stack Overflow Posts

Xin-Li Yang(杨昕立)1, David Lo2, Member, ACM, IEEE, Xin Xia(夏鑫)1*, Member, CCF, ACM, IEEE, Zhi-Yuan Wan(万志远)1, and Jian-Ling Sun(孙建伶)1, Member, CCF, ACM   

  1. 1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
    2 School of Information Systems, Singapore Management University, Singapore, Singapore
  • Received:2016-03-21 Revised:2016-08-14 Online:2016-09-05 Published:2016-09-05
  • Contact: Xin Xia E-mail:xxkidd@zju.edu.cn
  • About author:Xin-Li Yang is a Ph.D. candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou. His research interests include mining software repository and empirical study.
  • Supported by:

    This work is supported by the National Natural Science Foundation of China under Grant No. 61572426 and the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH17F01.

安全一直是一个热门且重要的领域。随着信息技术的快速发展,安全领域也一直被人们所关注。然而,由于安全领域有着悠久的历史,它包含了广泛而多变的话题,从经典的密码学到最近兴起的移动安全。因此,我们有必要对安全相关的话题和趋势做一个调研,这可以对安全领域的研究者,教育者和实践者提供一些指导。
为了解决上述需求,在本文中我们对Stack Overflow上的安全相关的帖子进行了一个大规模的实证研究。Stack Overflow是一个热门的在线问答网站,软件开发者们在上面交流,合作,共享信息。在Stack Overflow上有许多不同领域的问题帖,其中安全相关的问题占了很大比重,有着举足轻重的地位。我们首先基于问题帖的标签从Stack Overflow上提取出安全相关的问题,然后我们采用一种高级话题模型—基于遗传算法调节的LDA来对这些安全相关问题做聚类,从而得到安全领域下的所有子话题。根据这些子话题,我们做了一下分析:我们将所有子话题总结成五个大类;我们调研了不同子话题的热门程度和困难程度。最后,我们为安全领域的研究者,教育者和实践者提供了一些本文得到的研究结论。

Abstract: Security has always been a popular and critical topic. With the rapid development of information technology, it is always attracting people's attention. However, since security has a long history, it covers a wide range of topics which change a lot, from classic cryptography to recently popular mobile security. There is a need to investigate security-related topics and trends, which can be a guide for security researchers, security educators and security practitioners. To address the above-mentioned need, in this paper, we conduct a large-scale study on security-related questions on Stack Overflow. Stack Overflow is a popular on-line question and answer site for software developers to communicate, collaborate, and share information with one another. There are many different topics among the numerous questions posted on Stack Overflow and security-related questions occupy a large proportion and have an important and significant position. We first use two heuristics to extract from the dataset the questions that are related to security based on the tags of the posts. And then we use an advanced topic model, Latent Dirichlet Allocation (LDA) tuned using Genetic Algorithm (GA), to cluster different security-related questions based on their texts. After obtaining the different topics of security-related questions, we use their metadata to make various analyses. We summarize all the topics into five main categories, and investigate the popularity and difficulty of different topics as well. Based on the results of our study, we conclude several implications for researchers, educators and practitioners.

[1] Barua A, Thomas S W, Hassan A E. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 2014, 19(3): 619-654.

[2] Rosen C, Shihab E. What are mobile developers asking about? A large scale study using stack overflow. Empirical Software Engineering, 2016, 21(3): 1192-1223.

[3] Treude C, Barzilay O, Storey M A. How do programmers ask and answer questions on the web? NIER track. In Proc. the 33rd International Conference on Software Engineering (ICSE), May 2011, pp.804-807.

[4] Mamykina L, Manoim B, Mittal M, Hripcsak G, Hartmann B. Design lessons from the fastest Q&A site in the west. In Proc. the 29th SIGCHI Conference on Human Factors in Computing Systems, May 2011, pp.2857-2866.

[5] Xia X, Lo D, Wang X Y, Zhou B. Tag recommendation in software information sites. In Proc. the 10th Working Conference on Mining Software Repositories, May 2013, pp.287-296.

[6] Wang SW, Lo D, Vasilescu B, Serebrenik A. EnTagRec: An enhanced tag recommendation system for software information sites. In Proc. the 30th International Conference on Software Maintenance and Evolution (ICSME), September 2014, pp.291-300.

[7] Beyer S, Pinzger M. A manual categorization of Android app development issues on stack overflow. In Proc. the 30th International Conference on Software Maintenance and Evolution (ICSME), September 2014, pp.531-535.

[8] Linares-Vásquez M, Dit B, Poshyvanyk D. An exploratory analysis of mobile development issues using Stack Overflow. In Proc. the 10th Working Conference on Mining Software Repositories, May 2013, pp.93-96.

[9] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993-1022.

[10] Asuncion H U, Asuncion A U, Taylor R N. Software traceability with topic modeling. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering (ICSE), May 2010, pp.95-104.

[11] Thomas SW. Mining software repositories using topic models. In Proc. the 33rd International Conference on Software Engineering, May 2011, pp.1138-1139.

[12] Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A. How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.522-531.

[13] Heinrich G. Parameter estimation for text analysis. Technical Report, vsonix GmbH + University of Leipzi, 2008. http://www.arbylon.net/publications/textest. pdf,Aug. 2016.

[14] Porter M F. Snowball: A language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html, Aug. 2016.

[15] Goldberg D E, Holland J H. Genetic algorithms and machine learning. Machine Learning, 1988, 3(2/3): 95-99.

[16] Rousseeuw P J, Kaufman L. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.

[17] Sander J, Ester M, Kriegel H P, Xu X W. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.

[18] Hotho A, Maedche A, Staab S. Ontology-based text document clustering. KI, 2002, 16(4): 48-54.

[19] Nadi S, Krüger S, Mezini M, Bodden E. “Jumping through hoops”: Why do Java developers struggle with cryptography APIs? In Proc. the 38th International Conference on Software Engineering, May 2016, pp.935-946.

[20] Li H W, Xing Z C, Peng X, Zhao W Y. What help do developers seek, when and how? In Proc. the 20th Working Conference on Reverse Engineering (WCRE), October 2013, pp.142-151.

[21] Bajaj K, Pattabiraman K, Mesbah A. Mining questions asked by web developers. In Proc. the 11th Working Conference on Mining Software Repositories, May 2014, pp.112-121.

[22] Nie L M, Jiang H, Ren Z L, Sun Z Y, Li X C. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 2016, PrePrints, doi:10.1109/TSC.2016.2560165.

[23] Jiang H, Zhang J X, Li X C, Ren Z L, Lo D. A more accurate model for finding tutorial segments explaining APIs. In Proc. the 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, pp.157-167.

[24] Zhang Y, Lo D, Xia X, Sun J L. Multi-factor duplicate question detection in Stack Overflow. Journal of Computer Science and Technology, 2015, 30(5): 981-997.

[25] Xia X, Lo D, Correa D, Sureka A, Shihab E. It takes two to tango: Deleted stack overflow question prediction with text and meta features. In Proc. the 40th Annual International Computers, Software & Applications Conference (COMPSAC), June 2016.

[26] Wang X Y, Xia X, Lo D. TagCombine: Recommending tags to contents in software information sites. Journal of Computer Science and Technology, 2015, 30(5): 1017-1035.

[27] Xu B W, Xing Z C, Xia X, Lo D, Wang Q Y, Li S P. Domain-specific cross-language relevant question retrieval. In Proc. the 13th International Conference on Mining Software Repositories, May 2016, pp.413-424.

[28] Xu B W, Ye D C, Xing Z C, Xia X, Chen G B, Li S P. Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In Proc. the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), September 2016.

[29] Avdiienko V, Kuznetsov K, Gorla A, Zeller A, Arzt S, Rasthofer S, Bodden E. Mining apps for abnormal usage of sensitive data. In Proc. the 37th IEEE International Conference on Software Engineering (ICSE), May 2015, pp.426-436.

[30] Gorla A, Tavecchia I, Gross F, Zeller A. Checking app behavior against app descriptions. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.1025-1035.

[31] Huang J J, Zhang X Y, Tan L, Wang P, Liang B. As-Droid: Detecting stealthy behaviors in Android applications by user interface and program behavior contradiction. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.1036-1046.

[32] Kirat D, Vigna G. MalGene: Automatic extraction of malware analysis evasion signature. In Proc. the 22nd ACM SIGSAC Conference on Computer and Communications Security, October 2015, pp.769-780.

[33] Parameshwaran I, Budianto E, Shinde S, Dang H, Sadhu A, Saxena P. Auto-patching DOM-based XSS at scale. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, March 2015, pp.272-283.

[34] Fazzini M, Saxena P, Orso A. AutoCSP: Automatically retrofitting CSP to web applications. In Proc. the 37th International Conference on Software Engineering, May 2015, pp.336-346.

[35] Nguyen A T, Nguyen T T, Al-Kofahi J, Nguyen H V, Nguyen T N. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering, November 2011, pp.263-272.

[36] Nguyen A T, Nguyen T T, Nguyen T N, Lo D, Sun C N. Duplicate bug report detection with a combination of information retrieval and topic modeling. In Proc. the 27th IEEE/ACM International Conference on Automated Software Engineering, September 2012, pp.70-79.

[37] Lukins S K, Kraft N A, Etzkorn L H. Bug localization using latent Dirichlet allocation. Information and Software Technology, 2010, 52(9): 972-990.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张钹; 张铃;. A Relation Matrix Approach to Labelling Temporal Relations in Scheduling[J]. , 1991, 6(4): 339 -346 .
[2] 沈一栋;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[3] 招兆铿; 戴军; 陈文丹;. Automated Theorem Proving in Temporal Logic:T-Resolution[J]. , 1994, 9(1): 53 -62 .
[4] 帅典勋;. Concurrent Competitive Wave Approach to Hyper-Distributed Hyper-Parallel AI Processing[J]. , 1997, 12(6): 543 -554 .
[5] . [J]. , 2002, 17(2): 0 .
[6] . 基于边缘纹理驱动模型的贝叶斯人脸定位[J]. , 2005, 20(6): 849 -854 .
[7] . 可废除环签名[J]. , 2007, 22(6): 785 -794 .
[8] . 暂缺[J]. , 2008, 23(5 ): 719 -727 .
[9] Javier Tejada-Cárcamo, Hiram Calvo, Alexander Gelbukh, Kazuo Hara. [J]. , 2010, 25(5): 1030 -1039 .
[10] Rong Yang (杨荣), Zhao-Lan Yang (杨兆兰), and He-Ping Zhang (张和平). 字母重叠图的一些指标[J]. , 2012, 27(4): 897 -902 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: