›› 2016, Vol. 31 ›› Issue (5): 910-924.doi: 10.1007/s11390-016-1672-0

• Special Section on Software Systems 2016 • Previous Articles     Next Articles

What Security Questions Do Developers Ask? A Large-Scale Study of Stack Overflow Posts

Xin-Li Yang(杨昕立)1, David Lo2, Member, ACM, IEEE, Xin Xia(夏鑫)1*, Member, CCF, ACM, IEEE, Zhi-Yuan Wan(万志远)1, and Jian-Ling Sun(孙建伶)1, Member, CCF, ACM   

  1. 1 College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
    2 School of Information Systems, Singapore Management University, Singapore, Singapore
  • Received:2016-03-21 Revised:2016-08-14 Online:2016-09-05 Published:2016-09-05
  • Contact: Xin Xia E-mail:xxkidd@zju.edu.cn
  • About author:Xin-Li Yang is a Ph.D. candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou. His research interests include mining software repository and empirical study.
  • Supported by:

    This work is supported by the National Natural Science Foundation of China under Grant No. 61572426 and the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH17F01.

Security has always been a popular and critical topic. With the rapid development of information technology, it is always attracting people's attention. However, since security has a long history, it covers a wide range of topics which change a lot, from classic cryptography to recently popular mobile security. There is a need to investigate security-related topics and trends, which can be a guide for security researchers, security educators and security practitioners. To address the above-mentioned need, in this paper, we conduct a large-scale study on security-related questions on Stack Overflow. Stack Overflow is a popular on-line question and answer site for software developers to communicate, collaborate, and share information with one another. There are many different topics among the numerous questions posted on Stack Overflow and security-related questions occupy a large proportion and have an important and significant position. We first use two heuristics to extract from the dataset the questions that are related to security based on the tags of the posts. And then we use an advanced topic model, Latent Dirichlet Allocation (LDA) tuned using Genetic Algorithm (GA), to cluster different security-related questions based on their texts. After obtaining the different topics of security-related questions, we use their metadata to make various analyses. We summarize all the topics into five main categories, and investigate the popularity and difficulty of different topics as well. Based on the results of our study, we conclude several implications for researchers, educators and practitioners.

[1] Barua A, Thomas S W, Hassan A E. What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Software Engineering, 2014, 19(3): 619-654.

[2] Rosen C, Shihab E. What are mobile developers asking about? A large scale study using stack overflow. Empirical Software Engineering, 2016, 21(3): 1192-1223.

[3] Treude C, Barzilay O, Storey M A. How do programmers ask and answer questions on the web? NIER track. In Proc. the 33rd International Conference on Software Engineering (ICSE), May 2011, pp.804-807.

[4] Mamykina L, Manoim B, Mittal M, Hripcsak G, Hartmann B. Design lessons from the fastest Q&A site in the west. In Proc. the 29th SIGCHI Conference on Human Factors in Computing Systems, May 2011, pp.2857-2866.

[5] Xia X, Lo D, Wang X Y, Zhou B. Tag recommendation in software information sites. In Proc. the 10th Working Conference on Mining Software Repositories, May 2013, pp.287-296.

[6] Wang SW, Lo D, Vasilescu B, Serebrenik A. EnTagRec: An enhanced tag recommendation system for software information sites. In Proc. the 30th International Conference on Software Maintenance and Evolution (ICSME), September 2014, pp.291-300.

[7] Beyer S, Pinzger M. A manual categorization of Android app development issues on stack overflow. In Proc. the 30th International Conference on Software Maintenance and Evolution (ICSME), September 2014, pp.531-535.

[8] Linares-Vásquez M, Dit B, Poshyvanyk D. An exploratory analysis of mobile development issues using Stack Overflow. In Proc. the 10th Working Conference on Mining Software Repositories, May 2013, pp.93-96.

[9] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993-1022.

[10] Asuncion H U, Asuncion A U, Taylor R N. Software traceability with topic modeling. In Proc. the 32nd ACM/IEEE International Conference on Software Engineering (ICSE), May 2010, pp.95-104.

[11] Thomas SW. Mining software repositories using topic models. In Proc. the 33rd International Conference on Software Engineering, May 2011, pp.1138-1139.

[12] Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A. How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.522-531.

[13] Heinrich G. Parameter estimation for text analysis. Technical Report, vsonix GmbH + University of Leipzi, 2008. http://www.arbylon.net/publications/textest. pdf,Aug. 2016.

[14] Porter M F. Snowball: A language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html, Aug. 2016.

[15] Goldberg D E, Holland J H. Genetic algorithms and machine learning. Machine Learning, 1988, 3(2/3): 95-99.

[16] Rousseeuw P J, Kaufman L. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.

[17] Sander J, Ester M, Kriegel H P, Xu X W. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.

[18] Hotho A, Maedche A, Staab S. Ontology-based text document clustering. KI, 2002, 16(4): 48-54.

[19] Nadi S, Krüger S, Mezini M, Bodden E. “Jumping through hoops”: Why do Java developers struggle with cryptography APIs? In Proc. the 38th International Conference on Software Engineering, May 2016, pp.935-946.

[20] Li H W, Xing Z C, Peng X, Zhao W Y. What help do developers seek, when and how? In Proc. the 20th Working Conference on Reverse Engineering (WCRE), October 2013, pp.142-151.

[21] Bajaj K, Pattabiraman K, Mesbah A. Mining questions asked by web developers. In Proc. the 11th Working Conference on Mining Software Repositories, May 2014, pp.112-121.

[22] Nie L M, Jiang H, Ren Z L, Sun Z Y, Li X C. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 2016, PrePrints, doi:10.1109/TSC.2016.2560165.

[23] Jiang H, Zhang J X, Li X C, Ren Z L, Lo D. A more accurate model for finding tutorial segments explaining APIs. In Proc. the 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, pp.157-167.

[24] Zhang Y, Lo D, Xia X, Sun J L. Multi-factor duplicate question detection in Stack Overflow. Journal of Computer Science and Technology, 2015, 30(5): 981-997.

[25] Xia X, Lo D, Correa D, Sureka A, Shihab E. It takes two to tango: Deleted stack overflow question prediction with text and meta features. In Proc. the 40th Annual International Computers, Software & Applications Conference (COMPSAC), June 2016.

[26] Wang X Y, Xia X, Lo D. TagCombine: Recommending tags to contents in software information sites. Journal of Computer Science and Technology, 2015, 30(5): 1017-1035.

[27] Xu B W, Xing Z C, Xia X, Lo D, Wang Q Y, Li S P. Domain-specific cross-language relevant question retrieval. In Proc. the 13th International Conference on Mining Software Repositories, May 2016, pp.413-424.

[28] Xu B W, Ye D C, Xing Z C, Xia X, Chen G B, Li S P. Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In Proc. the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), September 2016.

[29] Avdiienko V, Kuznetsov K, Gorla A, Zeller A, Arzt S, Rasthofer S, Bodden E. Mining apps for abnormal usage of sensitive data. In Proc. the 37th IEEE International Conference on Software Engineering (ICSE), May 2015, pp.426-436.

[30] Gorla A, Tavecchia I, Gross F, Zeller A. Checking app behavior against app descriptions. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.1025-1035.

[31] Huang J J, Zhang X Y, Tan L, Wang P, Liang B. As-Droid: Detecting stealthy behaviors in Android applications by user interface and program behavior contradiction. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.1036-1046.

[32] Kirat D, Vigna G. MalGene: Automatic extraction of malware analysis evasion signature. In Proc. the 22nd ACM SIGSAC Conference on Computer and Communications Security, October 2015, pp.769-780.

[33] Parameshwaran I, Budianto E, Shinde S, Dang H, Sadhu A, Saxena P. Auto-patching DOM-based XSS at scale. In Proc. the 10th Joint Meeting on Foundations of Software Engineering, March 2015, pp.272-283.

[34] Fazzini M, Saxena P, Orso A. AutoCSP: Automatically retrofitting CSP to web applications. In Proc. the 37th International Conference on Software Engineering, May 2015, pp.336-346.

[35] Nguyen A T, Nguyen T T, Al-Kofahi J, Nguyen H V, Nguyen T N. A topic-based approach for narrowing the search space of buggy files from a bug report. In Proc. the 26th IEEE/ACM International Conference on Automated Software Engineering, November 2011, pp.263-272.

[36] Nguyen A T, Nguyen T T, Nguyen T N, Lo D, Sun C N. Duplicate bug report detection with a combination of information retrieval and topic modeling. In Proc. the 27th IEEE/ACM International Conference on Automated Software Engineering, September 2012, pp.70-79.

[37] Lukins S K, Kraft N A, Etzkorn L H. Bug localization using latent Dirichlet allocation. Information and Software Technology, 2010, 52(9): 972-990.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhang Bo; Zhang Ling;. A Relation Matrix Approach to Labelling Temporal Relations in Scheduling[J]. , 1991, 6(4): 339 -346 .
[2] Shen Yidong;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[3] Zhao Zhaokeng; Dai Jun; Chen Wendan;. Automated Theorem Proving in Temporal Logic:T-Resolution[J]. , 1994, 9(1): 53 -62 .
[4] Shuai Dianxun;. Concurrent Competitive Wave Approach to Hyper-Distributed Hyper-Parallel AI Processing[J]. , 1997, 12(6): 543 -554 .
[5] CHEN Yisong(陈毅松),LU Jian(卢坚),SUN Zhengxing(孙正兴)and ZHANG Fuyan(张福炎). Greylevel Difference Classification Algorithm in Fractal Image Compression[J]. , 2002, 17(2): 0 .
[6] Hua Li, Shui-Cheng Yan, and Li-Zhong Peng[1]. Robust Non-Frontal Face Alignment with Edge Based Texture[J]. , 2005, 20(6): 849 -854 .
[7] Dennis Y. W. Liu, Joseph K. Liu, Yi Mu, Willy Susilo and Duncan S. Wong. Revocable Ring Signature[J]. , 2007, 22(6): 785 -794 .
[8] Avraham Trahtman. Some Aspects of Synchronization of DFA[J]. , 2008, 23(5 ): 719 -727 .
[9] Javier Tejada-Cárcamo, Hiram Calvo, Alexander Gelbukh, and Kazuo Hara. Unsupervised WSD by Finding the Predominant Sense Using Context as a Dynamic Thesaurus[J]. , 2010, 25(5): 1030 -1039 .
[10] Rong Yang (杨荣), Zhao-Lan Yang (杨兆兰), and He-Ping Zhang (张和平). Some Indices of Alphabet Overlap Graph[J]. , 2012, 27(4): 897 -902 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved