计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (5): 1167-1183.doi: 10.1007/s11390-021-0218-2

所属专题: Artificial Intelligence and Pattern Recognition Computer Networks and Distributed Computing

• • 上一篇    下一篇

基于网络搜索欺诈关键词社区结构的网络搜索欺诈检测

Dong-Hui Yang1,2, Zhen-Yu Li1,2,*, Member, CCF, ACM, IEEE, Xiao-Hui Wang3, Kavé Salamatian4, and Gao-Gang Xie2,5, Member, CCF, ACM, IEEE   

  1. 1 Network Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Global Energy Interconnection Research Institute Co., Ltd., Beijing 102209, China;
    4 LISTIC Laboratory of Computer Science, Systems, Information and Knowledge Processing, Université Savoie Mont Blanc, Chambéry 73011, France;
    5 Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
  • 收稿日期:2019-12-11 修回日期:2021-07-01 出版日期:2021-09-30 发布日期:2021-09-30
  • 作者简介:Dong-Hui Yang is currently a Ph.D. candidate at the Institute of Computing Technology (ICT), Chinese Academy Sciences (CAS), Beijing. He received his B.E. degree in Department of Automation from Tsinghua University, Beijing, in 2016. His research interests include the domain name system (DNS) and Internet measurement.
  • 基金资助:
    This work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB1800205, the National Natural Science Foundation of China under Grant Nos. 61725206 and U20A20180, and CAS-Austria Project under Grant No. GJHZ202114.

Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search

Dong-Hui Yang1,2, Zhen-Yu Li1,2,*, Member, CCF, ACM, IEEE, Xiao-Hui Wang3, Kavé Salamatian4, and Gao-Gang Xie2,5, Member, CCF, ACM, IEEE        

  1. 1 Network Technology Research Center, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China;
    2 University of Chinese Academy of Sciences, Beijing 100049, China;
    3 Global Energy Interconnection Research Institute Co., Ltd., Beijing 102209, China;
    4 LISTIC Laboratory of Computer Science, Systems, Information and Knowledge Processing, Université Savoie Mont Blanc, Chambéry 73011, France;
    5 Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2019-12-11 Revised:2021-07-01 Online:2021-09-30 Published:2021-09-30
  • About author:Dong-Hui Yang is currently a Ph.D. candidate at the Institute of Computing Technology (ICT), Chinese Academy Sciences (CAS), Beijing. He received his B.E. degree in Department of Automation from Tsinghua University, Beijing, in 2016. His research interests include the domain name system (DNS) and Internet measurement.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB1800205, the National Natural Science Foundation of China under Grant Nos. 61725206 and U20A20180, and CAS-Austria Project under Grant No. GJHZ202114.

1、研究背景(context):
互联网用户在很大程度上依赖网络搜索引擎来获取自己想要的信息。搜索引擎的主要收入是广告。然而,搜索广告存在欺诈行为。欺诈者可以产生虚假的、不能到达预期受众的流量,以此增加广告商的成本;欺诈者也可以增加某些关键词的搜索量、产生虚假印象,以提高它们的竞价价格,并通过出售它们获利;欺诈者还可以发送某些精心设计的搜索关键词,目的是对搜索引擎的索引进行逆向工程、毒害其排名算法,甚至发现脆弱的网络服务器。总之,网络搜索中的各种欺诈行为会造成巨额的经济损失,因此,检测网络搜索中的欺诈行为是非常重要的。
2、目的(Objective):本文从欺诈搜索关键词的角度提出了一种简单又有效的检测网络搜索中欺诈行为的方法,在实际数据集上验证方法的有效性,并用检测结果分析欺诈关键词和欺诈用户行为特征。
3、方法(Method):本文方法的基本思路来自于一个观察,即服务于同一目标任务的欺诈搜索关键词具有社区结构。我们首先将欺诈搜索关键词之间的时间相关性建模成图,通过对图的分析发现这些词形成紧密连接的社区。然后我们使用一些种子欺诈关键词作为输入,通过挖掘搜索关键词与种子词之间的相关性,并利用若干技术手段逐步过滤掉偶然与种子词共同出现的非欺诈搜索关键词,完善检测结果。
4、结果(Result&Findings):实验证明,欺诈搜索关键词确实形成紧密连接的社区结构,并且我们提出的检测方法简单而有效,取得了很高的准确率和正确率。通过对检测结果的进一步分析,我们发现了欺诈搜索关键词的若干典型时间演进模式,其中既包括持续性的欺诈搜索行为,也包括显示出昼夜节律性的行为。此外,我们发现既存在机器人(自动程序)也存在人工进行欺诈搜索的用户。
5、结论(Conclusions):我们从欺诈搜索关键词的角度,提出了一种简单而有效的网络搜索欺诈检测方法,并分析了欺诈搜索关键词和欺诈者的特征。

关键词: 社区结构, 欺诈分析, 欺诈关键词检测, 网络搜索

Abstract: Internet users heavily rely on web search engines for their intended information. The major revenue of search engines is advertisements (or ads). However, the search advertising suffers from fraud. Fraudsters generate fake traffic which does not reach the intended audience, and increases the cost of the advertisers. Therefore, it is critical to detect fraud in web search. Previous studies solve this problem through fraudster detection (especially bots) by leveraging fraudsters' unique behaviors. However, they may fail to detect new means of fraud, such as crowdsourcing fraud, since crowd workers behave in part like normal users. To this end, this paper proposes an approach to detecting fraud in web search from the perspective of fraudulent keywords. We begin by using a unique dataset of 150 million web search logs to examine the discriminating features of fraudulent keywords. Specifically, we model the temporal correlation of fraudulent keywords as a graph, which reveals a very well-connected community structure. Next, we design DFW (detection of fraudulent keywords) that mines the temporal correlations between candidate fraudulent keywords and a given list of seeds. In particular, DFW leverages several refinements to filter out non-fraudulent keywords that co-occur with seeds occasionally. The evaluation using the search logs shows that DFW achieves high fraud detection precision (99%) and accuracy (93%). A further analysis reveals several typical temporal evolution patterns of fraudulent keywords and the co-existence of both bots and crowd workers as fraudsters for web search fraud.

Key words: community structure, fraud analysis, fraudulent keyword detection, web search

[1] Jansen B J. Click fraud. Computer, 2007, 40(7):85-86. DOI:10.1109/MC.2007.232.
[2] Jansen B J, Mullen T. Sponsored search:An overview of the concept, history, and technology. International Journal of Electronic Business, 2008, 6(2):114-131.
[3] Ghose A, Yang S. An empirical analysis of search engine advertising:Sponsored search in electronic markets. Management Science, 2009, 55(10):1605-1622. DOI:10.1504/IJEB.2008.018068.
[4] Fain D C, Pedersen J O. Sponsored search:A brief history. Bulletin of the American Society for Information Science and Technology, 2006, 32(2):12-13. DOI:10.1002/bult.1720320206.
[5] Gallaugher J. Information Systems:A Manager's Guide to Harnessing Technology. University of Minnesota Libraries Publishing, 2015.
[6] Shakiba T, Zarifzadeh S, Derhami V. Spam query detection using stream clustering. World Wide Web, 2018, 21(2):557-572. DOI:10.1007/s11280-017-0471-z.
[7] Stone-Gross B, Stevens R, Zarras A, Kemmerer R, Kruegel C, Vigna G. Understanding fraudulent activities in online ad exchanges. In Proc. the 2011 ACM SIGCOMM Internet Measurement Conference, November 2011, pp.279-294. DOI:10.1145/2068816.2068843.
[8] Pandit S, Chau D H, Wang S, Faloutsos C. Netprobe:A fast and scalable system for fraud detection in online auction networks. In Proc. the 16th International Conference on World Wide Web, May 2007, pp.201-210. DOI:10.1145/1242572.1242600.
[9] Yu F, Xie Y L, Ke Q F. SBotMiner:Large scale search bot detection. In Proc. the 3rd ACM International Conference on Web Search and Data Mining, February 2010, pp.421-430. DOI:10.1145/1718487.1718540.
[10] Zhang J J, Xie Y L, Yu F, Soukal D, Lee W. Intention and origination:An inside look at large-scale bot queries. In Proc. the 20th Annual Network and Distributed System Security Symposium, February 2013.
[11] Zhang L F, Guan Y. Detecting click fraud in pay-per-click streams of online advertising networks. In Proc. the 28th International Conference on Distributed Computing Systems, June 2008, pp.77-84. DOI:10.1109/ICDCS.2008.98.
[12] Buehrer G, Stokes J W, Chellapilla K, Platt J. Classification of automated web traffic. In Weaving Services and People on the World Wide Web, King I, Baeza-Yates R (eds.), Springer Verlag, 2009, pp.3-26. DOI:10.1007/978-3-642-00570-11.
[13] Sadagopan N, Li J. Characterizing typical and atypical user sessions in clickstreams. In Proc. the 17th International Conference on World Wide Web, April 2008, pp.885-894. DOI:10.1145/1367497.1367617.
[14] Duskin O, Feitelson D G. Distinguishing humans from robots in web search logs:Preliminary results using query rates and intervals. In Proc. the 2009 Workshop on Web Search Click Data, February 2009, pp.15-19. DOI:10.1145/1507509.1507512.
[15] Tian T, Zhu J, Xia F, Zhuang X, Zhang T. Crowd fraud detection in Internet advertising. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1100-1110. DOI:10.1145/2736277.2741136.
[16] Kang H W, Wang K S, Soukal D, Behr F, Zheng Z J. Largescale bot detection for search engines. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.501-510. DOI:10.1145/1772690.1772742.
[17] Haidar R, Elbassuoni S. Website navigation behavior analysis for bot detection. In Proc. the 2017 IEEE International Conference on Data Science and Advanced Analytics, October 2017, pp.60-68. DOI:10.1109/DSAA.2017.13.
[18] Guo Y, Shi J, Cao Z, Kang C, Xiong G, Li Z. Machine learning based cloudbot detection using multi-layer traffic statistics. In Proc. the 21st IEEE International Conference on High Performance Computing and Communications, the 17th IEEE International Conference on Smart City and the 5th IEEE International Conference on Data Science and Systems, August 2019, pp.2428-2435. DOI:10.1109/HPCC/SmartCity/DSS.2019.00339.
[19] Toffalini F, Abbà M, Carra D, Balzarotti D. Google dorks:Analysis, creation, and new defenses. In Proc. the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, July 2016, pp.255-275. DOI:10.1007/978-3-319-40667-113.
[20] Metwally A, Agrawal D, Abbadi A E. Duplicate detection in click streams. In Proc. the 14th International Conference on World Wide Web, May 2005, pp.12-21. DOI:10.1145/1060745.1060753.
[21] Metwally A, Agrawal D, Abbadi A E. Detectives:Detecting coalition hit inflation attacks in advertising networks streams. In Proc. the 16th International Conference on World Wide Web, May 2007, pp.241-250. DOI:10.1145/1242572.1242606.
[22] Immorlica N, Jain K, Mahdian M, Talwar K. Click fraud resistant methods for learning click-through rates. In Proc. the 1st International Workshop on Internet and Network Economics, December 2005, pp.34-45. DOI:10.1007/116009305.
[23] Dave V, Guha S, Zhang Y. ViceROI:Catching clickspam in search ad networks. In Proc. the 2013 ACM SIGSAC Conference on Computer & Communications Security, November 2013, pp.765-776. DOI:10.1145/2508859.2516688.
[24] Li X, Zhang M, Liu Y Q, Ma S P, Jin Y J, Ru L Y. Search engine click spam detection based on bipartite graph propagation. In Proc. the 7th ACM International Conference on Web Search and Data Mining, February 2014, pp.93-102. DOI:10.1145/2556195.2556214.
[25] Nagaraja S, Shah R. Clicktok:Click fraud detection using traffic analysis. In Proc. the 12th Conference on Security and Privacy in Wireless and Mobile Networks, May 2019, pp.105-116. DOI:10.1145/3317549.3323407.
[26] DeBlasio J, Guha S, Voelker G M, Snoeren A C. Exploring the dynamics of search advertiser fraud. In Proc. the 2017 Internet Measurement Conference, November 2017, pp.157-170. DOI:10.1145/3131365.3131393.
[27] Wei C, Liu Y Q, Zhang M, Ma S P, Ru L Y, Zhang K. Fighting against web spam:A novel propagation method based on click-through data. In Proc. the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2012, pp.395-404. DOI:10.1145/2348283.2348338.
[28] Haider C M R, Iqbal A, Rahman A H, Rahman M S. An ensemble learning based approach for impression fraud detection in mobile advertising. Journal of Network and Computer Applications, 2018, 112:126-141. DOI:10.1016/j.jnca.2018.02.021.
[29] Dong F, Wang H Y, Li L, Guo Y, Bissyandé T F, Liu T M, Xu G A, Klein J. FraudDroid:Automated ad fraud detection for Android apps. In Proc. the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 2018, pp.257-268. DOI:10.1145/3236024.3236045.
[30] Halfaker A, Keyes O, Kluver D, Thebault-Spieker J, Nguyen T, Shores K, Uduwage A, Warncke-Wang M. User session identification based on strong regularities in interactivity time. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.410-418. DOI:10.1145/2736277.2741117.
[31] Jones R, Klinkner K L. Beyond the session timeout:Automatic hierarchical segmentation of search topics in query logs. In Proc. the 17th ACM Conference on Information and Knowledge Management, October 2008, pp.699-708. DOI:10.1145/1458082.1458176.
[32] Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974, 19(6):716-723. DOI:10.1109/TAC.1974.1100705.
[33] Fruchterman T M J, Reingold E M. Graph drawing by forcedirected placement. Software:Practice and Experience, 1991, 21(11):1129-1164. DOI:10.1002/spe.4380211102.
[34] Saram?ki J, Kivel? M, Onnela J P, Kaski K, Kertesz J. Generalizations of the clustering coefficient to weighted complex networks. Physical Review E, 2007, 75(2):Article No. 027105. DOI:10.1103/PhysRevE.75.027105.
[35] Schütze H, Manning C D, Raghavan P. Introduction to Information Retrieval. Cambridge University Press, 2008.
[36] Pelleg D, Moore A. X-means:Extending K-means with efficient estimation of the number of clusters. In Proc. the 17th International Conference on Machine Learning, June 29-July 2, 2000, pp.727-734.
[37] Onnela J P, Saram?ki J, Hyv?nen J, Szabó G, Lazer D, Kaski K, Kertész J, Barabási A L. Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences, 2007, 104(18):7332-7336. DOI:10.1073/pnas.0610245104.
[38] Whang J J, Jung Y, Kang S, Yoo D, Dhillon I S. Scalable Anti-TrustRank with qualified site-level seeds for link-based web spam detection. In Proc. the 2020 Web Conference, April 2020, pp.593-602. DOI:10.1145/3366424.3385773.
[39] Wang H D, Xu F L, Li Y, Zhang P Y, Jin D P. Understanding mobile traffic patterns of large scale cellular towers in urban environment. In Proc. the 2015 Internet Measurement Conference, October 2015, pp.225-238. DOI:10.1145/2815675.2815680.
[40] Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research, 1988, 16(22):10881-10890. DOI:10.1093/nar/16.22.10881.
[41] Davies D L, Bouldin D W. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, 1(2):224-227. DOI:10.1109/TPAMI.1979.4766909.
[42] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(12):1650-1654. DOI:10.1109/TPAMI.2002.1114856.
[1] Yi-Ting Wang, Jie Shen, Zhi-Xu Li, Qiang Yang, An Liu, Peng-Peng Zhao, Jia-Jie Xu, Lei Zhao, Xun-Jie Yang. 基于搜索引擎丰富上下文信息的实体链接方法[J]. 计算机科学技术学报, 2020, 35(4): 724-738.
[2] Hua-Wei Shen (沈华伟), Member, CCF, Xue-Qi Cheng (程学旗), Senior Member, CCF, Member, IEEE, Yuan-Zhuo Wang1 (王元卓), Senior Member, CCF, Member, ACM, IEEE, and Yixin Chen (陈一昕), Senior Member, IEEE. 一种用于发现异质网络的多尺度结构的降维框架[J]. , 2012, (2): 341-357.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[4] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] 黄学东; 蔡莲红; 方棣棠; 迟边进; 周立; 蒋力;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[9] 许小曙;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .
[10] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: