We use cookies to improve your experience with our site.

基于网络搜索欺诈关键词社区结构的网络搜索欺诈检测

Exploiting the Community Structure of Fraudulent Keywords for Fraud Detection in Web Search

  • 摘要: 1、研究背景(context):
    互联网用户在很大程度上依赖网络搜索引擎来获取自己想要的信息。搜索引擎的主要收入是广告。然而,搜索广告存在欺诈行为。欺诈者可以产生虚假的、不能到达预期受众的流量,以此增加广告商的成本;欺诈者也可以增加某些关键词的搜索量、产生虚假印象,以提高它们的竞价价格,并通过出售它们获利;欺诈者还可以发送某些精心设计的搜索关键词,目的是对搜索引擎的索引进行逆向工程、毒害其排名算法,甚至发现脆弱的网络服务器。总之,网络搜索中的各种欺诈行为会造成巨额的经济损失,因此,检测网络搜索中的欺诈行为是非常重要的。
    2、目的(Objective):本文从欺诈搜索关键词的角度提出了一种简单又有效的检测网络搜索中欺诈行为的方法,在实际数据集上验证方法的有效性,并用检测结果分析欺诈关键词和欺诈用户行为特征。
    3、方法(Method):本文方法的基本思路来自于一个观察,即服务于同一目标任务的欺诈搜索关键词具有社区结构。我们首先将欺诈搜索关键词之间的时间相关性建模成图,通过对图的分析发现这些词形成紧密连接的社区。然后我们使用一些种子欺诈关键词作为输入,通过挖掘搜索关键词与种子词之间的相关性,并利用若干技术手段逐步过滤掉偶然与种子词共同出现的非欺诈搜索关键词,完善检测结果。
    4、结果(Result&Findings):实验证明,欺诈搜索关键词确实形成紧密连接的社区结构,并且我们提出的检测方法简单而有效,取得了很高的准确率和正确率。通过对检测结果的进一步分析,我们发现了欺诈搜索关键词的若干典型时间演进模式,其中既包括持续性的欺诈搜索行为,也包括显示出昼夜节律性的行为。此外,我们发现既存在机器人(自动程序)也存在人工进行欺诈搜索的用户。
    5、结论(Conclusions):我们从欺诈搜索关键词的角度,提出了一种简单而有效的网络搜索欺诈检测方法,并分析了欺诈搜索关键词和欺诈者的特征。

     

    Abstract: Internet users heavily rely on web search engines for their intended information. The major revenue of search engines is advertisements (or ads). However, the search advertising suffers from fraud. Fraudsters generate fake traffic which does not reach the intended audience, and increases the cost of the advertisers. Therefore, it is critical to detect fraud in web search. Previous studies solve this problem through fraudster detection (especially bots) by leveraging fraudsters' unique behaviors. However, they may fail to detect new means of fraud, such as crowdsourcing fraud, since crowd workers behave in part like normal users. To this end, this paper proposes an approach to detecting fraud in web search from the perspective of fraudulent keywords. We begin by using a unique dataset of 150 million web search logs to examine the discriminating features of fraudulent keywords. Specifically, we model the temporal correlation of fraudulent keywords as a graph, which reveals a very well-connected community structure. Next, we design DFW (detection of fraudulent keywords) that mines the temporal correlations between candidate fraudulent keywords and a given list of seeds. In particular, DFW leverages several refinements to filter out non-fraudulent keywords that co-occur with seeds occasionally. The evaluation using the search logs shows that DFW achieves high fraud detection precision (99%) and accuracy (93%). A further analysis reveals several typical temporal evolution patterns of fraudulent keywords and the co-existence of both bots and crowd workers as fraudsters for web search fraud.

     

/

返回文章
返回