检测僵尸用户提高微博数据的可靠度

吴贤; 范伟; 高晶; 冯子明; 俞勇

doi:10.1007/s11390-015-1584-4

检测僵尸用户提高微博数据的可靠度

Detecting Marionette Microblog Users for Improved Information Credibility

摘要

摘要: 本文研究一种特别的微博用户:"僵尸用户"。僵尸用户是营销公司通过手工创建或者编写程序自动生成的。和普通用户不同, 僵尸用户通过完成特定的任务来获得经济利益。例如, 僵尸用户通过关注某些用户提高他们统计意义上的知名度, 或者通过转发某些微博提高它们统计意义上的影响力。通过人为制造粉丝数量和转发数量, 僵尸用户造成了微博数据失真, 这不仅会误导普通用户, 也会影响基于微博数据的第三方应用。在本文中, 我们研究如何检测僵尸用户。问题的挑战在于营销公司使用了复杂的策略来操作僵尸用户, 使其伪装成正常用户。为了应对这个挑战,, 我们利用两方面的信息来侦测僵尸用户:(1)微博用户的个体特征;(2)用户之间的社交关系。通过使用这两方面的信息, 我们提出了一种半监督的检测模型来区分僵尸用户和正常用户。我们将提出的模型应用到中国最流行的微博平台之一的新浪微博, 我们发现检测的F-Measure可以达到0.9。为了进一步提高检测速度和降低特征生成的代价, 我们进一步提出了一种轻量级的检测模型。这种模型可以使用更少的特征检测转发热门微博的僵尸用户。此外, 我们还将提出的模型应用到新浪微博上被关注最多的200个微博主和被转发最多的50条热门微博上。

Abstract: In this paper, we propose to detect a special group of microblog users:the "marionette" users, who are created or employed by backstage "puppeteers", either through programs or manually. Unlike normal users that access microblog for information sharing or social communication, the marionette users perform specific tasks to earn financial profits. For example, they follow certain users to increase their "statistical popularity", or retweet some tweets to amplify their "statistical impact". The fabricated follower or retweet counts not only mislead normal users to wrong information, but also seriously impair microblog-based applications, such as hot tweets selection and expert finding. In this paper, we study the important problem of detecting marionette users on microblog platforms. This problem is challenging because puppeteers are employing complicated strategies to generate marionette users that present similar behaviors as normal users. To tackle this challenge, we propose to take into account two types of discriminative information:1) individual user tweeting behavior and 2) the social interactions among users. By integrating both information into a semi-supervised probabilistic model, we can effectively distinguish marionette users from normal ones. By applying the proposed model to one of the most popular microblog platforms (Sina Weibo) in China, we find that the model can detect marionette users with F-measure close to 0.9. In addition, we apply the proposed model to calculate the marionette ratio of the top 200 most followed microbloggers and the top 50 most retweeted posts in Sina Weibo. To accelerate the detecting speed and reduce feature generation cost, we further propose a light-weight model which utilizes fewer features to identify marionettes from retweeters.

HTML全文

参考文献()

施引文献

资源附件()