›› 2015, Vol. 30 ›› Issue (5): 1082-1096.doi: 10.1007/s11390-015-1584-4

Special Issue: Artificial Intelligence and Pattern Recognition; Data Management and Data Mining

• Special Section on Social Media Processing • Previous Articles     Next Articles

Detecting Marionette Microblog Users for Improved Information Credibility

Xian Wu1(吴贤), Wei Fan2(范伟), Member, ACM, Jing Gao3(高晶), Member, ACM, IEEE Zi-Ming Feng1(冯子明), Yong Yu1(俞勇)   

  1. 1 Department of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China;
    2 Baidu Research Big Data Laboratory, Sunnyvale, CA 94089, U.S.A.;
    3 Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14214, U.S.A.
  • Received:2014-11-15 Revised:2015-06-15 Online:2015-09-05 Published:2015-09-05
  • About author:Xian Wu is now a Ph.D. candidate in the Department of Computer Science of Shanghai Jiao Tong University. His research interests include data mining, statistical learning and nature language processing. Xian received his Master's degree from Shanghai Jiao Tong University in 2007 and Bachelor's degree from Southeast University, Nanjing, in 2004, both in computer science.

In this paper, we propose to detect a special group of microblog users:the "marionette" users, who are created or employed by backstage "puppeteers", either through programs or manually. Unlike normal users that access microblog for information sharing or social communication, the marionette users perform specific tasks to earn financial profits. For example, they follow certain users to increase their "statistical popularity", or retweet some tweets to amplify their "statistical impact". The fabricated follower or retweet counts not only mislead normal users to wrong information, but also seriously impair microblog-based applications, such as hot tweets selection and expert finding. In this paper, we study the important problem of detecting marionette users on microblog platforms. This problem is challenging because puppeteers are employing complicated strategies to generate marionette users that present similar behaviors as normal users. To tackle this challenge, we propose to take into account two types of discriminative information:1) individual user tweeting behavior and 2) the social interactions among users. By integrating both information into a semi-supervised probabilistic model, we can effectively distinguish marionette users from normal ones. By applying the proposed model to one of the most popular microblog platforms (Sina Weibo) in China, we find that the model can detect marionette users with F-measure close to 0.9. In addition, we apply the proposed model to calculate the marionette ratio of the top 200 most followed microbloggers and the top 50 most retweeted posts in Sina Weibo. To accelerate the detecting speed and reduce feature generation cost, we further propose a light-weight model which utilizes fewer features to identify marionettes from retweeters.

[1] Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes Twitter users:Real-time event detection by social sensors. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.851-860.

[2] Yu L L, Asur S, Huberman B A. Artificial inflation:The real story of trends and trend-setters in SinaWeibo. In Proc. the International Conference on Privacy, Security, Risk and Trust and International Conference on Social Computing, September 2012, pp.514-519.

[3] Bollen J, Mao H, Zeng X. Twitter mood predicts the stock market. arXiv.1010.3003, 2010. http://arxiv.org/abs/1010.3003, June 2015.

[4] Yang Z, Cai K, Tang J, Zhang L, Su Z, Li J. Social context summarization. In Proc. the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2011, pp.255-264.

[5] Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.

[6] Kang H, Wang K, Soukal D, Behr F, Zheng Z. Large-scale bot detection for search engines. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.501-510.

[7] Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I H. The WEKA data mining software:An update. SIGKDD Explorations, 2009, 11(1):10-18.

[8] Qiu X, Zhang Q, Huang X. FudanNLP:A toolkit for Chinese natural language processing. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics:System Demonstrations, August 2013, pp.49-54.

[9] Mathioudakis M, Koudas N. TwitterMonitor:Trend detection over the Twitter stream. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.1155-1158.

[10] Yin Z, Cao L, Han J, Zhai C, Huang T. Geographical topic discovery and comparison. In Proc. the 20th International Conference on World Wide Web, March 28-April 1, 2011, pp.247-256.

[11] Duan Y, Chen Z, Wei F, Zhou M, Shum H. Twitter topic summarization by ranking tweets using social influence and content quality. In Proc. the 24th International Conference on Computational Linguistics, December 2012, pp.763-780.

[12] Lehmann J, Gonçalves B, Ramasco J J, Cattuto C. Dynamical classes of collective attention in Twitter. In Proc. the 21st International Conference on World Wide Web, April 2012, pp.251-260.

[13] Dong A, Zhang R, Kolari P, Bai J, Diaz F, Chang Y, Zheng Z, Zha H. Time is of the essence:Improving recency ranking using Twitter data. In Proc. the 19th International Conference on World Wide Web, April 2010, pp.331-340.

[14] Buehrer G, Stokes J W, Chellapilla K. A large-scale study of automated web search traffic. In Proc. the 4th International Workshop on Adversarial Information Retrieval on the Web, April 2008, pp.1-8.

[15] Yu F, Xie Y, Ke Q. SBotMiner:Large scale search bot detection. In Proc. the 3rd ACM International Conference on Web Search and Data Mining, February 2010, pp.421-430.

[16] Gyöngyi Z, Garcia-Molina H, Pedersen J. Combating web spam with TrustRank. In Proc. the 30th International Conference on Very Large Data Bases, August 31-September 3, 2004, pp.576-587.

[17] Wu B, Davison B D. Identifying link farm spam pages. In Proc. Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, May 2005, pp.820-829.

[18] Krishnan V, Raj R. Web spam detection with anti-trust rank. In Proc. the 2nd International Workshop on Adversarial Information Retrieval on the Web, August 2006, pp.37-40.

[19] Benczúr A A, Csalogány K, Sarlós T, Uher M. SpamRank-Fully automatic link spam detection. In Proc. the 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005, pp.25-38.

[20] Castillo C, Mendoza M, Poblete B. Information credibility on Twitter. In Proc. the 20th International Conference on World Wide Web, Mar. 2011, pp.675-684.

[21] Yang C, Harkreader R C, Gu G. Empirical evaluation and new design for fighting evolving Twitter spammers. IEEE Transactions on Information Forensics and Security, 2013, 8(8):1280-1293.

[22] Laboreiro G, Sarmento L, Oliveira E C. Identifying automatic posting systems in microblogs. In Proc. the 15th Portuguese Conference on Artificial Intelligence, October 2011, pp.634-648.

[23] McCord M, Chuah M. Spam detection on Twitter using traditional classifiers. In Proc. the 8th International Conference on Autonomic and Trusted Computing, September 2011, pp.175-186.

[24] Benevenuto F, Magno G, Rodrigues T, Almeida V. Detecting spammers on Twitter. In Proc. the 7th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, July 2010.

[25] Yang C, Harkreader R, Zhang J, Shin S, Gu G. Analyzing spammers' social networks for fun and profit:A case study of cyber criminal ecosystem on Twitter. In Proc. the 21st International Conference on World Wide Web, April 2012, pp.71-80.

[26] Ghosh S, Viswanath B, Kooti F, Sharma N K, Korlam G, Benevenuto F, Ganguly N, Gummadi K P. Understanding and combating link farming in the Twitter social network. In Proc. the 21st International Conference on World Wide Web, April 2012, pp.61-70.

[27] Zhu Y, Wang X, Zhong E, Liu N N, Li H, Yang Q. Discovering spammers in social networks. In Proc. the 26th AAAI Conference on Artificial Intelligence, July 2012, pp.171-177.

[28] Hu X, Tang J, Zhang Y, Liu H. Social spammer detection in microblogging. In Proc. the 23rd International Joint Conference on Artificial Intelligence, August 2013, pp.2633-2639.

[29] Aggarwal A, Kumaraguru P. Followers or phantoms? An anatomy of purchased Twitter followers. arXiv:1408.1534, 2014. http://arxiv.org/abs/1408.1534, June 2015.

[30] Shen Y, Yu J, Dong K, Nan K. Automatic fake followers detection in Chinese micro-blogging system. In Proc. the 18th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2014, pp.596-607.

[31] Liu H, Zhang Y, Lin H, Wu J, Wu Z, Zhang X. How many zombies around you? In Proc. the 13th International Conference on Data Mining, December 2013, pp.1133-1138.

[32] Gowri C D, Mohanraj V. A survey on spam detection in Twitter. International Journal of Computer Science and Business Informatics, 2014, 14(1):92-102.

[33] Yardi S, Romero D M, Schoenebeck G, Boyd D. Detecting spam in a Twitter network. First Monday, 2010, 15(1).

[34] Hentschel M, Alonso O, Counts S, Kandylas V. Finding users we trust:Scaling up verified Twitter users using their communication patterns. In Proc. the 8th International Conference on Weblogs and Social Media, June 2014.

[35] Thomas K, Grier C, Song D, Paxson V. Suspended accounts in retrospect:An analysis of Twitter spam. In Proc. the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, November 2011, pp.243-258.

[36] Rahman M S, Huang T K, Madhyastha H V, Faloutsos M. Efficient and scalable socware detection in online social networks. In Proc. the 21st USENIX Conference on Security Symposium, August 2012, Article No. 32.

[37] Stringhini G, Egele M, Kruegel C, Vigna G. Poultry markets:On the underground economy of Twitter followers. In Proc. the 2012 ACM Workshop on Online Social Networks, August 2012, pp.1-6.

[38] Jiang M, Cui P, Beutel A, Faloutsos C, Yang S. Detecting suspicious following behavior in multimillion-node social networks. In Proc. the Companion Publication of the 23rd International Conference on World Wide Web Companion, April 2014, pp.305-306.
No related articles found!
Full text



[1] Lu Qi; Zhang Fubo; Qian Jiahua;. Program Slicing:Its Improved Algorithm and Application in Verification[J]. , 1988, 3(1): 29 -39 .
[2] Zhu Mingyuan;. Two Congruent Semantics for Prolog with CUT[J]. , 1990, 5(1): 82 -91 .
[3] Ma Jun; Ma Shaohan;. An O(k~2n~2) Algorithm to Find a k-Partition in a k-Connected Graph[J]. , 1994, 9(1): 86 -91 .
[4] Farid Mheir-ELSaadi; Bozena Kaminska;. An Automatic Hierarchical Delay Analysis Tool[J]. , 1994, 9(4): 349 -364 .
[5] Liao Xianzhi; Jin Lan;. Rendezvous Facilities in a Distributed Computer System[J]. , 1995, 10(2): 188 -192 .
[6] Jiang Chanaiun;. Net Operations (Ⅱ)-The Iterated Addition Operation of Petri Nets[J]. , 1995, 10(6): 509 -517 .
[7] Hu Weiwu; Xia Peisu;. Out-of-Order Execution in Sequentially Consistent Shared-Memory Systems:Theory and Experiments[J]. , 1998, 13(2): 125 -140 .
[8] Fu Yuxi;. Reaction Graph[J]. , 1998, 13(6): 510 -530 .
[9] LI Xiaoshan;. Decidability of Mean Value Calculus[J]. , 1999, 14(2): 173 -180 .
[10] CHEN Haiming;. Function Definition Language FDL andIts Implementation[J]. , 1999, 14(4): 414 -421 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved