• Articles • Previous Articles     Next Articles

Clustering Text Data Streams

Yu-Bao Liu1, Jia-Rong Cai1, Jian Yin1, and Ada Wai-Chee Fu2   

  1. 1Department of Computer Science, Sun Yat-Sen University, Guangzhou 510275, China 2Department of Computer Science and Engineering, the Chinese University of Hong Kong, Hong Kong, China
  • Received:2007-06-25 Revised:2007-11-14 Online:2008-01-15 Published:2008-01-10

Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF$*$IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF$*$IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

Key words: volume visualization; volume rendering; virtual reality; volumegraphics; object simplification; level of detail; wavelets theory;

[1] Dou Shen, Qiang Yang, JianTao Sun, Zheng Chen. Thread detection in dynamic text message streams. In -\it Proc. ACM SIGIR} 2006, Seattle, Washington, August 6--11, pp.35--42.

[2] Aggarwal C C. A framework for diagnosing changes in evolving data streams. In -\it Proc. ACM SIGMOD} 2003, San Diego, June 9--12, pp.575--586.

[3] Agrawal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In -\it Proc. VLDB 2003}, Berlin, September 9--12, 2003, pp.81--92.

[4] Agrawal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In -\it Proc. VLDB 2004}, Toronto, August 31--September 3, pp.852--863.

[5] O'Callaghan L, Mishra N, Meyerson A, Guha S. Streaming data algorithms for high-quality clustering. In -\it Proc. ICDE 2002}, San Jose, CA, February 26--March 1, pp.685--704.

[6] Aggarwal C C, Yu P S, A framework for clustering massive text and categorical data streams. In -\it Proc. SIAM Conference on Data Mining}, Bethesda, MD, April 20--22, 2006, pp.407--411.

[7] Yang Y, Pierce T, Carbonell J. A study of retrospective and on-line event detection. In -\it Proc. ACM SIGIR}, Melbourne, August 24--28, 1998, pp.28--36.

[8] Arindam Banerjee, Sugato Basu. Topic models over text streams: A study of batch and online unsupervised learning. In -\it Proc. SIAM Conference on Data Mining}, Minneapolis, April 26--28, 2007, pp.437--442.

[9] Xiaodan Zhang, Xiaohua Zhou, Xiaohua Hu. Semantic smoothing for model-based document clustering. In -\it Proc. ICDM06}, Hong Kong, December 18--22, pp.1193--1198.

[10] Zhou X, Hu X, Zhang X, Lin X, Song I Y. Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In -\it Proc. ACM SIGIR}, Seattle, Washington, August 6--11, 2006, pp.170--177.

[11] Zhai C, Lafferty J. Two-stage language models for information retrieval. In -\it Proc. ACM SIGIR}, Tampere, August 11--15, 2002, pp.49--56.

[12] Zhai C, Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval. In -\it Proc. ACM SIGIR}, New Orleans, September 9--13, 2001, pp.334--342.

[13] Yubao Liu, Jiarong Cai, Jian Yin, Ada Wai-Chee Fu. Clustering massive text data streams by semantic smoothing model. In -\it Proc. ADMA}, Harbin, August 6--8, 2007, pp.389--400.

[14] Zhong S, Ghosh J. Generative model-based document clustering: A comparative study. -\it Knowledge and Information Systems}, 2005, 8(3): 374--384.

[15] Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In -\it Proc. Text Mining Workshop}, -\it KDD 2000}, Boston, August 20--23, pp.1--20.

[16] Guha S, Mishra N, Motwani R, O'Callaghan L. Clustering data streams. In -\it Proc. FOCS 2000}, California, November 12--14, pp.359--366.

[17] Shi Zhong. Efficient streaming text clustering. -\it Neural Networks}, 2005, 18(5-6): 790--798.

[18] Fung G P C, Yu J X, Yu P S, Lu H. Parameter free bursty events detection in text streams. In -\it Proc. VLDB} 2005, Trondheim, August 30--September 2, pp.181--192

[19] Qi He, Kuiyu Chang, Ee-Peng Lim, Jun Zhang. Bursty feature representation for clustering text streams. In -\it Proc. SIAM Conference on Data Mining} 2007, Minneapolis, April 26--28, pp.491--496.

[20] Xu-Bin Deng, Yang-Yong Zhu. L-tree match: A new data extraction model and algorithm for huge text stream with noises. -\it Journal of Computer Science and Technology}, 2005, 20(6): 763--773.

[21] Gabriel Pui Cheong Fung, Jeffery Xu Yu, Hongjun Lu. Classifying text streams in the presence of concept drifts. In -\it Proc. PAKDD 2004}, Sydney, May 26--28, pp.373--383.

[22] Haixun Wang, Jian Yin, Jian Pei, Philip S Yu, Jeffrey Xu Yu. Suppressing model over-fitting in mining concept-drifting data streams. In -\it Proc. KDD 2006}, Philadelphia, August 20--23, pp.736--741.

[23] Weiheng Zhu, Jian Pei, Jian Yin, Yihuang Xie. Granula\-rity adaptive density estimation and on demand clustering of concept-drifting data streams. In -\it Proc. DaWaK 2006}, Krakow, September 4--8, pp.322--331.

[24] Qiaozhu Mei, Chengxiang Zhai. Discovering evolutionary theme patterns from text -An exploration of temporal text mining. In -\it Proc. KDD 2005}, Chicago, August 21--24, pp.198--207.

[25] Shouke Qin, Weining Qian, Aoying Zhou. Approximately processing multi-granularity aggregate queries over data streams. In -\it Proc. ICDE 2006}, Atlanta, April 3--8, p.67.

[26] Dong-Hong Han, Guo-Ren Wang, Chuan Xiao, Rui Zhou. Load shedding for window joins over streams. \it Journal of Computer Science and Technology, \rm 2007, 22(2): 182--189.

[27] Zhi-Hong Chong, Jeffrey Xu Yu, Zhen-Jie Zhang, Xue-Min Lin, Wei Wang, Ao-Ying Zhou. Efficient computation of $k$-medians over data streams under memory constraints. \it Journal of Computer Science and Technology, \rm 2006, 21(2): 284--296.

[28]Jian Pei, Haixun Wang, Philip S Yu. Online mining of data streams: Applications, techniques and progress. In -\it Proc. KDD 2004} (Tutorials), Seattle, WA, August 22--25, pp.1--60.

[29] Joong Hyuk Chang, Won Suk Lee. Effect of count estimation in finding frequent itemsets over online transactional data streams. \it Journal of Computer Science and Technology, \rm 2005, 20(1): 63--69.

[30] Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W Wah, Jianyong Wang, Y Dora Cai. Stream cube: An architecture for multi-dimensional analysis of data streams. \it Distributed and Parallel Databases, \rm 2005, 18(2): 173--197.

[31]Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W Wah, Jianyong Wang. Multi-dimensional regression analysis of time-series data streams. In \it Proc. VLDB 2002, \rm August 20--23, Hong Kong, pp.323--334.

[32] Yubao Liu, Jiarong Cai, Jian Yin, Zhilan Huang. Document clustering based on semantic smoothing approach. In \it Proc. AWIC 2007, \rm Fontainebleau, June 25--27, pp.217--222.
[1] Masoud Zadghorban Lifkooee, Celong Liu, Yongqing Liang, Yimin Zhu, Xin Li. Real-Time Avatar Pose Transfer and Motion Generation Using Locally Encoded Laplacian Offsets [J]. Journal of Computer Science and Technology, 2019, 34(2): 256-271.
[2] Dorde M. Durdević and Igor I. Tartalja. Domino Tiling: A New Method of Real-Time Conforming Mesh Construction for Rendering Changeable Height Fields [J]. , 2011, 26(6): 971-987.
[3] Wai-Ho Mak (麦伟豪), Yingcai Wu (巫英才), Ming-Yuen Chan (陈明远), and Huamin Qu (屈华民), Member, IEEE. Visibility-Aware Direct Volume Rendering [J]. , 2011, 26(2): 217-228.
[4] Takuma Kawamura, Naohisa Sakamoto,and Koji Koyamada, Member, IEEE. Level-of-Detail Rendering of Large-Scale Irregular Volume Datasets Using Particles [J]. , 2010, 25(5): 905-915.
[5] Sungchan Kim, Kunwoo Lee, Taesik Hong, and Moonki Jung. Modeling in Multi-Resolution and Its Applications [J]. , 2006, 21(2): 272-278 .
[6] Rong-Hua Liang, Zhi-Geng Pan, and Chun Chen. New Algorithm for 3D Facial Model Reconstruction and Its Application in Virtual Reality [J]. , 2004, 19(4): 0-0.
[7] Michael Boyles and FANG ShiaoFen(方晓芬). 3DIVE: An Immersive Environment for Interactive Volume Data Exploration [J]. , 2003, 18(1): 0-0.
[8] CAO Weiqun(曹卫群),BAO Hujun(鲍虎军)and PENG Qunsheng(彭群生). An Algorithm for LOD by Merging Near Coplanar Faces Based on Gauss Sphere [J]. , 2001, 16(5): 0-0.
[9] ZHANG Xiaopeng(张晓鹏),CHEN Yanyun(陈彦云)and WU Enhua(吴恩华). Hair Image Generation Using Connected Texels [J]. , 2001, 16(4): 0-0.
[10] PAN Zhigeng(潘志庚),ZHOU Kun(周昆)and SHI Jiaoying(石教英). A New Mesh Simplification Algorithm Based on Triangle Collapses [J]. , 2001, 16(1): 0-0.
[11] HE Taosong;. Volumetric Virtual Environments [J]. , 2000, 15(1): 37-46.
[12] Tong Xin; Tang Zesheng;. Hardware Assisted Fast Volume Rendering with Boundary Enhancement [J]. , 1998, 13(5): 393-401.
[13] Cai Yong; Heng Phengann; Wu Enhua; Liu Xuehui; Li Hongju; Sun Qingjie;. An Image-Based Virtual Reality Prototype System [J]. , 1998, 13(5): 475-480.
[14] Zhang Qiong; Shi Jiaoring;. Acoustic Simulation with Dynamic Mechanisms in Virtual Reality [J]. , 1998, 13(3): 285-288.
[15] Cai Wenli; Shi Jiaoying;. Composed Scattering Model for Direct Volume Rendering [J]. , 1996, 11(5): 433-442.
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved