›› 2014, Vol. 29 ›› Issue (1): 116-141.doi: 10.1007/s11390-013-1416-3

Special Issue: Data Management and Data Mining

• Data Management and Data Mining • Previous Articles     Next Articles

On Density-Based Data Streams Clustering Algorithms:A Survey

Amineh Amini, Member, IEEE, Teh Ying Wah, and Hadi Saboohi, Member, ACM, IEEE   

  1. Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya Kuala Lumpur 50603, Malaysia
  • Received:2013-04-11 Revised:2013-11-13 Online:2014-01-05 Published:2014-01-05
  • Supported by:

    The work is supported by the University of Malaya Research under Grant No. RG097-12ICT.

Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using densitybased methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream's nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.

[1] Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques (3rd edition). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[2] Gaber M, Zaslavsky A, Krishnaswamy S. Mining data streams: A review. ACM SIGMOD Record, 2005, 34(2): 1826.

[3] Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In Proc. the 2006 SIAM Conference on Data Mining, April 2006, pp.328-339.

[4] Chen Y, Tu L. Density-based clustering for real-time stream data. In Proc. the 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2007, pp.133-142.

[5] Aggarwal C C (ed.). Data Streams: Models and Algorithms. New York, USA: Springer, 2007.

[6] Hahsler M, Dunham M H. Temporal structure learning for clustering massive data streams in real-time. In Proc. the 11th SIAM Conference on Data Mining, April 2011, pp.664675.

[7] O?allaghan L, Mishra N, Meyerson A et al. Streaming-data algorithms for high-quality clustering. In Proc. the 18th Int. Conf. Data Engineering, Feb. 26-Mar. 1, 2002, pp.685-694.

[8] Barbará D. Requirements for clustering data streams. SIGKDD Explorations Newsletter, 2002, 3(2): 23-27.

[9] Guha S, Meyerson A, Mishra N et al. Clustering data streams: Theory and practice. IEEE Trans. Knowledge and Data Engineering, 2003, 15(3): 515-528.

[10] Aggarwal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th International Conference on Very Large Data Bases, Sept. 2003, pp.81-92.

[11] Ackermann M R, Lammersen C, Märtens M, Raupach C, Sohler C, Swierkot K. StreamKM++: A clustering algorithm for data streams. In Proc. the 12th Workshop on Algorithm Engineering and Experiments, Jan. 2010, pp.173-187.

[12] Ikonomovska E, Loskovska S, Gjorgjevik D. A survey of stream data mining. In Proc. the 8th National Conference with International Participation, Sept. 2007, pp.19-25.

[13] Gaber M, Zaslavsky A, Krishnaswamy S. Data stream mining. Data Mining and Knowledge Discovery Handbook, 2010, pp.759-787.

[14] Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In Proc. the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2002, pp.1-16.

[15] Jain A K, Dubes R C. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.

[16] Jain A K. Data clustering: 50 years beyond K-means. Pattern Recognition Letter, 2010, 31(8): 651-666.

[17] Mahdiraji A. Clustering data stream: A survey of algorithms. Int. J. Knowledge-Based and Intelligent Engineering Systems, 2009, 13(2): 39-44.

[18] Amini A, Wah T, Saybani M et al. A study of density-grid based clustering algorithms on data streams. In Proc. the 8th Int. Conf. Fuzzy Systems and Knowledge Discovery, July 2011, pp.1652-1656.

[19] Amini A, Wah T Y. Density micro-clustering algorithms on data streams: A review. In Proc. Int. Multiconf. Data Mining and Applications, March 2011, pp.410-414.

[20] Amini A, Wah T Y. A comparative study of density-based clustering algorithms on data streams: Micro-clustering approaches. In Lecture Notes in Electrical Engineering 110, Ao S, Castillo O, Huang X (eds.), Springer, 2012, pp.275-287.

[21] Aggarwal C C. A survey of stream clustering algorithms. In Data Clustering: Algorithms and Applications, Aggarwal C C, Reddy C (eds.), CRC Press, 2013, pp.457-482.

[22] Han J, Kamber M. Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann, 2006.

[23] MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, June 21July 18, 1967, pp.281-297.

[24] Lloyd S P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.

[25] Guha S, Mishra N, Motwani R, O'Callaghan L. Clustering data streams. In Proc. the 41st Annual Symposium on Foundations of Computer Science, Nov. 2000, pp.359-366.

[26] Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. In Proc. the 1996 ACM SIGMOD International Conference on Management of Data, June 1996, pp.103-114.

[27] Karypis G, Han E, Kumar V. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 6875.

[28] Kranen P, Assent I, Baldauf C, Seidl T. The clustree: Indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst., 2011, 29(2): 249-272.

[29] Wang W, Yang J, Muntz R R. STING: A statistical information grid approach to spatial data mining. In Proc. the 23rd Int. Conf. Very Large Data Bases, Aug. 1997, pp.186-195.

[30] Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: A wavelet-based clustering approach for spatial data in very large databases. The VLDB Journal, 2000, 8(3/4): 289-304.

[31] Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, 27(2): 94-105.

[32] Tu L, Chen Y. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery Data, 2009, 3(3): Article No. 12.

[33] Wan L, Ng W K, Dang X H et al. Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowledge Discovery from Data, 2009, 3(3): Article No. 14.

[34] Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1977, 39(1): 1-38.

[35] Dang X, Lee V, Ng W K et al. An EM-based algorithm for clustering data streams in sliding windows. In Proc. the Int. Conf. Database Systems for Advanced Applications, Apr. 2009, pp.230-235.

[36] Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. the 2nd International Conference on Knowledge Discovery and Data Mining, Aug. 1996, pp.226-231.

[37] Ankerst M, Breunig M M, Kriegel H P, Sander J. Optics: Ordering points to identify the clustering structure. ACM SIGMOD Record, 1999, 28(2): 49-60.

[38] Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In Proc. the 4th KDD, Sept. 1998, pp.58-65.

[39] Matysiak M. Data stream mining: Basic methods and techniques. Technical Report, RWTH Aachen University, 2012.

[40] Zhou A, Cao F, Qian W, Jin C. Tracking clusters in evolving data streams over sliding windows. Knowledge and Information Systems, 2008, 15(2): 181-214.

[41] Ren J, Ma R. Density-based data streams clustering over sliding windows. In Proc. the 6th Int. Conf. Fuzzy systems and Knowledge Discovery, Aug. 2009, pp.248-252.

[42] Charikar M, O'Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In Proc. the 35th Annual ACM Symp. Theory of Computing, June 2003, pp.30-39.

[43] Gao J, Li J, Zhang Z, Tan P N. An incremental data stream clustering algorithm based on dense units detection. In Proc. the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2005, pp.420-425.

[44] Aggarwal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In Proc. the 30th International Conference on Very Large Data Bases, Volume 30, Aug. 29-Sept. 3, 2004, pp.852-863.

[45] Aggarwal C C, Han J, Wang J, Yu P S. On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery, 2005, 10(3): 251-273.

[46] Babcock B, Datar M, Motwani R, O'Callaghan L. Maintaining variance and k-medians over data stream windows. In Proc. the 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, June 2003, pp.234-243.

[47] Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In Lecture Notes in Computer Science 6380, Hameurlain A, Küng J,Wagner R et al. (eds.), Springer Berlin/Heidelberg, 2010, pp.1-30.

[48] Vitter J S. Random sampling with a reservoir. ACM Trans. Math. Softw., 1985, 11(1): 37-57.

[49] Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look: A tutorial. In Proc. the 2002 ACM SIGMOD Int. Conf. Management of Data, June 2002, pp.635-635.

[50] Aggarwal C C, Yu P S. A survey of synopsis construction in data streams. In Advances in Database Systems 31, Aggarwal C C (ed.), Springer, 2007, pp.169-207.

[51] Garofalakis M N. Wavelets on streams. In Encyclopedia of Database Systems, Springer US, 2009, pp.3446-3451.

[52] Gilbert A C, Kotidis Y, Muthukrishnan S, Strauss M J. Onepass wavelet decompositions of data streams. IEEE Trans. Knowl. and Data Eng., 2003, 15(3): 541-554.

[53] Gama J, Gaber M M (eds.). Learning from Data Streams { Processing Techniques in Sensor Networks. Springer, 2007.

[54] Rosset S, Inger A. KDD-cup 99: Knowledge discovery in a charitable organization's donor database. SIGKDD Explorations Newsletter, 2000, 1(2): 85-90.

[55] Hubert L J, Levin J R. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 1976, 83(6): 1072-1080.

[56] Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 2005.

[57] Wu J, Xiong H, Chen J. Adapting the right measures for K-means clustering. In Proc. the 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, June 2009, pp.877-886.

[58] Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336): 846-850.

[59] Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 2004, 55(3): 311-331.

[60] Dongen S. Performance criteria for graph clustering and Markov cluster experiments. Technical Report, National Research Institute for Mathematics and Computer Science, Stichting Mathematisch Centrum, Netherlands, 2000.

[61] Rosenberg A, Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2007, pp.410-420.

[62] Meil穉 M. Comparing clusterings: An axiomatic view. In Proc. the 22nd Int. Conf. Machine Learning, Aug. 2005, pp.577584.

[63] Rijsbergen C J V. Information Retrieval. Newton, MA, USA: Butterworth-Heinemann, 1979.

[64] Milligan G. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 1981, 46(2): 187-199.

[65] Pereira C M M, de Mello R F. A comparison of clustering algorithms for data streams. In Proc. the 1st Int. Conf. Integrated Comp. Tech., May 31-June 2, 2011, pp.59-74.

[66] Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.

[67] Forestiero A, Pizzuti C, Spezzano G. A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Mining and Knowledge Discovery, 2013, 26(1): 1-26.

[68] Bifet A, Holmes G, Pfahringer B et al. MOA: Massive online analysis, a framework for stream classification and clustering. Journal of Machine Learning Research, 2010, 11: 44-50.

[69] Holmes G, Donkin A, Witten I H. WEKA: A machine learning workbench. In Proc. the 2nd Australian and New Zealand Conference on Intelligent Information Systems, Nov. 29Dec. 3, 1994, pp.357-361.

[70] Kranen P, Kremer H, Jansen T et al. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In Proc. the IEEE Int. Conf. Data Mining Workshops, Dec. 2010, pp.1400-1403.

[71] Kremer H, Kranen P, Jansen T et al. An e甧ctive evaluation measure for clustering on evolving data streams. In Proc. the 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, July 2011, pp.868-876.

[72] De Francisci Morales G. SAMOA: A platform for mining big data streams. In Proc. the 22nd Int. Conf. World Wide Web Companion, May 2013, pp.777-778.

[73] Tasoulis D K, Ross G, Adams N M. Visualising the cluster structure of data streams. In Proc. the 7th International Conference on Intelligent Data Analysis, Sept. 2007, pp.8192.

[74] Ruiz C, Menasalvas E, Spiliopoulou M. C-DenStream: Using domain knowledge on a data stream. In Proc. the 12th International Conference on Discovery Science, Oct. 2009, pp.287-301.

[75] Liu L, Jing K, Guo Y et al. A three-step clustering algorithm over an evolving data stream. In Proc. the IEEE Int. Conf. Intelligent Computing and Intelligent Systems, Nov. 2009, pp.160-164.

[76] Lin J, Lin H. A density-based clustering over evolving heterogeneous data stream. In Proc. the 2nd Int. Colloquium on Computing, Communication, Control, and Management, Aug. 2009, pp.275-277.

[77] Isaksson C, Dunham M, Hahsler M. SOStream: Self organizing density-based clustering over data stream. In Lecture Notes in Computer Science 7376, Perner P (ed.), Springer Berlin Heidelberg, 2012, pp.264-278.

[78] Ntoutsi I, Zimek A, Palpanas T et al. Density-based projected clustering over high dimensional data streams. In Proc. the 12th SIAM Int. Conf. Data Mining, April 2012, pp.987-998.

[79] Hassani M, Spaus P, Gaber M M, Seidl T. Density-based projected clustering of data streams. In Proc. the 6th Int. Conf. Scalable Uncertainty Management, Sept. 2012, pp.311-324.

[80] Jia C, Tan C, Yong A. A grid and density-based clustering algorithm for processing data stream. In Proc. the 2nd Int. Conf. Genetic and Evolutionary Computing, Sept. 2008, pp.517-521.

[81] Ren J, Cai B, Hu C. Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 2011, 6(1): 83-93.

[82] Yang Y, Liu Z, Zhang J et al. Dynamic density-based clustering algorithm over uncertain data streams. In Proc. the 9th Int. Conf. Fuzzy Systems and Knowledge Discovery, May 2012, pp.2664-2670.

[83] Amini A, Teh Ying W. DENGRIS-Stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In Proc. International Conference on Data Mining and Computer Engineering, Dec. 2012, pp.206-210.

[84] Bhatnagar V, Kaur S, Chakravarthy S. Clustering data streams using grid-based synopsis. Knowledge and Information Systems, June 2013.

[85] Ruiz C, Spiliopoulou M, Menasalvas E. C-DBSCAN: Densitybased clustering with constraints. In Proc. the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, May 2007, pp.216-223.

[86] Yang C, Zhou J. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proc. the 6th IEEE Int. Conf. Data Mining Workshops, Dec. 2006, pp.682-688.

[87] Kohonen T. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 1982, 43(1): 59-69.

[88] Bohm C, Kailing K, Kriegel H P, Kroger P. Density connected clustering with local subspace preferences. In Proc. the 4th IEEE Int. Conf. Data Mining, Nov. 2004, pp.27-34.

[89] Kennedy J F, Kennedy J, Eberhart R C. Swarm Intelligence. Morgan Kaufmann Pub, 2001.

[90] Shamshirband S, Anuar N, Kiah M et al. An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique. Engineering Applications of Artificial Intelligence, 2013, 26(9): 2105-2127.

[91] Sander J, Ester M, Kriegel H P, Xu X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 1998, 2(2): 169-194.

[92] Plant C, Teipel S J, Oswald A et al. Automated detection of brain atrophy patterns based on MRI for the prediction of Alzheimer's disease. NeuroImage, 2010, 50(1): 162-174.

[93] Mete M, Kockara S, Aydin K. Fast density-based lesion detection in dermoscopy images. Computerized Medical Imaging and Graphics, 2011, 35(2): 128-136.

[94] Yang D, Rundensteiner E A, Ward M O. Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endow., 2011, 5(2): 121-132.

[95] Lee C H. Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications, 2012, 39(10): 9623-9641.

[96] Yu Y, Wang Q, Wang X, Wang H, He J. Online clustering for trajectory data stream of moving objects. Computer Science and Information Systems, 2013, 10(3): 1293-1317.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved