We use cookies to improve your experience with our site.

基于密度的数据流聚类算法调研

On Density-Based Data Streams Clustering Algorithms:A Survey

  • 摘要: 在过去的几年中,对于数据流进行聚类研究已经吸引了研究人员的广泛关注。数据流为传统的聚类算法带来了额外的挑战,使得算法需要在有限的计算时间和内存上对流数据进行单趟聚类。因此,对于数据流应用来说,如何发现其中任意形状的聚类是非常重要的问题。数据流随时间演化,在规模上无穷无尽。并且,研究人员并不能预先知道数据流中所包含的聚类数量。由于其环境的各种原因,数据流中包含了时常还含有一些噪声。基于密度的方法在对数据流进行聚类上表现出了卓越性能。它能够发现数据流中任意形状的聚类和探测噪声数据。并且,它不需要事先知道数据流中的聚类个数。 因此,大量基于密度的算法被高金以用于数据流聚类。这些算法的主要思想是在聚类过程中使用基于密度的方法,同时满足由于数据流本身带来的一些约束条件。本文的目的是调研已有文献中的数据流聚类算法。本文不仅归纳了主要的基于密度的数据流聚类算法,讨论了它们的长处和局限。而且,本文进一步解释了这些算法如何解决数据流聚类所带来的挑战。同时,本文调研了评价聚类质量和评价聚类性能的度量元。我们希望本文能够为研究人员研究数据流聚类,特别是基于密度的数据流聚类算法的引玉之砖。

     

    Abstract: Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using densitybased methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream's nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.

     

/

返回文章
返回