内存约束条件下流数据中 k 中值点的快速计算

摘要: 正在出现的数据密集型应用，如传感器网络中的数据处理，广域网中网络流量分析和电子商务及股市在线分析等等，都需要新的技术和方法来处理高速产生的大规模流数据。作为数据挖掘的主要技术之一，对流数据进行聚类的方法得到广泛的研究。在本篇论文中，我们研究高速产生的数据流中 k中值点（k-median）的快速估计，特别是这些数据来源于高维空间。k中值点是度量聚类方法的一种。我主要目的是能够尽量减少计算时间，同时能够用尽量小的内存获得尽量精确的k中值点。我们的工作起始于以下的观察：已有的算法尽管有不同量级的理论结果，但是他们实际的性能却是非常一致。这种原因在于用尽量小的内存来获得近似的k中值点时，通常需要复杂的运算来维护一个压缩的“概要”（sketch）结构。这个“概要”往往通过一个现成的算法将不同时段的数据加以压缩处理而得到。但是在每个时段上所运行的算法也不能得到精确的解。因此，在不同时段上所产生的误差会随着流数据注入而积累起来，这使得大多数算法只能取得较差的近似结果，在实际运算中模糊不同算法的差异。我们在本篇论文中，提出基于格子的快速近似算法。这种近似方法不要将流数据分成不同的时段，因此不会产生随时间增长的误差。其中的一个重要概念是（1－ε）占优。在流数据上，我们将k中值点问题简化为一个流数据上频繁项搜索问题。这个简化是基于格子中点密集程度和（1－ε）占优之间的联系来达成的。这种方法的主要优点是不需要分时段压缩数据，这样就省掉每段费时的压缩运算，同时不会产生误差的累积。我们的主要贡献是， • 我们提出基于格子的有损压缩方法，它比基于分段压缩数据的方法有明显的优点。通过（ 1－ε）占优的概念，这种方法能够控制格子的个数这个关键问题。 • 我们提供两种方法来建立基于格子的压缩结构。 • 我们通过详尽的实验来研究各种算法的优劣。在实验数据和真实数据上，发现我们的方法都能够取得好的性能。同时，我们的实验研究也是第一次实现一些理论结果。

Abstract: In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1-\epsilon)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.

内存约束条件下流数据中 k 中值点的快速计算

Efficient Computation of k-Medians over Data Streams Under Memory Constraints