大规模数据处理的最小核数据计算

郭鸿杰; 李建中; 高宏

doi:10.1007/s11390-022-2429-6

摘要: 1、研究背景（Context）：
在大数据分析处理中，核数据计算是将数据分析任务扩展到大规模数据集的一种关键技术。给定一个问题，由核数据计算得到的解是由整个数据集以可证明的近似比得到的解的近似版本。核数据被广泛应用于几何优化，聚类和近似查询处理等问题中。为了减少存储、传输、查询的代价，在相同的近似误差下，人们更希望找到更小的核数据。Wang等人在2021-PODS上发表的一篇文章中首次提出了最小epsilon-coreset和最小epsilon-kernel的概念。他们证明了epsilon-coreset和epsilon-kernel的大小的最坏情况复杂度相同，但是它们的最小版本是否可以互相规约是一个open问题。
2、目的（Objective）：
我们研究的目的是为了回答Wang等人在2021-PODS上提出的open问题，并给出高效地算法计算最小epsilon-kernel。
3、方法（Method）：
首先我们定义了最小epsilon-kernel问题并证明了问题的复杂性。其次，我们运用几何方法和启发式方法给出了最小epsilon-kernel问题的近似解。然后，根据计算复杂性理论和算法复杂性理论证明了算法的近似比和复杂性，证明了了最小epsilon-kernel问题与最小epsilon-coreset问题的关系。最后，提出了一个随机算法进一步提高了计算效率。
4、结果（Result & Findings）：
首先，我们证明了在维度大于等于3时，最小epsilon-kernel问题是NP-难的。其次，一个近似比为O(dlog1/ε)的伪亚线性近似算法被提出来解决该问题。然后，我们证明了最小epsilon-coreset问题和最小epsilon-kernel问题可以互相图灵规约。最后，随机算法的提出使得计算复杂度从O((logn/ε^((d-1)/2) +(|Q|)/ε^((d-1)) )降到了O((logn+dlog 1/ε|Q|)/ε^((d-1)/2) )，其中|Q|是核数据大小。
5、结论（Conclusions）：
本研究提出最小epsilon-kernel计算问题，并理论证明了问题的复杂度。我们解决了Wang等人提出的open问题。基于索引的伪亚线性算法被提出来解决最小epsilon-kernel计算问题。我们理论分析了算法的近似比和复杂度。最后，引入随机算法进一步降低了算法的复杂度，并分析了新的近似比。

Abstract: Kernel is a kindof data summarywhich is elaborately extracted from a largedataset. Given a problem, the solution obtainedfrom the kernelis an approximate version of the solutionobtained from thewhole dataset witha provable approximate ratio.It is widely used in geometric optimization, clustering, and approximate query processing, etc.,for scaling them up to massivedata. In this paper, we focuson the minimum ε-kernel (MK) computation that asks for a kernelof the smallest sizefor large-scale dataprocessing. For the openproblem presented by Wang et al.that whether theminimum ε-coreset (MC) problemand the MK problem can be reduced to each other,we first formalize the MK problemand analyze its complexity. Due to theNP-hardness of theMK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK),is developed to solve it. We prove that the MC problem and the MK problemcan be Turing-reduced to each other.Then, we discuss the update of MK underinsertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε-Kernel algorithm (RA-SCMK), is utilized to further reducethe complexity of SCMK. The efficiency andeffectiveness of SCMK and RA-SCMKare verified by experimental results on real-world and synthetic datasets. Experiments show that thekernel sizes of SCMK are 2x and 17.6x smallerthan those of an ANN-basedmethod on real-world and synthetic datasets, respectively. The speedup ratioof SCMK over the ANN-based methodis 5.67 on synthetic datasets.RA-SCMK runs up tothree times faster than SCMK on synthetic datasets.

大规模数据处理的最小核数据计算

Minimum Epsilon-Kernel Computation for Large-Scale Data Processing