We use cookies to improve your experience with our site.
Hong-Jie Guo, Jian-Zhong Li, Hong Gao. Minimum Epsilon-Kernel Computation for Large-Scale Data Processing[J]. Journal of Computer Science and Technology, 2022, 37(6): 1398-1411. DOI: 10.1007/s11390-022-2429-6
Citation: Hong-Jie Guo, Jian-Zhong Li, Hong Gao. Minimum Epsilon-Kernel Computation for Large-Scale Data Processing[J]. Journal of Computer Science and Technology, 2022, 37(6): 1398-1411. DOI: 10.1007/s11390-022-2429-6

Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

  • Kernel is a kindof data summarywhich is elaborately extracted from a largedataset. Given a problem, the solution obtainedfrom the kernelis an approximate version of the solutionobtained from thewhole dataset witha provable approximate ratio.It is widely used in geometric optimization, clustering, and approximate query processing, etc.,for scaling them up to massivedata. In this paper, we focuson the minimum ε-kernel (MK) computation that asks for a kernelof the smallest sizefor large-scale dataprocessing. For the openproblem presented by Wang et al.that whether theminimum ε-coreset (MC) problemand the MK problem can be reduced to each other,we first formalize the MK problemand analyze its complexity. Due to theNP-hardness of theMK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK),is developed to solve it. We prove that the MC problem and the MK problemcan be Turing-reduced to each other.Then, we discuss the update of MK underinsertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε-Kernel algorithm (RA-SCMK), is utilized to further reducethe complexity of SCMK. The efficiency andeffectiveness of SCMK and RA-SCMKare verified by experimental results on real-world and synthetic datasets. Experiments show that thekernel sizes of SCMK are 2x and 17.6x smallerthan those of an ANN-basedmethod on real-world and synthetic datasets, respectively. The speedup ratioof SCMK over the ANN-based methodis 5.67 on synthetic datasets.RA-SCMK runs up tothree times faster than SCMK on synthetic datasets.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return