基于数据场理论和网格相似度的密度峰值聚类算法

俞庆英; 施格格; 徐东升; 汪文凯; 陈传明; 罗永龙

doi:10.1007/s11390-023-2984-5

基于数据场理论和网格相似度的密度峰值聚类算法

Density Peak Clustering Algorithm Based on Data Field Theory and Grid Similarity

摘要

摘要:
研究背景 聚类是数据挖掘研究和应用中必不可少的工具。同一集群中的元素具有最大的相似性，各个不同集群之间存在明显差异。聚类分析在机器学习、图聚类、生物学、传感器网络和图像识别等领域已得到广泛应用。2014年，Rodriguez等人提出了一种通过快速搜索和查找密度峰值（Clustering by fast search and find of density peaks，DPC）的新型聚类算法。该算法假设聚类中心被具有较低局部密度的邻居包围，并且它们与具有较高局部密度的任何点的距离相对较大；DPC可以快速有效地识别聚类中心，但还是存在一些问题需要解决：1）单步分配策略易导致非中心点的连续分配错误；2）DPC算法无法有效处理一个簇中有多个密度峰值的数据集；3）通过绘制决策图的方法手动识别聚类中心会导致聚类中心不确定。
目的针对研究背景提出的几个问题，我们的研究目标如下：1）改进聚类中心的选取方法从而准确地找到聚类中心；2）改进非中心点的分配策略从而降低连续分配错误发生的可能性；3）提高传统DPC算法的聚类精度和运行效率。
方法我们提出了一种基于数据场理论和网格相似度的密度峰值聚类算法（DFGS-DPC），该方法主要包括以下四个步骤：1）数据网格化：将原始数据网格化得到子空间编码和网格数据。2）密度峰值发现：提出潜在网格密度的概念，综合考虑了网格中数据的潜在价值、周围环境的影响以及所包含数据点的个数。通过计算每个网格的潜在网格密度来评价其疏密程度，然后按照密度值的降序排列输出。3）邻域扩充：从降序网格序列中选取具有最大潜在网格密度的网格为初始聚类中心，使用网格相似度来判断两个网格间的相似程度，找到满足邻域相似半径要求的网格纳入当前簇中，并在这些网格中找到邻域扩展中心，将其作为新的聚类中心来不断扩展当前聚类。4）集群合并：如果聚类中心间的网格相似性小于合并阈值，则合并这两个集群；如果存在聚类中心和其他所有聚类中心间的网格相似性都不满足合并阈值，则不将此集群和任何集群合并。在真实世界数据集和合成数据集上的实验表明，本文提出的密度峰值聚类算法能有效处理一个簇中有多个密度峰值的数据集，改进了聚类效果。
结果我们在真实世界数据集和合成数据集上进行了大量实验，从两个方面来评估DFGS-DPC的性能：1）聚类效果(ACC, ARI, NMI, F1)与同类算法相比，DFGS-DPC在合成数据集上运行结果的ACC最多提高了32.47%，ARI最多提高了58.32%，NMI最多提高了44.14%，F1最多提高了28.31%；DFGS-DPC在真实世界数据集上运行结果的ACC最多提高了18.63%，ARI最多提高了11.08%，NMI最多提高了8.46%，F1最多提高了21.59%。2）运行效率在大型数据集上，DFGS-DPC的运行时间远远少于同类算法，最多比同类算法短595.9634秒。从聚类效果和运行效率来看，DFGS-DPC的性能优于同类算法。
结论为了准确找到聚类中心，本文提出了基于网格空间和数据场理论的潜在网格密度计算方法，并提出了网格相似度评价方法用于快速准确地找到集群成员。同时还提出了邻域扩展中心的概念，以有效处理一个簇中多密度峰对聚类结果的影响。在合成数据集和真实数据集上的实验结果表明，本文提出的密度峰值聚类算法DFGS-DPC的性能优于对比的四种算法，可以有效识别密度不均匀、交叉缠绕的数据集。此外，DFGS-DPC可以自动获取聚类中心，在大规模数据集上的运行效率有较大优势。DFGS-DPC引入了k和α两个参数，未来将找寻一种自动确定参数的方法，简化算法参数的确定过程。

Abstract: The density peak clustering algorithm can rapidly identify cluster centers by drawing decision graphs without any prior knowledge; however, when multiple density peaks are present in one cluster of the dataset, the cluster centers cannot be accurately obtained, leading to incorrect clustering. Moreover, the single-step allocation strategy is poorly fault-tolerant and can lead to successive allocation errors. To address these problems, this study proposes a density peak clustering method based on data field theory and grid similarity, which divides the original data into grid spaces to obtain subspace codes and grid data. Subsequently, the potential grid density is introduced to measure the density of each grid based on the data field, and the density peaks are identified to obtain the cluster centers from the high-density grids. Finally, self-organized clustering is realized based on neighborhood extension centers and grid similarity. It thus reduces the possibility of successive assignment errors and the negative impact of multiple density peaks on the clustering results. In addition, the algorithm automatically determines the cluster centers and can correctly assign the non-density-peak grid to the corresponding clusters. Experimental results with synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar algorithms in terms of accuracy and efficiency when dealing with complex datasets with large density differences and cross-tangling.

HTML全文

参考文献()

施引文献

资源附件()