基于<i>K</i>NN与自编码器的异常检测

刘叔正; 马帅; 陈瀚清; 崔立真; 丁杰

doi:10.1007/s11390-023-2403-y

摘要:

研究背景 异常检测是在数据中发现行为不符合预期的模式的一类问题，也是数据分析中的一项基本数据挖掘技术。在数据规模大或异常模式复杂时，数据标签通常不可获取或获取成本太高，因此无监督异常检测方法被广泛研究，其中包括基于K近邻的方法和基于自编码器的方法。然而，基于K近邻的方法在处理高维数据时存在缺陷，而基于自编码器的方法基本适用于这种情况；基于自编码器的方法通常不能很好地保持数据邻近关系，而基于K近邻的方法证实在保持数据邻近关系上是有效的。当前同时解决高维数据和数据临近关系保持问题的方法还鲜有研究，因此，需要一种有效的方式结合K近邻和自编码器，从而获得更好的异常检测方法。

目的 XXXXX

方法首先，我们提出了近邻自编码器（NNAE），通过结合K近邻新设计的损失函数保持数据近邻关系，进而学习低维空间中的嵌入。为了进一步缓解k选择问题，我们设计了一个新的异常分：k-最近重构邻居（KNRN），它结合了基于K近邻的方法中使用的k-距离和基于自编码器的方法中使用的重构误差。为了提高方法的可用性，我们还开发了一个指标Z，通过利用NNAE的重构误差来选择更好的NNAE结构参数。最后，最终在5个真实数据集上与其他4种方法进行对比，通过丰富的实验评估我们的NNAE+KNRN方法。

结果 NNAE+KNRN是一种有效的异常检测方法。在所有数据集上与K近邻、传统自编码器、鲁棒自编码器和隔离森林方法相比，NNAE+ KNRN在AUC指标上平均分别提高了11.40%、34.93%、31.97%和8.85%。此外，NNAE+KNRN通过利用新设计的损失函数和重构误差缓解了k选择问题，且其中只需要少量的最近邻就足以满足。最后，结构指标Z对于选择NNAE的结构参数以提高可用性是合理的。

结论本文提出了一个在高维数据上进行异常检测的新方法NNAE+KNRN，该方法无缝结合了K近邻和自编码器，其中的近邻自编码器（NNAE）能够处理维度灾难并缓解了k选择问题，k-最近重构邻居（KNRN）进一步缓解了k选择问题。其次，我们设计了一个结构指标Z来帮助选择较好的NNAE结构参数。大量的实验验证了所提出方法的有效性和易用性。未来可能的方向包括设计更好的结构参数选择方法，以及将自编码器用于其他数据挖掘算法中。

Abstract: K-nearest neighbor (KNN) is one of the most fundamental methods for unsupervised outlier detection because of its various advantages, e.g., ease of use and relatively high accuracy. Currently, most data analytic tasks need to deal with high-dimensional data, and the KNN-based methods often fail due to “the curse of dimensionality”. AutoEncoder-based methods have recently been introduced to use reconstruction errors for outlier detection on high-dimensional data, but the direct use of AutoEncoder typically does not preserve the data proximity relationships well for outlier detection. In this study, we propose to combine KNN with AutoEncoder for outlier detection. First, we propose the Nearest Neighbor AutoEncoder (NNAE) by persevering the original data proximity in a much lower dimension that is more suitable for performing KNN. Second, we propose the K-nearest reconstruction neighbors (KNRNs) by incorporating the reconstruction errors of NNAE with the K-distances of KNN to detect outliers. Third, we develop a method to automatically choose better parameters for optimizing the structure of NNAE. Finally, using five real-world datasets, we experimentally show that our proposed approach NNAE+KNRN is much better than existing methods, i.e., KNN, Isolation Forest, a traditional AutoEncoder using reconstruction errors (AutoEncoder-RE), and Robust AutoEncoder.

基于KNN与自编码器的异常检测

Combining KNN with AutoEncoder for Outlier Detection