维持概率分布不变的启发式抽样算法

doi:10.1007/s11390-020-0065-6

摘要: 1、目的（Objective）：
抽样是生成数据子集的基本方法之一。由于许多数据分析方法都是基于概率分布，因此在抽样时保持概率分布不变有助于确保可靠的数据分析结果。然而，如何维持概率分布不变的条件下抽取最小数据子集仍然是一个挑战。本文基于贝叶斯网络，将联合概率分布分解为条件概率的乘积，在此基础上，我们依据皮尔逊检验判定若数据子集符合分布检验则概率分布不变。由此，本文提出了一个维持概率分布的抽样问题。此外，本文设计了一种基于遗传算法的启发式抽样方法，通过设计两个评分函数来抽取所需的子集：一个是基于卡方检验，另一个是基于似然函数。
2、方法（Method）：
我们首先利用皮尔逊卡方检验构造了分布检验评分函数，然后假设条件概率的先验分布服从狄利克雷分布，从狄利克雷分布函数出发构造了似然评分函数，基于评分函数将问题转化成了多目标非线性规划问题。为求解该问题，本文设计了一个基于遗传算法的启发式抽样算法，每次迭代中，遗传算法基于前述评分函数优化选取|ΔD′|个样本加入数据子集，然后判定数据子集是否满足分布检验，迭代直至数据子集满足皮尔逊卡方分布检验。
3、结果（Result&Findings）：
我们在4种不同的数据集上进行了验证，当样本量规模为6万且显著性差异水平α为0.05时，表1展示了本文算法可分别排除99.9%、99.0%、93.1%、96.7%的样本。当抽取同样规模样本时，表2表明本文算法在保证概率分布的同时平均分布误差仅为0.03，而随机抽样仅能通过83.8%的分布检验而平均分布误差为0.24。
4、结论（Conclusions）：
本文提出了一个新的抽样问题：利用皮尔逊检验抽取一个保持概率分布的最小数据子集。针对此问题，本文设计了基于遗传算法的启发式抽样方法，并构造了基于卡方检验与似然函数的两个评分函数用于指导抽样。实验表明当显著性差异水平为0.05时，我们的算法可以分别排除4种规模为6万的不同数据集的99.9%、99.0%、93.1%和96.7%的样本，且平均分布误差仅为0.03。

Abstract: Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0.05, the algorithm can exclude 99.9%, 99.0%, 93.1% and 96.7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0.03; by contrast, the subsets generated by random sampling pass only 83.8% of the tests, and the average distribution difference is approximately 0.24.

维持概率分布不变的启发式抽样算法

A Heuristic Sampling Method for Maintaining the Probability Distribution