A Heuristic Sampling Method for Maintaining the Probability Distribution

Jiao-Yun Yang; Jun-Da Wang; Yi-Fang Zhang; Wen-Juan Cheng; Lian Li

doi:10.1007/s11390-020-0065-6

Jiao-Yun Yang, Jun-Da Wang, Yi-Fang Zhang, Wen-Juan Cheng, Lian Li. A Heuristic Sampling Method for Maintaining the Probability Distribution[J]. Journal of Computer Science and Technology, 2021, 36(4): 896-909. DOI: 10.1007/s11390-020-0065-6

Citation:

A Heuristic Sampling Method for Maintaining the Probability Distribution

Abstract

Abstract

Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0.05, the algorithm can exclude 99.9%, 99.0%, 93.1% and 96.7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0.03; by contrast, the subsets generated by random sampling pass only 83.8% of the tests, and the average distribution difference is approximately 0.24.

FullText(HTML)

References (35)

Relative Articles

Supplements (2)

Cited By

A Heuristic Sampling Method for Maintaining the Probability Distribution

Abstract

Catalog

Export File

Citation

Format

Content