Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (4): 896-909.doi: 10.1007/s11390-020-0065-6

Special Issue: Artificial Intelligence and Pattern Recognition

• Regular Paper • Previous Articles     Next Articles

A Heuristic Sampling Method for Maintaining the Probability Distribution

Jiao-Yun Yang1,2,3, Member, CCF, Jun-Da Wang1,2,4,*, Yi-Fang Zhang1,2,3, Wen-Juan Cheng1,2,3, and Lian Li1,2,3, Member, CCF        

  1. 1 Key Laboratory of Knowledge Engineering with Big Data of Ministry of Education, Hefei University of Technology Hefei 230601, China;
    2 National Smart Eldercare International Science and Technology Cooperation Base, Hefei University of Technology Hefei 230601, China;
    3 School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China;
    4 School of Mathematics, Hefei University of Technology, Hefei 230601, China
  • Received:2019-10-05 Revised:2020-08-15 Online:2021-07-05 Published:2021-07-30
  • Contact: Jun-Da Wang
  • About author:Jiao-Yun Yang received his B.S. degree and Ph.D. degree in computer science from University of Science and Technology of China, Hefei, in 2009 and 2014, respectively. He is an associate professor in the School of Computer Science and Information Engineering, Hefei University of Technology, Hefei. His research interests include heuristic search, machine learning and health computing. He is a member of CCF.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China under Grant No. 2018YFB1003204, Anhui Provincial Key Technologies Research and Development Program under Grant Nos. 1804b06020378 and 1704e1002221, and the National 111 Project of China under Grant No. B14025.

Sampling is a fundamental method for generating data subsets. As many data analysis methods are developed based on probability distributions, maintaining distributions when sampling can help to ensure good data analysis performance. However, sampling a minimum subset while maintaining probability distributions is still a problem. In this paper, we decompose a joint probability distribution into a product of conditional probabilities based on Bayesian networks and use the chi-square test to formulate a sampling problem that requires that the sampled subset pass the distribution test to ensure the distribution. Furthermore, a heuristic sampling algorithm is proposed to generate the required subset by designing two scoring functions: one based on the chi-square test and the other based on likelihood functions. Experiments on four types of datasets with a size of 60 000 show that when the significant difference level, α, is set to 0.05, the algorithm can exclude 99.9%, 99.0%, 93.1% and 96.7% of the samples based on their Bayesian networks—ASIA, ALARM, HEPAR2, and ANDES, respectively. When subsets of the same size are sampled, the subset generated by our algorithm passes all the distribution tests and the average distribution difference is approximately 0.03; by contrast, the subsets generated by random sampling pass only 83.8% of the tests, and the average distribution difference is approximately 0.24.

Key words: Bayesian network; chi-square test; sampling; probability distribution;

[1] Goodhart C A E, O'Hara M. High frequency data in financial markets:Issues and applications. Journal of Empirical Finance, 1997, 4(2/3):73-114. DOI:10.1016/S0927-5398(97)00003-0.
[2] Lohr S L. Sampling:Design and Analysis (2nd edition). CRC Press, 2019. DOI:10.1201/9780429296284.
[3] Yates F. Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 1948, 241(834):345-377. DOI:10.1098/rsta.1948.0023.
[4] Neyman J. On the two different aspects of the representative method:The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 1934, 97(4):558-625. DOI:10.2307/2342192.
[5] Rand W M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66(336):846-850. DOI:10.2307/2284239.
[6] Aljalbout E, Golkov V, Siddiqui Y et al. Clustering with deep learning:Taxonomy and new methods. arXiv:1801.07648,, March 2020.
[7] Goodman L A. Snowball sampling. The Annals of Mathematical Statistics, 1961, 32(1):148-170. DOI:10.1214/aoms/1177705148.
[8] Emerson R W. Convenience sampling, random sampling, and snowball sampling:How does sampling affect the validity of research? Journal of Visual Impairment & Blindness, 2015, 109(2):164-168. DOI:10.1177/01454-82X1510900215.
[9] Saar-Tsechansky M, Provost F. Active sampling for class probability estimation and ranking. Machine Learning, 2004, 54(2):153-178. DOI:10.1023/B:MACH.00000118-06.12374.c3.
[10] Dasgupta S, Hsu D. Hierarchical sampling for active learning. In Proc. the 25th International Conference on Machine Learning, June 2008, pp.208-215. DOI:10.1145/13-90156.1390183.
[11] Zhang H, Lin J, Cormack G V, Smucker M D. Sampling strategies and active learning for volume estimation. In Proc. the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2016, pp.981-984. DOI:10.1145/2911451.2914685.
[12] Silva J, Ribeiro B, Sung A H. Finding the critical sampling of big datasets. In Proc. the Computing Frontiers Conference, May 2017, pp.355-360. DOI:10.1145/3075-564.3078886.
[13] Alwosheel A, Van Cranenburgh S, Chorus C G. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modelling, 2018, 28:167-182. DOI:10.1016/j.jocm.2018.07.002.
[14] Wang A, An N, Chen G, Liu J, Alterovitz G. Subtype dependent biomarker identification and tumor classification from gene expression profiles. Knowledge-Based Systems, 2018, 146:104-117. DOI:10.1016/j.knosys.2018.01.025.
[15] Yang J, Wang J, Cheng W, Li L. Sampling to maintain approximate probability distribution under chi-square test. In Proc. the 37th National Conference of Theoretical Computer Science, August 2019, pp.29-45. DOI:10.1007/978981-15-0105-03.
[16] Paxton P, Curran P J, Bollen K A et al. Monte Carlo experiments:Design and implementation. Structural Equation Modeling, 2001, 8(2):287-312. DOI:10.1207/S15328007SEM08027.
[17] Gilks W R, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice (1st edition). Chapman and Hall/CRC, 1996.
[18] Wu S, Angelikopoulos P, Papadimitriou C et al. Bayesian annealed sequential importance sampling:An unbiased version of transitional Markov chain Monte Carlo. ASCEASME Journal of Risk and Uncertainty in Engineering Systems, Part B:Mechanical Engineering, 2018, 4(1):Article No. 011008. DOI:10.1115/1.4037450.
[19] George E I, McCulloch R E. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 1993, 88(423):881-889. DOI:10.1080/01621459.1993.10476353.
[20] Martino L, Read J, Luengo D. Independent doubly adaptive rejection Metropolis sampling within Gibbs sampling. IEEE Transactions on Signal Processing, 2015, 63(12):31233138. DOI:10.1109/TSP.2015.2420537.
[21] Murphy K. An introduction to graphical models. Technical Report, University of California, 2001., March 2020.
[22] Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning, 1997, 29(2/3):131-163. DOI:10.1023/A:1007465528199.
[23] Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technique Report, International Computer Science Institute, 1998., March 2020.
[24] Zivkovic Z. Improved adaptive Gaussian mixture model for background subtraction. In Proc. the 17th International Conference on Pattern Recognition, August 2004, pp.28-31. DOI:10.1109/ICPR.2004.1333992.
[25] Murphy K P. Machine Learning:A Probabilistic Perspective. MIT Press, 2012.
[26] Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In Breakthroughs in Statistics:Methodology and Distribution, Kotz S, Johnson N L (eds.), Springer, 1992, pp.11-28. DOI:10.1007/978-1-4612-4380-92.
[27] Balakrishnan N, Voinov V, Nikulin M S. Chi-Squared Goodness of Fit Tests with Applications. Academic Press, 2013.
[28] Das A, Kempe D. Approximate submodularity and its applications:Subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research, 2018, 19(1):Article No. 3.
[29] Qian C, Yu Y, Zhou Z H. Subset selection by Pareto optimization. In Proc. the Annual Conference on Neural Information Processing Systems, December 2015, pp.1774-1782.
[30] Qian C, Shi J C, Yu Y et al. Parallel Pareto optimization for subset selection. In Proc. the 25th International Joint Conference on Artificial Intelligence, July 2016, pp.1939-1945.
[31] Darrell W. A Genetic Algorithm Tutorial. Statistics & Computing, 1994, 4(2):65-85.
[32] Lauritzen S, Spiegelhalter D. Local computations with probabilities on graphical structures and their application on expert systems. J. Royal Statistical Soc.:Series B (Methodological), 1988, 50(2):157-194. DOI:10.1111/J.25-17-6161.1988.TB01721.X.
[33] Beinlich I, Suermondt H, Chavez R, Cooper G. The ALARM monitoring system:A case study with two probabilistic inference techniques for belief networks. In Proc. the 2nd European Conf. Artificial Intelligence in Medicine, August 1989, pp.247-256. DOI:10.1007/978-3-642-93437-728.
[34] Oniśko A, Druzdzel M J, Wasyluk H. A probabilistic causal model for diagnosis of liver disorders. In Proc. the 7th International Symposium on Intelligent Information Systems, June 1998, pp.379-387.
[35] Conati C, Gertner A S, VanLehn K et al. On-line student modeling for coached problem solving using Bayesian networks. In Proc. the 6th International Conference on User Modeling, June 1997, pp.231-242. DOI:10.1007/978-3-7091-2670-7_24.
[1] Shao-Jie Qiao, Guo-Ping Yang, Nan Han, Hao Chen, Fa-Liang Huang, Kun Yue, Yu-Gen Yi, Chang-An Yuan. Cardinality Estimator: Processing SQL with a Vertical Scanning Convolutional Neural Network [J]. Journal of Computer Science and Technology, 2021, 36(4): 762-777.
[2] Jun-Gang Xu, Yue Zhao, Jian Chen, Chao Han. A Structure Learning Algorithm for Bayesian Network Using Prior Knowledge [J]. , 2015, 30(4): 713-724.
[3] Dong-Ming Yan, Jian-Wei Guo, Bin Wang, Xiao-Peng Zhang, Peter Wonka. A Survey of Blue-Noise Sampling and Its Applications [J]. , 2015, 30(3): 439-452.
[4] Jing-Jie Liu, Lei Nie. A Functional Sensing Model and a Case Study in Household Electricity Usage Sensing [J]. , 2014, 29(2): 182-193.
[5] Ying-Jie Shi, Xiao-Feng Meng, Fusheng Wang, and Yan-Tao Gan. HEDC++:An Extended Histogram Estimator for Data in the Cloud [J]. , 2013, 28(6): 973-988.
[6] Yu-Xiang Wang, Jun-Zhou Luo, Ai-Bo Song, and Fang Dong. Partition-Based Online Aggregation with Shared Sampling in the Cloud [J]. , 2013, 28(6): 989-1011.
[7] Concha Bielza, Juan A. Fernández del Pozo, and Pedro Larrañaga. Parameter Control of Genetic Algorithms by Learning and Simulation of Bayesian Networks —— A Case Study for the Optimal Ordering of Tables [J]. , 2013, 28(4): 720-731.
[8] Heemin Park, Young-Jun Lee, Jinseok Chae, and Wonik Choi. Online Approach for Spatio-Temporal Trajectory Data Reduction for Portable Devices [J]. , 2013, 28(4): 597-604.
[9] Amin Nikanjam and Adel Rahmani. Exploiting Bivariate Dependencies to Speedup Structure Learning in Bayesian Optimization Algorithm [J]. , 2012, 27(5): 1077-1090.
[10] Shao-Liang Peng, Member, CCF, ACM, IEEE, Shan-Shan Li, Xiang-Ke Liao, Yu-Xing Peng, and Nong Xiao, Member, CCF, ACM, IEEE. Estimation of a Population Size in Large-Scale Wireless Sensor Networks [J]. , 2009, 24(5): 987-inside back cover.
[11] Bin Sheng, Jian Zhu, En-Hua Wu, Member, CCF, ACM, IEEE, and Yan-Ci Zhang. Lumiproxy: A Hybrid Representation of Image-Based Models [J]. , 2009, 24(3): 578-587.
[12] Jing Zhou and David De Roure. FloodNet: Coupling Adaptive Sampling with Energy Aware Routing in a Flood Warning System [J]. , 2007, 22(1): 121-130 .
[13] Hua Li, Shui-Cheng Yan, and Li-Zhong Peng[1]. Robust Non-Frontal Face Alignment with Edge Based Texture [J]. , 2005, 20(6): 849-854 .
[14] Cai-Yan Jia and Xie-Ping Gao. Multi-Scaling Sampling: An Adaptive Sampling Method for Discovering Approximate Association Rules [J]. , 2005, 20(3): 309-318 .
[15] LIU WeiYi(刘惟一) and SONG Ning(宋宁). Fuzzy Functional Dependencies and Bayesian Networks [J]. , 2003, 18(1): 0-0.
Full text



[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[5] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[6] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[7] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[8] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[9] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[10] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved