计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (2): 494-506.doi: 10.1007/s11390-019-1921-0

所属专题: Artificial Intelligence and Pattern Recognition Data Management and Data Mining

• • 上一篇    

随机森林无损压缩

Amichai Painsky1, Saharon Rosset2   

  1. 1 School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 91904, Israel;
    2 Department of Statistics and Operations Research, Tel Aviv University, Tel Aviv 69978, Israel
  • 收稿日期:2018-02-08 修回日期:2019-01-03 出版日期:2019-03-05 发布日期:2019-03-16
  • 作者简介:Amichai Painsky received his B.Sc. in electrical engineering from Tel Aviv University (2007), Tel Avir, his M.Eng. degree in electrical engineering from Princeton University (2009), Princeton, and his Ph.D. degree in statistics from the School of Mathematical Sciences in Tel Aviv University, Tel Aviv. He is currently a post-doctoral fellow, co-affiliated with the Israeli Center of Research Excellence in Algorithms (ICORE) at the Hebrew University of Jerusalem, Jerusalem, and the Signals, Information and Algorithms (SIA) Lab at MIT (Massachusetts Institute of Technology). His research interests include data mining, machine learning, statistical learning and their connection to information theory.
  • 基金资助:
    This work was supported by Israel Science Foundation under Grant No. 1487/12 and a Returning Scientist Fellowship from the Israeli Ministry of Immigration to Amichai Painsky.

Lossless Compression of Random Forests

Amichai Painsky1, Saharon Rosset2   

  1. 1 School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 91904, Israel;
    2 Department of Statistics and Operations Research, Tel Aviv University, Tel Aviv 69978, Israel
  • Received:2018-02-08 Revised:2019-01-03 Online:2019-03-05 Published:2019-03-16
  • About author:Amichai Painsky received his B.Sc. in electrical engineering from Tel Aviv University (2007), Tel Avir, his M.Eng. degree in electrical engineering from Princeton University (2009), Princeton, and his Ph.D. degree in statistics from the School of Mathematical Sciences in Tel Aviv University, Tel Aviv. He is currently a post-doctoral fellow, co-affiliated with the Israeli Center of Research Excellence in Algorithms (ICORE) at the Hebrew University of Jerusalem, Jerusalem, and the Signals, Information and Algorithms (SIA) Lab at MIT (Massachusetts Institute of Technology). His research interests include data mining, machine learning, statistical learning and their connection to information theory.
  • Supported by:
    This work was supported by Israel Science Foundation under Grant No. 1487/12 and a Returning Scientist Fellowship from the Israeli Ministry of Immigration to Amichai Painsky.

集成方法为最前沿的预测建模方法。此类方法,应用于现代大数据时,通常需要大量子学习器,每个子学习器的复杂度将随着数据集的大小显著增长。这导致了对存储空间需求的增加,而且费用很高。该问题在基于订户的环境中格外明显,该情况下,一个用户特定的分类器需要被存储在具有严格存储限制的个人设备(如蜂窝设备)。本文介绍了一种用于基于树的集成方法的无损压缩新方法,主要针对随机森林。我们提出的方法基于集成树的概率建模,接着是基于Bregman散度的模型聚类。这让我们找到了能精准描述树的模型的最小集合,同时,此集小到可以存储和维护。我们的压缩方案展现了针对各类现代数据集的高压缩率。重要的是,我们的方案能够从压缩格式上做预测,并且实现原集成的完美重构。此外,我们介绍了一套理论可靠的有损压缩方案,从而可以控制失真和编码速率之间的平衡。

关键词: 熵编码, 无损压缩, 有损压缩, 随机森林

Abstract: Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber-based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble's trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate.

Key words: entropy coding, lossless compression, lossy compression, random forest

[1] Breiman L, Friedman J, Olshen R A, Stone C J. Classification and Regression Trees (1st edition). Chapman and Hall/CRC, 1984.
[2] Quinlan J R. C4.5: Programs for Machine Learning (1st edition). Morgan Kaufmann Publishers, 1992.
[3] Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123-140.
[4] Schapire R E. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, Denison D D, Hansen M H, Holmes C C, Mallick B, Yu B (eds.), Springer, 2003, pp.149-171.
[5] Breiman L. Random forests. Machine Learning, 2001, 45(1): 5-32.
[6] Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (1st edition). Springer, 2001.
[7] Painsky A, Rosset S. Compressing random forests. In Proc. the 16th International Conference on Data Mining, December 2016, pp.1131-1136.
[8] Geurts P. Some enhancements of decision tree bagging. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, Sept. 2000, pp.136-147.
[9] Meinshausen N. Node harvest. The Annals of Applied Statistics, 2010, 4(4): 2049-2072.
[10] Friedman J H, Popescu B E. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2008, 2(3): 916-954.
[11] Bernard S, Heutte L, Adam S. On the selection of decision trees in random forests. In Proc. the 2009 International Joint Conference on Neural Networks, June 2009, pp.302- 307.
[12] Joly A, Schnitzler F, Geurts P, Wehenkel L. L1-based compression of random forest models. In Proc. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, April 2012, pp.375-380.
[13] Buciluǎ C, Caruana R, Niculescu-Mizil A. Model compression. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006, pp.535-541.
[14] Tikk D, Kóczy L T, Gedeon T D. A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning, 2003, 33(2): 185-202.
[15] Katajainen J, Mäkinen E. Tree compression and optimization with applications. International Journal of Foundations of Computer Science, 1990, 1(04): 425-447.
[16] Chen S, Reif J H. Efficient lossless compression of trees and graphs. In Proc. the 6th Data Compression Conference, March 1996, pp.428.
[17] Painsky A, Wornell G W. On the universality of the logistic loss function. arXiv:1805.03804, 2018. https://arxiv.org/pdf/1805.03804.pdf,September2018.
[18] Painsky A, Wornell G W. Bregman divergence bounds and the universality of the logarithmic loss. arXiv:1810.07014, 2018. http://export.arxiv.org/pdf/1810.07014,September2018.
[19] Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 2006, 15(3): 651- 674.
[20] Painsky A, Rosset S. Cross-validated variable selection in tree-based methods improves predictive performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11): 2142-2153.
[21] Sayood K. Introduction to Data Compression (5th Edition). Morgan Kaufmann, 2017.
[22] Szpankowski W, Weinberger M J. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE Transactions on Information Theory, 2012, 58(7): 4094-4104.
[23] Orlitsky A, Santhanam N P, Zhang J. Universal compression of memoryless sources over unknown alphabets. IEEE Transactions on Information Theory, 2004, 50(7): 1469- 1481.
[24] Painsky A, Rosset S, Feder M. Universal compression of memoryless sources over large alphabets via independent component analysis. In Proc. the 2015 Data Compression Conference, April 2015, pp.213-222.
[25] Painsky A, Rosset S, Feder M. A simple and efficient approach for adaptive entropy coding over large alphabets. In Proc. the 2016 Data Compression Conference, March 2016, pp.369-378.
[26] Painsky A, Rosset S, Feder M. Large alphabet source coding using independent component analysis. IEEE Transactions on Information Theory, 2017, 63(10): 6514-6529.
[27] Painsky A, Rosset S, Feder M G. Linear independent component analysis over finite fields: Algorithms and bounds. IEEE Transactions on Signal Processing, 2018, 66(22): 5875-5886.
[28] Zaks S. Lexicographic generation of ordered trees. Theoretical Computer Science, 1980, 10(1): 63-82.
[29] Banerjee A, Merugu S, Dhillon I S, Ghosh J. Clustering with Bregman divergences. Journal of Machine Learning Research, 2005, 6: 1705-1749.
[30] Lloyd S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
[31] Cover T M, Thomas J A. Elements of Information Theory (2nd edition, e-book). John Wiley & Sons, 2012.
[32] Deutsch L P. Gzip file format specification version 4.3. 1996. https://www.rfc-editor.org/rfc/rfc1952.txt,Oct.2018.
[33] Schuchman L. Dither signals and their effect on quantization noise. IEEE Transactions on Communication Technology, 1964, 12(4): 162-165.
[34] Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine Learning, 2006, 63(1): 3-42.
[35] Liu F T, Ting K M, Yu Y, Zhou Z H. Spectrum of variablerandom trees. Journal of Artificial Intelligence Research, 2008, 32: 355-384.
[36] Zhou Z H, Feng J. Deep forest: Towards an alternative to deep neural networks. arXiv:1702.08835, 2017. https://arxiv.org/pdf/1702.08835v2.pdf,September2018.
[1] Xia-An Bi, Zhao-Xu Xing, Rui-Hui Xu, Xi Hu. 基于影像遗传学数据的发现帕金森症的危险基因和异常脑区的有效WRF框架[J]. 计算机科学技术学报, 2021, 36(2): 361-374.
[2] Li-Wei Liu, and Hai-Zhou Ai. 结合上下文信息学习物体结构模型用于视觉跟踪[J]. , 2013, 28(5): 818-826.
[3] Nan Wang, Hai-Zhou Ai, and Feng Tang. 基于遮挡分析的多物体同时分割[J]. , 2013, 28(5): 890-906.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] 李明慧;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: