We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Shuai Han, Xian-Min Liu, Jian-Zhong Li. Efficient Partitioning Method for Optimizing the Compression on Array Data[J]. Journal of Computer Science and Technology, 2022, 37(5): 1049-1067. DOI: 10.1007/s11390-022-2371-7
Citation: Shuai Han, Xian-Min Liu, Jian-Zhong Li. Efficient Partitioning Method for Optimizing the Compression on Array Data[J]. Journal of Computer Science and Technology, 2022, 37(5): 1049-1067. DOI: 10.1007/s11390-022-2371-7

Efficient Partitioning Method for Optimizing the Compression on Array Data

Funds: This work was supported by the National Natural Science Foundation of China under Grant Nos. 61832003 and U1811461.
More Information
  • Author Bio:

    Shuai Han received her B.S. and M.S. degrees in computer science and technology from Harbin Engineering University, Harbin. She is currently pursuing her Ph.D. degree of computer science and technology in Harbin Institute of Technology, Harbin. Her main research interests include data management, query processing.

  • Received Date: March 30, 2022
  • Revised Date: September 13, 2022
  • Accepted Date: September 20, 2022
  • Published Date: September 29, 2022
  • Array partitioning is an important research problem in array management area, since the partitioning strategies have important influence on storage, query evaluation, and other components in array management systems. Meanwhile, compression is highly needed for the array data due to its growing volume. Observing that the array partitioning can affect the compression performance significantly, this paper aims to design the efficient partitioning method for array data to optimize the compression performance. As far as we know, there still lacks research efforts on this problem. In this paper, the problem of array partitioning for optimizing the compression performance (PPCP for short) is firstly proposed. We adopt a popular compression technique which allows to process queries on the compressed data without decompression. Secondly, because the above problem is NP-hard, two essential principles for exploring the partitioning solution are introduced, which can explain the core idea of the partitioning algorithms proposed by us. The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities. The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically. Supported by the two principles, two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively. Observing the expensive cost of the algorithm for the dependent case, a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost. Finally, the experiments are conducted on both synthetic and real-life data, and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.
  • [1]
    Duggan J, Stonebraker M. Incremental elasticity for array databases. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, Jun. 2014, pp.409-420. DOI: 10.1145/2588555.2588569.
    [2]
    Li J, Rotem D, Wong H K. A new compression method with fast searching on large databases. In Proc. the 13th International Conference on Very Large Data Bases, Sept. 1987, pp.311-318.
    [3]
    Wang J, Lin C, Papakonstantinou Y, Swanson S. An experimental study of bitmap compression vs. inverted list compression. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.993-1008. DOI: 10.1145/3035918.3064007.
    [4]
    Damme P, Ungethüm A, Hildebrandt J, Habich D, Lehner W. From a comprehensive experimental survey to a cost-based selection strategy for lightweight integer compression algorithms. ACM Transactions on Database Systems, 2019, 44(3): Article No. 9. DOI: 10.1145/3323991.
    [5]
    Sarawagi S, Stonebraker M. Efficient organization of large multidimensional arrays. In Proc. the 1994 International Conference on Data Engineering, Feb. 1994, pp.328-336. DOI: 10.1109/ICDE.1994.283048.
    [6]
    Otoo E J, Rotem D, Seshadri S. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. the 10th International Workshop on Data Warehousing and OLAP, Nov. 2007, pp.25-32. DOI: 10.1145/1317331.1317337.
    [7]
    Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In Proc. the 23rd International Conference on Scientific and Statistical Database Management, Jul. 2011, pp.1-16. DOI: 10.1007/978-3-642-22351-8.
    [8]
    Soroush E, Balazinska M, Wang D. ArrayStore: A storage manager for complex parallel array processing. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2011, pp.253-264. DOI: 10.1145/1989323.1989351.
    [9]
    Rusu F, Cheng Y. A survey on array storage, query languages, and systems. arXiv:1302.0103, 2013. https://arxiv.org/abs/1302.0103, Feb. 2022.
    [10]
    Chang C, Moon B, Acharya A, Shock C, Sussman A, Saltz J. Titan: A high-performance remote-sensing database. In Proc. the 13th International Conference on Data Engineering, April 1997, pp.375-384. DOI: 10.1109/ICDE.1997.581883.
    [11]
    Marathe A P, Salem K. Query processing techniques for arrays. The VLDB Journal, 2002, 11(1): 68-91. DOI: 10.1007/s007780200062.
    [12]
    Brown P G. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2010, pp.963-968. DOI: 10.1145/1807167.1807271.
    [13]
    Papadopoulos S, Datta K, Madden S, Mattson T. The TileDB array data storage manager. Proceedings of the VLDB Endowment, 2016, 10(4): 349-360. DOI: 10.14778/3025111.3025117.
    [14]
    Kaddoura M, Ranka S, Wang A. Array decompositions for nonuniform computational environments. Journal of Parallel and Distributed Computing, 1996, 36(2): 91-105. DOI: 10.1006/jpdc.1996.0092.
    [15]
    Muthukrishnan S, Suel T. Approximation algorithms for array partitioning problems. Journal of Algorithms, 2005, 54(1): 85-104. DOI: 10.1016/j.jalgor.2003.11.006.
    [16]
    Moudani W, Hussein M, Moukhtar M, Mora-Camino F. Intelligent data compression approach in multidimensional data warehouse. International Journal on Computer Science and Engineering, 2011, 3(2): 951-960.
    [17]
    Rahmani M M, Al-Mahmud A, Hossen M A, Rahman M, Ahmed M R, Sohan M F. A comparative analysis of traditional and modern data compression schemes for large multi-dimensional extendible array. In Proc. the 2019 International Conference on Electrical, Computer and Communication Engineering, Feb. 2019. DOI: 10.1109/ECACE.2019.8679182.
    [18]
    Cormode G, Garofalakis M, Haas P J et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 2011, 4(1/2/3): 1-294. DOI: 10.1561/1900000004.
    [19]
    Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37-57. DOI: 10.1145/3147.3165.
    [20]
    Deshpande A, Garofalakis M, Rastogi R. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proc. the 2001 ACM SIGMOD International Conference on Management of Data, May 2001, pp.199-210. DOI: 10.1145/375663.375685.
    [21]
    Reiner B, Hahn K, Höfling G, Baumann P. Hierarchical storage support and management for large-scale multidimensional array database management systems. In Proc. the 13th International Conference on Database and Expert Systems Applications, Sept. 2002, pp.689-700. DOI: 10.1007/3-540-46146-9.
    [22]
    Chai C, Li G, Li J, Deng D, Feng J. Cost-effective crowdsourced entity resolution: A partial-order approach. In Proc. the 2016 International Conference on Management of Data, Jun. 26-Jul. 1, 2016, pp.969-984. DOI: 10.1145/2882903.2915252.
    [23]
    Chai C, Liu J, Tang N, Li G, Luo Y. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment, 2022, 15(7): 1466-1478. DOI: 10.14778/3523210.3523223.
    [24]
    Liu J, Chai C, Luo Y, Lou Y, Feng J, Tang N. Feature augmentation with reinforcement learning. In Proc. the 2022 IEEE International Conference on Data Engineering, May 2022, pp.3360-3372. DOI: 10.1109/ICDE53745.2022.00317.
    [25]
    Manne F, Sørevik T. Partitioning an array onto a mesh of processors. In Proc. the 3rd International Workshop on Applied Parallel Computing, Aug. 1996, pp.467-477. DOI: 10.1007/3-540-62095-8.
    [26]
    Mingozzi A, Ricciardelli S, Spadoni M. Partitioning a matrix to minimize the maximum cost. Discrete Applied Mathematics, 1995, 62(1/2/3): 221-248. DOI: 10.1016/0166-218X(94)00154-6.
    [27]
    Nicol D M. Rectilinear partitioning of irregular data parallel computations. Journal of Parallel and Distributed Computing, 1994, 23(2): 119-134. DOI: 10.1006/jpdc.1994.1126.
    [28]
    Saule E, Baş E Ö, Çatalyürek Ü V. Load-balancing spatially located computations using rectangular partitions. Journal of Parallel and Distributed Computing, 2012, 72(10): 1201-1214. DOI: 10.1016/j.jpdc.2012.05.013.
    [29]
    Roth M A, Van Horn S J. Database compression. SIGMOD Record, 1993, 22(3): 31-39. DOI: 10.1145/163090.163096.
    [30]
    Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software–Practice and Experience, 2015, 45(1): 1-29. DOI: 10.1002/spe.2203.
    [31]
    Schlegel B, Gemulla R, Lehner W. Fast integer compression using SIMD instructions. In Proc. the 6th International Workshop on Data Management on New Hardware, June 2010, pp.34-40. DOI: 10.1145/1869389.1869394.
    [32]
    Antoshenkov G. Byte-aligned bitmap compression. In Proc. the 1995 Data Compression Conference, Mar. 1995, pp.476. DOI: 10.1109/DCC.1995.515586.
    [33]
    Wu K, Otoo E J, Shoshani A. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 2006, 31(1): 1-38. DOI: 10.1145/1132863.1132864.
    [34]
    Lemire D, Kaser O, Aouiche K. Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering, 2010, 69(1): 3-28. DOI: 10.1016/j.datak.2009.08.006.
    [35]
    Colantonio A, Di Pietro R. CONCISE: Compressed ‘n’ composable integer set. Information Processing Letters, 2010, 110(16): 644-650. DOI: 10.1016/j.ipl.2010.05.018.
    [36]
    Chambi S, Lemire D, Kaser O, Godin R. Better bitmap performance with roaring bitmaps. Software–Practice and Experience, 2016, 46(5): 709-719. DOI: 10.1002/spe.2325.
    [37]
    Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1463-1478. DOI: 10.1145/3035918.3064036.
    [38]
    Chai C, Cao L, Li G, Li J, Luo Y, Madden S. Human-in-the-loop outlier detection. In Proc. the 2020 International Conference on Management of Data, Jun. 2020, pp.19-33. DOI: 10.1145/3318464.3389772.
    [39]
    Huffman D A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952, 40(9): 1098-1101. DOI: 10.1109/JRPROC.1952.273898.
    [40]
    Witten I H, Neal R M, Cleary J G. Arithmetic coding for data compression. Communications of the ACM, 1987, 30(6): 520-540. DOI: 10.1145/214762.214771.
    [41]
    Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343. DOI: 10.1109/TIT.1977.1055714.
    [42]
    Agarwal R, Khandelwal A, Stoica I. Succinct: Enabling queries on compressed data. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp.337-350.
    [43]
    Mutlu O, Shen X, Zhai J, Zhang F. Potential of a method for text analytics directly on compressed data. Technical Report, North Carolina State University, 2017. https://repository.lib.ncsu.edu/bitstream/handle/1840.20/ 35582/TR-2017-4.pdf?sequence=1&isAllowed=y, Feb. 2022.
    [44]
    Zhang F, Zhai J, Shen X, Mutlu O, Chen W. Zwift: A programming framework for high performance text analytics on compressed data. In Proc. the 32nd International Conference on Supercomputing, Jun. 2018, pp.195-206. DOI: 10.1145/3205289.3205325.
    [45]
    Nevill-Manning C G, Witten I H. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 1997, 7: 67-82. DOI: 10.1613/jair.374.
    [46]
    Zhang F, Pan Z, Zhou Y, Zhai J, Shen X, Mutlu O, Du X. G-TADOC: Enabling efficient GPU-based text analytics without decompression. In Proc. the 2021 IEEE International Conference on Data Engineering, Apr. 2021, pp.1679-1690. DOI: 10.1109/ICDE51399.2021.00148.
    [47]
    Zhang F, Zhai J, Shen X, Mutlu O, Du X. POCLib: A high-performance framework for enabling near orthogonal processing on compression. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(2): 459-475. DOI: 10.1109/TPDS.2021.3093234.
  • Related Articles

    [1]Yu-Ping Shen, Xi-Shun Zhao. NP-Logic Systems and Model-Equivalence Reductions[J]. Journal of Computer Science and Technology, 2010, 25(6): 1321-1326. DOI: 10.1007/s11390-010-1104-5
    [2]Ya-Feng Wu, Yin-Long Xu, Guo-Liang Chen. Approximation Algorithms for Steiner Connected Dominating Set[J]. Journal of Computer Science and Technology, 2005, 20(5): 713-716.
    [3]Yong Zhang, Hong Zhu. Approximation Algorithm for Weighted Weak Vertex Cover[J]. Journal of Computer Science and Technology, 2004, 19(6).
    [4]DONG SheQin, HONG XianLong, WU YuLiang, GU Jun. Deterministic VLSI Block Placement Algorithm Using Less Flexibility First Principle[J]. Journal of Computer Science and Technology, 2003, 18(6).
    [5]LIU Tian. A Note on Closeness between NP-Hard Sets and C=P[J]. Journal of Computer Science and Technology, 2000, 15(2): 194-195.
    [6]Chen Bin, Hong Yong. FCV_1: A New Fast Greedy Covering Algorithm[J]. Journal of Computer Science and Technology, 1998, 13(4): 369-374.
    [7]Cheng Qi, Zhu Hong. MNP: A Class of NP Optimization Problems[J]. Journal of Computer Science and Technology, 1997, 12(4): 306-313.
    [8]Chen Bin, Hong Jiarong, Wang Yadong. The Minimum Feature Subset Selection Problem[J]. Journal of Computer Science and Technology, 1997, 12(2): 145-153.
    [9]Jiamg Xiong. Some Undecidable Problems on Approximability of NP Optimization Problems[J]. Journal of Computer Science and Technology, 1996, 11(2): 126-132.
    [10]Yu Xiangdong. Some Hard Examples for the Resolution Method[J]. Journal of Computer Science and Technology, 1990, 5(3): 302-304.
  • Others

  • Cited by

    Periodical cited type(2)

    1. Chenguang Fang, Shaoxu Song, Haoquan Guan, et al. Grouping Time Series for Efficient Columnar Storage. Proceedings of the ACM on Management of Data, 2023, 1(1): 1. DOI:10.1145/3588703
    2. Xinyi Zhang, Zhuo Chang, Hong Wu, et al. A Unified and Efficient Coordinating Framework for Autonomous DBMS Tuning. Proceedings of the ACM on Management of Data, 2023, 1(2): 1. DOI:10.1145/3589331

    Other cited types(0)

Catalog

    Article views (139) PDF downloads (2) Cited by(2)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return