最优化压缩性能的多维数组分块方法

韩帅; 刘显敏; 李建中

doi:10.1007/s11390-022-2371-7

摘要: 在多维数组管理领域，数组划分是一个重要的研究问题，因为数组划分策略可以对数组的存储、查询执行等数组管理系统中其他的部件产生重要影响。与此同时，压缩技术在数据量迅速增长的多维数组中应用广泛。由于观察到数组分块策略对压缩性能有重要影响，本文旨在研究能最优化压缩性能的数组划分算法。在数组的划分方法方面已有较多的研究成果，但是并未有人研究过最优化压缩性能的多维数组分块方法，最优化压缩性能的划分不仅可以节约多维数组的存储空间，还可以将邻近的数据紧密地存储在一起，为后续获得较好的查询性能提供保障。本文首先提出最优化压缩性能的多维数组分块问题（PPCP），并给出问题的形式化定义，采用了一种支持无解压计算的压缩方法；接着，由于该问题是NP难问题，本文提出两个规则来寻找最优的数组划分方案，其中第一个规则说明数组压缩性能可以被优化如果数组被划分成稀疏程度不同的两部分，第二个规则提出了一个支持启发式寻找最优划分点的贪心策略。基于这两个策略，本文在维度相互独立、维度依赖的两种假设下，分别提出最优化压缩性能的多维数组分块的启发式算法。由于在维度独立假设下的数组分块算法的时间复杂度较高，因此本文提出了线性时间的基于抽样和维度分组的优化算法。最后，实验证明了本文所提最优化压缩性能的多维数组分块算法具有较好的压缩性能和查询性能。

Abstract: Array partitioning is an important research problem in array management area, since the partitioning strategies have important influence on storage, query evaluation, and other components in array management systems. Meanwhile, compression is highly needed for the array data due to its growing volume. Observing that the array partitioning can affect the compression performance significantly, this paper aims to design the efficient partitioning method for array data to optimize the compression performance. As far as we know, there still lacks research efforts on this problem. In this paper, the problem of array partitioning for optimizing the compression performance (PPCP for short) is firstly proposed. We adopt a popular compression technique which allows to process queries on the compressed data without decompression. Secondly, because the above problem is NP-hard, two essential principles for exploring the partitioning solution are introduced, which can explain the core idea of the partitioning algorithms proposed by us. The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities. The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically. Supported by the two principles, two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively. Observing the expensive cost of the algorithm for the dependent case, a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost. Finally, the experiments are conducted on both synthetic and real-life data, and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.

最优化压缩性能的多维数组分块方法

Efficient Partitioning Method for Optimizing the Compression on Array Data