O2iJoin:一种基于索引的重叠区间高效连接算法

doi:10.1007/s11390-018-1872-x

摘要: 时态关系通常为元组关联时间区间来表示元组的有效时间。时态关系上的重叠区间连接在各种查询中至关重要。因此，人们提出了很多重叠区间连接的高效算法，这些算法大多都基于树状的索引结构，如四叉树、B+树和区间树，它们在访问数据时不可避免地访问树种的深路径，因而它们不如基于数据划分或平面扫描的算法高效。本文提出一种称为O2i索引的索引结构，并基于它设计了一种高效的重叠区间连接算法。O2i索引是一种两层扁平结构。第一层用数组存储所有区间的终点，并借助两个函数建立近似的嵌套结构；第二层用倒排表管理满足近似嵌套结构的所有区间。利用O2i索引，连接算法仅需访问必须访问的倒排链表而跳过其他链表。理论分析和实验结果表明，本文提出的算法同基于数据划分和平面扫描的最佳算法一样高效，时态大数据重叠区间连接操作提供了新的选择。
目的： 时间/空间高效地在时态大数据上执行重叠区间连接操作。
创新点： 提出了一种扁平的两层索引结构和基于这种索引结构的重叠区间连接算法，使得IO操作仅需一遍且内存开销可控的情况下，高效地完成时态大数据上的重叠区间连接操作。
方法： 利用一维数组管理所有区间的终点，通过新定义的前驱函数和后继函数在这些终点上建立近似的嵌套结构。执行连接操作时，将关系R中的每个区间对齐到近似嵌套结构上，以此确定关系S中的哪些区间需要访问。通过理论分析和算法设计，最终确保索引结构仅被IO一次就可以高效完成R和S上的重叠区间连接操作。
结论： 本文提出的两层扁平区间索引结构可以在O（nlogn）时间和O（nlogn）空间内建立，其中n是被索引的时态关系的元组数。基于这种索引结构的算法可以在O（n_z+mlog²n）的CPU时间开销和一遍IO读写开销下高效完成两个时态关系的重叠区间连接操作，其中n_z是连接结果的数量，m，n是参与连接的两个关系的大小。人工数据集和真实数据集上的实验表明，本文提出的算法与现有基于划分和平面扫描的最佳算法同样高效。它为时态大数据重叠区间连接操作提供了新的选择。

Abstract: Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based on tree structures such as quad-tree, B⁺-tree and interval tree. These algorithms usually have high CPU cost since deep path traversals are unavoidable, which makes them not so competitive as data-partition or plane-sweep based algorithms. This paper proposes an efficient overlap join algorithm based on a new two-layer flat index named as Overlap Interval Inverted Index (i.e., O2i Index). It uses an array to record the end points of intervals and approximates the nesting structures of intervals via two functions in the first layer, and the second layer uses inverted lists to trace all intervals satisfying the approximated nesting structures. With the help of the new index, the join algorithm only visits the must-be-scanned lists and skips all others. Analyses and experiments on both real and synthetic datasets show that the proposed algorithm is as competitive as the state-of-the-art algorithms.

O2iJoin:一种基于索引的重叠区间高效连接算法

O2iJoin: An Efficient Index-Based Algorithm for Overlap Interval Join