We use cookies to improve your experience with our site.

AS-Index:一种使用n-Grams和代数签名的串搜索结构

AS-Index:A Structure for String Search Using n-Grams and Algebraic Signatures

  • 摘要: 本文介绍了AS-Index,一种在磁盘数据库中精确搜索字符串的新的索引结构。AS-Index基于一种传统倒排文件结构,其主要创新是为一种概率搜索,该搜索基于用于n-grams散列和模式搜索的代数签名的属性。具体来说,通过检查发布列表中仅两个磁盘的访问,我们的签名属性允许实现一次搜索。因此,文中的算法要求特定次数的磁盘访问,该次数与模式大小和数据库规模无关。我们在大型数据库上做了广泛的实验,以评估我们的索引行为。实验表明我们的索引结构提供了稳定的搜索性能,该性能与获得发布列表的两个磁盘的访问成比例。这让我们的结构成为要求在大型文本数据库中进行快速检索的应用的一种选择。论文描述了索引结构,代数签名的使用,以及搜索算法。基于影响结构行为的参数,我们探讨了操作性的权衡,并给出了理论和实验性能分析。此外,我们对比了AS-Index和其它先进的索引结构。结果表明:1)因为结构的相似性,它的构造时间与其它结构的相匹配;2)关于检索时间,得益于本文所提出搜索方法的核心:签名计算所实现的对数据的较少的访问,本文方法稳定地优于标准方法。

     

    Abstract: We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-Index relies on a classical inverted file structure, whose main innovation is a probabilistic search based on the properties of algebraic signatures used for both n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of applications that require very fast lookups in large textual databases. We describe the index structure, our use of algebraic signatures, and the search algorithm. We discuss the operational trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental performance analysis. We next compare the AS-Index with the state-of-the-art alternatives and show that 1) its construction time matches that of its competitors, due to the similarity of structures, 2) as for search time, it constantly outperforms the standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of our search method.

     

/

返回文章
返回