›› 2016, Vol. 31 ›› Issue (1): 147-166.doi: 10.1007/s11390-016-1618-6

Special Issue: Data Management and Data Mining; Theory and Algorithms

• Data Management and Data Mining • Previous Articles     Next Articles

AS-Index:A Structure for String Search Using n-Grams and Algebraic Signatures

Camelia Constantin1, Cédric du Mouza2, Witold Litwin3, Fellow, ACM, Philippe Rigaux2, and Thomas Schwarz4   

  1. 1 LIP6 Laboratory, University Pierre et Marie Curie, Paris 75005, France;
    2 CEDRIC Laboratory, Conservatoire National des Arts et Métiers, Paris 75003, France;
    3 LAMSADE Laboratory, University Paris-Dauphine, Paris 75775, France;
    4 DICC Laboratory, Universidad Católica del Uruguay, Montevideo 11200, Uruguay
  • Received:2014-02-20 Revised:2014-07-23 Online:2016-01-05 Published:2016-01-05
  • About author:Witold Litwin is the exceptional class professor of computer science at the University Paris-Dauphine. His research areas are multidatabase systems, data structures, and scalable distributed data structures (SDDSs). For this work, Dr. Litwin was named ACM Fellow in 2001. At Dauphine, Dr. Litwin was the director of Centre d'Etudes et de Recherches en Informatique Appliqu′ee (CERIA) during 1996~2008. Before, he has been an invited lecturer and scientist at several universities, including UC Berkeley in 1992~1994, and Stanford University in 1990~1991. He wrote over 160 research papers, and edited or contributed to 11 books. He is the recipient of three US patents, filed by HP Labs and IBM Research & CSIS (GMU). His profile is public at Google Scholar, with over 6400 citations.
  • Supported by:

    This work has been partially supported by the Advanced European Research Council Grant Webdam.

We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-Index relies on a classical inverted file structure, whose main innovation is a probabilistic search based on the properties of algebraic signatures used for both n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of applications that require very fast lookups in large textual databases. We describe the index structure, our use of algebraic signatures, and the search algorithm. We discuss the operational trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental performance analysis. We next compare the AS-Index with the state-of-the-art alternatives and show that 1) its construction time matches that of its competitors, due to the similarity of structures, 2) as for search time, it constantly outperforms the standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of our search method.

[1] Margaritis G, Anastasiadis S V. SeFS:Unleashing the power of full-text search on file systems. In Proc. the 5th USENIX Conf. File and Storage Technology, Feb. 2007, Article No. 12.

[2] Crochemore M, Lecroq T. Pattern matching and textcompression algorithms. ACM Computing Surveys, 1996, 28(1):39-41.

[3] Ferragina P, Grossi R. The String B-tree:A new data structure for string search in external memory and its applications. J. ACM, 1999, 46(2):236-280.

[4] Phoophakdee B, Zaki M J. Genome-scale diskbased suffix tree indexing. In Proc. Int. Conf. Management of Data (SIGMOD), June 2007, pp.833-844.

[5] Miller E, Shen D, Liu J, Nicholas C. Performance and scalability of a large-scale n-gram based information retrieval system. Journal of Digital Information, 2000.

[6] Kim M S, Whang K, Lee J G, Lee M J. n-Gram/2L:A space and time efficient two-level n-gram inverted index structure. In Proc. the 31st Int. Conf. Very Large Data Bases (VLDB), Aug. 2005, pp.325-336.

[7] Litwin W, Schwarz T. Algebraic signatures for scalable distributed data structures. In Proc. the 20th Int. Conf. Data Engineering (ICDE), March 2004, pp.412-423.

[8] du Mouza C, Litwin W, Rigaux P, Schwarz T J E. AS-index:A structure for string search using n-grams and algebraic signatures. In Proc. the 18th Int. Conf. Information and Knowledge Management (CIKM), Nov. 2009, pp.295-304.

[9] Gray J, Fitzgerald B. Flash disk opportunity for server applications. ACM Queue, 2008, 6(4):18-23.

[10] Charras C, Lecroq T, Pehoushek J D. A very fast string matching algorithm for small alphabets and long patterns. In Proc. the 9th Int. Symp. Combinatorial Pattern Matching (CPM), July 1998, pp.55-64.

[11] Witten I, Moffat A, Bell T. Managing Gigabytes:Compressing and Indexing Documents and Images (1st edition). Morgan-Kaufmann, 1999.

[12] Na J C, Park K. Simple implementation of String B-Tree. In Proc. the 11th Int. Conf. String Processing and Information Retrieval (SPIRE), Oct. 2004, pp.214-215.

[13] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley, 1999.

[14] Robenek D, Platoš J, Snášel V. Efficient inmemory data structures for n-grams indexing. In Proc. Int. Work. Databases, Texts, Specifications and Objects (DATESO), April 2013, pp.48-58.

[15] Gusfield D. Algorithms on Strings, Trees, and Sequences:Computer Science and Computational Biology (1st edition). Cambridge University Press, 1997.

[16] Kurtz S. Reducing the space requirement of suffix trees. Software - Practice and Experience, 1999, 29(13):1149- 1171.

[17] Tata S, Hankins R, Patel J. Practical suffix tree construction. In Proc. the 30th Int. Conf. Very Large Databases (VLDB), Aug. 2004, pp.36-48.

[18] Manber U, Myers G. Sufix arrays:A new method for on-line string searches. SIAM Journal on Computing, 1993, 22(5):935-948.

[19] Kärkkäinen J. Suffix cactus:A cross between suffix tree and suffix array. In Proc. the 6th Int. Symp. Combinatorial Pattern Matching (CPM), July 1995, pp.191-204.

[20] Andersson A, Nilsson S. Efficient implementation of suffix trees. Software - Practice and Experience, 1995, 25(2):129-141.

[21] Dementiev R, Kärkkäinen J, Mehnert J, Sanders P. Better external memory suffix array construction. ACM Journal of Experimental Algorithmics, 2008, 12:Article No. 3.4.

[22] Barsky M, Thomo A, Stege U. Full-Text (Substring) Indexes in External Memory. Morgan & Claypool Publishers, 2011.

[23] Faloutsos C. Signature files. In Information Retrieval:Data Structures & Algorithms, Frakes W B, Baeza-Yates R (eds.), Prentice Hall, 1992, pp.44-65.

[24] Zobel J, Moffat A, Ramamohanarao K. Inverted files versus signature files for text indexing. ACM Trans. Database Systems (TODS), 1998, 23(4):453-490.

[25] Tan Z, Cao X, Ooi B C, Tung A K H. The ed-tree:An index for large DNA sequence databases. In Proc. Int. Conf. Scientific and Statistical Databases (SSDBM), July 2003, pp.151-160.

[26] Cao X, Li S C, Tung A K H. Indexing DNA sequences using q-grams. In Proc. the 10th Int. Conf. Database Systems for Advanced Applications (DASFAA), April 2005, pp.4-16.

[27] Williams H, Zobel J. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2002, 14(1):63-78.

[28] Risvik K M, Chilimbi T, Tan H, Kalyanaraman K, Anderson C. Maguro, a system for indexing and searching over very large text collections. In Proc. the 6th ACM Int. Conf. Web Search and Data Mining (WSDM), Feb. 2013, pp.727- 736.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[2] Zhou Di; Xu Xiangwen;. A Distributed Error Recovery Technique and Its Implementation and Application on UNIX[J]. , 1990, 5(2): 127 -138 .
[3] Klaus Buchenrieder;. Standard-Cell Placement from Functional Descriptions[J]. , 1991, 6(1): 37 -46 .
[4] Wang Nengbin; Liu Haiqing;. An Intelligent Tool to Support Requirements Analysis and Conceptual Design of Database Design[J]. , 1991, 6(2): 153 -160 .
[5] Xu Jiepan; Wang Lei;. A New Approach to Database Auto-Design by Logic[J]. , 1991, 6(2): 201 -204 .
[6] Lin Shan;. Using a Student Model to Improve Explanation in an ITS[J]. , 1992, 7(1): 92 -96 .
[7] Wu Xindong;. Inductive Learning[J]. , 1993, 8(2): 22 -36 .
[8] Qin Kaihuai; Fan Gang; Sun Cai;. Extrapolating Acceleration Algorithms for Finding B-Spline Intersections Using Recursive Subdivision Techniques[J]. , 1994, 9(1): 70 -85 .
[9] Wang Hui; Liu Dayou; Wang Yafei;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[10] Yu Huiqun; Song Guoxin; Sun Yongqiang;. Completeness of the Accumulation Calculus[J]. , 1998, 13(1): 25 -31 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved