1. Department of Computer Science, Hefei University of Technology, Hefei 230009, China;
2. Department of Computer Science, University of Vermont, Burlington, VT 05405, U. S. A. ;
3. Department of Computer Science and Technology, Hefei Normal University, Hefei 230601, China
Abstract Pattern matching with wildcards (PMW) has great theoretical and practical significance in bioinformatics, information retrieval, and pattern mining. Due to the uncertainty of wildcards, not only the number of all matches is exponential with respect to the maximal gap flexibility and the pattern length, but the matching positions in PMW are also hard to choose. The objective to count the maximal number of matches one by one is computationally infeasible. Therefore, rather than solving the generic PMW problem, many research efforts have further defined new problems within PMW according to different application backgrounds. To break through the limitations of either fixing the number or allowing an unbounded number of wildcards, pattern matching with flexible wildcards (PMFW) allows the users to control the ranges of wildcards. In this paper, we provide a survey on the state-of-the-art algorithms for PMFW, with detailed analyses and comparisons, and discuss challenges and opportunities in PMFW research and applications
This paper is supported in part by the National Natural Science Foundation of China under Grant Nos. 61229301 and 60828005, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China, under Grant No. IRT13059, and the National Science Foundation (NSF) of USA under Grant No. 0514819.
About author: Xindong Wu is a Yangtze RiverScholar in the School of Computer Science and Information Engineering at the Hefei University of Technology, China, a professor of computer science at the University of Vermont, USA, and a fellow of IEEE and AAAS. He received his B.S. and M.S. degrees in computer science from the Hefei University of Technology, China, and his Ph.D. degree in artificial intelligence from the University of Edinburgh, Britain. His research interests include data mining, big data analytics, knowledgebased systems, and Web information exploration. He is currently the steering committee chair of the IEEE International Conference on Data Mining (ICDM), the editorin-chief of Knowledge and Information Systems (KAIS, by Springer), and a series editor-in-chief of the Springer Book Series on Advanced Information and Knowledge Processing (AI & KP). He was the editor-in-chief of the IEEE Transactions on Knowledge and Data Engineering (TKDE, by the IEEE Computer Society) between 2005 and 2008. He served as program committee chair/co-chair for the 2003 IEEE International Conference on Data Mining, the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, and the 19th ACM Conference on Information and Knowledge Management.
Cite this article:
Xindong Wu, Ji-Peng Qiang, Fei Xie.Pattern Matching with Flexible Wildcards[J] Journal of Computer Science and Technology, 2014,V29(5): 740-750
 Cole J R, Chai B, Farris R J et al. The Ribosomal Database Project (RDP-II): Sequences and tools for high-throughput rRNA analysis. Nucleic Acids Research, 2005, 33(Database Issue): 294-296. Mendivelso J, PinzÓn Y, Lee I. Finding overlaps within regular expressions with variable-length gaps. In Proc. the 2013 Research in Adaptive and Convergent Systems, Oct. 2013, pp.16-21. Patnaik D, Laxman S, Chandramouli B, Ramakrishnan N. A general streaming algorithm for pattern discovery. Knowledge and Information Systems, 2013, 37(3): 585-610. Xie F, Wu X, Hu X et al. Sequential pattern mining with wildcards. In Proc. the 22nd IEEE International Conference on Tools with Artificial Intelligence, Oct. 2010, pp.241-247. Ding B, Lo D, Han J, Khoo S. Efficient mining of closed repet- Xindong Wu et al.: Pattern Matching with Flexible Wildcards 749 itive gapped subsequences from a sequence database. In Proc. the 25th IEEE International Conference on Data Engineering, Mar. 29-April 2, 2009, pp.1024-1035. El-Ramly M, Stroulia E, Sorenson P. From run-time behavior to usage scenarios: An interaction-pattern mining approach. In Proc. the 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, July 2002, pp.315-324. Manber U, Baeza-Yates R. An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 1991, 37(3): 133-136. de Pablo-Sánchez C, Segura-Bedmar I, Martínez P, IglesiasMaqueda A. Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowledge and Information Systems, 2013, 35(1): 87-109. Barbieri N, Bonchi F, Manco G. Topic-aware social influence propagation models. Knowledge and Information Systems, 2013, 37(3): 555-584. Wei Y, Dominique F, Jean-Paul B. An automatic keyphrase extraction system for scientific documents. Knowledge and Information Systems, 2013, 34(3): 691-724. Fischer M J, Paterson M S. String matching and other products. Technical Report, Massachusetts Institute of Technology, 1974. Muthukrishnan S, Palem K. Non-standard stringology: Algorithms and complexity. In Proc. the 26th Annual ACM Symposium on Theory of Computing, May 1994, pp.770-779. Indyk P. Faster algorithms for string matching problems: Matching the convolution bound. In Proc. the 39th Symp. Foundations of Computer Science, Nov. 1998, pp.166-173. Clifford P, Clifford R. Simple deterministic wildcard matching. Information Processing Letters, 2007, 101(2): 53-54. Cole R, Hariharan R. Verifying candidate matches in sparse and wildcard matching. In Proc. the 34th Annual ACM Symposium on Theory of Computing, May 2002, pp.592-601. Guo D, Hu X, Xie F, Wu X. Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph. Applied Intelligence, 2013, 39(1): 57-74. Navarro G, Raffinot M. Fast and simple character classes and bounded gaps pattern matching, with application to protein searching. In Proc. the 5th Annual International Conference on Computational Biology, April 2001, pp.231-240. Morgante M, Policriti A, Vitacolonna N, Zuccolo A. Structured motifs search. Journal of Computational Biology, 2005, 12(8): 1065-1082. Cole R, Gottlieb L, Lewenstein M. Dictionary matching and indexing with errors and don't cares. In Proc. the 36th Annual ACM Symposium on the Theory of Computing, June 2004, pp.91-100. Kalai A. Efficient pattern-matching with don't cares. In Proc. the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, January 2002, pp.655-656. Haapasalo T, Silvasti P, Sippu S et al. Online dictionary matching with variable-length gaps. In Proc. the 10th Int. Conf. Experimental Algorithms, May 2011, pp.76-87. Kucherov G, Rusinowitch M. Matching a set of strings with variable length don't cares. Theoretical Computer Science, 1997, 178(1/2): 129-154. Zhang M, Zhang Y, Hu L. A faster algorithm for matching a set of patterns with variable length don't cares. Information Processing Letter, 2010, 110(6): 216-220. Wu X, Zhu X, He Y, Arslan A N. PMBC: Pattern mining from biological sequences with wildcard constraints. Computers in Biology and Medicine, 2013, 43(5): 481-492. Rahman M S, Iliopoulos C S, Lee I et al. Finding patterns with variable length gaps or don't cares. In Proc. the 12th Annual International Computing and Combinatorics Conference, August 2006, pp.146-155. Bille P, Gørtz I L, Vildhøj H W, Wind D K. String matching with variable length gaps. Theoretical Computer Science, 2012, 443(20): 25-34. Min F, Wu X, Lu Z. Pattern matching with independent wildcard gaps. In Proc. the 8th IEEE Int. Conf. Dependable, Autonomic and Secure Computing, December 2009, pp.194-199. Zhu X, Wu X. Mining complex patterns across sequences with gap requirements. In Proc. the 20th Int. Joint Conf. Artifi-cial Intelligence, January 2007, pp.2934-2940. Chen G, Wu X, Zhu X, Arslan A, He Y. Efficient string matching with wildcards and length constraints. Knowledge and Information Systems, 2006, 10(4): 399-419. Guo D, Hong X, Hu X et al. A bit-parallel algorithm for sequential pattern matching with wildcards. Cybernetics and Systems, 2011, 42(6): 382-401. Lin P C, Li Z X, Lin Y D et al. Profiling and accelerating string matching algorithms in three network content security applications. IEEE Communications Surveys and Tutorials, 2006, 8(2): 24-37. Aho A V, Corasick M J. Efficient string matching: An aid to bibliographic search. Communications of the ACM, 1975, 18(6): 333-340. Tuck N, Sherwood T, Calder B, Varghese G. Deterministic memory-efficient string matching algorithms for intrusion detection. In Proc. the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies, March 2004, pp.2628-2639. Norton M. Optimizing pattern matching for intrusion detection. http://pdf.aminer.org/000/309/890/optimizing pattern match.pdf, July 2014. Boyer R S, Moore J S. A fast string searching algorithm. Communications of the ACM, 1977, 20(10): 762-772. Wu S, Manber U. A fast algorithm for multi-pattern searching. Technical Report TR-94-17, University of Arizona, 1994. Muth R, Manber U. Approximate multiple string search. In Proc. the 7th Annual Symposium on Combinatorial Pattern Matching (CPM), June 1996, pp.75-86. Karp R M, Rabin M O. Efficient randomized patternmatching algorithms. IBM Journal of Research and Development, 1987, 31(2): 249-260. Baeza-Yates R, Gonnet G H. A new approach to text searching. Communications of the ACM, 1992, 35(10): 74-82. Navarro G, Raffinot M. A bit-parallel approach to suffix automata: Fast extended string matching. In Proc. the 9th Annual Symp. Combinatorial Pattern Matching, July 1998, pp.14-33. Navarro G. A guided tour to approximate string matching. ACM Computing Surveys, 2001, 33(1): 31-88. Kim S, Kim Y. A fast multiple string pattern matching algorithm. In Proc. the 17th AoM/IAoM Conference on Computer Science, August 1999. Agrawal R, Srikant R. Mining sequential patterns. In Proc. the 11th Int. Conf. Data Engineering, March 1995, pp.3-14. Akutsu T. Approximate string matching with variable length don't care characters. Information Processing Letters, 1995, 55(5): 235-239. Lee I, Apostolico A, Iliopoulos C S, Park K. Finding approximate occurrences of a pattern that contains gaps. In Proc. the 14th Australasian Workshop on Combinatorial Algorithms, July 2003, pp.89-100. Zhang M, Kao B, Cheung D W, Yip K. Mining periodic patterns with gap requirement from sequences. In Proc. the ACM SIGMOD International Conference on Management of Data, June 2005, pp.623-633. Min F, Wu X. A comparative study of pattern matching algorithms on sequences. In Proc. the 12th International Confer- 750 J. Comput. Sci. & Technol., Sept. 2014, Vol.29, No.5 ence on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Dec. 2009, pp.510-517. Wang H, Xie F, Hu X, Li P, Wu X. Pattern matching with flexible wildcards and recurring characters. In Proc. the 2010 IEEE International Conference on Granular Computing, Aug. 2010, pp.782-786. Wu Y, Wu X, Jiang H, Min F. A heuristic algorithm for MPMGOOC. Chinese Journal of Computers, 2011, 34(8): 1452-1462. (in Chinese)