We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Xu-Bin Deng, Yang-Yong Zhu. L-tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises[J]. Journal of Computer Science and Technology, 2005, 20(6): 763-773.
Citation: Xu-Bin Deng, Yang-Yong Zhu. L-tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises[J]. Journal of Computer Science and Technology, 2005, 20(6): 763-773.

L-tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

More Information
  • Received Date: March 25, 2004
  • Revised Date: January 24, 2005
  • Published Date: November 14, 2005
  • In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support dataextraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.
  • [1]
    Schwinn A, Schelp J. Data integration patterns. In Proc. 6th Int. Conf. Business Information Systems ( BIS'03 ), Colorado Springs, Colorado, USA, June 4--6, 2003, pp.232--238.
    [2]
    Laender A, Ribeiro-Neto B, da Silva A. DEByE: Data extraction by example. Data and Knowledge Engineering , 2002, 40(2): 121--154.
    [3]
    Adelberg B. NoDoSE: A tool for semi-automatically extracting structured and semistructured data from text documents. In Proc. 1998 ACM SIGMOD Int. Conf. Management of Data ( SIGMOD'98 ), Seattle, Washington, USA, June 2--4, 1998, pp.283-294.
    [4]
    Arasu A, Garcia-Molina H. Extracting structured data from web pages. In Proc. 2003 ACM SIGMOD Int. Conf. Management of Data ( SIGMOD'03 ), San Diego, California, USA, June 10--12, 2003, pp.337--348.
    [5]
    Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards automatic data extraction from large web sites. In Proc. 27th Int. Conf. Very Large Data Bases ( VLDB'01 ), Roma, Italy, September 11--14, 2001, pp.109--118.
    [6]
    Papakonstantinous Y, Garcia-Molina H, Widom J. Object exchange across heterogeneous information sources. In Proc. 11th Int. Conf. Data Engineering ( ICDE'95 ), Taipei, March 6--10, 1995, pp.251--260.
    [7]
    Laender A, da Silva A, Ribeiro-Neto B et al . The Debye environment for web data management. IEEE Internet Computing , 2002, 6(4): 60--69.
    [8]
    Embley D, Campbell D, Liddle S, Smith R. Ontology-based extraction and structuring of information from data-rich unstructured documents. In Proc. 7th Int. Conf. Information and Knowledge Management ( CIKM'98 ), Bethesda, Maryland, USA, November 2--7, 1998, pp.52--59.
    [9]
    Meng X F, Lu H J, Wang H Y et al . Data extraction from the web based on pre-defined schema. Journal of Computer Science and Technology , 2002, 17(4): 377--388.
    [10]
    Embley D W, Jiang Y, Ng Y K. Record-boundary discovery in web documents. In Proc. 1999 ACM SIGMOD Int. Conf. Management of Data ( SIGMOD'99 ), Philadelphia, Pennsylvania, USA, June 1--3, 1999, pp.467--478.
    [11]
    Yamada Y, Ikeda D, Hirokawa S. Automatic wrapper generation for multilingual web resources. In Proc. 5th Int. Conf. Discovery Science ( DS'02 ), L"ubeck, Germany, November 24--26, 2002, pp.332--339.
    [12]
    Frisch A, Cardelli L. Greedy regular expression matching. In Proc. POPL'04 Workshop on Programming Languages Technologies for XML ( PLAN-X'04 ), Venice, Italy, January 13, 2004, pp.1--12.

Catalog

    Article views (20) PDF downloads (1450) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return