Data Extraction from the Web Based on Pre-Defined Schema
-
Abstract
With the development of the Internet, the World WideWeb has become an invaluable information source for most organizations.However, most documents available from the Web are in HTML form whichis originally designed for document formatting with littleconsideration of its contents. Effectively extracting data from suchdocuments remains a non-trivial task. In this paper, we present aschema-guided approach to extracting data from HTML pages. Under theapproach, the user defines a schema specifying what to be extracted andprovides sample mappings between the schema and the HTML page. Thesystem will induce the mapping rules and generate a wrapper that takesthe HTML page as input and produces the required data in the form ofXML conforming to the user-defined schema. A prototype systemimplementing the approach has been developed. The preliminaryexperiments indicate that the proposed semi-automatic approach is notonly easy to use but also able to produce a wrapper that extractsrequired data from inputted pages with high accuracy.
-
-