Innovating Web Page Classification Through Reducing Noise
-
Abstract
This paper presents a new method thateliminates noise in Web page classification. It first describes thepresentation of a Web page based on HTML tags. Then through a noveldistance formula, it eliminates the noise in similarity measure. Aftercarefully analyzing Web pages, we design an algorithm that candistinguish related hyperlinks from noisy ones. We can utilize non-noisyhyperlinks to improve the performance of Web page classification (the CAWNalgorithm). For any page, we can classify it through the text andcategory of neighbor pages related to the page. The experimental resultsshow that our approach improved classification accuracy.
-
-