Clustering DTDs: An Interactive Two-Level Approach
-
Abstract
XML (eXtensible Markup Language) is a standard which is widelyapplied in data representation and data exchange. However, as animportant concept of XML, DTD (Document Type Definition) is not takenfull advantage in current applications. In this paper, a new method for clustering DTDs is presented, and it can be used in XML document clustering. Thetwo-level method clusters the elements in DTDs and clusters DTDsseparately. Element clustering forms the first level and provideselement clusters, which are the generalization of relevant elements. DTDclustering utilizes the generalized information and forms the secondlevel in the whole clustering process. The two-level method has thefollowing advantages: 1) It takes into consideration both the contentand the structure within DTDs; 2) The generalized information aboutelements is more useful than the separated words in the vector model; 3)The two-level method facilitates the searching of outliers. Theexperiments show that this method is able to categorize the relevantDTDs effectively.
-
-