A Semi-Structured Document Model for Text Mining
-
Abstract
A semi-structured document has more structuredinformation compared to an ordinary document, and the relationamong semi-structured documents can be fully utilized. In order to takeadvantage of the structure and link information in a semi-structureddocument for better mining, a structured link vector model (SLVM) ispresented in this paper, where a vector represents a document, andvectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in theprocedure of K-means for briefness and clarity: calculating documentsimilarity and calculating cluster center. The clustering based on SLVMperforms significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from0.65--0.73 to 0.82--0.86.
-
-