||February 2, 2009
||Dr. Shanfeng Zhu, Associate Professor, Fudan University, China
||Enhancing MEDLINE Document Clustering by Incorporating MeSH Semantic Similarity
Clustering MEDLINE documents is usually conducted by the vector space
model, which computes the content similarity between two documents by
basically using the inner-product of their word vectors.
Recently the semantic information of MeSH (Medical Subject Headings)
terms is being applied to clustering MEDLINE documents by mapping
documents into MeSH concept vectors to be clustered.
However, current approaches of using MeSH terms have two serious
limitations: First, important semantic information may be lost when
generating MeSH concept vectors, and second, the content information
of the original text has been discarded.
Our new strategy includes three key points. First, we develop
a sound method for measuring the semantic similarity between two
documents over the MeSH ontology. Second, we combine both the
semantic and the content similarities to generate the integrated
similarity matrix between documents. Third, we apply a spectral
approach to clustering documents over the integrated similarity
Using various 100 datasets of MEDLINE records, we conduct extensive
experiments with changing alternative measures and parameters.
Experimental results show that integrating the semantic and content
similarities outperforms the case of using only one of the two
similarities, being statistically significant.
We further find the best parameter setting which is consistent over
all experimental conditions conducted.
We finally show a typical example of resultant clusters, confirming
the effectiveness of our strategy in improving MEDLINE document