ISSN 2630-0583 (Print)

ISSN 2630-0656 (Online)

JCST

Journal of Current Science and Technology

http://jcst.rsu.ac.th

Rangsit Journal of Arts and Sciences. Vol.7 No.1 , January - June 2017.

A comparative study of clustering techniques for non-segmented language documents

Todsanai Chumwatana

Abstract

          Document clustering has become an important area of study due to the rapid increase in the number of electronic documents.  It can be employed to group and categorize documents, as well as provide a useful summary of the categories for browsing purposes.  Until now, many clustering techniques have been developed for grouping and clustering documents both in segmented and non-segmented languages, like English and some Asian languages, respectively.  However, document clustering can be a complicated task for many Asian languages such as Chinese, Japanese, Korean and Thai, because these languages are written without explicit word boundary delimiters such as white space.  The aim of this paper is to provide a comprehensive and comparative study of non-segmented document clustering techniques using self-organizing map (SOM) and k-means, as they are two classic and well known methods in the area of text clustering.  To illustrate these two methods, experimental and comparative studies on clustering non-segmented documents by using SOM and k-means are revealed in this paper.  The keyword extraction is first applied to search for the member of occurrences.  These members are then used as an input for the next clustering process.  The experimental results show that k-means technique is simple and has low computation cost.  Meanwhile, SOM is relatively complex, but the clustering performance is more visual and easy to comprehend.  Consequently, k-means technique has become a well-known text clustering method and is used by many fields due to its straightforwardness, while SOM performs well for detection of noisy documents, thus making it more suitable for some applications such as navigation of document collection and multi-document summarization

Keywords: document clustering, k-means, non-segmented languages, self-organizing map

Download Full Paper.