Document Clustering using K Means
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Reading time: 30 minutes
Clustering documents is an important task as it groups similar documents together which can be used for a variety of tasks such as recommendations, similarity detection, creating dataset of a topic, generate new data following same pattern and so on. Clustering has always been a central task in Natural Language Processing and in this article, we use ideas from TF IDF and similarity metrics to use K Means clustering algorithm to cluster documents.
Introduction
Prerequisites: It is recommended that you read articles on Document Similarity and K Means Clustering from OpenGenus IQ for better understanding
Document Clustering: It is defined as the application of cluster analysis to text documents such that large amounts can be organized into meaningful and topic-specific clusters or groups.
Applications
- In Information Retrieval, it ensures speed and efficiency
- It has important applications in Organization of Information or Data
- Used in Topic Extraction
Procedure
- Since, we require document vectors, we find tf-idf scores for the terms in the document
- We find similarity between document pair using the above TF IDF data and a similarity metric
- Use a Clustering Algorithm (like K Means) to perform the unsupervised operation on the documents on the similarity values between documents
Example
d1 - Jim Corbett National Park
d2 - Ghatprabha Bird Sanctuary
d3 - Bandipur National Park
d4 - Ranthambore Tiger Sanctuary
Step 1
We calculate the tf-idf scores for the terms in each document==
Term | tf d1 | tf d2 | tf d3 | tf d4 | idf | tf-idf d1 | tf-idf d2 | tf-idf d3 | tf-idf d4 |
---|---|---|---|---|---|---|---|---|---|
Jim | 0.25 | 0 | 0 | 0 | 0.6 | 0.15 | 0 | 0 | 0 |
Corbett | 0.25 | 0 | 0 | 0 | 0.6 | 0.15 | 0 | 0 | 0 |
National | 0.25 | 0 | 0.33 | 0 | 0.3 | 0.075 | 0 | 0.099 | 0 |
Park | 0.25 | 0 | 0.33 | 0 | 0.3 | 0.075 | 0 | 0.099 | 0 |
Ghatprabha | 0 | 0.33 | 0 | 0 | 0.6 | 0 | 0.198 | 0 | 0 |
Bird | 0 | 0.33 | 0 | 0 | 0.6 | 0 | 0.198 | 0 | 0 |
Sanctuary | 0 | 0.33 | 0 | 0.33 | 0.3 | 0 | 0.099 | 0 | 0.099 |
Bandipur | 0 | 0 | 0.33 | 0 | 0.6 | 0 | 0 | 0.198 | 0 |
Ranthambore | 0 | 0 | 0 | 0.33 | 0.6 | 0 | 0 | 0 | 0.198 |
Tiger | 0 | 0 | 0 | 0.33 | 0.6 | 0 | 0 | 0 | 0.198 |
Thus, the document vectors are as follows:
Document | tf-idf vector |
---|---|
d1 | [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0] |
d2 | [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0] |
d3 | [0, 0, 0.099, 0.099, 0, 0, 0, 0.198, 0, 0] |
d4 | [0, 0, 0, 0, 0, 0, 0.099, 0, 0.198, 0.198] |
Step 2
We apply the K-Means Clustering Algorithm
- We have decided that the desired number of clusters or k will be 2. Let us initialize the first 2 points as our initial cluster centres.
C1 = [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
C2 = [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]
- Next, we find the distances of the other points w.r.t our cluster centres. Here, we use Manhattan Distance for the same purpose
- Assign clusters on the basis of minimum distances.
Document | Distance from C1 | Distance from C2 | Cluster Assigned |
---|---|---|---|
d1 | 0 | 0.945 | C1 |
d2 | 0.945 | 0 | C2 |
d3 | 0.546 | 0.891 | C1 |
d4 | 0.945 | 0.792 | C2 |
- Cluster centres are updated.
C1 = [0.075, 0.075, 0.087, 0.087, 0, 0, 0, 0.099, 0, 0]
C2 = [0, 0, 0, 0, 0.099, 0.099, 0.099, 0, 0.099, 0.099]
Repeat Step 4
Document | Distance from C1 | Distance from C2 | Cluster Assigned |
---|---|---|---|
d1 | 0.273 | 0.945 | C1 |
d2 | 0.918 | 0.396 | C2 |
d3 | 0.225 | 0.891 | C1 |
d4 | 0.918 | 0.396 | C2 |
There will be no more iterations since the clusters centres remain the same
As can be seen from the clusters assigned, the results are intuitive.
Question
Can computation of Cosine Similiarity be a useful factor for Document Clustering?
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.