Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 30 minutes
Clustering documents is an important task as it groups similar documents together which can be used for a variety of tasks such as recommendations, similarity detection, creating dataset of a topic, generate new data following same pattern and so on. Clustering has always been a central task in Natural Language Processing and in this article, we use ideas from TF IDF and similarity metrics to use K Means clustering algorithm to cluster documents.
Introduction
Prerequisites: It is recommended that you read articles on Document Similarity and K Means Clustering from OpenGenus IQ for better understanding
Document Clustering: It is defined as the application of cluster analysis to text documents such that large amounts can be organized into meaningful and topic-specific clusters or groups.
Applications
- In Information Retrieval, it ensures speed and efficiency
- It has important applications in Organization of Information or Data
- Used in Topic Extraction
Procedure
- Since, we require document vectors, we find tf-idf scores for the terms in the document
- We find similarity between document pair using the above TF IDF data and a similarity metric
- Use a Clustering Algorithm (like K Means) to perform the unsupervised operation on the documents on the similarity values between documents
Example
d1 - Jim Corbett National Park
d2 - Ghatprabha Bird Sanctuary
d3 - Bandipur National Park
d4 - Ranthambore Tiger Sanctuary
Step 1
We calculate the tf-idf scores for the terms in each document==
Term | tf d1 | tf d2 | tf d3 | tf d4 | idf | tf-idf d1 | tf-idf d2 | tf-idf d3 | tf-idf d4 |
---|---|---|---|---|---|---|---|---|---|
Jim | 0.25 | 0 | 0 | 0 | 0.6 | 0.15 | 0 | 0 | 0 |
Corbett | 0.25 | 0 | 0 | 0 | 0.6 | 0.15 | 0 | 0 | 0 |
National | 0.25 | 0 | 0.33 | 0 | 0.3 | 0.075 | 0 | 0.099 | 0 |
Park | 0.25 | 0 | 0.33 | 0 | 0.3 | 0.075 | 0 | 0.099 | 0 |
Ghatprabha | 0 | 0.33 | 0 | 0 | 0.6 | 0 | 0.198 | 0 | 0 |
Bird | 0 | 0.33 | 0 | 0 | 0.6 | 0 | 0.198 | 0 | 0 |
Sanctuary | 0 | 0.33 | 0 | 0.33 | 0.3 | 0 | 0.099 | 0 | 0.099 |
Bandipur | 0 | 0 | 0.33 | 0 | 0.6 | 0 | 0 | 0.198 | 0 |
Ranthambore | 0 | 0 | 0 | 0.33 | 0.6 | 0 | 0 | 0 | 0.198 |
Tiger | 0 | 0 | 0 | 0.33 | 0.6 | 0 | 0 | 0 | 0.198 |
Thus, the document vectors are as follows:
Document | tf-idf vector |
---|---|
d1 | [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0] |
d2 | [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0] |
d3 | [0, 0, 0.099, 0.099, 0, 0, 0, 0.198, 0, 0] |
d4 | [0, 0, 0, 0, 0, 0, 0.099, 0, 0.198, 0.198] |
Step 2
We apply the K-Means Clustering Algorithm
- We have decided that the desired number of clusters or k will be 2. Let us initialize the first 2 points as our initial cluster centres.
C1 = [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
C2 = [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]
- Next, we find the distances of the other points w.r.t our cluster centres. Here, we use Manhattan Distance for the same purpose
- Assign clusters on the basis of minimum distances.
Document | Distance from C1 | Distance from C2 | Cluster Assigned |
---|---|---|---|
d1 | 0 | 0.945 | C1 |
d2 | 0.945 | 0 | C2 |
d3 | 0.546 | 0.891 | C1 |
d4 | 0.945 | 0.792 | C2 |
- Cluster centres are updated.
C1 = [0.075, 0.075, 0.087, 0.087, 0, 0, 0, 0.099, 0, 0]
C2 = [0, 0, 0, 0, 0.099, 0.099, 0.099, 0, 0.099, 0.099]
Repeat Step 4
Document | Distance from C1 | Distance from C2 | Cluster Assigned |
---|---|---|---|
d1 | 0.273 | 0.945 | C1 |
d2 | 0.918 | 0.396 | C2 |
d3 | 0.225 | 0.891 | C1 |
d4 | 0.918 | 0.396 | C2 |
There will be no more iterations since the clusters centres remain the same
As can be seen from the clusters assigned, the results are intuitive.