Document Clustering using K Means


Reading time: 30 minutes

Clustering documents is an important task as it groups similar documents together which can be used for a variety of tasks such as recommendations, similarity detection, creating dataset of a topic, generate new data following same pattern and so on. Clustering has always been a central task in Natural Language Processing and in this article, we use ideas from TF IDF and similarity metrics to use K Means clustering algorithm to cluster documents.

Introduction

Prerequisites: It is recommended that you read articles on Document Similarity and K Means Clustering from OpenGenus IQ for better understanding

Document Clustering: It is defined as the application of cluster analysis to text documents such that large amounts can be organized into meaningful and topic-specific clusters or groups.

Applications

  • In Information Retrieval, it ensures speed and efficiency
  • It has important applications in Organization of Information or Data
  • Used in Topic Extraction

Procedure

  1. Since, we require document vectors, we find tf-idf scores for the terms in the document
Understand about TF IDF in depth
  1. We find similarity between document pair using the above TF IDF data and a similarity metric
Learn how TF IDF can be used for finding similarity between documents
  1. Use a Clustering Algorithm (like K Means) to perform the unsupervised operation on the documents on the similarity values between documents
Learn about K means clustering algorithm

Example

d1 - Jim Corbett National Park
d2 - Ghatprabha Bird Sanctuary
d3 - Bandipur National Park
d4 - Ranthambore Tiger Sanctuary

Step 1

We calculate the tf-idf scores for the terms in each document==

Term tf d1 tf d2 tf d3 tf d4 idf tf-idf d1 tf-idf d2 tf-idf d3 tf-idf d4
Jim 0.25 0 0 0 0.6 0.15 0 0 0
Corbett 0.25 0 0 0 0.6 0.15 0 0 0
National 0.25 0 0.33 0 0.3 0.075 0 0.099 0
Park 0.25 0 0.33 0 0.3 0.075 0 0.099 0
Ghatprabha 0 0.33 0 0 0.6 0 0.198 0 0
Bird 0 0.33 0 0 0.6 0 0.198 0 0
Sanctuary 0 0.33 0 0.33 0.3 0 0.099 0 0.099
Bandipur 0 0 0.33 0 0.6 0 0 0.198 0
Ranthambore 0 0 0 0.33 0.6 0 0 0 0.198
Tiger 0 0 0 0.33 0.6 0 0 0 0.198

Thus, the document vectors are as follows:

Document tf-idf vector
d1 [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
d2 [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]
d3 [0, 0, 0.099, 0.099, 0, 0, 0, 0.198, 0, 0]
d4 [0, 0, 0, 0, 0, 0, 0.099, 0, 0.198, 0.198]

Step 2

We apply the K-Means Clustering Algorithm

  1. We have decided that the desired number of clusters or k will be 2. Let us initialize the first 2 points as our initial cluster centres.
C1 = [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
C2 = [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]
  1. Next, we find the distances of the other points w.r.t our cluster centres. Here, we use Manhattan Distance for the same purpose

  1. Assign clusters on the basis of minimum distances.
Document Distance from C1 Distance from C2 Cluster Assigned
d1 0 0.945 C1
d2 0.945 0 C2
d3 0.546 0.891 C1
d4 0.945 0.792 C2
  1. Cluster centres are updated.
C1 = [0.075, 0.075, 0.087, 0.087, 0, 0, 0, 0.099, 0, 0]
C2 = [0, 0, 0, 0, 0.099, 0.099, 0.099, 0, 0.099, 0.099]

Repeat Step 4

Document Distance from C1 Distance from C2 Cluster Assigned
d1 0.273 0.945 C1
d2 0.918 0.396 C2
d3 0.225 0.891 C1
d4 0.918 0.396 C2

There will be no more iterations since the clusters centres remain the same

As can be seen from the clusters assigned, the results are intuitive.

Question

Can computation of Cosine Similiarity be a useful factor for Document Clustering?

Yes
No
Yes, Cosine Similarity can be converted to Cosine Distance, which, in turn, can be used as the distance measure to compute cluster centres in the K-Means Clustering Algorithm.