Document Clustering using K Means

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Reading time: 30 minutes

Clustering documents is an important task as it groups similar documents together which can be used for a variety of tasks such as recommendations, similarity detection, creating dataset of a topic, generate new data following same pattern and so on. Clustering has always been a central task in Natural Language Processing and in this article, we use ideas from TF IDF and similarity metrics to use K Means clustering algorithm to cluster documents.

Introduction

Prerequisites: It is recommended that you read articles on Document Similarity and K Means Clustering from OpenGenus IQ for better understanding

Document Clustering: It is defined as the application of cluster analysis to text documents such that large amounts can be organized into meaningful and topic-specific clusters or groups.

Applications

In Information Retrieval, it ensures speed and efficiency
It has important applications in Organization of Information or Data
Used in Topic Extraction

Procedure

Since, we require document vectors, we find tf-idf scores for the terms in the document

Understand about TF IDF in depth

We find similarity between document pair using the above TF IDF data and a similarity metric

Learn how TF IDF can be used for finding similarity between documents

Use a Clustering Algorithm (like K Means) to perform the unsupervised operation on the documents on the similarity values between documents

Learn about K means clustering algorithm

Example

d1 - Jim Corbett National Park
d2 - Ghatprabha Bird Sanctuary
d3 - Bandipur National Park
d4 - Ranthambore Tiger Sanctuary

Step 1

We calculate the tf-idf scores for the terms in each document==

Term	tf d1	tf d2	tf d3	tf d4	idf	tf-idf d1	tf-idf d2	tf-idf d3	tf-idf d4
Jim	0.25	0	0	0	0.6	0.15	0	0	0
Corbett	0.25	0	0	0	0.6	0.15	0	0	0
National	0.25	0	0.33	0	0.3	0.075	0	0.099	0
Park	0.25	0	0.33	0	0.3	0.075	0	0.099	0
Ghatprabha	0	0.33	0	0	0.6	0	0.198	0	0
Bird	0	0.33	0	0	0.6	0	0.198	0	0
Sanctuary	0	0.33	0	0.33	0.3	0	0.099	0	0.099
Bandipur	0	0	0.33	0	0.6	0	0	0.198	0
Ranthambore	0	0	0	0.33	0.6	0	0	0	0.198
Tiger	0	0	0	0.33	0.6	0	0	0	0.198

Thus, the document vectors are as follows:

Document	tf-idf vector
d1	[0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
d2	[0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]
d3	[0, 0, 0.099, 0.099, 0, 0, 0, 0.198, 0, 0]
d4	[0, 0, 0, 0, 0, 0, 0.099, 0, 0.198, 0.198]

Step 2

We apply the K-Means Clustering Algorithm

We have decided that the desired number of clusters or k will be 2. Let us initialize the first 2 points as our initial cluster centres.

C1 = [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
C2 = [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]

Next, we find the distances of the other points w.r.t our cluster centres. Here, we use Manhattan Distance for the same purpose

Assign clusters on the basis of minimum distances.

Document	Distance from C1	Distance from C2	Cluster Assigned
d1	0	0.945	C1
d2	0.945	0	C2
d3	0.546	0.891	C1
d4	0.945	0.792	C2

Cluster centres are updated.

C1 = [0.075, 0.075, 0.087, 0.087, 0, 0, 0, 0.099, 0, 0]
C2 = [0, 0, 0, 0, 0.099, 0.099, 0.099, 0, 0.099, 0.099]

Repeat Step 4

Document	Distance from C1	Distance from C2	Cluster Assigned
d1	0.273	0.945	C1
d2	0.918	0.396	C2
d3	0.225	0.891	C1
d4	0.918	0.396	C2

There will be no more iterations since the clusters centres remain the same

As can be seen from the clusters assigned, the results are intuitive.

Question

Can computation of Cosine Similiarity be a useful factor for Document Clustering?

Yes

Yes, Cosine Similarity can be converted to Cosine Distance, which, in turn, can be used as the distance measure to compute cluster centres in the K-Means Clustering Algorithm.