Get this book -> Problems on Array: For Interviews and Competitive Programming

Reading time: 30 minutes

Clustering documents is an important task as it groups similar documents together which can be used for a variety of tasks such as recommendations, similarity detection, creating dataset of a topic, generate new data following same pattern and so on. Clustering has always been a central task in Natural Language Processing and in this article, we use ideas from TF IDF and similarity metrics to use K Means clustering algorithm to cluster documents.

### Introduction

Prerequisites: It is recommended that you read articles on Document Similarity and K Means Clustering from OpenGenus IQ for better understanding

**Document Clustering**: It is defined as the application of cluster analysis to text documents such that large amounts can be organized into meaningful and topic-specific clusters or groups.

### Applications

- In Information Retrieval, it ensures speed and efficiency
- It has important applications in Organization of Information or Data
- Used in Topic Extraction

### Procedure

- Since, we require document vectors, we find tf-idf scores for the terms in the document

- We find similarity between document pair using the above TF IDF data and a similarity metric

- Use a Clustering Algorithm (like K Means) to perform the unsupervised operation on the documents on the similarity values between documents

### Example

```
d1 - Jim Corbett National Park
d2 - Ghatprabha Bird Sanctuary
d3 - Bandipur National Park
d4 - Ranthambore Tiger Sanctuary
```

**Step 1**

We calculate the tf-idf scores for the terms in each document==

Term | tf d1 |
tf d2 |
tf d3 |
tf d4 |
idf | tf-idf d1 |
tf-idf d2 |
tf-idf d3 |
tf-idf d4 |
---|---|---|---|---|---|---|---|---|---|

Jim | 0.25 | 0 | 0 | 0 | 0.6 | 0.15 | 0 | 0 | 0 |

Corbett | 0.25 | 0 | 0 | 0 | 0.6 | 0.15 | 0 | 0 | 0 |

National | 0.25 | 0 | 0.33 | 0 | 0.3 | 0.075 | 0 | 0.099 | 0 |

Park | 0.25 | 0 | 0.33 | 0 | 0.3 | 0.075 | 0 | 0.099 | 0 |

Ghatprabha | 0 | 0.33 | 0 | 0 | 0.6 | 0 | 0.198 | 0 | 0 |

Bird | 0 | 0.33 | 0 | 0 | 0.6 | 0 | 0.198 | 0 | 0 |

Sanctuary | 0 | 0.33 | 0 | 0.33 | 0.3 | 0 | 0.099 | 0 | 0.099 |

Bandipur | 0 | 0 | 0.33 | 0 | 0.6 | 0 | 0 | 0.198 | 0 |

Ranthambore | 0 | 0 | 0 | 0.33 | 0.6 | 0 | 0 | 0 | 0.198 |

Tiger | 0 | 0 | 0 | 0.33 | 0.6 | 0 | 0 | 0 | 0.198 |

Thus, the document vectors are as follows:

Document | tf-idf vector |
---|---|

d1 | [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0] |

d2 | [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0] |

d3 | [0, 0, 0.099, 0.099, 0, 0, 0, 0.198, 0, 0] |

d4 | [0, 0, 0, 0, 0, 0, 0.099, 0, 0.198, 0.198] |

**Step 2**

We apply the K-Means Clustering Algorithm

*We have decided that the desired number of clusters or k will be 2. Let us initialize the first 2 points as our initial cluster centres.*

```
C1 = [0.15, 0.15, 0.075, 0.075, 0, 0, 0, 0, 0, 0]
C2 = [0, 0, 0, 0, 0.198, 0.198, 0.099, 0, 0, 0]
```

*Next, we find the distances of the other points w.r.t our cluster centres. Here, we use Manhattan Distance for the same purpose*

*Assign clusters on the basis of minimum distances.*

Document | Distance from C1 | Distance from C2 | Cluster Assigned |
---|---|---|---|

d1 | 0 | 0.945 | C1 |

d2 | 0.945 | 0 | C2 |

d3 | 0.546 | 0.891 | C1 |

d4 | 0.945 | 0.792 | C2 |

*Cluster centres are updated.*

```
C1 = [0.075, 0.075, 0.087, 0.087, 0, 0, 0, 0.099, 0, 0]
C2 = [0, 0, 0, 0, 0.099, 0.099, 0.099, 0, 0.099, 0.099]
```

*Repeat Step 4*

Document | Distance from C1 | Distance from C2 | Cluster Assigned |
---|---|---|---|

d1 | 0.273 | 0.945 | C1 |

d2 | 0.918 | 0.396 | C2 |

d3 | 0.225 | 0.891 | C1 |

d4 | 0.918 | 0.396 | C2 |

**There will be no more iterations since the clusters centres remain the same**

As can be seen from the clusters assigned, the results are intuitive.