Implement Document Clustering using K Means in Python
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Reading time: 25 minutes | Coding time: 15 minutes
While the concepts of tf-idf, document similarity and document clustering have already been discussed in my previous articles, in this article, we discuss the implementation of the above concepts and create a working demo of document clustering in Python.
I have created my own dataset called 'Books.csv' in which I have added titles of Computer Science books pertaining to topics such as Data Science, Python Programming, Artificial Intelligence and such, along with the author names.
The structure of the data is as follows:
Author | Book titles |
---|---|
Tim Roughgarden | Algorithms Illuminated |
Steven Skiena | The Data Science Design Manual |
Kevin Ferguson, Max Pumperla | Deep Learning and the Game of Go |
Zed Shaw | Learn Python the Hard Way |
The task is to cluster the book titles using tf-idf and K-Means Clustering.
First, I imported all the required libraries.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import csv
I imported the .csv file by specifying the path
data=pd.read_csv("/home/Documents/Books.csv")
print(data.shape)
print (list(data.columns))#returns list
data.describe()
Output:
(31, 2)
['Author', 'Title']
Author | Title | |
---|---|---|
count | 31 | 31 |
unique | 31 | 29 |
top | Max Bramer | Introduction to Deep Learning |
freq | 1 | 2 |
Since, I intend to apply K-Means Clustering only on the Book titles, I implemented the following code snippet to create 2 separate lists for authors and titles
authors = []
titles = []
with open("/home/Documents/Books.csv") as csvfobj: #csvfobj=object
readCSV= csv.reader(csvfobj, delimiter=',') #as each field is separated by ','
#if file is not in the same project, add file path
#Reader object creates matrix
title = []
#to iterate through file row by row
for row in readCSV:
author = row[0]
title = row[1]
#adding to the list
authors.append(author)
titles.append(title)
print (authors)
print(titles)
Output:
['Brian Christian, Tom Griffiths', 'Jeff Erickson', 'Martin Erwig', 'Robert Sedgewick', 'Tim Roughgarden', 'Steven Skiena', 'Max Bramer', .....]
['Algorithms to live by', 'Algorithms', 'Once upon an Algorithm', 'Algorithms', 'Algorithms Illuminated', 'The Data Science Design Manual', 'Principles of Data Mining', 'R for Data Science', ...]
tf-idf
Now, tf-idf vectors for all titles are calculated by using sklearn.feature_extraction.text.TfidfVectorizer
- To display the tf-idf vector for any title, it is necessary to place it in a DataFrame which presents the output in a tabular form.
tfidfvect = TfidfVectorizer(stop_words='english')
X = tfidfvect.fit_transform(titles)
first_vector = X[0]
dataframe = pd.DataFrame(first_vector.T.todense(), index = vectorizer.get_feature_names(), columns = ["tfidf"])
dataframe.sort_values(by = ["tfidf"],ascending=False)
Output:
tfidf | |
---|---|
live | 0.796514 |
algorithms | 0.604621 |
software | 0.000000 |
.... | |
.... | |
.... | |
history | 0.000000 |
K-Means Clustering
Finally, the K-Means Clustering model is built using sklearn.cluster.KMeans
num = 5
kmeans = KMeans(n_clusters = num, init = 'k-means++', max_iter = 500, n_init = 1)
kmeans.fit(X)
print(kmeans.cluster_centers_) #This will print cluster centroids as tf-idf vectors
Output:
[[0.03475224 | 0.07692308 | 0.24686471 | 0.03475224 | 0. | 0.15046002 |
0. | 0.05581389 | 0.03475224 | 0.03580537 | 0. | 0. |
0. | 0. | 0. | 0.06492219 | 0. | 0. |
0.06492219 | 0. | 0. | 0. | 0. | 0.04000335 |
0.03475224 | 0. | 0. | 0.03475224 | 0.08848366 | 0.03475224 |
0.06127028 | 0. | 0.15046002 | 0. | 0.03475224 | 0. |
0. | 0.06127028 | 0. | 0.03580537 | 0. | 0. |
0. | 0. | 0. | 0. | 0.12802633 | 0.03475224 |
0.03580537 | 0. | 0. | 0. | 0. | 0. |
0. | 0. | 0.04000335 | 0. ] | ||
[...]] |
To predict the cluster for a new book title
X = tfidfvect.transform(["Data Structures and Algorithms"])
predicted = kmeans.predict(X)
print(predicted)
Output:
[0]
This will give a result in the form of [n] in which n will indicate the cluster number to which the book "Data Structures and Algorithms" would belong.
References
- Document clustering using K Means by Chaitanyasuma Jain
- Understanding TF IDF by Chaitanyasuma Jain
- Finding similarity between documents using TF IDF by Chaitanyasuma Jain
- K Means clustering by Ronit Ray
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.