Implement Document Clustering using K Means in Python


Reading time: 25 minutes | Coding time: 15 minutes

While the concepts of tf-idf, document similarity and document clustering have already been discussed in my previous articles, in this article, we discuss the implementation of the above concepts and create a working demo of document clustering in Python.

I have created my own dataset called 'Books.csv' in which I have added titles of Computer Science books pertaining to topics such as Data Science, Python Programming, Artificial Intelligence and such, along with the author names.

The structure of the data is as follows:

Author Book titles
Tim Roughgarden Algorithms Illuminated
Steven Skiena The Data Science Design Manual
Kevin Ferguson, Max Pumperla Deep Learning and the Game of Go
Zed Shaw Learn Python the Hard Way

The task is to cluster the book titles using tf-idf and K-Means Clustering.

First, I imported all the required libraries.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
import csv

I imported the .csv file by specifying the path

data=pd.read_csv("/home/Documents/Books.csv")
print(data.shape)
print (list(data.columns))#returns list
data.describe()

Output:

(31, 2)
['Author', 'Title']
Author Title
count 31 31
unique 31 29
top Max Bramer Introduction to Deep Learning
freq 1 2

Since, I intend to apply K-Means Clustering only on the Book titles, I implemented the following code snippet to create 2 separate lists for authors and titles

authors = []
titles = []
with open("/home/Documents/Books.csv") as csvfobj: #csvfobj=object
    readCSV= csv.reader(csvfobj, delimiter=',') #as each field is separated by ','
    #if file is not in the same project, add file path
    #Reader object creates matrix
    title = []
    #to iterate through file row by row
    for row in readCSV:
        author = row[0]
        title = row[1]
        
        #adding to the list
        authors.append(author)
        titles.append(title)
        
print (authors)
print(titles)

Output:

['Brian Christian, Tom Griffiths', 'Jeff Erickson', 'Martin Erwig', 'Robert Sedgewick', 'Tim Roughgarden', 'Steven Skiena', 'Max Bramer', .....]
['Algorithms to live by', 'Algorithms', 'Once upon an Algorithm', 'Algorithms', 'Algorithms Illuminated', 'The Data Science Design Manual', 'Principles of Data Mining', 'R for Data Science', ...]

tf-idf

Now, tf-idf vectors for all titles are calculated by using sklearn.feature_extraction.text.TfidfVectorizer

  • To display the tf-idf vector for any title, it is necessary to place it in a DataFrame which presents the output in a tabular form.
tfidfvect = TfidfVectorizer(stop_words='english')
X = tfidfvect.fit_transform(titles)

first_vector = X[0]
 
dataframe = pd.DataFrame(first_vector.T.todense(), index = vectorizer.get_feature_names(), columns = ["tfidf"])
dataframe.sort_values(by = ["tfidf"],ascending=False)

Output:

tfidf
live 0.796514
algorithms 0.604621
software 0.000000
....
....
....
history 0.000000

K-Means Clustering

Finally, the K-Means Clustering model is built using sklearn.cluster.KMeans

num = 5
kmeans = KMeans(n_clusters = num, init = 'k-means++', max_iter = 500, n_init = 1)
kmeans.fit(X)
print(kmeans.cluster_centers_) #This will print cluster centroids as tf-idf vectors

Output:

[[0.03475224 0.07692308 0.24686471 0.03475224 0. 0.15046002
0. 0.05581389 0.03475224 0.03580537 0. 0.
0. 0. 0. 0.06492219 0. 0.
0.06492219 0. 0. 0. 0. 0.04000335
0.03475224 0. 0. 0.03475224 0.08848366 0.03475224
0.06127028 0. 0.15046002 0. 0.03475224 0.
0. 0.06127028 0. 0.03580537 0. 0.
0. 0. 0. 0. 0.12802633 0.03475224
0.03580537 0. 0. 0. 0. 0.
0. 0. 0.04000335 0. ]
[...]]

To predict the cluster for a new book title

X = tfidfvect.transform(["Data Structures and Algorithms"])
predicted = kmeans.predict(X)
print(predicted)

Output:

[0]

This will give a result in the form of [n] in which n will indicate the cluster number to which the book "Data Structures and Algorithms" would belong.

References