Reading time: 25 minutes | Coding time: 15 minutes
While the concepts of tf-idf, document similarity and document clustering have already been discussed in my previous articles, in this article, we discuss the implementation of the above concepts and create a working demo of document clustering in Python.
I have created my own dataset called 'Books.csv' in which I have added titles of Computer Science books pertaining to topics such as Data Science, Python Programming, Artificial Intelligence and such, along with the author names.
The structure of the data is as follows:
|Tim Roughgarden||Algorithms Illuminated|
|Steven Skiena||The Data Science Design Manual|
|Kevin Ferguson, Max Pumperla||Deep Learning and the Game of Go|
|Zed Shaw||Learn Python the Hard Way|
The task is to cluster the book titles using tf-idf and K-Means Clustering.
First, I imported all the required libraries.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans import numpy as np import pandas as pd import csv
I imported the .csv file by specifying the path
data=pd.read_csv("/home/Documents/Books.csv") print(data.shape) print (list(data.columns))#returns list data.describe()
(31, 2) ['Author', 'Title']
|top||Max Bramer||Introduction to Deep Learning|
Since, I intend to apply K-Means Clustering only on the Book titles, I implemented the following code snippet to create 2 separate lists for authors and titles
authors =  titles =  with open("/home/Documents/Books.csv") as csvfobj: #csvfobj=object readCSV= csv.reader(csvfobj, delimiter=',') #as each field is separated by ',' #if file is not in the same project, add file path #Reader object creates matrix title =  #to iterate through file row by row for row in readCSV: author = row title = row #adding to the list authors.append(author) titles.append(title) print (authors) print(titles)
['Brian Christian, Tom Griffiths', 'Jeff Erickson', 'Martin Erwig', 'Robert Sedgewick', 'Tim Roughgarden', 'Steven Skiena', 'Max Bramer', .....]
['Algorithms to live by', 'Algorithms', 'Once upon an Algorithm', 'Algorithms', 'Algorithms Illuminated', 'The Data Science Design Manual', 'Principles of Data Mining', 'R for Data Science', ...]
Now, tf-idf vectors for all titles are calculated by using sklearn.feature_extraction.text.TfidfVectorizer
- To display the tf-idf vector for any title, it is necessary to place it in a DataFrame which presents the output in a tabular form.
tfidfvect = TfidfVectorizer(stop_words='english') X = tfidfvect.fit_transform(titles) first_vector = X dataframe = pd.DataFrame(first_vector.T.todense(), index = vectorizer.get_feature_names(), columns = ["tfidf"]) dataframe.sort_values(by = ["tfidf"],ascending=False)
Finally, the K-Means Clustering model is built using sklearn.cluster.KMeans
num = 5 kmeans = KMeans(n_clusters = num, init = 'k-means++', max_iter = 500, n_init = 1) kmeans.fit(X) print(kmeans.cluster_centers_) #This will print cluster centroids as tf-idf vectors
To predict the cluster for a new book title
X = tfidfvect.transform(["Data Structures and Algorithms"]) predicted = kmeans.predict(X) print(predicted)
This will give a result in the form of [n] in which n will indicate the cluster number to which the book "Data Structures and Algorithms" would belong.