Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 25 minutes
Finding similarity between documents is an important task. In the rise of Machine Learning techniques, there are complicated techniques using RNN to detect the grammer and flow behind two documents to determine their similarity. This might sound to be a difficult task but we have been doing this since 1980s using Natural Language Processing techniques. We will follow NLP techniques like TF IDF to achieve this in this article.
Did you know that even today 4 out of 5 systems use NLP techniques to deal with document similarity?
Introduction
In Natural Language Processing, during Information Retrieval, Document Similarity plays a very important role. Document Similarity is a concept which involves determination of how similar two or more documents are with respect to each other. It is not only used for searching but also for duplication detection.
Key idea is to represent documents as vectors using TF-IDF. Once we have the vector representation, we can similarly find the similarity using any of the similarity metrics such as cosine similarity.
Understand TF-IDF in depthSimilarity Metrics
- Cosine Similarity
- Jaccard Distance
- Euclidean Distance
In this article, we will focus on Cosine Similarity using tf-idf.
Jaccard Distance
This is the most intuitive and easy method of calculating Document Similarity.
Jaccard Similarity(d1, d2) = d1 ∩ d2 / d1 ∪ d2
= common things between d1 and d1 / all things in d1 and d2 together
let us say d1 and d2 are vectors
d1 = [ 1 3 2 ]
d2 = [ 5 0 3]
In this case, d1 ∩ d2 is: [3] and d1 ∪ d2 is [1 3 2 5 0]
Jaccard similarity between d1 and d2 is 1/5 = 0.2
Cosine Similarity
Definition - Cosine similarity defines the similarity between two or more documents by measuring cosine of angle between two vectors derived from the documents.
The steps to find the cosine similarity are as follows -
- Calculate document vector. (Vectorization)
As we know, vectors represent and deal with numbers. Thus, to be able to represent text documents, we find their tf-idf numerics.
- Calculate tf-idf for the given document d
Note: Scikit-Learn provides a transformer called the TfidfVectorizer in the module feature_extraction.text for vectorizing documents with TF–IDF numerics.
- Calculate the Cosine Similarity
The Cosine Similarity can be found by taking the Dot Product of the document vectors calculated in the previous step.
cosθ = v1.v2 / ( |v1| * |v2| )
where v1.v2 is dot product
|v1| is magnitude of v1
Understanding dot product and magnitude:
let us say v1 and v2 are vectors
v1 = [ 1 3 2 ]
v2 = [ 5 0 -3]
In this case, the dot product of v1 and v2 will be:
v1 . v2 = 1 * 5 + 3 * 0 + 2 * -3 = 5 + 0 + -6 = -1
Magnitude will be:
|v1| = 1^2 + 3^2 + 2^2 = 1 + 9 + 4 = 14
|v2| = 5^2 + 0^2 + (-3)^2 = 25 + 0 + 9 = 34
Euclidean Distance
- Calculation of Euclidean Distance is similar in process to Cosine Similarity.
- Calculate Document Vectors and apply the following formula to find the Euclidean Distance.
Euclidean Distance = sqrt(∑(xi−yi)^2), where i = 1 to i = n (number of vectors)
Example:
let us say v1 and v2 are vectors
v1 = [ 1 3 2 ]
v2 = [ 5 0 -3]
The Euclidean distance will be:
Euclidean distance = sqrt ( (1-5)^2 + (3-0)^2 + (2-(-3))^2 )
= sqrt ( 16 + 9 + 25 )
= sqrt (50)
= 5 * sqrt(2)
Example
Say, document similarity is to be found out for the following documents -
d1 - Music is a universal language
d2 - Music is a miracle
d3 - Music is a universal feature of the human experience
Step 1 - We calculate tf-idf for all three documents
Word | tf d1 | tf d2 | tf d3 | IDF | tf-idfd1 | tf-idfd2 | tf-idfd3 |
---|---|---|---|---|---|---|---|
music | 0.2 | 0.25 | 0.11 | 0 | 0 | 0 | 0 |
is | 0.2 | 0.25 | 0.11 | 0 | 0 | 0 | 0 |
a | 0.2 | 0.25 | 0.11 | 0 | 0 | 0 | 0 |
universal | 0.2 | 0 | 0.11 | 0.176 | 0.035 | 0 | 0.019 |
langauge | 0.2 | 0 | 0 | 0.477 | 0.095 | 0 | 0 |
miracle | 0 | 0.25 | 0 | 0.477 | 0 | 0.119 | 0 |
of | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |
the | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |
feature | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |
experience | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |
human | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |
Document | tf-idf vector |
---|---|
d1 | [0, 0, 0, 0.035, 0.095, 0, 0, 0, 0, 0, 0] |
d2 | [0, 0, 0, 0, 0, 0.119, 0, 0, 0, 0, 0] |
d3 | [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052] |
Let us calculate the document similarity w.r.t d3 -
Jaccard Distance between d1 and d3 -
J(d1, d3) = 4/14
= 0.286
Jaccard Distance between d2 and d3 -
J(d2, d3) = 3/13
= 0.231
Euclidean Distance between d1 and d3 using relative term frequency values -
E(d1, d3) = sqrt[ (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + 0 + 0 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2]
= sqrt[ 0.0081 + 0.0081 + 0.0081 + 0.0081 + 0.012 + 0.012 + 0.012 + 0.012 + 0.012]
= sqrt(0.0924)
= 0.303
Euclidean Distance between d2 and d3 using relative term frequency values -
E(d2, d3) = sqrt[ (0.25 - 0.11)^2 + (0.25 - 0.11)^2 + (0.25 - 0.11)^2 + (0 - 0.11)^2 + 0 + (0.25 - 0)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2]
= sqrt[ 0.0196 + 0.0196 + 0.0196 + 0.012 + 0.0625 + 0.012 + 0.012 + 0.012 + 0.012 + 0.012]
= sqrt(0.1933)
= 0.439
Cosine similarity between d1 and d3 -
num = [0, 0, 0, 0.035, 0.095, 0, 0, 0, 0, 0, 0] * [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]
= 0*0 + 0*0 + 0*0 + 0.035*0.019 + 0.095*0 + 0*0 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052
= 0.000665
den = sqrt[0 + 0 + 0 + 0.0012 + 0.009 + 0 + 0 + 0 + 0 + 0 + 0] * sqrt[0 + 0 + 0 + 0.0003 + 0 + 0 + 0.0027 + 0.0027 + 0.0027 + 0.0027 + 0.0027]
= 0.0102 + 0.0138
= 0.024
cosθ = 0.000665 / 0.024
= 0.028
Cosine similarity between d2 and d3 -
num = [0, 0, 0, 0, 0, 0.119, 0, 0, 0, 0, 0] * [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]
= 0*0 + 0*0 + 0*0 + 0*0.019 + 0*0 + 0.119*0 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052
= 0
Thus,
cosθ = 0
Thus, it can be concluded that d1 and d3 have greater Cosine Similarity as is intuitive.
Note:
- In this example, the most accurate Document Similarity is provided by Cosine Similarity.
- Meanwhile, Jaccard Distance is fairly accurate as it states that the document pair d1 and d3 are more similar as compared to d2 and d3.
- Euclidean Distance, on the other hand, makes a wrong conclusion altogether as the document similarity for the document pair d2 and d3 is greater than d1 and d3.