Find similarity between documents using TF IDF


Reading time: 25 minutes

Finding similarity between documents is an important task. In the rise of Machine Learning techniques, there are complicated techniques using RNN to detect the grammer and flow behind two documents to determine their similarity. This might sound to be a difficult task but we have been doing this since 1980s using Natural Language Processing techniques. We will follow NLP techniques like TF IDF to achieve this in this article.

Did you know that even today 4 out of 5 systems use NLP techniques to deal with document similarity?

Introduction

In Natural Language Processing, during Information Retrieval, Document Similarity plays a very important role. Document Similarity is a concept which involves determination of how similar two or more documents are with respect to each other. It is not only used for searching but also for duplication detection.

Key idea is to represent documents as vectors using TF-IDF. Once we have the vector representation, we can similarly find the similarity using any of the similarity metrics such as cosine similarity.

Understand TF-IDF in depth

Similarity Metrics

  1. Cosine Similarity
  2. Jaccard Distance
  3. Euclidean Distance

In this article, we will focus on Cosine Similarity using tf-idf.

Jaccard Distance

This is the most intuitive and easy method of calculating Document Similarity.

Jaccard Similarity(d1, d2) = d1 ∩ d2 / d1 ∪ d2
    = common things between d1 and d1 / all things in d1 and d2 together

let us say d1 and d2 are vectors

d1 = [ 1 3 2 ]
d2 = [ 5 0 3]

In this case, d1 ∩ d2 is: [3] and d1 ∪ d2 is [1 3 2 5 0]

Jaccard similarity between d1 and d2 is 1/5 = 0.2

Cosine Similarity

Definition - Cosine similarity defines the similarity between two or more documents by measuring cosine of angle between two vectors derived from the documents.

The steps to find the cosine similarity are as follows -

  • Calculate document vector. (Vectorization)

As we know, vectors represent and deal with numbers. Thus, to be able to represent text documents, we find their tf-idf numerics.

  • Calculate tf-idf for the given document d

Note: Scikit-Learn provides a transformer called the TfidfVectorizer in the module feature_extraction.text for vectorizing documents with TF–IDF numerics.

  • Calculate the Cosine Similarity

The Cosine Similarity can be found by taking the Dot Product of the document vectors calculated in the previous step.

cosθ = v1.v2 / ( |v1| * |v2| )

where v1.v2 is dot product
      |v1| is magnitude of v1

Understanding dot product and magnitude:

let us say v1 and v2 are vectors

v1 = [ 1 3 2 ]
v2 = [ 5 0 -3]

In this case, the dot product of v1 and v2 will be:

v1 . v2 = 1 * 5 + 3 * 0 + 2 * -3 = 5 + 0 + -6 = -1

Magnitude will be:

|v1| = 1^2 + 3^2 + 2^2 = 1 + 9 + 4 = 14
|v2| = 5^2 + 0^2 + (-3)^2 = 25 + 0 + 9 = 34

Euclidean Distance

  • Calculation of Euclidean Distance is similar in process to Cosine Similarity.
  • Calculate Document Vectors and apply the following formula to find the Euclidean Distance.
Euclidean Distance = sqrt(∑(xi−yi)^2), where i = 1 to i = n (number of vectors)

Example:

let us say v1 and v2 are vectors

v1 = [ 1 3 2 ]
v2 = [ 5 0 -3]

The Euclidean distance will be:

Euclidean distance = sqrt ( (1-5)^2 + (3-0)^2 + (2-(-3))^2 ) 
                   = sqrt ( 16 + 9 + 25 )
                   = sqrt (50)
                   = 5 * sqrt(2)

Example

Say, document similarity is to be found out for the following documents -

d1 - Music is a universal language
d2 - Music is a miracle
d3 - Music is a universal feature of the human experience

Step 1 - We calculate tf-idf for all three documents

Word tf d1 tf d2 tf d3 IDF tf-idfd1 tf-idfd2 tf-idfd3
music 0.2 0.25 0.11 0 0 0 0
is 0.2 0.25 0.11 0 0 0 0
a 0.2 0.25 0.11 0 0 0 0
universal 0.2 0 0.11 0.176 0.035 0 0.019
langauge 0.2 0 0 0.477 0.095 0 0
miracle 0 0.25 0 0.477 0 0.119 0
of 0 0 0.11 0.477 0 0 0.052
the 0 0 0.11 0.477 0 0 0.052
feature 0 0 0.11 0.477 0 0 0.052
experience 0 0 0.11 0.477 0 0 0.052
human 0 0 0.11 0.477 0 0 0.052
Document tf-idf vector
d1 [0, 0, 0, 0.035, 0.095, 0, 0, 0, 0, 0, 0]
d2 [0, 0, 0, 0, 0, 0.119, 0, 0, 0, 0, 0]
d3 [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]

Let us calculate the document similarity w.r.t d3 -

Jaccard Distance between d1 and d3 -

J(d1, d3) = 4/14
              = 0.286

Jaccard Distance between d2 and d3 -

J(d2, d3) = 3/13
              = 0.231

Euclidean Distance between d1 and d3 using relative term frequency values -

E(d1, d3) = sqrt[ (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + 0 + 0 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2]
              = sqrt[ 0.0081 + 0.0081 + 0.0081 + 0.0081 + 0.012 + 0.012 + 0.012 + 0.012 + 0.012]
              = sqrt(0.0924)
              = 0.303

Euclidean Distance between d2 and d3 using relative term frequency values -

E(d2, d3) = sqrt[ (0.25 - 0.11)^2 + (0.25 - 0.11)^2 + (0.25 - 0.11)^2 + (0 - 0.11)^2 + 0 + (0.25 - 0)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2]
              = sqrt[ 0.0196 + 0.0196 + 0.0196 + 0.012 + 0.0625 + 0.012 + 0.012 + 0.012 + 0.012 + 0.012]
              = sqrt(0.1933)
              = 0.439

Cosine similarity between d1 and d3 -

num = [0, 0, 0, 0.035, 0.095, 0, 0, 0, 0, 0, 0] * [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]
    = 0*0 + 0*0 + 0*0 + 0.035*0.019 + 0.095*0 + 0*0 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 
    = 0.000665
    
den = sqrt[0 + 0 + 0 + 0.0012 + 0.009 + 0 + 0 + 0 + 0 + 0 + 0] * sqrt[0 + 0 + 0 + 0.0003 + 0 + 0 + 0.0027 + 0.0027 + 0.0027 + 0.0027 + 0.0027]
    = 0.0102 + 0.0138
    = 0.024
    
cosθ = 0.000665 / 0.024
     = 0.028

Cosine similarity between d2 and d3 -

num = [0, 0, 0, 0, 0, 0.119, 0, 0, 0, 0, 0] * [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]
    = 0*0 + 0*0 + 0*0 + 0*0.019 + 0*0 + 0.119*0 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 
    = 0

Thus,
cosθ = 0

Thus, it can be concluded that d1 and d3 have greater Cosine Similarity as is intuitive.

Note:

  • In this example, the most accurate Document Similarity is provided by Cosine Similarity.
  • Meanwhile, Jaccard Distance is fairly accurate as it states that the document pair d1 and d3 are more similar as compared to d2 and d3.
  • Euclidean Distance, on the other hand, makes a wrong conclusion altogether as the document similarity for the document pair d2 and d3 is greater than d1 and d3.