Get this book -> Problems on Array: For Interviews and Competitive Programming

Reading time: 25 minutes

Finding similarity between documents is an important task. In the rise of Machine Learning techniques, there are complicated techniques using RNN to detect the grammer and flow behind two documents to determine their similarity. This might sound to be a difficult task but we have been doing this since 1980s using Natural Language Processing techniques. We will follow NLP techniques like TF IDF to achieve this in this article.

Did you know that even today 4 out of 5 systems use NLP techniques to deal with document similarity?

### Introduction

In Natural Language Processing, during Information Retrieval, Document Similarity plays a very important role. Document Similarity is a concept which involves determination of how **similar** two or more documents are with respect to each other. It is not only used for searching but also for duplication detection.

Key idea is to represent documents as vectors using TF-IDF. Once we have the vector representation, we can similarly find the similarity using any of the similarity metrics such as cosine similarity.

Understand TF-IDF in depth### Similarity Metrics

- Cosine Similarity
- Jaccard Distance
- Euclidean Distance

In this article, we will focus on Cosine Similarity using tf-idf.

## Jaccard Distance

This is the most intuitive and easy method of calculating Document Similarity.

```
Jaccard Similarity(d1, d2) = d1 âˆ© d2 / d1 âˆª d2
= common things between d1 and d1 / all things in d1 and d2 together
```

let us say d1 and d2 are vectors

```
d1 = [ 1 3 2 ]
d2 = [ 5 0 3]
```

In this case, d1 âˆ© d2 is: [3] and d1 âˆª d2 is [1 3 2 5 0]

Jaccard similarity between d1 and d2 is 1/5 = 0.2

## Cosine Similarity

**Definition** - Cosine similarity defines the similarity between two or more documents by measuring cosine of angle between two vectors derived from the documents.

The steps to find the cosine similarity are as follows -

- Calculate document vector. (
**Vectorization**)

As we know, vectors represent and deal with numbers. Thus, to be able to represent text documents, we find their tf-idf numerics.

- Calculate tf-idf for the given document
**d**

Note: Scikit-Learn provides a transformer called the TfidfVectorizer in the module feature_extraction.text for vectorizing documents with TFâ€“IDF numerics.

- Calculate the Cosine Similarity

The Cosine Similarity can be found by taking the Dot Product of the document vectors calculated in the previous step.

```
cosÎ¸ = v1.v2 / ( |v1| * |v2| )
where v1.v2 is dot product
|v1| is magnitude of v1
```

Understanding dot product and magnitude:

let us say v1 and v2 are vectors

```
v1 = [ 1 3 2 ]
v2 = [ 5 0 -3]
```

In this case, the dot product of v1 and v2 will be:

```
v1 . v2 = 1 * 5 + 3 * 0 + 2 * -3 = 5 + 0 + -6 = -1
```

Magnitude will be:

```
|v1| = 1^2 + 3^2 + 2^2 = 1 + 9 + 4 = 14
|v2| = 5^2 + 0^2 + (-3)^2 = 25 + 0 + 9 = 34
```

## Euclidean Distance

- Calculation of Euclidean Distance is similar in process to Cosine Similarity.
- Calculate Document Vectors and apply the following formula to find the Euclidean Distance.

```
Euclidean Distance = sqrt(âˆ‘(xiâˆ’yi)^2), where i = 1 to i = n (number of vectors)
```

Example:

let us say v1 and v2 are vectors

```
v1 = [ 1 3 2 ]
v2 = [ 5 0 -3]
```

The Euclidean distance will be:

```
Euclidean distance = sqrt ( (1-5)^2 + (3-0)^2 + (2-(-3))^2 )
= sqrt ( 16 + 9 + 25 )
= sqrt (50)
= 5 * sqrt(2)
```

## Example

Say, document similarity is to be found out for the following documents -

```
d1 - Music is a universal language
d2 - Music is a miracle
d3 - Music is a universal feature of the human experience
```

*Step 1* - We calculate tf-idf for all three documents

Word | tf d1 |
tf d2 |
tf d3 |
IDF | tf-idfd1 |
tf-idfd2 |
tf-idfd3 |
---|---|---|---|---|---|---|---|

music | 0.2 | 0.25 | 0.11 | 0 | 0 | 0 | 0 |

is | 0.2 | 0.25 | 0.11 | 0 | 0 | 0 | 0 |

a | 0.2 | 0.25 | 0.11 | 0 | 0 | 0 | 0 |

universal | 0.2 | 0 | 0.11 | 0.176 | 0.035 | 0 | 0.019 |

langauge | 0.2 | 0 | 0 | 0.477 | 0.095 | 0 | 0 |

miracle | 0 | 0.25 | 0 | 0.477 | 0 | 0.119 | 0 |

of | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |

the | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |

feature | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |

experience | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |

human | 0 | 0 | 0.11 | 0.477 | 0 | 0 | 0.052 |

Document | tf-idf vector |
---|---|

d1 | [0, 0, 0, 0.035, 0.095, 0, 0, 0, 0, 0, 0] |

d2 | [0, 0, 0, 0, 0, 0.119, 0, 0, 0, 0, 0] |

d3 | [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052] |

Let us calculate the document similarity w.r.t *d3* -

Jaccard Distance between *d1* and *d3* -

```
J(d1, d3) = 4/14
= 0.286
```

Jaccard Distance between *d2* and *d3* -

```
J(d2, d3) = 3/13
= 0.231
```

Euclidean Distance between *d1* and *d3* using relative term frequency values -

```
E(d1, d3) = sqrt[ (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + (0.2 - 0.11)^2 + 0 + 0 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2]
= sqrt[ 0.0081 + 0.0081 + 0.0081 + 0.0081 + 0.012 + 0.012 + 0.012 + 0.012 + 0.012]
= sqrt(0.0924)
= 0.303
```

Euclidean Distance between *d2* and *d3* using relative term frequency values -

```
E(d2, d3) = sqrt[ (0.25 - 0.11)^2 + (0.25 - 0.11)^2 + (0.25 - 0.11)^2 + (0 - 0.11)^2 + 0 + (0.25 - 0)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2 + (0 - 0.11)^2]
= sqrt[ 0.0196 + 0.0196 + 0.0196 + 0.012 + 0.0625 + 0.012 + 0.012 + 0.012 + 0.012 + 0.012]
= sqrt(0.1933)
= 0.439
```

Cosine similarity between *d1* and *d3* -

```
num = [0, 0, 0, 0.035, 0.095, 0, 0, 0, 0, 0, 0] * [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]
= 0*0 + 0*0 + 0*0 + 0.035*0.019 + 0.095*0 + 0*0 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052
= 0.000665
den = sqrt[0 + 0 + 0 + 0.0012 + 0.009 + 0 + 0 + 0 + 0 + 0 + 0] * sqrt[0 + 0 + 0 + 0.0003 + 0 + 0 + 0.0027 + 0.0027 + 0.0027 + 0.0027 + 0.0027]
= 0.0102 + 0.0138
= 0.024
cosÎ¸ = 0.000665 / 0.024
= 0.028
```

Cosine similarity between *d2* and *d3* -

```
num = [0, 0, 0, 0, 0, 0.119, 0, 0, 0, 0, 0] * [0, 0, 0, 0.019, 0, 0, 0.052, 0.052, 0.052, 0.052, 0.052]
= 0*0 + 0*0 + 0*0 + 0*0.019 + 0*0 + 0.119*0 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052 + 0*0.052
= 0
Thus,
cosÎ¸ = 0
```

Thus, it can be concluded that *d1* and *d3* have greater Cosine Similarity as is intuitive.

**Note**:

- In this example, the most accurate Document Similarity is provided by Cosine Similarity.
- Meanwhile, Jaccard Distance is fairly accurate as it states that the document pair
*d1*and*d3*are more similar as compared to*d2*and*d3*. - Euclidean Distance, on the other hand, makes a wrong conclusion altogether as the document similarity for the document pair
*d2*and*d3*is greater than*d1*and*d3*.