Jaccard Similarity (Intersection over Union)

Jaccard Similarity is a similarity metric that is used to determine how similar two data points are with each other. It is, originally, defined over sets as (Intersection between two sets) / (Union of two sets).

Jaccard Similarity is, also, known as Jaccard Index or Intersection over Union.

Set A and B
I = Intersection of sets A and B
U = Union of sets A and B

Jaccard Similarity = I / U

Jaccard similarity is always between 0 and 1 as the intersection of two sets can never be larger than the union of the two sets.

Key terms:

Intersection of two sets: Common elements in both sets
Union of two sets: All elements that belong to either of the sets or both sets

This is an important metric due to an unique property that links it to minhashing.

MinHash (Probabilistic Data Structure for Similarity) determines the Jaccard Similarity

Properties of Jaccard Similarity

The properties of Jaccard Similarity are as follows:

Jaccard Similarity is 0 if the sets are distinct
Jaccard Similarity is 1 if the sets are same
Jaccard Similarity does not depend on the order of elements in the set. This point becomes interesting as we link it with MinHash.

Use of Jaccard Similarity

The most common use cases of Jaccard Similarity are as follows:

Find similarity in writing of two Authors

Jaccard Similarity works well in this use case. The works of the authors form two distinct sets and Jaccard Similarity is calculated over it. It captures notions of writing style well provided its similicity.

It must be ensured that the works of the two authors are in the same field (like Computer Science) so that the results are accurate.

Simple way to estimate similarity

Jaccard Similarity is a simple yet highly useful metric to estimate similarity between two data points. It is used in various NLP problems frequently.

There do exist other more accurate similarity metrics but Jaccard Similarity is still in use because of its simplicity.

Use to estimate the value of MinHash which is another similarity metric which is widely used like in AltaVista Search Engine, in NLP tasks and more.

Jaccard Similarity with numerical sets

Let us understand the calculation using an example of two sets with numerical data and we need to find Jaccard Similarity of the two sets.

The two sets are:

S1 = {1, 9, 5, 19}
S2 = {0, -10, 20, 5}

Intersection of S1 and S2 = {5}; size = 1

Union of S1 and S2 = {1, 9, 5, 19, 0, -10, 20} = size 7

Jaccard Similarity = Intersection / Union

Jaccard Similarity = 1 / 7 = 0.1428

S1, S2 = two sets with numbers
intersection = 0
union = 0

for number N1 in S1
    if N1 is in S2
        ++ intersection
    ++ union

for number N2 in S2
    if N2 is not in S1
        ++ union
        
Jaccard Similarity = intersection / union

Jaccard Similarity with strings

Jaccard Similarity can be used to find similarity between two strings or body of text. For the body of text, each word is considered as a unit of a set while in comparing strings, each character is considered as a unit of a set.

Let us consider two strings:

opengenus
opensource

We need to find the similarity between the two strings. Each string can be visualized as a set of characters.

S1 = opengenus = {o, p, e, n, g, e, n, u, s}
S2 = opensource = {o, p, e, n, s, o, u, r, c, e}

As a set shall consist of unique elements only, the sets are as follows:

S1 = opengenus = {o, p, e, n, g, n, u, s}
S2 = opensource = {o, p, e, n, s, u, r, c}

S1 intersection S2 is {o, p, e, n, u, s} with size 6.

S1 union S2 is {o, p, e, n, g, n, u, s, r, c} with size 10.

The Jaccard Similarity of "opengenus" and "opensource" is 6/10 = 0.6

The pseudocode is as follows:

S1, S2 = two text body
remove duplicates in S1 and S2
intersection = 0
union = 0

for word1 in S1
    if word1 is in S2
        ++ intersection
    ++ union

for word1 in S2
    if word1 is not in S1
        ++ union
        
Jaccard Similarity = intersection / union

Minhashing and Jaccard Similarity

Given two sets, the probability of similarity using minhash function for a random permutation of data of the two sets is same as Jaccard similarity of the two sets.

Minhashing is a technique which is used to find the similarity of two sets using hashing techniques. The idea is to represent a set using a hash value that takes into consideration the order of elements.

The relation is intuitive in nature as a random order represents a general set and hence, the values of similarity using MinHash and Jaccard Similarity become same.

Conclusion

Jaccard Similarity is still a very useful metric to find the similarity of two datasets. It may not be used in cases when the order of elements in data is a significant feature.