Jaccard Similarity (Intersection over Union)
Jaccard Similarity is a similarity metric that is used to determine how similar two data points are with each other. It is, originally, defined over sets as (Intersection between two sets) / (Union of two sets).
Jaccard Similarity is, also, known as Jaccard Index or Intersection over Union.
Set A and B
I = Intersection of sets A and B
U = Union of sets A and B
Jaccard Similarity = I / U
Jaccard similarity is always between 0 and 1 as the intersection of two sets can never be larger than the union of the two sets.
Key terms:
- Intersection of two sets: Common elements in both sets
- Union of two sets: All elements that belong to either of the sets or both sets
This is an important metric due to an unique property that links it to minhashing.
Related:
- MinHash (Probabilistic Data Structure for Similarity) determines the Jaccard Similarity
Properties of Jaccard Similarity
The properties of Jaccard Similarity are as follows:
- Jaccard Similarity is 0 if the sets are distinct
- Jaccard Similarity is 1 if the sets are same
- Jaccard Similarity does not depend on the order of elements in the set. This point becomes interesting as we link it with MinHash.
Use of Jaccard Similarity
The most common use cases of Jaccard Similarity are as follows:
- Find similarity in writing of two Authors
Jaccard Similarity works well in this use case. The works of the authors form two distinct sets and Jaccard Similarity is calculated over it. It captures notions of writing style well provided its similicity.
It must be ensured that the works of the two authors are in the same field (like Computer Science) so that the results are accurate.
- Simple way to estimate similarity
Jaccard Similarity is a simple yet highly useful metric to estimate similarity between two data points. It is used in various NLP problems frequently.
There do exist other more accurate similarity metrics but Jaccard Similarity is still in use because of its simplicity.
- Use to estimate the value of MinHash which is another similarity metric which is widely used like in AltaVista Search Engine, in NLP tasks and more.
Jaccard Similarity with numerical sets
Let us understand the calculation using an example of two sets with numerical data and we need to find Jaccard Similarity of the two sets.
The two sets are:
- S1 = {1, 9, 5, 19}
- S2 = {0, -10, 20, 5}
Intersection of S1 and S2 = {5}; size = 1
Union of S1 and S2 = {1, 9, 5, 19, 0, -10, 20} = size 7
Jaccard Similarity = Intersection / Union
Jaccard Similarity = 1 / 7 = 0.1428
S1, S2 = two sets with numbers
intersection = 0
union = 0
for number N1 in S1
if N1 is in S2
++ intersection
++ union
for number N2 in S2
if N2 is not in S1
++ union
Jaccard Similarity = intersection / union
Jaccard Similarity with strings
Jaccard Similarity can be used to find similarity between two strings or body of text. For the body of text, each word is considered as a unit of a set while in comparing strings, each character is considered as a unit of a set.
Let us consider two strings:
- opengenus
- opensource
We need to find the similarity between the two strings. Each string can be visualized as a set of characters.
S1 = opengenus = {o, p, e, n, g, e, n, u, s}
S2 = opensource = {o, p, e, n, s, o, u, r, c, e}
As a set shall consist of unique elements only, the sets are as follows:
S1 = opengenus = {o, p, e, n, g, n, u, s}
S2 = opensource = {o, p, e, n, s, u, r, c}
S1 intersection S2 is {o, p, e, n, u, s} with size 6.
S1 union S2 is {o, p, e, n, g, n, u, s, r, c} with size 10.
The Jaccard Similarity of "opengenus" and "opensource" is 6/10 = 0.6
The pseudocode is as follows:
S1, S2 = two text body
remove duplicates in S1 and S2
intersection = 0
union = 0
for word1 in S1
if word1 is in S2
++ intersection
++ union
for word1 in S2
if word1 is not in S1
++ union
Jaccard Similarity = intersection / union
Minhashing and Jaccard Similarity
Given two sets, the probability of similarity using minhash function for a random permutation of data of the two sets is same as Jaccard similarity of the two sets.
Minhashing is a technique which is used to find the similarity of two sets using hashing techniques. The idea is to represent a set using a hash value that takes into consideration the order of elements.
The relation is intuitive in nature as a random order represents a general set and hence, the values of similarity using MinHash and Jaccard Similarity become same.
Conclusion
Jaccard Similarity is still a very useful metric to find the similarity of two datasets. It may not be used in cases when the order of elements in data is a significant feature.