Different Types of Similarity measurements

Reading time: 15 minutes

Similarity measurements or metrics are used to find the similarity between two data points (in N dimensional space), two strings, two probability distribution and two sets. These are used widely in Statistics, Machine Learning and Computing. We have listed and explored different Similarity measurements.
Similarity measurements are same as Distance measurements.

The different types of similarity measurements are:

  • Similarity between two data points (N dimensional)
    • Euclidean distance
    • Manhattan distance
    • Minkowski distance
    • Chebyshev distance
  • Similarity between Strings
    • Edit distance
    • Levenshtein distance
    • Damerau Levenshtein distance
    • Lee distance
    • Hamming distance
    • Jaro distance
    • Jaro Winkler distance
  • Similarity between two probability distribution
    • Mahalanobis distance
    • Bhattacharyya distance
    • Hellinger distance
  • Similarity between two sets
    • Jaccard index
    • Sorensen similarity index

Similarity between two data points

The different Similarity metrics between two data points are:

  • Euclidean distance
  • Manhattan distance
  • Minkowski distance
  • Chebyshev distance

Euclidean distance

Euclidean distance is a distance metric to find the distance between two points in an N-dimensional space. Euclidean distance is a special case of Minkowski distance when P=2.

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)

Then, the Euclidean distance between P1 and P2 is given as:

$$ \sqrt[2]{{(x1-y1)}^2\ +\ {(x2-y2)}^2\ +\ ...\ +\ {(xN-yN)}^2}
$$

Read about Euclidean distance/ L2 norm in depth

Manhattan distance

Manhattan distance is a distance metric to find the distance between two points in an N-dimensional space. Manhattan distance is a special case of Minkowski distance when P=1.

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)

Then, the Manhattan distance between P1 and P2 is given as:

$$ {{(x1-y1)}\ +\ {(x2-y2)}\ +\ ...\ +\ {(xN-yN)}}
$$

Minkowski distance

Minkowski distance is a distance/ similarity measurement between two points in the normed vector space (N dimensional real space) and is a generalization of the Euclidean distance and the Manhattan distance.

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)

Then, the Minkowski distance between P1 and P2 is given as:

$$ \sqrt[p]{{(x1-y1)}^p\ +\ {(x2-y2)}^p\ +\ ...\ +\ {(xN-yN)}^p}
$$

Chebyshev distance

Chebyshev distance is a distance metric which is the maximum absolute distance in one dimension of two N dimensional points. It has real world applications in Chess, Warehouse logistics and many other fields.

It is, also, known as Tchebychev distance, maximum metric, chessboard distance and Lāˆž metric.

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)

Then, the Chebyshev distance between P1 and P2 is given as:

Chebyshev distance = MAXIMUM (|Xi - Yi|) where i is 1 to N

One application of Chebyshev distance:

  • Minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares

Similarity between Strings

The different Similarity metrics between Strings are:

  • Edit distance
  • Levenshtein distance
  • Damerau Levenshtein distance
  • Lee distance
  • Hamming distance
  • Jaro distance
  • Jaro Winkler distance

Edit distance

Edit distance is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string.

Given two strings S1 and S2, edit distance is the minimum cost associated with operations to convert string S1 to S2. Common operations are:

  • Insertion: Insert a character at a position (cost: 1)
  • Deletion: Delete a character at any position (cost: 1)
  • Replace: Replace a character at any position with another character (cost: 1)

One application of Edit distance is to Quantify the similarity of DNA sequences.

Levenshtein distance

Levenshtein distance is a type of Edit distance which is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string.

It can be seen as a way of pairwise string alignment.

Damerau Levenshtein distance

Damerau Levenshtein distance is a variant of Levenshtein distance which is a type of Edit distance. Edit distance is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string.

Given two strings S1 and S2, edit distance is the minimum cost associated with operations to convert string S1 to S2. 3 Operations available in Levenshtein distance are:

  • Insertion: Insert a character at a position (cost: 1)
  • Deletion: Delete a character at any position (cost: 1)
  • Replace: Replace a character at any position with another character (cost: 1)

Let us add another operation: transposition for Damerau Levenshtein distance:

  • transposition: swap two adjacent characters (cost: 1)

For the transposition operation, we will look back one character to apply this operation.

Lee distance

Lee distance is a similarity metric developed by C. Y. Lee to find the similarity between two strings. Lee distance is a generalization of Hamming distance.

Let the two strings S1 and S2 be represented as:

  • S1: X1, X2, X3, ..., XN
  • S2: Y1, Y2, Y3, ..., YN

The Lee distance between S1 and S2 is as follows:

Lee distance = SUM( MINIMUM( |Xi - Yi|, q - |Xi - Yi|)

where q >= 2 (based on experimentation).

Hamming distance

Hamming distance is a similarity metric developed by Richard Hamming to find the similarity between two strings. Hamming distance is a special case of Lee Distance when q = 2 or 3.

This is a common metric used widely in error correcting codes.

Jaro distance

Jaro distance is a similarity metric to find the similarity between two strings.

Let there be two strings S1 and S2. Then, Jaro distance is defined as follows:

if M = 0, then jaro_distance = 0

Else, jaro_distance = (1/3) * (M / |S1| + M / |S2| + (M-T)/M)

where:

|S1| is length of string S1
|S2| is length of string S2
M is number of matching characters between S1 and S2
T is number of transposition between S1 and S2

Jaro Winkler distance

Jaro Winkler distance is a similarity metric to find the similarity between two strings. Jaro Winkler distance is a modification of Jaro distance.

Let there be two strings S1 and S2. Then, Jaro distance is defined as follows:

Let sim_j = Jaro Winkler distance
sim_i = Jaro distance

Then,
sim_j = sim_i + L * P * (1 - sim_i)

where:

  • L: length of common prefix at the start (maximum: 4)
  • P: constant scaling factor

Similarity between two probability distribution

Different Similarity metrics between two probability distribution are:

  • Mahalanobis distance
  • Bhattacharyya distance
  • Hellinger distance

Mahalanobis distance

Mahalanobis distance is a distance metric used to find the distance from a point P to a distribution D. Mahalanobis distance is a special case of Bhattacharyya distance.

Mahalanobis distance was developed by P. C. Mahalanobis in 1936. He was the founder of Indian Statistical Institute.

Bhattacharyya distance

Bhattacharyya distance is a similarity metric used to measure similarity between two probability distribution. This was developed by Anil Kumar Bhattacharya, a statistician from Indian Statistical Institute.

This is more reliable than Mahalanobis distance. Bhattacharyya distance is a generalization of Mahalanobis distance.

Let P and Q be two probability distribution.

P = x1, x2, ...
Q = y1, y2, ...

Bhattacharyya distance = - log( BC(P, Q))

where BC = Bhattacharyya Coefficient.
BC(P, Q) = SUM( (P(x) * Q(x))^0.5 )

Hellinger distance

Hellinger distance is a similarity metric used to measure similarity between two probability distribution.

Hellinger distance is related to Bhattacharyya distance. It was developed by Ernst Hellinger in 1909.

Similarity between two sets

Different Similarity metrics between two sets are:

  • Jaccard index
  • Sorensen similarity index

Jaccard index

Jaccard index is a metric that is used to find the similarity between two sets.

Let A and B be two sets, then Jaccard index is defined as:

Jaccard index = (A intersection B) / (A union B)
Jaccard index = (A intersection B) / (A + B - (A interection B))

Sorensen similarity index

Sorensen similarity index is a metric that is used to find the similarity between two sets.

Let A and B be two sets, then Jaccard index is defined as:

Sorensen similarity index = (A intersection B) / (A + B)

With this article at OpenGenus, you must have the complete idea of different Similarity metrics that are used in practice.