Different Types of Similarity measurements
Reading time: 15 minutes
Similarity measurements or metrics are used to find the similarity between two data points (in N dimensional space), two strings, two probability distribution and two sets. These are used widely in Statistics, Machine Learning and Computing. We have listed and explored different Similarity measurements.
Similarity measurements are same as Distance measurements.
The different types of similarity measurements are:
- Similarity between two data points (N dimensional)
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Chebyshev distance
- Similarity between Strings
- Edit distance
- Levenshtein distance
- Damerau Levenshtein distance
- Lee distance
- Hamming distance
- Jaro distance
- Jaro Winkler distance
- Similarity between two probability distribution
- Mahalanobis distance
- Bhattacharyya distance
- Hellinger distance
- Similarity between two sets
- Jaccard index
- Sorensen similarity index
Similarity between two data points
The different Similarity metrics between two data points are:
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Chebyshev distance
Euclidean distance
Euclidean distance is a distance metric to find the distance between two points in an N-dimensional space. Euclidean distance is a special case of Minkowski distance when P=2.
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Euclidean distance between P1 and P2 is given as:
$$ \sqrt[2]{{(x1-y1)}^2\ +\ {(x2-y2)}^2\ +\ ...\ +\ {(xN-yN)}^2}
$$
Manhattan distance
Manhattan distance is a distance metric to find the distance between two points in an N-dimensional space. Manhattan distance is a special case of Minkowski distance when P=1.
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Manhattan distance between P1 and P2 is given as:
$$ {{(x1-y1)}\ +\ {(x2-y2)}\ +\ ...\ +\ {(xN-yN)}}
$$
Minkowski distance
Minkowski distance is a distance/ similarity measurement between two points in the normed vector space (N dimensional real space) and is a generalization of the Euclidean distance and the Manhattan distance.
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
$$ \sqrt[p]{{(x1-y1)}^p\ +\ {(x2-y2)}^p\ +\ ...\ +\ {(xN-yN)}^p}
$$
Chebyshev distance
Chebyshev distance is a distance metric which is the maximum absolute distance in one dimension of two N dimensional points. It has real world applications in Chess, Warehouse logistics and many other fields.
It is, also, known as Tchebychev distance, maximum metric, chessboard distance and Lā metric.
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Chebyshev distance between P1 and P2 is given as:
Chebyshev distance = MAXIMUM (|Xi - Yi|) where i is 1 to N
One application of Chebyshev distance:
- Minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares
Similarity between Strings
The different Similarity metrics between Strings are:
- Edit distance
- Levenshtein distance
- Damerau Levenshtein distance
- Lee distance
- Hamming distance
- Jaro distance
- Jaro Winkler distance
Edit distance
Edit distance is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string.
Given two strings S1 and S2, edit distance is the minimum cost associated with operations to convert string S1 to S2. Common operations are:
- Insertion: Insert a character at a position (cost: 1)
- Deletion: Delete a character at any position (cost: 1)
- Replace: Replace a character at any position with another character (cost: 1)
One application of Edit distance is to Quantify the similarity of DNA sequences.
Levenshtein distance
Levenshtein distance is a type of Edit distance which is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string.
It can be seen as a way of pairwise string alignment.
Damerau Levenshtein distance
Damerau Levenshtein distance is a variant of Levenshtein distance which is a type of Edit distance. Edit distance is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string.
Given two strings S1 and S2, edit distance is the minimum cost associated with operations to convert string S1 to S2. 3 Operations available in Levenshtein distance are:
- Insertion: Insert a character at a position (cost: 1)
- Deletion: Delete a character at any position (cost: 1)
- Replace: Replace a character at any position with another character (cost: 1)
Let us add another operation: transposition for Damerau Levenshtein distance:
- transposition: swap two adjacent characters (cost: 1)
For the transposition operation, we will look back one character to apply this operation.
Lee distance
Lee distance is a similarity metric developed by C. Y. Lee to find the similarity between two strings. Lee distance is a generalization of Hamming distance.
Let the two strings S1 and S2 be represented as:
- S1: X1, X2, X3, ..., XN
- S2: Y1, Y2, Y3, ..., YN
The Lee distance between S1 and S2 is as follows:
Lee distance = SUM( MINIMUM( |Xi - Yi|, q - |Xi - Yi|)
where q >= 2 (based on experimentation).
Hamming distance
Hamming distance is a similarity metric developed by Richard Hamming to find the similarity between two strings. Hamming distance is a special case of Lee Distance when q = 2 or 3.
This is a common metric used widely in error correcting codes.
Jaro distance
Jaro distance is a similarity metric to find the similarity between two strings.
Let there be two strings S1 and S2. Then, Jaro distance is defined as follows:
if M = 0, then jaro_distance = 0
Else, jaro_distance = (1/3) * (M / |S1| + M / |S2| + (M-T)/M)
where:
|S1| is length of string S1
|S2| is length of string S2
M is number of matching characters between S1 and S2
T is number of transposition between S1 and S2
Jaro Winkler distance
Jaro Winkler distance is a similarity metric to find the similarity between two strings. Jaro Winkler distance is a modification of Jaro distance.
Let there be two strings S1 and S2. Then, Jaro distance is defined as follows:
Let sim_j = Jaro Winkler distance
sim_i = Jaro distance
Then,
sim_j = sim_i + L * P * (1 - sim_i)
where:
- L: length of common prefix at the start (maximum: 4)
- P: constant scaling factor
Similarity between two probability distribution
Different Similarity metrics between two probability distribution are:
- Mahalanobis distance
- Bhattacharyya distance
- Hellinger distance
Mahalanobis distance
Mahalanobis distance is a distance metric used to find the distance from a point P to a distribution D. Mahalanobis distance is a special case of Bhattacharyya distance.
Mahalanobis distance was developed by P. C. Mahalanobis in 1936. He was the founder of Indian Statistical Institute.
Bhattacharyya distance
Bhattacharyya distance is a similarity metric used to measure similarity between two probability distribution. This was developed by Anil Kumar Bhattacharya, a statistician from Indian Statistical Institute.
This is more reliable than Mahalanobis distance. Bhattacharyya distance is a generalization of Mahalanobis distance.
Let P and Q be two probability distribution.
P = x1, x2, ...
Q = y1, y2, ...
Bhattacharyya distance = - log( BC(P, Q))
where BC = Bhattacharyya Coefficient.
BC(P, Q) = SUM( (P(x) * Q(x))^0.5 )
Hellinger distance
Hellinger distance is a similarity metric used to measure similarity between two probability distribution.
Hellinger distance is related to Bhattacharyya distance. It was developed by Ernst Hellinger in 1909.
Similarity between two sets
Different Similarity metrics between two sets are:
- Jaccard index
- Sorensen similarity index
Jaccard index
Jaccard index is a metric that is used to find the similarity between two sets.
Let A and B be two sets, then Jaccard index is defined as:
Jaccard index = (A intersection B) / (A union B)
Jaccard index = (A intersection B) / (A + B - (A interection B))
Sorensen similarity index
Sorensen similarity index is a metric that is used to find the similarity between two sets.
Let A and B be two sets, then Jaccard index is defined as:
Sorensen similarity index = (A intersection B) / (A + B)
With this article at OpenGenus, you must have the complete idea of different Similarity metrics that are used in practice.