Reading time: 15 minutes
Damerau Levenshtein distance is a variant of Levenshtein distance which is a type of Edit distance. Edit distance is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string. It can be seen as a way of pairwise string alignment.
Motivation: Damerau stated that the four operations in Damerau Levenshtein distance correspond to more than 80% of all human misspellings
The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transposition operation among its allowable operations in addition to the three single character edit operations (insertions, deletions and substitutions).
Damerau Levenshtein distance has a vast range of applications in computational linguistics, computer science, natural language processing, bioinformatics and many others.
Given two strings S1 and S2, edit distance is the minimum cost associated with operations to convert string S1 to S2. 3 Operations available in Levenshtein distance are:
- Insertion: Insert a character at a position (cost: 1)
- Deletion: Delete a character at any position (cost: 1)
- Replace: Replace a character at any position with another character (cost: 1)
Let us add another operation: transposition
- transposition: swap two adjacent characters (cost: 1)
For the transposition operation, we will look back one character to apply this operation.
S1: ag qwe S2: a wqe
Operations in Levenshtein distance:
Operation 1: delete g at index 1 Operation 2: delete q at index 4 Operation 3: insert q at index 5 Levenshtein distance = 3
Operations in Levenshtein distance with transposition operation:
Operation 1: delete g at index 1 Operation 2: transposition: swap characters at index 4 and 5 Levenshtein distance with transposition operation = 2
One problem with this is that it assumes no character where added or delete between the transposed characters.
So, in Damerau Levenshtein distance, for the transposition operation, we will look beyond one character and hence, we can swap characters between any index.
If we consider S1 to be row and S2 to be column (in algorithms like Wagner Fischer algorithm)
The cost of a transposition is calculated as:
(cost before transposition) + (distance between rows) + (distance between columns) + 1
Implementation point of view:
The cost for all four operations are calculated and the minimum cost operation is chosen.
Levenshtein distance with non-negative cost satisfies the axioms of a metric giving rise to a metric space of strings, when the following conditions are met:
- Every edit operation has positive cost
- For every operation, there is an inverse operation with equal cost
Properties of unit-cost Levenshtein distances include:
- LCS distance is bounded above by the sum of lengths of a pair of strings
- LCS distance is an upper bound on Levenshtein distance
- For strings of the same length, Hamming distance is an upper bound on Levenshtein distance
Regardless of cost/weights, the following property holds of all edit distances:
- When a and b share a common prefix, this prefix has no effect on the distance
Algorithms that can be used to compute Levenshtein distance are:
- Wagner Fischer algorithm
- Ukkonen algorithm
- Landau Myers and Schmidt algorithm
- Dissimilarity between protein sequences
- Quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T
Natural Language Processing
- Correction of spelling mistakes
- Correct OCR errors
- Measure the similarity between two data points