# Edit distance

Reading time: 15 minutes

Edit distance is a large class of distance metric of measuring the dissimilarity between two strings by computing a minimum number of operations (from a set of operations) used to convert one string to another string. It can be seen as a way of pairwise string alignment.

Edit distance has several variants such as Levenshtein distance, Spatial Edit Distance, Graph edit distance and many others.

Edit distance has a vast range of applications in computational linguistics, computer science, natural language processing, bioinformatics and many others.

### Computation

Given two strings S1 and S2, edit distance is the minimum cost associated with operations to convert string S1 to S2. Common operations are:

- Insertion: Insert a character at a position (cost: 1)
- Deletion: Delete a character at any position (cost: 1)
- Replace: Replace a character at any position with another character (cost: 1)

Example:

Consider the following strings:

```
S1 = aadbdeg
S2 = aabdec
```

Operations used to convert S1 to S2:

Operation 1: Delete d at position 3

Operation 2: Replace g at position 6 by c

Edit distance = 1 + 1 = 2

### Properties

Edit distance with non-negative cost satisfies the axioms of a metric giving rise to a metric space of strings, when the following conditions are met:

- Every edit operation has positive cost
- For every operation, there is an inverse operation with equal cost

Properties of unit-cost edit distances include:

- LCS distance is bounded above by the sum of lengths of a pair of strings
- LCS distance is an upper bound on Levenshtein distance
- For strings of the same length, Hamming distance is an upper bound on Levenshtein distance

Regardless of cost/weights, the following property holds of all edit distances:

- When a and b share a common prefix, this prefix has no effect on the distance

### Algorithms

Algorithms that can be used to compute edit distance are:

- Wagner Fischer algorithm
- Ukkonen algorithm
- Landau Myers and Schmidt algorithm

### Applications

Computational Biology

- Quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T

Natural Language Processing

- Correction of spelling mistakes
- Correct OCR errors

General computation

- Measure the similarity between two data points