Learning Python's difflib Module


Difflib is a Python module that contains several easy-to-use functions and classes that allow users to compare sets of data. The module presents the results of these sequence comparisons in a human-readable format, utilizing deltas to display the differences more cleanly.

Much of the time difflib is used to compared string sequences, however, it can also be used to compare other data types as long as they are hashable. We know that an object is hashable if its hash value doesn't change through the duration of its lifetime.

The most commonly used classes in the difflib module are the Sequence Matcher and the Differ classes. There are also several other helper classes and functions that can assist with more specific operations. Let's take a closer look at some of these functions.

Sequence Matcher

First, let's start off with a fairly self-explanatory method of the difflib module: SequenceMatcher. The SequenceMatcher method will compare two given strings and return data that presents how similar the two strings are. Let's try this out together using the ratio() object. This will return the comparison data in decimal format.

>>> import difflib
>>> from difflib import SequenceMatcher
>>> str1 = 'I like pizza'
>>> str2 = 'I like tacos'
>>> seq = SequenceMatcher(a=str1, b=str2)
>>> print(seq.ratio())
0.66666666

In the example above, we start off by importing the difflib module as well as the SequenceMatcher class in our terminal or Command Prompt. Then we define the two string values that we will be comparing using the SequenceMatcher class. Next, you'll notice that we create a new variable that encapsulates the SequenceMatcher class with two parameters, a and b. Although, the method actually accepts three parameters: None, a, and b.

In order for the the method to acknowledge our two strings, we need to assign each of the string values to the method's variables, SequenceMatcher(a=str1, b=str2).

Once all of the necessary variables have been defined and the SequenceMatcher has been given at least two parameters, we can now print the value using the ratio() object that we'd mentioned earlier. This determines the ratio of characters that are similar in the two strings and the result is then returned as a decimal. Just like that, we have compared two simple strings and received a result on their similarities.

Note

The ratio() object is one of a few that belong to the Sequence Matcher class. Check out the Python documentation to find out about more of these objects to perform more operations on sequences.

Differ

The Differ class is the opposite of SequenceMatcher; it takes in lines of text and finds the differences between the strings. However, the Differ class is unique in its usage of deltas, making it even more readable and easier for humans to spot the differences.

For instance, when adding new characters to the second string in a comparison between two strings, a '+ ' will appear before the line that has received the additional characters.

As you have probably guessed, deleting some of the characters that were visible in the first string will cause '- ' to pop up before the second line of text.

If a line is the same in both sequences, ' ' will be returned and if there is a line missing, then you will see '? '. Additionally, you can also utilize attributes like ratio(), which we saw in the last example. Let's see the Differ class in action.

>>> import difflib
>>> from difflib import Differ
>>> str1 = "I would like to order a pepperoni pizza"
>>> str2 = "I would like to order a veggie burger"
>>> str1_lines = str1.splitlines()
>>> str2_lines = str2.splitlines()
>>> d = difflib.Differ()
>>> diff = d.compare(str1_lines, str2_lines)
>>> print('\n'.join(diff))
# output
I would like to order a 
'- ' pepperoni pizza
'+ ' veggie burger

In the example above, we begin by importing the module and Differ class. Once we have defined our two strings that we want to compare, we must invoke the splitlines() function on the two strings.

>>> str1_lines = str1.splitlines()
>>> str2_lines = str2.splitlines()

This will allow us to compare the strings by each line rather than by each individual character.

Once we have defined a variable that contains the Differ class, we create another that contains Differ with the compare() object, taking in the two strings as parameters.

>>> diff = d.compare(str1_lines, str2_lines)

We call the print function and join the diff variable with a line enter so that our result is formatted in a way that makes it more readable.

get_close_matches

Another simple yet powerful tool in difflib is its get_close_matches method. It's exactly what it sounds like: a tool that will take in arguments and return the closest matches to the target string. In pseudocode, the function works like this:

get_close_matches(target_word, list_of_possibilities, n=result_limit, cutoff)

As we can see above, get_close_matches can take in 4 arguments but only requires the first 2 in order to return results.

The first parameter is the word that we are targeting; what we want the method to return similarities to. The second parameter can be an array of terms, or a variable that points to an array of strings. The third parameter allows the user to define a limit to the number of results that are returned. The last parameter determines how similar two words need to be in order to be returned as a result.

With the first two parameters, alone, the method will return results based on the default cutoff of 0.6 (in the range of 0 - 1) and a default result limit of 3. Take a look at a couple of examples in order to see how this function really works.

>>> import difflib
>>> from difflib import get_close_matches
>>> get_close_matches('bat', ['baton', 'chess', 'bat', 'bats', 'fireflies', 'batter'])

['bat', 'bats', 'baton']

Notice how the example above only returns three results even though there is a fourth term that is similar to 'bats': 'batter'. This is because we did not specify a result limit as our third parameter. Let's try that again, but this time we will define a result_limit and a cutoff.

>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.6)

['bat', 'bats', 'baton', 'batter']

This time we get all four results that are at least 60% similar to the word, 'bat'. The cutoff is equivalent to the original because we just defined the same value as the default, 0.6. However, this can be changed to make the results more or less strict. The closer to 1, the more strict the constraints will be. In the example below, the constraint has been changed to 0.9. This means that the results will need to be at least 90% similar to the word 'bat'.

>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.9)

['bat']

unified_diff & context_diff

There are two classes in difflib which operate in a very similar fashion; the unified_diff and the context_diff. The only major difference between the two is the result.

The unified_diff takes in two strings of data and then returns each word that was either added or removed from the first. The best way to understand this concept is by seeing it in practice:

>>> import sys
>>> import difflib
>>> from difflib import unified_diff
>>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n']
>>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n']
>>> sys.stdout.writelines(unified_diff(str1, str2))
---
+++
@@ -1,5 +1,5 @@
-dog
-cat
-frog
-bear
+puppy
+kitten
+tadpole
+cub
 animals

As evidenced by the results, the unified_diff returns the removed words prefixed with - and returns the added words prefixed with +. The final word, 'animals' contains no prefix because it was present in both strings.

The context_diff works in the same way as the unified_diff. However, instead of revealing what was added and removed from the original string, it simply returns what lines have changed by returning the changed lines with a prefix of '!'.

>>> from difflib import context_diff
>>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n']
>>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n']
>>> sys.stdout.writelines(context_diff(str1, str2))
***
---
***************
*** 1,5 ****
! dog
! cat
! frog
! bear
  animals
--- 1,5 ----
! puppy
! kitten
! tadpole
! cub
  animals

Within these examples, we can see that many of the functions and classes of the difflib module resemble one another. Each have their own set of benefits and it's important to analyze which will work best for your project. Comparing sets of data becomes effortless when leveraging the difflib module, but your results can be even better when your program returns results in the most readable format possible for your data. The best way to improve your skills with this module is by putting these examples into practice.

With this article at OpenGenus, we must have a complete idea of using difflib module in Python. Enjoy.

Learn more: