Difflib is a Python module that contains several easy-to-use functions and classes that allow users to compare sets of data. The module presents the results of these sequence comparisons in a human-readable format, utilizing deltas to display the differences more cleanly.
Much of the time difflib is used to compared string sequences, however, it can also be used to compare other data types as long as they are hashable. We know that an object is hashable if its hash value doesn't change through the duration of its lifetime.
The most commonly used classes in the
difflib module are the Sequence Matcher and the Differ classes. There are also several other helper classes and functions that can assist with more specific operations. Let's take a closer look at some of these functions.
First, let's start off with a fairly self-explanatory method of the difflib module:
SequenceMatcher method will compare two given strings and return data that presents how similar the two strings are. Let's try this out together using the
ratio() object. This will return the comparison data in decimal format.
>>> import difflib >>> from difflib import SequenceMatcher >>> str1 = 'I like pizza' >>> str2 = 'I like tacos' >>> seq = SequenceMatcher(a=str1, b=str2) >>> print(seq.ratio()) 0.66666666
In the example above, we start off by importing the difflib module as well as the
SequenceMatcher class in our terminal or Command Prompt. Then we define the two string values that we will be comparing using the
SequenceMatcher class. Next, you'll notice that we create a new variable that encapsulates the
SequenceMatcher class with two parameters,
b. Although, the method actually accepts three parameters:
In order for the the method to acknowledge our two strings, we need to assign each of the string values to the method's variables,
Once all of the necessary variables have been defined and the SequenceMatcher has been given at least two parameters, we can now print the value using the
ratio() object that we'd mentioned earlier. This determines the ratio of characters that are similar in the two strings and the result is then returned as a decimal. Just like that, we have compared two simple strings and received a result on their similarities.
ratio() object is one of a few that belong to the Sequence Matcher class. Check out the Python documentation to find out about more of these objects to perform more operations on sequences.
Differ class is the opposite of
SequenceMatcher; it takes in lines of text and finds the differences between the strings. However, the
Differ class is unique in its usage of deltas, making it even more readable and easier for humans to spot the differences.
For instance, when adding new characters to the second string in a comparison between two strings, a
'+ ' will appear before the line that has received the additional characters.
As you have probably guessed, deleting some of the characters that were visible in the first string will cause
'- ' to pop up before the second line of text.
If a line is the same in both sequences,
' ' will be returned and if there is a line missing, then you will see
'? '. Additionally, you can also utilize attributes like
ratio(), which we saw in the last example. Let's see the
Differ class in action.
>>> import difflib >>> from difflib import Differ >>> str1 = "I would like to order a pepperoni pizza" >>> str2 = "I would like to order a veggie burger" >>> str1_lines = str1.splitlines() >>> str2_lines = str2.splitlines() >>> d = difflib.Differ() >>> diff = d.compare(str1_lines, str2_lines) >>> print('\n'.join(diff)) # output I would like to order a '- ' pepperoni pizza '+ ' veggie burger
In the example above, we begin by importing the module and
Differ class. Once we have defined our two strings that we want to compare, we must invoke the
splitlines() function on the two strings.
>>> str1_lines = str1.splitlines() >>> str2_lines = str2.splitlines()
This will allow us to compare the strings by each line rather than by each individual character.
Once we have defined a variable that contains the
Differ class, we create another that contains
Differ with the
compare() object, taking in the two strings as parameters.
>>> diff = d.compare(str1_lines, str2_lines)
We call the print function and join the
diff variable with a line enter so that our result is formatted in a way that makes it more readable.
Another simple yet powerful tool in
difflib is its
get_close_matches method. It's exactly what it sounds like: a tool that will take in arguments and return the closest matches to the target string. In pseudocode, the function works like this:
get_close_matches(target_word, list_of_possibilities, n=result_limit, cutoff)
As we can see above,
get_close_matches can take in 4 arguments but only requires the first 2 in order to return results.
The first parameter is the word that we are targeting; what we want the method to return similarities to. The second parameter can be an array of terms, or a variable that points to an array of strings. The third parameter allows the user to define a limit to the number of results that are returned. The last parameter determines how similar two words need to be in order to be returned as a result.
With the first two parameters, alone, the method will return results based on the default cutoff of 0.6 (in the range of 0 - 1) and a default result limit of 3. Take a look at a couple of examples in order to see how this function really works.
>>> import difflib >>> from difflib import get_close_matches >>> get_close_matches('bat', ['baton', 'chess', 'bat', 'bats', 'fireflies', 'batter']) ['bat', 'bats', 'baton']
Notice how the example above only returns three results even though there is a fourth term that is similar to 'bats': 'batter'. This is because we did not specify a result limit as our third parameter. Let's try that again, but this time we will define a result_limit and a cutoff.
>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.6) ['bat', 'bats', 'baton', 'batter']
This time we get all four results that are at least 60% similar to the word, 'bat'. The cutoff is equivalent to the original because we just defined the same value as the default, 0.6. However, this can be changed to make the results more or less strict. The closer to 1, the more strict the constraints will be. In the example below, the constraint has been changed to 0.9. This means that the results will need to be at least 90% similar to the word 'bat'.
>>> get_close_matches('bat', ['baton', 'chess', 'batter', 'bats', 'fireflies', 'battering'], n=4, cutoff=0.9) ['bat']
unified_diff & context_diff
There are two classes in
difflib which operate in a very similar fashion; the unified_diff and the context_diff. The only major difference between the two is the result.
unified_diff takes in two strings of data and then returns each word that was either added or removed from the first. The best way to understand this concept is by seeing it in practice:
>>> import sys >>> import difflib >>> from difflib import unified_diff >>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n'] >>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n'] >>> sys.stdout.writelines(unified_diff(str1, str2)) --- +++ @@ -1,5 +1,5 @@ -dog -cat -frog -bear +puppy +kitten +tadpole +cub animals
As evidenced by the results, the
unified_diff returns the removed words prefixed with
- and returns the added words prefixed with
+. The final word, 'animals' contains no prefix because it was present in both strings.
context_diff works in the same way as the
unified_diff. However, instead of revealing what was added and removed from the original string, it simply returns what lines have changed by returning the changed lines with a prefix of '!'.
>>> from difflib import context_diff >>> str1 = ['dog\n', 'cat\n', 'frog\n', 'bear\n', 'animals\n'] >>> str2 = ['puppy\n', 'kitten\n', 'tadpole\n', 'cub\n', 'animals\n'] >>> sys.stdout.writelines(context_diff(str1, str2)) *** --- *************** *** 1,5 **** ! dog ! cat ! frog ! bear animals --- 1,5 ---- ! puppy ! kitten ! tadpole ! cub animals
Within these examples, we can see that many of the functions and classes of the
difflib module resemble one another. Each have their own set of benefits and it's important to analyze which will work best for your project. Comparing sets of data becomes effortless when leveraging the
difflib module, but your results can be even better when your program returns results in the most readable format possible for your data. The best way to improve your skills with this module is by putting these examples into practice.
With this article at OpenGenus, we must have a complete idea of using difflib module in Python. Enjoy.