Working with filecmp in Python


If you have ever wanted to compare files or directories in a simple way, then Python's filecmp module is the perfect place to start. The module involves simple operations and data types (loops and lists) that can be replicated in any programming language. filecmp just simplifies and shortens your code by taking care of the logic for you.

filecmp is very similar to the difflib module, which we have discussed in a previous article. Check it out if you're curious to learn more about Python's modules.

In order to understand and use the filecmp module, we will walk through a series of examples together. First, we need to go over the structure of the module and how it works.

cmp class

The cmp class is used to generate a simple (boolean) True or False result based on how similar the module finds two files to be.

filecmp.cmp(file1, file2, [shallow=False])

The class takes in two files and an optional third parameter known as shallow. This parameter determines how the two files will be compared; they will be examined from, either, a shallow or deep perspective.

If we set the boolean shallow to True, the cmp class will call the os.stat() function on each iteration, passing in the current file that is being evaluated.

The os.stat() method will take in the file and return a series of stats. cmp will then compare that information with stats of the other files it has already evalutated. Some of the information returned includes: the size of the file (in bits), the date the file was last accessed, the user id of the file owner, and more. All of the stats are acquired from the file's stat signature.

Files that are compared using shallow are only compared once, unless either of the files' stat signatures change. This prevents the program from repeating the iteration unnecessarily.

>>> import filecmp
>>> filecmp.cmp('../dir1/text3.txt', '../dir2/text3.txt')
False
>>> filecmp.cmp('../dir1/text3.txt', '../dir2/text3.txt', shallow=True)
True

The cmp class is fairly basic in that it doesn't require any external functions, making it very portable. However, it is limited in its abilities since it returns only a boolean result and can only compare two files at a time.

cmpfiles

Earlier, we saw how the cmp class returns a boolean after comparing two files, but cmpfiles compares two directories and returns the comparison in three lists: match, mismatch, and errors. This gives us much more information, helping us understand the relationships between directories in more depth.

Using cmpfiles is very simple. Two directories are passed in as parameters to the cmpfiles method and each directory is opened and evaluated. We start out by defining common as a list containing a string of each file name that is present in both directories. This tells cmpfiles to loop through and compare the files that we've specified in common.

If the files are found to be the same, cmpfiles concatenates the string onto the match list, the files that are not a match are concatenated onto mismatch. As you may have guessed, files that cannot be compared at all are concatenated onto the errors list.

Let's see cmpfiles in action:

import filecmp
# from filecmp import cmpfiles

dir1 = "/Users/lyndi/documents/opengenus/dir1"
dir2 = "/Users/lyndi/documents/opengenus/dir2"
common = ["text1.txt", "text2.txt", "text3.txt"]

# shallow comparison
match, mismatch, errors = filecmp.cmpfiles(dir1, dir2, common)

# Note that we did not specify a shallow parameter
# This defaults to shallow=True

print("Shallow Comparison")
print("Match: ", match)
print("Mismatch: ", mismatch)
print("Errors: ", errors, "\n\n")

# deep Comparison
match, mismatch, errors = filecmp.cmpfiles(dir1, dir2, common, shallow=False)

print("Deep Comparison")
print("Match: ", match)
print("Mismatch: ", mismatch)
print("Errors: ", errors)

# output: 
Shallow Comparison
Match:  ["text2.txt"]
Mismatch:  ["text3.txt"]
Errors:  ["text1.txt"] 

Deep Comparison
Match:  ["text2.txt"]
Mismatch:  ["text3.txt"]
Errors:  ["text1.txt"]

In the above example, we could have specified whether or not cmpfiles should perform a shallow comparison. However, in this case it would have evaluated to the same result.

dircmp class

The dircmp class finds the difference of two directories by constructing a new directory comparison object. This allows the files in each directory to be compared via a shallow comparison.

For instance, in the next example we have three common files: "text1.txt", "text2.txt", and "text3.txt". They are common because both directories, dir1 and dir2, contain each of these files.

Remember that these files are only considered a match in a shallow comparison if their signatures are the same. If we take a peek at what is contained within the files (or perform a deep comparison), we will find that not all of these files are actually identical.

Below is an example that utilizes some of the attributes of the dircmp class:

import filecmp
from filecmp import dircmp

# prints out the difference between directories
def printDiff(difference):
    for name in difference.diff_files:
        print("The difference found in %s and %s is %s" % (difference.left,
              difference.right, name))

difference = dircmp("../dir1", "../dir2")
printDiff(difference)
# output: The difference found in ../dir1 and ../dir2 is text3.txt

From the output, you may notice that "text3.txt" was the only file found to have a difference between the two directories. Essentially, this means that even though a file named "text3.txt" exists in both dir1 and dir2, the text inside did not match up exactly in both files.

This result also implies that "text1.txt" and "text2.txt" did match exactly. If we had received no result then that would mean that all of the files with common names had the same contents.

Note

Did you notice the use of difference.left and difference.right in our example? The dircmp class can also evaluate a directory based on which parameter it was passed in as: the left or the right one. In this case, "../dir1" was passed in on the left (or as the first parameter) so it is equivalent to left.

The dircmp class has several attributes that pertain to the left or right parameters. Check out the Python documents for a closer look into these attributes to find out more about how they can be used for your own programs.

With this article at OpenGenus, you must have the complete idea of using filecmp in Python. Enjoy.