Creating archive for larger files in Python


There might be times when we have lot of space occupied by larger files so we can create a archive for all these files ,archiving all these files manually can be a tedious task so we will learn to write a python script to recursively traverse all the directories present in the given path or the present working directory to find all the files whose size is greater than a given size and archive them.

In this article, you will learn how to use the functionality provided by Python modules to get the size of a file, recursively traversing the directories and creating a directory,creating writing to a file,creating an archive of files.

Modules used:

  • os module: The os module in python is used to provide user a portable way of using the operating system dependent functionality.

  • shutil module: The shutil module offers a number of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal

Stepwise breakdown of the problem (algorithm)

1.Getting the path of the desired directory from the user or setting the path to the present working directory in case no path is entered.
2.Checking whether the path of the directory entered by the user exists or not.if the path is not valid exit the program.
3.Creating the log directory to store all the larger files.
4.Creating a metadata.txt file in log directory to store the relative path of all the files that will be moved and storing it's file descriptor.
5.Getting the expected size limit of file.
6.Converting the size of file from kilobytes into bytes.
7.Getting list of all the files in that directory and sub directories before archiving.
8.Getting all the files in the given path or current working directory.
9.Checking whether size of file is greater than the given size limit.
10.If greater move the file to log folder and write the relative path of the file in the metadata.txt file.
11.Repeat step 9 and 10 until all files are checked.
12.closing the file descriptor of the metadata.txt file.
13.Archiving the log folder.
14.Removing/Deleting the log folder.
15.Getting list of all the files in that directory and sub directories after archiving.

Now we will learn about some of the important steps which we used in the above algorithm.

  • Recursively traversing the directories.
  • Getting path of the files.
  • Getting the size of the files.
  • Getting relative path of a file.
  • Moving the files.
  • Creating an archive.

Recursively traversing the directories

We might have a condition where there are multiples directories and in the given path .We use the os.walk() function to recursively traverse all the directories and get list of roots,directories and files.

os.walk():Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).

Example:

import os
path="path to the desired directory"
#os.walk return a tuple containing 3 elements root,directories and files.
for (root,directory,file) in os.walk(path):
    print(root)
    print(directory)
    print(file)
    print("___________________")

Getting path to the files

Now that we have the root and the list of files in the root we need to join the files in the root to get the path of the file.We use os.path.join() function to join two or more path together.In our case we need to join the filename which we got from the list with the root do we pass root and filename as arguments.

os.path.join():Join one or more path components intelligently. The return value is the concatenation of path and any members of paths with exactly one directory separator (os.sep) following each non-empty part except the last, meaning that the result will only end in a separator if the last part is empty.

Example:

import os
path="Path to the desired directory"
#root contains the path to the directory which the function is traversing.
#directory contains the list of directories in the root.
#files contain the list of files in the root.
for (root,directory,files) in os.walk():
    for file in files():
        pth=os.path.join(root,file)
        print("path of file:",pth)

Getting the size of the file

We need to get the size of the file using the path which we created in the last step.To get the size of the file we use the os.stat() function to get the details of the file by passing it's path as the argument.

os.stat():Get the status of a file or a file descriptor. Perform the equivalent of a stat() system call on the given path.Returns a stat_result object.

The stat_result object contains the following attributes:

st_mode
st_ino
st_dev
st_nlink
st_uid
st_gid
st_size
st_atime
st_mtime
st_ctime
st_atime_ns
st_mtime_ns
st_ctime_ns
st_blocks
st_blksize
st_rdev
st_flags
st_gen
st_birthtime
st_fstype
st_rsize
st_creator
st_type
st_file_attributes
st_reparse_tag
We will need to access the st_size attribute to get the most recent access time of the file in seconds.

Example:

import os
path="Path to the desired directory"
#first we take the stat_result object in the filestat
#then we extract the size of the file in fsize variable  
for (root,directory,files) in os.walk():
    for file in files():
        pth=os.path.join(root,file)
        filestat=os.stat(pth)
        fsize=filestat.st_size
        print("path of file:",pth,"\tsize:",fsize,"bytes")

Getting relative path of a file

We need to get the relative path of files that will be moved so that when we access the zip file we have the data that which file belonged to which directory, so we need to find out the relative path of the file and store them.We will use the os.path.relpath() method to get the relative path of the file from the given path or the current working directory.We need the path of the file and the path from which we need to find the relative path in our case it will be the path given by the user or the current working directory.It will return us a string containing the relative path of the file.

os.path.relpath():It is used to get a relative filepath to the given path either from the current working directory or from the given directory.
syntax:os.path.relpath(path, start)

Example:

import os
fpath="path of the file of whose relative path you want to find"
spath="The path from where the relative path need to be calculated"
rpath=os.path.relpath(fpath,spath)
print("relative path of ",fpath," from ", spath ," is ",rpath)

Moving the files

We now have got all the required data to check whether a file should be moved or not.Now we will learn how to move a file, to move a file using python we need to use the shutil.move() function which allows us to move a file by passing the path of the source file and the path of the destination as arguments(parameter) of the function.

shutil.move():Recursively move a file or directory (src) to another location (dst) and return the destination.If the destination is an existing directory, then src is moved inside that directory. If the destination already exists but is not a directory, it may be overwritten.

Syntax:shutil.move(spath,dpath)

Example:

import shutil
spath="path of the source file"
dpath="path to the destination directory"
shutil.move(spath,dpath)
print("The file has been moved.")

Creating an archive.

After moving all the larger file s in to the log folder we need to create an archive of the log folder for this we use the shutil.make_archive() to create a zip file to store all the data of the log directory.

shutil.make_archive():This function of the shutil module is to create a archive of the given directory recursively.We provide the name of the archive to be created, it's format(zip,tar,etc.) and the source directory i.e., the directory which will be archived.It will return the path of the new archive that is created.

Example:

import shutil
spath="path of the directory you want o archive"
print("Archive ",shutil.make_archive("backup","zip",spath)," is created.")

Code of the problem statement

#importing os and shutil module
import os
import shutil

#Getting path of file in pth variable
pth=input("Enter path to the directory where you want to zip files or press enter for current working directory:")

#Checking whether pth is empty or not if empty store path of present working directory
if len(pth)==0:
    pth=os.getcwd();
else:
#if pth has some value we check whether the path is a directory or not.    
    if not os.path.isdir(pth):
        print("Wrong path!!!")
        exit(0)
    else:
        os.chdir(pth)

#creating the log folder to store the files        
lp=os.path.join(pth,"log")
os.mkdir(lp)

#Creating a metadata file to store the relative path of the files that will be moved into the log directory
mp=os.path.join(lp,"metadata.txt")
fd=os.open(mp,os.O_APPEND|os.O_CREAT|os.O_RDWR)

#Getting the expected size limit of file    
n=int(input("Enter the size of file that needs to be zipped:"))

#converting size of file from kilobytes into bytes
n=n*1024

print("\n\t*****\t*****\n")

print("List of files and directories before zipping:")
#Recursively traversing files and directories using os.walk()
for roots,dirs,files in os.walk(pth):
    for f in files:
        print(os.path.join(roots,f))
        
print("\n\t*****\t*****\n")


#Recursively traversing all the files and checking their size
print("\nFiles that are being moved to log directory:")
print("\nFile name\t\t\tSize")
for roots,dirs,files in os.walk(pth):
    for f in files:
        if roots != lp:
            fil=os.path.join(roots,f)
            filstat=os.stat(fil)
            fs=filstat.st_size
            if fs> n:
                print(fil,":",fs/1024,"kb")
                #Getting relative path of the file
                relativepth=str.encode(os.path.relpath(fil,pth)+"\n")
                os.write(fd,relativepth)
                #Moving the file to the log directory
                shutil.move(fil,lp)
                
#Closing file descriptor fd            
os.close(fd)

#Creating a zip of the log directory
shutil.make_archive("logbackup",'zip',lp)


#Deleting log folder
shutil.rmtree(lp)

print("\n\t*****\t*****\n")

print("List of files and directories after zipping:")            
for roots,dirs,files in os.walk(pth):
    for f in files:
        print(os.path.join(roots,f))

Output

Enter path to the directory where you want to zip files or press enter for current working directory:H:\Projects\timecheckpy
Enter the size of file that needs to be zipped:1028

	*****	*****

List of files and directories before zipping:
H:\Projects\timecheckpy\file5
H:\Projects\timecheckpy\file6
H:\Projects\timecheckpy\a\file1.txt
H:\Projects\timecheckpy\b\file2
H:\Projects\timecheckpy\c\file3
H:\Projects\timecheckpy\c\d\file4
H:\Projects\timecheckpy\log\metadata.txt

	*****	*****


Files that are being moved to log directory:

File name			Size
H:\Projects\timecheckpy\file5 : 2048.0 kb
H:\Projects\timecheckpy\a\file1.txt : 10240.0 kb
H:\Projects\timecheckpy\c\d\file4 : 2048.0 kb

	*****	*****

List of files and directories after zipping:
H:\Projects\timecheckpy\file6
H:\Projects\timecheckpy\logbackup.zip
H:\Projects\timecheckpy\b\file2
H:\Projects\timecheckpy\c\file3

With this article at OpenGenus, you must have a complete idea of developing such an application in Python. Enjoy.