linecache module in Python: cache a text file

linecache is one of the most useful modules of Python standard library which can be used to have random access to any text file. This is used by the traceback module (which extracts, formats and prints the stack traces of Python programs, it prints the error messages or the exceptions when they are raised int he program) to retrieve source lines and include them in the formatted traceback.

This module is used extensively when dealing with Python source files. It returns the requested line(s) by indexing them into a list. Repeatedly reading the lines and parsing them provides lot of efficiency in time complexity.

Importing the module

linecache module is imported by this-

>>import linecache

Usage of linecache module

Following is a list of linecache module and example of their code-

1. getline():

getline() is the most important and commonly used method of linecache module. It has two parameters. The first parameter is the file name and the second parameter is the line number that has to be displayed, i.e. getline(file, n). It returns the nth line from the file that is passed through the file parameter. It will return '' on errors or when the line or file in not found. The terminating newline character will be included for lines that are found.

>>linecache.getline('/folder/file.txt', 2)
Hello World!

In the above code, the getline method is used to display the 2nd line of the text file 'file.txt' in the 'folder' directory.

When multiple lines in a range is needed to be displayed, getlines() function is used.

>>linecache.getlines('/folder/file.txt', [2:4])
['Hello World!\n','Hello!\n']

The above code displays the texts in the range from the 2nd line to 3rd line of the text file 'file.txt' in the 'folder' directory.

2. clearcache():

When the lines from files previously read using getline() are no longer needed, we can clear the cache by clearcache() function. It has no parameters.

>>linecache.clearcache()

3. checkcache():

It checks validity of cache. It is used when the files in the cache may have changed on disk, and new version is needed. If filename is empty, it will check all the previous entries of cache.

>>linecache.checkcache()

4. lazycache():

It captures details about non-file based modules to allow its lines to be accesed by getline() even if module_globals is None during the call. This avoids doing I/O until a line is actually needed. When getlines() is called only then the module loader will be asked for the source only, not immediately.

For example, It checks if the filename is cachable if it is then the filename must not be already cached. If there is a file called 'file.txt' and it is not already in cache then it returns True, otherwise False.

>>linecache.lazycache('file.txt', module_globals=None)

Output:
True

Applications:

Let's see some application of this linecache module-

1. Reading Specific Lines

As we have seen in the above, using this module we can read lines from a text file. The line numbers in the linecache module start with 1, but if we split the string then we start indexing the array from 0. We also need to strip the trailing newline from the value returned from the cache. Let's see an example-

Let's take a textfile 'Hello.txt' the content of which is shown bellow-

HelloA
HelloB
HelloC

HelloD

Now let's try to read and display the 3rd line.

import linecache

line=linecache.getlines('Hello.txt', 3)

print(line)

Output:
HelloC

2. Handling Empty Lines

Let's see an example of the output when the line to be displayed is empty.

import linecache

# Blank lines include the newline
print '"%s"' % linecache.getline("Hello.txt", 4)

Output:
' '

The 4th line of the file "file.txt" is blank so empty string (' ') is printed.

3. Error Handling

If the requested line number falls out of the range of valid lines in the file, linecache returns an empty string too.

import linecache

# Blank lines include the newline
print '"%s"' % linecache.getline("Hello.txt", 7)

Output:
' '

Source code:

Cache is extensively used for the operations of linechace module. A cache is component or temporary storage that stores data so that anytime in the future requests for that data can be served faster. This data stored in a cache might be the result of an earlier computation or a copy of data which is stored in another storage.

Now let's dive deep into the backend code of the different functions of line cache module. (from https://github.com/python/cpython/blob/master/Lib/linecache.py)

Let's take a sample file with a sample content-

filename.txt

Hello World 1
Hello World 2
Hello World 3

Let's now see the source code of different linecache module methods-

Importing important modules-

import functools
import sys
import os
import tokenize

1. clearcache():

cache = {} #Initilizing cache

def clearcache():
    """Clear the cache entirely."""
    cache.clear()

.clear() function clears the content of the cache entirely.

2. getline():

def getline(filename, lineno, module_globals=None):
    
    lines = getlines(filename, module_globals)
    if 1 <= lineno <= len(lines):
        return lines[lineno - 1]
    return ''

It reads a line from the file from the cache and then updates the cache if it doesn't have an entry for this file already. If the lineno is greater than or equal to 1 or lesst than or equal to the total number of lines available, then it returns the particular line by lines[lineno-1].

3. getlines():

def getlines(filename, module_globals=None):

    if filename in cache:
        entry = cache[filename]
        if len(entry) != 1:
            return cache[filename][2]
    try:
        return updatecache(filename, module_globals)
    except MemoryError:
        clearcache()
        return []

Similarly for getlines() method, if there are multiple lines to be read, i.e. the length of the entry is not 1, then the multiple lines from the cache are loaded. If it doesn't have an entry for the file, it updates the cache.

4. checkcache():

def checkcache(filename=None):
    """Discard cache entries that are out of date.
    (This is not checked upon each call!)"""

    if filename is None:
        filenames = list(cache.keys())
    elif filename in cache:
        filenames = [filename]
    else:
        return

    for filename in filenames:
        entry = cache[filename]
        if len(entry) == 1:
            continue         # lazy cache entry
        size, mtime, lines, fullname = entry
        if mtime is None:
            continue   # no-op for files loaded via a __loader__
        try:
            stat = os.stat(fullname)
        except OSError:
            del cache[filename]
            continue
        if size != stat.st_size or mtime != stat.st_mtime:
            del cache[filename]

If file is empty, list(cache.keys()) checks for the previous entries in cache, else it loads the new entries in cache. If there is only one new entry, it's lazycache and checks for next iteration. Otherwise the entry is stored in the variables size, mtime, liens, fullname as shown in the above code. If mtime is None, it again checks for the next iterationl. If something's wrong, it deletes the entry by del keyword in the cache by as shown in the above.

5. updatechache():

def updatecache(filename, module_globals=None):
    if filename in cache:
        if len(cache[filename]) != 1:
            del cache[filename]
    if not filename or (filename.startswith('<') and filename.endswith('>')):
        return []

    fullname = filename
    try:
        stat = os.stat(fullname)
    except OSError:
        basename = filename

        if lazycache(filename, module_globals):
            try:
                data = cache[filename][0]()
            except (ImportError, OSError):
                pass
            else:
                if data is None:
                    # No luck, the PEP302 loader cannot find the source
                    # for this module.
                    return []
                cache[filename] = (
                    len(data),
                    None,
                    [line + '\n' for line in data.splitlines()],
                    fullname
                )
                return cache[filename][2]

        # Try looking through the module search path, which is only useful
        # when handling a relative filename.
        if os.path.isabs(filename):
            return []

        for dirname in sys.path:
            try:
                fullname = os.path.join(dirname, basename)
            except (TypeError, AttributeError):
                # Not sufficiently string-like to do anything useful with.
                continue
            try:
                stat = os.stat(fullname)
                break
            except OSError:
                pass
        else:
            return []
    try:
        with tokenize.open(fullname) as fp:
            lines = fp.readlines()
    except OSError:
        return []
    if lines and not lines[-1].endswith('\n'):
        lines[-1] += '\n'
    size, mtime = stat.st_size, stat.st_mtime
    cache[filename] = size, mtime, lines, fullname
    return lines

It updates the cache entry and return its list of lines. If filename doesn't exist or the format is unknows it return empty array.stat() method of OS module provides the chance of interacting with the working operatng system. If relative path of the file is given it again return emtpy, else it looks or the useful module search path and joins the directory path with the file base name. If the file exists, tokenize.open() method detects the encoding of the concerned file and the lines are read and stored. If any error, i.e. OSError occurs it returns empty array again. If there is atleast one line readable but the last line ends with '\n' it puts '\n' at the end by lines[-1]+=1 and returns the lines that were read. If some error occurs, it prints the error or exception message, deletes the cache entry and returns an empty list.

6. lazycache():

def lazycache(filename, module_globals):
   
    if filename in cache:
        if len(cache[filename]) == 1:
            return True
        else:
            return False
    if not filename or (filename.startswith('<') and filename.endswith('>')):
        return False
        
    # Try for a __loader__, if available
    if module_globals and '__loader__' in module_globals:
        name = module_globals.get('__name__')
        loader = module_globals['__loader__']
        get_source = getattr(loader, 'get_source', None)

        if name and get_source:
            get_lines = functools.partial(get_source, name)
            cache[filename] = (get_lines,)
            return True
    return False

It seeds the cache for filename with module_globals. Then upon calling getline() the module loader will be asked for the source. If there is an entry in the cache already, it is not altered. True has to be returned if a lazy load is registered in the cache, otherwise False. Then it checks for the existance of thhe module_globals and the loader's presence in the module_globals, and upon its positive return it stores the name, loader and the source of the loader. If name and source exists, those are stored in the particular filename of the cache. functools.partial function allows to derive a function with one parameter to a function with fewer fixed parameters.

These are some functionality and usage of linecache module. See the documentation of the module here.