A Pythonic Way of PDF to Image Conversion

Get this book -> Problems on Array: For Interviews and Competitive Programming

Reading time: 15 minutes

Motivation

The picture sums up the motivation behind this article. Let us imagine a situation in which we have The Invincible Iron Man comic available in PDF and we are trying to identify the pages which have the Iron Man in action. Can we automate this work? Yes, we can do it through image processing. Is PDF a suitable format? No, the images are the best mode of information for image processing. Can we convert a PDF to a sequence of images? Yes, we can and this forms the intention of this article.

Installation Steps

For accomplishing this task, we are going to utilize certain utilities and libraries.
1. Python
We are going to use a pythonic way for achieving the conversion. A python 2.7 or 3.3+ forms the primary requirement. Refer Installation-1 to properly install python.
2. Poppler
The Poppler is a PDF rendering library that is based on the xpdf-3.0 code base. This library forms the core for utilities like Pdf2Image, PdfToText, and PDFToHTML which deals with PDFs. Refer Installation-2 for installing Poppler.
3. Pdf2image
This is the python library which calls the pdftoppm library to convert a pdf to a sequence of PIL image objects. The pdftoppm library utilizes the poppler to execute the conversion. The following pip command can be used to install the library, pip install pdf2image
4. Pillow
The Pdf2image library returns a list of image objects of type PIL.PpmImagePlugin.PpmImageFile or PIL.PngImagePlugin.PngImageFile for a given PDF based on the chosen format. These image objects can be converted to png or jpg file formats using the library, Pillow. To install this library in python, issue the command, pip install Pillow

Implementation


#PDF TO IMAGE CONVERSION
#IMPORT LIBRARIES
import pdf2image
from PIL import Image
import time

#DECLARE CONSTANTS
PDF_PATH = "demo.pdf"
DPI = 200
OUTPUT_FOLDER = None
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'jpg'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False

def pdftopil():
    #This method reads a pdf and converts it into a sequence of images
    #PDF_PATH sets the path to the PDF file
    #dpi parameter assists in adjusting the resolution of the image
    #output_folder parameter sets the path to the folder to which the PIL images can be stored (optional)
    #first_page parameter allows you to set a first page to be processed by pdftoppm 
    #last_page parameter allows you to set a last page to be processed by pdftoppm
    #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
    #thread_count parameter allows you to set how many thread will be used for conversion.
    #userpw parameter allows you to set a password to unlock the converted PDF
    #use_cropbox parameter allows you to use the crop box instead of the media box when converting
    #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError

    start_time = time.time()
    pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, output_folder=OUTPUT_FOLDER, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
    print ("Time taken : " + str(time.time() - start_time))
    return pil_images
    
def save_images(pil_images):
    #This method helps in converting the images in PIL Image file format to the required image format
    index = 1
    for image in pil_images:
        image.save("page_" + str(index) + ".jpg")
        index += 1

if __name__ == "__main__":
    pil_images = pdftopil()
    save_images(pil_images)

Performance

The performance of pdftoppm is way better than its alternative ImageMagick in terms of quality. The time complexity of the technique depends on the chosen format.

Pages	Time Taken (ppm)	Time Taken (jpg)	Time Taken (png)
2	0.97	0.78	1.90
13	3.66	2.54	12.19
50	18.96	7.76	32.13

Note: The time taken is measured in seconds.

The above table clearly shows that chosing jpg format is faster than the other two formats. In addition to the right format, the number of threads can also be increased to parallelize and speed up the conversion.

Applications

The PDF to image conversion has a role in several applications. Some of them includes real time document classification, Optical Character Recognition (OCR), and localization of tables and forms in a document.

Question

Which python library is used to save a PIL Image object as a JPEG file?

PIL

time

pdf2image

pdf2ppm