Reading time: 15 minutes
The picture sums up the motivation behind this article. Let us imagine a situation in which we have The Invincible Iron Man comic available in PDF and we are trying to identify the pages which have the Iron Man in action. Can we automate this work? Yes, we can do it through image processing. Is PDF a suitable format? No, the images are the best mode of information for image processing. Can we convert a PDF to a sequence of images? Yes, we can and this forms the intention of this article.
For accomplishing this task, we are going to utilize certain utilities and libraries.
We are going to use a pythonic way for achieving the conversion. A python 2.7 or 3.3+ forms the primary requirement. Refer Installation-1 to properly install python.
The Poppler is a PDF rendering library that is based on the xpdf-3.0 code base. This library forms the core for utilities like Pdf2Image, PdfToText, and PDFToHTML which deals with PDFs. Refer Installation-2 for installing Poppler.
This is the python library which calls the pdftoppm library to convert a pdf to a sequence of PIL image objects. The pdftoppm library utilizes the poppler to execute the conversion. The following pip command can be used to install the library, pip install pdf2image
The Pdf2image library returns a list of image objects of type PIL.PpmImagePlugin.PpmImageFile or PIL.PngImagePlugin.PngImageFile for a given PDF based on the chosen format. These image objects can be converted to png or jpg file formats using the library, Pillow. To install this library in python, issue the command, pip install Pillow
#PDF TO IMAGE CONVERSION #IMPORT LIBRARIES import pdf2image from PIL import Image import time #DECLARE CONSTANTS PDF_PATH = "demo.pdf" DPI = 200 OUTPUT_FOLDER = None FIRST_PAGE = None LAST_PAGE = None FORMAT = 'jpg' THREAD_COUNT = 1 USERPWD = None USE_CROPBOX = False STRICT = False def pdftopil(): #This method reads a pdf and converts it into a sequence of images #PDF_PATH sets the path to the PDF file #dpi parameter assists in adjusting the resolution of the image #output_folder parameter sets the path to the folder to which the PIL images can be stored (optional) #first_page parameter allows you to set a first page to be processed by pdftoppm #last_page parameter allows you to set a last page to be processed by pdftoppm #fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF) #thread_count parameter allows you to set how many thread will be used for conversion. #userpw parameter allows you to set a password to unlock the converted PDF #use_cropbox parameter allows you to use the crop box instead of the media box when converting #strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError start_time = time.time() pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, output_folder=OUTPUT_FOLDER, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT) print ("Time taken : " + str(time.time() - start_time)) return pil_images def save_images(pil_images): #This method helps in converting the images in PIL Image file format to the required image format index = 1 for image in pil_images: image.save("page_" + str(index) + ".jpg") index += 1 if __name__ == "__main__": pil_images = pdftopil() save_images(pil_images)
The performance of pdftoppm is way better than its alternative ImageMagick in terms of quality. The time complexity of the technique depends on the chosen format.
|Pages||Time Taken (ppm)||Time Taken (jpg)||Time Taken (png)|
Note: The time taken is measured in seconds.
The above table clearly shows that chosing jpg format is faster than the other two formats. In addition to the right format, the number of threads can also be increased to parallelize and speed up the conversion.
The PDF to image conversion has a role in several applications. Some of them includes real time document classification, Optical Character Recognition (OCR), and localization of tables and forms in a document.