A Pythonic Way of PDF to Image Conversion
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Reading time: 15 minutes
Motivation
The picture sums up the motivation behind this article. Let us imagine a situation in which we have The Invincible Iron Man comic available in PDF and we are trying to identify the pages which have the Iron Man in action. Can we automate this work? Yes, we can do it through image processing. Is PDF a suitable format? No, the images are the best mode of information for image processing. Can we convert a PDF to a sequence of images? Yes, we can and this forms the intention of this article.
Installation Steps
For accomplishing this task, we are going to utilize certain utilities and libraries.
1. Python
We are going to use a pythonic way for achieving the conversion. A python 2.7 or 3.3+ forms the primary requirement. Refer Installation-1 to properly install python.
2. Poppler
The Poppler is a PDF rendering library that is based on the xpdf-3.0 code base. This library forms the core for utilities like Pdf2Image, PdfToText, and PDFToHTML which deals with PDFs. Refer Installation-2 for installing Poppler.
3. Pdf2image
This is the python library which calls the pdftoppm library to convert a pdf to a sequence of PIL image objects. The pdftoppm library utilizes the poppler to execute the conversion. The following pip command can be used to install the library, pip install pdf2image
4. Pillow
The Pdf2image library returns a list of image objects of type PIL.PpmImagePlugin.PpmImageFile or PIL.PngImagePlugin.PngImageFile for a given PDF based on the chosen format. These image objects can be converted to png or jpg file formats using the library, Pillow. To install this library in python, issue the command, pip install Pillow
Implementation
#PDF TO IMAGE CONVERSION
#IMPORT LIBRARIES
import pdf2image
from PIL import Image
import time
#DECLARE CONSTANTS
PDF_PATH = "demo.pdf"
DPI = 200
OUTPUT_FOLDER = None
FIRST_PAGE = None
LAST_PAGE = None
FORMAT = 'jpg'
THREAD_COUNT = 1
USERPWD = None
USE_CROPBOX = False
STRICT = False
def pdftopil():
#This method reads a pdf and converts it into a sequence of images
#PDF_PATH sets the path to the PDF file
#dpi parameter assists in adjusting the resolution of the image
#output_folder parameter sets the path to the folder to which the PIL images can be stored (optional)
#first_page parameter allows you to set a first page to be processed by pdftoppm
#last_page parameter allows you to set a last page to be processed by pdftoppm
#fmt parameter allows to set the format of pdftoppm conversion (PpmImageFile, TIFF)
#thread_count parameter allows you to set how many thread will be used for conversion.
#userpw parameter allows you to set a password to unlock the converted PDF
#use_cropbox parameter allows you to use the crop box instead of the media box when converting
#strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError
start_time = time.time()
pil_images = pdf2image.convert_from_path(PDF_PATH, dpi=DPI, output_folder=OUTPUT_FOLDER, first_page=FIRST_PAGE, last_page=LAST_PAGE, fmt=FORMAT, thread_count=THREAD_COUNT, userpw=USERPWD, use_cropbox=USE_CROPBOX, strict=STRICT)
print ("Time taken : " + str(time.time() - start_time))
return pil_images
def save_images(pil_images):
#This method helps in converting the images in PIL Image file format to the required image format
index = 1
for image in pil_images:
image.save("page_" + str(index) + ".jpg")
index += 1
if __name__ == "__main__":
pil_images = pdftopil()
save_images(pil_images)
Performance
The performance of pdftoppm is way better than its alternative ImageMagick in terms of quality. The time complexity of the technique depends on the chosen format.
Pages | Time Taken (ppm) | Time Taken (jpg) | Time Taken (png) |
---|---|---|---|
2 | 0.97 | 0.78 | 1.90 |
13 | 3.66 | 2.54 | 12.19 |
50 | 18.96 | 7.76 | 32.13 |
Note: The time taken is measured in seconds.
The above table clearly shows that chosing jpg format is faster than the other two formats. In addition to the right format, the number of threads can also be increased to parallelize and speed up the conversion.
Applications
The PDF to image conversion has a role in several applications. Some of them includes real time document classification, Optical Character Recognition (OCR), and localization of tables and forms in a document.
Question
Which python library is used to save a PIL Image object as a JPEG file?
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.