An Intro to PyPDF2
We take a look at the PyPDF2 package, a powerful library that lets you do some pretty interesting things with your PDF documents. Read on for the details!
Join the DZone community and get the full member experience.
Join For Freethe pypdf2 package is a pure-python pdf library that you can use for splitting, merging, cropping, and transforming pages in your pdfs. according to the pypdf2 website, you can also use pypdf2 to add data, viewing options, and passwords to the pdfs, too. finally, you can use pypdf2 to extract text and metadata from your pdfs.
pypdf2 is actually a fork of the original pypdf, which was written by mathiew fenniak and released in 2005. however, the original pypdf’s last release was in 2014. a company called phaseit, inc. spoke with mathieu and ended up sponsoring pypdf2 as a fork of pypdf.
at the time of writing, the pypdf2 package hasn’t had a release since 2016. however, it is still a solid and useful package that is worth your time to learn.
the following lists what we will be learning in this article:
- extracting metadata
- splitting documents
- merging 2 pdf files into 1
- rotating pages
- overlaying/watermarking pages
- encrypting/decrypting
let’s start by learning how to install pypdf2!
installation
pypdf2 is a pure python package, so you can install it using pip (assuming pip is in your system’s path):
python -m pip install pypdf2
as usual, you should install third-party python packages to a python virtual environment to make sure that it works the way you want it to.
extracting metadata from pdfs
you can use pypdf2 to extract a fair amount of useful data from any pdf. for example, you can learn the author of the document, its title and subject, and how many pages there are. let’s find out how by downloading the sample of this book from leanpub . the sample i downloaded was called reportlab-sample.pdf . i will include this pdf for you to use in the github source code as well.
here’s the code:
# get_doc_info.py
from pypdf2 import pdffilereader
def get_info(path):
with open(path, 'rb') as f:
pdf = pdffilereader(f)
info = pdf.getdocumentinfo()
number_of_pages = pdf.getnumpages()
print(info)
author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
get_info(path)
here, we import the pdffilereader class from pypdf2 . this class gives us the ability to read a pdf and extract data from it using various accessor methods. the first thing we do is create our own get_info function that accepts a pdf file path as its only argument. then, we open the file in read-only binary mode. next, we pass that file handler into pdffilereader and create an instance of it.
now, we can extract some information from the pdf by using the getdocumentinfo method. this will return an instance of pypdf2.pdf.documentinformation , which has the following useful attributes, among others:
- author
- creator
- producer
- subject
- title
if you print out the documentinformation object, this is what you will see:
{'/author': 'michael driscoll',
'/creationdate': "d:20180331023901-00'00'",
'/creator': 'latex with hyperref package',
'/producer': 'xetex 0.99998',
'/title': 'reportlab - pdf processing with python'}
we can also get the number of pages in the pdf by calling the getnumpages method.
extracting text from pdfs
pypdf2 has limited support for extracting text from pdfs. it doesn’t have built-in support for extracting images, unfortunately. i have seen some recipes on stackoverflow that use pypdf2 to extract images, but the code examples seem to be pretty hit-or-miss.
let’s try to extract the text from the first page of the pdf that we downloaded in the previous section:
# extracting_text.py
from pypdf2 import pdffilereader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = pdffilereader(f)
# get the first page
page = pdf.getpage(1)
print(page)
print('page type: {}'.format(str(type(page))))
text = page.extracttext()
print(text)
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
text_extractor(path)
you will note that this code starts out in much the same way as our previous example. we still need to create an instance of pdffilereader . but this time, we grab a page using the getpage method. pypdf2 is zero-based, much like most things in python, so when you pass it a one, it actually grabs the second page. the first page, in this case, is just an image, so it wouldn’t have any text.
interestingly, if you run this example, you will find that it doesn’t return any text. instead, all i got was a series of line break characters. unfortunately, pypdf2 has pretty limited support for extracting text. even if it is able to extract text, it may not be in the order you expect and the spacing may be different, as well.
to get this example code to work, you will need to try running it against a different pdf. i found one on the united states internal revenue service website .
this is a w9 form for people who are self-employed or contract employees. it can be used in other situations too. anyway, i downloaded it as w9.pdf . if you use that pdf instead of the sample one, it will happily extract some of the text from page 2. i won’t reproduce the output here as it is kind of lengthy though.
splitting pdfs
the pypdf2 package gives you the ability to split up a single pdf into multiple ones. you just need to tell it how many pages you want. for this example, we will open up the w9 pdf from the previous example and loop over all six of its pages. we will split off each page and turn it into its own standalone pdf.
let’s find out how:
# pdf_splitter.py
import os
from pypdf2 import pdffilereader, pdffilewriter
def pdf_splitter(path):
fname = os.path.splitext(os.path.basename(path))[0]
pdf = pdffilereader(path)
for page in range(pdf.getnumpages()):
pdf_writer = pdffilewriter()
pdf_writer.addpage(pdf.getpage(page))
output_filename = '{}_page_{}.pdf'.format(
fname, page+1)
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
print('created: {}'.format(output_filename))
if __name__ == '__main__':
path = 'w9.pdf'
pdf_splitter(path)
for this example, we need to import both the pdffilereader and the pdffilewriter . then we create a fun little function called pdf_splitter . it accepts the path of the input pdf. the first line of this function will grab the name of the input file, minus the extension. next we open the pdf up and create a reader object. then we loop over all the pages using the reader object’s getnumpages method.
inside of the for loop, we create an instance of pdffilewriter . we then add a page to our writer object using its addpage method. this method accepts a page object, so to get the page object, we call the reader object’s getpage method. now, we had added one page to our writer object. the next step is to create a unique file name, which we do by using the original file name plus the word “page” plus the page number + 1. we add the one because pypdf2’s page numbers are zero-based, so page 0 is actually page 1.
finally, we open the new file name in write-binary mode and use the pdf writer object’s write method to write the object’s contents to disk.
merging multiple pdfs together
now that we have a bunch of pdfs, let’s learn how we might take them and merge them back together. one useful use case for doing this is for businesses to merge their dailies into a single pdf. i have needed to merge pdfs for work and for fun. one project that sticks out in my mind is scanning documents in. depending on the scanner you have, you might end up scanning a document into multiple pdfs, so being able to join them together again can be wonderful.
when the original pypdf came out, the only way to get it to merge multiple pdfs together was like this:
# pdf_merger.py
import glob
from pypdf2 import pdffilewriter, pdffilereader
def merger(output_path, input_paths):
pdf_writer = pdffilewriter()
for path in input_paths:
pdf_reader = pdffilereader(path)
for page in range(pdf_reader.getnumpages()):
pdf_writer.addpage(pdf_reader.getpage(page))
with open(output_path, 'wb') as fh:
pdf_writer.write(fh)
if __name__ == '__main__':
paths = glob.glob('w9_*.pdf')
paths.sort()
merger('pdf_merger.pdf', paths)
here, we create a pdffilewriter object and several pdffilereader objects. for each pdf path, we create a pdffilereader object and then loop over its pages, adding each and every page to our writer object. then we write out the writer object’s contents to disk.
pypdf2 made this a bit simpler by creating a pdffilemerger class:
# pdf_merger2.py
import glob
from pypdf2 import pdffilemerger
def merger(output_path, input_paths):
pdf_merger = pdffilemerger()
file_handles = []
for path in input_paths:
pdf_merger.append(path)
with open(output_path, 'wb') as fileobj:
pdf_merger.write(fileobj)
if __name__ == '__main__':
paths = glob.glob('fw9_*.pdf')
paths.sort()
merger('pdf_merger2.pdf', paths)
here, we just need to create the pdffilemerger object and then loop through the pdf paths, appending them to our merging object. pypdf2 will automatically append the entire document so you don’t need to loop through all the pages of each document yourself. then, we just write it out to disk.
the pdffilemerger class also has a merge method that you can use. its code definition looks like this:
def merge(self, position, fileobj, bookmark=none, pages=none,
import_bookmarks=true):
"""
merges the pages from the given file into the output file at the
specified page number.
:param int position: the *page number* to insert this file. file will
be inserted after the given number.
:param fileobj: a file object or an object that supports the standard
read and seek methods similar to a file object. could also be a
string representing a path to a pdf file.
:param str bookmark: optionally, you may specify a bookmark to be
applied at the beginning of the included file by supplying the
text of the bookmark.
:param pages: can be a :ref:`page range <page-range>` or a
``(start, stop[, step])`` tuple
to merge only the specified range of pages from the source
document into the output document.
:param bool import_bookmarks: you may prevent the source
document's bookmarks from being imported by specifying this as
``false``.
"""
basically, the merge method allows you to tell pypdf where to merge a page by page number. so if you have created a merging object with three pages in it, you can tell the merging object to merge the next document in at a specific position. this allows the developer to do some pretty complex merging operations. give it a try and see what you can do!
rotating pages
pypdf2 gives you the ability to rotate pages. however, you must rotate in 90 degrees increments. you can rotate the pdf pages either clockwise or counterclockwise. here’s a simple example:
# pdf_rotator.py
from pypdf2 import pdffilewriter, pdffilereader
def rotator(path):
pdf_writer = pdffilewriter()
pdf_reader = pdffilereader(path)
page1 = pdf_reader.getpage(0).rotateclockwise(90)
pdf_writer.addpage(page1)
page2 = pdf_reader.getpage(1).rotatecounterclockwise(90)
pdf_writer.addpage(page2)
pdf_writer.addpage(pdf_reader.getpage(2))
with open('pdf_rotator.pdf', 'wb') as fh:
pdf_writer.write(fh)
if __name__ == '__main__':
rotator('reportlab-sample.pdf')
here we create our pdf reader and writer objects as before. then we get the first and second pages of the pdf that we passed in. we then rotate the first page 90 degrees clockwise or to the right. then we rotate the second page 90 degrees counter-clockwise. finally, we add the third page in its normal orientation to the writer object and write out our new three-page pdf file.
if you open the pdf, you will find that the first two pages are now rotated in opposite directions of each other with the third page in its normal orientation.
overlaying/watermarking pages
pypdf2 also supports merging pdf pages together or overlaying pages on top of each other. this can be useful if you want to watermark the pages in your pdf. for example, one of the ebook distributors i use will “watermark” the pdf versions of my book with the buyer’s email address. another use case that i have seen is to add printer control marks to the edge of the page to tell the printer when a certain document has reached its end.
for this example we will take one of the logos i use for my blog, “the mouse vs. the python,” and overlay it on top of the w9 form from earlier:
# watermarker.py
from pypdf2 import pdffilewriter, pdffilereader
def watermark(input_pdf, output_pdf, watermark_pdf):
watermark = pdffilereader(watermark_pdf)
watermark_page = watermark.getpage(0)
pdf = pdffilereader(input_pdf)
pdf_writer = pdffilewriter()
for page in range(pdf.getnumpages()):
pdf_page = pdf.getpage(page)
pdf_page.mergepage(watermark_page)
pdf_writer.addpage(pdf_page)
with open(output_pdf, 'wb') as fh:
pdf_writer.write(fh)
if __name__ == '__main__':
watermark(input_pdf='w9.pdf',
output_pdf='watermarked_w9.pdf',
watermark_pdf='watermark.pdf')
the first thing we do here is extract the watermark page from the pdf. then we open the pdf that we want to apply the watermark to. we use a for loop to iterate over each of its pages and call the page object’s mergepage method to apply the watermark. next we add that watermarked page to our pdf writer object. once the loop finishes, we write our new watermarked version out to disk.
here’s what the first page looked like:
that was pretty easy.
pdf encryption
the pypdf2 package also supports adding a password and encryption to your existing pdfs. as you may recall from chapter 10, pdfs support a user password and an owner password. the user password only allows the user to open and read a pdf, but may have some restrictions applied to the pdf that could prevent the user from printing, for example. as far as i can tell, you can’t actually apply any restrictions using pypdf2 or it’s just not documented well.
here’s how to add a password to a pdf with pypdf2:
# pdf_encryption.py
from pypdf2 import pdffilewriter, pdffilereader
def encrypt(input_pdf, output_pdf, password):
pdf_writer = pdffilewriter()
pdf_reader = pdffilereader(input_pdf)
for page in range(pdf_reader.getnumpages()):
pdf_writer.addpage(pdf_reader.getpage(page))
pdf_writer.encrypt(user_pwd=password, owner_pwd=none,
use_128bit=true)
with open(output_pdf, 'wb') as fh:
pdf_writer.write(fh)
if __name__ == '__main__':
encrypt(input_pdf='reportlab-sample.pdf',
output_pdf='encrypted.pdf',
password='blowfish')
all we did here was create a set of pdf reader and write objects and read all the pages with the reader. then we added those pages out to the specified writer object and added the specified password. if you only set the user password, then the owner password is set to the user password automatically. whenever you add a password, 128-bit encryption is applied by default. if you set that argument to false, then the pdf will be encrypted at 40-bit encryption instead.
wrapping up
we covered a lot of useful information in this article. you learned how to extract metadata and text from your pdfs. we found out how to split and merge pdfs. you also learned how to rotate pages in a pdf and apply watermarks. finally, we discovered that pypdf2 can add encryption and passwords to our pdfs.
related reading
- pypdf2 documentation
- pypdf2 github page
- automate the boring stuff — chapter 13: working with pdf and word documents
Published at DZone with permission of Mike Driscoll, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments