DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • High-Performance Java Serialization to Different Formats
  • Did You Know the Fastest Way of Serializing a Java Field Is Not Serializing It at All?
  • TestCafe Integration With Cucumber
  • RION - A Fast, Compact, Versatile Data Format

Trending

  • AI Speaks for the World... But Whose Humanity Does It Learn From?
  • How To Build AI-Powered Prompt Templates Using the Salesforce Prompt Builder
  • Enterprise Data Loss Prevention (DLP) Security Policies and Tuning
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  1. DZone
  2. Coding
  3. Languages
  4. Creating and Manipulating PDFs with pdfrw

Creating and Manipulating PDFs with pdfrw

If you need to work with PDFs in Python, then you need to read this post.

By 
Mike Driscoll user avatar
Mike Driscoll
·
Jan. 11, 19 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
20.1K Views

Join the DZone community and get the full member experience.

Join For Free

patrick maupin created a package he called pdfrw and released it back in 2012. the pdfrw package is a pure-python library that you can use to read and write pdf files. at the time of writing, pdfrw was at version 0.4. with that version, it supports subsetting, merging, rotating and modifying data in pdfs. the pdfrw package has been used by the rst2pdf package (see chapter 18) since 2010 because pdfrw can “faithfully reproduce vector formats without rasterization.” you can also use pdfrw in conjunction with reportlab to re-use potions of existing pdfs in new pdfs that you create with reportlab.

in this article, we will learn how to do the following:

  • extract certain types of information from a pdf
  • splitting pdfs
  • merging / concatenating pdfs
  • rotating pages
  • creating overlays or watermarks
  • scaling pages
  • combining the use of pdfrw and reportlab

let’s get started!

installation

as you might expect, you can install pdfrw using pip. let’s get that done so we can start using pdfrw:

python -m pip install pdfrw

now that we have pdfrw installed, let’s learn how to extract some information from our pdfs.

extracting information from pdf

the pdfrw package does not extract data in quite the same way that pypdf2 does. if you have using pypdf2 in the past, then you may recall that pypdf2 lets you extract a document information object that you can use to pull out information like author, title, etc. while pdfrw does let you get the info object, it displays it in a less friendly way. let’s take a look:

note: i am using the standard w9 form from the irs for this example.

# reader.py

from pdfrw import pdfreader

def get_pdf_info(path):
    pdf = pdfreader(path)

    print(pdf.keys())
    print(pdf.info)
    print(pdf.root.keys())
    print('pdf has {} pages'.format(len(pdf.pages)))

if __name__ == '__main__':
    get_pdf_info('w9.pdf')

here we import pdfrw’s pdfreader class and instantiate it by passing in the path to the pdf file that we want to read. then we extract the pdf object’s keys, the information object, and the root. we also grab how many pages are in the document. the result of running this code is below:

['/id', '/root', '/info', '/size']
{'/author': '(se:w:car:mp)',
 '/creationdate': "(d:20171109144422-05'00')",
 '/creator': '(adobe livecycle designer es 9.0)',
 '/keywords': '(fillable)',
 '/moddate': "(d:20171109144521-05'00')",
 '/producer': '(adobe livecycle designer es 9.0)',
 '/spdf': '(1112)',
 '/subject': '(request for taxpayer identification number and certification)',
 '/title': '(form w-9 \\(rev. november 2017\\))'}
['/pages', '/perms', '/markinfo', '/extensions', '/acroform', '/metadata', '/type', '/names', '/structtreeroot']
pdf has 6 pages

if you run this against the reportlaf-sample.pdf file that i also included in the source code for this book, you will find that the author name that is returned ends up being ‘‘ instead of “michael driscoll.” i haven’t figured out exactly why that is, but i am assuming that pypdf2 does some extra data massaging on the pdf trailer information that pdfrw currently does not do.

splitting

you can also use pdfrw to split a pdf up. for example, maybe you want to take the cover off of a book for some reason or you just want to extract the chapters of a book into multiple pdfs instead of storing them in one file. this is fairly trivial to do with pdfrw. for this example, we will use my reportlab book’s sample chapter pdf that you can download on leanpub .

# splitter.py

from pdfrw import pdfreader, pdfwriter

def split(path, number_of_pages, output):
    pdf_obj = pdfreader(path)
    total_pages = len(pdf_obj.pages)

    writer = pdfwriter()

    for page in range(number_of_pages):
        if page <= total_pages:
            writer.addpage(pdf_obj.pages[page])

    writer.write(output)

if __name__ == '__main__':
    split('reportlab-sample.pdf', 10, 'subset.pdf')

here we create a function called split that takes an input pdf file path, the number of pages that you want to extract and the output path. then we open up the file using pdfrw’s pdfreader class and grab the total number of pages from the input pdf. then we create a pdfwriter object and loop over the range of pages that we passed in. in each iteration, we attempt to extract a page from the input pdf and add that page to our writer object. finally we write the extracted pages to disk.

merging/concatenating

the pdfrw package makes merging multiple pdfs together very easy. let’s write up a simple example that demonstrates how to do it:

# concatenator.py

from pdfrw import pdfreader, pdfwriter, indirectpdfdict

def concatenate(paths, output):
    writer = pdfwriter()

    for path in paths:
        reader = pdfreader(path)
        writer.addpages(reader.pages)

    writer.trailer.info = indirectpdfdict(
        title='combined pdf title',
        author='michael driscoll',
        subject='pdf combinations',
        creator='the concatenator'
    )

    writer.write(output)

if __name__ == '__main__':
    paths = ['reportlab-sample.pdf', 'w9.pdf']
    concatenate(paths, 'concatenate.pdf')

in this example, we create a function called concatenate that accepts a list of paths to pdfs that we want to concatenate together and the output path. then iterate over those paths, open the file and add all the pages to the writer object via the writer’s addpages method. just for fun, we also import indirectpdfdict , which allows us to add some trailer information to our pdf. in this case, we add the title, author, subject and creator script information to the pdf. then we write out the concatenated pdf to disk.

rotating

the pdfrw package also supports rotating the pages of a pdf. so if you happen to have a pdf that was saved in a weird way or an intern that scanned in some documents upside down, then you can use pdfrw (or pypdf2) to fix the pdfs. note that in pdfrw you must rotate clockwise in increments that are divisible by 90 degrees.

for this example, i created a function that will extract all the odd pages from the input pdf and rotate them 90 degrees:

# rotator.py

from pdfrw import pdfreader, pdfwriter, indirectpdfdict

def rotate_odd(path, output):
    reader = pdfreader(path)
    writer = pdfwriter()
    pages = reader.pages

    for page in range(len(pages)):
        if page % 2:
            pages[page].rotate = 90
            writer.addpage(pages[page])

    writer.write(output)

if __name__ == '__main__':
    rotate_odd('reportlab-sample.pdf', 'rotate_odd.pdf')

here we just open up the target pdf and create a writer object. then we grab all the pages and iterate over them. if the page is an odd-numbered page, we rotate it and then add that page to our writer object. this code ran pretty fast on my machine and the output is what you would expect.

overlaying/watermarking pages

you can use pdfrw to watermark your pdf with some kind of information. for example, you might want to watermark a pdf with your buyer’s email address or with your logo. you can also use the overlay one pdf on top of another pdf. we will actually use the overlay technique for filling in pdf forms in chapter 17.

let’s create a simple watermarking script to demonstrate how you might use pdfrw to overlay one pdf on top of another.

# watermarker.py

from pdfrw import pdfreader, pdfwriter, pagemerge

def watermarker(path, watermark, output):
    base_pdf = pdfreader(path)
    watermark_pdf = pdfreader(watermark)
    mark = watermark_pdf.pages[0]

    for page in range(len(base_pdf.pages)):
        merger = pagemerge(base_pdf.pages[page])
        merger.add(mark).render()

    writer = pdfwriter()
    writer.write(output, base_pdf)

if __name__ == '__main__':
    watermarker('reportlab-sample.pdf',
                'watermark.pdf',
                'watermarked-test.pdf')

here we create a simple watermarker function that takes an input pdf path, the pdf that contains the watermark and the output path of the end result. then we open up the base pdf and the watermark pdf. we extract the watermark page and then iterate over the pages in the base pdf. in each iteration, we create a pagemerge object using the current base pdf page that we are on. then we overlay the watermark on top of that page and render it. after the loop finished, we create a pdfwriter object and write the merged pdf to disk.

scaling

the pdfrw package can also manipulate pdfs in memory. in fact, it will allow you to create form xobjects. these objects can represent any page or rectangle in a pdf. what this means is that you once you have one of these objects created, you can then scale, rotate and position pages or sub-pages. there is a fun example on the pdfrw github page called 4up.py that takes pages from a pdf and scales them down to a quarter of their size and positions four pages to a single page.

here is my version:

# scaler.py

from pdfrw import pdfreader, pdfwriter, pagemerge


def get4(srcpages):
    scale = 0.5
    srcpages = pagemerge() + srcpages
    x_increment, y_increment = (scale * i for i in srcpages.xobj_box[2:])
    for i, page in enumerate(srcpages):
        page.scale(scale)
        page.x = x_increment if i & 1 else 0
        page.y = 0 if i & 2 else y_increment
    return srcpages.render()


def scale_pdf(path, output):
    pages = pdfreader(path).pages
    writer = pdfwriter(output)
    scaled_pages = 4

    for i in range(0, len(pages), scaled_pages):
        four_pages = get4(pages[i: i + 4])
        writer.addpage(four_pages)

    writer.write()

if __name__ == '__main__':
    scale_pdf('reportlab-sample.pdf', 'four-page.pdf')

the get4 function comes from the 4up.py script. this function takes a series of pages and uses pdfrw’s pagemerge class to merge those pages together. we basically loop over the passed in pages and scale them down a bit, then we position them on the page and render the page series on one page.

the next function is scale_pdf , which takes the input pdf and the path for the output. then we extract the pages from the input file and create a writer object. next we loop over the pages of the input document 4 at a time and pass them to the get4 function. then we take the result of that function and add it to our writer object.

finally we write the document out to disk. here is a screenshot that kind of shows how it looks:

image title


now let’s learn how we might combine pdfrw with reportlab!

combining pdfrw and reportlab

one of the neat features of pdfrw is its ability to integrate with the reportlab toolkit. there are several examples on the pdfrw github page that show different ways to use the two packages together. the creator of pdfrw thinks that you may be able to simulate some of reportlab’s pagecatcher functionality which is a part of reportlab’s paid product. i don’t know if it does or not, but you can definitely do some fun things with pdfrw and reportlab.

for example, you can use pdfrw to read in pages from a pre-existing pdf and turn them into objects that you can write out in reportlab. let’s write a script that will create a subset of a pdf using pdfrw and reportlab. the following example is based on one from the pdfrw project:

# split_with_rl.py

from pdfrw import pdfreader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl

from reportlab.pdfgen.canvas import canvas

def split(path, number_of_pages, output):
    pdf_obj = pdfreader(path)

    my_canvas = canvas(output)

    # create page objects
    pages = pdf_obj.pages[0: number_of_pages]
    pages = [pagexobj(page) for page in pages]

    for page in pages:
        my_canvas.setpagesize((page.bbox[2], page.bbox[3]))
        my_canvas.doform(makerl(my_canvas, page))
        my_canvas.showpage()

    # write the new pdf to disk
    my_canvas.save()


if __name__ == '__main__':
    split('reportlab-sample.pdf', 10, 'subset-rl.pdf')

here we import some new functionality. first, we import the pagexobj which will create a form xobject from the view that you give it. the view defaults to an entire page, but you could tell pdfrw to just extract a portion of the page. next, we import the makerl function which will take a reportlab canvas object and a pdfrw form xobject and turn it into a form that reportlab can add to its canvas object.

so, let’s examine this code a bit and see how it works. here we create a reader object and a canvas object. then we create a list of form xform objects starting with the first page to the last page that we specified. note that we do not check if we asked for too many pages though, so that is something that we could do to enhance this script and make it less likely to fail.

next, we iterate over the pages that we just created and add them to our reportlab canvas. you will note that we set the page size using the width and height that we extract using pdfrw’s bbox attributes. then we add the form xobjects to the canvas. the call to showpage tells reportlab that you finished creating a page and to start a new one. finally, we save the new pdf to disk.

there are some other examples on pdfrw’s site that you should review. for example, there is a neat piece of code that shows how you could take a page from a pre-existing pdf and use it as the background for a new pdf that you create in reportlab. there is also a really interesting scaling example where you can use pdfrw and reportlab to scale pages down in much the same way that we did with pdfrw all by itself.

wrapping up

the pdfrw package is actually pretty powerful and has features that pypdf2 does not. its ability to integrate with reportlab is one feature that i think is really interesting and could be used to create something original. you can also use pdfrw to do many of the same things that we can do with pypdf2, such as splitting, merging, rotating and concatenating pdfs together. i actually thought pdfrw was a bit more robust in generating viable pdfs than pypdf2 but i have not done extensive tests to actually confirm this.

regardless, i believe that pdfrw is worth adding to your toolkit.

related reading

  • github page for pdfrw
  • a quick intro to pdfrw
  • pdfrw and pdf forms: filling them using python


Object (computer science) file IO Extract Form (document)

Published at DZone with permission of Mike Driscoll, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • High-Performance Java Serialization to Different Formats
  • Did You Know the Fastest Way of Serializing a Java Field Is Not Serializing It at All?
  • TestCafe Integration With Cucumber
  • RION - A Fast, Compact, Versatile Data Format

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!