DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Python and Open-Source Libraries for Efficient PDF Management
  • Managing AWS Managed Microsoft Active Directory Objects With AWS Lambda Functions
  • Implementing a RAG Model for PDF Content Extraction and Query Answering
  • Efficient String Formatting With Python f-Strings

Trending

  • Java’s Next Act: Native Speed for a Cloud-Native World
  • The 4 R’s of Pipeline Reliability: Designing Data Systems That Last
  • Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
  • Java Virtual Threads and Scaling
  1. DZone
  2. Coding
  3. Languages
  4. Splitting and Merging PDFs With Python

Splitting and Merging PDFs With Python

PyPDF2 is a powerful and useful package. If you need to manipulate existing PDFs, then this package might be right up your alley!

By 
Mike Driscoll user avatar
Mike Driscoll
·
Apr. 13, 18 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
26.4K Views

Join the DZone community and get the full member experience.

Join For Free

The PyPDF2 package allows you to do a lot of useful operations on existing PDFs. In this article, we will learn how to split a single PDF into multiple smaller ones. We will also learn how to take a series of PDFs and join them back together into a single PDF.

Getting Started

PyPDF2 doesn't come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

pip install pypdf2

Now that we have PyPDF2 installed, let's learn how to split and merge PDFs!

Splitting PDFs

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones. You just need to tell it how many pages you want. For this example, we will download a W9 form from the IRS and loop over all six of its pages. We will split off each page and turn it into its own standalone PDF.

Let's find out how:

# pdf_splitter.py
 
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
 
 
def pdf_splitter(path):
    fname = os.path.splitext(os.path.basename(path))[0]
 
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
 
        output_filename = '{}_page_{}.pdf'.format(
            fname, page+1)
 
        with open(output_filename, 'wb') as out:
            pdf_writer.write(out)
 
        print('Created: {}'.format(output_filename))
 
if __name__ == '__main__':
    path = 'w9.pdf'
    pdf_splitter(path)

For this example, we need to import both the PdfFileReader and the PdfFileWriter. Then we create a fun little function called pdf_splitter. It accepts the path of the input PDF. The first line of this function will grab the name of the input file, minus the extension. Next, we open the PDF up and create a reader object. Then we loop over all the pages using the reader object's getNumPages method.

Inside of the for loop, we create an instance of  PdfFileWriter. We then add a page to our writer object using its addPage method. This method accepts a page object, so to get the page object, we call the reader object's getPage method. Now, we added one page to our writer object. The next step is to create a unique file name, which we do by using the original file name plus the word "page" plus the page number + 1. We add the one because PyPDF2's page numbers are zero-based, so page 0 is actually page 1.

Finally, we open the new file name in write-binary mode and use the PDF writer object's write method to write the object's contents to disk.

Merging Multiple PDFs Together

Now that we have a bunch of PDFs, let's learn how we might take them and merge them back together. One useful use case for doing this is for businesses to merge their dailies into a single PDF. I have needed to merge PDFs for work and for fun. One project that sticks out in my mind is scanning documents in. Depending on the scanner you have, you might end up scanning a document into multiple PDFs, so being able to join them together again can be wonderful.

When the original PyPdf came out, the only way to get it to merge multiple PDFs together was like this:

# pdf_merger.py
 
import glob
from PyPDF2 import PdfFileWriter, PdfFileReader
 
def merger(output_path, input_paths):
    pdf_writer = PdfFileWriter()
 
    for path in input_paths:
        pdf_reader = PdfFileReader(path)
        for page in range(pdf_reader.getNumPages()):
            pdf_writer.addPage(pdf_reader.getPage(page))
 
    with open(output_path, 'wb') as fh:
        pdf_writer.write(fh)
 
 
if __name__ == '__main__':
    paths = glob.glob('w9_*.pdf')
    paths.sort()
    merger('pdf_merger.pdf', paths)

Here, we create a PdfFileWriter object and several PdfFileReader objects. For each PDF path, we create a PdfFileWriter object and then loop over its pages, adding each and every page to our writer object. Then we write out the writer object's contents to disk.

PyPDF2 made this a bit simpler by creating a PdfFileMerger object:

# pdf_merger2.py
 
import glob
from PyPDF2 import PdfFileMerger
 
def merger(output_path, input_paths):
    pdf_merger = PdfFileMerger()
    file_handles = []
 
    for path in input_paths:
        pdf_merger.append(path)
 
    with open(output_path, 'wb') as fileobj:
        pdf_merger.write(fileobj)
 
if __name__ == '__main__':
    paths = glob.glob('w9_*.pdf')
    paths.sort()
    merger('pdf_merger2.pdf', paths)

Here, we just need to create the PdfFileMerger object and then loop through the PDF paths, appending them to our merging object. PyPDF2 will automatically append the entire document, so you don't need to loop through all the pages of each document yourself. Then, we just write it out to disk.

The PdfFileMerger class also has a merge method that you can use. Its code definition looks like this:

def merge( self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True ):
        "" "
        Merges the pages from the given file into the output file at the
        specified page number.

        :param int position: The *page number* to insert this file. File will
            be inserted after the given number.

        :param fileobj: A File Object or an object that supports the standard read
            and seek methods similar to a File Object. Could also be a
            string representing a path to a PDF file.

        :param str bookmark: Optionally, you may specify a bookmark to be applied at
            the beginning of the included file by supplying the text of the bookmark.

        :param pages: can be a :ref:`Page Range <page-range>` or a ``(start, stop[, step])`` tuple
            to merge only the specified range of pages from the source
            document into the output document.

        :param bool import_bookmarks: You may prevent the source document's bookmarks
            from being imported by specifying this as ``False``.
        " ""

Basically, the merge method allows you to tell PyPDF where to merge a page by page number. So, if you have created a merging object with three pages in it, you can tell the merging object to merge the next document in at a specific position. This allows the developer to do some pretty complex merging operations. Give it a try and see what you can do!

Wrapping Up

PyPDF2 is a powerful and useful package. I have been using it off and on for years to work on various home and work projects. If you need to manipulate existing PDFs, then this package might be right up your alley!

Object (computer science) PDF Python (language)

Published at DZone with permission of Mike Driscoll, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Python and Open-Source Libraries for Efficient PDF Management
  • Managing AWS Managed Microsoft Active Directory Objects With AWS Lambda Functions
  • Implementing a RAG Model for PDF Content Extraction and Query Answering
  • Efficient String Formatting With Python f-Strings

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!