DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • How to Get Plain Text From Common Documents in Java
  • How to Change PDF Paper Sizes With an API in Java
  • How To Convert Common Documents to PNG Image Arrays in Java
  • How to Rasterize PDFs in Java

Trending

  • Code Reviews: Building an AI-Powered GitHub Integration
  • Agile’s Quarter-Century Crisis
  • Event-Driven Microservices: How Kafka and RabbitMQ Power Scalable Systems
  • Creating a Web Project: Caching for Performance Optimization
  1. DZone
  2. Coding
  3. Java
  4. How to Split PDF Files into Separate Documents Using Java

How to Split PDF Files into Separate Documents Using Java

In this article, we discuss how PDF file structure manages individual page objects, and we learn how to split those pages into new PDF documents with APIs.

By 
Brian O'Neill user avatar
Brian O'Neill
DZone Core CORE ·
Jan. 29, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
5.8K Views

Join the DZone community and get the full member experience.

Join For Free

Asking our Java file-processing applications to manipulate PDF documents can only increase their value in the long run. PDF is by far the most popular, widely used file type in the world today, and that’s unlikely to change any time soon.

Introduction

In this article, we’ll specifically learn how to divide PDF files into a series of separate PDF documents in Java — resulting in exactly one new PDF per page of the original file — and we’ll discuss open-source and third-party web API options to facilitate implementing that programmatic workflow into our code. We’ll start with a high-level overview of how PDF files are structured to make this type of workflow possible.

Distinguishing PDF from Open-Office XML File Types

I’ve written a lot about MS Open-Office XML (OOXML) files (e.g., DOCX, XLSX, etc.) in recent months — and it’s worth noting right away that PDF files are extremely different.

Where OOXML files are structured as a zip-compressed package of XML files containing document formatting instructions, PDF files use a binary file format that prioritizes layout fidelity over structured data representation and editability. In other words, PDF files care more about the visual appearance of content than its accessibility; we might’ve noticed this for ourselves if we’ve tried to copy and paste information directly from PDF files into another document.

Understanding How PDF Files Manage Individual Pages

Each individual page within a PDF document is organized in a hierarchical section called the Page Tree. Within this Tree, each page is represented as its own independent object, and each page object references its own content streams (i.e., how the file should render the page when it’s opened) and resources (i.e., which fonts, images, or other objects the file should use on each page). Each resource found on any given PDF page contains a specific byte offset reference in the PDF directory (called a cross-reference table), which directs the object to load in a specific page location. 

If we’ve spent time looking at any document file structures in the past, this should all sound pretty familiar. What might be less familiar is the path to building a series of new, independent PDF documents using each page object found within a PDF Page Tree.

Creating New PDF Files From Split PDF Pages

The latter stage of our process involves extracting and subsequently cloning PDF page content — which includes retaining the necessary resources (page rendering instructions) and maintaining the right object references (content location instructions) for each PDF page. The API we use to handle this stage of the process will often duplicate shared resources from the original PDF document to avoid issues in the subsequent standalone documents. Handling this part correctly is crucial to ensure the resulting independent PDF documents contain the correct content; this consideration is one of the many reasons why we (probably) wouldn’t enjoy writing a program to handle this workflow from scratch.

Once each page is successfully cloned, a new PDF document must be created for each page object with a Page Tree that defines only one page, and the result of this process must be serialized. The original PDF metadata object (which includes information like the document title, author, creation date, etc.) may be retained or deleted, depending on the API.

Splitting PDFs With an Open-Source Library

If we’re heading in an open-source API direction for our project, we might’ve already guessed that we’d land on an Apache library. Like most Apache APIs, the Apache PDFBox library is extremely popular thanks to its frequent updates, extensive support, and exhaustive documentation. Apache PDFBox has a utility called PDFSplit which conveniently facilitates the PDF splitting process.

More specifically, the PDFSplit utility is represented by the Splitter class from the Apache PDFBox library. After we create a Splitter instance in our code (this configures logic for splitting a PDF document), we can call the split() method that breaks our loaded PDF into a series of independent PDF document objects. Each new PDF document can then be stored independently with the save() method, and when our process is finished, we can invoke the close() method to prevent memory leaks from occurring in our program.

Like any library, we can add Apache PDFBox to our Java project by adding the required dependencies to our pom.xml (for Maven projects) or to our build.gradle (for Gradle projects). 

Splitting PDFs With a Web API

One of the challenges we often encounter using open-source APIs for complex file operations is the overhead incurred from local memory usage (i.e., on the machine running the code). When we split larger PDF files, for example, we consume a significant amount of local RAM, CPU, and disk space on our server. Sometimes, it’s best to itemize our file processing action as a web request and ask it to take place on another server entirely. This offloads the bulk of our file processing overhead to another server, distributing the workload more effectively.

We could deploy a new server on our own, or we could lean on third-party web API servers with easy accessibility and robust features.  This depends entirely on the scope and requirements of our project; we may not have permission to provision a new server or leverage a third-party service. We’ll now look at one example of a simple web API request that can offload PDF splitting and document generation on our behalf. 

Demonstration

The below solution is free to use, requiring an API key in the configuration step. For a Maven project, we can install it by first adding the below reference to our pom.xml repository:

XML
 
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>


And then adding the below reference to our pom.xml dependency:

XML
 
<dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies>


Alternatively, for a Gradle project, we’ll add the below in our root build.gradle (at the end of repositories):

Groovy
 
allprojects {
	repositories {
		...
		maven { url 'https://jitpack.io' }
	}
}


We’ll then add the following dependency in build.gradle:

Groovy
 
dependencies {
        implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}


Next, we’ll place the import classes at the top of our file:

Java
 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.SplitDocumentApi;


Then, we’ll add our API key configuration directly after:

Java
 
ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");


Finally, we’ll create an instance of the SplitDocumentAPI and call the apiInstance.splitDocumentPdfByPage() method with our input PDF file:

Java
 
SplitDocumentApi apiInstance = new SplitDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
Boolean returnDocumentContents = true; // Boolean | Set to true to directly return all of the document contents in the DocumentContents field; set to false to return contents as temporary URLs (more efficient for large operations).  Default is false.
try {
    SplitPdfResult result = apiInstance.splitDocumentPdfByPage(inputFile, returnDocumentContents);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling SplitDocumentApi#splitDocumentPdfByPage");
    e.printStackTrace();
}


We'll most likely want to keep returnDocumentContents set to true in our code, just like the above example. This specifies that the API will return file byte strings in our response array rather than temporary URLs (which are used to "chain" edits together by referencing modified file content in a cache on the endpoint server). Our try/catch block will print errors (with stack trace) to the console for easy debugging.

In our API response, we can expect an array of new PDF documents. Here's a JSON response model for reference:

JSON
 
{
  "Successful": true,
  "Documents": [
    {
      "PageNumber": 0,
      "URL": "string",
      "DocumentContents": "string"
    }
  ]
}


And an XML version of the same, if that's more helpful:

XML
 
<?xml version="1.0" encoding="UTF-8"?>
<SplitPdfResult>
	<Successful>true</Successful>
	<Documents>
		<PageNumber>0</PageNumber>
		<URL>string</URL>
		<DocumentContents>string</DocumentContents>
	</Documents>
</SplitPdfResult>


Conclusion

In this article, we learned about how PDF files are structured, and we focused our attention on the way PDF pages are organized within PDF file structure. We learned about the high-level steps involved in splitting a PDF file into a series of separate documents, and we then explored two Java libraries — one open-source library and one third-party web API — to facilitate adding this workflow into our own Java project.

Document PDF Java (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • How to Get Plain Text From Common Documents in Java
  • How to Change PDF Paper Sizes With an API in Java
  • How To Convert Common Documents to PNG Image Arrays in Java
  • How to Rasterize PDFs in Java

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!