How to Split PDF Files into Separate Documents Using Java
In this article, we discuss how PDF file structure manages individual page objects, and we learn how to split those pages into new PDF documents with APIs.
Join the DZone community and get the full member experience.
Join For FreeAsking our Java file-processing applications to manipulate PDF documents can only increase their value in the long run. PDF is by far the most popular, widely used file type in the world today, and that’s unlikely to change any time soon.
Introduction
In this article, we’ll specifically learn how to divide PDF files into a series of separate PDF documents in Java — resulting in exactly one new PDF per page of the original file — and we’ll discuss open-source and third-party web API options to facilitate implementing that programmatic workflow into our code. We’ll start with a high-level overview of how PDF files are structured to make this type of workflow possible.
Distinguishing PDF from Open-Office XML File Types
I’ve written a lot about MS Open-Office XML (OOXML) files (e.g., DOCX, XLSX, etc.) in recent months — and it’s worth noting right away that PDF files are extremely different.
Where OOXML files are structured as a zip-compressed package of XML files containing document formatting instructions, PDF files use a binary file format that prioritizes layout fidelity over structured data representation and editability. In other words, PDF files care more about the visual appearance of content than its accessibility; we might’ve noticed this for ourselves if we’ve tried to copy and paste information directly from PDF files into another document.
Understanding How PDF Files Manage Individual Pages
Each individual page within a PDF document is organized in a hierarchical section called the Page Tree. Within this Tree, each page is represented as its own independent object, and each page object references its own content streams (i.e., how the file should render the page when it’s opened) and resources (i.e., which fonts, images, or other objects the file should use on each page). Each resource found on any given PDF page contains a specific byte offset reference in the PDF directory (called a cross-reference table), which directs the object to load in a specific page location.
If we’ve spent time looking at any document file structures in the past, this should all sound pretty familiar. What might be less familiar is the path to building a series of new, independent PDF documents using each page object found within a PDF Page Tree.
Creating New PDF Files From Split PDF Pages
The latter stage of our process involves extracting and subsequently cloning PDF page content — which includes retaining the necessary resources (page rendering instructions) and maintaining the right object references (content location instructions) for each PDF page. The API we use to handle this stage of the process will often duplicate shared resources from the original PDF document to avoid issues in the subsequent standalone documents. Handling this part correctly is crucial to ensure the resulting independent PDF documents contain the correct content; this consideration is one of the many reasons why we (probably) wouldn’t enjoy writing a program to handle this workflow from scratch.
Once each page is successfully cloned, a new PDF document must be created for each page object with a Page Tree that defines only one page, and the result of this process must be serialized. The original PDF metadata object (which includes information like the document title, author, creation date, etc.) may be retained or deleted, depending on the API.
Splitting PDFs With an Open-Source Library
If we’re heading in an open-source API direction for our project, we might’ve already guessed that we’d land on an Apache library. Like most Apache APIs, the Apache PDFBox library is extremely popular thanks to its frequent updates, extensive support, and exhaustive documentation. Apache PDFBox has a utility called PDFSplit
which conveniently facilitates the PDF splitting process.
More specifically, the PDFSplit
utility is represented by the Splitter
class from the Apache PDFBox library. After we create a Splitter
instance in our code (this configures logic for splitting a PDF document), we can call the split()
method that breaks our loaded PDF into a series of independent PDF document objects. Each new PDF document can then be stored independently with the save()
method, and when our process is finished, we can invoke the close()
method to prevent memory leaks from occurring in our program.
Like any library, we can add Apache PDFBox to our Java project by adding the required dependencies to our pom.xml
(for Maven projects) or to our build.gradle
(for Gradle projects).
Splitting PDFs With a Web API
One of the challenges we often encounter using open-source APIs for complex file operations is the overhead incurred from local memory usage (i.e., on the machine running the code). When we split larger PDF files, for example, we consume a significant amount of local RAM, CPU, and disk space on our server. Sometimes, it’s best to itemize our file processing action as a web request and ask it to take place on another server entirely. This offloads the bulk of our file processing overhead to another server, distributing the workload more effectively.
We could deploy a new server on our own, or we could lean on third-party web API servers with easy accessibility and robust features. This depends entirely on the scope and requirements of our project; we may not have permission to provision a new server or leverage a third-party service. We’ll now look at one example of a simple web API request that can offload PDF splitting and document generation on our behalf.
Demonstration
The below solution is free to use, requiring an API key in the configuration step. For a Maven project, we can install it by first adding the below reference to our pom.xml
repository:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
And then adding the below reference to our pom.xml
dependency:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
Alternatively, for a Gradle project, we’ll add the below in our root build.gradle
(at the end of repositories):
allprojects {
repositories {
...
maven { url 'https://jitpack.io' }
}
}
We’ll then add the following dependency in build.gradle
:
dependencies {
implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}
Next, we’ll place the import classes at the top of our file:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.SplitDocumentApi;
Then, we’ll add our API key configuration directly after:
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
Finally, we’ll create an instance of the SplitDocumentAPI
and call the apiInstance.splitDocumentPdfByPage()
method with our input PDF file:
SplitDocumentApi apiInstance = new SplitDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
Boolean returnDocumentContents = true; // Boolean | Set to true to directly return all of the document contents in the DocumentContents field; set to false to return contents as temporary URLs (more efficient for large operations). Default is false.
try {
SplitPdfResult result = apiInstance.splitDocumentPdfByPage(inputFile, returnDocumentContents);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling SplitDocumentApi#splitDocumentPdfByPage");
e.printStackTrace();
}
We'll most likely want to keep returnDocumentContents
set to true
in our code, just like the above example. This specifies that the API will return file byte strings in our response array rather than temporary URLs (which are used to "chain" edits together by referencing modified file content in a cache on the endpoint server). Our try/catch block will print errors (with stack trace) to the console for easy debugging.
In our API response, we can expect an array of new PDF documents. Here's a JSON response model for reference:
{
"Successful": true,
"Documents": [
{
"PageNumber": 0,
"URL": "string",
"DocumentContents": "string"
}
]
}
And an XML version of the same, if that's more helpful:
<?xml version="1.0" encoding="UTF-8"?>
<SplitPdfResult>
<Successful>true</Successful>
<Documents>
<PageNumber>0</PageNumber>
<URL>string</URL>
<DocumentContents>string</DocumentContents>
</Documents>
</SplitPdfResult>
Conclusion
In this article, we learned about how PDF files are structured, and we focused our attention on the way PDF pages are organized within PDF file structure. We learned about the high-level steps involved in splitting a PDF file into a series of separate documents, and we then explored two Java libraries — one open-source library and one third-party web API — to facilitate adding this workflow into our own Java project.
Opinions expressed by DZone contributors are their own.
Comments