DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How To Convert Image Files Into GIF or WebP Format Using Java
  • How To Validate Names Using Java
  • How to Get Plain Text From Common Documents in Java
  • How To Convert Common Documents to PNG Image Arrays in Java

Trending

  • Medallion Architecture: Efficient Batch and Stream Processing Data Pipelines With Azure Databricks and Delta Lake
  • How AI Agents Are Transforming Enterprise Automation Architecture
  • Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
  • Build Your First AI Model in Python: A Beginner's Guide (1 of 3)
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Convert a PDF to Text (TXT) Using Java

How to Convert a PDF to Text (TXT) Using Java

This article outlines the difficulties in extracting plain text from regular PDF documents at scale and demonstrates two API solutions that efficiently perform that task.

By 
Brian O'Neill user avatar
Brian O'Neill
DZone Core CORE ·
Aug. 21, 22 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
9.1K Views

Join the DZone community and get the full member experience.

Join For Free

There is perhaps no file type more ubiquitous (by design) than the Portable Document Format (PDF). Capable of holding an impressive variety of content/object types and working seamlessly on any operating system you can think of, PDFs dominate personal and professional project landscapes as a destination format for bulky and/or specially formatted files. File types like PowerPoint’s PPTX, for example, are often so large that exporting the file as a PDF is the only efficient way to make the project shareable; PDF’s vector and raster graphics capabilities offer an ideal solution, maintaining a perfect representation of the original document while achieving much better compression for sharing. Formats like Microsoft Word DOCX simply can’t be opened as intended on many operating systems; the PDF version easily retains the same fonts and formatting edits included in the original, allowing the end viewer to see an exact visual representation of the document as it was intended. The list of *insert document* to PDF conveniences goes on and on.

If there is one major drawback to PDF documents, it is that they are notoriously difficult to edit. In fact, almost everything that makes PDFs such an ideal solution for reformatting externally/manually generated material conversely makes them one of the more challenging formats to manipulate. Because PDFs handle so many different content types in one file, they go through extensive compression to achieve an easily portable size, which means opening a PDF document and changing its contents is never a straightforward task. It doesn’t help that they are designed and programmed to be difficult to edit in the first place; it’s part of what makes PDFs a secure and reliable format in the first place.

So, what if you just want to extract plain, unformatted text from a PDF — and nothing more special than that? There are many reasons why getting pure text is useful, but extracting it in a convenient, scalable way isn’t as simple as it may seem. If you’ve ever attempted to extract text by — for example — hastily converting a PDF to an office document format (perhaps using one of the hundreds of free PDF conversion tools available online), especially without knowing what the original document format was, you’ve likely experienced a huge amount of formatting inconsistencies, strange spacing issues, missing links or media files, and random lines or tables floating around where they shouldn’t be. When you just wanted the plain text portion, that clutter is a big distraction, and you’re still left with the task of separating text from the new document and manually normalizing that anyway. If you’ve tried to extract text from a scanned or rasterized PDF (one that is entirely made up of two-dimensional images with pixels) using those same tools, you’ve probably noticed that it isn’t possible at all — at least, not without a specialized Optical Character Recognition (OCR) service; a very separate, albeit equally important solution to the PDF-to-text problem. When you attempt to get plain text from a regular PDF document, what you’re really trying to do is isolate one specific piece of a PDF’s many possible content types and only retain the text content from it. Further, you’re asking for that text — which can contain a lot of complex formatting encoded from a proprietary application like Microsoft Word — to be normalized in a way that anyone on any platform can read it.

Because of the relative difficulty associated with performing simple editing tasks on a PDF, it’s common practice to use third-party PDF editors (or premium Adobe tools) to achieve the desired results. These solutions, while effective on a file-by-file basis, aren’t great for achieving results at scale, however — they still require manual navigation through an interface, which takes up time most people don’t have to waste on high-volume conversion tasks.

To edit and process PDFs at scale, third-party API services represent the most efficient solution. That’s because PDF editing APIs can communicate with the compressed PDF file without ever having to open it; they can make meaningful edits (such as rotating pages, removing comments, etc.) and, on the opposite end of the spectrum, they can extract targeted content without having any impact on the original document at all. 

Demonstration

In the demonstration portion of this article, I’ll walk you through two simple and easy-to-use API solutions that are designed to extract plain text from regular PDF documents without having to open or make any changes to the original file. These API solutions include the following:

  1. Convert PDF to Text (TXT)
  2. Convert PDF to Text (TXT) by page

The first solution listed above will simply remove plain text from a PDF document without performing any additional operations (by default); the API response will contain a ‘TextResult’ string with the body of extracted text. Below, I’ve included a response model for reference:

JSON
 
{
  "Successful": true,
  "TextResult": "string"
}


The second solution will, as the name suggests, remove text while including the page number that each portion of text came from in the results. This solution adds a greater level of control over the conversion, ensuring the resulting information can be interacted with in roughly the same order that the original PDF document intended, and making it much easier to catalogue the converted information when we store it in TXT form. The below model shows how this response is formatted:

JSON
 
{
  "Successful": true,
  "Pages": [
    {
      "PageNumber": 0,
      "PageText": "string"
    }
  ]
}


Both solutions also provide an optional ‘textFormattingMode’ parameter that can be configured to specify how whitespace should be handled when making the conversion. When using this feature, possible values are ‘preserveWhitespace,’ which will retain whitespace in the document and preserve its relative positioning to the text, and ‘minimizeWhitespace,’ which will not insert any more spaces into the document in most cases. The default setting is ‘preserveWhitespace.’

Below, I’ll walk you through how you can take advantage of either API using ready-to-run, complementary code examples in Java. Please note that to use either API for free, you’ll just need to register a free account on www.cloudmersive.com to get a secure API key (this account will yield a limit of 800 API calls per month).

Before calling either API, we'll begin with SDK installation as our first step. We can do this with Maven by first adding a reference to the repository in pom.xml:

XML
 
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>


And then adding a reference to the pom.xml dependency:

XML
 
<dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies>


We can also do this with Gradle by adding it in our root build.gradle at the end of repositories:

Groovy
 
allprojects {
	repositories {
		...
		maven { url 'https://jitpack.io' }
	}
}


And then adding the dependency in build.gradle:

Groovy
 
dependencies {
        implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}


Now we can structure our API calls, beginning with the generic PDF to TXT conversion API. Within the code snippet below, include your API key where the documentation indicates (just below the imports), and then include your file path in the inputFile parameter below that:

Java
 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertDocumentApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
String textFormattingMode = "textFormattingMode_example"; // String | Optional; specify how whitespace should be handled when converting PDF to text.  Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases.  Default is 'preserveWhitespace'.
try {
    TextConversionResult result = apiInstance.convertDocumentPdfToTxt(inputFile, textFormattingMode);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ConvertDocumentApi#convertDocumentPdfToTxt");
    e.printStackTrace();
}


For the solution which converts PDF to Text by page, use the following code snippet instead. This will capture your API key and file input parameters in the same way as the first:

Java
 
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditPdfApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

EditPdfApi apiInstance = new EditPdfApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
String textFormattingMode = "textFormattingMode_example"; // String | Optional; specify how whitespace should be handled when converting the document to text.  Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases.  Default is 'preserveWhitespace'.
try {
    PdfTextByPageResult result = apiInstance.editPdfGetPdfTextByPages(inputFile, textFormattingMode);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling EditPdfApi#editPdfGetPdfTextByPages");
    e.printStackTrace();
}


Remember, the default textFormattingMode setting will preserve whitespace from the input PDF in the output. If you’d prefer to avoid that, make sure to change the example code to ‘minimizeWhitespace’ instead. With this solution in hand, you’ll be able to easily redirect text from PDF documents to a variety of different destinations without ever having to open the document in question.

API Apache Maven Document PDF Plain text Convert (command) Dependency Java (programming language) Repository (version control) Strings

Opinions expressed by DZone contributors are their own.

Related

  • How To Convert Image Files Into GIF or WebP Format Using Java
  • How To Validate Names Using Java
  • How to Get Plain Text From Common Documents in Java
  • How To Convert Common Documents to PNG Image Arrays in Java

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!