DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • How To Convert Common Documents to PNG Image Arrays in Java
  • How To Convert ODF Files to PDF in Java
  • How to Convert a PDF to Text (TXT) Using Java
  • How to Convert Between PDF and TIFF in Java

Trending

  • Why Good Models Fail After Deployment
  • The Third Culture: Blending Teams With Different Management Models
  • Detecting Bugs and Vulnerabilities in Java With SonarQube
  • Working With Cowork: Don’t Be Confused
  1. DZone
  2. Coding
  3. Java
  4. How to Convert PDF to Text in Java

How to Convert PDF to Text in Java

Utilize OCR technology to convert a PDF to text using an API in Java.

By 
Brian O'Neill user avatar
Brian O'Neill
DZone Core CORE ·
Mar. 26, 21 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
9.0K Views

Join the DZone community and get the full member experience.

Join For Free

Without the ability to copy, paste, or edit within a PDF document, it can be a frustrating task to manually transcribe a PDF to text. Fortunately for us, we have Optical Character Recognition (OCR) technology to help us out. We have discussed this a bit in previous articles, but to clarify, optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text. 

OCR is most popular as a form of data entry for printed paper data records, but it is also frequently used to digitize printed texts so that they can be edited, stored compactly, or displayed online. This technology has been refined and trained to recognize patterns, and now with the additional assistance of AI, can provide a high degree of accuracy with little effort. 

In the following tutorial, we will provide instructions on how to utilize an OCR API to scan a PDF document and convert it to text, automating what would normally be a long and drawn-out process. The operation supports various quality levels and a wide array of languages, so you can customize it to fit your project’s needs.

As usual, our first step is to install the Maven SDK by adding a reference to the repository:

Java
 




x


 
1
<repositories>
2
    <repository>
3
        <id>jitpack.io</id>
4
        <url>https://jitpack.io</url>
5
    </repository>
6
</repositories>



Next, we will add a reference to the dependency:

Java
 




xxxxxxxxxx
1


 
1
<dependencies>
2
<dependency>
3
    <groupId>com.github.Cloudmersive</groupId>
4
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
5
    <version>v3.90</version>
6
</dependency>
7
</dependencies>



Once the installation is complete, we’re all set up to add our imports to the top of the controller and perform the functional call:

Java
 




xxxxxxxxxx
1
28


 
1
// Import classes:
2
//import com.cloudmersive.client.invoker.ApiClient;
3
//import com.cloudmersive.client.invoker.ApiException;
4
//import com.cloudmersive.client.invoker.Configuration;
5
//import com.cloudmersive.client.invoker.auth.*;
6
//import com.cloudmersive.client.PdfOcrApi;
7

          
8
ApiClient defaultClient = Configuration.getDefaultApiClient();
9

          
10
// Configure API key authorization: Apikey
11
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
12
Apikey.setApiKey("YOUR API KEY");
13
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
14
//Apikey.setApiKeyPrefix("Token");
15

          
16
PdfOcrApi apiInstance = new PdfOcrApi();
17
File imageFile = new File("/path/to/inputfile"); // File | PDF file to perform OCR on.
18
String recognitionMode = "recognitionMode_example"; // String | Optional; possible values are 'Basic' which provides basic recognition and is not resillient to page rotation, skew or low quality images uses 1-2 API calls per page; 'Normal' which provides highly fault tolerant OCR recognition uses 26-30 API calls per page; and 'Advanced' which provides the highest quality and most fault-tolerant recognition uses 28-30 API calls per page.  Default recognition mode is 'Basic'
19
String language = "language_example"; // String | Optional, language of the input document, default is English (ENG).  Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
20
String preprocessing = "preprocessing_example"; // String | Optional, preprocessing mode, default is 'Auto'.  Possible values are None (no preprocessing of the image), and Auto (automatic image enhancement of the image before OCR is applied; this is recommended).
21
try {
22
    PdfToTextResponse result = apiInstance.pdfOcrPost(imageFile, recognitionMode, language, preprocessing);
23
    System.out.println(result);
24
} catch (ApiException e) {
25
    System.err.println("Exception when calling PdfOcrApi#pdfOcrPost");
26
    e.printStackTrace();
27
}



To ensure the process runs smoothly, there are a few parameters that need to be met:

  • Image File – PDF file to perform OCR on.
  • API Key – your personal API key; this can be obtained by registering for a free account on the Cloudmersive website.
  • Recognition Mode (optional) – three settings are provided; the default is Basic.
    • Basic: base-level recognition and not resilient to page rotation or low-quality images; uses 1-2 API calls.
    • Normal: provides highly fault-tolerant recognition; uses 26-30 API calls
    • Advanced: provides the highest quality and most fault-tolerant recognition; uses 28-30 API calls.
  • Language (optional) – the language of the input text; default is ENG (English).
  • Preprocessing (optional) – two settings are available for preprocessing mode; the default is Auto.
    • None: no preprocessing of the image.
    • Auto: automatic image enhancement before OCR is applied.

Your response will be delivered in no time and will list the text results by page. OCR has come a long way since its humble beginnings in the early 1900s, so your results should be both concise and accurate.

PDF Java (programming language) Convert (command)

Opinions expressed by DZone contributors are their own.

Related

  • How To Convert Common Documents to PNG Image Arrays in Java
  • How To Convert ODF Files to PDF in Java
  • How to Convert a PDF to Text (TXT) Using Java
  • How to Convert Between PDF and TIFF in Java

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook