How To Convert a Scanned Document Into Text in Java
How to transform a scanned image in common formats such as JPEG and PNG into text via Optical Character Recognition.
Join the DZone community and get the full member experience.Join For Free
Whether you’re looking to minimize storage space, display documents online, or edit a document electronically, Optical Character Recognition (OCR) is a great method for digitizing printed texts. This can be particularly helpful to businesses as a form of data entry for various documentation such as invoices, bank statements, mail, and electronic receipts.
Early versions of OCR technology needed to be trained with images of each character and the very first models were actually created in 1914 to convert scanned text into telegraph or audio code for the visually impaired. As you may have guessed, current versions of OCR have come a long way since the 1900s and are capable of digitally achieving a high degree of recognition accuracy for most fonts over a wide range of file formats.
This tutorial focuses specifically on an OCR API that will convert a scanned image of a document into text. It’s important to clarify that this specific API is intended to run on scanned documents only; if you want to leverage OCR technology to convert photos to text, be sure to utilize our photo to text function instead, as it’s designed to unskew the image prior to conversion.
To kick off our process, we will first need to install the SDK package with Maven by adding add a reference to the repository in pom.xml:
Next, we’ll add a reference to the dependency:
Once the installation is complete, we’ll be ready to add the imports to the top of the file and call the image to text function with the following code:
This will quickly and efficiently return a text version of your uploaded image. To optimize and ensure the accuracy of the API, you will need to verify that the following parameters are included:
- API Key—this can be obtained via the Cloudmersive website; if you register for a free account, you will receive a personal API key and access to 800 calls/month
- Image file to perform the OCR on; common file formats such as PNG and JPEG are supported
- Recognition mode—this is optional; possible values are:
- Basic: provides basic recognition and is not resilient to page rotation, skew, or low-quality images; uses 1-2 API calls
- Normal: provides highly fault-tolerant OCR recognition; uses 26-30 API calls
- Advanced: provides the highest quality and most fault-tolerant recognition uses 28-30 API calls. Default recognition mode is Advanced
- Language— this is optional, but the default language is English (ENG). Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
- Preprocessing— this is optional; default preprocessing mode is Auto. Possible values are
- None: no preprocessing of the image
- Auto: enhancement of the image before OCR is applied (recommended)
By using the image to text function, you’ll be able to easily provide text versions of scanned documents whenever the need arises.
Opinions expressed by DZone contributors are their own.