Laying the Groundwork for Intelligent Automation — OCR
See how Tesseract can help with document automation.
Join the DZone community and get the full member experience.Join For Free
While hovering my smartphone over a check for mobile deposit, I was struck by the question: “What technology is behind this?” After some googling, I found one critical technology piece is Optical Character Recognition(OCR). OCR extracts machine printed or hand-written fields on check images and converts them into text.
An OCR engine is a computer program that uses sets of parameters to discern characters from the image. One of the most popular OCR engines is Tesseract. Tesseract has been around since the 1980s. It was originally developed as a proprietary software by Hewlett Packard Labs. In 2005, Tesseract was open sourced by HP. Since 2006, Google has been sponsoring the project and actively contributing to it. Tesseract is considered one of the most accurate open-source OCR engines.
In Tesseract v4.0, it adds a new OCR engine based on Long Short Term Memory(LSTM) neural networks. LSTM is a special type of Recurrent Neural Network(RNN) that is capable of creating long-term dependencies. In use cases like natural language processing, LSTM stores information learned from the previous context and applies the knowledge to understand the next words.
As the principal evangelist of Automation Anywhere, a Robotics Process Automation company, I decided to put on my developer’s hat and see how Tesseract can help with document automation.
Installing Tesseract 4.0 on Mac
Installing Tesseract 4.0 on Mac is straightforward. From your Terminal Window, type:
$ brew install tesseract
Verifying Tesseract Version
To verify Tesseract 4.0 is installed correctly, use the following command:
$ tesseract --version tesseract 4.0.0 leptonica-1.78.0 libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1 Found AVX2 Found AVX Found SSE
To test how Tesseract works, I used the sample utility bill bill.png below.
Run Tesseract from command line and display the extracted text on the standard output. bill.png is the input filename. Language(-l) is set to be English. Page Segmentation Mode(--psm) defines additional information about text structure. Its default value is 3, which uses fully automatic page segmentation.
$tesseract bill.png stdout -l eng --psm 3
As you can see from the output, the top of the bill — including service address, customer account information, etc. — is extracted correctly in general.
Usage information in the table is processed as text. There is no co-relation established between the data fields and values.
DESCRIPTION CHARGE Water supply £9,949.02 Water fixed charge £220.52, Water waste disposal £7,628.73 Waste water fixed charge £886.01 Supplementary charge £0.00 Trade effluent disposal £0.00 Trade effluent fixed charge £0.00
Tesseract expects a page of text when it segments an image. The extracted balance summary part doesn’t match the visual order of the document.
Net total charges £0.00 £0.00 Cancelled charges from previous bill £18,684.28 £42.26
One trick we can use here is to tweak --psm parameter and apply a different segmentation mode. By switching psm to --psm 6 , we assume the content to be a single uniform block of text and are able to get a more structured result for the summary part.
$ tesseract bill.png stdout -l eng --psm 6 Mise credit £0.00 £0.00 Cancelled charges from previous bill £18,684.28 Net total charges £42.26 VAT
Training Tesseract 4.0
In order to use OCR for business documents data capture, we need to set up rules and templates for each data field varying for each document format. The process is tedious and expensive. For a minor document alternation, we have to define new rules.
Tesseract can also be trained to recognize fonts or languages pertaining to your documents. Detailed documents can be found here.
Even though Tesseract v4 significantly improves the performance and accuracy of the OCR engine, its deep learning model still faces a lot of challenges.
- Low-resolution images or images with noise result in poor accuracy. Tesseract assumes that your input image has been relatively cleaned. However, you may still need to do pre-processing on the images.
- LSTM model improves the context dependencies, but a general purpose OCR engine doesn’t understand discrepancies between different document types and can not extract features specific to invoices or utility bills.
- The engine is limited by the data it was trained on. If your document contains a new font that Tesseract has not been trained on, it is unlikely that the OCR engine will be able to recognize the text.
How We Tackle the Challenges at Automation Anywhere
For enterprise document automation, OCR only scratches the surface. At Automation Anywhere, we developed IQ Bot that uses OCR as the underlying technology and added congnitive abilities to it. IQ Bot is skilled at applying human logic to document patterns and extracting data fields and values in the same way that a human would.
- Image pre-processing on low-quality, low-resolution images. IQ Bot enables various image processing operations such as binarization, noise removal, etc. before running the real OCR.
- Context analysis on document types. OCR extracted text result runs through another neural network that has been trained on specific document types and document context for error correction.
- Human in the loop. The data fields and values extracted by the system can be verified by a human for fast feedback. Human worker trains AI model to eliminate possible errors. The training process is as simple as mouse clicks, and business users without a computer science or machine learning background can manage it.
OCR, the technology that has been around for decades, found new life in intelligent automation. By combining AI and OCR, IQ Bot automatically performs complex data extraction tasks and feeds the results into the business automation pipeline. This extends the proficiency of automation beyond anything we’ve yet experienced.
Opinions expressed by DZone contributors are their own.