DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Laying the Groundwork for Intelligent Automation — OCR

Laying the Groundwork for Intelligent Automation — OCR

See how Tesseract can help with document automation.

Wei Xiao user avatar by
Wei Xiao
·
Apr. 11, 19 · Tutorial
Like (5)
Save
Tweet
Share
10.87K Views

Join the DZone community and get the full member experience.

Join For Free

While hovering my smartphone over a check for mobile deposit, I was struck by the question: “What technology is behind this?” After some googling, I found one critical technology piece is Optical Character Recognition(OCR). OCR extracts machine printed or hand-written fields on check images and converts them into text.

An OCR engine is a computer program that uses sets of parameters to discern characters from the image. One of the most popular OCR engines is Tesseract. Tesseract has been around since the 1980s. It was originally developed as a proprietary software by Hewlett Packard Labs. In 2005, Tesseract was open sourced by HP. Since 2006, Google has been sponsoring the project and actively contributing to it. Tesseract is considered one of the most accurate open-source OCR engines.

In Tesseract v4.0, it adds a new OCR engine based on Long Short Term Memory(LSTM) neural networks. LSTM is a special type of Recurrent Neural Network(RNN) that is capable of creating long-term dependencies. In use cases like natural language processing, LSTM stores information learned from the previous context and applies the knowledge to understand the next words.

As the principal evangelist of Automation Anywhere, a Robotics Process Automation company, I decided to put on my developer’s hat and see how Tesseract can help with document automation.

Installing Tesseract 4.0 on Mac

Installing Tesseract 4.0 on Mac is straightforward. From your Terminal Window, type:

$ brew install tesseract

Verifying Tesseract Version

To verify Tesseract 4.0 is installed correctly, use the following command:

$ tesseract --version

tesseract 4.0.0

leptonica-1.78.0

  libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1

 Found AVX2

 Found AVX

 Found SSE

Using Tesseract

To test how Tesseract works, I used the sample utility bill bill.png below.

Image title

Run Tesseract from command line and display the extracted text on the standard output. bill.png is the input filename. Language(-l) is set to be English. Page Segmentation Mode(--psm) defines additional information about text structure. Its default value is 3, which uses fully automatic page segmentation.

$tesseract bill.png stdout -l eng --psm 3

As you can see from the output, the top of the bill — including service address, customer account information, etc. — is extracted correctly in general.

Usage information in the table is processed as text. There is no co-relation established between the data fields and values.

DESCRIPTION CHARGE

Water supply £9,949.02

Water fixed charge £220.52,

Water waste disposal £7,628.73

Waste water fixed charge £886.01

Supplementary charge £0.00

Trade effluent disposal £0.00

Trade effluent fixed charge £0.00

Tesseract expects a page of text when it segments an image. The extracted balance summary part doesn’t match the visual order of the document.

Net total charges

£0.00

£0.00

Cancelled charges from previous bill

£18,684.28

£42.26

One trick we can use here is to tweak --psm parameter and apply a different segmentation mode. By switching psm to --psm 6 , we assume the content to be a single uniform block of text and are able to get a more structured result for the summary part.

$ tesseract bill.png stdout -l eng --psm 6

Mise credit £0.00

£0.00

Cancelled charges from previous bill

£18,684.28

Net total charges

£42.26

VAT

Training Tesseract 4.0

In order to use OCR for business documents data capture, we need to set up rules and templates for each data field varying for each document format. The process is tedious and expensive. For a minor document alternation, we have to define new rules.

Tesseract can also be trained to recognize fonts or languages pertaining to your documents. Detailed documents can be found here.

Takeaways

Even though Tesseract v4 significantly improves the performance and accuracy of the OCR engine, its deep learning model still faces a lot of challenges.

  1. Low-resolution images or images with noise result in poor accuracy. Tesseract assumes that your input image has been relatively cleaned. However, you may still need to do pre-processing on the images.
  2. LSTM model improves the context dependencies, but a general purpose OCR engine doesn’t understand discrepancies between different document types and can not extract features specific to invoices or utility bills.
  3. The engine is limited by the data it was trained on. If your document contains a new font that Tesseract has not been trained on, it is unlikely that the OCR engine will be able to recognize the text.

How We Tackle the Challenges at Automation Anywhere

For enterprise document automation, OCR only scratches the surface. At Automation Anywhere, we developed IQ Bot that uses OCR as the underlying technology and added congnitive abilities to it. IQ Bot is skilled at applying human logic to document patterns and extracting data fields and values in the same way that a human would.

  • Image pre-processing on low-quality, low-resolution images. IQ Bot enables various image processing operations such as binarization, noise removal, etc. before running the real OCR.
  • Context analysis on document types. OCR extracted text result runs through another neural network that has been trained on specific document types and document context for error correction.
  • Human in the loop. The data fields and values extracted by the system can be verified by a human for fast feedback. Human worker trains AI model to eliminate possible errors. The training process is as simple as mouse clicks, and business users without a computer science or machine learning background can manage it.

Conclusion

OCR, the technology that has been around for decades, found new life in intelligent automation. By combining AI and OCR, IQ Bot automatically performs complex data extraction tasks and feeds the results into the business automation pipeline. This extends the proficiency of automation beyond anything we’ve yet experienced.

Machine learning

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Integrating AWS Secrets Manager With Spring Boot
  • CRUD REST API With Jakarta Core Profile Running on Java SE
  • Best Practices for Writing Clean and Maintainable Code
  • ClickHouse: A Blazingly Fast DBMS With Full SQL Join Support

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: