DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Reading Text Automatically

Reading Text Automatically

It is necessary to have a scan of a typed text with high resolution. Luckily, this can be easy to do with some code — even in a language with accents.

Arthur Charpentier user avatar by
Arthur Charpentier
·
Mar. 25, 17 · Opinion
Like (4)
Save
Tweet
Share
6.15K Views

Join the DZone community and get the full member experience.

Join For Free

It is now very easy to read (automatically) some text that can be found in a PDF file. For instance, consider the program of the conference we had yesterday – and today – in Rennes.

> library(pdftools)
> scan_pdf <- pdf_text("http://crem.univ-rennes1.fr/Documents/Docs_sem_divers/2017_03_10-11_JJD/JDD_prog.pdf")
> cat(scan_pdf)
Journées Jeunes Docteurs
Programme du jeudi 9 mars 2017
Faculty of Economics - Rennes - Amphi Henri Krier
9h- 9h30 - Accueil
9h30-10h15 :      Présentation du CREM, de la faculté et des activités de recherche liées du ou laboratoire
10h15-10h50 :     Emmanuel LORENZON (Université de Bordeaux, GREThA)
Collusion with a rent seeking agency in sponsored search auctions
10h50-11h25 :     Julien BERTHOUMIEU (Université de Bordeaux, GREThA)
The Impact of “At-the-Border” and “Behind-the-Border” Policies on Cost-Reducing Research
and Development
Co-écrit avec Antoine Bouët

As you can see, it is working well, even in French, where we have those weird letters (with accents). Here, it is working well because the PDF is vectorized — it was generated properly by an open office.

But sometimes, we can have only a scanned version of a letter:

Or just a picture with some typed text. I will not mention handwriting because it is much more complex.

The other day, my friend Fleur did show me a picture, and some very simple lines of code,{

> library('tesseract')
> pic1="https://f.hypotheses.org/wp-content/blogs.dir/253/files/2017/03/rapport-expert-fr.png"
> text_fr <- ocr(pic1, engine = tesseract("fra"))
> cat(text_fr)
Conformité
Risque vérifié Conforme
Causes et circonstances
Evénement Catastrophe Naturelle
Cause Inondation
Garanties CATNAT
Date sinistre 03/10/2015

It looks like we’ve been able to extract typed text from a picture! I want to check. I have to admit, first of all, that installation on a Linux machine is tricky. One has to install the first leptonica and then follow some guidelines to install tesseract (see also Artem's advices). It took me some time, but I’ve been able to install the package.

The first important step, it to train the algorithm with some texts in French (because it is in French in my picture)

> library('tesseract')
> tesseract_download("fra")

Then, I did try with the picture that Fleur did send me (the picture was inserted in the core of the message):

> pic2="https://f.hypotheses.org/wp-content/blogs.dir/253/files/2017/03/unnamed.png"
> text_fr <- ocr(pic2, engine = tesseract("fra"))
> cat(text_fr)
Conformité
nuque vérlfli Canforme
Causes et circonstances
Evénement Catastmphe Naturelle
causa Inondation
GIrllfllu CATNAÏ
nm slnlflu 03/10/2015

Clearly, something went wrong here. When I got that output, I thought that I did not train properly the function. But it was not the answer. As described in that post (in French) it is necessary to have a clean picture, to read it properly:

And actually, if we zoom in our picture — the first one, used by Fleur, to show me that package — we have:

While for the second one — with a lower resolution — we have:

If you don’t see the difference, look more carefully:

For the first one, and for the second one:

It is necessary to have a scan of a typed text with high resolution. And you have to admit: That it is awesome.

The good thing is that I have to work with a judge in France to assess the quality of experts. And since most of the reports are typed and then scanned, I am glad to have such a function. I just have to make sure that the resolution is high enough.

PDF Awesome (window manager) POST (HTTP) Linux (operating system) guidelines Extract Machine

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Microservices 101: Transactional Outbox and Inbox
  • DevOps for Developers — Introduction and Version Control
  • How Elasticsearch Works
  • Test Design Guidelines for Your CI/CD Pipeline

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: