Reading Text Automatically

DZone 's Guide to

Reading Text Automatically

It is necessary to have a scan of a typed text with high resolution. Luckily, this can be easy to do with some code — even in a language with accents.

· Big Data Zone ·
Free Resource

It is now very easy to read (automatically) some text that can be found in a PDF file. For instance, consider the program of the conference we had yesterday – and today – in Rennes.

> library(pdftools)
> scan_pdf <- pdf_text("http://crem.univ-rennes1.fr/Documents/Docs_sem_divers/2017_03_10-11_JJD/JDD_prog.pdf")
> cat(scan_pdf)
Journées Jeunes Docteurs
Programme du jeudi 9 mars 2017
Faculty of Economics - Rennes - Amphi Henri Krier
9h- 9h30 - Accueil
9h30-10h15 :      Présentation du CREM, de la faculté et des activités de recherche liées du ou laboratoire
10h15-10h50 :     Emmanuel LORENZON (Université de Bordeaux, GREThA)
Collusion with a rent seeking agency in sponsored search auctions
10h50-11h25 :     Julien BERTHOUMIEU (Université de Bordeaux, GREThA)
The Impact of “At-the-Border” and “Behind-the-Border” Policies on Cost-Reducing Research
and Development
Co-écrit avec Antoine Bouët

As you can see, it is working well, even in French, where we have those weird letters (with accents). Here, it is working well because the PDF is vectorized — it was generated properly by an open office.

But sometimes, we can have only a scanned version of a letter:

Or just a picture with some typed text. I will not mention handwriting because it is much more complex.

The other day, my friend Fleur did show me a picture, and some very simple lines of code,{

> library('tesseract')
> pic1="https://f.hypotheses.org/wp-content/blogs.dir/253/files/2017/03/rapport-expert-fr.png"
> text_fr <- ocr(pic1, engine = tesseract("fra"))
> cat(text_fr)
Risque vérifié Conforme
Causes et circonstances
Evénement Catastrophe Naturelle
Cause Inondation
Garanties CATNAT
Date sinistre 03/10/2015

It looks like we’ve been able to extract typed text from a picture! I want to check. I have to admit, first of all, that installation on a Linux machine is tricky. One has to install the first leptonica and then follow some guidelines to install tesseract (see also Artem's advices). It took me some time, but I’ve been able to install the package.

The first important step, it to train the algorithm with some texts in French (because it is in French in my picture)

> library('tesseract')
> tesseract_download("fra")

Then, I did try with the picture that Fleur did send me (the picture was inserted in the core of the message):

> pic2="https://f.hypotheses.org/wp-content/blogs.dir/253/files/2017/03/unnamed.png"
> text_fr <- ocr(pic2, engine = tesseract("fra"))
> cat(text_fr)
nuque vérlfli Canforme
Causes et circonstances
Evénement Catastmphe Naturelle
causa Inondation
GIrllfllu CATNAÏ
nm slnlflu 03/10/2015

Clearly, something went wrong here. When I got that output, I thought that I did not train properly the function. But it was not the answer. As described in that post (in French) it is necessary to have a clean picture, to read it properly:

And actually, if we zoom in our picture — the first one, used by Fleur, to show me that package — we have:

While for the second one — with a lower resolution — we have:

If you don’t see the difference, look more carefully:

For the first one, and for the second one:

It is necessary to have a scan of a typed text with high resolution. And you have to admit: That it is awesome.

The good thing is that I have to work with a judge in France to assess the quality of experts. And since most of the reports are typed and then scanned, I am glad to have such a function. I just have to make sure that the resolution is high enough.

machine learning ,text analysis ,big data

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}