Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Recognize Dates in PDFs

DZone's Guide to

How to Recognize Dates in PDFs

Learn what goes into automating the recognition of dates within PDF documents to hack your accounting functions.

· Java Zone
Free Resource

Are you joining the containers revolution? Start leveraging container management using Platform9's ultimate guide to Kubernetes deployment.

One of the “pleasures” of having your own business is dealing with accounting.

Now, to survive, I tried a few things like:

One boring thing to do is to organize the receipts and invoices I had to pay: travel expenses, books, conference fees, etc. You get the idea.

Why Do I Need Software to Find Dates in PDF Files?

Ideally, I would like to have a tool that matches the receipts with lines from my bank statements and generates some report that could make my business consultant very happy.

The first step is a tool that, given a bunch of PDFs, can associate dates with them. Once I get dates, I know which lines of the bank statement to consider, and I can match them (perhaps considering also the amount).

Step 1: Get Text Out of PDFs

There are two possible situations: The PDF contains the text, or the PDF is a simple image and we need to recognize the text using OCR techniques.

In the first case, things are relatively easy. We use PDFBox to extract the text:

fun getTextInPdf(fileName: String): List<PieceOfText> {
    val reader = PdfReader(FileInputStream(File(fileName)))
    val n = reader.numberOfPages
    var i = 0
    val listener = MyTextRenderListener()
    val processor = PdfContentStreamProcessor(listener)
    while (i<n) {
        val page = reader.getPageN(i + 1)
        val resourcesDic = page.getAsDict(PdfName.RESOURCES)
        processor.processContent(ContentByteUtils.getContentBytesForPage(reader, i + 1), resourcesDic)
        i++
    }
    reader.close()
    return listener.process()
}

But, of course, real life brings always surprises. Some PDFs contain one block per paragraph (great!) while others contain one block per letter. That means that we have to look at the position of each single letter and see if it is contiguous to another one. In that case, we need to merge them to get words.

If we do not have the text in the PDF we need to employ OCR techniques. I have used tesseract which is written in C++. To use it in the JVM, I used the javacpp-presets.

Once we get the block of texts, we need to split it into sequences of words. This part is relatively easy, but not as trivial as expected because we need to keep track of the exact position of each single word. So when splitting a block in words, some math is involved to find the bounds of each word.

Step 2: Recognize Dates

We now have a bunch of words. We need to look at those and find sequences of words corresponding to dates.

Now, dates can come in all sort of weird formats. Consider also that I have receipts in English (UK & US), French, German, and Italian.

Let’s see some examples.

First the classical DD/MM/YY:

  • 15/04/2016
  • 18-06-2016
  • 01.04.2016

Sometimes you can find the YY/MM/DD instead:

  • 2016-05-13
  • The month could be in letters, in any language:

    • May 7, 2016
    • 18 June 2016
    • 22 Aprile 2016
    • 1 juillet 2016
    • avril 12, 2016

    It could also be abbreviated:

    • 2016 Apr 14
    • 12 Apr 2016

    Of course, there are dates that cannot be recognized definitively because the Americans at some point decided it was a good idea to invert the month and the day in dates. So if I read 2/3/2016 it could either be the 2nd of March or the 3rd of February. It means that some dates are ambiguous.

    Step 3: Decide Which Date Is the Date of the oOrder

    Now, an invoice can contain many different dates. Consider this example:

    selection_260

    One date is the voice of the order (commande in French), another one is the date of registration of the company. This is definitely a date we are not interested in.

    There are other examples:

    selection_263

     

    Or again: 

    selection_264

    selection_265

    How do we handle these cases?

    Well, we use a series of heuristics which seems to work in most cases. One of the most powerful is looking for meaningful words near the date such as facture/invoice/fattura, or order/commande/ordine. We can also use the size of the font and the position of the date on the page.

    Do these tricks work always? No.

    Do they work pretty often? Yes, they do.

    Conclusions

    Let me first share with you my enthusiastic appreciation for the date format chosen by Americans. I find it a really good idea and not at all idiotic and unfortunate. In practice, it is a big source of pain and the first cause of error.

    Most of my receipts are in electronic, and they contain the text. In the few cases in which I need to use OCR, it seems to be working decently enough.

    In the end, I am quite satisfied with this prototype because. after some tuning. it seems to guess correctly over 90% of the time. Not bad. Not bad at all.

    Oh, well, I guess I have to go back to my accounting.

    Any idea of things you would like to automate to make your life easier?

    Using Containers? Read our Kubernetes Comparison eBook to learn the positives and negatives of Kubernetes, Mesos, Docker Swarm and EC2 Container Services.

    Topics:
    pdf files ,date formatting

    Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    THE DZONE NEWSLETTER

    Dev Resources & Solutions Straight to Your Inbox

    Thanks for subscribing!

    Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

    X

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}