Extract Information From PDF Invoice

In this writing, I will explain the way I used to parse PDF invoice files using regex and PDFBox

Tho Luong

CORE ·

Updated Jul. 06, 20 · Presentation

Likes (3)

Comment

Save

8.6K Views

It's pretty easy to write code to generate PDF files but pretty hard to parse and get back information from it because PDF is complicated. Unfortunately, it's sometimes the input of our system which needs to parse and model before doing further logic on it.

If the template is various, it's nearly impossible to write one abstract parser to understand and extract all information we need such as Order number, quantity, amount, vendor id. But if the number of templates is fixed, yes there's a way to achieve that with PDF box and regex.

In this writing, I will explain the way I used to parse the PDF file below. Hopefully, it can be applied to yours as well.

Check out my code here TestInvoice.java

Extraction requirements

Need to get the following information from the above file:

PO number
Date of the PO
Vendor
{ Barcode, Description, Quantity } in the table

Libs

As you may know, PDF stores strings and characters separately with absolute positioning. Meaning even 2 words look like belong to the same string but the raw data we receive can be a list of concrete strings with position. For example, the result when reading the word Purchase can be:

    JSON
   
xxxxxxxxxx

[{
    { text: "ch", x: 11, y: 4, w: 15, h: 10 },
    { text: "Pur", x: 0, y: 3, w: 10, h: 10},
    { text: "ase", x: 27, y: 4, w: 12, h: 10 }
}]

The difficulty is:

They're not the same y
The order of strings are not the same as they appear in PDF viewers

We need a lib to reorder pieces of words and concatenate them if needed. The lib I use is PDFLayoutTextStripper which helps to transform PDF to plain text but pretty well keep the original layout. Below is the sample output:

    Plain Text
   
xxxxxxxxxx

                                                                                                *PO-003847945*                                           
                                                                                                                                                         
                                                                                      Page.........................: 1    of    1                        
                                                                                                                                                         
                Address...........:     Peera  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
                                        P.O.Box 3371                                                                                                     
                                        Dohe,                                      PO-003847945                                                          
                                        QAT                                       TL-00074             EOCE  EELA ALMANNAI   W.L.L.                      
                                                                                                                                                         
                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
                Fax...................:                                                                                                                  
                                                                                                                                                         
               100225                Rawdat  Eqdeem                                 Date...................................: 5/10/2020                   
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
               Phone........:                                                       Attention Information                                                
               Fax.............:                                                                                                                         
               Vendor :    TL-00074                                                                                                                      
               EOCE EELA ALMANAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     
                                                                                                                                                         
                                                                                                                         Discount                        
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
                                                     350                                                                                                 
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          
                                                                                                                                                         
                                                1.25                                                                                                                        
(truncated...)

Using Regex

After having PDF content in a single string, we can split it into lines and loop through them, using regex to find the desired information.

Match PO Number

Observing that the PO number is the first substring with the following format

    Plain Text
   
xxxxxxxxxx

PO-{list of digits}

we also see that the PO number stays alone, far from other words so we can make the pattern stronger by adding suffix and prefix spaces. The better pattern should be

    Plain Text
   
xxxxxxxxxx

{at least 5 spaces}PO-{list of digits}{at least 5 spaces}

turn this into Java Regex pattern:

    Plain Text
   
xxxxxxxxxx

\\s{5,}(PO\\-\\d+)\\s{5,}

Match PO Date and Vendor

PO date is the first substring match following pattern

    Plain Text
   
xxxxxxxxxx

Date{list of dots}{anything but not a digit e.g. space}{1 or 2 digits/1 or 2 digits/4 digits}

In Regex:

    Plain Text
   
xxxxxxxxxx

Date\\.+[^\\d]*(\\d+\\/\\d+\\/\\d{4})

with a similar observation we have regex for vendor:

    Plain Text
   
xxxxxxxxxx

Vendor\\s*\\:\\s*([^\\s]+)

Read Table Content

To read table content while looping through all the lines in PDF file, we need to know the following signals:

The signal of the table header line to turn reading mode to reading-table-content. Also, once we know the header line we know bounds to trap column content.
The signal of the first line that not belongs to the table to stop reading-table-content mode otherwise it will keep adding wrong content into the table

Check out my code here TestInvoice.java

There're some important points in this implementation:

I only use some headers not all for header line detection. The reason is that's strong enough for identifying and the Discount header does not stay in the same line as others
Description is multiple lines cell, its content spreads from the line with barcode and before the next barcode line

With these observations we need to find barcode and use it as the anchor cell for the row.

A More Accurate Way to Detect PO Number

Many values in forms is with their labels e.g. Po Number: PO-1234422312446. It will give us higher accuracy if we can find data label and data value together. That's what I applied to find PO Date and Vendor above. But some of the value have the label and value are in the vertical alignment. For example:

    Plain Text
   
xxxxxxxxxx

              PO Number
           PO-1234422312446

For this layout, we can first, detect the position of the label, then scan the next lines at the same x-range as label with tolerance to find the first non-empty value. That should be the value we're finding. The implementation is as below:

    Java
   
xxxxxxxxxx

String poNumberLabel = "PO Number";
String poNumber = null;
boolean foundPONumberLabel = false;
int spaceTolerance = 5;
for (String line : lines) {
    // ...
    // detect PO Number
    if (poNumber != null) {
        continue;
    }
    int start = line.indexOf(poNumberLabel);
    if (start >= 0) {
        foundPONumberLabel = true;
    }
    int end = start + poNumberLabel.length();
    if (foundPONumberLabel) {
        poNumber = match(line.substring(start - spaceTolerance, end + spaceTolerance), "po-regex-here");
    }
}

Design for Multi-Template Parsers

If your system has several PDF templates, the suggested pattern to manage all parsers is factory pattern, the design is as below:

Interfaces

    Java
   
xxxxxxxxxx

class ParsedContent {
    // e.g.
    // private string poNumber;
    // private string date;
    // private Row[] rows;
}
interface Parser {
    public ParsedContent parse(String[] lines);
}
interface ParserFactory {
    public Parser get(String[] lines); // detect Parser from its content
}

Implementation

    Java
   
xxxxxxxxxx

abstract class AbstractParser implements Parser {
    /**
     * Check and determine if the input lines are acceptable for this parser
     */
    protected boolean isValid(String[] lines);
}
class Template1Parser implements AbstractParser {
    // ...
}
class Template2Parser implements AbstractParser {
    // ...
}
class ParserFactoryImpl implements ParserFactory {
    private Parser[] parsers = new Parser[] {
        new Template1Parser(),
        new Template2Parser()
    };
    public Parser get(String[] lines) {
        Parser retVal = null;
        for (Parser p : this.parsers) {
            if (p.isValid(lines)) {
                if (retVal != null) {
                    throw new Found2ParsersException();
                }
                retVal = p;
            }
        }
        if (retVal == null) {
            throw new ParserNotFoundException();
        }
        return retVal;
    }
}

Usage:

    Java
   
xxxxxxxxxx

ParserFactory pf = new ParserFactoryImpl();
// read pdf file and store content in String[] lines
ParsedContent content = pf.get(lines).parse(lines);

Source code

Check out my code here TestInvoice.java

PDF Database Plain text Extract

Opinions expressed by DZone contributors are their own.

Related

Trending

Extract Information From PDF Invoice

In this writing, I will explain the way I used to parse PDF invoice files using regex and PDFBox

Extraction requirements

Libs

Using Regex

Match PO Number

Match PO Date and Vendor

Read Table Content

A More Accurate Way to Detect PO Number

Design for Multi-Template Parsers

Interfaces

Implementation

Usage:

Source code

Related

Partner Resources