{{announcement.body}}
{{announcement.title}}

Extract Information From PDF Invoice

DZone 's Guide to

Extract Information From PDF Invoice

In this writing, I will explain the way I used to parse PDF invoice files using regex and PDFBox

· Java Zone ·
Free Resource

It's pretty easy to write code to generate PDF files but pretty hard to parse and get back information from it because PDF is complicated. Unfortunately, it's sometimes the input of our system which needs to parse and model before doing further logic on it.

If the template is various, it's nearly impossible to write one abstract parser to understand and extract all information we need such as Order number, quantity, amount, vendor id. But if the number of templates is fixed, yes there's a way to achieve that with PDF box and regex.

In this writing, I will explain the way I used to parse the PDF file below. Hopefully, it can be applied to yours as well.

Check out my code here TestInvoice.java

Extraction requirements

Need to get the following information from the above file:

  • PO number
  • Date of the PO
  • Vendor
  • { Barcode, Description, Quantity } in the table

Libs

As you may know, PDF stores strings and characters separately with absolute positioning. Meaning even 2 words look like belong to the same string but the raw data we receive can be a list of concrete strings with position. For example, the result when reading the word Purchase can be:

JSON
 




xxxxxxxxxx
1


 
1
[{
2
    { text: "ch", x: 11, y: 4, w: 15, h: 10 },
3
    { text: "Pur", x: 0, y: 3, w: 10, h: 10},
4
    { text: "ase", x: 27, y: 4, w: 12, h: 10 }
5
}]



The difficulty is:

  • They're not the same y
  • The order of strings are not the same as they appear in PDF viewers

We need a lib to reorder pieces of words and concatenate them if needed. The lib I use is PDFLayoutTextStripper which helps to transform PDF to plain text but pretty well keep the original layout. Below is the sample output:

Plain Text
 




xxxxxxxxxx
1
35


 
1
                       
2
                                                                                                *PO-003847945*                                           
3
                                                                                                                                                         
4
                                                                                      Page.........................: 1    of    1                        
5
                                                                                                                                                         
6
                                                                                                                                                         
7
                                                                                                                                                         
8
                                                                                                                                                         
9
                                                                                                                                                         
10
                Address...........:     Peera  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
11
                                        P.O.Box 3371                                                                                                     
12
                                        Dohe,                                      PO-003847945                                                          
13
                                        QAT                                       TL-00074             EOCE  EELA ALMANNAI   W.L.L.                      
14
                                                                                                                                                         
15
                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
16
                Fax...................:                                                                                                                  
17
                                                                                                                                                         
18
                                                                                                                                                         
19
               100225                Rawdat  Eqdeem                                 Date...................................: 5/10/2020                   
20
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
21
               Phone........:                                                       Attention Information                                                
22
               Fax.............:                                                                                                                         
23
               Vendor :    TL-00074                                                                                                                      
24
               EOCE EELA ALMANAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     
25
                                                                                                                                                         
26
                                                                                                                                                         
27
                                                                                                                         Discount                        
28
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
29
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
30
                                                     350                                                                                                 
31
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
32
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          
33
                                                                                                                                                         
34
                                                1.25                                                                                                                        
35
(truncated...)



Using Regex

After having PDF content in a single string, we can split it into lines and loop through them, using regex to find the desired information.

Match PO Number

Observing that the PO number is the first substring with the following format

Plain Text
 




xxxxxxxxxx
1


 
1
PO-{list of digits}



we also see that the PO number stays alone, far from other words so we can make the pattern stronger by adding suffix and prefix spaces. The better pattern should be

Plain Text
 




xxxxxxxxxx
1


 
1
{at least 5 spaces}PO-{list of digits}{at least 5 spaces}



turn this into Java Regex pattern:

Plain Text
 




xxxxxxxxxx
1


 
1
\\s{5,}(PO\\-\\d+)\\s{5,}



Match PO Date and Vendor

PO date is the first substring match following pattern

Plain Text
 




xxxxxxxxxx
1


 
1
Date{list of dots}{anything but not a digit e.g. space}{1 or 2 digits/1 or 2 digits/4 digits}



In Regex:

Plain Text
 




xxxxxxxxxx
1


 
1
Date\\.+[^\\d]*(\\d+\\/\\d+\\/\\d{4})



with a similar observation we have regex for vendor:

Plain Text
 




xxxxxxxxxx
1


 
1
Vendor\\s*\\:\\s*([^\\s]+)



Read Table Content

To read table content while looping through all the lines in PDF file, we need to know the following signals:

  1. The signal of the table header line to turn reading mode to reading-table-content. Also, once we know the header line we know bounds to trap column content.
  2. The signal of the first line that not belongs to the table to stop reading-table-content mode otherwise it will keep adding wrong content into the table

Check out my code here TestInvoice.java

There're some important points in this implementation:

  1. I only use some headers not all for header line detection. The reason is that's strong enough for identifying and the Discount header does not stay in the same line as others
  2. Description is multiple lines cell, its content spreads from the line with barcode and before the next barcode line

With these observations we need to find barcode and use it as the anchor cell for the row.

A More Accurate Way to Detect PO Number

Many values in forms is with their labels e.g. Po Number: PO-1234422312446. It will give us higher accuracy if we can find data label and data value together. That's what I applied to find PO Date and Vendor above. But some of the value have the label and value are in the vertical alignment. For example:

Plain Text
 




xxxxxxxxxx
1


 
1
              PO Number
2
 
          
3
           PO-1234422312446



For this layout, we can first, detect the position of the label, then scan the next lines at the same x-range as label with tolerance to find the first non-empty value. That should be the value we're finding. The implementation is as below:

Java
 




xxxxxxxxxx
1
20


 
1
String poNumberLabel = "PO Number";
2
String poNumber = null;
3
boolean foundPONumberLabel = false;
4
int spaceTolerance = 5;
5
 
          
6
for (String line : lines) {
7
    // ...
8
    // detect PO Number
9
    if (poNumber != null) {
10
        continue;
11
    }
12
    int start = line.indexOf(poNumberLabel);
13
    if (start >= 0) {
14
        foundPONumberLabel = true;
15
    }
16
    int end = start + poNumberLabel.length();
17
    if (foundPONumberLabel) {
18
        poNumber = match(line.substring(start - spaceTolerance, end + spaceTolerance), "po-regex-here");
19
    }
20
}



Design for Multi-Template Parsers

If your system has several PDF templates, the suggested pattern to manage all parsers is factory pattern, the design is as below:

Interfaces

Java
 




xxxxxxxxxx
1
14


 
1
class ParsedContent {
2
    // e.g.
3
    // private string poNumber;
4
    // private string date;
5
    // private Row[] rows;
6
}
7
 
          
8
interface Parser {
9
    public ParsedContent parse(String[] lines);
10
}
11
 
          
12
interface ParserFactory {
13
    public Parser get(String[] lines); // detect Parser from its content
14
}



Implementation

Java
 




xxxxxxxxxx
1
37


 
1
abstract class AbstractParser implements Parser {
2
    /**
3
     * Check and determine if the input lines are acceptable for this parser
4
     */
5
    protected boolean isValid(String[] lines);
6
}
7
 
          
8
class Template1Parser implements AbstractParser {
9
    // ...
10
}
11
 
          
12
class Template2Parser implements AbstractParser {
13
    // ...
14
}
15
 
          
16
class ParserFactoryImpl implements ParserFactory {
17
    private Parser[] parsers = new Parser[] {
18
        new Template1Parser(),
19
        new Template2Parser()
20
    };
21
 
          
22
    public Parser get(String[] lines) {
23
        Parser retVal = null;
24
        for (Parser p : this.parsers) {
25
            if (p.isValid(lines)) {
26
                if (retVal != null) {
27
                    throw new Found2ParsersException();
28
                }
29
                retVal = p;
30
            }
31
        }
32
        if (retVal == null) {
33
            throw new ParserNotFoundException();
34
        }
35
        return retVal;
36
    }
37
}



Usage:

Java
 




xxxxxxxxxx
1


 
1
ParserFactory pf = new ParserFactoryImpl();
2
3
// read pdf file and store content in String[] lines
4
ParsedContent content = pf.get(lines).parse(lines);



Source code

Check out my code here TestInvoice.java

Topics:
invoice processing, java, parse, pdf, table

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}