DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How to Get Plain Text From Common Documents in Java
  • Dynamic Watermarking of Bitmaps in Databases
  • Reading an HTML File, Parsing It and Converting It to a PDF File With the Pdfbox Library
  • Non-blocking Database Migrations

Trending

  • Simplify Authorization in Ruby on Rails With the Power of Pundit Gem
  • Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve
  • Segmentation Violation and How Rust Helps Overcome It
  • The Modern Data Stack Is Overrated — Here’s What Works
  1. DZone
  2. Data Engineering
  3. Databases
  4. Extract Information From PDF Invoice

Extract Information From PDF Invoice

In this writing, I will explain the way I used to parse PDF invoice files using regex and PDFBox

By 
Tho Luong user avatar
Tho Luong
DZone Core CORE ·
Updated Jul. 06, 20 · Presentation
Likes (3)
Comment
Save
Tweet
Share
8.6K Views

Join the DZone community and get the full member experience.

Join For Free

It's pretty easy to write code to generate PDF files but pretty hard to parse and get back information from it because PDF is complicated. Unfortunately, it's sometimes the input of our system which needs to parse and model before doing further logic on it.

If the template is various, it's nearly impossible to write one abstract parser to understand and extract all information we need such as Order number, quantity, amount, vendor id. But if the number of templates is fixed, yes there's a way to achieve that with PDF box and regex.

In this writing, I will explain the way I used to parse the PDF file below. Hopefully, it can be applied to yours as well.

Check out my code here TestInvoice.java

Extraction requirements

Need to get the following information from the above file:

  • PO number
  • Date of the PO
  • Vendor
  • { Barcode, Description, Quantity } in the table

Libs

As you may know, PDF stores strings and characters separately with absolute positioning. Meaning even 2 words look like belong to the same string but the raw data we receive can be a list of concrete strings with position. For example, the result when reading the word Purchase can be:

JSON
 




xxxxxxxxxx
1


 
1
[{
2
    { text: "ch", x: 11, y: 4, w: 15, h: 10 },
3
    { text: "Pur", x: 0, y: 3, w: 10, h: 10},
4
    { text: "ase", x: 27, y: 4, w: 12, h: 10 }
5
}]



The difficulty is:

  • They're not the same y
  • The order of strings are not the same as they appear in PDF viewers

We need a lib to reorder pieces of words and concatenate them if needed. The lib I use is PDFLayoutTextStripper which helps to transform PDF to plain text but pretty well keep the original layout. Below is the sample output:

Plain Text
 




xxxxxxxxxx
1
35


 
1
                       
2
                                                                                                *PO-003847945*                                           
3
                                                                                                                                                         
4
                                                                                      Page.........................: 1    of    1                        
5
                                                                                                                                                         
6
                                                                                                                                                         
7
                                                                                                                                                         
8
                                                                                                                                                         
9
                                                                                                                                                         
10
                Address...........:     Peera  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
11
                                        P.O.Box 3371                                                                                                     
12
                                        Dohe,                                      PO-003847945                                                          
13
                                        QAT                                       TL-00074             EOCE  EELA ALMANNAI   W.L.L.                      
14
                                                                                                                                                         
15
                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
16
                Fax...................:                                                                                                                  
17
                                                                                                                                                         
18
                                                                                                                                                         
19
               100225                Rawdat  Eqdeem                                 Date...................................: 5/10/2020                   
20
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
21
               Phone........:                                                       Attention Information                                                
22
               Fax.............:                                                                                                                         
23
               Vendor :    TL-00074                                                                                                                      
24
               EOCE EELA ALMANAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     
25
                                                                                                                                                         
26
                                                                                                                                                         
27
                                                                                                                         Discount                        
28
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
29
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
30
                                                     350                                                                                                 
31
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
32
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          
33
                                                                                                                                                         
34
                                                1.25                                                                                                                        
35
(truncated...)



Using Regex

After having PDF content in a single string, we can split it into lines and loop through them, using regex to find the desired information.

Match PO Number

Observing that the PO number is the first substring with the following format

Plain Text
 




xxxxxxxxxx
1


 
1
PO-{list of digits}



we also see that the PO number stays alone, far from other words so we can make the pattern stronger by adding suffix and prefix spaces. The better pattern should be

Plain Text
 




xxxxxxxxxx
1


 
1
{at least 5 spaces}PO-{list of digits}{at least 5 spaces}



turn this into Java Regex pattern:

Plain Text
 




xxxxxxxxxx
1


 
1
\\s{5,}(PO\\-\\d+)\\s{5,}



Match PO Date and Vendor

PO date is the first substring match following pattern

Plain Text
 




xxxxxxxxxx
1


 
1
Date{list of dots}{anything but not a digit e.g. space}{1 or 2 digits/1 or 2 digits/4 digits}



In Regex:

Plain Text
 




xxxxxxxxxx
1


 
1
Date\\.+[^\\d]*(\\d+\\/\\d+\\/\\d{4})



with a similar observation we have regex for vendor:

Plain Text
 




xxxxxxxxxx
1


 
1
Vendor\\s*\\:\\s*([^\\s]+)



Read Table Content

To read table content while looping through all the lines in PDF file, we need to know the following signals:

  1. The signal of the table header line to turn reading mode to reading-table-content. Also, once we know the header line we know bounds to trap column content.
  2. The signal of the first line that not belongs to the table to stop reading-table-content mode otherwise it will keep adding wrong content into the table

Check out my code here TestInvoice.java

There're some important points in this implementation:

  1. I only use some headers not all for header line detection. The reason is that's strong enough for identifying and the Discount header does not stay in the same line as others
  2. Description is multiple lines cell, its content spreads from the line with barcode and before the next barcode line

With these observations we need to find barcode and use it as the anchor cell for the row.

A More Accurate Way to Detect PO Number

Many values in forms is with their labels e.g. Po Number: PO-1234422312446. It will give us higher accuracy if we can find data label and data value together. That's what I applied to find PO Date and Vendor above. But some of the value have the label and value are in the vertical alignment. For example:

Plain Text
 




xxxxxxxxxx
1


 
1
              PO Number
2

          
3
           PO-1234422312446



For this layout, we can first, detect the position of the label, then scan the next lines at the same x-range as label with tolerance to find the first non-empty value. That should be the value we're finding. The implementation is as below:

Java
 




xxxxxxxxxx
1
20


 
1
String poNumberLabel = "PO Number";
2
String poNumber = null;
3
boolean foundPONumberLabel = false;
4
int spaceTolerance = 5;
5

          
6
for (String line : lines) {
7
    // ...
8
    // detect PO Number
9
    if (poNumber != null) {
10
        continue;
11
    }
12
    int start = line.indexOf(poNumberLabel);
13
    if (start >= 0) {
14
        foundPONumberLabel = true;
15
    }
16
    int end = start + poNumberLabel.length();
17
    if (foundPONumberLabel) {
18
        poNumber = match(line.substring(start - spaceTolerance, end + spaceTolerance), "po-regex-here");
19
    }
20
}



Design for Multi-Template Parsers

If your system has several PDF templates, the suggested pattern to manage all parsers is factory pattern, the design is as below:

Interfaces

Java
 




xxxxxxxxxx
1
14


 
1
class ParsedContent {
2
    // e.g.
3
    // private string poNumber;
4
    // private string date;
5
    // private Row[] rows;
6
}
7

          
8
interface Parser {
9
    public ParsedContent parse(String[] lines);
10
}
11

          
12
interface ParserFactory {
13
    public Parser get(String[] lines); // detect Parser from its content
14
}



Implementation

Java
 




xxxxxxxxxx
1
37


 
1
abstract class AbstractParser implements Parser {
2
    /**
3
     * Check and determine if the input lines are acceptable for this parser
4
     */
5
    protected boolean isValid(String[] lines);
6
}
7

          
8
class Template1Parser implements AbstractParser {
9
    // ...
10
}
11

          
12
class Template2Parser implements AbstractParser {
13
    // ...
14
}
15

          
16
class ParserFactoryImpl implements ParserFactory {
17
    private Parser[] parsers = new Parser[] {
18
        new Template1Parser(),
19
        new Template2Parser()
20
    };
21

          
22
    public Parser get(String[] lines) {
23
        Parser retVal = null;
24
        for (Parser p : this.parsers) {
25
            if (p.isValid(lines)) {
26
                if (retVal != null) {
27
                    throw new Found2ParsersException();
28
                }
29
                retVal = p;
30
            }
31
        }
32
        if (retVal == null) {
33
            throw new ParserNotFoundException();
34
        }
35
        return retVal;
36
    }
37
}



Usage:

Java
 




xxxxxxxxxx
1


 
1
ParserFactory pf = new ParserFactoryImpl();
2
3
// read pdf file and store content in String[] lines
4
ParsedContent content = pf.get(lines).parse(lines);



Source code

Check out my code here TestInvoice.java

PDF Database Plain text Extract

Opinions expressed by DZone contributors are their own.

Related

  • How to Get Plain Text From Common Documents in Java
  • Dynamic Watermarking of Bitmaps in Databases
  • Reading an HTML File, Parsing It and Converting It to a PDF File With the Pdfbox Library
  • Non-blocking Database Migrations

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!