DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Dynamic Watermarking of Bitmaps in Databases
  • Understanding PDF Standards: What Developers Should Know
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps
  • Raft in Tarantool: How It Works and How to Use It

Trending

  • After 9 Years, Microsoft Fulfills This Windows Feature Request
  • Unlocking the Potential of Apache Iceberg: A Comprehensive Analysis
  • Comparing Managed Postgres Options on The Azure Marketplace
  • Mastering Deployment Strategies: Navigating the Path to Seamless Software Releases
  1. DZone
  2. Data Engineering
  3. Data
  4. Three Methods to Automatically Validate PDF Data

Three Methods to Automatically Validate PDF Data

Want to validate data from a PDF? Sure you do. See three different ways you can handle it, each with their own benefits.

By 
Bipin Patwardhan user avatar
Bipin Patwardhan
·
Nov. 18, 16 · Tutorial
Likes (8)
Comment
Save
Tweet
Share
39.3K Views

Join the DZone community and get the full member experience.

Join For Free

an insurance customer delivery team wanted to (as part of regression testing) automate the validation of data present in pdf documents. after going through the requirements, we explored multiple options and suggested three solutions, each with its own set of unique features. two of the options involve a two-step process, where the first step converts the pdf document into a text document, while text is extracted in the second step. in this article, we elaborate on the problem and also share an overview of each option.

introduction

recently, we were in a discussion with a project delivery team that was dealing with pdf documents. the delivery team works for an insurance customer, where one of their activities is to generate customer policies as pdf documents. as a standard process, the pdf documents generated are verified for content and structure and then sent to the customer. after each functionality change, the team needs to perform a regression test using various data sets and multiple templates. today, the team has to go through each generated pdf document and validate information like name, address, policy number, policy start date, and the like, manually. as the number of tests is expected to grow along with the number of pdf templates, the team wanted a solution that would reduce the manual efforts involved and work across a large volume of documents.

at first glance, the task of locating data inside a pdf seems to be straight-forward. but, this task is not as simple as it appears to be. pdf is a display format and data stored in the pdf may not be in the same order in which it is displayed on screen or as it appears on a printed page. this is because text and/or images in a pdf are placed using page coordinates and do not have a linear structure (like in a text file) or a hierarchical structure (like in an xml file) that we are commonly accustomed to in other formats. in this respect, the pdf is like an html document, which specifies how the data is to be displayed visually, rather than using a well-defined structure that in turn helps decide the layout of the data. for example, when a pdf (as shown in figure 1) is converted to a text file (as shown in figure 2), paragraphs that are placed next to each other (visually), may be separated by many other paragraphs (in the converted text file).

image title

figure 1: sample pdf

image title

figure 2: pdf converted to text

after considering the requirements of the delivery team for an easy, scalable and automatable process, we explored various options and solutions. we came up with three viable methods that can address the needs of the team. in the following sections, we describe each of the methods. it is important to note that two of these methods work on a text file, which is generated from the pdf document. for the scope of this article, the text file generated from the pdf document is known as the ‘extract file’.

method 1: extracting text using coordinates

the most commonly mentioned technique for extracting text from a pdf document is by using the pdftextstripperbyarea method provided by the apache pdfbox library. to use this method, we need to specify a rectangular area (using its coordinates), which when placed on the pdf page, defines the area from which pdfbox will extract text. a java sample using this method is shown in table 1.

pdftextstripperbyarea stripper = new pdftextstripperbyarea();
stripper.setsortbyposition(true);
rectangle rect = new rectangle(10, 280, 275, 60); //coordinates of region
stripper.addregion("r1", rect);
list allpages = document.getdocumentcatalog().getallpages();
pdpage firstpage = (pdpage)allpages.get(0);
stripper.extractregions(firstpage);
system.out.println("rectangle dimensions:" + rect);
system.out.println("text: " + stripper.gettextforregion("r1"));

table 1: sample for pdftextstripperbyarea

one of the difficulties of using this coordinate based extraction method is that we need to define each rectangular are from where we wish to extract text. for long documents, this task can be very time consuming as well as error-prone. the method of manually specifying the coordinates is error-prone as we need to guess the positions as well as the size of the rectangles needed. in most cases, this becomes an exercise in trial-and-error that needs multiple iterations.

to help make the task of specifying coordinates easy, we developed a helper application, namely pdfvisualmapper (shown in figure 3), that loads one pdf page as an image and allow us to specify the rectangles using rubber-banding technique (click and drag the mouse to define outline). the application generates, as output, the coordinates of the rectangles which can be used in the pdftextstripperbyarea to extract text, as shown in figure 4.


image title

figure 3: pdfvisualmapper showing three rectangular areas from which text is to be extracted for validation

image title

figure 4: output generated by pdfvisualfieldmapper, with field names and their coordinates

using coordinates to define an area and extract text from them is a fairly simple method that works as long as the position of the text elements does not undergo a change. for example, if we define an area to extract two lines of text (say, business address), it will fail to extract the complete address if the address spans three lines (the third line of the address will not be extracted). similarly, if we have defined other elements for extraction based on the assumption that the address line will be spanning two lines of text, it may happen that the data may get shifted down the page if the address line spans more than two lines.

while this method is the simplest of the three methods, its biggest limitation is the fixed nature of the coordinates used for extraction. any change to the position of the data can result in incorrect extraction.

method 2: finding known values

method two validates text present in the extract file pdf document and searches for known values. to use this method, we need to specify the data we are searching for. the solution will search for the specified text (‘master text’) in the extract file.in this method, we need to create an input file that contains text that we wish to search for in the extract file. the application searches for master text and the pdf document is declared to be valid if all the required master text elements are found.

to validate a document, we need to create an input file as shown in table 2. to account for multiple occurrences of the master text (for example policy number can appear in multiple places), it is possible to specify that master text being searched for is after specified prefix text or between specified prefix text and suffix text or before specified suffix text.

field name

master text

prefix

suffix

policy number:

2016-04-1705



premium policy

$1,016.00

premium policy


expiration date

10 apr 2016



table 2: input file

the application will search for the master text one entry at a time and generate output as shown in table 3. if the master text is found, the entry is marked ‘found’. if the master text is not found, that entry is marked as ‘not found’. it is important to note that the solution will stop at the first occurrence of the master text entry and will not search for all occurrences of the master text entry.

field name

status

master text

policy number:

found

2016-04-1705

premium policy

found

$1,016.00

expiration date

not found

10 apr 2016

table 3: output generated by solution

method 3: rule-based extraction

method three is a bit involved and complex as compared to the other methods described in earlier sections. we have named this method ‘rule-based extraction.’ similar to method 1, this method extracts text from the extract file and generates an output file. the contents of the output file are validated independently with the master data. when compared to method 1, in this method, we define rules that allow us to navigate and extract data from the extract file.to accommodate for various formats, multiple rules are supported by the solution. as an example, for the sample document shown in figure 2, we can define the rule file as shown in table 4 (for convenience, the json format is used):

{
    "rules": [
        { "command": "skiptill", "terminatingtext": "policy number" },
        { "command": "extracttext", "fieldname": "policynumber" },
        { "command": "skiptillend" }
    ]
}

table 4: rules

by defining various text operations as rules, this method is more flexible that the other methods. due to parameterization, not only are the rules flexible, the solution itself is expandable as new rules can easily be added to cater to specific needs of a team. as the solution uses dynamic class loading principles of the java language, adding new rules to the solution is as simple as creating a java archive (jar file) containing the new rules and adding them to the classpath of the solution.

while this method very flexible, it depends on well-known text elements (‘markers’ as we call them) in the extract file, for the solution to identify its position in the file and extract data accordingly. if the position of these markers changes, either due to a change in format or due to usage of a different pdf to text conversion solution, the rules file will have to be updated to account for the changes.

comparing the methods

of the three methods described, the most logical question to ask is, ‘which of these methods is the best?’ sadly, there is no simple answer. it depends on the input pdf documents from which we wish to extract data. if we wish to extract known data from the pdf, then method two is preferred. for example, if we know the policy number that needs to be found, we can specify the policy number as master text and the solution will be able to find that text in the extract file. if we do not know the exact value of the data we are looking for, method one and method three are preferred. additionally, if we are guaranteed that the layout of the data in the pdf does not undergo a change, method one is preferable over method three. if we wish to extract data from a document and also ensure that the structure of the document is appropriate, method three is preferred as we can define rules that allow for such validation in addition to data extraction. the biggest difference between method one as compared to method two and method three is method one can operate directly on the pdf document, whereas methods two and three need the pdf document to be converted into an equivalent text file.

automation

today, each business is focusing on increasing effectiveness of each process using automation. so, how does each method stack up against automation? we are happy to note that each of these methods can be included in an automation workflow. for each method, the initial manual effort is needed to either define the rectangular areas for extraction (method one) or define the master file (method two) or define rules for extraction (method three). after creating the input files, we can apply the solution to multiple instances of the same template. thus, each of the methods is scalable across multiple instances of one template as well as flexible to adapt to multiple templates without needing any code changes.

conclusion

the task of converting a pdf document to a text document is fairly easy using tools like apache pdfbox (as well as xpdf). but, after the conversion, extracting required data from the converted text is a challenge. this is because text is not arranged in a well-defined format inside a pdf. there is no guarantee that text will appear in the same structure when extracted from the pdf (though text on one page does appear on the same page). in this article, we have presented an overview of three methods that we have developed to address the problem of extract data from the pdf and validating the same against provided master data. each of the methods has its own advantages and its own set of limitations. the choice of the method depends on the pdf document itself, as well as the way in which data from the pdf has to be processed.

Data (computing) Document PDF Extract master teams Database

Opinions expressed by DZone contributors are their own.

Related

  • Dynamic Watermarking of Bitmaps in Databases
  • Understanding PDF Standards: What Developers Should Know
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps
  • Raft in Tarantool: How It Works and How to Use It

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!