Three Methods to Automatically Validate PDF Data
Want to validate data from a PDF? Sure you do. See three different ways you can handle it, each with their own benefits.
Join the DZone community and get the full member experience.
Join For Freean insurance customer delivery team wanted to (as part of regression testing) automate the validation of data present in pdf documents. after going through the requirements, we explored multiple options and suggested three solutions, each with its own set of unique features. two of the options involve a two-step process, where the first step converts the pdf document into a text document, while text is extracted in the second step. in this article, we elaborate on the problem and also share an overview of each option.
introduction
recently, we were in a discussion with a project delivery team that was dealing with pdf documents. the delivery team works for an insurance customer, where one of their activities is to generate customer policies as pdf documents. as a standard process, the pdf documents generated are verified for content and structure and then sent to the customer. after each functionality change, the team needs to perform a regression test using various data sets and multiple templates. today, the team has to go through each generated pdf document and validate information like name, address, policy number, policy start date, and the like, manually. as the number of tests is expected to grow along with the number of pdf templates, the team wanted a solution that would reduce the manual efforts involved and work across a large volume of documents.
at first glance, the task of locating data inside a pdf seems to be straight-forward. but, this task is not as simple as it appears to be. pdf is a display format and data stored in the pdf may not be in the same order in which it is displayed on screen or as it appears on a printed page. this is because text and/or images in a pdf are placed using page coordinates and do not have a linear structure (like in a text file) or a hierarchical structure (like in an xml file) that we are commonly accustomed to in other formats. in this respect, the pdf is like an html document, which specifies how the data is to be displayed visually, rather than using a well-defined structure that in turn helps decide the layout of the data. for example, when a pdf (as shown in figure 1) is converted to a text file (as shown in figure 2), paragraphs that are placed next to each other (visually), may be separated by many other paragraphs (in the converted text file).
figure 1: sample pdf |
figure 2: pdf converted to text |
after considering the requirements of the delivery team for an easy, scalable and automatable process, we explored various options and solutions. we came up with three viable methods that can address the needs of the team. in the following sections, we describe each of the methods. it is important to note that two of these methods work on a text file, which is generated from the pdf document. for the scope of this article, the text file generated from the pdf document is known as the ‘extract file’.
method 1: extracting text using coordinates
the most commonly mentioned technique for extracting text from a pdf document is by using the pdftextstripperbyarea method provided by the apache pdfbox library. to use this method, we need to specify a rectangular area (using its coordinates), which when placed on the pdf page, defines the area from which pdfbox will extract text. a java sample using this method is shown in table 1.
|
table 1: sample for pdftextstripperbyarea
one of the difficulties of using this coordinate based extraction method is that we need to define each rectangular are from where we wish to extract text. for long documents, this task can be very time consuming as well as error-prone. the method of manually specifying the coordinates is error-prone as we need to guess the positions as well as the size of the rectangles needed. in most cases, this becomes an exercise in trial-and-error that needs multiple iterations.
to help make the task of specifying coordinates easy, we developed a helper application, namely pdfvisualmapper (shown in figure 3), that loads one pdf page as an image and allow us to specify the rectangles using rubber-banding technique (click and drag the mouse to define outline). the application generates, as output, the coordinates of the rectangles which can be used in the pdftextstripperbyarea to extract text, as shown in figure 4.
figure 3: pdfvisualmapper showing three rectangular areas from which text is to be extracted for validation
figure 4: output generated by pdfvisualfieldmapper, with field names and their coordinates
using coordinates to define an area and extract text from them is a fairly simple method that works as long as the position of the text elements does not undergo a change. for example, if we define an area to extract two lines of text (say, business address), it will fail to extract the complete address if the address spans three lines (the third line of the address will not be extracted). similarly, if we have defined other elements for extraction based on the assumption that the address line will be spanning two lines of text, it may happen that the data may get shifted down the page if the address line spans more than two lines.
while this method is the simplest of the three methods, its biggest limitation is the fixed nature of the coordinates used for extraction. any change to the position of the data can result in incorrect extraction.
method 2: finding known values
method two validates text present in the extract file pdf document and searches for known values. to use this method, we need to specify the data we are searching for. the solution will search for the specified text (‘master text’) in the extract file.in this method, we need to create an input file that contains text that we wish to search for in the extract file. the application searches for master text and the pdf document is declared to be valid if all the required master text elements are found.
to validate a document, we need to create an input file as shown in table 2. to account for multiple occurrences of the master text (for example policy number can appear in multiple places), it is possible to specify that master text being searched for is after specified prefix text or between specified prefix text and suffix text or before specified suffix text.
field name |
master text |
prefix |
suffix |
policy number: |
2016-04-1705 |
|
|
premium policy |
$1,016.00 |
premium policy |
|
expiration date |
10 apr 2016 |
|
|
table 2: input file
the application will search for the master text one entry at a time and generate output as shown in table 3. if the master text is found, the entry is marked ‘found’. if the master text is not found, that entry is marked as ‘not found’. it is important to note that the solution will stop at the first occurrence of the master text entry and will not search for all occurrences of the master text entry.
field name |
status |
master text |
policy number: |
found |
2016-04-1705 |
premium policy |
found |
$1,016.00 |
expiration date |
not found |
10 apr 2016 |
table 3: output generated by solution
method 3: rule-based extraction
method three is a bit involved and complex as compared to the other methods described in earlier sections. we have named this method ‘rule-based extraction.’ similar to method 1, this method extracts text from the extract file and generates an output file. the contents of the output file are validated independently with the master data. when compared to method 1, in this method, we define rules that allow us to navigate and extract data from the extract file.to accommodate for various formats, multiple rules are supported by the solution. as an example, for the sample document shown in figure 2, we can define the rule file as shown in table 4 (for convenience, the json format is used):
|
table 4: rules
by defining various text operations as rules, this method is more flexible that the other methods. due to parameterization, not only are the rules flexible, the solution itself is expandable as new rules can easily be added to cater to specific needs of a team. as the solution uses dynamic class loading principles of the java language, adding new rules to the solution is as simple as creating a java archive (jar file) containing the new rules and adding them to the classpath of the solution.
while this method very flexible, it depends on well-known text elements (‘markers’ as we call them) in the extract file, for the solution to identify its position in the file and extract data accordingly. if the position of these markers changes, either due to a change in format or due to usage of a different pdf to text conversion solution, the rules file will have to be updated to account for the changes.
comparing the methods
of the three methods described, the most logical question to ask is, ‘which of these methods is the best?’ sadly, there is no simple answer. it depends on the input pdf documents from which we wish to extract data. if we wish to extract known data from the pdf, then method two is preferred. for example, if we know the policy number that needs to be found, we can specify the policy number as master text and the solution will be able to find that text in the extract file. if we do not know the exact value of the data we are looking for, method one and method three are preferred. additionally, if we are guaranteed that the layout of the data in the pdf does not undergo a change, method one is preferable over method three. if we wish to extract data from a document and also ensure that the structure of the document is appropriate, method three is preferred as we can define rules that allow for such validation in addition to data extraction. the biggest difference between method one as compared to method two and method three is method one can operate directly on the pdf document, whereas methods two and three need the pdf document to be converted into an equivalent text file.
automation
today, each business is focusing on increasing effectiveness of each process using automation. so, how does each method stack up against automation? we are happy to note that each of these methods can be included in an automation workflow. for each method, the initial manual effort is needed to either define the rectangular areas for extraction (method one) or define the master file (method two) or define rules for extraction (method three). after creating the input files, we can apply the solution to multiple instances of the same template. thus, each of the methods is scalable across multiple instances of one template as well as flexible to adapt to multiple templates without needing any code changes.
conclusion
the task of converting a pdf document to a text document is fairly easy using tools like apache pdfbox (as well as xpdf). but, after the conversion, extracting required data from the converted text is a challenge. this is because text is not arranged in a well-defined format inside a pdf. there is no guarantee that text will appear in the same structure when extracted from the pdf (though text on one page does appear on the same page). in this article, we have presented an overview of three methods that we have developed to address the problem of extract data from the pdf and validating the same against provided master data. each of the methods has its own advantages and its own set of limitations. the choice of the method depends on the pdf document itself, as well as the way in which data from the pdf has to be processed.
Opinions expressed by DZone contributors are their own.
Comments