Over a million developers have joined DZone.

Visual Mapper for PDF Data Extraction

There's more than one way to extract data from a PDF. Read on to see what alternative method the author found.

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Recently, I was talking to a team that had the task of extracting data from a PDF. The team was using the PDFTextStripperByArea class of Apache PDFBox. To use this method for extraction, we need to specify a rectangle, which when placed on the PDF page, defines the area from which PDFBox will extract text.

The logical question at this point might be, 'Is there no other way to extract data from a PDF?' While it is possible to convert the PDF into a plain text file and use that for data extraction, the method is not without problems. As PDF is a display format, the data stored in it is not necessarily in the same order as it appears on-screen. For example, when a PDF is converted to a text file, paragraphs that were placed next to each other in the PDF, may be separated by many other paragraphs in the converted text file.

Using coordinates to define an area and extract text from them is a fairly simple and consistent method. This method faces problem only if the format of the PDF using which the areas are defined, undergoes a change. For example, if we define an area to extract two lines of text (say, business address), it will fail to extract the complete address if it spans three lines. But let us leave that discussion for another day.

Obviously, defining the coordinates of the areas is a tedious task. Some solution was needed to ease the task. I looked around for an interactive method that would allow us to define the areas visually. Given the vast nature of the Internet and the number of people making contributions, I was able to find a suitable sample (reference: Mouse drag and draw : Mouse Draw on Java2s).

To make the sample suitable for my task, a small change to the PaintSurface class was needed. Instead of clearing the background, I loaded an image of the PDF page from which data was to be extracted. I modified the constructor of PaintSurface as below

ii = new ImageIcon(imagePath);
image = ii.getImage();
if ( image != null ) {
    Dimension size = new Dimension(image.getWidth(null), image.getHeight(null));
    setPreferredSize(size);
    setMinimumSize(size);
    setMaximumSize(size);
    setSize(size);
    setLayout(null);
}

where 'imagePath' is of type String and is the path to the image. Additionally, the 'paint' method was updated to draw the image, before drawing the rectangles.

By showing the content of the PDF page, areas defined using the application were the exact coordinates of the areas used for text extraction using PDFBox.

Links

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
java ,pdfbox ,apache

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}