DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Securing AI/ML Workloads in the Cloud: Integrating DevSecOps with MLOps
  • MCP Servers Are Everywhere, but Most Are Collecting Dust: Key Lessons We Learned to Avoid That
  • MCP Elicitation: Human-in-the-Loop for MCP Servers
  • Why Open-Source OpenSearch 3.0 Is More Than Just an Upgrade: An Interview

Trending

  • Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
  • What Is Plagiarism? How to Avoid It and Cite Sources
  • The Hidden Cost of AI Tokens: Engineering Patterns for 10x Resource Efficiency
  • AI Agents in Java: Architecting Intelligent Health Data Systems
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Utilize OpenAI API to Extract Information From PDF Files

Utilize OpenAI API to Extract Information From PDF Files

A solution to use PDFBox and OpenAI API to extract information from PDF files.

By 
Tho Luong user avatar
Tho Luong
·
Jan. 30, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
6.8K Views

Join the DZone community and get the full member experience.

Join For Free

Why It's Hard to Extract Information From PDF Files

PDF, or Portable Document Format, is a popular file format that is widely used for documents such as invoices, purchase orders, and other business documents. However, extracting information from PDFs can be a challenging task for developers.

One reason why it is difficult to extract information from PDFs is that the format is not structured. Unlike HTML, which has a specific format for tables and headers that developers can easily identify, PDFs do not have a consistent layout for information. This makes it harder for developers to know where to find the specific information they need.

Another reason why it is difficult to extract information from PDFs is that there is no standard layout for information. Each system generates invoices and purchase orders differently, so developers must often write custom code to extract information from each individual document. This can be a time-consuming and error-prone process.

Additionally, PDFs can contain both text and images, making it difficult for developers to programmatically extract information from the document. OCR (optical character recognition) can be used to extract text from images, but this adds complexity to the process and may result in errors if the OCR software is not accurate.

Existing Solutions

Existing solutions for extracting information from PDFs include:

  • Using regex: to match patterns in text after converting the PDF to plain text. Examples include invoice2data and traprange-invoice. However, this method requires knowledge of the format of the data fields.
  • AI-based cloud services: utilize machine learning to extract structured data from PDFs. Examples include pdftables and docparser, but these are not open-source friendly.

Yet Another Solution for PDF Data Extraction: Using OpenAI

One solution to extract information from PDF files is to use OpenAI's natural language processing capabilities to understand the content of the document. However, OpenAI is not able to work with PDF or image formats directly, so the first step is to convert the PDF to text while retaining the relative positions of the text items.

One way to achieve this is to use the PDFLayoutTextStripper library, which uses PDFBox to read through all text items in the PDF file and organize them in lines, keeping the relative positions the same as in the original PDF file. This is important because, for example, in an invoice's items table, if the amount is in the same column as the quantity, it will result in incorrect values when querying for the total amount and total quantity. Here is an example of the output from the stripper:

 
                                                                                                *PO-003847945*                                           
                                                                                                                                                         
                                                                                      Page.........................: 1    of    1                        
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                                                         
                Address...........:     Aeeee  Consumer  Good  Co.(QSC)            Purchase       Order                                                  
                                        P.O.Box 1234                                                                                                     
                                        Dooo,                                      PO-003847945                                                          
                                        ABC                                       TL-00074                                   
                                                                                                                                                         
                Telephone........:                                                 USR\S.Morato         5/10/2020 3:40 PM                                
                Fax...................:                                                                                                                  
                                                                                                                                                         
                                                                                                                                                         
               100225                Aaaaaa  Eeeeee                                 Date...................................: 5/10/2020                   
                                                                                    Expected  DeliveryDate...:  5/10/2020                                
               Phone........:                                                       Attention Information                                                
               Fax.............:                                                                                                                         
               Vendor :    TL-00074                                                                                                                      
               AAAA BBBB CCCCCAAI    W.L.L.                                         Payment  Terms     Current month  plus  60  days                     
                                                                                                                                                         
                                                                                                                                                         
                                                                                                                         Discount                        
          Barcode           Item number     Description                  Quantity   Unit     Unit price       Amount                  Discount           
          5449000165336     304100          CRET ZERO 350ML  PET             5.00 PACK24          54.00        270.00         0.00         0.00          
                                                     350                                                                                                 
          5449000105394     300742          CEEOCE  EOE SOFT DRINKS                                                                                      
                                            1.25LTR                          5.00  PACK6          27.00        135.00         0.00         0.00          
                                                                                                                                                         
                                                1.25                                                                                                                        
(truncated...)


Once the PDF has been converted to text, the next step is to call the OpenAI API and pass the text along with queries such as "Extract fields: 'PO Number', 'Total Amount'". The response will be in JSON format, and GSON can be used to parse it and extract the final results. This two-step process of converting the PDF to text and then using OpenAI's natural language processing capabilities can be an effective solution for extracting information from PDF files.

The query is as simple as follows, with %s replaced by PO text content:

 
private static final String QUERY = """
    Want to extract fields: "PO Number", "Total Amount" and "Delivery Address".
    Return result in JSON format without any explanation. 
    The PO content is as follows:
    %s
    """;


The query consists of two components:

  1. Specifying the desired fields.
  2. Formatting the field values as JSON data for easy retrieval from API response.

And here is the example response from OpenAI:

 
{
  "object": "text_completion",
  "model": "text-davinci-003",
  "choices": [
    {
      "text": "\\n{\\n  \\"PO Number\\": \\"PO-003847945\\",\\n  \\"Total Amount\\": \\"1,485.00\\",\\n  \\"Delivery Address\\": \\"Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT\\"\\n}",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  // ... some more fields
}


Decoding the text field's JSON string yields the following desired fields:

 
{
  "PO Number": "PO-003847945",
  "Total Amount": "1,485.00",
  "Delivery Address": "Peera Consumer Good Co.(QSC), P.O.Box 3371, Dohe, QAT"
}


Run Sample Code

Prerequisites:

  • Java 16+
  • Maven

Steps:

  • Create an OpenAI account.
  • Log in and generate an API key.
  • Replace OPENAI_API_KEY in Main.java with your key.
  • Update SAMPLE_PDF_FILE if needed.
  • Execute the code and view the results from the output.
API Extract PDF AI Open source Data (computing)

Published at DZone with permission of Tho Luong. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Securing AI/ML Workloads in the Cloud: Integrating DevSecOps with MLOps
  • MCP Servers Are Everywhere, but Most Are Collecting Dust: Key Lessons We Learned to Avoid That
  • MCP Elicitation: Human-in-the-Loop for MCP Servers
  • Why Open-Source OpenSearch 3.0 Is More Than Just an Upgrade: An Interview

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook