DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Google Cloud Document AI Basics
  • Exploring Foundations of Large Language Models (LLMs): Tokenization and Embeddings
  • The Disruptive Potential of On-Device Large Language Models
  • Retrieval-Augmented Generation (RAG): Enhancing AI-Language Models With Real-World Knowledge

Trending

  • MCP Servers: The Technical Debt That Is Coming
  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  • Monolith: The Good, The Bad and The Ugly
  • Scaling Microservices With Docker and Kubernetes on Production
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. DocAI: PDFs/Scanned Docs to Structured Data

DocAI: PDFs/Scanned Docs to Structured Data

In this article, discover a way to chat with AI and ask questions in the context of your scanned documents.

By 
Kriti B user avatar
Kriti B
·
Oct. 09, 24 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
3.0K Views

Join the DZone community and get the full member experience.

Join For Free

Problem Statement

The "why" of this AI solution is very important and prevalent across multiple fields.

Imagine you have multiple scanned PDF documents:

  • Where customers make some manual selections, add signature/dates/customer information 
  • You have multiple pages of written documentation that have been scanned and want a solution that obtains text from these documents 

OR

  • You are simply looking for an AI-backed avenue that provides an interactive mechanism to query documents that do not have a structured format

Dealing with such scanned/mixed/unstructured documents can be tricky, and extracting crucial information from them could be manual, hence error-prone and cumbersome.

The solution below leverages the power of OCR (Optical character recognition) and LLM (Large Language Models) in order to obtain text from such documents and query them to obtain structured trusted information.

High-Level Architecture

High-level architecture

User Interface

  • The user interface allows for uploading PDF/scanned documents (it can be further expanded to other document types as well).
  • Streamlit is being leveraged for the user interface:
    • It is an open-source Python Framework and is extremely easy to use.
    • As changes are performed, they reflect in the running apps, making this a fast testing mechanism.
    • Community support for Streamlit is fairly strong and growing.
  • Conversation chain:
    • This is essentially required to incorporate chatbots that can answer follow-up questions and provide chat history.
    • We leverage LangChain for interfacing with the AI model we use; for the purpose of this project, we have tested with OpenAI and Mistral AI.

Backend Service

Flow of Events

  1. The user uploads a PDF/scanned document, which then gets uploaded to an S3 bucket.
  2. An OCR service then retrieves this file from the S3 bucket and processes it to extract text from this document.
  3. Chunks of text are created from the above output, and associated vector embeddings are created for them.
    • Now this is very important because you do not want context to be lost when chunks are split: they could be split mid-sentence, without some punctuations the meaning could be lost, etc.
    • So to counter it, we create overlapping chunks.
  4. The large language model that we use takes these embeddings as input and we have two functionalities:
    1. Generate specific output:
      • If we have a specific kind of information that needs to be pulled out from documents, we can provide query in-code to the AI model, obtain data, and store it in a structured format.
      • Avoid AI hallucinations by explicitly adding in-code queries with conditions to not make up certain values and only use the context of the document.
      • We can store it as a file in S3/locally OR write to a database.
    2. Chat
      • Here we provide the avenue for the end user to initiate a chat with AI to obtain specific information in the context of the document.

OCR Job

  • We are using Amazon Textract for optical recognition on these documents.
  • It works great with documents that also have tables/forms, etc.
  • If working on a POC, leverage the free tier for this service.

Vector Embeddings

  • A very easy way to understand vector embeddings is to translate words or sentences into numbers which capture the meaning and relationships of this context
  • Imagine you have the word "ring" which is an ornament: in terms of the word itself, one of its close matches is "sing". But in terms of the meaning of the word, we would want it to match something like "jewelry", "finger", "gemstones", or perhaps something like "hoop", "circle", etc.
    • Thus when we create vector embedding of "ring", we basically are filling it up with tons of information about its meaning and relationships.
    • This information, along with the vector embeddings of other words/statements in a document, ensures that the correct meaning of the word "ring" in context is picked.
  • We used OpenAIEmbeddings for creating Vector Embeddings.

LLM

  • There are multiple large language models that can be used for our scenario.
  • In the scope of this project, testing with OpenAI and Mistral AI has been done.
  • Read more here on API Keys for OpenAI.
  • For MistralAI, HuggingFace was leveraged.

Use Cases and Tests

We performed the following tests:

  • Signatures and handwritten dates/texts were read using OCR.
  • Hand-selected options in the document
  • Digital selections made on top of the document
  • Unstructured data parsed to obtain tabular content (add to text file/DB, etc.)

Future Scope

We can further expand the use cases for the above project to incorporate images, integrate with documentation stores like Confluence/Drive, etc. to pull information regarding a specific topic from multiple sources, add a stronger avenue to do comparative analysis between two documents, etc.

AI Document PDF Data (computing) large language model

Published at DZone with permission of Kriti B. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Google Cloud Document AI Basics
  • Exploring Foundations of Large Language Models (LLMs): Tokenization and Embeddings
  • The Disruptive Potential of On-Device Large Language Models
  • Retrieval-Augmented Generation (RAG): Enhancing AI-Language Models With Real-World Knowledge

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!