DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Extracting Clean Excel Tables From PDFs Using Python + Docling
  • Python and Open-Source Libraries for Efficient PDF Management
  • Leveraging AI-Driven Cloud Services for Enhanced AML Compliance in Banking
  • Improving Sentiment Score Accuracy With FinBERT and Embracing SOLID Principles

Trending

  • Java in a Container: Efficient Development and Deployment With Docker
  • From Indicators to Insights: Automating IOC Enrichment Using Python and Threat Feeds
  • LLM Agents and Getting Started with Them
  • Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Implementing a RAG Model for PDF Content Extraction and Query Answering

Implementing a RAG Model for PDF Content Extraction and Query Answering

Using Python to extract and process text from a PDF document, generate embeddings, calculate cosine similarity, and answer queries using the extracted content.

By 
Vaibhavi Tiwari user avatar
Vaibhavi Tiwari
·
Nov. 22, 24 · Tutorial
Likes (8)
Comment
Save
Tweet
Share
15.4K Views

Join the DZone community and get the full member experience.

Join For Free

The Retrieval-Augmented Generation (RAG) model integrates two robust methodologies: information retrieval and language generation. The model initially gathers pertinent information from an extensive dataset in response to a query, subsequently formulating a reply utilizing the context obtained. This design improves the precision of produced responses by anchoring them in real data, rendering it especially beneficial for intricate information requests across extensive datasets, like lengthy PDF files.

This tutorial will walk you through the process of utilizing Python to extract and process text from a PDF document, create embeddings, conduct cosine similarity calculations, and respond to queries derived from the extracted content.

Prerequisites

Ensure you have the following libraries installed in your Python environment:

  • PyMuPDF (fitz): For extracting text from PDFs.
  • rake-nltk: For phrase extraction.
  • openai: To interact with OpenAI's embedding and language models.
  • pandas: To handle and export data.
  • numpy and scipy: For numerical operations and cosine similarity calculations.

Step-by-Step Guide

Step 1: Import Libraries and Open the PDF

Import the libraries and open the PDF using this code:

Python
 
import fitz 
# PyMuPDF Open the PDF file 
pdf_document = "path/to/your/document.pdf" 
document = fitz.open(pdf_document) 
# Initialize a dictionary to hold the text for each page 
pdf_text = {} 
# Loop through each page 
for page_number in range(document.page_count): 
# Get a page 
    page = document.load_page(page_number) 
    # Extract text from the page 
    text = page.get_text() 
    # Store the extracted text in the dictionary 
    pdf_text[page_number + 1] = text 
# Pages are 1-indexed for readability 
# Close the document 
document.close() 
# Output the dictionary 
for page, text in pdf_text.items(): 
    print(f"Text from page {page}:\n{text}\n")


Step 2: Chunk Text for Embedding

The text needs to be broken down into smaller, manageable chunks. We use RecursiveCharacterTextSplitter to split each page's text into overlapping chunks.

Using the RecursiveCharacterTextSplitter to break text into smaller, manageable chunks with overlapping sections is important for several reasons, especially when dealing with natural language processing (NLP) tasks, large documents, or continuous text analysis. Here’s why it’s beneficial:

1. Improves Context Retention

  • When text is split into overlapping chunks, each chunk retains some of the previous and following content. This helps preserve context, which is especially crucial for algorithms that rely on surrounding information, like NLP models.
  • Overlapping text ensures that important details spanning across chunk boundaries aren’t lost, which is critical for maintaining the coherence of the information.

2. Enhances Accuracy in NLP Tasks

  • Many NLP models (such as question-answering systems or sentiment analysis models) can perform better when provided with complete context. Overlapping chunks help these models access more relevant information, leading to more accurate and reliable results.

3. Manages Memory and Processing Efficiency

  • Breaking down large texts into smaller parts helps manage memory usage and processing time, making it feasible to handle extensive documents without overwhelming the system.
  • Smaller chunks allow for parallel processing, improving the efficiency of tasks like keyword extraction, summarization, or entity recognition on large texts.

4. Facilitates Chunked Data Storage and Retrieval

  • Overlapping chunks can be stored and retrieved more flexibly, making it easier to reconstruct portions of the text for further processing, such as when analyzing text in a sliding window approach for time series data or contextual searches.

5. Supports Recursive Splitting for Optimal Size

  • RecursiveCharacterTextSplitter can recursively split text until the desired chunk size is achieved, allowing you to tailor chunk sizes according to model input limits or memory constraints while keeping context intact.
Python
 
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Split text into chunks
page_chunks = {}
for page, text in pdf_text.items():
    chunks = text_splitter.split_text(text)
    page_chunks[page] = chunks
# Output chunks for each page 
for page, chunks in page_chunks.items(): 
    print(f"Text chunks from page {page}:") 
    for i, chunk in enumerate(chunks, start=1): 
        print(f"Chunk {i}:\n{chunk}\n")


Step 3: Extract Key Phrases

To extract meaningful phrases from the text, we use rake-nltk, a Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm.

RAKE is an algorithm for extracting keywords from text, designed to be fast and efficient. It works by identifying words or phrases that are statistically significant within a document. Here's an overview of how it works:

How RAKE Works

  1. Word Segmentation: It splits the text into individual words and phrases, discarding common stop words (like "and," "the," "is," etc.).
  2. Phrase Construction: RAKE groups together contiguous words that are not stop words to form candidate phrases.
  3. Scoring: Each candidate phrase is given a score based on the frequency of its words and the degree of co-occurrence with other words in the text. This score helps determine the relevance of each phrase as a potential keyword.
  4. Sorting: The phrases are sorted based on their scores, and the highest-scoring phrases are selected as keywords.
Python
 
from rake_nltk import Rake

rake = Rake()

# Extract phrases from each page and store in a dictionary
page_phrases = {}
for page, text in pdf_text.items():
    rake.extract_keywords_from_text(text)
    phrases = rake.get_ranked_phrases()
    page_phrases[page] = phrases

chunk_phrases = {} 
# Extract phrases for each chunk 
for page, chunks in page_chunks.items(): 
    for chunk_number, chunk in enumerate(chunks, start=1): 
        rake.extract_keywords_from_text(chunk) 
        phrases = rake.get_ranked_phrases() 
        chunk_phrases[(page, chunk_number)] = phrases 
# Output phrases for each chunk 
for (page, chunk_number), phrases in chunk_phrases.items(): 
    print(f"Key phrases from page {page}, chunk {chunk_number}:\n{phrases}\n")


Step 4: Generate Embeddings

Generate embeddings for each phrase using OpenAI's text-embedding-ada-002 model and save in the Excel format. This model generates numerical representations (embeddings) of text. These embeddings capture the semantic meaning of the text, allowing you to compare and analyze pieces of text based on their content.

Python
 
# Function to get embeddings for a phrase
openai.api_key = "YOUR-API-KEY"
def get_embedding(phrase): 
    response = openai.Embedding.create(input=phrase, model="text-embedding-ada-002") 
    return response['data'][0]['embedding'] 
# Dictionary to hold embeddings 
phrase_embeddings = {} 
# Generate embeddings for each phrase 
for (page, chunk_number), phrases in chunk_phrases.items(): 
    embeddings = [get_embedding(phrase) for phrase in phrases] 
    phrase_embeddings[(page, chunk_number)] = list(zip(phrases, embeddings)) 
# Prepare data for Excel
excel_data = [] 
for (page, chunk_number), phrases in phrase_embeddings.items(): 
    for phrase, embedding in phrases: 
        excel_data.append({ "Page": page, "Chunk": chunk_number, "Phrase": phrase, "Embedding": embedding }) 
# Create a DataFrame 
df = pd.DataFrame(excel_data)
# Save to Excel 
excel_filename = "phrases_embeddings.xlsx" 
df.to_excel(excel_filename, index=False) 
print(f"Embeddings saved to {excel_filename}")


Step 5: Query Processing and Similarity Calculation

Generate embeddings for query phrases and find the most similar chunks using cosine similarity. Cosine similarity is a measure used to determine how similar two vectors are based on the angle between them in a multi-dimensional space. It’s commonly used in text analysis and information retrieval to compare text embeddings or document vectors, as it quantifies similarity irrespective of the vectors' magnitude. In the context of text embeddings, cosine similarity helps identify which documents or sentences are closely related based on their meaning, rather than just their content or word count.

Python
 
def extract_phrases_from_query(query): 
    rake.extract_keywords_from_text(query) 
    return rake.get_ranked_phrases() 
# Example query 
query = "What are the results of the 2DRA algorithm?(This question should be based on your pdf)" 
# Extract phrases from the query 
query_phrases = extract_phrases_from_query(query) 
# Output query phrases 
print(f"Query phrases:\n{query_phrases}\n")

def get_embeddings(phrases): 
    return [openai.Embedding.create(input=phrase, model="text-embedding-ada-002")['data'][0]['embedding'] for phrase in phrases] 
# Get embeddings for query phrases 
query_embeddings = get_embeddings(query_phrases)

import numpy as np 
from scipy.spatial.distance import cosine 
# Function to calculate cosine similarity 
def cosine_similarity(embedding1, embedding2): 
    return 1 - cosine(embedding1, embedding2) 
# Dictionary to store similarities 
chunk_similarities = {} 
# Calculate cosine similarity for each chunk 
for (page, chunk_number), phrases in phrase_embeddings.items(): 
    similarities = [] 
    for phrase, embedding in phrases: 
        phrase_similarities = [cosine_similarity(embedding, query_embedding) for query_embedding in query_embeddings] 
        similarities.append(max(phrase_similarities)) 
    # Choose the highest similarity for each phrase 
    average_similarity = np.mean(similarities) 
    # Average similarity for the chunk 
    chunk_similarities[(page, chunk_number)] = average_similarity 
# Get top 5 chunks by similarity 
top_chunks = sorted(chunk_similarities.items(), key=lambda x: x[1], reverse=True)[:5] 
# Output top 5 chunks 
print("Top 5 most relatable chunks:") 
selected_chunks = []
for (page, chunk_number), similarity in top_chunks: 
    print(f"Page: {page}, Chunk: {chunk_number}, Similarity: {similarity}") 
    print(f"Chunk text:\n{page_chunks[page][chunk_number-1]}\n")
    selected_chunks.append(page_chunks[page][chunk_number-1])


Step 6: Generate and Retrieve Answer Using OpenAI

Compose the context for the query from the most similar chunks and retrieve the answer using OpenAI’s GPT model.

Python
 
context = "\n\n".join(selected_chunks) 
prompt = f"Answer the following query based on the provided text:\n\n{context}\n\nQuery: {query}\nAnswer:" 
# Use the OpenAI API to get a response 
response = openai.ChatCompletion.create( 
    model="gpt-4", 
    messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ], 
    max_tokens=300, 
    temperature=0.1 ) 
# Extract the answer from the response 
answer = response['choices'][0]['message']['content'].strip() 
# Output the answer 
print(f"Answer:\n{answer}")


Finally, this is the answer that I received after asking that question:

Answer: The 2DRA model was utilized to perform data recovery on the Virtual Machine (VM) affected by ransomware. It was successful in retrieving all the 14,957 encrypted files. Additionally, an analysis of the encrypted files and their associated hash values on the VM was conducted using the 2DRA model after the execution of WannaCry ransomware. The analysis revealed that the hexadecimal values of the files were distinct prior to encryption, but were altered after the encryption.

The solution is based on the PDF I used in step 1. You'll see the solution in the PDF you submitted.

This concludes implementing a basic RAG pipeline that reads PDF content, extracts meaningful phrases, generates embeddings, calculates similarities, and answers queries based on the most relevant content.

NLP PDF Phrase (software) Python (language)

Opinions expressed by DZone contributors are their own.

Related

  • Extracting Clean Excel Tables From PDFs Using Python + Docling
  • Python and Open-Source Libraries for Efficient PDF Management
  • Leveraging AI-Driven Cloud Services for Enhanced AML Compliance in Banking
  • Improving Sentiment Score Accuracy With FinBERT and Embracing SOLID Principles

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook