DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Python and Open-Source Libraries for Efficient PDF Management
  • Essential Python Libraries: Introduction to NumPy and Pandas
  • Why Use AWS Lambda Layers? Advantages and Considerations
  • Getting Started With Snowflake Snowpark ML: A Step-by-Step Guide

Trending

  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • How Clojure Shapes Teams and Products
  • Java's Quiet Revolution: Thriving in the Serverless Kubernetes Era
  • Fixing Common Oracle Database Problems
  1. DZone
  2. Coding
  3. Languages
  4. Building a RAG Model Pipeline Using Python With Online Text Data

Building a RAG Model Pipeline Using Python With Online Text Data

In this article, you'll find an end-to-end guide for extracting, embedding, and querying text from online sources like Wikipedia using OpenAI for answers.

By 
Vaibhavi Tiwari user avatar
Vaibhavi Tiwari
·
Nov. 18, 24 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
2.0K Views

Join the DZone community and get the full member experience.

Join For Free

In this tutorial, I will walk you through the process of constructing a Retrieval-Augmented Generation (RAG) pipeline using Python. This pipeline will be used to get, process, and query content from online publications. Using this method, text will be extracted from a Wikipedia page and then processed into more manageable chunks. Embeddings will be created, similarity will be calculated, and user queries will be answered with information pertinent to the question.

Prerequisites

Create a .ipynb file and start following the steps below:

Python
 
!pip3 install requests beautifulsoup4 openai pandas numpy scipy spacy langchain openpyxl

Library Descriptions

Requests

This library allows us to make HTTP requests in Python, which is essential for retrieving online data, such as extracting text from websites (e.g., Wikipedia articles).

BeautifulSoup4

A powerful library for web scraping, BeautifulSoup4 is used here to parse and extract text from HTML, which is particularly helpful for structuring text from online sources.

OpenAI

The OpenAI library enables interaction with OpenAI’s API for tasks like generating text and embeddings or performing language-based tasks using models such as GPT-3 or GPT-4.

pandas

pandas is a versatile data manipulation library that allows for structured data storage and management, making it easy to organize and export data (e.g., to Excel).

NumPy and SciPy

These libraries provide efficient mathematical functions. NumPy is used for numerical operations, while SciPy includes functions for calculating cosine similarity, which is helpful for comparing text embeddings.

spaCy

spaCy is a natural language processing (NLP) library that allows for keyword extraction, entity recognition, and other linguistic processing. Here, it helps us extract key phrases from chunks of text.

LangChain

This library supports the implementation of language model applications. It includes tools like RecursiveCharacterTextSplitter, which enables us to split text into manageable chunks while preserving context.

openpyxl

openpyxl is used to write data into Excel files, allowing us to save embeddings or other structured data for later use.

After installing these libraries, we're ready to set up our data processing pipeline for the RAG model.

Step-by-Step Guide

Step 1: Import Libraries and Fetch Article Content

First, we begin by importing the requests library and BeautifulSoup, which will allow us to fetch and parse content from a Wikipedia article online. We utilize requests to send an HTTP request to the article’s URL, and BeautifulSoup assists us in parsing the HTML response to extract the main text. We gather all the paragraphs, tidy up the text, and save each paragraph in a list. 

At last, we merge these paragraphs into one cohesive string, article_text, which is prepared for the subsequent steps of processing.

Python
 
import requests
from bs4 import BeautifulSoup

# URL of the Wikipedia article (use any topic you prefer)
url = "https://en.wikipedia.org/wiki/Natural_language_processing"

# Fetch the page content
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract paragraphs from the article
text_data = []
for p in soup.find_all('p'):
    text = p.get_text(strip=True)
    text_data.append(text)

# Join all paragraphs into a single text
article_text = "\n\n".join(text_data)
print("Article content extracted.")


Step 2: Chunk Text for Embedding

In this step, we utilize the RecursiveCharacterTextSplitter from langchain_text_splitters to break the article text into manageable chunks while preserving essential context. Setting a chunk_size of 1000 characters, along with a chunk_overlap of 200 characters allows each chunk to maintain some overlap with the surrounding text, which aids in preserving context across the boundaries. 

Next, we divide the complete article text into these segments and display each segment to verify the separation. This sets up the text for embedding and similarity analysis in the following steps.

Python
 
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Split article text into chunks
chunks = text_splitter.split_text(article_text)

# Display chunks
for i, chunk in enumerate(chunks, start=1):
    print(f"Chunk {i}:\n{chunk}\n")


Step 3: Extract Key Phrases Using spaCy

Here, we utilize spaCy to extract key phrases from each text chunk, emphasizing significant noun phrases. Once the en_core_web_sm language model is installed and loaded, we proceed to define a function that will help us identify noun chunks in each segment of text. This function retrieves phrases that consist of multiple words, ensuring the extraction of more meaningful keywords. 

Next, we apply this function to each chunk, collecting the extracted phrases in a dictionary and showing them to confirm our results. This step is crucial for identifying key terms and concepts that will be used in the following embedding and similarity calculations.

Python
 
import spacy
!python -m spacy download en_core_web_sm

# Load spaCy's English model (make sure it's installed)
# You may need to run this once: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Define a function to extract keywords using noun chunks from spaCy
def extract_keywords_spacy(text):
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks if len(chunk.text.split()) > 1]  # Only keep phrases longer than 1 word

# Initialize a dictionary to store phrases for each chunk
chunk_phrases = {}

# Extract phrases for each chunk
for chunk_number, chunk in enumerate(chunks, start=1):
    phrases = extract_keywords_spacy(chunk)
    chunk_phrases[chunk_number] = phrases

# Display extracted phrases
for chunk_number, phrases in chunk_phrases.items():
    print(f"Key phrases from chunk {chunk_number}:\n{phrases}\n")


Step 4: Generate Embeddings for Key Phrases

In this step, we create embeddings for each extracted key phrase utilizing OpenAI’s text-embedding-ada-002 model, which delivers numerical representations of text grounded in semantic meaning. A function named get_embedding is defined to retrieve embeddings for a specified phrase. We apply this function to each phrase in our chunks and store the results accordingly. 

Next, we organize the phrases and embeddings from each chunk into a format compatible with Excel for export. In the end, we store the embeddings in an Excel file called phrases_embeddings_article.xlsx, making it convenient for future analysis.

Python
 
import openai
import pandas as pd

openai.api_key = "YOUR-API-KEY"

def get_embedding(phrase):
    response = openai.Embedding.create(input=phrase, model="text-embedding-ada-002")
    return response['data'][0]['embedding']

# Generate embeddings for each phrase
phrase_embeddings = {}
for chunk_number, phrases in chunk_phrases.items():
    embeddings = [get_embedding(phrase) for phrase in phrases]
    phrase_embeddings[chunk_number] = list(zip(phrases, embeddings))

# Prepare data for Excel output
excel_data = []
for chunk_number, phrases in phrase_embeddings.items():
    for phrase, embedding in phrases:
        excel_data.append({"Chunk": chunk_number, "Phrase": phrase, "Embedding": embedding})

# Save embeddings to Excel
df = pd.DataFrame(excel_data)
df.to_excel("phrases_embeddings_article.xlsx", index=False)
print("Embeddings saved to phrases_embeddings_article.xlsx")


Step 5: Query Processing and Similarity Calculation

In this step, we determine the similarity scores between the query and the embeddings of each chunk to identify the most relevant content. To start, we utilize OpenAI’s model to embed the query and establish a cosine_similarity function for assessing the similarity between embeddings. In each segment, we calculate similarity scores between the query embedding and every phrase embedding within that segment, choosing the highest score for each phrase. 

Next, we proceed to save the average similarity for each chunk. In this step, we organize and extract the top five chunks with the highest similarity scores, showcasing the most pertinent content connected to the query.

Python
 
rom scipy.spatial.distance import cosine
import numpy as np

def cosine_similarity(embedding1, embedding2):
    return 1 - cosine(embedding1, embedding2)

query = "Explain the applications of NLP in healthcare."
query_phrases = [get_embedding(query)]
chunk_similarities = {}

# Calculate similarity for each chunk
for chunk_number, phrases in phrase_embeddings.items():
    similarities = []
    for phrase, embedding in phrases:
        phrase_similarities = [cosine_similarity(embedding, query_embedding) for query_embedding in query_phrases]
        similarities.append(max(phrase_similarities))
    chunk_similarities[chunk_number] = np.mean(similarities)

# Retrieve top 5 most relevant chunks
top_chunks = sorted(chunk_similarities.items(), key=lambda x: x[1], reverse=True)[:5]
selected_chunks = [chunks[chunk_number-1] for chunk_number, _ in top_chunks]
print("Top 5 relevant chunks:", selected_chunks)


Step 6: Generate and Retrieve Answer Using OpenAI

Combine relevant chunks into a context and ask a question using the OpenAI model.

Python
 
context = "\n\n".join(selected_chunks)
prompt = f"Answer the following question based on the article:\n\n{context}\n\nQuestion: {query}\nAnswer:"

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}],
    max_tokens=300,
    temperature=0.1
)

answer = response['choices'][0]['message']['content'].strip()
print(f"Answer:\n{answer}")


Once you have completed all of these steps, you can expect a similar result/answer:

"In healthcare, Natural Language Processing (NLP) is used to analyze notes and text in electronic health records. This data, which would otherwise be inaccessible, is crucial when seeking to improve care or protect patient privacy."

Note: I used the Wikipedia article/page about NLP; feel free to use any other article of your choice.

Library Python (language)

Opinions expressed by DZone contributors are their own.

Related

  • Python and Open-Source Libraries for Efficient PDF Management
  • Essential Python Libraries: Introduction to NumPy and Pandas
  • Why Use AWS Lambda Layers? Advantages and Considerations
  • Getting Started With Snowflake Snowpark ML: A Step-by-Step Guide

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!