Web Scraping With LLMs, ScrapeGraphAI, and LangChain
In this article, learn how to use LLMs for web scraping with ScrapeGraphAI, LangChain, and Pydantic. This guide covers setup, configuration, and data extraction
Join the DZone community and get the full member experience.
Join For FreeNow that we can scrape websites using Python and its libraries like BeautifulSoup, Requests, and Pandas, let’s take a step ahead and learn how we could simplify it further using LLM. Before we talk about the scraping part, let us understand the terminologies and what an LLM is. You are in the right place to learn about all these words if you are unfamiliar with LangChain, AI, or NLP.
What Is LLM?
LLM stands for large language model. It is a machine learning model trained on a large amount of data, referred to as a corpus, which consists of vast textual data. Large in the sense that there is a lot of data — terabytes — contained in the data. For example, an LLM may have seen terabytes of data, while a file on your computer system may be sized in gigabytes (GB). LLMs are able to respond to inquiries based on such textual data because of their thorough training. By utilizing them wisely, large language models may be applied to a variety of tasks, including summaries, Q&As, and translations. Just as Python provides libraries and frameworks, LLMs also have these resources.
What Is ScrapeGraphAI?
ScrapeGraphAI is an advanced Python package that uses LLMs and configurable graph-based pipelines to transform online scraping. It is a strong technology that can make structured data extraction from webpages and local documents easier. Users may obtain important information with only one inquiry. With ScrapeGraphAI’s sophisticated language model capabilities, manual parsing, and convoluted rule-based systems will be a thing of the past as they can comprehend intricate data structures and extract valuable facts.
The library provides a variety of customized graph classes, namely SmartScraperGraph for single-page scraping, SearchGraph for multi-page extraction from search results, and ScriptCreatorGraph to generate customized scraping scripts. using ScrapeGraphAI, you may select the best AI backend for your scraping requirements from a variety of LLM providers, such as OpenAI, Groq, and Azure, in addition to local models using Ollama.
Let us begin by extracting data using LLM from a website.
This tutorial will walk you through the fundamentals of using the library. Along with that, we will look into different frameworks, too.
Installation
First, let’s install ScrapeGraphAI. Run the following command in your terminal:
pip install langchain pydantic python-dotenv scrapegraphai
Note: I will be using the Google Colab environment.
The code snippet below does the following:
- Install packages. The code updates and installs some tools needed for web scraping and working with OpenAI.
- Import modules. It brings in tools for securely getting your API key and managing settings.
- Set API key. You enter your OpenAI API key safely, and it gets stored for use in your project.
- Allow nested async operations. It adjusts the setup to let you run multiple asynchronous tasks at once, which is helpful for certain types of programming.
Learn how to generate an OpenAI API key to integrate it into our code above.
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
Here’s a straightforward explanation of what the below code does:
1. Import Modules
- os: Used for interacting with the operating system, like getting environment variables.
- dotenv: Helps load environment variables from a .env file.
- SmartScraperGraph from scrapegraphai.graphs: A tool for web scraping using AI.
2. Load Environment Variables
- load_dotenv(): This reads from a .env file in your project directory to load environment variables like your API key.
3. Get API Key
- os.getenv(“OPENAI_API_KEY”): Retrieves your OpenAI API key from the environment variables.
4. Set Up Configuration
- graph_config: Sets up the configuration for SmartScraperGraph, including the API key and the AI model to use (e.g., GPT-4).
5. Create and Run SmartScraperGraph
- SmartScraperGraph(…): Initializes the web scraper with a prompt, a webpage URL to scrape, and the configuration.
- prompt: Instructions for what you want to scrape from the page.
- source: The URL of the webpage to scrape.
- config: The setup for the AI model and API key.
- smart_scraper_graph.run(): Executes the scraping task and gets the results.
6. Print Results
- print(result): Displays the results of the scraping task.
So, in essence, this script is set up to scrape data from a specific webpage using an AI model, with the API key loaded from a secure file.
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
# Load environment variables (make sure you have a .env file with your OPENAI_API_KEY)
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")
graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
},
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the services on this page with their descriptions.",
source="https://understandingdata.com/",
config=graph_config,
)
result = smart_scraper_graph.run()
print(result)
OUTPUT -
{'services': [
{'name': 'Data Engineering', 'description': 'NA'},
{'name': 'React Development', 'description': 'NA'},
{'name': 'Python Programming Development', 'description': 'NA'},
{'name': 'Prompt Engineering', 'description': 'NA'},
{'name': 'Web Scraping', 'description': 'NA'},
{'name': 'SaaS Applications', 'description': 'NA'}
]}
Web Scraping Using LangChain and Pydantic
LangChain
LangChain is a framework that utilizes LLMs. Take ChatGPT, for example — it uses an OpenAI language model to generate responses. However, ChatGPT itself isn’t an LLM; it’s an application built with one. Now, if you want to personalize and manage your own LLMs, LangChain is a great tool. It allows you to use your data to tailor the LLMs to your needs, simplifying many processes. This way, developers can focus on the critical tasks while LangChain handles the rest.
Now, let’s take a look at another method using LangChain and Pydantic.
Prerequisites
Before we dive into the implementation, make sure you have the following installed:
- Python 3.7+
- LangChain
- Pydantic
Setting Up Pydantic Models
Pydantic is a data validation and settings management library that uses Python-type annotations. It works great for defining and confirming structured data. To represent the services we wish to extract, we will define our data models here.
from langchain_core.pydantic_v1 import BaseModel
from typing import List
class ServiceSchema(BaseModel):
name: str
description: str
class Services(BaseModel):
services: List[ServiceSchema]
In this code, ServiceSchema
represents a single service with a name and description. Services
is a container for a list of ServiceSchema
objects.
Configuring SmartScraperGraph
SmartScraperGraph
is a powerful tool in LangChain for building and running scraping tasks. Here, we’ll configure it to scrape the services from a webpage.
from langchain import SmartScraperGraph
smart_scraper_graph = SmartScraperGraph(
prompt="Extract all of the services that are offered on this page.",
source="https://understandingdata.com/",
config=graph_config,
schema=Services,
)
result = smart_scraper_graph.run()
print(result)
With this setup:
- The scraper is instructed to extract services via a prompt.
- The URL of the webpage to be scraped is the source.
- Model details and the OpenAI API key are included in the configuration.
- The Pydantic model known as schema specifies the extracted data’s structure.
Complete Example
Here’s the complete example in one script:
from langchain_core.pydantic_v1 import BaseModel
from typing import List
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
class ServiceSchema(BaseModel):
name: str
description: str
class Services(BaseModel):
services: List[ServiceSchema]
# Load environment variables
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")
graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-4o",
},
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the services on this page with their descriptions.",
source="https://understandingdata.com/",
config=graph_config,
schema=Services,
)
result = smart_scraper_graph.run()
print(result)
try:
model = Services(**result)
print(model)
except Exception as e:
print(e)
Output -
services=[
ServiceSchema(name='Data Engineering', description='NA'),
ServiceSchema(name='React Development', description='NA'),
ServiceSchema(name='Python Programming Development', description='NA'),
ServiceSchema(name='Prompt Engineering', description='NA'),
ServiceSchema(name='Web Scraping', description='NA'),
ServiceSchema(name='SaaS Applications', description='NA')
]
Conclusion
We’ve demonstrated in this blog article how to create a smart scraper to obtain services from a webpage using Pydantic and LangChain. You may easily scrape and arrange data from any webpage by using SmartScraperGraph’s capabilities and well-defined data models. This method makes the scraping process easier to use and evaluate by making sure that the scraped data follows a predetermined structure.
Published at DZone with permission of Juveria dalvi. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments