Web Scraping With LLMs, ScrapeGraphAI, and LangChain

In this article, learn how to use LLMs for web scraping with ScrapeGraphAI, LangChain, and Pydantic. This guide covers setup, configuration, and data extraction

Juveria dalvi

Jan. 31, 25 · Tutorial

Likes (0)

Comment

Save

3.6K Views

Now that we can scrape websites using Python and its libraries like BeautifulSoup, Requests, and Pandas, let’s take a step ahead and learn how we could simplify it further using LLM. Before we talk about the scraping part, let us understand the terminologies and what an LLM is. You are in the right place to learn about all these words if you are unfamiliar with LangChain, AI, or NLP.

What Is LLM?

LLM stands for large language model. It is a machine learning model trained on a large amount of data, referred to as a corpus, which consists of vast textual data. Large in the sense that there is a lot of data — terabytes — contained in the data. For example, an LLM may have seen terabytes of data, while a file on your computer system may be sized in gigabytes (GB). LLMs are able to respond to inquiries based on such textual data because of their thorough training. By utilizing them wisely, large language models may be applied to a variety of tasks, including summaries, Q&As, and translations. Just as Python provides libraries and frameworks, LLMs also have these resources.

What Is ScrapeGraphAI?

ScrapeGraphAI is an advanced Python package that uses LLMs and configurable graph-based pipelines to transform online scraping. It is a strong technology that can make structured data extraction from webpages and local documents easier. Users may obtain important information with only one inquiry. With ScrapeGraphAI’s sophisticated language model capabilities, manual parsing, and convoluted rule-based systems will be a thing of the past as they can comprehend intricate data structures and extract valuable facts.

The library provides a variety of customized graph classes, namely SmartScraperGraph for single-page scraping, SearchGraph for multi-page extraction from search results, and ScriptCreatorGraph to generate customized scraping scripts. using ScrapeGraphAI, you may select the best AI backend for your scraping requirements from a variety of LLM providers, such as OpenAI, Groq, and Azure, in addition to local models using Ollama.

Let us begin by extracting data using LLM from a website.

This tutorial will walk you through the fundamentals of using the library. Along with that, we will look into different frameworks, too.

Installation

First, let’s install ScrapeGraphAI. Run the following command in your terminal:

    Plain Text
   
   pip install langchain pydantic python-dotenv scrapegraphai

Note: I will be using the Google Colab environment.

The code snippet below does the following:

Install packages. The code updates and installs some tools needed for web scraping and working with OpenAI.
Import modules. It brings in tools for securely getting your API key and managing settings.
Set API key. You enter your OpenAI API key safely, and it gets stored for use in your project.
Allow nested async operations. It adjusts the setup to let you run multiple asynchronous tasks at once, which is helpful for certain types of programming.

Learn how to generate an OpenAI API key to integrate it into our code above.

    Plain Text
   
   import os 
from dotenv import load_dotenv 
from scrapegraphai.graphs import SmartScraperGraph

Here’s a straightforward explanation of what the below code does:

1. Import Modules

os: Used for interacting with the operating system, like getting environment variables.
dotenv: Helps load environment variables from a .env file.
SmartScraperGraph from scrapegraphai.graphs: A tool for web scraping using AI.

2. Load Environment Variables

load_dotenv(): This reads from a .env file in your project directory to load environment variables like your API key.

3. Get API Key

os.getenv(“OPENAI_API_KEY”): Retrieves your OpenAI API key from the environment variables.

4. Set Up Configuration

graph_config: Sets up the configuration for SmartScraperGraph, including the API key and the AI model to use (e.g., GPT-4).

5. Create and Run SmartScraperGraph

SmartScraperGraph(…): Initializes the web scraper with a prompt, a webpage URL to scrape, and the configuration.
prompt: Instructions for what you want to scrape from the page.
source: The URL of the webpage to scrape.
config: The setup for the AI model and API key.
smart_scraper_graph.run(): Executes the scraping task and gets the results.

6. Print Results

print(result): Displays the results of the scraping task.

So, in essence, this script is set up to scrape data from a specific webpage using an AI model, with the API key loaded from a secure file.

    Plain Text
   
 

   import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

# Load environment variables (make sure you have a .env file with your OPENAI_API_KEY)
load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
   "llm": {
      "api_key": openai_key,
      "model": "gpt-4o",
   },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the services on this page with their descriptions.",
    source="https://understandingdata.com/",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(result)
  

    Plain Text
   
 

   OUTPUT -

{'services': [
    {'name': 'Data Engineering', 'description': 'NA'}, 
    {'name': 'React Development', 'description': 'NA'}, 
    {'name': 'Python Programming Development', 'description': 'NA'}, 
    {'name': 'Prompt Engineering', 'description': 'NA'}, 
    {'name': 'Web Scraping', 'description': 'NA'}, 
    {'name': 'SaaS Applications', 'description': 'NA'}
]}
  

Web Scraping Using LangChain and Pydantic

LangChain

LangChain is a framework that utilizes LLMs. Take ChatGPT, for example — it uses an OpenAI language model to generate responses. However, ChatGPT itself isn’t an LLM; it’s an application built with one. Now, if you want to personalize and manage your own LLMs, LangChain is a great tool. It allows you to use your data to tailor the LLMs to your needs, simplifying many processes. This way, developers can focus on the critical tasks while LangChain handles the rest.

Now, let’s take a look at another method using LangChain and Pydantic.

Prerequisites

Before we dive into the implementation, make sure you have the following installed:

Python 3.7+
LangChain
Pydantic

Setting Up Pydantic Models

Pydantic is a data validation and settings management library that uses Python-type annotations. It works great for defining and confirming structured data. To represent the services we wish to extract, we will define our data models here.

    Plain Text
   
   from langchain_core.pydantic_v1 import BaseModel
from typing import List

class ServiceSchema(BaseModel):
    name: str
    description: str

class Services(BaseModel):
    services: List[ServiceSchema]

In this code, ServiceSchema represents a single service with a name and description. Services is a container for a list of ServiceSchema objects.

Configuring SmartScraperGraph

SmartScraperGraph is a powerful tool in LangChain for building and running scraping tasks. Here, we’ll configure it to scrape the services from a webpage.

    Plain Text
   
 

   from langchain import SmartScraperGraph

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract all of the services that are offered on this page.",
    source="https://understandingdata.com/",
    config=graph_config,
    schema=Services,
)

result = smart_scraper_graph.run()
print(result)
  

With this setup:

The scraper is instructed to extract services via a prompt.
The URL of the webpage to be scraped is the source.
Model details and the OpenAI API key are included in the configuration.
The Pydantic model known as schema specifies the extracted data’s structure.

Complete Example

Here’s the complete example in one script:

    Plain Text
   
 

   from langchain_core.pydantic_v1 import BaseModel
from typing import List
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

class ServiceSchema(BaseModel):
    name: str
    description: str

class Services(BaseModel):
    services: List[ServiceSchema]

# Load environment variables
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
   "llm": {
      "api_key": openai_key,
      "model": "gpt-4o",
   },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the services on this page with their descriptions.",
    source="https://understandingdata.com/",
    config=graph_config,
    schema=Services,
)

result = smart_scraper_graph.run()
print(result)

try:
    model = Services(**result)
    print(model)
except Exception as e:
    print(e)
  

    Plain Text
   
 

   Output -

services=[
    ServiceSchema(name='Data Engineering', description='NA'), 
    ServiceSchema(name='React Development', description='NA'), 
    ServiceSchema(name='Python Programming Development', description='NA'), 
    ServiceSchema(name='Prompt Engineering', description='NA'), 
    ServiceSchema(name='Web Scraping', description='NA'), 
    ServiceSchema(name='SaaS Applications', description='NA')
]
  

Conclusion

We’ve demonstrated in this blog article how to create a smart scraper to obtain services from a webpage using Pydantic and LangChain. You may easily scrape and arrange data from any webpage by using SmartScraperGraph’s capabilities and well-defined data models. This method makes the scraping process easier to use and evaluate by making sure that the scraped data follows a predetermined structure.

Data extraction Python (language) large language model

Published at DZone with permission of Juveria dalvi. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending