DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Reversing an Array: An Exploration of Array Manipulation
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • Build Retrieval-Augmented Generation (RAG) With Milvus

Trending

  • How to Format Articles for DZone
  • From Hype to Harm: Why AI Governance Needs More Than Good Intentions
  • Domain-Centric Agile Modeling for Legacy Insurance Systems
  • Yet Another GenAI Nightmare: Seven Shadow AI Pitfalls to Avoid
  1. DZone
  2. Data Engineering
  3. Data
  4. Scrape Amazon Product Reviews With Python

Scrape Amazon Product Reviews With Python

Let's learn how we can implement Python and Python scripts to scrape the Amazon website in an ethical way to extract product review data.

By 
Juveria dalvi user avatar
Juveria dalvi
·
Jan. 29, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.2K Views

Join the DZone community and get the full member experience.

Join For Free

Amazon is a well-known e-commerce platform with a large amount of data available in various formats on the web. This data can be invaluable for gaining business insights, particularly by analyzing product reviews to understand the quality of products provided by different vendors.

In this guide, we will look into web scraping steps to extract Amazon reviews of a particular product and save them in Excel or CSV format. Since manually copying information online can be tedious, we’ll focus on scraping reviews from Amazon. This hands-on experience will enhance our practical understanding of web scraping techniques.

Prerequisite

Before we start, make sure you have Python installed in your system. You can do that from this link. The process is very simple — just install it like you would install any other application.

Now that everything is set, let’s proceed.

How to Scrape Amazon Reviews Using Python

Install Anaconda through this link. Be sure to follow the default settings during installation. For more guidance, you can watch this video:


We can use various IDEs, but to keep it beginner-friendly, let’s start with Jupyter Notebook in Anaconda. You can watch the video linked above to understand and get familiar with the software.

Steps for Web Scraping Amazon Reviews

Create a New Notebook and save it.

Step 1: Import Necessary Modules

Let’s start importing all the modules needed using the following code:

Python
 
import requests
from bs4 import BeautifulSoup
import pandas as pd


Step 2: Define Headers

To avoid getting your IP blocked, define custom headers. Note that you can replace the User-agent value with your user agent, which you can find by searching "my user agent" on Google.

Python
 
custom_headers = {
"Accept-language": "en-GB,en;q=0.9",
"User-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1
Safari/605.1.15",
}


Step 3: Fetch Webpage

Create a Python function to fetch the webpage, check for errors, and return a BeautifulSoup object for further processing.

Python
 
# Function to fetch the webpage and return a BeautifulSoup object
def fetch_webpage(url):
response = requests.get(url, headers=headers)
if response.status_code != 200:
print("Error in fetching webpage")
exit(-1)
page_soup = BeautifulSoup(response.text, "lxml")
return page_soup


Step 4: Extract Reviews

Inspect Element to find the element and attribute from which we want to extract data. Let's create another function to select the div and attribute and set it to extract_reviews variable. It identifies review-related elements on a webpage but doesn’t yet extract the actual review content. You would need to add code to extract the relevant information from these elements (e.g., review text, ratings, etc.).

Python
 
# Function to extract reviews from the webpage
def extract_reviews(page_soup):
review_blocks = page_soup.select('div[data-hook="review"]')
reviews_list = []


Step 5: Process Review Data

The code below processes each review element, extracts the customer’s name (if available), and stores it in the customer variable. If no customer information is found, customer remains none.

Python
 
for review in review_blocks:

    author_element = review.select_one('span.a-profile-name')
    customer = author_element.text if author_element else None

    rating_element = review.select_one('i.review-rating')
    customer_rating = rating_element.text.replace("out of 5 stars", "") if rating_element else None

    title_element = review.select_one('a[data-hook="review-title"]')
    review_title = title_element.text.split('stars\n', 1)[-1].strip() if title_element else None

    content_element = review.select_one('span[data-hook="review-body"]')
    review_content = content_element.text.strip() if content_element else None

    date_element = review.select_one('span[data-hook="review-date"]')
    review_date = date_element.text.replace("Reviewed in the United States on ", "").strip() if date_element else None

    image_element = review.select_one('img.review-image-tile')
    image_url = image_element.attrs["src"] if image_element else None


Step 6: Process Scraped Reviews

The purpose of this function is to process scraped reviews. It takes various parameters related to a review (such as customer, customer_rating, review_title, review_content, review_date, and image URL), and the function returns the list of processed reviews.

Python
 
review_data = {
    "customer": customer,
    "customer_rating": customer_rating,
    "review_title": review_title,
    "review_content": review_content,
    "review_date": review_date,
    "image_url": image_url
  }
  reviews_list.append(review_data)

return reviews_list


Step 7: Initialize Review URL

Now, let's initialize a search_url variable with an Amazon product review page URL.

Python
 
def main():

review_page_url = "https://www.amazon.com/BERIBES-Cancelling-Transparent-Soft-Earpads-Charging-Black/product-
reviews/B0CDC4X65Q/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

page_soup = fetch_webpage(review_page_url)
scraped_reviews = extract_reviews(page_soup)


Step 8: Verify Scraped Data

Now, let’s print (“Scraped Data:”, data) scraped review data (stored in the data variable) to the console for verification purposes.

Python
 
# Print the scraped data to verify
print("Scraped Data:", scraped_reviews)


Step 9: Create a DataFrame

Next, create a DataFrame from the data, which will help organize data into tabular form.

Python
 
# create a DataFrame and export it to a CSV file
reviews_df = pd.DataFrame(data=scraped_reviews)


Step 10: Export DataFrame to CSV

Now, export the DataFrame to a CSV file in the current working directory.

Python
 
reviews_df.to_csv("reviews.csv", index=False)

  print("CSV file has been created.")


Step 11: Ensure Standalone Execution

The code construct below acts as a protective measure. It ensures that certain code runs only when the script is directly executed as a standalone program rather than being imported as a module by another script.

Python
 
# Ensuring the script runs only when executed directly
if __name__ == '__main__':
main()


Result

Result


Why Scrape Amazon Product Reviews?

Scraping Amazon product reviews can provide valuable insights for businesses. Here’s why they do it:

Feedback Collection

Every business needs feedback to understand customer requirements and implement changes to improve product quality. Scraping reviews allows businesses to gather large volumes of customer feedback quickly and efficiently.

Sentiment Analysis

Analyzing the sentiments expressed in reviews can help identify positive and negative aspects of products, leading to informed business decisions.

Competitor Analysis

Scraping allows businesses to monitor competitors’ pricing and product features, helping them stay competitive in the market.

Business Expansion Opportunities

By understanding customer needs and preferences, businesses can identify opportunities for expanding their product lines or entering new markets.

Manually copying and pasting content is time-consuming and error-prone. This is where web scraping comes in. Using Python to scrape Amazon reviews can automate the process, reduce manual errors, and provide accurate data.

Benefits of Scraping Amazon Reviews

  • Efficiency: Automate data extraction to save time and resources.
  • Accuracy: Reduce human errors with automated scripts.
  • Large data volume: Collect extensive data for comprehensive analysis.
  • Informed decision-making: Use customer feedback to make data-driven business decisions.

Conclusion

Now that we’ve covered how to scrape Amazon reviews using Python, you can apply the same techniques to other websites by inspecting their elements. Here are some key points to remember:

Understanding HTML

Familiarize yourself with the HTML structure. Knowing how elements are nested and how to navigate the Document Object Model (DOM) is crucial for finding the data you want to scrape.

CSS Selectors

Learn how to use CSS selectors to accurately target and extract specific elements from a webpage.

Python Basics

Understand Python programming, especially how to use libraries like requests for making HTTP requests and BeautifulSoup for parsing HTML content.

Inspecting Elements

Practice using browser developer tools (right-click on a webpage and select “Inspect” or press Ctrl+Shift+I) to examine the HTML structure. This helps you find the tags and attributes that hold the data you want to scrape.

Error Handling

Add error handling to your code to deal with possible issues, like network errors or changes in the webpage structure.

Legal and Ethical Considerations

Always check a website’s robots.txt file and terms of service to ensure compliance with legal and ethical rules of web scraping. By mastering these areas, you’ll be able to confidently scrape data from various websites, allowing you to gather valuable insights and perform detailed analyses.

Extract Data (computing) jupyter notebook Python (language)

Published at DZone with permission of Juveria dalvi. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Reversing an Array: An Exploration of Array Manipulation
  • Enhancing Business Decision-Making Through Advanced Data Visualization Techniques
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • Build Retrieval-Augmented Generation (RAG) With Milvus

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!