How to Scrape Walmart Product Pages With Node.js in Under 10 Lines of Code
In this tutorial, walk through the steps required to successfully scrape Walmart product data in less than a dozen lines of code, using Node.JS and Scrapezone's web scraping SDK.
Join the DZone community and get the full member experience.Join For Free
What Is Web Scraping?
Web scraping or web crawling is the act of getting publicly available data from the internet. As the internet is the richest data source known to man and a lot of it is user-generated content (like reviews, social posts, comments, and more), the variety of publicly available data for web scraping is endless. In this quick tutorial, I will walk you through the steps required to successfully scrape Walmart product data in less than a dozen lines of code, using Node.JS and Scrapezone's web scraping SDK.
Why Is It Hard to Maintain a Dedicated Scraper
The main problems with web scraping are websites change and bot detection. HTML pages do not look the same every day and minor change in how the website displays their data requires changes to the scraper and in some cases, even a small change to the website’s structure might require a complete rebuild.
Suppose you are building a solution that revolves around analyzing publicly available data. In that case, you will probably require multiple scrapers running in parallel to provide data in a timely, reliable and accurate way. When writing a scraper, the development team must run daily tests on it, monitor website changes, and adjust the scrapers accordingly.
Using a web scraping SDK like Scrapezone's officially maintained scrapers guarantees you just get the data at scale without worrying about any changes to the website’s structure.
Scraping Data at a Large Scale
When scraping data at a large scale, you will hit a brick wall in the form of being blocked by anti-bot detections. Most eCommerce marketplaces, search and travel websites have some form of another of an anti-bot protection system. The classic way of dealing with this bot detection is imitating a real user: using different IP addresses, utilizing headless browsers like Puppeteer or Selenium, setting up original and reliable browser-fingerprints, and throttling your request rates. Some proxy providing companies even promise automatic rotating proxies, so you do not need to worry about maintaining and managing a list of IP addresses.
All right, now let us dive in and get some code done to scrape the product data. In this example, I will cover how to scrape the product pages of some Headphones from Walmart.
Step 1: Install Node.js
If you do not have node installed yet, go ahead and install it from https://nodejs.org/en/download/.
Open Terminal and type the following command to start a new node.js project.
We make a new directory called ‘scraper’, initialize an npm project in it, and install Scrapezone SDK. We are using ‘npm’ — the node package manager for it. Please note that this will create a ‘node_modules’ folder in your project directory and download the code for Scrapezone SDK to it.
Step 2: Write the Scraper Code
In this step, as opposed to regular scraping tutorials - we will not need to use Cheerio, Puppeteer, Selenium, or Phantom.js. No scraping libraries are needed since we already get the parsed data to respond to the API request. A new Scrapezone account is automatically generated with 1,000 free API calls, so for a small to medium size proof of concept operation, this should suffice.
Create a new file: index.js and paste the following code:
Step 3: Create a Scrapezone Account and Run the Scraper
Register to Scrapezone or login if you already have an account. It's free for the first 1,000 scrapes. Copy your scraping username/password and change the code on line 2 to use those instead of 'username', 'password'.
All set. Run the scraper:
You will see the product details printed as a JSON object.
The results will contain the brand name, product name, product image, description, price, ratings, and a few more fields you might want to use.
If you want to see the complete documentation and the different scrapers available, visit Scrapezone's documentation page on GitHub.
To view the results as a CSV file, go to Scrapezone Dashboard again, select the scrape from the 'scrapes' page and click 'Download CSV'.
While many developers and organizations waste a lot of time and money in re-inventing the wheel and writing scrapers, avoiding anti-bot detections and monitoring their scrapes, the faster option for business growth is to outsource the burden of all those instead of reimplementing existing solutions. One of the important software architecture skills is knowing how to integrate existing solutions for your software fast growth.
The scraping as a service solution world multiplies and allows businesses to focus on their efforts instead of the tedious tasks that web scraping consists of.
Opinions expressed by DZone contributors are their own.