{{announcement.body}}
{{announcement.title}}

5 Puppeteer Tricks That Will Make Your Web Scraping Easier and Help You Avoid Detection

DZone 's Guide to

5 Puppeteer Tricks That Will Make Your Web Scraping Easier and Help You Avoid Detection

This article includes five puppeteer tricks (with code examples), which I believe help you better scrape the web and avoid detection.

· Web Dev Zone ·
Free Resource

Puppeteer probably is the best free web scraping tool on the internet. It has so many options and is very easy to use once you get the hang of it. The problem with it is that it is too complicated and the average developer might be overwhelmed by the vast options it offers. 

As a veteran in the web scraping industry and the proxy world, I’ve gathered five puppeteer tricks (with code examples), which I believe help you with the daunting task of web scraping when using Puppeteer and how they will help you avoid detection.

So, What is Puppeteer?

Puppeteer is an open-source Node.js library developed and maintained by Google. It is based on Chromium, the open version of Chrome, and can do almost any task a human can perform on a regular web browser. It has a headless mode, which allows it to run as code in the background, without actually rendering the pages, and thus reduces a lot of the resources needed to run it.

Google’s maintenance of this library is fantastic, with new features and security updates regularly added a clear and easy-to-use API, and user-friendly documentation.

What is Web Scraping?

Web Scraping is the automatic version of surfing the web and collecting data. The internet is full of content and user-generated content (UGC), so you can scrape countless data points.

However, most of the valuable data is in these popular websites, which are being scraped daily are Google search results, eCommerce platforms like Amazon, Walmart, Shopify, Travel websites, hotels you get the deal. Most companies or individuals who perform web scraping are looking for data to improve their sales, search rankings, keyword analysis, price comparison, and so on.

What Is the Difference Between Web Crawling and Web Scraping?

Web scraping and web crawling are very similar terms, and the confusion between them is natural. The main difference between web scraping and web crawling revolves around the type of operation/activity that the user is doing. 

Web crawling moves around a website and collects links, and optionally goes through those links and collects and aggregates data or additional links. It is called crawling because it works like a spider that crawls through a website; this is why crawlers are often called spiders by some developers.

Web Scraping on the other hand is task-oriented. It’s targeting a predefined link and retrieves the data from it and sends it to the database.

Usually, a data collection is built around a combination of those two approaches, which means getting the links to scrape with a web crawler/spider and then scraping the data from those pages with a scraper.

5 Tips for using Puppeteer like a Pro:

Since Puppeteer is rather complicated, there are many configurations a developer needs to learn in order to scrape the web using it properly. Therefore, I have summarized the top 5 tips that are often forgotten and might be the difference between a successful and failed scrape operation.

Headless Mode

Puppeteer, like Selenium, allows the user to activate it in a headless mode. This prevents the browser from rendering on the screen and saves a lot of resources. In fact, when scraping from a Docker container, it is impossible to launch Puppeteer in a non-headless (normal mode), trying this will result in an error.

To start Puppeteer in headless mode, we will need to add “headless: true” to the launch arguments.

JavaScript
 

Avoid Using Unnecessary Tabs

The most common mistake that can affect performance in a wide-scale scraping operation, is opening a new tab on Puppeteer when launching the browser. This mistake is so common that it is often repeated in most Puppeteer tutorials and StackOverflow answers.

The mistake of opening a second tab looks like this:

JavaScript


When launching a browser on Puppeteer, it launches with an open page. To get the object of that page, simply do:

JavaScript


And that’s it, you got the object of the already open page.

Using a Proxy

Many websites will try to block your scraping attempts since they want only human visitors on their website. Most of the blocking will be IP-based rules, a simple bot-defense mechanism that monitors the website’s visitor IP address saves it in a database, and basically counts visits per day for each address. When you try to scrape a website and visit over a certain amount of pages, this defense mechanism will start to block your visits. There are two main responses for when a website recognizes a scrape attempt: Some websites simply return a 400 status code and some present a page with a Captcha or just a page with a message that you are probably a robot.

To use a proxy address, you need to sign up to one of the many proxy providers, who will give you an address, username, and password.

When launching Puppeteer, you will need to pass the given address as an array object with the field “--proxy-server=<address>”:

JavaScript


The username and passwords need to be signed on the page object itself. Use the page.authenticate() method:

JavaScript

Set a User-Agent

Setting the User-Agent is a simple tweak to your browser fingerprint that can make a big difference in the number of times your scraper will be blocked. The user agent is a property that is sent by a browser when requesting a web page. It provides website information about the browser version and operating system. Naturally, when you don’t have a valid User-Agent, websites can easily detect your scraper as a bot and block your scraping attempts.

For the best results, I always use my real browser’s user agent. You can get this information by searching google.

The easiest way to set the user agent is with the page.setUserAgent() function:

JavaScript

Set the Correct Screen Resolution

The screen resolution is another simple one that can take you a long distance. Similar to the user agent, we need to make sure we match the screen resolution of the device we use. If we want to scrape a desktop website, we should use a popular desktop resolution, and for mobile sites, we need to use a mobile phone’s screen resolution.

The most common screen resolution nowadays is 1366X768 so that we will set the page accordingly.

Setting the screen resolution is done with the page.setViewport() method:

JavaScript


Our example now looks like this:

JavaScript


I hope this article has cleared out a few common mistakes for using Puppeteer and will help you avoid staring at the screen when realizing your scraper is blocked without knowing why.

Good luck!

Topics:
puppeteer, web crawling web scraping, web scraping

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}