A Simple Web Crawler with Node.js

A really simple web crawler developed with Node.js that crawls all the URLs of a domain and gets all the required data from an HTML source.

Denzel Vieta

Updated Sep. 17, 21 · Tutorial

Likes (5)

Comment

Save

15.1K Views

For business owners, all the data publicly available on the internet can be beneficial. This data can be helpful to generate leads for service providers and reach a bigger audience on the internet. Along with that, the data can also be used for the deep learning project to train the model.

For an Artificial Intelligence (AI) project, the more data you have, the more accurate a model can predict. For one of my recent weather app projects, I used the previous data and found out the importance of such large data. For that reason, I decided to share how to develop a simple web crawler that crawls a website and gets important data.

There are several npm (node.js packages) available for web scraping. All you need to do is to install and import. There is another library called Cheerio.js available in the node.js which enables the usage of jQuery in the node.

All the websites on the internet can be crawled. You can consider using a binary translator to convert sensitive data into binary or any other numbering system for protection.

Here, in this article, I’m going to discuss how you can crawl all the pages of a domain. To run this code, you need to install a node.js package called a crawler. Copy and paste the following code to all the internal links of a domain:

Create a Node.js project. Inside of the project, create a .js file and paste the following code into that file.

     JavaScript 
   
 
 
   const Crawler = require("crawler");

let obselete = []; // Array of what was crawled already

let c = new Crawler();

function crawlAllUrls(url) {
    console.log(`Crawling ${url}`);
    c.queue({
        uri: url,
        callback: function (err, res, done) {
            if (err) throw err;
            let $ = res.$;
            try {
                let urls = $("a");
                Object.keys(urls).forEach((item) => {
                    if (urls[item].type === 'tag') {
                        let href = urls[item].attribs.href;
                        if (href && !obselete.includes(href) && href.startsWith(url)) {
                            href = href.trim();
                            obselete.push(href);
                            // Slow down the
                            setTimeout(function() {
                                href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`) // The latter might need extra code to test if its the same site and it is a full domain with no URI
                            }, 5000)
                        }
                    }
                });
            } catch (e) {
                console.error(`Encountered an error crawling ${url}. Aborting crawl.`);
                done()
            }
            done();
        }
    })
}

crawlAllUrls(url);
 
  

With the help of the above code, you can save all the URLs of a domain in an array. If you are using any database, then you can store all the URLs in the database. Talking about the code, this crawler first gets all the internal links with the ahref tag and adds these links in the queue to crawl next. With a domain having hundreds of URLs, this crawler can easily complete the crawl in just a few seconds.

If you want to store other information like meta title, description, or canonical URL, you can get it with the following code:

     JavaScript 
   
 
 
   var title = $("title").text();
var url = res.options.url;
var description = $("meta[name=description]").attr("content");
var robots = $("meta[name=robots]").attr("content");
var pagesize = res.body.length;
var h1 = $("h1").text();
var h2 = $("h2").text();
var code = res.options.stateData.code;
var protocol = res.options.protocol;
var actualDataSize = res.options.stateData.actualDataSize;
var requestTime = res.options.stateData.requestTime;
var keywords = $("meta[name=keywords]").attr("content");
 
  

If you want to abstract any other information from the HTML then you can just console.log the body:

     JavaScript 
   
   console.log(res);

By consoling res, you get the HTML body of a web page. You can get the HTML tags that you want and then store them in the database.

Note

This code might not work for the single-page application developed with Vue/React/Angular CLI. Since there is just one HTML file in a single page application and all the content on a web page is rendered dynamically, this code does not work with the SPAs.

So, that’s all I have got for a web crawler. If you have any doubts then do share them in the comment box, I will try my best to find the simplest solution.

Node.js Web Service

Opinions expressed by DZone contributors are their own.

Related

Trending

A Simple Web Crawler with Node.js

A really simple web crawler developed with Node.js that crawls all the URLs of a domain and gets all the required data from an HTML source.

Related

Partner Resources