DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Node.js Http Module to Consume Spring RESTful Web Application
  • While Performing Dependency Selection, I Avoid the Loss Of Sleep From Node.js Libraries' Dangers
  • 5 Best Node.js Practices to Develop Scalable and Robust Applications
  • Concurrency and Parallelism in Node.js for Scalable Apps

Trending

  • Debugging Core Dump Files on Linux - A Detailed Guide
  • Automatic Code Transformation With OpenRewrite
  • How to Format Articles for DZone
  • Testing SingleStore's MCP Server
  1. DZone
  2. Coding
  3. JavaScript
  4. A Simple Web Crawler with Node.js

A Simple Web Crawler with Node.js

A really simple web crawler developed with Node.js that crawls all the URLs of a domain and gets all the required data from an HTML source.

By 
Denzel Vieta user avatar
Denzel Vieta
·
Updated Sep. 17, 21 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
15.1K Views

Join the DZone community and get the full member experience.

Join For Free

For business owners, all the data publicly available on the internet can be beneficial. This data can be helpful to generate leads for service providers and reach a bigger audience on the internet. Along with that, the data can also be used for the deep learning project to train the model.

For an Artificial Intelligence (AI) project, the more data you have, the more accurate a model can predict. For one of my recent weather app projects, I used the previous data and found out the importance of such large data. For that reason, I decided to share how to develop a simple web crawler that crawls a website and gets important data.

There are several npm (node.js packages) available for web scraping. All you need to do is to install and import. There is another library called Cheerio.js available in the node.js which enables the usage of jQuery in the node.

All the websites on the internet can be crawled. You can consider using a binary translator to convert sensitive data into binary or any other numbering system for protection.

Here, in this article, I’m going to discuss how you can crawl all the pages of a domain. To run this code, you need to install a node.js package called a crawler. Copy and paste the following code to all the internal links of a domain:

Create a Node.js project. Inside of the project, create a .js file and paste the following code into that file.

JavaScript
 
const Crawler = require("crawler");

let obselete = []; // Array of what was crawled already

let c = new Crawler();

function crawlAllUrls(url) {
    console.log(`Crawling ${url}`);
    c.queue({
        uri: url,
        callback: function (err, res, done) {
            if (err) throw err;
            let $ = res.$;
            try {
                let urls = $("a");
                Object.keys(urls).forEach((item) => {
                    if (urls[item].type === 'tag') {
                        let href = urls[item].attribs.href;
                        if (href && !obselete.includes(href) && href.startsWith(url)) {
                            href = href.trim();
                            obselete.push(href);
                            // Slow down the
                            setTimeout(function() {
                                href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`) // The latter might need extra code to test if its the same site and it is a full domain with no URI
                            }, 5000)
                        }
                    }
                });
            } catch (e) {
                console.error(`Encountered an error crawling ${url}. Aborting crawl.`);
                done()
            }
            done();
        }
    })
}

crawlAllUrls(url);


With the help of the above code, you can save all the URLs of a domain in an array. If you are using any database, then you can store all the URLs in the database. Talking about the code, this crawler first gets all the internal links with the ahref tag and adds these links in the queue to crawl next. With a domain having hundreds of URLs, this crawler can easily complete the crawl in just a few seconds.

If you want to store other information like meta title, description, or canonical URL, you can get it with the following code:

JavaScript
 
var title = $("title").text();
var url = res.options.url;
var description = $("meta[name=description]").attr("content");
var robots = $("meta[name=robots]").attr("content");
var pagesize = res.body.length;
var h1 = $("h1").text();
var h2 = $("h2").text();
var code = res.options.stateData.code;
var protocol = res.options.protocol;
var actualDataSize = res.options.stateData.actualDataSize;
var requestTime = res.options.stateData.requestTime;
var keywords = $("meta[name=keywords]").attr("content");


If you want to abstract any other information from the HTML then you can just console.log the body:

JavaScript
 
console.log(res);

By consoling res, you get the HTML body of a web page. You can get the HTML tags that you want and then store them in the database.

Note

This code might not work for the single-page application developed with Vue/React/Angular CLI. Since there is just one HTML file in a single page application and all the content on a web page is rendered dynamically, this code does not work with the SPAs.

So, that’s all I have got for a web crawler. If you have any doubts then do share them in the comment box, I will try my best to find the simplest solution.

Node.js Web Service

Opinions expressed by DZone contributors are their own.

Related

  • Node.js Http Module to Consume Spring RESTful Web Application
  • While Performing Dependency Selection, I Avoid the Loss Of Sleep From Node.js Libraries' Dangers
  • 5 Best Node.js Practices to Develop Scalable and Robust Applications
  • Concurrency and Parallelism in Node.js for Scalable Apps

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!