DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Stop Writing Dialect-Specific SQL: A Unified Query Builder for Node.js
  • Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
  • From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value
  • How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris

Trending

  • The Prompt Isn't Hiding Inside the Image
  • Why Pass/Fail CI Pipelines Are Insufficient for Enterprise Release Decisions
  • Chat with Your Oracle Database: SQLcl MCP + GitHub Copilot
  • Genkit Middleware: Intercept, Extend, and Harden your Gen AI Pipelines
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Develop a Scraper With Node.js, Socket.IO, and Vue.js/Nuxt.js

Develop a Scraper With Node.js, Socket.IO, and Vue.js/Nuxt.js

Web scraper development with node.js and vue.js in the front-end with socket.io to get real-time data.

By 
Dwayne O. Smith user avatar
Dwayne O. Smith
·
Aasshey Triveddi user avatar
Aasshey Triveddi
·
Updated Apr. 28, 21 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
15.5K Views

Join the DZone community and get the full member experience.

Join For Free

The incredible amount of data available publicly on the internet for any industry can be useful for market research. You can use this data in machine learning/big data to train your model with tens of thousands of entries.

Here, in this article, I’m going to discuss the development of a web scraper with Node.js, Cheerio.js, and send back-end data to Vue.js in the front-end. Along with that, I’m going to use a simple crawler Node.js package.

Simplecrawler: Used to get all the pages of a domain.

Crawler: Get internal links, meta title, description, and content.

Vue.js: Used in the front-end to show data to the users.

Socket.io: Send data from the back-end to the front-end in real-time.

Node.js: Used to run the backend server.

Web Scrapper Screenshot

(source)

Where Can Web Scrapper Be Used?

Web scrapper can be used to extract content from the website and you can use it for your marketing campaign and data analysis. Many SEO companies use web scraper to extract data that is publicly available on the internet.

Along with that, many companies that provide data science and machine learning services need a huge amount of data to train their model. They can also use this web scraper to extract data from the internet.

If you are developing a travel portal then you can use this scraper to get data from multiple websites, compare the data, and returns the most affordable package. To save data from getting scrapped, you can take the help of different numbering systems and save data in the database. Popular numbers systems are binary, decimal, hexadecimal, and octal. For your reference, consider this decimal to hexadecimal tool and convert your sensitive data.

First, you need to install all the required npm packages:

JavaScript
 




x


 
1
npm install –save vue-socket.io socket.io-client socket.io simplecrawler nuxt-socket-io crawler



Code for Developing a Web Scraper With Node.Js/Vue.Js:

First, add the following vue.js code with the v-model:

JavaScript
 




xxxxxxxxxx
1


 
1
<input type="text" placeholder="Enter a Domain" v-model="domain" value="" @keyup.enter="senddomain">
2
<button type="button" class="btn btn-sm btn-info" name="button" @click="senddomain">Check Domain</button></div><br>
3

          



In the script tag, add the following inside data return:

JavaScript
 




xxxxxxxxxx
1


 
1
data() {
2
    return {
3
      domain: ''
4
}
5
}
6

          




Inside the methods tags, copy and paste the following code:

JavaScript
 




xxxxxxxxxx
1
19


 
1
senddomain(){
2
      if(this.domain === ''){
3
       
4
alert("please enter a domain")
5
      }
6
      else if(!this.domain.includes("http")){
7
       
8
alert("please enter URL with http")
9
    }
10
      else{
11
      const message = this.domain;
12
     
13
//this.messages.push(message)
14
    // socket.emit('brokenlinks', message)
15
     
16
socket.emit('brokenlinks1', message)
17
    }
18
    }
19

          



Now, create a folder called IO at the root directory, inside of this IO folder, create a file called index.js. Copy and paste the following code inside of the index.js file.

JavaScript
 




xxxxxxxxxx
1
102


 
1
import socketIO from 'socket.io'
2
var Crawler1 = require("simplecrawler");
3
var Crawler = require("crawler");
4
   
5
socket.on('brokenlinks1', function (message) {
6
       
7
console.log("for broken link:" + message);
8
        var crawler1 = Crawler1(message)
9
let xxxz = [];
10
let xxx2 = [];
11
let domains = [];
12
let acc = [];
13
var element = {
14
        title: '',
15
        url: ''
16
      };
17

          
18
crawler1.on("fetchcomplete", function(queueItem, responseBuffer, response) {
19

          
20
 
21
if(queueItem.stateData.contentType === "text/html; charset=UTF-8" || queueItem.stateData.contentType === "text/html; charset=utf-8"){
22
   
23
acc.push(queueItem);
24
      var last_element = acc[acc.length - 1];
25
    domains = last_element.url;
26
   
27
//xxx2.push(domains)
28
            // socket.emit('new-domain', domains)
29

          
30
var c = new Crawler({
31
maxConnections: 10,
32
// This will be called for each crawled page
33
callback : function(error, response, done) {
34
if(error){
35
console.log(error);
36
}else{
37
    let $ = response.$;
38
    let url = response.options.uri;
39
  let urls = '';
40
  urls = $("a");
41

          
42
                //console.log(images);
43
               
44
Object.keys(urls).forEach((item) => {
45
                 
46
//console.log(urls[item].attribs.href);
47

          
48
                    if (urls[item].type === 'tag') {
49
                     
50
let href = '';
51
                       
52
href = urls[item].attribs.href;
53
              // console.log(href);
54
                       
55
if(href !== undefined && href.startsWith('https'))
56
                       
57
{
58
                       
59
https.get(href, function(res) {
60
                       
61
  var status = res.statusCode;
62
                         
63
//socket.emit('new-message', { href, url, status })
64
                         
65
socket.emit('brokenlinks', { status, href, url })
66
})
67
}
68
if(href !== undefined && href.startsWith('http:')){
69
  http.get(href, function(res) {
70
    var status = res.statusCode;
71
   
72
//socket.emit('new-message', { href, url, status })
73
   
74
socket.emit('brokenlinks', { status, href, url })
75
})
76
}
77
//                       
78
console.log("href:" + ":" + href);
79
  // console.log("alt tag:" + ":" + alt);
80

          
81
                    }
82
                });
83
//console.log(response.options.uri);
84
//socket.emit('queueItem', { keywordsverdict, h1verict, canonicalverdict, descriptionverdict, descriptionlength, title, titlelengthverdict, titlelength, urlfinal, description, h1, h2, canonical, keywords })
85
}
86
done();
87
}
88
});
89
c.queue(domains);
90

          
91
}
92

          
93
// console.log(acc);
94

          
95
          // socket.emit('new-domain', domains)
96
});
97

          
98
crawler1.start();
99
       
100
messages.push(acc)
101
      })
102

          



Create a plugins folder at the root of your app. Inside of that plugins folder, create a file named socket.io.js and add the following code in that file:

JavaScript
 




xxxxxxxxxx
1


 
1
import io from 'socket.io-client'
2
const socket = io(process.env.WS_URL)
3
export default socket
4

          



Code Explanation

In most cases, developers use Axios to send data to the server and receive a response and show that to the users. Axios is great but it only returns the response once. When I used Axios for the scraper, this crawler returned just the first result and I did not receive the rest of the data. So that I decided to use the socket.io plug-in to get the real-time data from that crawler. And guess what, it works just awesome, socket.io is working beyond my expectations.

You can implement this code on your app and if you face any error then do not hesitate to share your experience, I will be happy to help solve your errors!

Data science Node.js Socket.IO

Opinions expressed by DZone contributors are their own.

Related

  • Stop Writing Dialect-Specific SQL: A Unified Query Builder for Node.js
  • Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
  • From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value
  • How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook