Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Searching Social Media Influencers With Node.js

DZone's Guide to

Searching Social Media Influencers With Node.js

In this article, we'll show you how to quickly create your own web scraping application so you can gather the data need for your project.

· Web Dev Zone
Free Resource

Learn how to build modern digital experience apps with Crafter CMS. Download this eBook now. Brought to you in partnership with Crafter Software

In the past few years, we have seen how social media has drastically changed the way that brands reach their customers. In fact, social media started to play a big role in purchasing decisions.

The growth of "online celebrities" through social networks caught marketers' attention and that's where the idea of social influencers comes in.

There's a great explanation about the growth of social influencers:

"Consumers have always valued opinions expressed directly to them. Marketers may spend millions of dollars on elaborately conceived advertising campaigns, yet often what really makes up a consumer’s mind is not only simple but also free: a word-of-mouth recommendation from a trusted source. As consumers overwhelmed by product choices tune out the ever-growing barrage of traditional marketing, word of mouth cuts through the noise quickly and effectively." -  Bughin, Jacques. Doogan, Jonathan. Vetvik, Ole Jørgen (2010, April). A new way to measure word-of-mouth marketing

In simple words: people trust people!

Wait, What Is a Social Media Influencer?

Well, here's a good definition:

"An influencer is quite simply someone who carries influence over others. A social media influencer is someone who wields that influence through social media. The form of influence can vary and no two influencers are the same. Celebrity endorsements were the original form of influencer marketing, but in the digital age of online connection, regular people have become online “celebrities” with powerfully engaged social media followings, especially in certain market segments." -  Newberry, Christina (2017, April 19). A Comprehensive Guide to Influencer Marketing on Social Media. 

Cool, Why Should I Care?

The numbers are quite impressive, and seconded by Twitter's research of is users:

"Nearly 40% of Twitter users say they’ve made a purchase as a direct result of a Tweet from an influencer." -  Karp, Katie (2016, May 10). New research: The value of influencers on Twitter. 

That's only Twitter. Imagine the percentage of purchases that social influencers can reach through Instagram or YouTube, which has a much stronger visual appeal.

Before We Begin

The code that is presented here is intended for educational and research purposes only. There’s no intention to cause any harm to third-party services.

The Scenario

Let’s imagine we are into the fitness industry. We want to automate the process of finding some of the most influential people in the fitness world through social networks.

We’ll start building a simple scraper to find those people. Let’s break this task into smaller pieces for a better understanding of the scraper:

  • Google Scraper: we need a Google scraper to find out some news about fitness influencers.
  • Page Scraper: we need a scraper to look for Instagram profiles.
  • Instagram Scraper: we need a scraper to extract information about the influencer’s profile.

The Requirements

There’s a lot of great libs for web scraping out there, like scrape-it for Node.js or BeautifulSoup for Python, but, we’ll keep it as simple as possible and we’ll build our own mechanism.

That will require the following dependencies:

  • request: Simplified HTTP request client.
  • request-promise: The simplified HTTP request client ‘request’ with Promise support.
  • cheerio: Fast, flexible, and lean implementation of core jQuery designed specifically for the server.
  • lodash: A modern JavaScript utility library delivering modularity, performance, and extras.
  • util: The util module designed to support the needs of Node.js's own internal APIs.
  • url: The URL module that provides utilities for URL resolution and parsing.

The project structure will look like:

influence
│   index.js
│   package.json    
└───lib
│   │   google.js
│   │   hunter.js
│   │   instagram.js
│   │   scraper.js
└───node_modules
    │   ...

The Scraper

Let’s get started building an abstract scraper (all the scrapers have to perform similar tasks, so let’s reuse some code):

'use strict';

var request = require('request-promise');
var cheerio = require('cheerio');

/**
 * The default constructor
 */
function Scraper() {

}

/**
 * Returns a http get request promise
 * 
 * @param  {Object}  req     The request definition
 * @return {Promise} promise The get request promise
 */
Scraper.prototype.get = function(req) {
  return request(req);
}

/**
 * Returns a cheerio instance loaded with the given html body
 * 
 * @param  {String}   body The html body
 * @return {Function} $    The cheerio instance
 */
Scraper.prototype.load = function(body) {
  return cheerio.load(body);
}

/**
 * Returns a promise of a request call
 * 
 * @param  {String}  uri The request uri
 * @param  {Object}  qs  The request query string 
 * @return {Promise} promise The request promise
 */
Scraper.prototype.prepare = function(uri, qs) {
  var req = {
    uri:             uri,
    qs:              qs,
    simple:          false,
    transform:       this.load,
    followRedirects: false
  };

  return this.get(req).then(this.scrape);
}

/**
 * Executes the scrape chain: 
 * request -> cheerio load -> scrape -> transform
 * 
 * @param  {Array}   promises The promises array
 * @return {Promise} promise  The execution result
 */
Scraper.prototype.execute = function(promises) {
  var self = this;
  return Promise.all(promises).then(function(result) {
    return self.transform(result);
  });
}

/**
 * The scrape method.
 * This is the abstract method that should be implemented by all the scrapers.
 * It performs a html body scrape.
 * 
 * @param  {Function} $ The cheerio instance
 * @return {Array}    result The scrape result
 */
Scraper.prototype.scrape = function($) {
  throw new TypeError('scrape() method must be implemented');
}

/**
 * The transform method.
 * This is the abstract method that should be implemented by all the scrapers.
 * It transform the scrape result in a array of objects.
 * 
 * @param  {Array} result The scrape result
 * @return {Array} normalized An array of normalized data
 */
Scraper.prototype.transform = function(result) {
  throw new TypeError('transform() method must be implemented');
}

module.exports = Scraper;

The Google Scraper

We need Google to help us finding news about the influencers. Let’s build a Google scraper:

'use strict';

var url     = require('url');
var util    = require('util');
var Scraper = require('./scraper');

/**
 * The defult configuration
 */
var HOST             = 'https://google.com'; // google's host
var PATH             = '/search';            // search path
var LIMIT            = 10;                   // google's page limit
var START_AT_PAGE    = 0;                    // starts at page 0
var URL_ELEMENT      = '.g h3 a';            // html tag to search

/**
 * The default constructor 
 */
function Google() {
  Scraper.call(this);
}

/**
 * The implementation of the abstract method transform()
 * Returns a single array of the given multi array
 * 
 * @param  {Array} result The multi array
 * @return {Array} single The single array
 */
Google.prototype.transform = function(result) {    
  return [].concat.apply([], result);
}

/**
 * The implementation of the abstract method scrape()
 * It performs a scrape of the Google's search page.
 * 
 * @param  {Function} $     The cheerio instance
 * @return {Array}    links The Google's search links
 */
Google.prototype.scrape = function($) {
  var links = [];
  $(URL_ELEMENT).each(function(i, element) {
    var href = $(element).attr('href');
    var parsed = url.parse(href, true);
    links.push(parsed.query.q);
  });

  return links;
}

/**
 * Performs a Google search
 * 
 * @param  {String}  term    The term to search
 * @param  {Number}  limit   The result limit
 * @return {Promise} promise A promise of the execution chain
 */
Google.prototype.search = function(term, limit) {
  var promises = [];
  var size     = limit || LIMIT;

  for (var i = START_AT_PAGE; i < size; i += LIMIT) {        
    var promise = this.prepare(HOST + PATH, { q: term, start: i });
    promises.push(promise);
  }

  return this.execute(promises);
}

util.inherits(Google, Scraper);
module.exports = Google;

The Page Scraper

Now, we need to look for Instagram profiles in the pages that we will get in a Google search. Let’s build the page scraper - we’ll call it Hunter:

'use strict';

var util    = require('util');
var Scraper = require('./scraper');

/**
 * The default constructor 
 */
function Hunter() {
  Scraper.call(this);
}

/**
 * The implementation of the abstract method transform()
 * Returns a single array of the given multi array
 * 
 * @param  {Array} result The multi array
 * @return {Array} single The single array
 */
Hunter.prototype.transform = function(result) {  
  var profiles = [].concat.apply([], result);
  return Array.from(new Set(profiles));
}

/**
 * The implementation of the abstract method scrape()
 * It searches for instagram profiles in the given html body.
 * 
 * @param  {Function} $     The cheerio instance
 * @return {Array}    links The Insgragram account links
 */
Hunter.prototype.scrape = function($) {
  var profiles = [];
  var links = $('a[href^="http://instagram.com/"],a[href^="https://instagram.com/"]')
                .not('[href^="http://instagram.com/p/"]')
                .not('[href^="https://instagram.com/p/"]')
                .not('[href^="http://instagram.com/d/"]')
                .not('[href^="https://instagram.com/d/"]');

  $(links).each(function(i, link) {
    var href = $(link).attr('href');
    if (href.endsWith('/')) {
      href = href.substring(0, href.length - 1);
    }

    profiles.push(href);
  });

  return profiles;
}

/**
 * Searches for instagram profiles links in the given page's url
 *
 * @param  {Array} links The page's url
 * @return {Promise} promise A promise of the execution chain
 **/
Hunter.prototype.hunt = function(links) {
  var promises = [];

  for (var index in links) {
    var link = links[index];
    var promise = this.prepare(link);
    promises.push(promise);
  }

  return this.execute(promises);  
}

util.inherits(Hunter, Scraper);
module.exports = Hunter;

The Instagram Scraper

Finally, we need to extract information about the Instagram account of the influencer:

'use strict';

var util    = require('util');
var _       = require('lodash');
var Scraper = require('./scraper');

/**
 * The default constructor
 */
function Instagram() {
  Scraper.call(this);
}

/**
 * The implementation of the abstract method transform()
 * Returns a single array of the given multi array
 * 
 * @param  {Array} result The multi array
 * @return {Array} single The single array
 */
Instagram.prototype.transform = function(result) {
  var profiles = [].concat.apply([], result);
  return _.compact(profiles);
}

/**
 * The implementation of the abstract method scrape()
 * It performs a scrape of the user's Insragram account page.
 * 
 * @param  {Function} $ The cheerio instance
 * @return {Array}    profiles The Insgram account profiles
 */
Instagram.prototype.scrape = function($) {
  var link        = $('link[hreflang="x-default"]').attr('href');
  var picture     = $('meta[property="og:image"]').attr('content');
  var description = $('meta[property="og:description"]').attr('content');

  // profile not found
  if (!link || !picture || !description)
    return;

  if (link.endsWith('/'))
    link = link.substring(0, link.length - 1);

  var splited   = link.split('/');
  var username  = splited[splited.length -1];
  var indexOf   = description.indexOf('-');
  var info      = description.substring(0, indexOf - 1).trim();
  var followers = info.split(',')[0].replace(' Followers', '');
  var title     = $('title').text();
  var index     = title.indexOf('(');
  var name      = title.substring(0, index - 1).trim();

  return {
    link     : link,
    username : username,
    name     : name,
    followers: followers,
    picture  : picture
  }
}

/**
 * Extracts information of the given instagram links
 *
 * @param  {Array}   links The instagram links
 * @return {Promise} promise A promise of the execution chain
 */
Instagram.prototype.profiles = function(links) {
  var promises = [];

  for (var index in links) {
    var link = links[index];
    var promise = this.prepare(link);
    promises.push(promise);
  }

  return this.execute(promises);
}

util.inherits(Instagram, Scraper);
module.exports = Instagram;

Putting it All Together

Let’s export our modules. In the index.js file, we’ll create a class called Influence with a static method to find the influencers:

var Google    = require('./lib/google');
var Hunter    = require('./lib/hunter');
var Instagram = require('./lib/instagram');

/**
 * The defaul constructor
 */
function Influence() {

}

/**
 * Searches instagram profiles based on the given term
 * 
 * @param  {String}  term  The term to search
 * @param  {Number}  limit The google's search limit
 * @return {Promise} promise A promise to be executed 
 */
Influence.find = function(term, limit) {
  var google = new Google();

  return google.search(term, limit)
    .then(function(links) {
      var hunter = new Hunter();
      return hunter.hunt(links);
    })
    .then(function(profiles) {
      var instagram = new Instagram();
      return instagram.profiles(profiles);
    });
}

module.exports = {
  Google:    Google,
  Hunter:    Hunter,
  Instagram: Instagram,
  Influence: Influence
};

Finding the Influencers

Let’s test our code. We’ll create a file called “test.js” in the root path of our project and we’ll search for the term, “Top fitness Instagram accounts”:

'use strict';

var Influence = require('./').Influence;

var term = 'top fitness instagram accounts';
Influence.find(term).then(        

  /**
  * Handles the instagram profiles
  */
  function(profiles) {
    console.log(profiles);
  }

).catch(

  /**
  * Handles the error
  */
  function(error) {
    console.error(error);
  }

);

And the result:

[
  {
    "link": "https://www.instagram.com/menshealthmag",
    "username": "menshealthmag",
    "name": "Men's Health",
    "followers": "939k",
    "picture": "https://instagram.fcpq1-1.fna.fbcdn.net/t51.2885-19/11371057_983333891687510_70028928_a.jpg"
  },
  {
    "link": "https://www.instagram.com/harpersbazaarus",
    "username": "harpersbazaarus",
    "name": "Harper's BAZAAR",
    "followers": "3m",
    "picture": "https://instagram.fcpq1-1.fna.fbcdn.net/t51.2885-19/s150x150/18299602_1302688826466727_8338763405586333696_a.jpg"
  },
  {
    "link": "https://www.instagram.com/amandabisk",
    "username": "amandabisk",
    "name": "Amanda Bisk",
    "followers": "680.8k",
    "picture": "https://instagram.fcpq1-1.fna.fbcdn.net/t51.2885-19/s150x150/16123289_158236814672260_164752922544963584_n.jpg"
  },
  {
    "link": "https://www.instagram.com/hannahbronfman",
    "username": "hannahbronfman",
    "name": "Hannah Fallis Bronfman",
    "followers": "388.3k",
    "picture": "https://instagram.fcpq1-1.fna.fbcdn.net/t51.2885-19/s150x150/16464716_949189645214982_1784722712950734848_a.jpg"
  },
  ...
]

Conclusion

In a few steps, we built a simple scraper that automates the process of finding social media influencers. You can combine this with machine learning, use IBM Watson or another vendor’s cognitive service to analyze those profiles and do incredible things.

You can find the source code at https://github.com/tommelo/influence

Crafter is a modern CMS platform for building modern websites and content-rich digital experiences. Download this eBook now. Brought to you in partnership with Crafter Software.

Topics:
javascript ,web scraping ,web dev ,node.js

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}