DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. IoT
  4. Finding the Impact of a Tweet using Spark GraphX

Finding the Impact of a Tweet using Spark GraphX

Looking for a great way to do social network analysis? Take a look at this annotated Scala example of using Spark with GraphX to process Twitter data.

Himanshu Gupta user avatar by
Himanshu Gupta
·
Jan. 02, 17 · Tutorial
Like (3)
Save
Tweet
Share
5.02K Views

Join the DZone community and get the full member experience.

Join For Free

Social Network Analysis (SNA), a process of investigating social structures using Networks and Graphs, has become a very hot topic nowadays. Using it, we can answer many questions like:

  • How many connections an individual have?
  • What is the ability of an individual to influence a network?
  • and so on…

These and others can be used for conducting marketing research studies, running ad campaigns, and finding out latest trends. So, it becomes very crucial to identify the impact of an individual or individuals in a social network, so that we can identify key individuals, or Alpha Users (term used in SNA), in a social network.

In this post, we are going to see how to find the impact of an individual in a Social Network like Twitter, i.e., How many Twitter users an individaul can influence via his/her Tweet up to N number of level, i.e., Followers of followers of followers… and so on. For, this post we have downloaded data from Stanford University’s SNAP portal. Also, for anonymity, we have assigned random names to the users in the Twitter data.

Now, let's get on to the task of finding Alpha User. For that first we have to extract data from the text file like this:

val data = sparkContext.textFile("twitter-graph-data.txt")
val followees: RDD[(VertexId, String)] = data.map(_.split(",")).map { arr =>
  val user = arr(0).replace("((", "")
  val id = arr(1).replace(")", "")
  (id.toLong, user)
} 

val followers: RDD[(VertexId, String)] = data.map(_.split(",")).map { arr =>
  val user = arr(2).replace("(", "")
  val id = arr(3).replace("))", "")
  (id.toLong, user)
}

Then we have to create the graph from data extracted above using Spark GraphX API.

val vertices = followees.union(followers)
val edges = data.map(_.split(",")).map { arr =>
  val followeeId = arr(1).replace(")", "").toLong
  val followerId = arr(3).replace("))", "").toLong
  Edge(followeeId, followerId, "follow")
} 

val defaultUser = ("")
val graph = Graph(vertices, edges, defaultUser)

In above code we have created vertices and edges using the Twitter followers & followees RDD. Also, we have to provide a default entry like defaultUser.

Now, comes the trivial task of finding the most influential/alpha user which has maximum influence up to 2 levels, i.e., the one who have maximum followers of followers in total. For this, we are going to use Pregel API of Spark GraphX and Breadth First Traversal algorithm.

val subGraph = graph.pregel("", 2, EdgeDirection.In)((id, attr, msg) =>
attr + "," + msg,
triplet => Iterator((triplet.srcId, triplet.dstAttr)),
(a, b) => (a + "," + b))

The above code is hard to digest at first. So, let's try to understand it.

  • The first argument in pregel is the initial message that is sent to all vertices in the graph which are connected.
  • The second argument is the number of iterations which tells to pregel API as to how many times the message has to be sent among connected vertices.
  • The third argument is the direction of edge in which message has to be sent. We selected inward direction because we have to traverse to root node from leaf nodes. So, that we can get all followers of followers for all users in the graph.
  • Then comes the vprog argument which asks for the function/expression to get vertex argument. For that, we have combined the attribute and message of the previous vertex as to accumulate all users following an individual user on Twitter.
  • After that comes the sendMsg argument which requires an expression for calculating the message to be sent in next iteration. Here we are sending followers name so that we can accumulate all users following an individual user on Twitter.
  • Lastly comes the mergeMsg argument, which reduces/merges the messages sent to a vertex during an iteration of Pregel API.

Now, we just have to find the user with maximum Followers of Followers. For that we can use following code:

val alphaUser = subGraph.vertices.map(vertex => (vertex._1, vertex._2.split(",").distinct.length - 2)).max()(new Ordering[Tuple2[VertexId, Int]]() {
  override def compare(x: (VertexId, Int), y: (VertexId, Int)): Int =
    Ordering[Int].compare(x._2, y._2) 
})

val alphaUserId = graph.vertices.filter(_._1 == alphaUser._1).map(_._2).collect().head

The variable alphaUserId will provide us the Id of Alpha User (as the name suggests). In the same way, if we wanted to find the most influential user up to N level then just increase the number of iterations to N instead of 2 in Pregel API. It is as easy as it gets ;)

I hope you liked this post and want to have a hands-on experience for this example. So, just download or clone the code from here: https://github.com/knoldus/spark-graphx-twitter and run it.

Network twitter API Data (computing) Text file Graph (Unix) POST (HTTP) Alpha (finance) Task (computing)

Published at DZone with permission of Himanshu Gupta, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Deploying Java Serverless Functions as AWS Lambda
  • How to Create a Real-Time Scalable Streaming App Using Apache NiFi, Apache Pulsar, and Apache Flink SQL
  • Kubernetes vs Docker: Differences Explained
  • Better Performance and Security by Monitoring Logs, Metrics, and More

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: