DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Election 2016: Analyzing Real-Time Twitter Sentiment With MemSQL Pipelines

Election 2016: Analyzing Real-Time Twitter Sentiment With MemSQL Pipelines

Using Apache Kafka, MemSQL, Machine Learning and our Pipelines Twitter Demo as a base, we are bringing real-time analytics to Election 2016.

Neil Dahlke user avatar by
Neil Dahlke
·
Oct. 22, 16 · Opinion
Like (4)
Save
Tweet
Share
5.46K Views

Join the DZone community and get the full member experience.

Join For Free

november is nearly upon us, with the spotlight on election 2016. this election has been amplified by millions of digital touchpoints. in particular, twitter has risen in popularity as a forum for voicing individual opinions as well as tracking statements directly from the candidates. pew research center states that “in january 2016, 44% of u.s. adults reported having learned about the 2016 presidential election in the past week from social media, outpacing both local and national print newspapers.” the first 2016 presidential debate “between donald trump and hillary clinton was the most-tweeted debate ever. all told, there were 17.1 million interactions on twitter about the event.”

by now, most people have probably seen both encouraging and deprecating tweets about two candidates: hillary clinton and donald trump. twitter has become a real-time voice for the public watching along with debates and campaign announcements. we wanted to hone in on the sentiments expressed in real time. using apache kafka, memsql, machine learning and our pipelines twitter demo as a base, we are bringing real-time analytics to election 2016.

hillary vs trump real-time twitter sentiment

introducing our latest live demonstration, election 2016: real-time twitter analytics. we analyze the sentiment –attitude, emotion, or feeling– of every tweet about clinton and trump as it is tweeted. now, anyone can see how high or low in the negative or positive tweets are trending at any given point. we’re giving everyone access to the broader scope of how each candidate is doing according to the twittersphere.

how it works

first, we wrote a python script to collect tweets and retweets that contain the words hillary, hillary, trump, or trump directly from twitter.com. we picked the words “hillary” and “trump” as descriptors since they are the most used for the candidates. the script pushes this content to an apache kafka queue in real time. messages in this kafka queue are then streamed using memsql pipelines. released in september 2016 at strata+hadoop world, pipelines features a brand new sql command create pipeline , enabling native ingest from apache kafka and creation of real-time streaming pipelines.

the create pipeline statement looks like this:


createpipeline`twitter_pipeline`

asload datakafka‘your-kafka-host-ip:9092/your-kafka-topic’

intotable`tweets`

the create table statement for the tweets table in memsql is shown below:

create table`tweets`(

`id`bigint(20)defaultnull,

`ts`timestamp notnulldefaultcurrent_timestamp on update current_timestamp,

`tweet`json collate utf8_bin,

`text`astweet::$text persisted text character set utf8 collate utf8_general_ci,

`retweet_count`astweet::%retweet_count persisted int(11),

`candidate`ascase

when(text like'%illary%')then'clinton'

when(text like'%rump%')then'trump'

else'unknown'endpersisted text character set utf8 collate utf8_general_ci,

`created`asfrom_unixtime(`tweet`::$created_at)persisted datetime,

key`id`(`id`)/*!90619 using clustered columnstore */,

/*!90618 shard */key`id_2`(`id`)

note : we create tweets as a columnstore table so it can handle large amounts of data for analytics. we also utilize persisted computed columns in memsql to parse json data for categorizing each tweet by candidate. memsql natively supports the json data format.

when the twitter_pipeline is run, data in the tweets table looks like this:

memsql>select *from tweets limit1\g

***************************1.row ***************************

id:786409507039485952

ts:2016-10-1303:33:53

tweet:{"created_at":1476329611,"favorite_count":0,"id":786409507039485952,"retweet_count":0,"text":"rt @blackwomen4bern: this will be an interesting halloween this year...expect me to tweet some epic hillary costumes...i expect there will…","username":"hankandmya12"}

text:rt@blackwomen4bern:thiswill be an interesting halloween thisyear...expect me totweet some epic hillary costumes...iexpect there will…

retweet_count:0

candidate:clinton

created:2016-10-1303:33:31

1row inset(0.03sec)

next, we created a second pipeline that pulled from the same kafka topic, but instead of storing directly into a table, we perform real-time sentiment analysis with a memsql pipelines transform that leverages the python natural language toolkit (nltk) vader module . the create pipeline statement for the second pipeline looks like this:

create pipeline`twitter_sentiment_pipeline`

asload data kafka'your-kafka-host-ip:9092/your-kafka-topic'

with transform('http://download.memsql.com/pipelines-twitter-demo/transform.tar.gz','transform.py','')

into table`tweet_sentiment`

combining data from these two memsql pipelines, we can perform analytics using sql. for example, we can create a histogram of tweet sentiment through the following query:


select

sentiment_bucket,

sum(if(candidate="clinton",tweet_volume,0))asclinton_tweets,

sum(if(candidate="trump",tweet_volume,0))astrump_tweets

from tweets_per_sentiment_per_candidate_timeseriest

group by sentiment_bucket

order by sentiment_bucket;

lastly we constructed a user interface (ui). we built the graph using websockets react to visualize the rolling average tweet sentiment for both candidates, drawn in real time.

hillary vs clinton real-time chart

Pipeline (software) twitter kafka Database

Published at DZone with permission of Neil Dahlke, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • What Was the Question Again, ChatGPT?
  • ChatGPT: The Unexpected API Test Automation Help
  • Load Balancing Pattern

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: