DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. HDF 2.0 Flow Processing Real-Time Tweets from Strata Hadoop

HDF 2.0 Flow Processing Real-Time Tweets from Strata Hadoop

Here is a hands-on tutorial of how to ingest tweets real-time and send them to various data sinks including Slack, Phoenix (Hbase), and HDFS. It also shows you how to run TensorFlow against images downloaded from twitter.

Tim Spann user avatar by
Tim Spann
CORE ·
Oct. 10, 16 · Opinion
Like (2)
Save
Tweet
Share
4.49K Views

Join the DZone community and get the full member experience.

Join For Free

I had a few hours in the morning before the Strata+ Hadoop World conference schedule kicked in, so I decided to write a little HDF 2.0 flow to grab all the tweets about the Strata Hadoop conference.

First up, I used GetTwitter to read tweets and filtered on these terms:

  • strata
  • stratahadoop
  • strataconf
  • NiFi
  • FutureOfData
  • ApacheNiFi
  • Hortonworks
  • Hadoop
  • ApacheHive
  • HBase
  • ApacheSpark
  • ApacheTez
  • MachineLearning
  • ApachePhoenix
  • ApacheCalcite
  • ApacheStorm
  • ApacheAtlas
  • ApacheKnox
  • Apache Ranger
  • HDFS
  • Apache Pig
  • Accumulo
  • Apache Flume
  • Sqoop
  • Apache Falcon

Input

InvokeHttp: I used this to download the first image URL from tweets.

GetTwitter: This is our primary source of data and the most important. You must have a twitter account, a twitter developer account and create a twitter application. Then you can access the keywords and hashtags above. So far I’ve ingested 14,211 tweets into Phoenix. This included many times I’ve shut it down for testing and moving things around. I’ve had this run live as I’ve added pieces. I do not recommend this development process, but it’s good for exploring data.

Processing

RouteOnAttribute: To only process tweets with an actual message, sometimes they are damaged or missing. Don’t waste our time.

ExecuteStreamCommand: To call shell scripts that call TensorFlow C++ binaries and Python scripts. Many ways to do this, but this is the easiest.

UpdateAttribute: To change the file name for files I downloaded to HDFS.

For Output Sinks

PutHDFS: Saved to HDFS in a few different directories (the first attached image); the raw JSON tweet, a limited set of fields such as handle, message, geolocation and a fully processed file that I added TensorFlow Inception v3 image recognition for images attached to Strata tweets and sentiment analysis using VADER on the text of the tweet.

PutSQL: I upserted all tweets that were enriched with HDF called TensorFlow & Python Sentiment Analysis into a Phoenix Table;

PutSlack: https://nifi-se.slack.com/messages/general/

Visualization

There are a ton of ways to look at this data now.

I used Apache Zeppelin since it was part of my HDP 2.5 cluster and it’s so easy to use. I added a few tables, charts and did quick SQL exploration of the data in Phoenix.

Linux Shell Scripts

source /usr/local/lib/bazel/bin/bazel-complete.bash export JAVA_HOME=/opt/jdk1.8.0_101/ /bin/rm -rf /tmp/$@ hdfs dfs -get /twitter/rawimage/$@ /tmp/ /opt/demo/tensorflow/bazel-bin/tensorflow/examples/label_image/label_image --image="/tmp/$@" --output_layer="softmax:0" --input_layer="Mul:0" --input_std=128 --input_mean=128 --graph=/opt/demo/tensorflow/tensorflow/examples/label_image/data/tensorflow_inception_graph.pb --labels=/opt/demo/tensorflow/tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt 2>&1| cut -c48- /bin/rm -rf /tmp/$@ 
python /opt/demo/sentiment/sentiment2.py "$@"

Python Script

If you have Python 2.7 installed, in previous articles I have shown how to install PiP and NLTK. Very easy to do some simple Sentiment Analysis. I also have a version where I just return the polarity_scores (compound, negative, neutral and positive).

hadoop Flow (web browser) Processing

Published at DZone with permission of Tim Spann, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Why Every Fintech Company Needs DevOps
  • Cloud Native London Meetup: 3 Pitfalls Everyone Should Avoid With Cloud Data
  • Automated Performance Testing With ArgoCD and Iter8
  • Problems of Cloud Cost Management: A Socio-Technical Analysis

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: