Over a million developers have joined DZone.

Mining Twitter Data and Stashing it in MongoDB

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

Hey Mongoers!

I recently had the pleasure of joining the MongoLab team.  I share this with you for two reasons: First, you can too! (We’re hiring!). But also because I remember when I heard about MongoDB, I created an account on MongoLab and thought… now what?


With open source technologies proliferating as “Big Data” and analytics explode, we thought it would be beneficial to let our users and friends utilize a script that takes care of the nitty gritty and allows them to explore what makes MongoDB great.  We’re excited to present Twitter-Harvest, a Python script that utilizes the Twitter REST API v1.1 to retrieve tweets from a user’s timeline and insert them into a MongoDB database.

Quick Demo

The details on installation and running the app are located on this GitHub repo. For the impatient, I empathize… we’ve provided some Twitter credentials and an out-of-the-box command that you can run to see that everything works. After you have downloaded/unzipped the repo, run:

Straight out of the box, you’ll notice that the script will print in your console all the tweets that it is harvesting.  Peruse the help docs and pass arguments accordingly- most notably you’ll want to tack on a MongoDB URI using the –db flag so that you can store the tweets in your database.  Also keep in mind that if you’d like to use this script more than once, you should obtain your own Twitter credentials for security and rate limiting reasons.

Diving in

Once you have the necessary modules set up, you’ll notice that the run script has quite a few options. *Twitter OAuth credentials are required. To help you store the harvested tweets, you can create a free Sandbox database with us! We have included the following options that we thought would be popular with users:

  • harvesting native retweets (-r)
  • printing each tweet the program iterates over (-v)
  • MongoDB URI, allow insertion into a MongoDB (–db)
  • setting the number of tweets to be harvested (–numtweets)
  • user timeline that you would like to harvest from *default is mongolab (–user)
So, let’s say I want to harvest and print 100 of @mongolab‘s tweets (and retweets). The command and arguments would be:
python twitter-harvest.py --db mongodb-uri --consumer-key consumer-key --consumer-secret consumer-secret --access-token access-token --access-secret access-secret -r --numtweets 100

Just like that, we have 100 tweets in a collection called “mongolab”.

To help you along, we also have help documentation available:

% python twitter-harvest.py -h

optional arguments:

-h, --help                             help
-r, --retweet                          include native retweets
-v, --verbose                          print harvested tweets in shell
--numtweets NUMTWEETS                  set harvest number
--user USER                            choose twitter user timeline
--db DB                                MongoDB URI
--consumer-key CONSUMER_KEY            Twitter Consumer Key
--consumer-secret CONSUMER_SECRET      Twitter Consumer Secret
--access-token ACCESS_TOKEN            Twitter Access Token
--access-secret ACCESS_SECRET          Twitter Access Token Secret

Now, onto the fun stuff. Let’s see what interesting data or projects you can come up with using this tool!

We challenge you!

In case you’re stumped, here’s a few challenges we’ve thought up that really highlight both Twitter’s vast array of information and MongoDB features.

1. Compile a list of “successful”- retweeted and/or favorited- tweets and return only a few of the fields. Hint: Aggregation Framework

2. Harvest from a variety of users (friends, family, athletes) and see who has tweeted near you and with what frequency. Hint: Geospatial Indexes

3. Experiment with text indexes – after all, tweets are text- and examine your queries. Can you make them faster?  Hint: Text Search + Cursor Explain

4. Use this as an example to set up a public stream- great for data mining! Hint: Twitter Public Streams

Happy coding, and be sure to keep us posted on your projects. We’re always here to help!



*special thanks to our Swedish friend Gustav Arngården @arngarden over at @aitellu for the harvesting idea!

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.


Published at DZone with permission of Chris Chang, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}