What I Learned About Data Visualization from Hilary Mason of Bit.ly
Join the DZone community and get the full member experience.
Join For FreeThat's when I decided it was my weekend's goal to get her to hack on something, anything, related to data mining with me. Check her out on Twitter @hmason or her website @ http://www.hilarymason.com/
Graciously, she agreed and we set up the time and place. We ended up with around ten people in total hacking for about an hour in a small cafe here in St. Louis. I published the final product here: http://github.com/jcbozonier/Strangeloop-Data-Visualization
and Hilary is hosting the visualization here:
That's the background and this is what came of it for me.
Answers Are Easy, Asking The Right Questions Is Hard
By grouping up with Hilary I was hoping to get some insight into her professional workflow, what tools she uses, and also I wanted to get a feel for her general approach and mindset for answering a given question with her data-fu.
The question we ultimately decided to work on was what "What does the Strangeloop social network look like on Twitter?" In other words, who's talking to who and how much? Our shared mental model for the problem was essentially a graph of nodes interconnected with a bunch of undirected edges which indicated those two people had communicated via Twitter. Hilary had already grabbed Protovis along with a sample of using it to create a force-directed layout so it was a perfect fit for answering that question.
Three Steps
Today I learned to think about data analysis as three main steps or phases (since the steps can get a little large).
1. Get Data- Get the data. In whatever form is easiest, just gather all of the data you'll need and get it on disk. Don't worry about how nice and neat it is.
2. Prune it- Now you can take that mass of data and start to think about what portions of it you can use. The pruning phase is your chance to trim your data down and focus it a bit. This is where you eliminate all aspects of the data except for the ones you'll want to visualize.
3. Glam it up- Here's where you figure out what you'll need to do to get your data into a visualizable form.
1. Getting Data From Twitter
To get our data I wrote a script that used Twitter's search api to download all tweets that contained the hash tag #strangeloop. Since the data is paged, my code had to loop through about 15 pages until it had exhausted Twitter's records.
This is the code. It's pretty simple but effective.
require 'net/http' pages_remain = true number = 1 file_containing_tweets = 'strangeloop_tweets.json' while(pages_remain) open(file_containing_tweets, 'a') { |f| Net::HTTP.start("search.twitter.com") { |http| response = http.get("/search.json?q=%23strangeloop&rpp=100&page=#{number}") if response.body == '{"error":"page parameter out of range"}' pages_remain = false else f.puts response number += 1 end } } end
There may be errors or corner cases and that's fine. None of this is code I would unit test until it became apparent that I should. The main task at hand here is to get data and in this case at least that's a binary result. It's easy to know if some part of that code went wrong. Also, I need to be able to work quickly enough that I can stay in the flow of the problem at hand. I'm really just hacking at Twitter trying to get the data I want to a file on disk. If I have to do it by hand that's fine.
2. Pruning The Data To Fit My Mental Model
I chose to download the data as JSON because I assumed that would be a pretty simple format to integrate with. Now that Ruby 1.9 comes with a JSON module out of the box, it totally was! Well... pretty much.
require 'json' def get_file_as_string(filename) data = '' f = File.open(filename, "r") f.each_line do |line| data += line end return data end def get_strangeloop_tweets text_file_containing_tweets = 'formatted_tweets.json' raw_json_text = get_file_as_string text_file_containing_tweets tweets = JSON.parse(raw_json_text) return tweets end
My approach once again was very hack-oriented. Do a little bit of ruby script in such a way that I can verify that it worked via the command line, reiterate by adding another step or two and repeating. It's like TDD but much less thought, just hacking and feeling my way around the problem space.
3. Glamming It Up For Protovis
Edge = Struct.new(:from, :to) def get_tweep_connections_from tweets tweep_edges = {} tweets.each{ |tweet| tweep = tweet['from_user'] to_nodes = extract_all_tweeps_from tweet if to_nodes.length > 0 to_nodes.each{ |node| raise "node is blank!!" if node == '' edge_a = Edge.new(tweep, node) edge_b = Edge.new(node, tweep) if tweep_edges.has_key? edge_a tweep_edges[edge_a] += 1 elsif tweep_edges.has_key? edge_b tweep_edges[edge_b] += 1 else tweep_edges[edge_a] = 1 end } end } return tweep_edges end
David Joyner was also kind enough to send me his original Python code that essentially does the same thing:
import json, re RE_MENTION = re.compile(r'@(\w+)') f = open('formatted_tweets.json') tweets = json.load(f) f.close() graph = {} for tweet in tweets: from_user = tweet['from_user'] for m in RE_MENTION.finditer(tweet['text']): to_user = m.group(0)[1:] pair1 = (from_user, to_user) pair2 = (to_user, from_user) if pair1 in graph: graph[pair1] += 1 elif pair2 in graph: graph[pair2] += 1 else: graph[pair1] = 1 for key, value in graph.items(): print "%s, %s, %d" % (key[0], key[1], value)
The thought was that the more active a person was on Twitter, the more they influenced the network. This could cause someone who was really chatty to get over-emphasized in the visualization but in our case it worked out well.
So, we had all of this data but it wasn't in the form that Protovis needed to show our awesome visualization. Hilary figured this out by downloading a sample project from their project's website. The data needed to be put in this form:
var miserables = { nodes:[ {nodeName:"Myriel", group:1}, {nodeName:"Napoleon", group:1}, {nodeName:"Mlle. Baptistine", group:1}, {nodeName:"Mme. Magloire", group:1}, {nodeName:"Countess de Lo", group:1}, {nodeName:"Geborand", group:1}, {nodeName:"Champtercier", group:1}, {nodeName:"Cravatte", group:1}, {nodeName:"Count", group:1}, {nodeName:"Old Man", group:1} ], links:[ {source:1, target:0, value:1}, {source:2, target:0, value:8}, {source:3, target:0, value:10}, {source:3, target:2, value:6}, {source:4, target:0, value:1}, {source:5, target:0, value:1}, {source:6, target:0, value:1}, {source:7, target:0, value:1}, {source:8, target:0, value:2}, {source:9, target:0, value:1} ] };
If you scroll through that a ways you'll eventually see some data that looks like this:
def create_protovis_data_from tweeps, tweep_edges counter = 0 tweep_index_lookup = {} File.open('strangeloop_words.js', 'w'){|file| file.puts 'var miserables = {' file.puts 'nodes:[' tweeps.each{|tweep| tweep_index_lookup[tweep] = counter file.puts "{nodeName:\"#{tweep}\", group:1}, //#{tweep_index_lookup[tweep]}" counter += 1 } file.puts '],' file.puts 'links:[' tweep_edges.each{ |edge, strength| from_tweep = edge[:from] to_tweep = edge[:to] raise "bad to tweep!!" if not tweep_index_lookup.include? to_tweep raise "bad to tweep!!" if not tweep_index_lookup.include? from_tweep from_index = tweep_index_lookup[from_tweep] to_index = tweep_index_lookup[to_tweep] file.puts "{source:#{from_index}, target:#{to_index}, value: #{(2)**strength}}," } file.puts ']};' } end
I just basically create a hash where I store the index number for each Twitter user's name and then look it up when I'm generating that portion of the file.
Biggest Take Away: Baby Steps
The Other Biggest Take Away: Get Data At Any Cost Necessary
Next Steps
Published at DZone with permission of Justin Bozonier, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments