For the past couple of days I attended the first Strata Conf to be held in London – a conference which seems to bring together people from the data science and big data worlds to talk about the stuff they’re doing.
Since I’ve been playing around with a couple of different things in this area over the last 4/5 months I thought it’d be interesting to come along and see what people much more experienced in this area had to say!
- My favourite talk of the morning was by Jake Porway talking about his company DataKind – “an organisation that matches data from non-profit and government organisations with data scientists”.
In particular he focused on data dive – weekend events DataKind run where they bring together NGOs who have data they want to explore and data scientists/data hackers/statisticians who can help them find some insight in the data.
There was an event in London last weekend and there’s an extensive write up on one that was held in Chicago earlier in the year.
Jake also had some good tips for working with data which he shared:
- Start with a question not with the data
- Team up with someone who knows the story of the data
- Visualisation is a process not an end – need tools that allow you to explore the data
- You don’t need big data to have big insights
Most of those tie up with what Ashok and I have been learning in the stuff we’ve been working on but Jake put it much better than I could!
- Jeni Tennison gave an interesting talk about the Open Data Institute – an organisation that I hadn’t heard about until the talk. Their goal is to help people find value from the Open Government Data that’s now being made available.
There’s an Open Data Hack Day in London on October 25th/26th being run by these guys which sounds like it could be pretty cool.
Jeni had another talk on the second day where I believe she went into more detail about how they are going about making government data publicly available, including the data of legislation.gov.uk.
- Simon Rogers of the Guardian and Kathryn Hurley of Google gave a talk about the Guardian Data blog where Kathyrn had recently spent a week working.
Simon started out by talking about the importance of knowing what stories matter to you and your audience before Kathryn rattled through a bunch of useful tools for doing this type of work.
Some of the ones I hadn’t heard of were Google Refine and Data Wrangler for cleaning up data, Google Data Explorer for finding interesting data sets to work with and finally CartoDB, DataWrapper and Tableau for creating data visualisations.
- In the afternoon I saw a very cool presentation demonstrating Emoto 2012 – a bunch of visualisations done using London 2012 Olympics data.
It particularly focused around sentiment analysis – working out the positive/negative sentiment of tweets – which the guys used Lexalytics Salience Engine to do.
One of the more amusing examples showed the emotion of tweets about Ryan Lochte suddenly going very negative when he admitted to peeing in the pool.
- Noel Welsh gave a talk titled ‘Making Big Data Small’ in which he ran through different streaming/online algorithms which we can use to work out things like the most frequent items or to learn classifiers/recommendation systems.
It moved pretty quickly so I didn’t follow everything but he did talk about Hash functions, referencing the Murmur Hash 3 algorithm and also talked about the stream-lib library which has some of the other algorithms mentioned.
Alex Smola’s blog was suggested as a good resource for learning more about this topic as well.
- Edmund Jackson then gave an interesting talk about using clojure to do everything you’d want to do in the data science arena from quickly hacking something to building a production ready piece of machine learning code.
He spent a bit of time at the start of the talk explaining the mathematical and platform problems that we face when working in this area and suggested that clojure sits nicely on the intersection.
If we need to do anything statistics related we can use incanter, weka and Mahout give us machine learning algorithms, we can use JBLAS to do linear algebra and cascalog is available to run queries on top of Hadoop.
On top of that if we want to try some code out on a bit of data we have an easily accessible REPL and if we later need to make our code run in parallel it should be reasonably easy to do.
- Jason McFall gave a talk about establishing cause and effect from data which was a good refresher in statistics for me and covered similar ground to some of the Statistics One course on coursera.
In particular he talked about the danger of going on a fishing expedition where we decide what it is we want to conclude from our data and then go in search of things to support that conclusion.
We also need to make sure we connect all the data sets – sometimes we can make wrong conclusions about something but when we have all the data that conclusion no longer makes sense.
Think Stats was suggested as a good book for learning more in this area.
- The last talk I saw was by Max Gadney talking about the work he’s done for the Government Digital Service (GDS) building a dashboard for departmental data & for uefa providing insight to users about what’s happening in a match.
I’d seen some of the GDS stuff before but Max has written it up pretty extensively on his blog as well so it was the uefa stuff that intrigued me more!
In particular he developed an ‘attacking algorithm’ which filtered through the masses of data they had and was able to determine which team had the attacking momentum – it was especially interesting to see how much Real Madrid dominated against Manchester City when they played each other a couple of weeks ago.
Here are some of the things I learned from the second day of talks:
- John Graham Cunningham opened the series of keynotes with a talk describing the problems British Rail had in 1955 when trying to calculate the distances between all train stations and comparing them to the problems we have today.
British Rail were trying to solve a graph problem when people didn’t know about graphs and Dijkstra’s algorithm hadn’t been invented and it was effectively invented on this project but never publicised. John’s suggestion here was that we need to share the stuff that we’re doing so that people don’t re-invent the wheel.
He then covered the ways they simplified the problem by dumping partial results to punch cards, partitioning the data & writing really tight code – all things we do today when working with data that’s too big to fit in memory. Our one advantage is that we have lots of computers that we can get to do our work – something that wasn’t the case in 1955.
There is a book titled ‘A Computer called LEO: Lyons Tea Shops and the world’s first office computer‘ which was recommended by one of the attendees and covers some of the things from the talk.
The talk is online and worth a watch, he’s pretty entertaining as well as informative!
- Next up was Alasdair Allan who gave a talk showing some of the applications of different data sources that he’s managed to hack together.
For example he showed an application which keeps track of where his credit card is being used via his bank’s transaction record and it sends those details to his phone and compares it to his current GPS coordinates.
If they differ then the card is being used by someone else, and he was actually able to detect fraudulent use of his card more quickly than his bank on one occasion!
He also got access to the data on an RFID chip of his hotel room swipe card and was able to chart the times at which people went into/came out of the room and make inferences about why some times were more popular than others.
The final topic covered was how we leak our location too easily on social media platforms – he referenced a paper by some guys at the University of Rochester titled ‘Following your friends and following them to where you are‘ in which the authors showed that it’s quite easy to work out your location just by looking at where your friends currently are.
- Ben Goldacre did the last keynote in which he covered similar ground as in his TED talk about pharmaceuticals not releasing the results of failed trials.
I didn’t write down anything from the talk because it takes all your concentration to keep up with him but he’s well worth watching if you get the chance!
- I attended a panel about how journalists use data and an interesting point was made about being sceptical about data and finding out how the data was actually collected rather than just trusting it.
Another topic discussed was whether the open data movement might be harmed if people come up with misleading data visualisations – something which is very easy to do.
If the data is released and people cause harm by making cause/effect claims that don’t actually exist then people might be less inclined to make their data open in the first place.
We were encouraged to think about where the gaps are in what’s being reported. What isn’t being reported but perhaps should be?
- The coolest thing I saw at Strata was the stuff that Narrative Science are doing – they have developed some software which is able to take in a load of data and convert it into an article describing the data.
We were showing examples of this being done for football matches, company reports and even giving feedback on your performance in an exam and suggesting areas in which you need to focus your future study.
Wired had an article a few months ago where they interviewed Kristian Hammond, one of the co founders and the guy who gave this talk.
I have no idea how they’re doing what they’re doing but it’s very very clever!
- I’d heard about DataSift before coming to Strata – they are one of the few companies that has access to the twitter fire hose and have previously been written up on the High Scalability blog – but I still wanted to see it in action!
The talk was focused around five challenges DataSift have had:
- Digging through unstructured data volumes – they take tweets and convert them into 94 different files using some NLP wizardry. They use Lexalytics Salience Engine to help them do this.
- Filtering – separating the signal from the noise. Popular hash tags end up getting massively spammed so those tweets need to be excluded.
- Analysing – real time filtering and tagging of data. They use the cloudera Hadoop distribution.
- Variety – integrating data from different sources. e.g. showing the Facebook stock price vs the twitter sentiment analysis of the company.
- Make it work 24/7
There was a very meta demo where the presenter showed DataSift’s analysis of the strataconf hash tag which suggested that 60% of tweets showed no emotion but 15% were extremely enthusiastic -’that must be the Americans’.
- I then went to watch another talk by Alasdair Allan – this time pairing with a colleague of his, Zena Wood, talking about the work they’re doing at the University of Exeter.
It mostly focused on tracking movement around the campus based on which wifi mast your mobile phone was currently closest to and allowed them to make some fascinating observations.
e.g. Alasdair often took a different route out of the campus which was apparently because that route was more scenic. However, he would only take it if it was sunny!
They discussed some of the questions they want to answer with the work they’re doing such as:
- Do people go to lectures if they’re on the other side of the campus?
- How does the campus develop? Are new buildings helping build the social network of students?
- Is there a way to stop freshers’ flu from spreading?
- The last talk I went to was by Thomas Levine of ScraperWiki talking about different tools he uses to clean up the data he’s working with.
There were references to ‘head’, ‘tail’, ‘tr’ and a few other Unix tools and a Python library called unidecode which is able to convert Unicode data into ASCII.
He suggested saving any data you’re working with into a database rather than working with raw files – CouchDB was his preference and he’s also written a document like interface over SQLite called DumpTruck.
In the discussion afterwards someone mentioned Apache Tika which is a tool for extracting meta data using parser libraries. It looks neat as well.
- A general trend at this conference was that some of the talks ended up feeling quite salesy and some presenters would only describe what they were doing up to a certain point at which the rest effectively became ‘magic’.
I found this quite strange because in software conferences that I’ve attended people are happy to explain everything to you but I think here the ‘magic’ is actually how people are making money so it doesn’t make sense to expose it.
Overall it was an enjoyable couple of days and it was especially fascinating to see the different ways that people have come up with for exploring and visualising data and creating useful applications on top of that.