Searching for Big Insights at Big Data Week
Searching for Big Insights at Big Data Week
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This article was written by Craig Wentworth at the MWD blog.
I was at Big Data Week’s London Conference to hear about how bigger and faster data can make our cities smarter, our selves more quantified, and our ‘things’ more interconnected. In fact, the subject of all those billions of instrumented ‘things’, and the deluge of sensor data they threaten to drown us all in, pretty much anchored most of the day’s talks. 2014 is certainly shaping up to be the year of Velocity. Blink and you’ll miss it.
But let’s not mistake data for information (let alone insight). In the haystack of human history, even finding the needle doesn’t necessarily mean you’ll be able to darn your socks properly. It just means you’ve found a needle (well done you!). What you need to do then is find a contextual thread, and make sure you’ve still got your socks to hand (the world will have turned; you may have more pressing garment repairs to attend to).
First up after the morning’s introductions was Kenneth Cukier, Data Editor of The Economist…
Cukier waved the flag for “simple models and a lots of data trump[ing] more elaborate models based on less data” (to quote from Halevy, Norvig, and Pereira’s seminal 2009 paper for Google The Unreasonable Effectiveness of Data). He came armed with an impressive array of examples to prove his point – everything from speech recognition, through grammar checking, self-driving cars, and even accurately determining America’s favourite pie filling (hint: you may be surprised) it seems benefits from extra helpings of as much data as you can eat. (Apparently, whilst apple is still firmly the king of the family 12” pies (phew!), when you scrutinise more and more of the sales data (and look at smaller, more individual pastry offerings) it drops to seventh in the national rankings.) So what did all this tell us? That when people are free to exercise their personal pudding choice (without the need to pander to the lowest common family fruit denominator) they’d rather eat almost anything but apple? Possibly. But what it definitely does show us is that you don’t really know full picture until you’ve analysed the fullest dataset you can get your hands on.
Big Data can give you new insights into things you thought you knew all about already… as long as you know what questions to ask.
However, at this point it’s probably worth citing Alistair Croll’s and Tim O’Reilly’s(separate) 2011 ripostes to the earlier Google ‘data über alles’ assertion, namely that less data can trump more data after all – if those in possession of it make up for their data shortcomings by having more of a clue about what to do with the little they do have.
In truth there’s probably room for both schools. Just as Clean Data vs More Data approaches can healthily co-exist, depending on your use case; so Big Data + No Clue doesn’t necessarily equal zero value… it depends, again, on your use case. Sometimes you’ll get what you’re looking from just by sampling bigger; sometimes you’ll need to ask better questions, or take the answer you do get and frame its implications within the context brought from other, parallel interrogations – lest you succumb to the attraction of spurious correlations (murder rates and ice cream sales both climb when the summer comes to New York… but you’re not about to put out an APB on the city’s ice cream trucks in the search for psychosis-inducing confectionary additives… are you?
We then dived into three trending themes du jour of the Big Data movement (Smart Cities, Quantified Self, and the Internet of Things).
We had more pi(e) from Dave Starling next, but this time of the Raspberry variety (American family filling ranking: unknown). Starling is Chief Architect at Picsolve (purveyors of photo and video sharing systems for the theme park, leisure and tourist industries). Having started with attraction photography in 1993, the company now augments its offer with analytics fed by data gleaned from sensors embedded all over the rides (often on low-cost Raspberry PIs for redundancy). Picsolve’s installations use multiple Couchbase NoSQL instances in Amazon’s cloud to analyse all sorts of variables – screams decibels, g-force, percentage of faces in a souvenir photo showing abject fear vs unbridled joy, etc. He confesses that, at the park level, it isn’t really big data in the volume sense, but its fast bursts of schema-less data certainly satisfy the variety and velocity Vs. Plus it’s run on a similar architecture that people might pull together for a big data initiative… and once enough of it’s aggregated together to generate strategic insights across the business it can start to get plenty big enough.
The conference’s second topic (the Quantifiable Self) was introduced by Ruth Thomson from Cambridge Consultants. She defined her theme as “measurements about me and my surroundings which provides me with actionable information that enables me to become fitter, healthier, [better at sport]”, etc. Her talk took in examples of the myriad multi-instrumented sensors available to ‘log your life’ (or ‘hack your body’, depending on your view) – some embedded in phones and watches; some in dedicated monitoring devices to be worn about your person. The technology’s come on in leaps and bounds (apt, really) as ever-smaller, ever-lower-power measuring devices have hit the market – and sorts of insights can now be presented at the flick of a wrist (literally, in some cases) to improve your posture / golf swing / exercise regime, etc.
Her “what next, then?” musings were interesting to note for their wider applicability too. Smartphones, claim Thomson, have rather jump-started market for the smart wearable tech that measures, logs, analyses and advises us on how we walk, run, swim, play sport, etc – specifically with their ability to connect with tiny measuring devices about your person over Bluetooth, and with the cloud for storage and sharing data over wireless; plus their processing power and user interface, harnessed through a marketplace of apps. However, whilst this has facilitated the current wave of innovation, for things to really take off Thomson reckons the devices doing the quantifying about one’s self need to shake off the smartphone shackles and operate autonomously (iPhones are notoriously intolerant of swimming pools, for example). Also, whilst it’s true enough to say the monitoring tech is interconnected with its phone (acting as a hub), it’s often siloed and rarely interconnected with anything else – Thomson refers to this as it behaving like as an “Internet of Thing”. And finally, whilst your smartphone hub might share some data via the cloud, permitting social interactions (the Quantified Selfie, anyone?), it’s still relatively Small Data (about individuals). Smart buildings, fitted out with passive nodes, could potentially collect enough to warrant anonymising and aggregating for Big Data applications. So, it sounds like most of those Digital Enterprise shift vectors (mobile, social, cloud, big data analytics) are either at play now – albeit on a smaller scale – or soon could be… stand by for the Digital Individual shift, then?
And now to Smart Cities…
The notion of Future (smart) Cities was introduced in talks by Edward Bryan (IBM’s VP of Smarter City Development) and Karen Lomas (EMEA Director of Smart Cities and Buildings at Intel). Bryan repeated the mantra of needing to bring together data from multiple sources in order to improve decision-making and generate the next best action (something I covered in my MWD report Big Data: What is it and why should I care?). These smarter cities started with data (from sensors everywhere) which gets monitored, visualised, and analysed (and correlated with yet more sources – for instance, social media sentiment, transaction data), and is then used to automate, notify and coordinate (e.g. dispatching street light repair teams, booking in highway maintenance so that the road’s only dug up once for everyone to lay their cables, etc). It’s this flow of data, through information, to (actionable) insight that results in an outcome which can dramatically improve services which citizens depend on.
Lomas’ focus was more on the citizens themselves and design thinking approach that envisaged ‘a day in the life of…’ people because of the different roles we play time and place as we interact with a city’s services (home resident, traveller, worker, shopper, leisure customer, etc). She also highlighted the potential for gamification of these interactions when open data is exposed for new applications (for example, the Chromaroma game of “location-based Top Trumps” that allows London’s commuters to gain points for themselves and their team based on their choice of journeys around the capital).
And last, but by no means least…
Many of the remaining speakers were from the CTOs of tech start-ups, and the tone changed a little from here on. Yes, still some interesting examples of Big Data in action, but lots of specific show and tell (and job ads). At times, I wasn’t sure whether those on the stage wanted to sell to the audience or hire from it.
One highlight for me, though, was Toby Oliver (CTO of Path Intelligence) with a talk loosely aligned to the Quantified Self theme, though which could probably be more accurately described as the Inadvertently Broadcast Self. His company describes its business as based on “polite RF mining”, and by that he means its systems collect RF data from the ether as they leak out of mobile phones (smart, dumb – they’re not fussy) as a proxy for where their owners are, where they’re going, and what they’re coming back to. Path Intelligence sell to shops and malls keen to understand customer behaviour in their physical spaces, and they do that by capturing anonymous RF phone signals (and identifying each uniquely to track where they are) and from that aggregating up and calculate footfall, dwell time, visitor density, unit correlation (as in “x% of people who visited Gap also visited Starbucks”), etc.
Oliver’s data is certainly big as well as fast. Path Intelligence collect 566Tb of raw RF data per day, but this shrinks to 180Gb of information once identifiers have been extracted, leading to an estimated 5Mb of actionable insights (such as ‘move goods X from aisle A to aisle B’, or ‘deploy some assistants to explain the products in zone C now to convert burgeoning interest into sales’). He’s at pains to stress (as does the company’s website) that this ‘footpath’ technology cannot identify individuals and only aggregated data is shared with clients. Just as well, I suppose, since presumably un-aggregated footpath data overlayed with CCTV footage running facial recognition, trawling through social media, could probably make a decent fist of doing so!
Another notable mention goes to Phil Harvey, CTO of DataShaka – a company focused on the Variety of Big Data. However, not content with Variety or any of Big Data’s other ‘Vs’, he presented us with three Cs of data: Chaordic, Connectionist, and Consilience as characteristics that help it flow “like water” in an organisation.
In short, Chaordic data is in the Goldilocks zone between unstructured (chaotic) and structured (ordered) data – semi-structured, with a foot in each camp (a more relaxed structure, but with a mechanism for layering on queries). Connectionist data is at the nexus of sets vs individuals – focusing on the connections between the latter, rather than the relations between the former (it’s the approach at the heart of the notion of the semantic web and linked-data). Consilient data focuses on “things that jump together”, addressing the combination of data variety at a fundamental level so that it forms a single, valid and unified set (unimpeded by diverse schemas).
So where did all this leave me? The conference’s final speaker, Eva Pascoe from New Urban Informatics, rounded off the day with a plea to balance innovation with safeguards for personal privacy. Yes, you can do all these things and more; but ought you? To earn and retain their customers’ trust, as well as buy their loyalty, companies should be more open about the trade-offs and opt-outs; the entitlements and obligations.
Without that, most of us content to become ‘part of the product’ in return for a better service are probably doing so because we’d rather, say, wade through the 30,066 words in Hamlet than the 36,275 in PayPal’s terms and conditions.
To agree or not to agree, that is the question.
Published at DZone with permission of Angela Ashenden , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.