My Reflections on Data Gravity
My Reflections on Data Gravity
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Dave McCrory introduced his idea of Data Gravity with a blog post back in 2010. The core idea was — and is — interesting, and got some traction from sites like ReadWriteWeb, ZDNet and GigaOM. More recently, Data Gravity featured in this year’s EMC World keynote. But beyond the observation that large or valuable agglomerations of data exert a pull that tends to see them grow in size or value, what is a recognition of Data Gravity actually good for? Stefano Bertolo perhaps summed the question up best, suggesting;
(And for those who don’t know, that’s erudite Italian for the rich getting richer as the poor get poorer.)
As a concept, Data Gravity seems pretty closely associated with current enthusiasm for Big Data. And, like Big Data, the term’s real-world connotations can be unhelpful almost as often as they are helpful. Big Data is generally accepted to exhibit at least three characteristics, which are Volume, Velocity and Variety. Various other V’s, including Value, also get mentioned from time to time, but with less consistency. And yet, Big Data’s name says it’s all about size. Size (volume) matters. The speed with which data must be ingested, processed or excreted is less important. The complexity and diversity of the data doesn’t matter either. And that’s nonsense, of course. On its own, the size of a data set is neither here nor there. Coping with lots of data certainly raises some not-insignificant technical challenges, but the community is actually doing a pretty good job of coming up with technically impressive solutions. The interesting aspect of a huge data set isn’t its size, but the very different modes of working that become possible when you begin to unpick the complex interrelationships between data elements. Sometimes, Big Data is the vehicle by which enough data is gathered together about enough aspects of enough things from enough places for those interrelationships to become observable against the background noise. Other times, Big Data is the background noise, and any hope of insight is drowned beneath the unending stream of petabytes.
To a degree, Data Gravity’s name falls into the same trap. More gravity must be good, right? And more mass leads to more gravity. And mass must be connected to volume, in some vague way that was explained when I was 11, and which involves STP. Therefore, bigger data sets have more gravity. Which means that bigger data sets are better data sets. QED. That assertion is clearly nonsense, but luckily it’s not actually what McCrory is suggesting. His arguments are more nuanced than that, and potentially far more useful.
Instinctively, I like that the equation attempts to move attention away from ‘the application’ toward the pools of data that support many, many applications at once. The data is where the potential lies. Applications are merely the means to unlock that potential in various ways. So maybe notions of Potential Energy from elsewhere in Physics need to figure here?
But I’m wary of the emphasis given to real numbers that are simply the underlying technology’s vital statistics; network latency, bandwidth, request sizes, numbers of requests, and the rest. I realise that these are the measurable things that we have, but feel that more abstract notions of value (and even, perhaps, tangible economics) need to figure just as prominently. As Sam Johnston commented this afternoon,
It’s much less clear to me, though, how we could set about assigning numbers to ‘value’ in a way that doesn’t simply end up with every single data provider (miraculously, of course) finding their own little pot of pointless trivia to have the biggest gravitational pull of any resource on the web. That really won’t help us. Numbers of requests may give one measure of one aspect of value, but it’s not the whole story either.
And so I’m left reaffirming my original impression that Data Gravity is “interesting”. It’s also intriguing, and I keep feeling that it should be insightful. I’m just not — yet — sure exactly how. Is a resource with a Data Gravity of 6 twice as good as a resource with a Data Gravity of 3? Does a data set with a Data Gravity of 15 require three times as much investment/infrastructure/love as a data set scoring a humble 5? It’s unlikely to be that simple, but I do look forward to seeing what happens as Dave begins to work with the parts of our industry that can lend empirical credibility to his initial dabblings in mathematics.
If real numbers show the equations to stand up, all we then need to do is work out what the numbers mean. Should an awareness of Data Gravity change our behaviour, should it validate what gut feel led us to do already, or is it just another ‘interesting’ and ultimately self-evident number that doesn’t take us anywhere?
I don’t know, but I look forward to finding out.
Published at DZone with permission of Paul Miller , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.