As the NSA PRISM debacle continues to unfold and spreads across continents it’s probably good to stop and think about the technology and philosophy behind it all. Because this is big data and analytics in its most potent and controversial form and it’s certainly not the last time we’ll see this hit the headlines.
The NSA built Accumulo, which is a database design based on Google Big Table. Written in Java, the NSA contributed Accumulo to the Apache Foundation and in 2012 it was promoted from incubation to a top level project. Right now the Agency is harvesting petabytes of data in Accumulo, a staggering amount that grows daily. But the cleverest part of all of this is in the analytics, Accumulo was built and extended the Big Table concept to analyze trillions of points in data in order to create intelligence that can detect the connections between those points and the strength of those connections.
If you thought you had it sussed with something like LinkedIn using INmaps (my network is too large to generate one funnily enough) or Facebook Social Graph, think again because through Accumulo the NSA can find out who you are, where you are, who you know and why you know them. It’s graph analysis on steroids and it’s a hot topic right now for making sense of large datasets, primarily by understanding how tightly different data points are related or how similar to each other they are.
If you put this in perspective in terms of the amount of data needed to make this happen, Yahoo a few years back was operating an approximate 42,000-node Hadoop environment, consisting of hundreds of petabytes, and users on Facebook are generating more than 500 terabytes of data every day. So the Agency has someinfrastructure in place. Just what is that infrastructure built on ? I’m sure there are vendors out there keeping a very tight lip. But rest assured, if they are that advanced they’ll have already considered or even built a cloud infrastructure, potentially hybrid, to stay ahead.
But behind all the noise and crowd baiting from the Washington Post the exposure raises more questions about the power of big data and analytics and just how large and powerful it gets. If you think about all the hype generated about consumer privacy and enterprises collating and analyzing information for a more targeted and personal experience, customer segmentation and demographics, location-based and real-time marketing what the NSA exposure has taught us is that there really is no privacy in the 21st century and we should just get used to it. Our data is anonymized unless it’s being used specifically for our purpose and benefit but the fact is we are happily generating it for them to use in any case.
But Big Data is no longer creepy. Sorry but it’s not. You must live in painful ignorance if you think that every nuance of a digital interaction hasn’t been collected by someone and analysed. What’s clear is that analytics and big data seem to be labelled as only for marketers to hound us with or for banks to sell us more debt laden products. We forget, for example, about the medical and scientific boundaries being broken that rely on data analytics and human generated information to help it along.
At some point there will be consumer based tools affordable enough for people to make sense of the data they generate themselves, and why not, it’s all part of the equation. Personal graph analysis will become a reality as much as its parent will be wielded by enterprises.
So, you see, we have heroes and villains even in data analytics but it’s all a matter of perspective. The NSA are deemed evil for breaching our liberties and analyzing data without our consent to understand terrorist activities, and medical science is a force of good for helping us cure diseases using data sourced from all manner of places.
And to paraphrase the Man of Steel trailer -
“The analysts believed that if the world found out what big data really was, they’d reject me. They were convinced that the world wasn’t ready. What do you think?”