Ok, I get that Big Data is a pretty big hype right now and I also get that humans like to give names to phenomena to give them a handle to think about it. However, the amount of buzzword bingo around the whole Big Data sphere is really just staggering.
ZDNet has a nicely interlinked blog post by Tony Baer on “Fast Data”, apparently the new tag for real-time, low latency analytics. I recently attended the BerlinBuzzwords conference and can confirm that real-time analytics is in fact a pretty hot topic right now. Over and over again people have admitted that they really had no current analytics on key performance indicators of their infrastructure, be it the number of games installed or other kinds of metrics, and they discussed different ways to run batch-processing systems like Hadoop in tiny iterations to get closer to real-time.
Of course, real-time data mining is nothing new, and as I’ve already discussed elsewhere there exists a whole field of research called stream mining to deal with these topics. However, it looks like the industry is just beginning to adopt these techniques.
Another insight I’ve also discussed is that disks are often too slow for real-time. Unless all of your requests can be served from the cache in memory, an access to disk is quite slow and you cannot get beyond a few hundred requests per second. Now from a machine learning point of view, having all your data in memory and running analysis methods on it also isn’t something special. After all, all the usual frameworks like R, matlab or scipy work that way: Read the data, clean the data, run an analysis, write a report. I’d say ML (and also data science or computational statistics for that matter) is so memory-centric that most of my colleagues view a database as just another storage format for data exchange.
However, the idea of using your memory for something else than disk caches seems to be so mindboggling new to database guys that they invented a completely new term for it: “in-memory analytics.” Apparently in an effort to jump the Big Data Buzzword Bandwagon, companies like SAP, Oracle, or SAS have started to offer “in-memory analytics” products and solutions which are basically just the way you normally process data, at least in my world ;)
I think we still need a few more buzzwords, so here are a few more suggestions:
Big Data Science Neither Big Data or Data Science is sufficient, we need Big Data Science!
Real-Time Big Data IMHO, sounds better than Fast Data and also has Big Data in it, a clear win.
Small Data That way, we can bring all existing algorithms which don’t quite scale back into the Big Data world. The main selling point here is that these methods are often exact and not just approximations, leading to much more accurate results!
I’m only half-joking here. A friend of mine who works at TeraData has told me that classical data base vendors have started to interpret NoSQL as “Not Only SQL”.