Over the last 12 months, I’ve had plenty of “conversations” about big data analytics and BI strategies with customers and potential users. The five words below represent tell-tale signs of decay in the field, summing up the current state of analytics/BI and demonstrating why it is in, by and large, a sorry state. (Beware: I'm going to use a measure of hyperbole to underline my point.)
This is probably obvious to most industry insiders, but it's worth mentioning: if you have a “batch” process in your big data analytics, you're not processing live data. You're not processing data in a real-time context. Period.
That means you're analyzing stale data, and your smarter, more agile competitors are running circles around you. They can analyze and process live (streaming) data in real-time and make appropriate operational BI decisions based on real-time analytics.
Using “batch” in your system design is like running your database off a tape drive. Would you do that when everyone around you is using disks?
A bit controversial. But if you need one, your analytics/BI are probably not driving your business -- since you need a human body between your business and your data. Humans (who sadly need to eat and sleep) saddle any process with massive latency and non-real-time characteristics.In most cases, needing a data scientist simply means:
- The data you're collecting -- and the system collecting it -- are so messy that you need a Data Scientist (i.e. Statistician/Engineer under thirty) to clean it up
- Your process is too hopelessly slow and clunky for real automation
- Your analytics/BI is outdated by definition (i.e. analyzing stale data with no meaningful BI impact on daily operations)
The little brother of “Batch.” It is essentially a built-in failure for any analytics or BI. In the world of hyper-local advertising, geolocation, up-to-the-second updates on Twitter or Facebook or LinkedIn, you're the proverbial grandma driving a '66 Buick on the highway, turn-light blinking as everyone speeds past you…There’s simply no excuse for having any type of overnight processing (except for some rare legacy financial applications). Overnight processing is not only a technical laziness but often a built-in organizational tenet -- and that’s what makes it even more appalling.
The little brother of “Overnight.” ETL is what many people blame for overnight processing. “Look, we’ve got to move this Oracle into Hadoop and it takes 6 hours, and we can only do it at night when no one is online.”Well, I can only really count two or three clients of ours where no one is online during the night. This is 2012, for God’s sake! Most businesses -- even smallish startups -- are 24/7 operations these days.
ETL is the clearest sign of significant technical debt accumulation. It is, for the most part, indicative of a defensive and lazy approach to system design. It is especially troubling to see this approach in newer, younger companies that don’t have 25 years of legacy to deal with.
And it is equally invigorating to see it being steadily removed in companies with fifty years of history in IT.
This is a bit controversial, too. But I’m getting a bit tired of hearing, “We must design to process Petabytes of data” from companies with twenty employees.Let me break it down:
- 99.99% of companies will NEVER need Petabytes-scale
- If your business “needs” to process Petabytes of data for its operations, you're likely doing something very wrong
- Most of the “working sets” that we’ve seen, i.e. the data you really need to process, measure in low teens of terabytes for the absolute majority of use cases
- Given how frequently data is changing (in its structure, content, usefulness, fresh-ness, etc.) I don’t expect that “working set” size will grow nearly as quickly (if at all) -- overall data amount will grow, but not the actual “window” that we need to process.