Over a million developers have joined DZone.

It's Not About the Size of Your Data, but Since We're All Counting…

DZone's Guide to

It's Not About the Size of Your Data, but Since We're All Counting…

Engineering your own uses of big data doesn't mean you have to a certain amount of data for it to be valuable, but it's cool to know how big some of these events per second are happening in the world of big sites and big social for comparison.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Maybe you’re new to the data world. In which case: tl;dr data size doesn’t matter but we still love to compare.

If you’re still chuckling, then perhaps you’ve noticed the proliferating number of #humblebrags on Hacker News about the size of someone’s data pipe. The headline usually goes something like this:

“How I ingested a $#*% ton of data and lived to blog about it.”

Get Updates on Analytics, Data and Open Source.

Examples include Yelp’s latest article on a 1 billion messages pipeline. Gold star Yelp, thanks for sharing! This seems like a pretty fat pipe, until you remember that a message is essentially just another kind of event — albeit with a slightly larger payload — which the good trolls of H/N were quick to point out. There are plenty of IoT networks that generate that amount of traffic in a single hour! (Ask us how we know.)

Segment recently put up a job posting for an Infrastructure Engineering position, and in it, they linked to the following diagram saying “we ingest somewhere near the total number of Snapchats and Instagrams combined.”

Cool story Segment! That clocks in at about 1.2B events per day. At Treasure Data, we have customers doing that volume every hour! In fact, a few weeks ago we just passed 100 billion events per day, all ingested into the single largest multi-tenant Hadoop cluster in the world. Since then, we’ve grown even further: blowing past 1.5 million events per second last Friday. And that’s not even counting the volume handled by Fluentd — our open source pipeline technology deployed at companies like Atlassian, Microsoft, and thousands of companies running on AWS, Google Cloud Platform, and Microsoft Azure.

I guess you could say we’re pretty serious about our data volumes too. But since the goal of all this data is to glean insights, what can we take away here?

  1. Everyone else’s data is smaller than you think: In the title, I alluded to the size of your data not mattering. Something something about “it’s the way you use it” and such. This seems to be even more true than we expect. When Gartner loudly dropped the term “Big Data” from its hype curve last summer, it would seem they were, for once, ahead of a trend. You don’t need a large amount of data. You just need the right data, in the right places at the right time.If Segment, the darling of the point-to-point integration world, is processing just 1.2 billion events per day across hundreds of accounts, it would seem that people are finding a lot of value from some fairly small data sets — as long as they show up consistently in the right tools without engineering work.(Or it could be that Segment gets expensive at volume — $10,000 for every 1M monthly users — prompting fast growing startups to scramble for open source alternatives.)
  2. You don’t need big data to get big resultsThere are applications for Big Data, and there will be many more in the always-kind-of-around-the-corner IoT future. But you don’t need big data volumes to get big insights. In fact, some of your most useful data is probably so small it’s hidden in plain sight. It’s the thousands of contacts in your marketing automation system, which, if you could only join against your Mixpanel product usage data, might actually tell you where your best users are coming from.Or it could be locked in your CRM silo, begging to be matched against 3rd party data to build look-a-like models and dramatically improve your PPC ROI. Or maybe it’s just pulling ERP data together with product analytics so you can see your margins broken down account by account. In all these examples, the utility lies not in the size of the data, but in the value that’s unlocked when you bring multiple systems of record together on one analytics platform.

So fear not, ye of considerably smaller data! If you can bring it all together with the right tools, you too can unlock game-changing insights for your business. You don’t need big data expertise or a fancy engineering degree; you just need a tool that’s built from the ground up for business analytics, with friendly connectors into popular SaaS tools and powerful processing engines to run all of your cross-silo analytics. So stop enviously lurking on H/N and get your own data into the action today.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,hadoop ,events ,analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}