I’m fascinated by big data and all of it’s applications. The idea of building things that can process petabytes of data and information and find meaningful insight on relatively cheap commodity hardware really excites me.
I was talking to someone at a networking event recently, and we were talking about the “big data” phenomenon. The guy said to me, “That’s all that mapreduce stuff, I know mapreduce. Maybe I should become a big data architect!” He then went on to discuss how it’s “all a fad.”
Really? Is that all there is to it? Well certainly a complete stranger at a tech meetup knows!
So here’s the major problem, this individual only saw Big Data as “mapreduce,” his mistake was that he hasn’t yet realized that Hadoop is more than just “mapreduce,” in order to achieve anything meaningful with big data, you need to see that Hadoop as a platform to really understand the business value of it.
In this article I’ll be talking about the HortonWorks Data Platform as a reference platform, which you can download a Sandbox VM with a pre-configured version of it to learn from.
The best thing about Hadoop 2.0, other than the fact that it’s state of the art in big data, is that it is all Apache licensed code, so you are free to use it in your own projects and products. There’s a lot of cool components that can be really really useful if you can understand what each one does.
Big data is exactly what it says, it’s “big data”
Wikipedia states that;
“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”
Big Data in today’s context is driven by 3 the three V’s
Volume - This is the simply a large amount of data, and as defined by wikipedia, generally the volume equation of big data is a set of data that is too large or complex to be dealt with using traditional dataprocessing methods.
Velocity - Velocity is the rate at which data is being introduced into your ecosystem, and velocity is usually increasing over time. According to IDC*, mankind will have 40 Zettabytes of data stored by 2020, that’s 40,000,000,000,000,000,000,000 Bytes! That represents a steady increase in data velocity.
Variety - Data doesn’t have to be limited to just database tables, according to many analysts such as IDC, Gartner and others, up to 80% of all data in the world is “unstructured”, things such as presentations, word processing documents, text files, pdf’s, spreadsheets, and so on.
What is a data platform?
So we’ve defined that big data is big, comes from many sources, and is growing at a staggering rate. This is the problem that a data platform is trying to solve. It’s more than just a large storage pool, NoSQL database and map reduce. A data platform is a set of tools that is designed to solve the inherent problems of handling, analyzing and getting actionable intelligence from big data.
For example here’s the HortonWorks 2.1 Platform
It can be a complex beast and it’s important to try to understand what each component does, but the main takeaway is that it’s more than just map reduce, it really is a platform, and in the HortonWorks platform it is quite correct to describe YARN as a Data Operating system as it is the platform you can build your big data solutions on. YARN supports (very simplified):
Containers for your applications (processes in an OS)
Resource Management & scheduling (CPU, Memory etc in a traditional OS)
YARN is way more than just what I’ve listed above, the key idea is that YARN is distributed and is the foundation for a modern, scalable data platform.
I could write an entire article on YARN in detail, but that would not be in scope for this article. But to sum it up YARN is a re architecture of the original Hadoop framework to make it more agile and flexible, YARN is also designed to support non mapreduce and near real time workloads as well, better supporting the “data platform” idea.
You can find out more about YARN at http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
It’s about applications, mixing and matching technology to meet customer needs.
So the business case for big data is to try to derive value out of the data that a company possesses or acquires. Here are some examples
Machine learning, matching and suggestion engines.
Learning algorithms can suggest to customers based on previous activity.
Full text indexing a large document store.
indexing a large companies documents and allowing employees to search for documents and items.
Looking through unstructured data and linking it to customers, employees, suppliers.
Looking for patterns in data.
In 2013, Google actually tracked the flu better than the CDC by tracking searches http://www.google.org/flutrends/.
Applications of this type of technology are endless and only bound by your imagination.
Identifying what you need from a data platform.
So you’ve decided it’s time to build a big data system on a data platform, like any process, you need to identify what your requirements will be.
What sort of data do you want to put into big data?
Where is the data coming from?
What do you want to do with it?
What sort of latency is acceptable?
Near real time?
What sort of skills do you need?
Big data is not for the feint of heart, you and your team must be willing to master many disciplines in order to be successful. You’ll need understanding of code, hardware, Virtualization, networking, databases (SQL & NoSQL), ETL, Cloud, and more.. Don’t fool yourself, you’ll need some serious skills on-board.
But it’s definitely achievable for a team that’s willing to educate and apply themselves to the task, and if you don’t see it as a fun challenge, or you maybe don’t have the team in place to execute such an ambitious plan, there’s lots of companies with expertise in building big data expertise such as HortonWorks and others that you can partner with.
In part 2 of this series I will explore a popular architecture pattern that will allow you to build out a modern data platform to meet both near real time analytics and longer term batch analytics.
Originally Posted on my blog - http://www.kevinedaly.com/blogs/hadoopismore