So, You Want to Be a Tech Visionary: Part II
So, You Want to Be a Tech Visionary: Part II
Here is Part II of an executive guide to data lakes, how they work, and why you want one.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Did you miss Part I? Be sure to read it here!
Now, the Upside of Hadoop and Data Lakes
So we’ve reviewed some of the challenges with Hadoop and data lakes. They are all valid and are all characteristic of a cutting-edge, young technology—as you know, innovation is never without heightened risk. However, these two technologies have massive upside, especially given recent advances that help manage and mitigate the risks. Let’s take a look at some of the key strengths of both Hadoop and data lakes.
Hadoop Is Massively Powerful, Flexible and Scalable to Almost Any Business Problem
Hadoop was originally designed precisely because legacy tools could not handle the volumes of data being produced at the onset of the modern digital age; since then, data has grown exponentially, but so has computing power. Being both open source and parallel-compute-driven, Hadoop is poised to continue leveraging advances in both software and hardware for years to come.
Distributed Computing Is Cost Effective and Future Proof
With Hadoop and a data lake, the bulk of your hardware is low-cost, highly redundant storage and servers. There’s no need for ultra-high-end servers that eat up swaths of budget before you even get into the services to configure them and that may need to be replaced in five years. Even more so, as compute power and storage get cheaper, you get to reap the benefit of upgrading as much as you want, when you want.
The Data Lake Leverages Hadoop to Its Full Potential
Instead of rigid, lengthy, pre-defined workflows that have to be defined before you even receive data, the data lake allows you to store data in its pure form by leveraging the distributed, low-cost storage of Hadoop. The flexible nature of Hadoop also allows for easier ingestion than possible in traditional systems, meaning the data lake can store more data without requiring excessive middleware to translate or normalize. And, by providing a unified interface for on-demand data preparation, transformation, and enrichment, self-service data preparation tools such as Zaloni Mica offer more efficiency, convenience, and insight to data scientists and other users than traditional RDMS systems can.
How to Evaluate If You Are Ready for a Data Lake
So, should you jump into the lake and embrace these emerging technologies? Here are a few questions to ask as you evaluate your options.
Just How Big Is Your Data?
Are you dealing with terabytes, petabytes, exabytes or zettabytes? Perhaps more importantly, how much will that data grow in the next five years, and will the solution you choose today be able to scale as your business grows? If you are already dealing with, or expect to deal with, more than a few petabytes of data, a data lake might be the right choice.
What Are Your Short-Term and Long-Term Goals?
Is expansion, growth and evolution of your data storage and processing capabilities a priority, or is more weight placed on “run-the-business” processes? What critical business processes exist that may be rendered inefficient if your organization fails to keep up? And, of course, what is the current budget and forecast outlook? Data lake implementations range in size, complexity and cost, but in the long-term will almost always end up being simpler and cheaper to maintain for massive amounts of data than RDMS.
What Is the Most Important Service Characteristic You Provide to End-Users?
Do your users expect 24x7 uptime and availability, or is data fidelity the most important part of your services? Are your applications and data mostly transactional, or is data flow largely limited to ingestion, with interaction occurring in limited capacity? Although advances have been made in Hadoop that allow more flexibility in transactional applications, traditional RDMS systems still have an edge over Hadoop when it comes to these implementations. Fortunately, data lakes provide the flexibility to maintain EDW and RDMS systems where needed and leverage the cost, scalability and redundancy benefits of Hadoop elsewhere.
How Much Risk Does My Organization Tolerate?
If you’ve answered the previous questions and still believe that a data lake is applicable to your business, the final question revolves around risk tolerance of your organization. Data lakes can be a tricky business, even with expert guidance.
Your Next Steps
My first recommendation of what to do next would be to continue researching and reading about Hadoop and data lakes. Then, talk to some experts—people who have experience implementing and using data lakes. Also, look into what tools are available to make data lake implementation and management easier. Products such as Zaloni’s Bedrock data management platform exist to help ensure that your first foray into the data lake is not your last, by simplifying management, automating ingestion and processing, reducing manual interaction, and providing proven consulting and data science guidance services.
Opinions expressed by DZone contributors are their own.