You’ve heard it time and time again: cloud is the future. Those who don’t adopt modern Big Data practices will fall behind the pack. The next wave of IT disruption is right around the corner. And yet, at the same time, budgets are shrinking, demand is growing and pressure on the IT organization to show value is at an all-time high. As an executive, you have the full force of your business behind you and more options than ever to achieve both short- and long-term goals with business data. So many options, in fact, that the landscape has become a confusing, often contradictory mess of competing solutions.
Hadoop is one of the most widely adopted next-generation big data frameworks and is also one of the worst offenders as far as being confusing. MapR, Cloudera, or HortonWorks? Flume, Sqoop, Kafka, or NiFi? Spark or MapReduce? All of the many offerings in the Hadoop ecosystem have their strengths and weaknesses, but many are unrealistically sold as a silver bullet to solve an array of business problems. Likewise, the data lake architecture, although younger than Hadoop, holds great promise. However, this architecture can also be confusing for business leaders as it becomes more pervasive in the market. So, where do you start when you need concrete, proven big data solutions?
Some Basics About Hadoop and Data Lakes
You may know nothing about data lakes or Hadoop, you might have heard of them in passing, or you might already be rolling out a pilot. This post is meant to serve equally as an introduction and a reminder of some of the strengths and basic uses of Hadoop and data lakes. To level-set, let me first define Hadoop and data lakes.
Hadoop often refers to all of the many interrelated big data software products created under the umbrella of the Apache Foundation. Hadoop has also come to refer to bundles of these products sold by third-party vendors such as MapR, Cloudera, and Hortonworks, among many others.
Data lakes are architectures of (usually enterprise-level) data storage, management, and governance. In this architecture, raw data is ingested into the “lake,” where it resides in an unaltered state until it is needed by the organization; it can then be processed, enriched and extracted without losing fidelity or metadata surrounding the raw data.
What Hadoop and Data Lakes Are NOT
Before jumping into what Hadoop and Data Lakes can do, here’s what they can’t do.
Hadoop Is Not a Drop-In Replacement for Traditional Database Systems
For everything that Hadoop is, it is not simple. It is radically different from traditional Oracle and IBM implementations and, although it is amazingly powerful, it is not one-size-fits-all. All of the nuances and subtleties would take a whole series of posts on their own, but for now understand that there is a place and function for both Hadoop and RDMS in a cutting-edge IT organization, especially when highly transactional processes are common.
Data Lakes Are Not a Wholesale Replacement for Data Warehouse Architectures
As tempting as it may be to rip out all of your EDW architecture and transition to a data lake, this is equivalent to opening all of the floodgates at once before the dam is built. Like gushing water that floods its surrounds, this approach will flood data owners, IT managers, and other end users—and not in a good way. A steady, carefully planned transition may or may not involve completely removing EDWs, even at full implementation. This is why, in a thoughtful data lake architecture such as in the above diagram, EDWs may still be present. The EDW portion may be significantly smaller in this architecture than in an EDW-only implementation, but it may never be completely eliminated.
Data lakes and Hadoop are not set-it-and-forget-it systems
For very different reasons, Hadoop and a data lake architecture require expert, hands-on management throughout their lifecycles.
- Hadoop is an ecosystem of Apache-managed open source projects. As such, it is constantly changing, evolving and shifting. Being abreast of the most current changes in each project is critical to long-term success.
- Data lakes, if left unmanaged, can quickly become messy and unmanageable, creating a lack of transparency into the processes and origins of data, and growing in size and complexity until they are no longer efficient or cost effective. This is where products such as Zaloni’s Bedrock data lake management platform can be leveraged to effectively automate, manage and govern the data lake.
Ready to find out the upsides to data lakes? Stay tuned for Part II.