An Introduction to the Agile Data Lake, Part 1
In this article, the author provides some guidance for your consideration on how to design, build, and use a successful agile data lake.
Join the DZone community and get the full member experience.Join For Free
Let's be honest, the 'Data Lake' is one of the latest buzz-words everyone is talking about. Like many buzzwords, few really know how to explain what it is, what it is supposed to do, and/or how to design and build one. As pervasive as they appear to be, you may be surprised to learn that Gartner predicts that only 15% of data lake projects make it into production. Forrester predicts that 33% of enterprises will take their attempted data lake projects off life-support. That's scary! Data lakes are about getting value from enterprise data, and, given these statistics, its nirvana appears to be quite elusive. I'd like to change that and share my thoughts and hopefully providing some guidance for your consideration on how to design, build, and use a successful data lake: An agile data lake. Why agile? Because to be successful, it needs to be.
Ok, to start, let's look at the Wikipedia definition for what a data lake is:
"A data lake is a storage repository that holds a vast amount of raw data in its native format, incorporated as structured, semi-structured, and unstructured data."
Not bad. Yet considering we need to get value from a data lake this Wikipedia definition is just not quite sufficient. Why? The reason is simple; you can put any data in the lake, but you need to get data out and that means some structure must exist. The real idea of a data lake is to have a single place to store all enterprise data, ranging from raw data (which implies an exact copy of source system data) through transformed data, which is then used for various business needs including reporting, visualization, analytics, machine learning, data science, and much more.
I like a 'revised' definition from Tamara Dull, Principal Evangelist, Amazon Web Services, who says:
"A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data, where the data structure and requirements are not defined until the data is needed."
Much better! Even Agile-like. The reason why this is a better definition is that it incorporates both the prerequisite for data structures and that the stored data would then be used in some fashion, at some point in the future. From that we can safely expect value and that exploiting an Agile approach is absolutely required. The data lake therefore includes structured data from relational databases (basic rows and columns), semi-structured data (like CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (typically images, pictures, audio, and video) thus creating a centralized data store accommodating all forms of data. The data lake then provides an information platform upon which to serve many business use cases when needed. It is not enough that data goes into the lake, data must come out too.
And, we want to avoid the 'Data Swamp' which is essentially a deteriorated and/or unmanaged data lake that is inaccessible to and/or unusable by its intended users, providing little to no business value to the enterprise. Are we on the same page so far? Good.
Data Lakes: In the Beginning
Before we dive deeper, I'd like to share how we got here. Data lakes represent an evolution resulting from an explosion of data (volume-variety-velocity), the growth of legacy business applications plus numerous new data sources (IoT, WSL, RSS, Social Media, etc.), and the movement from on-premise to cloud (and hybrid).
Additionally, business processes have become more complex, new technologies have recently been introduced enhancing business insights and data mining, plus exploring data in new ways like machine learning and data science. Over the last 30 years we have seen the pioneering of a data warehouse (from the likes of Bill Inmon and Ralph Kimball) for business reporting all the way through now to the agile data lake (adapted by Dan Linstedt, yours truly, and a few other brave souls) supporting a wide variety of business use cases, as we'll see.
To me, data lakes represent the result of this dramatic data evolution and should ultimately provide a common foundational, information warehouse architecture that can be deployed on-premise, in the cloud, or a hybrid ecosystem.
Successful data lakes are pattern based, metadata driven (for automation) business data repositories, accounting for data governance and data security (ala GDPR and PII) requirements. Data in the lake should present coalesced data and aggregations of the "record of truth" ensuring information accuracy (which is quite hard to accomplish unless you know how), and timeliness. Following an Agile/Scrum methodology, using metadata management, applying data profiling, master data management, and such, I think a data lake must represent a 'Total Quality Management" information system. Still with me? Great!
What Is a Data Lake For?
Essentially a data lake is used for any data-centric, business use case, downstream of System (Enterprise) Applications, that helps to drive corporate insights and operational efficiency. Here are some common examples:
- Business information, systems integration, and real-time data processing.
- Reports, dashboards, and analytics.
- Business insights, data mining, machine learning, and data science.
- Customer, vendor, product, and service 360.
How do you build an agile data lake? As you can see there are many ways to benefit from a successful data lake. My question to you is, are you considering any of these? My bet is that you are. My next questions are: Do you know how to get there? Are you able to build a data lake the RIGHT way and avoid the swamp? I'll presume you are reading this to learn more. Let's continue.
There are three key principles I believe you must first understand and must accept:
- A properly implemented ecosystem, data models, architecture, and methodologies.
- The incorporation of exceptional data processing, governance, and security.
- The deliberate use of job design patterns and best practices.
A successful data lake must also be agile which then becomes a data processing and information delivery mechanism designed to augment business decisions and enhance domain knowledge. A data lake, therefore, must have a managed lifecycle. This life cycle incorporates three key phases: ingestions, adaptation, and consumption.
- Extracting raw source data, accumulating (typically written to flat files) in a landing zone or staging area for downstream processing and archival purposes.
- Loading and transformation of this data into usable formats for further processing and/or use by business users.
- Data aggregations (KPI's, data-points, or metrics).
- Analytics (actuals, predictive, and trends).
- Machine learning, data mining, and data science.
- Operational system feedback and outbound data feeds.
- Visualizations and reporting.
The challenge is how to avoid the swamp. I believe you must use the right architecture, data models, and methodology. You really must shift away from your 'legacy' thinking; adapt and adopt a 'modern' approach. This is essential. Don't fall into the trap of thinking you know what a data lake is and how it works until you consider these critical points.
Okay then, let's examine then these three phases a bit more. Data ingestion is about capturing data, managing it, and getting it ready for subsequent processing. I think of this like a box crate of data, dumped onto the sandy beach of the lake; a landing zone called a 'persistent staging area.' Persistent because once it arrives, it stays there; for all practical purposes, once processed downstream, becomes an effective archive (and you don't have to copy it somewhere else). This PSA will contain data, text, voice, video, or whatever it is, which accumulates.
You may notice that I am not talking about technology yet. I will but, let me at least point out that depending upon the technology used for the PSA, you might need to offload this data at some point. My thinking is that an efficient file storage solution is best suited for this first phase.
Data adaptation is a comprehensive, intelligent coalescence of the data which must adapt organically to survive and provide value. These adaptations take several forms (we'll cover them below) yet essentially reside first in a raw, lowest level of granulation, data model which then can be further processed, or as I call it, business purposed, for a variety of domain use cases. The data processing requirements here can be quite involved so I like to automate as much of this as possible. Automation requires metadata. Metadata management presumes governance. And don't forget security. We'll talk about these more shortly.
Data consumption is not just about business users, it is about business information, the knowledge it supports, and hopefully, the wisdom derived from it. You may be familiar with the DIKW Pyramid; Data > Information > Knowledge > Wisdom. I like to insert 'Understanding' after 'Knowledge' as it leads wisdom.
Data should be treated as a corporate asset and invested as such. Data then becomes a commodity and allows us to focus on the information, knowledge, understanding, and wisdom derived from it. Therefore, it is about the data and getting value from it.
That's all for Part 1. Tune in next time when we'll cover data stores, data security, and more!
Published at DZone with permission of Dale Anderson, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.