An Introduction to the Agile Data Lake, Part 2
We wrap up this two-part series by looking at how to add agility, security, and data governance to your data lakes. Read on for more!
Join the DZone community and get the full member experience.Join For Free
Welcome back! If you missed Part 1, you can check it out here.
Data Store Systems: Data Stores
Okay, as we continue to formulate the basis for building a data lake, let's look at how we store data. There are many ways we do this. Here's a review:
ROW: traditional relational database system (RDMBS), i.e. Oracle, MS Server, MySQL, etc.
COUMNAR: relatively unknown; feels like a RDBMS but optimzed for Columsn (i.e. Snowflake, Presto, Redshift, Infobright, and others).
NoSQL — "No Only SQL":
Non-relational, eventual consistency storage and retrieval systems (i.e.: Cassandra, MongoDB, and more).
Distributed data processing framework supporting high data volume, velocity, and variety (i.e.: Cloudera, Hortonworks, MapR, EMR, and HD Insights).
Graph — "Triple-Store":
Subject-Predicate-Object, index-free 'triples': baszed upon Graph theory (i.e. AlegroGraph and Neo4j).
Everything else under the sun (i.e.: ASCII/EBCDIC, CSV, XML, JSON, HTML, AVRO, Parquet).
There are many ways to store our data, and many considerations to make, so let's simplify our life a bit and call them all 'data stores,' regardless of them being source, intermediate, archive, or target data storage. Simply pick the technology for each type of data store as needed.
What is Data Governance? Clearly another industry enigma. Again, Wikipedia to the rescue:
"Data Governance is a defined process that an organization follows to ensure that high quality data exists throughout the complete lifecycle."
Does that help? Not really? I didn't think so. The real idea of data governance is to affirm data as a corporate asset, invest and manage it formally throughout the enterprise, so it can be trusted for accountable and reliable decision making. To achieve these lofty goals, it is essential to appreciate Source through Target lineage. Management of this lineage is a key part of Data Governance and should be well defined and deliberately managed. Separated into three areas, lineage is defined as:
- Schematic Lineage maintains the metadata about the data structures.
- Semantic Lineage maintains the metadata about the meaning of data.
- Data Lineage maintains the metadata of where data originates and its auditability as it changes allowing ‘current’ and ‘back-in-time’ queries.
It is fair to say that a proper, in-depth discussion on data governance, metadata management, data preparation, data stewardship, and data glossaries are essential, but if I did that here we'd never get to the good stuff. Perhaps another blog? Ok, but later....
Data lakes must also ensure that personal data (GDPR and PII) is secure and can be removed (disabled) or updated upon request. Securing data requires access policies, policy enforcement, encryption, and record maintenance techniques. In fact, all corporate data assets need these features which should be a cornerstone of any data lake implementation. There are three states of data to consider here:
Talend works with several technologies offering data security features. In particular, 'Protegrity Cloud Security' provides these capabilities using Talend specific components and integrated features well suited for building an agile data lake. Please feel free to read "BUILDING A SECURE CLOUD DATA LAKE WITH AWS, PROTEGRITY AND TALEND" for more details. We are working together with some of our largest customers using this valuable solution.
Agile Data Lake Technology Options
Processing data into and out of a data lake requires technology (hardware/software) to implement. Grappling with the many, many options can be daunting. It is so easy to take these for granted, picking anything that sounds good. It's only after or until better understanding the data involved, systems chosen, and development efforts does one find that the wrong choice has been made. Isn't this the definition of a data swamp? How do we avoid this?
A successful data lake must incorporate a pliable architecture, data model, and methodology. We've been talking about that already. But picking the right 'technology' is more about the business data requirements and expected use cases. I have some good news here. You can de-couple the data lake designs from the technology stack. To illustrate this, here is a 'Marketecture' diagram of depicting the many different technology options crossing through the agile data lake architecture.
As shown above, there are many popular technologies available, and you can choose different capabilities to suit each phase in the data lake life-cycle. For those who follow my blogs you already know I do have a soft spot for data vaults. Since I've detailed this approach before, let me simply point you to some interesting links:
- My blog posts on data vaults have been very popular:
- Kent Graziano, Chief Technical Evangelist @Snowflake and I wrote a joint post:
- My Talend DV Tutorial continues to evolve; watch for updates
- Currently covering the Relational Model with and without using PIT tables.
- I have completed work on a Snowflake deployment.
- I have completed work on a Big Data (Cloudera/Hive) version, as well.
You should know that Dan Linstedt created this approach and has developed considerable content you may find interesting. I recommend these:
- Brief History of the Data Vault
- A short intro to Data Vault 2.0
- Defining a Data Lake
- Data Lake Part 2: Reference Data Architectures
- Defining a Data Lake Part 3: Landing Zones
- Defining a Data Lake: Part 4 data warehouse vs data lake
- Defining a Data Lake: Part 5 - do we need a Data Lake?
- Defining a Data Lake: Part 6 - CDC & Integration
I hope you find all this content helpful. Yes, it is a lot to ingest, digest, and understand (hey, that sounds like a data lake), but take the time. If you are serious about building and using a successful data lake you need this information.
The Agile Data Lake Life Cycle
Ok, whew — a lot of information already and we are not quite done. I have mentioned that a data lake has a life-cycle. A successful Agile Data Lake Life-Cycle incorporates the three phases I've described above, data stores, data governance, data security, metadata management (lineage), and, of course, 'Business Rules.' Notice that what we want to do is de-couple 'Hard' business rules (that transform physical data in some way) from 'Soft' business rules (that adjust result sets based upon adapted queries). This separation contributes to the life-cycle being agile.
Think about it, if you push physical data transformations upstream then when the inevitable changes occur, the impact is less to everything downstream. On the flip side, when the dynamics of business impose new criteria, changing a SQL 'where' clause downstream will have less impact on data models it pulls from. The Business Vault provides this insulation from the Raw Data Vault as it can be reconstituted when radical changes occur.
Additionally, a data lake is not a data warehouse but in fact, encapsulates one as a use case. This is a critical takeaway from this post. Taking this further, we are not creating 'data marts' anymore, we want 'information marts.' Did you review the DIKW Pyramid link I mentioned above? Data should, of course, be considered and treated as a business asset. Yet, simultaneously, data is now a commodity leading us to information, knowledge, and, hopefully, wisdom.
This diagram walks through the Agile Data Lake Life-Cycle from Source to Target data stores. Study this. Understand this. You may be glad you did. Ok, let me finish by saying that to be agile a data lake must:
Data models should be additive without impact to the existing model when new sources appear.
Be Insert Only
Especially for Big Data technologies where updates and deletes are expensive.
Provide Scalable Options
Hybrid infrastructures can offer extensive capabilities.
Allow for Automation
Metadata, in many aspects, can drive the automation of data movement.
Provide Auditable, Historical Data
A key aspect of data lineage
And finally, consider that STAR Schemas are, and always were, designed to be 'Information Delivery Mechanisms,' a misunderstanding some in the industry has fostered for many years. For many years we have all built Data Warehouses using STAR schemas to deliver reporting and business insights. These efforts all too often resulted in raw data storage of the data warehouse in rigid data structures, requiring heavy data cleansing, and frankly high impact when upstream systems are changed or added.
The cost in resources and budget has been a cornerstone to many delays, failed projects, and inaccurate results. This is a legacy mentality and I believe it is time to shift our thinking to a more modern approach. The Agile Data Lake is that new way of thinking. STAR schemas do not go away, but their role has shifted downstream, where they belong and always intended for.
This is just the beginning, yet I hope this blog post gets you thinking about all the possibilities now.
Incorporate all of this as I've shown above and not only will you create an Agile Data Lake, but you will avoid the swamp!
Till next time...
Published at DZone with permission of Dale Anderson, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.