Today’s technology and software advances allow us to process and analyze huge amounts of data. While it’s clear that Big Data is a hot topic and organizations are investing a lot of money around it, it’s important to note that in addition to considering scale, we also need to take into account the variety of the types of data being analyzed. Data variety means that datasets can be stored in many formats and storage systems, each of which had their own characteristics.
Taking data variety into account is a difficult task, but it provides the benefit of having a 360-degree approach — enabling a full view of your customers, providers, and operations. To enable this 360-degree approach, we need to implement next-generation data architectures. In doing so, the main question becomes: How do you create an Agile data platform that takes into account data variety and scalability of future data?
The answer for today’s forward-looking organizations increasingly relies on a data lake. A data lake is a single repository that manages transactional databases, operational stores, and data generated outside of the transactional enterprise systems — all in a common repository. The data lake supports data from different sources like files, clickstreams, IoT sensor data, social network data, and SaaS application data.
A core tenet of the data lake is the storage of raw, unaltered data. This enables flexibility in the analysis and exploration of data and also allows queries and algorithms to evolve based on both historical and current data instead of a single point-in-time snapshot. A data lake also provides benefits by avoiding information silos and centralizing the data into one common repository. This repository will most likely be distributed across many physical machines but will provide end users transparent access and a unified view of the underlying distributed storage. Moreover, data is not only distributed but also replicated, so access, redundancy, and availability can be ensured.
A data lake stores all types of data, both structured and unstructured, and provides democratized access via a single unified view across the enterprise. With this approach, you can support many different data sources and data types in a single platform. A data lake strengthens an organization’s existing IT infrastructure, integrating with legacy applications, enhancing (or even replacing) an enterprise data warehouse (EDW) environment, and providing support for new applications that can take advantage of the increasing data variety and data volumes experienced today.
Being able to store data from different input types is an important feature of a data lake since this allows your data sources to continue to evolve without discarding potentially valuable metadata or raw attributes. A breadth of different analytical techniques can also be used to execute over the same input data, avoiding limitations that arise from processing data only after it has been aggregated or transformed. The creation of this unified repository that can be queried with different algorithms, including SQL alternatives outside the scope of traditional EDW environments, is the hallmark of a data lake and a fundamental piece of any big data strategy.
To realize the maximum value of a data lake, it must provide:
The ability to ensure data quality and reliability — that is, ensure the data lake appropriately reflects your business.
Easy access, making it faster for users to identify which data they want to use.
To govern the data lake, it’s critical to have processes in place to cleanse, secure, and operationalize the data. These concepts of data governance and data management are explored later in this report.
Building a data lake is not a simple process, and it is necessary to decide which data to ingest, and how to organize and catalog it. Although it is not an automatic process, there are tools and products to simplify the creation and management of a modern data lake architecture at enterprise scale. These tools allow ingestion of different types of data — including streaming, structured, and unstructured.
They also allow application and cataloging of metadata to provide a better understanding of the data you already ingested or plan to ingest. All of this allows you to create the foundation for an agile data lake platform.
This is an excerpt from Understanding Metadata: Create the Foundation for a Scalable Data Architecture by Federico Castanedo and Scott Gidley. To learn more about metadata and next generation architectures, download the full ebook here.