AWS Lake Formation for Data Lakes
An high-level look at how data lakes can help organizations and how AWS's Lake Formation product can help in creating a data lake.
Join the DZone community and get the full member experience.Join For Free
With a growing impetus on digitization, high volumes of data are literally pouring out of enterprise applications deployed via massive data centers. Every year the growth in this data is manifold. This is raising demands on the existing data platforms to scale even more and be durable. More people are accessing this data than before and new and varied use cases for data processing are popping up.
It is to meet this demand that the concept of data lakes was conceived. A data lake is a centralized store which keeps all your data, structured or unstructured, and at the scale you need.
Data lakes can ingest data from anywhere, like online transaction processing systems, machine data, system or application logs, or other cloud services. They can store both relational and non-relational data. They integrate with many analytics and machine learning tools. Essentially, they operate on the data without the need for moving data to other external places for processing.
Why AWS Lake Formation
Typically, creating a data lake involves several steps and is time-consuming. Even if you are using popular cloud services like AWS, you still need to piece together multiple AWS services. For example, some of the steps needed on AWS to create a data lake without using lake formation are as follows:
Identify the existing data stores, like an RDBMS or cloud DB service.
Create S3 storage.
Create S3 bucket policies.
Create an ETL job which will transform data from the DB and put it into the S3 buckets.
Allow analytic services to access this data in S3.
Manage ETL jobs, set up audit policies and user permissions, and create a data cleansing strategy.
The above process is effort intensive and prone to human error.
AWS Lake Formation was born to make the process of creating data lakes smooth, convenient, and quick. Given below is the dashboard of an AWS Lake Formation and it explains the various lifecycle stages. It takes three or four easy to configure steps to create a data lake.
One of the elegant ways of loading data into the data lake is by using blueprints. You configure a blueprint by specifying the source of the data, a destination within the data lake, and the frequency of load. Data can be loaded in bulk or incrementally. The blueprint will find the source tables, convert the data into the required format, and partition the data based on partitioning schema. All of this is customizable.
Lake Formation can read data from a variety of sources. It can be MySQL, Postgres, SQL Server, Oracle DB running on AWS, or over on-prem servers. You can also import data from other S3 buckets, or logs generated from CloudTrail, CloudFront, etc.
Lake Formation uses ML Transforms for creating machine-learning transforms which are used in an ETL job. One such transform is FindMatches which helps deduplicate the data.
Lake Formation provides a centralized place to create fine-grained data access policies. It allows you to encrypt data using S3 encryption mechanisms. You may choose server-side or client-side encryption. For encryption, you may use keys provided by AWS KMS or any HSM system. All data access can be audited in this centralized place. It is possible to download the audit logs for further analysis.
Lake Formation allows text-based search and also supports faceted search. The data stored in S3 can then be queried using AWS Athena.
Lake Formation uses AWS Glue crawlers to extract technical metadata and creates a catalog out of it. You may then label this information for your custom use, such as marking sensitive information.
Lake Formation does not cost extra. However, you pay for the underlying AWS services being used in the process like S3, AWS Glue (for ETL), IAM, Athena, KMS, etc.
Create a Data Lake
Creating a data lake with Lake Formation involves the following steps:
1. Configure a Blueprint. You specify a blueprint type — Bulk Load or Incremental — create a database connection and an IAM role for access to this data. Trigger the blueprint and visualize the imported data as a table in the data lake. All this can be done using the AWS GUI.
2. You may now also set up permissions to an IAM user, group, or role with which you can share the data.
3. Now, go to AWS Athena and run a query to test the set up of the data lake.
Some Use Cases for a Data Lake
Enterprises need the ability to make faster decisions and respond quickly to market dynamics. But to achieve this with the huge amount of data available from multiple sources has proved to be challenging. Data lakes are a solution to analyze a large volume of data from various sources. Benefits of a data lake are derived from the analytics that is possible on such a collection of data.
1. Data lakes can help create a better understanding of customer behavior by churning data available from CRM systems combined with social media and order management systems.
2. The oil and gas industry churns out massive amounts of data. It is vital for this industry to analyze this data for historical modeling. Data lakes also help in providing analytics which predict the downtime of drilling systems and thus save cost.
3. Data lakes are an excellent choice for smart city initiatives. They can ingest data from a large number of IoT sensors installed across the city.
Opinions expressed by DZone contributors are their own.