Data Lakes in the Cloud (Part 2)
When it comes from taking your on-prem data lakes to the cloud, you have to consider the process, the technology, and the business goals. Read on to find out what to keep in mind.
Join the DZone community and get the full member experience.Join For Free
In the first blog of this series, we discussed some of the key drivers for a Cloud Data Lake such as:
- The cost advantages of the elastic utility model of the cloud, especially for highly variable workloads typical of Data Lake operations and analytics processing.
- Lower administrative and operational cost achieved by delegating the heavy lifting of configuration and platform maintenance to the cloud providers.
- Access to a range of compute and storage options beyond Hadoop, as well as advanced cloud services.
- Geographical coverage and data availability guarantees.
But how do enterprises that already have an on-premises data lake migrate to the cloud to realize those benefits? Every cloud migration project has to begin with a clear statement of business as well as technical objectives. Cost reduction without loss of service levels and same or superior user experience tends to be the top business objective. While the current cost of on-premises data platform may be known, quantifying future costs of a Data Lake in the Cloud can be done only in the context of architectural decisions made after sorting through and picking from a bewildering array of options across cloud providers.
Business objectives further clarify the scope and time frame for the migration, compliance requirements with respect to data security, physical location and longevity of data, and business continuity needs during and after migration. Scope has to do with which on-premises data sets, from which enterprise functions or departments, from which on-premises Hadoop clusters and from which data centers, will migrate to the Cloud Data Lake. Compliance requirements are particularly important to highly regulated industries like healthcare and banking. It goes hand in hand with security needs founded on a well-researched threat model. Business continuity needs determine which are the critical applications that cannot tolerate any downtime during migration.
Technical objectives lay down the use cases that the data will be subjected to by a variety of users. These users may belong to different business functions and come with varying skill sets and data access and processing needs. The enterprise on-premises data governance practice may have been spread across the infrastructure, application and security teams. But that enterprise governance model will change in the cloud based on the choice of Identity and Access Management services and tools. Similarly, the data security objectives will determine how data will be secured in the cloud while in motion and at rest. Technology choices in the cloud, in addition, may be guided by the requirements around feature parity and data processing performance by comparison with the on-premises Data Lake.
A Data Lake management application layer greatly facilitates the realization of the business and technical objectives. It does this by abstracting from the user the underlying data platform technologies, whether on-premises or in the cloud, and by providing a common metadata view.
Cloud Technology Choices and Migration Design
While there are a multitude of technology choices and cloud providers, we see three broad models of Data Lake migration to the cloud:
- Forklift migration of on-premises Hadoop cluster to cloud.
- Migration to use Hadoop-based cloud services and cloud-native storage.
- Migration to a hybrid on-premises/cloud model, using a variety of cloud-native services, and establishing a seamless data fabric view with metadata
These are also reflective of the increasing levels of maturity of Cloud Data Lake adoption. There are of course variations of these models making more or less use of cloud elasticity with the help of a management framework.
Forklift migration refers to moving on-premises Hadoop cluster to one built ground up from basic compute instances in the Cloud. This is the simplest migration model leveraging existing staff skill sets. It uses only the IaaS aspect of cloud with persistent compute instances, typically with instance local storage. Except for infrastructure access, security is entirely the cloud customer’s responsibility, as is the creation, configuration, monitoring and maintenance of the cluster.
Moving from Hadoop on-premises to using Hadoop as a service from the Cloud provider is the second model of migration. Much of the heavy lifting around Hadoop cluster setup and configuration, and ensuring compatibility of Hadoop ecosystem components is left to the cloud provider. A Data Lake management application may aid in the creation and use of transient Hadoop clusters on demand and interface directly to cloud native persistent storage.
The third model of Data Lake migration involves a gradual transition from Hadoop on-premises to hybrid architectures — on-premises/cloud, using a variety of cloud-native storage options and services in addition to the Hadoop ecosystem tools, adopting cloud service patterns for processing event streams, real-time analytics, and machine learning. This model presupposes a metadata management layer to remove any mismatch between the underlying technologies and provide a seamless data fabric view across all the data regardless of storage location.
Between the three aforementioned migration models, the major Hadoop distributions (Cloudera, Hortonworks, MapR), the ever-expanding Hadoop ecosystem tool variations they support, and the big three cloud service providers (AWS, Azure, GCP) each with unique service offerings and pricing, the options for migration are too numerous to list here. Meaningful comparisons will need to be done in the context of specific business and technical requirements.
A good migration design requires deep expertise in Data Lake and cloud technologies, and data pipeline design patterns, either developed internally or bought from a service provider.
Migration Planning and Execution
Data Lake migration planning typically starts with a proof of concept pilot to validate the technical choices, feature parity, and performance in the cloud. This is followed by a phased approach consistent with the chosen migration model that takes into account:
- Infrastructure migration decisions — storage and compute, sizing, scaling, networking.
- Security of data and governance of data access, and resource usage in the cloud.
- Retooling data ingestion for sending to the cloud Data Lake data that is currently received by the on-premises platform from different sources.
- Detailed inventory of on-premises Data Lake, and mapping to cloud platform.
- Data transformation pipelines and corresponding translation to cloud mechanisms.
- Application migration — forklift vs rewrite, processes for development, test, and production.
- Data extraction tools and processes in the cloud for visualization, insights, or predictions.
- Migration options for historical data.
- Versions of cloud tools and application compatibility.
- Data Lake management applications.
An execution plan that defines the transition process from on-premises to the Cloud Data Lake, testing, performance monitoring, and business continuity during and after the cutover, are critical to a successful migration.
The benefits of migrating a Data Lake from on-premises to the cloud are achieved only through a careful specification of business and technical objectives, a validated set of migration design choices, planning and phased execution. A metadata management application layer is invaluable during the transition as well as for future proofing the Data Lake solution in the cloud.
Published at DZone with permission of Kannan Rajagopalan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.