Over a million developers have joined DZone.

Best Practices for Setting Up a Data Warehouse in the Cloud

DZone's Guide to

Best Practices for Setting Up a Data Warehouse in the Cloud

As data continues to get bigger, data warehousing is becoming more and more necessary. Read on to learn the benefits of keeping your warehouse in the cloud.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Data and analytics have become an important part of today's business. Modern businesses generate large amounts of data from their customers, suppliers, partners, and internal systems. If carefully analyzed and interpreted, this data could provide immense insight into business growth and sustainability. The data collected from these different data sources comes in unstructured, semi-structured, and structured forms. It’s useful to transform the data into a structured format and send it to a central location. This is where businesses rely on data warehousing solutions to work as a central repository, where the data is collected for analytics as well as different tools used for data transformation.

Running a data warehouse is becoming increasingly expensive and difficult to manage and scale. While costs amount to millions of dollars upfront for both software and hardware, it also takes months to plan, architect, and implement the data warehousing solution, which is not always viable for small and medium-sized businesses. There are different types of data warehouse solutions out there which need to be selected depending on the individual business requirements.

Data Warehousing in the Cloud

Many businesses are therefore moving their data warehouses to the cloud to improve performance and decrease costs. With the cloud, scalability and elasticity are built-in, and it’s possible to expand for rapid data growth, both in terms of processing capacity and storage.

Since the cloud provides various tools, managed services, and technologies to reduce the complexity and management overhead of the data warehousing solutions, it is possible for the business to focus on using the data to generate results rather working on the technology and managing the data warehouse itself. Let's look at several best practices in using the cloud for data warehousing and the advantages it provides.

Best Practices in the Cloud

Use the Right Migration Strategy

When moving to the cloud for a data warehousing solution, it is required to migrate data from existing solutions to the cloud. Depending on the migration strategy, it is possible to also move part of the data pipeline to the cloud, in addition to moving structured data from the existing data warehouse. This could even include moving unstructured or semi-structured data to the cloud to store and transform the data, as required by the data warehousing solution.

For example, instead of maintaining a file server locally, it is possible to directly ingest unstructured data such as CSV or Excel files to an object storage service such as Amazon S3. The data could then be transformed and processed using a data pipeline within the cloud for increased performance, reduced costs, and reduced management overhead.

In addition, depending on the data volumes and operational requirements, it is required to set up the migration strategy to move data to the cloud data warehousing solution. For example, if the data volume is low (which doesn’t require a continuous operation), it is possible to set up a one-step migration to extract the data and import it into the cloud data warehousing solution.

However, when dealing with large data volumes with continuous operational requirements, this approach becomes impractical due to the massive amount of data to be moved as well as the new data being added within the migration time frame. Therefore, this requires a two-step process. for the first step, it is important to first extract the data in non-peak usage to minimize the impact to the existing data warehouse and migrate to the cloud using a medium that matches with the extracted data volume. For example, if the data volume is large, and expands to a couple of terabytes or petabytes, it is viable to physically store the data in a storage device and shift it to the cloud, rather than sending it across the wire. There are services such as Amazon Snowball used to simplify the process by leasing the required storage devices. As the second step, it’s required to automate the synchronization of the changed dataset to update the cloud data warehouse to the latest version of the generated data.

Use ELT Over ETL for Batch Processing

When setting up the data warehousing solution in the cloud, the data pipeline can also be optimized. Traditionally, it is common to follow a process of Extract Transform and Load (ETL) to move unstructured and semi-structured data to the data warehouse. This is mostly an ongoing process where the extracted data is cleansed, enriched, transformed, and loaded into a data warehouse.

However, with data warehousing solutions in the cloud such as Amazon Redshift (which is powerful enough to handle transformations), it is possible to move the extracted data directly to the data warehouse and is then highly efficient to perform transformation within the data warehouse. This changes the order in the data pipeline to Extract Load and Transform (ELT) from ETL.

Use Managed Services for Warehousing Workflows

It’s possible to use the cloud as an Infrastructure-as-a-Service (IaaS), install data warehousing software and implement the ETL processes from scratch using cloud servers. But this provides partial value in moving to the cloud. Most of the advantages come with a fully-managed (serverless) or semi-managed data warehousing and ETL/ELT solutions provided by the cloud to lower the total cost of ownership.

If we take AWS as an example, Amazon Redshift provides a petabyte-scale data warehousing solution, which allows for storing data and querying it for analytics.

In addition, there are several managed services for the data pipeline that need to be selected based on the particular use case. For example, the AWS Migration Service can be used to move data from on-premise to the cloud with schema conversions. If the transformation workload is heavy, AWS Data Pipeline is there to process the data in parallel. It can either be plugged in as a node in the pipeline, or perform the ingestion directly using predefined templates. When moving data within Amazon, AWS Glue can be used. Glue is a serverless ETL service used to move data across various services as well as to Redshift.

It is also important to note that when comparing these services, each of comes with a different level of flexibility. Although fully-managed services have the least overhead with regards to setup and overall operations, these could be least flexible in handling different types of data sources and workloads, compared to semi-managed and manual configurations. It’s important to identify individual use cases and design the workflow accordingly to leverage these services.

Use Massively Parallel Processing Architectures

When analyzing data in the cloud, it is possible to utilize massively parallel processing (MPP) nodes available in a cluster to analyze the data in parallel. This not only increases the performance of large data warehouses but also allows you to scale the data warehouse both in size and its processing capability by adding more nodes to the cluster.

Data warehousing solutions such as Amazon Redshift, Teradata Aster, Druid, and Vertica utilize MPP architecture. In addition, open source frameworks such as Hadoop and Spark also support these architectures to perform analytics in parallel.

Separate Real-Time From Batch

When analyzing business data, it is important to derive both real-time and batch analytics, which are used for different levels of decision making and also involve a different level of accuracy.

Understanding the timing requirements of analytics is also required to design the processing workflow accordingly using the correct tools and technologies.

For instance, it would be impractical to run real-time queries on large amounts of data in a data warehouse solution such as Amazon Redshift. Instead, it would be more practical and cost-effective to use streaming analytics solutions which run on top of the data stream (such as Amazon Kinesis Streams and Amazon Kinetic Analytics) before ingesting to the data warehouse. It is also possible to unite real-time and batch analytics by performing the real-time analytics while ingesting the data to the data warehousing solution using serverless services such as Amazon Glue and Lambda.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

cloud ,big data ,data warehousing ,mpp

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}