Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Choose the Right Data Warehouse

DZone's Guide to

How to Choose the Right Data Warehouse

A data warehouse is a foundational technology in analytics. Knowing your needs will help you choose the right solution so you can find success with your analytic efforts.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

To analyze big datasets, you need to store that data someplace that has the scale, performance, and power needed to get the job done: a data warehouse. This article covers key features and benefits of five widely used data warehouse solutions to help you choose the right one.

Making a Choice

When it comes to building your data strategy and architecture, it’s important to understand which data warehouses should be candidates for consideration. Typically, teams will be asking themselves questions like:

  • How do I install and configure a data warehouse?
  • Which data warehouse solution will help me to get the fastest query times?
  • How much data and what formats does it support?
  • What level of security is offered?
  • Which of my analytics tools are supported?
  • How much database administration effort is needed?
  • What are the costs?
  • Are they HIPPA compliant?

The good news is that both Amazon and Google offer HIPPA-compliant solutions. This reflects significant advances over the past few years, as both have taken compliance seriously. Please note that we do not delve into the specifics of compliance for these services here, so we advise mapping your requirements against published standards for both services.

1. Amazon Redshift

Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against petabytes of structured data.

Key Features

  • Amazon Redshift delivers fast query performance by using columnar storage technology.
  • Easily resize your cluster up and down as your performance and capacity needs change with just a few clicks in the console or a simple API call.
  • Store highly structured, frequently accessed data on Redshift, unstructured data in an Amazon S3 “data lake,” and query seamlessly across both with Amazon Redshift Spectrum.
  • You only pay for what you use. You can have an unlimited number of users doing unlimited analytics on all your data.

Things to Consider

  • Data formats: The service works with CSV, DELIMITER, FIXEDWIDTH, Avro, and JSON.
  • Speed and performance: Redshift stores data in a columnar format and leverages parallel processing. Most results come back in seconds. The performance can be always optimized depending on your needs.
  • Supported functions: Amazon Redshift is based on PostgreSQL 8.0.2; however, it has some very important differences. Redshift supports a number of functions that are extensions to the SQL standard, as well as standard aggregate functions, scalar functions, and window functions.
  • Integration with other BI tools: Amazon Redshift natively integrates with a number of BI tools such as Tableau, Looker, Birst, Chartio, and more. You can also configure JDBC and ODBC connections to connect to your cluster from SQL client tools.
  • Security: With Amazon Redshift, you can encrypt data using SSL connections and protect access using virtual networking environment.
  • Cost: Amazon Redshift charges per-hour per-node, which includes both compute and storage. This pricing model is predictable as users are able to run as many queries as necessary without being penalized by a high cost.

When Should You Use Amazon Redshift?

Redshift is recommended if you currently in the Amazon ecosystem and have large sets of structured data. It is a good choice if you query your data frequently or need to perform complex data aggregations. In addition, you can always use Redshift Spectrum by distributing your data between Redshift and S3 bucket depending on your needs to reduce your storage costs. Redshift also has a ton of partners and integrates with a large number of BI tools. This may be a perfect solution to experiment with new analytics tools.

If you have never used Redshift, you might be eligible for their two months offer of free data:

  • dc1.large server with 2 CPU, 15 GB Ram, 160 GB SSD storage
  • 750 hours each month, for two months

2. Amazon Redshift Spectrum

Amazon Redshift Spectrum gives you the freedom to store your data in a multitude of formats and query within your favorite BI tools without needing to load and transform your data stored in Amazon S3.

Key Features

  • Queries within your favorite BI tools without needing to load and transform your data stored in Amazon S3.
  • Scales processing across thousands of nodes with separated cluster storage and computing.
  • Fast results with Amazon Redshift query optimizer that minimizes data scanned in AWS S3 to improve query speed.
  • A pay-per-query price that allows you to pay only for queries you run.

Things to Consider

  • Data formats: Amazon Redshift Spectrum support CSV, TSV, Parquet, Sequence, and RCFile file formats. Amazon recommends using a columnar format because it will allow you to choose only the columns you need to transfer data from S3.
  • Speed and performance: In order to effectively work with complex queries running on large amounts of data, you need to optimize your data for parallel processing. For best performance, Amazon suggests breaking large files into many smaller chunks (from 100 MB to 1 GB) and storing them in the same folder. To learn more read about 10 Best Practices for Amazon Redshift Spectrum.
  • Data compression: To reduce storage space, improve performance, and minimize costs you can compress your data files. Spectrum supports GZIP, SNAPPY, and BZ2 compression files.
  • External tables: To query data on Amazon S3, Spectrum uses external tables, so you’ll need to define those. Syntax to query external tables is the same SELECT syntax that is used to query other Amazon Redshift tables. One thing to mention is that you can join the created external table with other non-external tables residing on Redshift using the JOIN command.
  • Integration with other BI tools: Does Spectrum seamlessly integrates with your existing SQL and BI tools the same as Redshift? You will want to verify this as your current tool(s) may not offer native support.
  • Cost: Users pay for the amount of data scanned by queries they run. Spectrum charges $5 for each TB of data scanned. The usual charges to run your Redshift cluster and to store your data in S3 are applied but if you are not running queries, Spectrum will not charge you.

When Should You Use Amazon Redshift Spectrum?

All Redshift customers should consider Spectrum. Having Spectrum up and running can save you money since you can lifecycle data out of Redshift to S3 and be more efficient with your local storage on Redshift. You can learn more about Redshift Spectrum performance here.

3. Amazon Athena

Amazon Athena is a serverless interactive query service. Check out our intro article to Athena to learn more.

With Amazon Athena, users don’t need to concern themselves with setting up any servers, frameworks, clusters, or other tools. Athena is considered serverless. If you have data loaded to Amazon S3, Athena can leverage it. This can reduce costs for use cases that do not require a traditional data warehouse.

Key Features

  • Automated database and table creation are available with no need for advanced technical training typical in these systems.
  • Rapid query results without having to worry about tuning queries or optimizing database structures.
  • Since Amazon S3 stores the data, there is no need for businesses to invest in physical IT infrastructure to query and store their information.
  • The pay-as-you-go business model means users only need to pay for queries they actually run. This avoids getting locked into fixed rates for a level of service they don’t actually use.

Things to Consider

  • Data formats: The service works with a number of different data formats. These include ORC, JSON, CSV, and Parquet. For better performance, consider converting data to columnar formats using Apache Parquet.
  • Speed and performance: Athena makes it quick and easy to run queries on S3 data without having to set up servers, define clusters, or do any of the housekeeping that similar query systems require. Check out performance tips and many other points of interest at the AWS blog.
  • Supported functions: Athena uses Presto as its SQL query engine. Users can enter ANSI-standard SQL into this tool and interface directly with Amazon S3 data via Athena. See the Facebook Presto function documentation for a full list of functions.
  • Integration with other BI tools: Amazon promotes this service as a way to produce result sets with SQL queries. However, the data can be used with other business intelligence tools for reporting and analysis. One obvious example is Amazon QuickSight. The service has a JDBC driver that can be used to interface with other business intelligence software.
  • Security: Users are able to control who can access data on S3. It’s possible to fine-tune security to allow different people to see different sets of data and also to grant access to other user’s data.
  • Cost: Users will pay for the amount of data scanned by queries they run. Athena charges $5 for each TB of data scanned. Queries are rounded up to the nearest MB, with a 10 MB minimum. Users pay for stored data at regular S3 rates.

Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

When Should You Use Amazon Athena?

You should consider Amazon Athena if your workload is more ad hoc and you want to avoid the costs associated with an infrastructure like Redshift. It doesn’t require any installation or deployment on any cluster. The data is stored on Amazon S3, can be queried as needed using ANSI SQL, and you only will be charged for queries that you actually run. This might help you balance investments for storage and compute resources that might go underutilized in Redshift.

4. Google BigQuery

Google BigQuery is a serverless, fully managed, petabyte-scale, low-cost enterprise data warehouse. With BigQuery, you can query billion of rows in seconds using SQL syntax. There is no infrastructure to manage and you don’t need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL.

Key Features

BigQuery runs on the Google Cloud Storage infrastructure and can be accessed with a REST-oriented application program interface (API):

  • A fully-managed petabyte-scalable system.
  • Leverages parallel processing.
  • Leverages columnar storage.
  • Geared towards interactive reporting on large data sets.
  • Supports integration and connection with various applications, including BI tools.

Things to Consider

  • Data formats: BigQuery supports CSV, JSON, Avro, and Cloud Datastore backups.
  • Speed and performance: BigQuery brings in as many resources as needed to run a query in seconds, and depends on the size of the dataset queried. According to Google, queries that do less work perform better. Check out BigQuery performance tips here for more information.
  • Supported functions: BigQuery uses standard SQL, which is compliant with the SQL 2011 standard and has extensions that support querying nested and repeated data.
  • Integration with other BI tools: In addition to the web interface and command line, you can connect BigQuery popular SQL tools such as SQL Workbench or business intelligence tools using ODBC and JDBC drivers.
  • Security: BigQuery uses Identity and Access Management (IAM) to manage access to resources.
  • Utlilities: Google provides a BigQuery SDK which can be used to run various operations like creating tables, loads, expirations and many other functions.
  • Cost: Users will pay for the amount of data scanned by queries they run. Google charges $5 for each TB of data scanned. The storage is $0.02 per GB. All other operations, such as loading data, export, copy, or metadata, are free.

When Should You Use Google BigQuery?

Google handles scaling all the infrastructure, so you don’t have to worry about the number of instances, capacity, storage and so forth. It is considered a “fully managed” solution which means Google handles all that magic behind the scenes for you.

If you are already a heavy user of Google products like Google Cloud, Google Analytics 360, or DoubleClick, BigQuery might be a good place to start. Google Analytics 360 customers get a $500 credit each month for BigQuery usage. This credit may more than cover your monthly costs for BigQuery. For reference, an export of 1B hits from Google Analytics 360 to BigQuery would be about 1TB of data and cost about $20 per month. Depending on your workload, the $500 credit might cover all your costs. Google has also released an "always" free tier which might be perfect for those that have light usage requirements!

Like Athena and Spectrum, pricing for data scanned introduces variability into your cost estimates. The types and quantities of your queries will largely determine the budget needed. If you (or your extended team) are scanning 10TB of data a day, then your costs will be about $1,500 a month plus the cost of the amount of data stored. This is extremely cost competitive when compared to traditional database licensing costs, but something you should consider in relation to the other offerings.

5. Panoply.io

Panoply is a smart data warehouse. It provides end-to-end data management-as-a-service for analytics, including ingestions with ELT, smart data management via machine learning, and query optimization. Panoply has unique self-optimizing architecture and uses auto-generated schemas that scan through your data to identify underlying schema and metadata to model and streamline the data from source to analysis minimizing the time you need to spend on optimizing your data warehouse. This is perfect for teams that want to be “hands-off” from warehouse management responsibilities.

Key Features

  • Panoply uses three-tier storage architecture that utilizes machine learning and natural language processing (NLP).
  • ELT + data integration allowing you to analyze and transform your data in seconds regardless of scale (see Panoply’s resource on ETL vs. ELT).
  • The platform auto-scales clusters to keep up with the organization’s needs while reducing server costs.

Things to Consider

  • Data formats: Panoply automatically identifies data formats like CSV, TSV, JSON, XML, and many log formats.
  • Speed and performance: Panoply automatically aggregates data as it streams in, so your analytics can be done in seconds.
  • Integration with other BI tools: Panoply supports dozens of third-party analytics and BI tools from external APIs and proprietary SDKs to tools like Periscope, Looker, and R.
  • Security: Panoply uses the latest security patches and encryption capabilities provided by the underlying AWS platform. Panoply also offers an extra layer of security built to enhance data protection and privacy.
  • Cost: Panoply charges based on the amount of data stored and offers several plans. A free plan includes up to 10 million stored rows per month, which is based on the daily average of the number of rows stored, not loaded.

When Should You Use Panoply?

Panoply is good for you if you want to leverage scale benefits of AWS S3, query features of Amazon Redshift and data accessibility through Elasticsearch but not ready to invest time into maintaining your data warehouse.

If you’d like to give Panoply a chance, they offer a 21-day free trial. Learn more about Panoply.io here.

The Next Steps

A data warehouse is a foundational technology in data analytics. Knowing your own needs will help you choose the right solution so you can find success with your analytics efforts. At Openbridge, we support all five data warehouse solutions so you can start leveraging the full potential of your data. When data is available and accessible in your warehouse, teams can focus on the activities that deliver better outcomes for your business.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data warehouse ,athena ,redshift spectrum ,bigquery ,panoply ,redshift

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}