10 Considerations to Quickly Find Success When Adopting Amazon Redshift Spectrum
10 Considerations to Quickly Find Success When Adopting Amazon Redshift Spectrum
This article covers what is important to know when adopting Amazon Redshift Spectrum for interactive queries.
Join the DZone community and get the full member experience.Join For Free
Amazon Redshift Spectrum, an interactive query service for Redshift customers, was introduced in April 2017. The service allows to avoid time-consuming ETL workflows and run queries directly against the data stored in Amazon S3. This article covers what is important to know when adopting Amazon Redshift Spectrum for interactive queries and how to automate certain processes to improve performance and lower query costs.
1. Schema and Table Definitions
Setting up Amazon Redshift Spectrum requires creating an external schema and tables. You can use Amazon Athena data catalog or Amazon EMR as a “metastore” in which to create an external schema. Note, external tables are read-only, and won’t allow you to perform insert, update, or delete operations.
Make sure that the data files in S3 and the Redshift cluster are in the same AWS region. You also need to provide authorization to access your external Athena data catalog, which can be done through an IAM console. To run Redshift Spectrum queries, the database user must have permission to create temporary tables in the database. More details: https://docs.aws.amazon.com/redshift/latest/dg/r_GRANT.html
This step can be done automatically. Openbridge has a service that creates schema and tables automatically based on user’s storage configuration and processed files’ structure.
2. Optimized Data Formats
Amazon Redshift Spectrum supports the following formats: AVRO, PARQUET, TEXTFILE, SEQUENCEFILE, RCFILE, RegexSerDe, ORC, Grok, CSV, Ion, and JSON.
As a best practice to improve performance and lower costs, Amazon suggests using columnar data formats such as Apache Parquet. Since Amazon Redshift Spectrum charges you per query and for the amount of data scanned from S3, it is advisable to scan only the data you need. This can be done by using columnar formats like Parquet. This also minimizes the amount of data transferred from Amazon S3 through Redshift by selecting only the columns you need. This is not possible with row-based formats like CSV or JSON.
In order to benefit from this optimization, you have to query for the fewest columns possible. However, if you want to access the whole row by ID, columnar storage would be suboptimal, so you may want to run some tests.
3. Data Compression
Amazon Redshift Spectrum allows you to run queries on S3 data without having to set up servers, define clusters, or do any maintenance of the system. However, to improve query return speed and performance, it is recommended to compress data files. Compressed files are recognized by extensions. The following are supported:
- gzip — .gz
- Snappy — .snappy
- bzip2 — .bz2
Openbridge defaults to using Snappy with Apache Parquet as it’s a trade-off between the amount of CPU utilized for processing files and the decrease in S3 storage/IO used.
4. Preparing Files for Massively Parallel Processing
To query external data, Redshift Spectrum uses multiple instances to scan files. This is called massively parallel processing (MPP) and allows you to faster run complex queries on large amounts of data. To leverage MPP benefits, place the data files in a separate folder for each table and make sure you keep it the right size. Amazon recommends breaking large files into many smaller files of equal size (no larger than 64 MB) to evenly distribute the workload.
To correspond with this optimization, Openbridge will split each incoming file into smaller ones of roughly 200 MB. The metadata is adding about 12 MB per file which, after compressing with parquet + snappy, ends up in 10–50 MB size. The exact file size depends on how compressible the original file is but is fairly consistent within a single incoming file.
5. Data Partitioning
Data partitioning is one more practice to improve query performance. Partitioning can reduce scanning time to 2–4 times. Be sure to recognize the right columns for partitioning, for example, if you are frequently using certain columns for filtering. Consider that excessively granular partitioning adds time for retrieving partition information but can reduce the amount of data scanned from S3. Also, actual performance varies depending on file placement, query pattern, file size distribution, number of files in a partition, number of qualified partitions, etc.
Openbridge will partition data by date. For files uploaded by Batch or API, this is the processing date. For data generated by other applications like Facebook, YoutTube, DoubleClick, Google Analytics 360, and others, Openbridge uses the date when the data is fetched.
6. Supported Functions
Once an external table is defined, you can start querying data just like any other Redshift table. Data also can be joined with the data in other non-external tables so the workflow is evenly distributed among all nodes in the cluster.
There are some limitations to consider. Redshift Spectrum doesn’t support nested data types, such as STRUCT, ARRAY, and MAP. Also, when using the Athena data catalog, there are some limitations on queries. Each account is limited to 100 databases, and each database is limited to 100 tables. In addition, you can perform a maximum of 20,000 partitions per table. However, you can request a limit increase by contacting AWS Support.
8. Integration with Other BI Tools
Amazon Redshift Spectrum can be used on its own to perform queries. However, for visual reporting and analysis, you can integrate Spectrum with other BI tools such as Tableau, Looker, Microstrategy, Microsoft Power BI, and others. Basically, if the tool supports Redshift, it should be able to recognize data from Spectrum. If this doesn’t happen automatically, Amazon suggests to configure your application to query SVV_EXTERNAL_TABLES and SVV_EXTERNAL_COLUMNS.
9. Amazon Redshift Spectrum Security
For data security, Redshift Spectrum is using server-side encryption (SSE-S3), using an AES-256 encryption key managed by Amazon S3 and server-side encryption with keys managed by AWS Key Management Service (SSE-KMS).
Amazon S3 client-side encryption is not supported. For more information, see Protecting Data Using Server-Side Encryption.
Also, consider that with Amazon Redshift Spectrum, you cannot control user permissions on an external table. What you can do is to grant and revoke permissions on the external schema.
Openbridge will store data on S3 using AES-256 encryption.
10. Amazon Redshift Spectrum Pricing
With Amazon Redshift Spectrum, users pay for the amount of data scanned by queries they run as well as for Redshift instance and S3 storage:
- $5 for each TB of data scanned, queries are rounded up to the nearest MB, with a 10 MB minimum
- Amazon Redshift instance rates for the cluster used
- Standard Amazon S3 rates for storage and requests
- If you use the AWS Glue Data Catalog as a metastore, you will be also paying standard AWS Glue Data Catalog rates
For more performance tips, check out this AWS post.
When Should You Consider Amazon Redshift Spectrum
If you are already a Redshift customer, the use Amazon Spectrum can help you balance the need for adding capacity to the system. With Redshift Spectrum capabilities you will no longer need to think about storage of data as constraint. You can now run SQL queries directly on vast amounts of data you collected within Amazon S3. This can save you money since you can lifecycle data out of Redshift to S3 vs keeping it in local storage. For example, you have a 100 GB transactional table of infrequently accessed data with Redshift. Why pay to store that in Redshift when moving it to S3 and start querying it with Spectrum is an option? This can approach can save you hundreds or thousands of dollars.
If you are attracted to a solution with no infrastructure to maintain and uses pay-per-query pricing getting started with Spectrum is simple and easy. The move to serverless database and query technologies is certainly gaining interest and AWS has delivered with Spectrum.
Published at DZone with permission of Thomas Spicer . See the original article here.
Opinions expressed by DZone contributors are their own.