In this post, we take a look at the architecture of this popular data warehouse and how to ensure Redshift's peak performance.
Join the DZone community and get the full member experience.Join For Free
Amazon Redshift is (for the most part) a Data Warehouse as a service, and there’s no need to provision hardware, install databases or patches with few options to tune the system. While there are few options available to tune or customize the database, it’s absolutely critical to correctly design the physical table layout to maximize performance.
Before diving into the detail it’s worth giving an overview of how Redshift is internally architected. The diagram below illustrates how every query is submitted to the Leader Node which is responsible for parsing the query, determining the best execution plan, and coordinating and aggregating results.
When data is loaded, it’s distributed across each compute node in the cluster as a series of slices, where each slice corresponds to a CPU core, memory allocation, and disk space. This method maximizes parallel execution and supports scalability as the system can be migrated to a larger cluster with additional nodes.
When a query is executed, the leader node breaks up the task into a number of parallel steps, executed by the Compute Nodes which actually store the data, and perform the heavy lifting. This means any given query can be executed in parallel across multiple cores reading multiple disks, thus maximizing throughput.
Of course, the extent to which the query slice can run independently in parallel depends upon the extent to which the workload can be balanced, and the remainder of this article explains how this can be achieved using Sort Keys and Distribution Keys.
Sort Keys and Indexes
Ever since Bayer and McCreight first proposed the B-Tree index in 1972 it has been the primary indexing method used by almost every database, although database designers must carefully balance a trade-off of better read performance and write throughput.
While a B-Tree supports rapid access for both direct lookup and scan operations, it’s a major cause of locking contention issues when bulk loading data, which can lead to performance issues. Even the Bitmap Index, specifically designed for analytic query performance leads to significant concurrency issues when maintained by multiple writers, and is often disabled prior to bulk load operations.
On Redshift, there’s no need to devise an indexing strategy or drop and rebuild indexes around batch ETL loads, as Redshift does not support traditional indexes. Instead, the data is physically stored to maximize query performance using SORT KEYS.
The diagram above illustrates the method used by Redshift which is based on sorting data during load to maximize read performance, in this case by TEAM and then CITY.
As data is loaded it’s sorted by a SORT KEY, and the minimum and maximum value recorded for each 1Mb block. This is used by the optimizer to skip over blocks based upon the query where clause. For example, in the above table, a query filtering by TEAM = ‘Web’ would only read block 3, as all others are automatically eliminated. Equally filtering by CONTINENT = ‘USA’ would read only block 1.
Without a sort key, the same query would potentially read every block in the entire table (potentially millions of rows), with a consequent impact upon performance.
In some database systems (eg. Oracle) this is achieved by declaring a PARTITION and SUB-PARTITION on the table, and the effect is the same – improvements to query performance by partition elimination.
To demonstrate the potential gains, we ran a simple benchmark summary query across a billion rows on a cluster of 8 dc2.large nodes, and simply adding sort keys (without any filters) meant a count with a group by query ran twice as fast. Including a filter in the query where clause, produced sub-second results.
A sort key improves performance:
- By reducing disk I/O by skipping over blocks when filtering data using a query where clause.
- By reducing the need to physically sort data for ORDER BY or GROUP BY operations.
- By facilitating a MERGE JOIN – the fastest of the three join methods supported by Redshift.
Types of Sort Key
There are currently two types of sort keys:
- Composite: The default which can include multiple columns, and must be defined in priority order with the column most often queried starting first.
- Interleaved: Which places equal priority to each sort column. Best used for large tables which are infrequently updated where no single column is frequently filtered.
By default, a composite key will probably give a better query performance, but be sure to sequence the columns correctly to maximize row elimination.
Interleaved keys should be considered for relatively static large tables in which single columns appear as highly selective predicates by themselves, but no single column is frequently used to filter results. They should be avoided for time-series fact tables as they can lead to excessive VACUUM effort.
The Need to Vacuum
As Redshift does not reclaim free space automatically, updates and delete operations can frequently lead to table growth. Equally, it’s important as new entries are added, that the data is maintained in a sorted sequence.
The VACUUM command is used to re-sequence data, and reclaim disk space as a result of DELETE and UPDATE operations. Although it won’t block other processes, it can be a resource-intensive operation, especially for data stored using interleaved sort keys.
It should be run periodically to ensure consistent performance and to reduce disk usage.
The selection of a SORT KEY should be based upon a knowledge of the data, and how values appear as a predicate in a query where clause.
- All Tables: Should have a sort key defined.
- Use Composite Keys: Based upon an understanding of how data is filtered in queries. Don’t assume an interleaved key works better if you don’t understand the data query patterns.
- Select: The columns which are most often used in the where, group by and Order By clause.
- Large Fact Tables: Tables that are frequently queried by DATE or TIMESTAMP should be defined as a composite sort key to maximize data elimination.
- Distribution Keys: Columns which are selected as a distribution key are also candidates for a sort key. This improves join performance between large Fact and Dimension tables as both tables are sorted by the same sequence, and the join becomes a simple sort-merge operation.
- Sequence: Sort keys carefully. If a composite sort key is used, ensure the first column appears most often in the where clause for maximum performance.
- Compression: Must be avoided on the sort keys as the query engine cannot use the full potential of sort keys if they are compressed.
To summarize, the best practices in selecting sort keys include:
Distribution Keys: The Challenge
The diagram below illustrates the challenge whereby data is automatically distributed across nodes in the cluster and queries are executed in parallel on every node. This works well to maximize performance, except when tables are joined. If the related data is held on different nodes, it causes inter-node data transfers which significantly impact performance.
In the example below, data is badly distributed, and therefore needs to be transferred between nodes to complete join operations.
The Solution: Distribution Keys
Any given table can only have only one distribution key, and it determines the physical location of the data across each node in the cluster. The aim of selecting a sensible distribution key is to balance a number of (sometimes conflicting) priorities:
- Eliminate data transfer: Joining data held on different nodes leads to data being copied across the network which can significantly impact performance.
- Reduced disk space: To avoid data transfers, it’s possible to replicate the data across all nodes in the cluster, but this increases disk space requirements, and data loads take longer, so it must be replicated to each machine. This method improves query performance at the expense of disk space and load rate.
- Ensuring workload balance: One option is to use distribution by KEY, as all rows the with the same key are located on the same machine. However, if poorly selected this can lead to an unbalanced workload, as either the data is skewed, or the majority of queries are performed on a single machine.
There are three options for a distribution key, illustrated in the diagram below:
- ALL: Replicates the data across every node in the cluster. Best applied to relatively small dimension tables, this uses more disk space but ensures every join operation is quickly performed locally, avoiding data transfers. It also increases the time taken to execute the Redshift COPY command, and may ultimately need a larger cluster. Replicating tables to ALL nodes also affects load times, as any updates need to be written to all nodes in the cluster.
- KEY: Distributes the data across the nodes, and ensures every row with the same key is located on the same node. This means for example, if distributed by STATE, queries limited to a small number of states will execute on a subset of the nodes. There is, however, a risk of overloading one or more nodes for very large states. A potentially better use-case is where a large SALES fact table is joined to an equally large CUSTOMERS dimension table. In this case, sales for each customer are physically stored on the same node, and assuming a relatively even distribution, queries will be executed across all nodes in the cluster.
- EVEN: Distributes rows evenly in a round-robin fashion. This works well for tables (eg. Audit logs) which are infrequently joined to others as queries are executed in parallel across all nodes in the cluster. Equally, this is an excellent option for any Fact table for which all the joining Dimension tables are distributed using the ALL method.
Recommendation: Design With Extreme Care
Care must be taken to agree on a data distribution strategy as follows:
- Small Dimension Tables: Should be replicated across all nodes in the cluster. Replicating dimension tables which appear in join operations is a good way to reduce inter-node data transfers at the expense of disk space and load rate.
- Fact Tables: Those that are purely joined to dimension tables with a distribution type of ALL should be distributed using an EVEN method. This ensures a balanced distribution to support fast inserts and queries while avoiding data transfers.
- Large Dimension Tables: A candidate for KEY distribution, but the join keys must be carefully selected along with the Fact tables as each table can have only one distribution key. For example, a large CUSTOMER table could be distributed by CUSTOMER_ID along with the SALES table which must be distributed using the same key. This ensures SALES data for the same customer is co-located on the same machine. Assuming all other dimensions are distributed using the ALL method, this will ensure all joins are performed locally.
- Fact Tables joined to KEY Dimensions: Must be distributed with the same KEY as the corresponding dimension table.
Where a large fact table joins to more than one very large dimension table, the designer must decide the best way to balance the conflicting demands. Once the KEY dimension is selected, the remaining dimensions must be distributed by either the EVEN or ALL method based upon the balance of disk space, data load rate, and query performance.
Thanks for reading this far. If you found this helpful, you can view more articles on Big Data, Cloud Computing, Database Architecture and the future of data warehousing on my web site www.Analytics.Today.
Published at DZone with permission of John Ryan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.