An Overview of Amazon Redshift
This article will define the Amazon Redshift cloud data warehouse and provide a few tips for those looking into Redshift as a potential solution.
Join the DZone community and get the full member experience.Join For Free
Data warehouses provide businesses with the ability to slice and dice data and extract valuable insights from that data to make better business decisions. Used for reporting and data analysis, data warehouses act as a central repository for all or for portions of the data collected by an enterprise's various systems.
Data warehouses are "fed" data from different data sources, such as relational/NoSQL databases or third-party APIs. All these data sources need to be combined into a coherent data set that is optimized for fast database queries.
Initially, data warehousing was only available as an on-premise solution — until Amazon Web Services launched Redshift in November of 2012. On-premises data warehouses are appliance-based, making them difficult to expand, while cloud data warehouses offer elasticity, scalability, and the ability to handle big data volumes while still using familiar models (such as the SQL/relational model used by Redshift).
This article will define the Amazon Redshift cloud data warehouse and provide a few tips for those looking into Redshift as a potential solution. This is not a Redshift database design tutorial but a primer to give you an idea of what you would need to learn if you choose Redshift.
What Is Amazon Redshift?
First of all, let's take a very quick look at Redshift and see how it differs from a normal relational database.
Amazon Redshift is a cloud data warehouse that allows enterprises to scale from a few hundred gigabytes of data to a petabyte or more (see the official documentation). This enables you to use your data to acquire new insights for your business and customers. Redshift is comprised of nodes called Amazon Redshift clusters. After provisioning the clusters, you can upload datasets to the data warehouse. You can then perform analysis queries on the data.
Redshift Nodes, Slices, and Table Distribution Style
In Redshift, a slice is a further subdivision of the data, and each node can have multiple slices. When you load data into Redshift, the rows of that data are distributed across the cluster's slices according to the table distribution style. How data is distributed affects query performance. When performing joins and aggregations, data often needs to be sent over the internal network to other nodes for processing. Minimizing these redistributions is important in order to gain better performance. Also, we want to avoid some nodes doing a lot more work than the others, so it is also about distributing the computational load.
For each table you must specify the distribution style which can be ALL, EVEN, or KEY:
ALL distributes the data to all nodes. This can be great for read performance, as data is always local to the compute node, but it comes at the price of greater write cost and maintenance operations.
EVEN distributes the data evenly across nodes, making sure that each node gets its fair share of the data. This is suitable when the table does not need to be joined with other tables.
KEY distributes the data by a distribution key (DistKey). This key is one of the columns of the table. Distribution via a column is important for efficient joins with another table. By making the foreign key the DistKey of both tables, we ensure that the joined data of each row is always on the same node, thereby reducing the redistribution of data.
Each table must have a SortKey. Redshift always stores data sorted by the SortKey.
So while it is true that star and snowflake schemas are not always necessary anymore, there are new distributed system-level considerations that you must design for. Data locality and distribution are critical to getting the best performance. Ideally, we want joined rows to be as close as possible. If the data can exist on the same node, then even better. Data that needs to be moved over the internal network for joins means slower queries. You also have to take into account the cardinality of your filters. If you always perform filters based on today's date and you have the date column as the DistKey, then you can end up with uneven loads. For example, that might mean that of your ten nodes, you always end up using only one node and hammering it while the others stay idle. So ideally, we want data locality at an individual row level, but data distribution of all the rows that match a given filter.
Redshift supports the use of primary and foreign keys but does not enforce them. What does that mean, exactly? Well, Redshift really cares about the DistKey (when using the KEY distribution style) and the SortKey. But Primary Keys and Foreign Keys can help the query optimizer to make better decisions and it can help people understand the data model better. But these constraints must be enforced by the applications feeding Redshift, as Redshift itself treats them as informational only.
Redshift is priced according to the amount of data you store and by the number of nodes. The number of nodes is expandable. Depending on the amount of stored data, teams can set up anywhere from a single node (160 GB or 0.016 TB) to a 128 node cluster (with 16 TB capacity on a hard disk drive).
While we won’t be diving deep into the technical configurations of Amazon Redshift, there are technical considerations for its pricing model. Understanding of nodes versus clusters, the differences between data warehousing on solid state disks versus hard disk drives, and the part that virtual cores play in data processing is helpful for examining Redshift’s cost-effectiveness. For more details, see this article on Redshift pricing for data engineers.
Another useful article is this Hitchhiker’s Guide, showing how ironSource uses Redshift as raw data storage and how the pricing was worked out for that specific use case.
It's pretty simple to get started and start learning how to use Redshift. There are a number of additional ways of getting your data into Redshift, and that might be your next step. If you have an existing data warehouse that you want to move to Redshift, you might be thinking about whether it will still perform well, given that it has a star or snowflake schema. The answer is yes, it can still perform well, and Amazon provides guidance on doing that. But you might want to consider going for a different schema that is easier to write to and leverage the columnar storage and MPP of the platform.
Redshift is a great PaaS offering for data warehousing and Amazon just keeps making it better. It can now even read directly from S3. Most vendors in the data space have integrations with it, which make it a good choice for the future. But there are also the SaaS options that abstract away even more work for you. With these tools, you don't need to think about data distribution and data locality anymore. They can reduce the time necessary to get from identifying a data source to producing value in the form of business insights and reduce the time spent optimizing table schema designs. So if you are new to data warehousing in general or new to data warehousing in the cloud, then take a look at the various options open to you in this new world of cheap storage and massively parallel computing.
Opinions expressed by DZone contributors are their own.