Everything You Need to Know About Redshift Spectrum, Athena, and S3
Everything You Need to Know About Redshift Spectrum, Athena, and S3
Last week, Amazon announced Redshift Spectrum — a feature that helps Redshift users seamlessly query arbitrary files stored in S3. Learn more about it here.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Last week at the AWS San Francisco Summit, Amazon announced a powerful new feature: Redshift Spectrum. Spectrum offers a set of new capabilities that allow Redshift users to seamlessly query arbitrary files stored in S3 as though they were normal Redshift tables, delivering on the long-awaited requests for separation of storage and compute within Redshift. This option opens up a ton of new use-cases that were either impossible or prohibitively costly before.
In essence, Spectrum feels like a powerful integration between Redshift and Athena that provides the following features:
- New SQL Commands to create external schemas and tables.
- Ability to query these external tables and join them with the rest of your Redshift cluster.
This is simple but very powerful. Amazon just made Redshift much bigger without compromising on performance or other database semantics.
So, how does it all work? It starts by defining external tables. One limitation this setup currently has is that you can’t split a single table between Redshift and S3. One use case that we cover in Panoply where such separation would be necessary is when you have a massive table (think clickstream time series) but only want the most recent events, i.e. the past three months, to reside in Redshift, as that covers most of your queries. Then, you might want to have the rest of the data in S3 and have the capability to seamlessly query this table. While this is not yet part of the new Redshift features, I hope that it will be something that Redshift team will consider in the future. In the meantime, Panoply’s auto-archiving feature provides an (almost) similar result for our customers.
This means that every table can either reside on Redshift normally or be marked as an external table. “External table” is a term from the realm of Data Lakes and query engines, like Apache Presto, to indicate that the data in the table is stored externally — either with an S3 bucket or Hive metastore. Effectively, the table is virtual. It’s only a link with some metadata. Let’s consider the following table definition:
CREATE EXTERNAL TABLE external_schema.click_stream ( time timestamp, user_id int ) STORED AS TEXTFILE LOCATION 's3://myevents/clicks/'
See: SQL Reference for CREATE EXTERNAL TABLE.
Basically, what we’ve told Redshift is to create a new external table — a read-only table that contains the specified columns and has its data located in the provided S3 path as text files. That’s it. We can start querying it as if it had all of the data pre-inserted into Redshift via normal COPY commands.
This has several important implications:
- Performance. Querying this table is bound to be slower than querying data that resides within Redshift as it involves reading the files and parsing them on every query. See below to understand how it’s done.
- Scale. Since we’re not storing the data in Redshift, there’s a clear separation of storage and compute. You can scale as far as you’d like, both in data size and in computation power, independently of one another. This can easily be as big as S3 allows — exabytes in scale.
- Cost. If you’re not querying this table, you wouldn’t pay for it (well, beyond the S3 costs). Also, see below more about cost.
The more experienced reader will notice that I’ve skipped the schema/database creation phase. There’s one technical detail I’ve skipped: external schemas. As you might’ve noticed, in no place did we provide Redshift with the relevant credentials for accessing the S3 file. Quite cleverly, instead of having to define it on every table (like we do for every COPY command), these details are provided once by creating an External Schema and then assigning all tables to that schema. I will not elaborate on it here, as it’s just a one-time technical setup step, but you can read more about it here.
As stated above, this new feature-set is merely an integration between Redshift and AWS Athena. The latter is AWS’s smart wrapper around Apache Presto, which is a query engine that allows SQL code against arbitrary data files — be it HDFS, S3, local file-system, etc.
Virtually you can think of it this way: when you run a query against an external Redshift table it goes through Redshift => Athena => Presto. That means that at least some of the computation, especially the low-level table scans, happens within Presto, which needs to parse the raw data files into a tabular format. In order to do that, Presto needs to know ahead of time how the data is structured, is it a Parquet file? A CSV or TSV file?
That’s where the aforementioned “STORED AS” clause comes in. When we initially create the external table, we let Redshift (and subsequently Athena and Presto) know how the data files are structured.
Now that the table is defined. We can query it just like any other Redshift table. But more importantly, we can join it with other non-external tables. So if we have our massive click stream external table and we want to join it with a smaller & faster users table that resides on Redshift, we can issue a query like:
SELECT clicks.time, clicks.user_id, users.user_name FROM external_schema.click_stream as clicks JOIN users ON (clicks.user_id = users.users_id);
Redshift will construct a query plan that joins these two tables like so:
Basically what happens is that the User table is scanned normally within Redshift by distributing the work among all nodes in the cluster. In parallel, Redshift will ask Athena to scan the clicks table. Athena itself is an automatically distributed system that will assign more nodes to participate in the workload based on how much data you’re querying. Finally, the data is collected from both scans, joined and returned.
Performance and Cost
Naturally, queries running against S3 are bound to be a bit slower. It’s still interactively fast, as the power of Presto and the elastic scaling of Athena allows great parallelism, but it’s not going to be as fast as having your data pre-compressed, pre-analyzed data stored within Redshift. That’s not just because of S3 I/O speed compared to EBS or local disk reads, but also due to the lack of caching, ad-hoc parsing on query-time and the fact that there are no sort keys.
But that’s fine. External tables cover a different use-case. You use them for data your need to query infrequently, or as part of an ELT process that generates views and aggregations. In fact, in Panoply we’ve simulated these use-cases in the past similarly — we would take raw arbitrary data from S3 and periodically aggregate/transform it into small, well-optimized materialized views within Panoply.
As for the cost... this is a tricky one. At the get-go, external tables cost nothing (beyond the S3 storage cost), as they don’t actually store or manipulate data in any way. It’s just a bunch of Metadata. But as you start querying, you’re basically using Athena’s cost model of paying per scanned data size. So if, for example, you run a query that needs to process 1TB of data, you’d be billed for $5 for that query. This model isn’t unique, as is quite convenient when you indeed query these external tables infrequently but can become problematic and unpredictable when your team query it often.
Finally, using a columnar data format like Parquet can improve both performance and cost tremendously, as Presto wouldn’t need to read and parse the whole table, but only the specific columns that are part of the query. This saves the costs of I/O, due to file size, especially when compressed, but also the cost of parsing.
It’s clear that the world of data analysis is undergoing a revolution. It started out with Presto, which was arguably the first tool to allow interactive queries on arbitrary data lakes. Then, Google’s Big Query provided a similar solution except with automatic scaling. And finally, AWS Athena and Spectrum bring these same capabilities to AWS.
But here at Panoply, we still believe the best is yet to come. These new awesome technologies illustrate the possibilities, but the performance is still a bit off compared to classic data warehouses like Redshift and Vertica that had decades to evolve and perfect. Technically, there’s little reason for these new systems to not provide competitive query performance despite their limitations and conceptual differences from the classic data warehouses. (Yeah, I said it. More on this topic to come...)
In any case, we’ve been already simulating some of these features for our customers internally for the past year and a half. Having these new capabilities baked into Redshift makes it easier for us to deliver more value like auto-archiving faster and easier. In the near future, we can expect to see teams learn more from their data and utilize it better than ever before by using capabilities that, until very recently, were outside of their reach.
Published at DZone with permission of Roi Avinoam , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.