When it comes to data warehousing, Amazon Redshift holds its head high. Redshift is categorized as one of the many database services offered within the AWS. This entity is fast, highly relational and flaunts petabyte scale warehousing. The best thing about Amazon Redshift is its cost-effectiveness. Better performance is also guaranteed along with integrations of several BI tools.
Figure 1: Listing out the benefits of Redshift
Quite recently, Redshift had a major showdown against the Azure data lake and it would now be appropriate to learn about each of them in detail. This article explains the benefits and architecture of Amazon Redshift.
We have already described Amazon Redshift as a performant data warehousing service, best known for the cost-effectiveness. The less-expensive nature is due to negligible upfront costs as charges are levied only based on the provisioned resources.
In terms of scalability, Amazon Redshift is extremely potent— allowing users to configure databases with ease. The UI interface is used alongside the mentioned scripts or APIs.
Several data clusters can reside over Amazon Redshift but the existing one is marked 'read only' once new servers are provisioned. Right after provisioning, the existing data sets are copied to the newer servers. DNS is switched to the newly provisioned servers — giving way to a constant loop.
Figure 2: Overviewing the basic structure
Amazon Redshift has the capability of provisioning, adding software patches and even upgrading the concerned OS to a fully functional database. Server health monitoring is pretty constant and even the failed nodes are readily replaced, whenever detected.
Most importantly, the data is kept secured by enabling SSL over the wire i.e. in the active mode. Apart from this, disk storage is also available at Amazon Redshift, with necessary encryption. Encryption offers an unconventional flexibility by securing both backed up data and the one active over the wire.
The basic component of Amazon Redshift is the cluster — considered as a group of few compute nodes. For more than one compute node, a ‘leader node’ is also assigned to the hierarchy.
It is the leader node that takes care of query executions in most cases. In case it exists, communications with the external world is handled only by the Leader. When it comes to client BI tools, it is the ‘leader node’ that is involved.
Upon receiving the query request, the entire query plan is generated by the Leader, and the request is intuitively distributed among the featured compute nodes.
Figure 3: Architecture for Dummies
Coming back to the compute nodes, we get a built-in processing unit and the disk. The fetched queries are executed by each one of them and the result is then transferred to the leader node. Upon consolidating the set of results, the leader processes the outcome. It is to be understood that query distribution is segregated in parallel as all compute nodes are made up of slices and each one has a portion of allocated memory. Again, the quantity of slices is determined by the number of allotted CPUs.
Amazon Redshift boasts columnar data storage — featuring minimum and maximum values. Data is retrieved upon request using tokens like sort keys. Sort keys are extremely helpful when the databases are searched for exact results. Blocks are saved using the sort keys with each being 1MB in size. Data overlapping between the blocks isn’t even a possibility with Redshift’s architecture as upon request the engine can directly jump onto the desired block.
Interleaved and compound sort keys are featured for helping out the datebase with data segregations into blocks. The latter comprises of multiple columns and helps join secondary and primary ones. However, the query performance drops whenever the primary column is missing,
Next in line is the sort key, which is helpful when multiple queries are concerned. Most larger tables put this sort key to use, and every column is given equal priority.
Both these sort keys bank upon re-indexing for better query performance. However, a compound sort key is assigned as the default entity. Most applications over desktops or smartphones must look to incorporate Redshift as the data warehousing service for the clarity of data retrieval. This will be beneficial as cache storage units can then be freed once the app is uninstalled without hampering the performance of the machine. Applications can retrieve data easily upon request if Amazon Redshift is taken into account.
Suppose you are looking to import data from SQL Server to Amazon Redshift. The first step would be to create tables in SQL. One can then extract data reserves from SQL, via a delimiter. Once coded into a textual form, data needs to be uploaded into the ‘AWS S3’ storage unit. Finally, the copy command is implemented for importing the same to Redshift.
Figure 4: Importing data into Redshift
Featured Syntax is as follows:
copy < table name> from < S3 file path> credentials ‘aws_access_key_id=<key id>; aws_secret_access_key=< access key>’ delimiter ‘<Delimiter Character>'
Concept of Client ‘BI Tools’
In terms of client tools, Tableau is something which connects directly to Redshift. Once this is factored or keyed in, the details are shifted over to the 'Redshift Live Data'. Dashboards can be generated using ‘RedShift table schema’. The report will then be ready to be viewed across multiple browsers.
Redshift is an integral tool when it comes to performance, cost and even scalability. Most importantly, we can easily integrate client ‘BI tools’ with the same. Overall, Amazon Redshift is a well-oiled machine with leader nodes and compute nodes working in cohesion. Sort keys form the backbone and this service comes in with application patches and even the quintessential OS upgrade.