Azure Data Lake vs. Amazon Redshift: Data Warehousing for Professionals
The differences between Azure and Amazon for data warehousing needs.
Join the DZone community and get the full member experience.Join For Free
Take a closer look at the biggest data warehousing giants and you will find a host of similarities and differences. Here we are going to pit Amazon’s Redshift against Microsoft’s Azure— segregating each into five differentiating factors.
Figure 1: Basic Hierarchy of Data Warehousing
Putting forth the SimilaritiesAs for similarities, both these entities render support for the ‘petabyte’ scale system— making it easier for us to manage large quantities of data. Master nodes or rather leader nodes can be associated with both of them besides the compute nodes which in turn can take care of the exported storage from the former, via the NFS. Again, it is the leader node with responsibilities vested upon them for distributing workload to the wide-array of compute notes. Once processed, using the storage, compute nodes readily return back the data. Both the data warehousing channels resort to columnar storage for enabling parallel processing. Here are some of the differentiating factors:
Instant scaling of data clusters is initiated in Redshift, via the API or the management console. Once processed, immediate changes are applied to the set of information. On encountering new clusters, Redshift changes the orientation of existing ones into Read-only mode. Data copying takes place only when the newer set of clusters are readily provisioned. The entire operation can take somewhere between a couple of hours and even days— based on the size of data clusters. For this entity, compute and storage nodes are grouped in as the node definition.
Figure 2: Detailed Analysis of an Approach using Azure Data Lake
The Azure SQL data lake can scale clusters in a matter of minutes. This delay is due to the separate processing of storage and compute units. Unlike Redshift, SQL DW can pause any operating compute with immediate effect. This minimizes the storage costs as in the paused state, compute units aren’t chargeable.
In Redshift, data can be easily integrated from the elastic map reduce, ‘Amazon S3’ storage, DynamoDB and a few more. For something called as ‘on-premises’ database, Redshift allows seamless integration to the file and then importing the same to S3. Better performances in terms of query can only be achieved via Re-Indexing.
Figure 3: Example of Data Storage, via Azure Blob Storage and Mirrored DC
For SQL DW, it’s the Azure Blob storage offering data integrations. The approach, however, is slightly similar to the Redshift when it comes to importing data from the server. File residing on the server is first exported into a text file which is then copied onto the Blob storage. DWs can then import the processed file using the syntax ‘CREATE EXTERNAL TABLE’ that is synonymous to the Polybase commands. The next command in the process is ‘CREATE TABLE…AS SELECT’. These steps will now import the data and re-create preferred indexes— suitable for optimal query performance.
Coming to Azure first, the SQL DW renders support for all the concepts including defined functions, stored procedures and indexes. Redshift, however, supports two formats of sort keys: interleaved and compound. The interleaved key is a default entity, giving equal precedence to each and every column — based on weight. Compound sort key brings in multiple columns — including the secondary ones. In case of broken links, the compound sort key helps with joining. However, the query performance drops when the secondary columns are working sans the primary ones. Enterprises need to re-index both the sort keys for optimal query performance as in the case of kik messenger for pc which is a great example of interleaved processing of data. The messenger is then rendered useful and can work across multiple environments.
Security and Backup
When it comes to data warehousing, data backup is sacred. Redshift allows backing up in the VPC (virtual private cloud). Data encryption is initiated during client access and even at rest. When active, data clusters are encrypted via SSL and during rest the modus operandi changes to the AES-256 i.e. a hardware accelerated entity. Amazon doesn’t allow us to access the compute nodes directly but via the leader node. S3 storage keeps backup using snapshots and this can be retained there for at least a day. The time period can extend up to 35 days at most. Azure still doesn’t support hosting over a virtual network but it does offer encryption at rest and even ‘over the wire’. Rest encryption occurs via ‘ALTER DATABASE <> SET ENCRYPTION ON’ syntax whereas the latter can be initiated, via the SSL. Backup to the blob storage is automatic in every 8 hours. The featured clusters of information can be retained by the blob for a period of 7 days.
Featured BI Tools
While Redshift supports integration for popular tools like Tableau, Azure SQL DW can work in cohesion with Power BI and even the Tableau. Redshift also offers support for ODBC and JDBC drivers. Bottom LineThe featured data warehousing techniques look great when the Business Intelligence Environment is considered. While the data lake takes the lead when it comes to scalability and compute decoupling, Redshift offers better security — enabling VPC hosting.
Opinions expressed by DZone contributors are their own.