Redefining Data Warehousing With Azure Data Lake for Dummies
Redefining Data Warehousing With Azure Data Lake for Dummies
Comparing the features of Microsoft's Azure Data Lake to Amazon Redshift, and how Microsoft's query language operates as a mix of C# and SQL.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Data warehousing has taken center stage in the era of cloud computing. With several tech giants clamoring for the online visibility, it’s time we start looking at the options at hand concerning analytics and storage. Microsoft has made a move in the positive direction with ‘Azure Data Lake’, a typical data warehousing tool which readily focuses on the analysis of non-structured and structured data. Amidst the popularity of DW services, it is only appropriate that we take a closer look at the emerging concept of Data Lake which has been tough competition for Amazon’s Redshift.
Figure 1: Sneak Peek into the New Data Lake Services
Data Lake is more of a cloud-based offering from Microsoft, which is extremely scalable and cost-effective. Three major components constitute the hierarchy of this tool: HDInsight, Data Lake Analytics, and the Data Lake Store. Again, Data Lake boasts a pretty sizeable analytics section — comprising of HDInsight and Data Lake Analytics. Moreover, there is a preview section, showcasing Data Analytics Store and the Data Lake Store.
Figure 2: Describing the World of Possibilities
Companies can also integrate the same with ‘Visual Studio’— adding new and improved UI enhancements to the otherwise bland interface of Data Lake. Another feature is available with this tool which helps in determining the modus operandi of code execution. This can be visualized with the 30 second ‘video playback’, irrespective of the code length. Job queries are displayed along with the time taken by each. This feature helps fine tune the nature of queries and zero in upon the ones which are hard to initiate.
Most importantly, Azure Data Lake is highly cost-effective as you need to pay only for the provisioned services. No upfront costs are involved, and prices are incurred on the basis of provisioned nodes. Computing time is also taken into account while calculating the costs. For code to be executed, nodes are assigned based on a ‘degree of parallelism’. Moreover, Azure Data Lake only charges you for the working nodes and also the time under process. That said, even if the functional nodes aren’t working, charges will still be incurred. A clearer picture is available in Visual Studio, which is also capable of providing detailed information and the user can then determine the unutilized nodes. The user interface offered by Visual Studio will therefore help revisit the total costs as users can bring in or push out the nodes, recalculating the query time and approximating the overall costs.
The three components of Azure Data Lake have been previously mentioned. However, this data warehousing tool works on Apache YARN — helping distribute the queries and their execution. Many data warehousing tools are available over the cloud and can be integrated using the smartphones. However, premium services like Data Lake and other tools from Microsoft can only work if the gadgets are rooted and offer complete functionality. Other than smartphones attached to the cloud, multiple databases can also make use of the Data Lake Store which is communicated to using WebHDFS. This entity offers REST APIs, helping users retrieve data from the source. Be it unstructured or structured data, the Data Lake Store can accommodate almost anything.
Figure 3: Data Lake Architecture in Detail
Decoding the ‘Data Lake Store’
Here comes the storage layer for your Data Lake, accessible to analytics and HDInsight. As per the explained architecture, this unit uses ‘WebHDFS REST APIs’ in order to retrieve data. Unlimited storage is available and it offers petabyte scale functionality, similar to Redshift. When the data is processed in parallel, Data Lake Store distributes it into storage servers. The read operations are therefore improved. Azure AD is integrated for seamless authentication, rendering features like multi-factor and role-based security checks.
Associated UI works like a charm and upon creating an account, users need to click on ‘Data Explorer’. This helps them navigate through the user interface and upload files. Once uploaded, it is easy to preview the data over the web and even in the file. Lastly, copy the URL for handling analytics.
Figure 4: Differentiating between the usual data warehousing techniques and the Data Lake
Decoding ‘Data Lake Analytics’
Microsoft brings in U-SQL with data lake analytics. Note that analytics is still in preview and can support Data Lake Store and even the Azure storage. Security is handled by Azure AD and the new query is created over the portal using the designated UI. Parallelism is often taken into account while considering compute nodes that can handle a given query.
Insights into U-SQL
This is a query language from Microsoft, dedicated to big data. It is a proper mix of C# and SQL. Firstly, if you are talking about WHERE conditions, this language follows C# in terms of syntax. To be precise, the comparison operator will be like a= = b.
U-SQL queries comprise of three steps:
- Acquiring data from source which can be anything between Azure SQL databases and u-SQL tables.
- Transformation needs to applied to the retrieved rows.
- Save the processed result onto the file.
“DECLARE @in string = “/TestData /SalesTrans.txt”; DECLARE @out string = “/result /results.tsv”; @salesTrans = EXTRACT SalesId int, Region string, Units int?, EmployeeId USING Extractors.Tsv (); OUTPUT @salesTrans TO @out USING Outputters.Tsv (); “
The ‘DECLARE’ Clause is case-sensitive and ‘EXTRACT’ reads data from files. The given schema features a nullable column, including ‘?’ that resembles C# in every aspect. This piece of code can easily copy processed data into the desired output file.
Figure 5: Microsoft’s take of the functionality of Azure Data Lake
Azure Data Lake comes as a new yet tested DW tool, offering exclusives like playback video and query performance identifiers. The interactive UI, in unison with Visual Studio helps determine the functional nodes and scale them accordingly, as per requirements. Costs can therefore be modified as node spin-off is possible. Finally, it is possible to check out the transition of your project, using Preview for Data Lake Analytics and Data Lake Store.
Opinions expressed by DZone contributors are their own.