How to Do a Snowflake Query Pushdown in Talend
In this article, see how Talend leverages Snowflake query pushdown via ELT.
Join the DZone community and get the full member experience.Join For Free
In a typical/traditional data warehouse solution, the data is read into ETL memory, processed/transformed in the memory before loading into the target database. With the growing data, the cost of compute is also increasing and hence it becomes vital to look for an alternate design.
Welcome to pushdown query processing. The basic idea of pushdown is that certain parts of SQL queries or the transformation logic can be "Pushed" to where the data resides in the form of generated SQL statements. So instead of bringing the data to processing logic, we take the logic to where data resides. This is very important for performance reasons.
You may also like: A Deep Dive Into Couchbase N1QL Query Optimization
Snowflake supports Query pushdown with v2.1 and later. This pushdown can help you transition from a traditional ETL process to a more flexible and powerful ELT model. In this blog, I will be showcasing how Talend leverages Snowflake query pushdown via ELT.
ETL vs ELT
Before we get into advance details, let's rejuvenate the basics. With traditional ETL, Extract Transform Load, the data is first extracted, transformed, and then loaded into the target like Snowflake. Here, most of the data transformation like filtering, sorting, aggregation, etc. takes place at ETL tool memory before loading it into the target.
With ELT, Extract Load and Transform, the data is first Extracted, then loaded and then data transformations are performed. With the ELT model, all the data is loaded into Snowflake and then the date transformations are performed directly in Snowflake.
Snowflake offers powerful SQL capabilities via query pushdown thereby enabling data transformation to a more effective ELT model. During development using ELT, it is possible to view the code as it will be executed by Snowflake.
Talend ELT Job Design
In Talend, there are native components to configure pushdown optimization. These components would convert the transformation logic to an SQL query and also send the query to the snowflake database. The Snowflake database runs the query. In Talend, query pushdown can be leveraged using ELT components tELTInput, tELTMap, and tELTOutput. These components are available under ELT -> Map -> DB JDBC
Let's take a quick look at these components
- tELTInput: This component adds input tables for the SQL statement. There is no restriction on the number of input tables. One can add as many Input tables as required by the SQL statement to be executed.
- tELTMap: this is the mapping component where the transformation is defined. This component uses the table(s) provided as input to feed the parameter in the built SQL statement. This component converts the transformation into SQL statements.
- tELTOutput: Carries out the action on the table specified along with the action on the data as specified according to the output schema defined.
Now, let's build a job to use these components and to utilize Snowflake query pushdown. I will explain it with an example. Let's assume that I have two tables in Snowflake named SALES and CITY. Sales table contains details of item sold, unit sold, sales channel (Online or offline) cost of unit, total revenue, total profit as per Region, country. City table is a dimension table which has Country code, population of country. Now the metric which I need to calculate is the total profit for online sales for each item at Region, country level. The result must be inserted into ONLINE_AGG table.
Now to this logic in ELT format, my job would look as given below:
Let's look at this job in more detail. As a best practice, I have used tPrejob to open the snowflake connection and tPostjob to close the connection. I have also used tDie to handle exceptions at various components. The next few section explains in detail the sections marked in the image above (A, B, C, and D).
After adding the input tables, I perform an INNER JOIN on the SALES and CITY table. As shown in the image below. This editor can also be used for providing additional where clause, group by clause and order by clause. In the example, I have given and performed the following transformation:
- where condition as PUBLIC.SALES.SALES_CHANNEL ='Online'
- as I am doing aggregation on Total_Profit, I have given group by on group by PUBLIC.SALES.REGION , SALES.COUNTRY , PUBLIC.SALES.ITEM_TYPE , PUBLIC.SALES.SALES_CHANNEL columns
- Null handling for column ITEM_TYPE
These transformations are highlighted in the image below.
Now, the beauty of this component is that as you write the transformation, the SQL gets generated. Click on the 'Generated SQL Select query for table2 output' to see the generated SQL
To validate the results, I copied this SQL and ran it in the Snowflake worksheet.
Now that the job design is completed, let's run the job. At runtime, you will see that the records are not bought into Talend memory.
Instead, the query is executed at Snowflake.
To confirm the execution, let's query the history at Snowflake.
Expanded view of the query executed
In this article, we saw how we could leverage the power of query pushdown with Talend while working with Snowflake. This job design method enables high utilization of snowflake clusters for processing data. Well, that's all for now. Keep watching this space for more blogs, and until then, happy reading!
Published at DZone with permission of Rekha Sree, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.