DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Data Warehouse Using Azure
  • Partitioning Historical Data Into Daily Parquet Files in Azure Data Lake Using Azure Data Factory and Azure Notebook
  • Seamless Integration of Azure Functions With SQL Server: A Developer's Perspective
  • Which Tool Is Better for Code Completion — Azure Data Studio or dbForge SQL Complete?

Trending

  • Securing Your Applications With Spring Security
  • Creating a Deep vs. Shallow Copy of an Object in Java
  • Modular Software Architecture: Advantages and Disadvantages of Using Monolith, Microservices and Modular Monolith
  • The Ultimate Guide to API vs. SDK: What’s the Difference and How To Use Them
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Exploration Using Serverless SQL Pool In Azure Synapse

Data Exploration Using Serverless SQL Pool In Azure Synapse

In this detailed article, you will learn how to carry out data ingestion and analysis using Azure Synapse Serverless SQL Pool along with sample datasets.

Mohan Krishna Mangamuri user avatar by
Mohan Krishna Mangamuri
·
May. 31, 23 · Tutorial
Like (1)
Save
Tweet
Share
2.30K Views

Join the DZone community and get the full member experience.

Join For Free

Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse, is a cloud-based analytics service provided by Microsoft Azure. It provides capabilities like data exploration, data analysis, data integration, advanced analytics,  machine learning, etc., on the Azure Data Lake Blob Storage.

In this article, we will take a deep dive into the ingestion of data files into Azure Synapse, data analysis, and transformation of the files in Azure Synapse using a Serverless SQL Pool. This article expects you to have a basic understanding of Azure fundamentals. Open your Synapse workspace and have it ready. Let's go.

Ingest/View Files in the Azure Data Lake

1. On the left side of Synapse Studio, use the >> icon to expand the menu. This reveals the different pages within Synapse Studio that we'll use to manage resources and perform data analytics tasks.

2. On the Data page, view the Linked tab and verify that your workspace includes a link to your Azure Data Lake Storage Gen2 storage account.

Data page

3. For this example, we are going to use free CSV files from NZ Govt Stats

4. Under the Azure Data Lake Storage Gen2, you should see your default storage account provisioned. Let us have a files/CSV folder created and upload the .csv files from the above link. For this example, we have uploaded the 'Business-employment-data-september-2022-quarter-csv.csv' file.

5. Once you uploaded a .csv file, right-click the CSV folder, then in the 'New SQL script' list on the toolbar, select 'Select Top 100 rows.' Review the Transact-SQL code that is generated.

SQL
 
SQL -- This is auto-generated code 



SELECT     TOP 100 * FROM     OPENROWSET(         BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/csv/Business-employment-data-september-2022-quarter-csv.csv', 



        FORMAT = 'CSV',         



        PARSER_VERSION='2.0'     ) 



AS [result] 


6. The above code uses the Azure OPENROWSET to read data from the CSV files in the files folder and retrieves the first 100 rows of data. In the connect to list, ensure the built-in is selected. It represents the built-in SQL pool that was created with our workspace. In the real world, the administration team at your organization would have created a named SQL Pool in your workspace. Then on the toolbar, use the Run button to run the SQL code and review the results, which should look like this.

review the results

In this case, the data files include the column names in the first row, so modify the query to add a HEADER_ROW = TRUE parameter to the WITH clause, as shown here (don't forget to add a comma after the previous parameter):

SQL
 

SELECT

    TOP 100 *

FROM

    OPENROWSET(

        BULK 'https://datalakexxxxxxx.dfs.core.windows.net/files/csv/Business-employment-data-september-2022-quarter-csv.csv',

        FORMAT = 'CSV',

        PARSER_VERSION='2.0',

        HEADER_ROW = TRUE

    ) AS [result]


Run the above query, and you should see the results below.

query results

You can publish the results using the "Publish" button on the pane.

Transform Data Using CREATE EXTERNAL TABLE AS SELECT Statements

A simple way to use SQL to transform data in a file and persist the results in another file is to use a CREATE EXTERNAL TABLE AS SELECT statement. This statement creates a table based on the requests of a query, but the data for the table is stored as files in a data lake. The transformed data can then be queried through the external table or accessed directly in the file system (for example, for inclusion in a downstream process to load the transformed data into a data warehouse).

Create an External Data Source and File Format

By defining an external source in a database, you can use it to reference the data lake location where you want to store files for external tables. An external file format enables you to define the format for those files — Parquet, CSV, etc. To use these objects to work with external tables, you need to create them in a database other than the default master database.

1. In Synapse studio, on the 'Develop' page, in the '+' menu, select SQL script

Develop

2. In the new script pane, add the following code (replacing datalakexxxxxxx with the name of your data lake storage account) to create a new database and add an external data source to it.

SQL
 
--Database for Employment data

CREATE DATABASE Employment

COLLATE Latin1_General_100_BIN2_UTF8;

GO;

Use Employment;

GO;


-- External data is in the Files container in the data lake

CREATE EXTERNAL DATA SOURCE employment_data WITH (

    LOCATION = 'https://datalakeXXXXXXX.dfs.core.windows.net/files/csv/Business-employment-data-september-2022-quarter-csv.csv'

);


GO;


-- Format for table files

CREATE EXTERNAL FILE FORMAT ParquetFormat

    WITH (

            FORMAT_TYPE = PARQUET,

            DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'

        );


GO;


Modify the script properties to change its name to Create Employment DB, and publish it. Ensure that the script is connected to the built-in SQL pool and the master database, and then run it. Switch back to the Data page and use the ↻ button at the top right of Synapse Studio to refresh the page. Then view the Workspace tab in the Data pane, where a SQL database list is now displayed. Expand this list to verify that the Employment database has been created.

Synapse Studio

Expand the Employment database, its External Resources folder, and the External data sources folder under that to see the employment_data external data source you created.

Expand the Employment database

In the properties pane, you can name the script like the one below and publish it. It will save the code, and the script will be saved so you can edit it later.

script

Create an External Table

In the Synapse studio, on the Develop page, in the '+' menu, select SQL script, and add the following code to retrieve and aggregate data from the CSV files by using the external data source. Note that the BULK path is now relative to the folder location (files folder) on which the data source is defined.

SQL
 
USE Employment;

GO;


SELECT Count(Series_reference) AS Series_count,

Period as period

FROM

    OPENROWSET(

        BULK 'csv/Business*.csv',

        DATA_SOURCE = 'employment_data',

        FORMAT = 'CSV',

        PARSER_VERSION = '2.0',

        HEADER_ROW = TRUE

    ) AS series_count

GROUP BY Period;


Run the script, and the results should show as below.

script result

Modify the SQL code to save the results of the query in an external table like this:

SQL
 
USE Employment;

GO;


CREATE EXTERNAL TABLE Series_by_period

WITH (

    LOCATION = 'files/SeriesbyPeriod/',

    DATA_SOURCE = employment_data,

    FILE_FORMAT = ParquetFormat

)
AS

SELECT Count(Series_reference) AS Series_count,

Period as period

FROM
OPENROWSET(

        BULK 'csv/Business*.csv',
  
        DATA_SOURCE = 'employment_data',

        FORMAT = 'CSV',

        PARSER_VERSION = '2.0',

        HEADER_ROW = TRUE

    ) AS orders
GROUP BY Period;


Run the script. This time there's no output, but the code should have created an external table based on the results of the query. Name the script Create SeriesByPeriod Table and publish it.

Create SeriesByPeriod Table

In the Data->Workspace menu for the Series_by_period table, select New SQL script > Select TOP 100 rows. Then run the resulting script and verify that it returns the aggregated Series Count by Employment Period data.

Series_by_period

Employment Period data

On the files tab containing the file system for your data lake, view the contents of the files folder (refreshing the view if necessary) and verify that a new SeriesbyPeriod folder has been created.

SeriesbyPeriod

In the SeriesbyPeriod folder, observe that one or more files with names similar to F5C*** parquet have been created. These files contain the aggregated Series by Period data.

SeriesbyPeriod folder

To verify this, you can select one of the files and use the New SQL script > Select TOP 100 rows menu to query it directly.

SQL
 
-- This is auto-generated code

SELECT

    TOP 100 *

FROM

    OPENROWSET(

        BULK 'https://datalakerg2vh64.dfs.core.windows.net/files/sales/SeriesbyPeriod/F5C90A74-1FA5-4F41-919D-BC5C09F2B2F0_24_0-1.parquet',

        FORMAT = 'PARQUET'

    ) AS [result]


query executed successfully


Conclusion

 We have demonstrated how to ingest data into Azure Data Lake Storage Gen2, and how to perform data analysis and data transformation activities in Azure Synapse Studio. This is very useful for Data Analysts to perform data analysis activities, Developers to check their code before automation, Architects and Data Engineers to understand the data and explore the datasets and create data pipelines, and Test Engineers to validate the data.

Azure Data Lake Data warehouse azure sql

Opinions expressed by DZone contributors are their own.

Related

  • Data Warehouse Using Azure
  • Partitioning Historical Data Into Daily Parquet Files in Azure Data Lake Using Azure Data Factory and Azure Notebook
  • Seamless Integration of Azure Functions With SQL Server: A Developer's Perspective
  • Which Tool Is Better for Code Completion — Azure Data Studio or dbForge SQL Complete?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: