Microsoft Azure Synapse Analytics: Scaling Hurdles and Limitations
Azure Synapse faces serious challenges limiting its use in the Enterprise Data space, impacting performance and functionality.
Join the DZone community and get the full member experience.
Join For FreeAzure Synapse Analytics is a strong tool for processing large amounts of data. It does have some scaling challenges that can slow things down as your data grows. There are also a few built-in restrictions that could limit what you’re able to do and affect both performance and overall functionality. So, while Synapse is powerful, it’s important to be aware of these potential roadblocks as you plan your projects.
Data Distribution and Skew
Data skew remains a significant performance bottleneck in Synapse Analytics. Poor distribution key selection can lead to:
- 80-90% of data concentrated on 10% of nodes
- Hotspots during query execution
- Excessive data movement via TempDB
You can check for data skew by checking how rows are distributed across the distribution_id
(which typically maps 1:1 to compute nodes at maximum scale).
SELECT
distribution_id,
COUNT(*) AS row_count
FROM
[table_name]
GROUP BY
distribution_id
ORDER BY
row_count DESC;
If you see that a few (distribution_id
)s have a much higher (row_count
) than others, this indicates skew.
To mitigate this:
- Use high-cardinality columns for even distribution
- Monitor skew using
DBCC PDW_SHOWSPACEUSED
- Redistribute tables with
CREATE TABLE AS SELECT (CTAS)
Resource Management and Scaling
1. SQL Pools
You do not have any control over the built-in pool configuration. For a dedicated pool, the defaults are:
- Maximum DWU: Gen1: DW6000, Gen2: DW30000c
- Scaling requires manual intervention using SQL commands
To manually scale your dedicated SQL pool, you use the following ALTER DATABASE
command. Here’s how you do it:
ALTER DATABASE [your_database]
MODIFY (SERVICE_OBJECTIVE = 'DW1000c');
When you scale a Synapse pool, it goes into “Scaling” mode for a little while, and once it’s done, it switches back to “Online” and is ready to use.
Key Points
- Scaling is not automatic, so you must run the command yourself.
- The SQL pool must be online to scale.
- You can also scale using PowerShell or the Azure portal, but the SQL command is a direct way to do it.
2. Apache Spark Pools
- Scale-up triggers if resource utilization exceeds capacity for 1 minute.
- Scale-down requires 2 minutes of underutilization.
3. Integration Runtimes
- Manual scaling through the Azure portal and not from the Synapse workspace.
4. Concurrency Limits
- Maximum 128 concurrent queries; any further excess queries are queued.
- Concurrent open sessions: 1024 for DWU1000c and higher, 512 for DWU500c and lower.
Query and Data Limitations
1. SQL Feature Gaps
- No support for triggers, cross-database queries, or geospatial data types
- Limited use of expressions like
GETDATE()
orSUSER_SNAME()
- No FOR XML/FOR JSON clauses or cursor support
2. Data Size Restrictions
- Source table row size limited to 7,500 bytes for Azure Synapse Link for SQL
- LOB data > 1 MB not supported in initial snapshots for certain data types
3. Query Constraints
- Maximum 4,096 columns per row in SELECT results
- Up to 32 nested subqueries in a SELECT statement
- JOIN limited to 1,024 columns
4. View Limitations
- Maximum of 1023 columns in a view. If you have more columns, view restructuring is needed.
Error: CREATE TABLE failed because column 'VolumeLable' in table 'QTable' exceeds the maximum of 1024 columns.
To get around this, you’ll just need to break your view up into a few smaller ones, each with fewer than 1,024 columns. For example:
-- First view with columns 1 to 1023
CREATE VIEW dbo.BigTable_Part1 AS
SELECT col1, col2, ..., col1023
FROM dbo.BigTable;
-- Second view with the remaining columns
CREATE VIEW dbo.BigTable_Part2 AS
SELECT col1024, col1025, ..., col1100
FROM dbo.BigTable;
-- Combine views
SELECT *
FROM dbo.BigTable_Part1 p1
JOIN dbo.BigTable_Part2 p2
ON p1.PrimaryKey = p2.PrimaryKey;
Limited Data Format Support
- ORC and Avro formats are not supported which are common file formats in Enterprise data. Moving to parquet or deltalake format is recommended.
- Integrated with the very old version of Deltalake, which does not support critical features like column-mapping, column renaming, etc.
Access Limitations
When you try to set up Azure Synapse Link for SQL, you might run into an error if the database owner isn’t linked to a valid login. Basically, the system needs the database owner to be tied to a real user account to work properly. If it’s not, Synapse Link can’t connect and throws an error.
Workaround
To fix this, just make sure the database owner is set to a real user that actually has a login. You can do this with a quick command:
sqlALTER AUTHORIZATION ON DATABASE::[YourDatabaseName] TO [ValidLogin];
Replace (YourDatabaseName
) with your actual database name and (ValidLogin
) with the name of a valid server-level login or user. This command changes the ownership of the database to the specified login, ensuring that the database owner is properly mapped and authenticated.
Performance Optimization Challenges
1. Indexing Issues
- Clustered Columnstore Index (CCI) degradation due to frequent updates or low memory
- Outdated statistics leading to suboptimal query plans
2. TempDB Pressure
- Data movement from skew or incompatible joins can quickly fill TempDB
- Maximum TempDB size: 399 GB per DW100c
3. IDENTITY Column Behavior
- Distributed across 60 shards, leading to non-sequential values
Backup and Recovery Limitations
- No offline .BAK or .BACPAC backups with data
- Limited to 7-day retention or creating database copies (incurring costs)
Conclusion
Azure Synapse Analytics is a powerful tool for handling big data, but it’s not without its quirks. You’ll run into some scaling headaches and built-in limits that can slow things down or make certain tasks tricky. To get the best performance, you’ve got to be smart about how you distribute your data, manage resources, and optimize your queries.
Keeping an eye on things and tuning regularly helps avoid bottlenecks and keeps everything running smoothly. Basically, it’s great — but you’ll need to work around some bumps along the way to make it really shine.
Published at DZone with permission of Vamshidhar Morusu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments