When to Use Row or Page Compression in SQL Server
When to Use Row or Page Compression in SQL Server
The internet is littered with posts describing row and page compression, how they work, the performance gains, etc. But few posts clarify row vs. page compression.
Join the DZone community and get the full member experience.Join For Free
RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.
Introduced with SQL Server 2008, page and row compression are turning ten years old this year. In that time, the internet became littered with posts describing both features, how they work, the performance gains, etc.
Despite digesting all of that information, a colleague asked me a very simple question those posts did not answer:
"How do I know to use row or page compression in SQL Server?"
That's what this post is for: to help provide some clarity on row versus page compression. You're welcome.
Types of Data Compression in SQL Server
Row and page compression are dirty secrets among many SQL consultants. I know more than one person who walked into a shop, enabled compression, showed a performance gain on a critical report, and walked out with a fat contract that same day. I'm not writing this post to out them or their secret. I'm writing to help you get a fat contract, or a fat bonus, by understanding compression a bit better.
Let's start with the basics. SQL Server 2017 has the following compression features:
The COMPRESS function uses GZIP to store data in a compressed format. Columnstore indexes store compressed data by default and also offer archival compression using XPRESS. We aren't going to talk about COMPRESS or Columnstore today. Instead, we will focus on row and page compression, because that's the question asked. But I thought you should know what types of data compression are available in SQL Server 2017.
How They Work
The first thing to understand is that row compression is a subset of page compression. Put another way, you cannot have page compression without first having row compression performed. When you go to enable page compression, the engine will do the following actions in this order:
Row compression is defined here. The simple explanation is that it take fixed-length columns and makes them variable length, adding additional bytes for the overhead of tracking the changes being made. The link provided has a table that references the savings for the datatypes used. It's interesting reading. And by "interesting," I mean "it won't matter for my summary later."
Prefix and dictionary compression come next, as part of page compression. You can read all the details here. The simple explanation is that the process looks for repeated patterns. First it looks for repeated values by column (that's the prefix part), then it looks for patterns on the entire page (that's the dictionary part). The link provided has the details and is more interesting than the row compression article for reasons that will be apparent soon-ish, I promise.
How to Enable Row or Page Compression in SQL Server
Enabling row or page compression on a table is easy. (However, enabling row or page compression for a database is not easy, because you have to enable it for each object manually. For large schemas, this can be a PITA and make you wish for the days of Microsoft Access and the Compact and Repair Database option. LOL, no, it doesn't. Nothing makes anyone wish for Access. #justsayin.)
Let's set up a test table and enable compression. Here's the example I've been using in my session Database Design: Size DOES Matter! for years. In that session, I use sys.messages to populate a table:
CREATE TABLE [dbo].[TestCompression]( [m_id] [int] NULL, [text] [nvarchar](3000) NULL ) ON [PRIMARY] INSERT INTO [dbo].[TestCompression] SELECT message_id, text from sys.messages
After the table is created, we estimate the space savings for row compression like this:
EXEC sp_estimate_data_compression_savings 'dbo', 'TestCompression', NULL, NULL, 'ROW' ; GO
Here's my result set:
The results show an expected gain of 54.8% (taking the column size_with_requested_compression_setting (KB) and dividing by the column size_with_current_compression_setting (KB), or 31880/58112). I will leave estimating the space savings for page compression as an exercise for the reader. Besides, I don't want to spoil the surprise ending.
One last item before we enable row compression for this table. I want to know how many logical reads it takes to return all the rows. I will use STATISTICSIO to get this information:
SET STATISTICS IO ON SELECT m_id, text FROM dbo.TestCompression SET STATISTICS IO OFF
This returns the following message (formatted to fit on this page):
(232056 rows affected) Table 'TestCompression'. Scan count 1, logical reads 7226, physical reads 0, read-ahead reads 7172, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
We now enable row compression with the following statement:
ALTER TABLE dbo.TestCompression REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = ROW) GO
And now rerun the SELECT, and we observe the following message returned with the results:
(232056 rows affected) Table 'TestCompression'. Scan count 1, logical reads 3982, physical reads 0, read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So, we have 55% fewer logical reads (3982/7226). Similar to the expected savings estimate of 54.8% found earlier.
(I love it when I check my work and the answers match.)
When to Use Row or Page Compression in SQL Server
OK, we've reviewed details on compression in SQL Server, let's answer the original question: "How do I know to use row or page compression in SQL Server?"
The MSDN article I referenced and linked to at the beginning of the article has a wonderful summary for this purpose. If you are too lazy to scroll up, I will link to it again for you here. If you are too lazy to read the article, let me repeat the pertinent details for you now:
A more detailed approach to deciding what to compress involves analyzing the workload characteristics for each table and index. It is based on the following two metrics:U: The percentage of update operations on a specific table, index, or partition, relative to total operations on that object. The lower the value of U (that is, the table, index, or partition is infrequently updated), the better candidate it is for page compression. S: The percentage of scan operations on a table, index, or partition, relative to total operations on that object. The higher the value of S (that is, the table, index, or partition is mostly scanned), the better candidate it is for page compression.
That sounds great except we don't know the percentage of update and scan operations. Yet.
Lucky for us the whitepaper includes the T-SQL needed to determine such metrics. For the U-value, the percentage of updates, we are given this code:
/* To compute U, use the statistics in the DMV sys.dm_db_index_operational_stats. U is the ratio (expressed in percent) of updates performed on a table or index to the sum of all operations (scans + DMLs + lookups) on that table or index. The following query reports U for each table and index in the database. */ SELECT o.name AS [Table_Name], x.name AS [Index_Name], i.partition_number AS [Partition], i.index_id AS [Index_ID], x.type_desc AS [Index_Type], i.leaf_update_count * 100.0 / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count ) AS [Percent_Update] FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i JOIN sys.objects o ON o.object_id = i.object_id JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id WHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) != 0 AND objectproperty(i.object_id,'IsUserTable') = 1 ORDER BY [Percent_Update] ASC
For S, the percentage of scans, we are given this code:
/* To compute S, use the statistics in the DMV sys.dm_db_index_operational_stats. S is the ratio (expressed in percent) of scans performed on a table or index to the sum of all operations (scans + DMLs + lookups) on that table or index. In other words, S represents how heavily the table or index is scanned. The following query reports S for each table, index, and partition in the database. */ SELECT o.name AS [Table_Name], x.name AS [Index_Name], i.partition_number AS [Partition], i.index_id AS [Index_ID], x.type_desc AS [Index_Type], i.range_scan_count * 100.0 / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count ) AS [Percent_Scan] FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i JOIN sys.objects o ON o.object_id = i.object_id JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id WHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) != 0 AND objectproperty(i.object_id,'IsUserTable') = 1 ORDER BY [Percent_Scan] DESC
OK, let's put it all together.
All of the above information can be found on hundreds of blog posts over the past ten years. Here's the part that I haven't found mentioned anywhere else.
When your table has few duplicate values, as our test table does here, row compression is the right answer no matter what your U-value or S-value. The reason for this is detailed above. Page compression looks for repeated patterns. Therefore, if your data does not have repeated patterns, you don't get much extra benefit from page compression. You will see extra CPU utilization, but probably not much of a performance gain to make it worthwhile. Our test table has few repeated patterns, therefore page compression is likely to be more overhead than what is needed.
(Of course, you should test everything and not just listen to me ramble. Oh, and now would be a good time to do that exercise I left for you above.)
And that's the decision point I want you to take away from this post. When you need to decide to use either row or page compression look at the data and determine if it is likely to have repeated values or not. If yes, then it is a candidate for page compression provided the U and S-values are OK. If there are few repeated values, then row compression is likely to be sufficient.
(BONUS: I once wrote a post on how to find the objects in your database that store the fewest number of rows per page. It may be relevant for your interests here, too. Have a look.)
Published at DZone with permission of Thomas LaRock , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.