An Intro to Correlated Datetime Columns
An Intro to Correlated Datetime Columns
Although it's a situational tool, Correlated Datetime Columns in SQL Server can lend a boost to your performance when dealing with datetime columns.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
SQL Server is a deep and complex product. There’s always more to learn. For example, I had never heard of Correlated Datetime Columns. They were evidently introduced as a database option in SQL Server 2005 to help support data warehousing style queries (frequently using dates and times as join criteria or filter criteria). You can read up on the concept here from this older article from 2008 on MSDN. However, doing a search online I didn’t find much else explaining how this stuff worked (one article here, that didn’t break this down in a way I could easily understand). Time for me to get my learn on.
The concept is simple, turning this on for your database means that dates which have a relationship, the example from MSDN uses OrderDate and DueDate from the Purchasing.PurchaseOrderHeader and Purchasing.PurchaseOrderDetail tables respectively. Clearly, the DueDate would be near the OrderDate in general terms. Cool. Makes sense. Let’s see how it works, if it works.
Correlated Datetime Columns at Work
I’m going to use the example code from the Microsoft article, partly because there are aspects of it I don’t understand and exploring it will help me learn. Let’s start with the (slightly modified) base query that we hope will benefit from the Correlated Datetime setting:
SELECT * FROM Purchasing.PurchaseOrderHeader AS poh JOIN Purchasing.PurchaseOrderDetail AS pod ON poh.PurchaseOrderID = pod.PurchaseOrderID WHERE poh.OrderDate BETWEEN '20130801' AND '20130901';
The changes I made were the aliases and the values in the WHERE clause. I left everything else the same. Performance on this was 115ms on average with 59 reads. This is the execution plan:
Click to embiggen.
The missing index suggestion is on OrderDate using an INCLUDE of all the columns (the joy of SELECT * and Missing Index Hints, I should post a rant on Missing Index Hints at some point). The plan itself uses two Clustered Index Scan operations to retrieve the data. Since the scans are Ordered (check the properties in each operator to validate this, but the lack of a Sort operator is also a good hint), a Merge Join was used to put the data from the two tables together.
The example code drops and recreates the clustered index on the PurchaseOrderDetail table because one of the tables must have a clustered index with a datetime column as the first (or only) column in the key for Correlated Datetime Columns. However, I like to see how things behave. Let’s first create a non-clustered index on PurchaseOrderDetail.
CREATE INDEX IX_PurchaseOrderDetail_DueDate ON Purchasing.PurchaseOrderDetail(DueDate); GO
I don’t expect this to do anything for the above query, but I want to test each step. The non-clustered index doesn’t change anything. The execution plan and performance are the same as above. Fine. Let’s enable Correlated Datetime Columns:
ALTER DATABASE AdventureWorks2014 SET DATE_CORRELATION_OPTIMIZATION ON; GO
Running the query again, I can’t detect any differences. The execution plans are identical and the performance is the same. In case the plan wasn’t removed from cache because of the database setting change, I sure would expect it to be, but I could be wrong, I’m going to remove it from the cache:
DECLARE @PlanHandle VARBINARY(64); SELECT @PlanHandle = deps.plan_handle FROM sys.dm_exec_procedure_stats AS deps CROSS APPLY sys.dm_exec_sql_text(deps.sql_handle) AS dest WHERE dest.text LIKE 'SELECT * FROM Purchasing.PurchaseOrderHeader AS poh%'; IF @PlanHandle IS NOT NULL BEGIN DBCC FREEPROCCACHE(@PlanHandle); END GO
Nope. It’s not because of a cached plan. The Correlated Datetime Column setting just isn’t digging how I’ve got the clustered index set up. However, let’s also test this. I’ll create an index on the PurchaseOrderHeader.OrderDate as well:
CREATE INDEX IX_PurchaseOrderHeader_OrderDate ON Purchasing.PurchaseOrderHeader (OrderDate); GO
Nope. No joy. All right. Clean up the test indexes and follow Microsoft’s lead:
DROP INDEX IX_PurchaseOrderDetail_DueDate ON Purchasing.PurchaseOrderDetail; GO DROP INDEX IX_PurchaseOrderHeader_OrderDate ON Purchasing.PurchaseOrderHeader; GO CREATE UNIQUE NONCLUSTERED INDEX IX_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID ON Purchasing.PurchaseOrderDetail(PurchaseOrderID,PurchaseOrderDetailID); GO ALTER TABLE Purchasing.PurchaseOrderDetail DROP CONSTRAINT PK_PurchaseOrderDetail_PurchaseOrderID_PurchaseOrderDetailID; GO CREATE CLUSTERED INDEX IX_PurchaseOrderDetail_DueDate ON Purchasing.PurchaseOrderDetail(DueDate); GO
Rerunning the original query, I immediately see a difference. The reads dropped from 59 to 57. The execution time is a little better at 105ms. The execution plan is very different:
We’re making use of that clustered index, and it’s changed to a Hash Join. Let’s see the predicate for the Seek operation:
Seek Keys: Start: [AdventureWorks2014].[Purchasing].[PurchaseOrderDetail].DueDate >= Scalar Operator(‘2013-07-07 00:00:00.000’), End: [AdventureWorks2014].[Purchasing].[PurchaseOrderDetail].DueDate < Scalar Operator(‘2013-10-05 00:00:00.000’)
So, the optimizer has picked a range of values that are near to the values that have been passed in, assuming a correlation between the data. According to the documentation, the optimizer also ensures that using this correlation will still result in the same data being returned.
The original scan of PurchaseOrderHeader is still the same. In fact, we can see that when comparing the plans:
You’ll note that I have the similarities highlighted, so you can tell that the Scan is the same. Yes, its cost percentage within the plan has changed, but it’s still the same basic operation.
The magic comes from the optimizer creating statistics in the form of a materialized view that are created and maintained around the data in tables that qualify for the behavior of the Correlated Datetime Columns. You can even see these in your views. They have a common naming standard as outlined in the documentation: _MPStats_Sys_<constraint_object_id>_<GUID>_<FK_constraint_name> (no word on if MP is anything as crazy as the WA of system generated statistics).
In short, Correlated Datetime Columns worked… Or did it? We’re comparing apples to hammers at the moment. I’ve got a new clustered index on one of the tables. That changes all the choices, whether or not I’ve changed some database setting. Let’s remove Correlated Datetime Columns, pull the plan from cache, and then rerun the query.
Oops. Performance is now at 145ms on average with 118 reads. The execution plan has changed yet again:
In short, it works.
I could go further and look to replace the clustered index on the PurchaseOrderHeader table, but you get the idea.
Correlated Datetime Columns works. Clearly, it’s not something you’re going to enable on all your databases. Probably most of your databases don’t have clustered indexes on datetime columns, let alone enough tables with correlation between the data stored in them. However, when you do have that type of data correlation, enabling Correlated Datetime Columns and ensuring you have a clustered index on the datetime column is a viable tuning mechanism. Further, this is a mechanism that has been around since 2005. Just so you know, I did all my testing in SQL Server 2016, so this something that anyone in the right situation can take advantage of. Just remember that TANSTAAFL always applies. Maintaining the statistics needed for the Correlated Datetime Columns is done through materialized views that are automatically created through the optimization process. You can see the views in SSMS and any queries against the objects. You’ll need to take this into account during your statistics maintenance. However, if Correlated Datetime Columns is something you need, this is really going to help with this, fairly narrow, aspect of query tuning.
Published at DZone with permission of Grant Fritchey , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.