Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

CASE Statement in GROUP BY

DZone's Guide to

CASE Statement in GROUP BY

Sometimes tackling a problem in a single query isn't the best approach. We have a look at a scenario where a CASE statement is needed in a GROUP BY clause. What's the best way to handle this?

· Database Zone
Free Resource

Find out how Database DevOps helps your team deliver value quicker while keeping your data safe and your organization compliant. Align DevOps for your applications with DevOps for your SQL Server databases to discover the advantages of true Database DevOps, brought to you in partnership with Redgate

Set-based operations means you should put everything into a single statement, right?

Well, not really. People seem to think that having two queries is really bad, so when faced with logical gaps, they just cram them into the query they have. This is partly because SQL Server and T-SQL supports letting you do this, and it's partly because it looks like a logical extension of code reuse to arrive at a query structure that supports multiple logic chains. However, let's explore what happens when you do this in a particular situation: a CASE statement in a GROUP BY clause.

You see this a lot because a given set of data may be needed in slightly different context by different groups within the company. Like many of my example queries, this could be better written. Like many of my example queries, it mirrors what I see in the wild (and for those following along at home, I'm using the WideWorldImporters database for tests now):

CREATE PROCEDURE dbo.InvoiceGrouping (@x INT)
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY CASE WHEN @x = 7 THEN i.ContactPersonID
              WHEN @x = 15 THEN i.AccountsPersonID
              ELSE i.SalespersonPersonID
         END;
GO


Running this for any given value above, 7, 15, or other, you'll get the same execution plan, regardless of the column used in the GROUP BY. However, Parameter Sniffing is still something of a factor. When you group this data by SalesPersonID, you only get 10 rows back. This will be shown as the estimated number of rows returned if some value other than 7 or 15 is used as a parameter. However, this is always the plan:

coreplan
 

You can click on that to expand it into something readable. We can eliminate the Parameter Sniffing from the equation if we want to by modifying the query thus:

CREATE PROCEDURE dbo.InvoiceGrouping_NoSniff (@x INT)
AS
DECLARE @x2 INT;
SET @x2 = @x;

SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY CASE WHEN @x2 = 7 THEN i.ContactPersonID
              WHEN @x2 = 15 THEN i.AccountsPersonID
              ELSE i.SalespersonPersonID
         END;
GO


However, except for some deviation on the estimated rows (since it's averaging the rows returned), the execution plan is the same.

What's the big deal right? Well, let's break down the code into three different procedures:

CREATE PROCEDURE dbo.InvoiceGrouping_Contact
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY i.ContactPersonID;
GO

CREATE PROCEDURE dbo.InvoiceGrouping_Sales
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY i.SalespersonPersonID;
GO

CREATE PROCEDURE dbo.InvoiceGrouping_Account
AS
SELECT  SUM(il.UnitPrice),
        COUNT(i.ContactPersonID),
        COUNT(i.AccountsPersonID),
        COUNT(i.SalespersonPersonID)
FROM    Sales.Invoices AS i
JOIN    Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID
GROUP BY i.AccountsPersonID;
GO


Interestingly enough, these three queries produce a nearly identical execution plan. The one big difference is the Compute Scalar operator that is used to generate a value for the Hash Match Aggregate is no longer in the query:

specificplan

The same basic set of structures, scans against both tables, to arrive at the data. Cost estimates between the two plans are very different though, with the targeted queries having a much lower estimated cost.

Performance-wise, interestingly enough, the average execution time of the first query, only returning the 10 rows, is 157ms on average, while the query grouping directly on the SalesPersonID averages about 190ms. Now, the reads tell a slightly different story with 17428 on the generic query and 5721 on the specific query. So, maybe a server under load will see a significant performance increase. However, let's deal with what we have in front of us and say that, at least for these tests, the catch-all GROUP BY query performs well.

Now let's change the paradigm slightly. Let's add an index:

CREATE INDEX TestingGroupBy ON Sales.Invoices (SalespersonPersonID);


Frankly, this isn't a very useful index. However, after adding it, the execution plan for the InvoiceGrouping_Sales query changes. Instead of scanning the table, it's now scanning the index. Despite recompiles and attempts to force it using hints, the original InvoiceGrouping query will not use this index. Duration of the InvoiceGrouping_Sales query drops to 140ms on average and the reads drop a little further to 5021. Getting an 11% increase on performance is a win.

This is a pretty simplified example, however. Making the CASE statement more complex won't improve performance or further assist the optimizer to make good choices. Instead of trying to cram multiple different logical groupings into a single query, a better approach would be to create the three new procedures that I did above, and make the original InvoiceGrouping procedure into a wrapping procedure that chooses which of the individual procedures to call. This way, if you do add indexes in support of each of the different possible groupings, you would realize a positive outcome in your performance.

Want to talk more about execution plans and query tuning? In August, I'll be doing an all day pre-con at SQLServer Geeks Annual Summit in Bangalore India.

I'm also going to be in Oslo Norway for a pre-con before SQL Saturday Oslo in September.

Align DevOps for your applications with DevOps for your SQL Server databases to increase speed of delivery and keep data safe. Discover true Database DevOps, brought to you in partnership with Redgate

Topics:
query ,database ,sql ,aggregate ,plan

Published at DZone with permission of Grant Fritchey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}