Finding Code Smells Using SQL Prompt
Finding Code Smells Using SQL Prompt
Let's at some creative SQL statements you can use to help root out code smells and get the data you really want.
Join the DZone community and get the full member experience.Join For Free
New whitepaper: Database DevOps – 6 Tips for Achieving Continuous Delivery. Discover 6 tips for continuous delivery with Database DevOps in this new whitepaper from Redgate. In 9 pages, it covers version control for databases and configurations, branching and testing, automation, using NuGet packages, and advice for how to start a pioneering Database DevOps project. Also includes further research on the industry-wide state of Database DevOps, how application and database development compare, plus practical steps for bringing DevOps to your database. Read it now free.
TOP in a
SELECT statement without a subsequent
ORDERBY clause is legal in SQL Server, but meaningless because asking for the
TOP10 rows implies that the data is guaranteed to be in a certain order, and tables have no implicit logical order. You must specify the order.
SELECT statement, you should always use an
ORDERBY clause with the
TOP clause, to specify which rows are affected by the
TOP filter. If you need to implement a paging solution in an application widget, to send chunks or “pages” of data to the client so a user can scroll through data, it is better and easier to use the
FETCH subclause in the
ORDERBY clause, instead of the
A recommendation to avoid use
TOP in a
SELECT statement, without an
ORDERBY, is included as a “Best Practice” code analysis rule in SQL Prompt (BP006).
Limiting Rows With TOP
TOP isn’t standard SQL, but it is intuitive. If you want just a few sample rows from a table source, then it is tempting to use the
TOP keyword without the
ORDERBY clause. A single table is likely to conform to the order of the clustered index but even that isn’t guaranteed due to parallelism.
If we go beyond the query for a single table and do a few joins, the ‘natural’ order is less obvious. Maybe you are in AdventureWorks and want just five customers, any five customers, and their addresses. It is perfectly legal to do this, but it is slightly dangerous if you subsequently forget why you are doing it.
SELECT TOP 5 Person.Title, Person.FirstName, Person.MiddleName, Person.LastName, Address.AddressLine1, Address.AddressLine2, Address.City, Address.PostalCode, AddressType.Name FROM Sales.Customer INNER JOIN Person.Person ON Customer.PersonID = Person.BusinessEntityID INNER JOIN Person.BusinessEntityAddress ON Person.BusinessEntityID = BusinessEntityAddress.BusinessEntityID INNER JOIN Person.Address ON BusinessEntityAddress.AddressID = Address.AddressID INNER JOIN Person.AddressType ON BusinessEntityAddress.AddressTypeID = AddressType.AddressTypeID;
You get what you expect, just the first five customers that are returned by the query. The order I get is that of the clustered index of the
PK_Customer_CustomerID, starting with the customer with the lowest
customer_id that is a person rather than a store. A different execution strategy could well change that. You are not guaranteed a deterministic result. This may be fine if you just want a sample during development, but in production systems what you really want are the top five customers and addresses according to their rank order, as determined by some attribute such as how much they spend. You really need that
In short, SQL tables have no guaranteed consistency in its intrinsic order. You may have set a
PRIMARYKEYthat gives your table values some fundamental ordering, but this isn’t guaranteed to happen consistently. SQL Server reserves the right to introduce any optimisation it wants, in creating an execution plan that will return the results, even if it means delivering those results in a different order. In short, you can’t guarantee that the results will be returned in the order you expect, unless you make it explicit with an
So, we return to the perfectly reasonable requirement that the developer has of just being able to eyeball a representative sample of rows from the query. How should this be done?
SET ROWCOUNT and TABLESAMPLE: Do They Help?
There was a time when we had to use the
SETROWCOUNT statement to put a limit on the number of rows returned. One drawback to that was that the query optimiser couldn’t create an effective plan in the light of the number of rows requested, because the
ROWCOUNT is a session or procedure/trigger-wide setting that wasn’t visible to the Query Optimiser within the query.
Also, it was possible to forget that you’d set the
ROWCOUNT and neglected to ‘unset’ it. Another disadvantage was that you couldn’t pass the value a variable.
TOP is better, because it works at the statement level, and you can pass either a row value, or a percentage, as a variable or expression.
You might think that you can reliably get a limited number of rows from a table with the
TABLESAMPLE clause. The only problem with it is that it doesn’t work as advertised, and even if it did work as advertised it only works on tables, rather than the whole rich variety of table sources.
SELECT * FROM Sales.Customer TABLESAMPLE SYSTEM (5);
This is supposed to limit the number of rows returned from a table in the
FROM clause to a sample number or
PERCENT of rows. A quick test will show you why nobody uses it.
DROP TABLE IF EXISTS #Result; CREATE TABLE #Result (TheOrder INT IDENTITY, TheRowsReturned INT); GO INSERT INTO #Result (TheRowsReturned) SELECT Count(*) FROM Sales.Customer TABLESAMPLE(200 ROWS); GO 30 SELECT #Result.TheOrder, #Result.TheRowsReturned FROM #Result;
Getting Meaningful Table Samples With TOP…ORDER BY
So far, because
TABLESAMPLE is broken, we are stuck with the slight awkwardness of this for getting a sample from a table.
SELECT TOP 5 * FROM Sales.Customer
Why is this awkward? Listing 4 will give you back five rows, but you cannot entirely rely on which rows are returned, though it is likely to be in the order of the
PRIMARYKEY, as we are just accessing a single table. However,
sales.customer is a bit of a ‘trick’ table in that it uses polymorphic association, and the first 700 of the 19820 rows represent stores, not people. So, Listing 4 is likely to give a very unrepresentative sample of this table because You could easily get the incorrect impression of the data in the table, thinking that customers are stores when the majority are people!
What most developers will want is to see a few rows of the table they are investigating, taken at random, but if you want the sample in random order, then you have to make that explicit.
SELECT TOP 5 * FROM Sales.Customer ORDER BY NewId()
This will return you five rows in a random order, but it takes more resources to return the result. If you aren’t dealing with a ‘trick’ table and just don’t care about order, and need to indicate that fact in the code, then SQL Server will accept any system function, such as
SELECT NULL). This trick is often seen where SQL Server refuses to accept code in a window function that requires an
ORDERBY. It means 'I know, I know but I’m doing it deliberately.'
SELECT TOP 10 * FROM Sales.Customer ORDER BY @@identity
If you are happy with the records you get by using
TOP without the
ORDERBY, then it is best to be entirely explicit, and point out that you really do want it by the
SELECT TOP 5 * FROM Sales.Customer ORDER BY Customer.CustomerID;
Using TOP With ORDER BY for Reporting Queries
TOP comes into its own for reporting purposes. Managers love lists of top customers and top salesmen. At this point, the
ORDERBY part becomes crucial.
SELECT TOP 10 Person.BusinessEntityID, Sum(SalesOrderHeader.TotalDue) AS expenditure FROM Sales.SalesPerson INNER JOIN Sales.SalesOrderHeader ON SalesPerson.BusinessEntityID = SalesOrderHeader.SalesPersonID INNER JOIN Person.Person ON SalesPerson.BusinessEntityID = Person.BusinessEntityID GROUP BY Person.BusinessEntityID ORDER BY Sum(SalesOrderHeader.TotalDue) DESC;
This gives you the top ten performing salespeople.
We might decide that this report doesn’t really tell us who the salesperson was, so we’d tweak it.
SELECT SalesPerformance.SalesValue, Coalesce(Person.Title + ' ', '') + Person.FirstName + Coalesce(' ' + Person.MiddleName, '') + ' ' + Person.LastName + Coalesce(' ' + Person.Suffix, '') AS SalesPerson FROM ( SELECT TOP 10 SalesPerson.BusinessEntityID AS salesPerson, Sum(SalesOrderHeader.TotalDue) AS SalesValue FROM Sales.SalesPerson INNER JOIN Sales.SalesOrderHeader ON SalesPerson.BusinessEntityID = SalesOrderHeader.SalesPersonID INNER JOIN Person.Person ON SalesPerson.BusinessEntityID = Person.BusinessEntityID GROUP BY SalesPerson.BusinessEntityID ORDER BY Sum(SalesOrderHeader.TotalDue) DESC ) AS SalesPerformance(SalesPerson, SalesValue) INNER JOIN Person.Person ON SalesPerformance.SalesPerson = Person.BusinessEntityID ORDER BY SalesPerformance.SalesValue DESC
Why would we need to do that second
ORDERBY? The original SQL was an aggregate query and we needed the top 10 aggregate sales totals, so we had to impose an order on it. This was passed, with no fixed order, to the outer query that added the person’s name. To be certain of the order of this outer query, it will need an explicit
ORDERBY clause too. This is sometimes referred to as a 'presentation
ORDERBY', or 'presentation ordering'.
Using FETCH-OFFSET Instead of TOP
The better way of doing this is, of course, to use the
ORDERBY with the optional
FETCH clause, available in SQL Server 2012 onwards, instead of
TOP. It is much more versatile and is standard ANSII SQL too. Here are the twenty longest-serving employees of AdventureWorks.
SELECT Employee.JobTitle, Employee.HireDate, Coalesce(Person.Title + ' ', '') + Person.FirstName + Coalesce(' ' + Person.MiddleName, '') + ' ' + Person.LastName + Coalesce(' ' + Person.Suffix, '') AS Name FROM HumanResources.Employee INNER JOIN Person.Person ON Person.BusinessEntityID = Employee.BusinessEntityID ORDER BY Employee.HireDate ASC OFFSET 0 ROWS FETCH FIRST 20 ROWS ONLY;
ORDERBY…OFFSET…ROWSFETCHFIRST…ROWSONLY, you can provide a means of scrolling or paging through this hall of fame.
Using TOP With INSERT, UPDATE, MERGE, or DELETE
It seems odd that you are discouraged from using
TOP without and
ORDERBY, when in certain circumstances you are actively prevented from doing so. As well as the
SELECT statement, the
UPDATE statements all have a
TOP clause. In contrast to
SELECT, you can’t have an associated
ORDERBYclause. Let’s take this example.
DROP TABLE IF EXISTS #tempCustomer; --in case it exists SELECT Customer.CustomerID, Customer.PersonID, Customer.StoreID, Customer.TerritoryID, Customer.AccountNumber, Customer.rowguid, Customer.ModifiedDate INTO #tempCustomer FROM Sales.Customer --just for the test UPDATE TOP (10) #tempCustomer SET #tempCustomer.AccountNumber = Replace(#tempCustomer.AccountNumber, 'AW', 'PF') OUTPUT Deleted.CustomerID, Deleted.AccountNumber AS before, Inserted.AccountNumber AS after
Now try putting in an
ORDERBY clause! It won’t allow it. As the documentation says:
“The rows referenced in the
TOP expression used with
DELETE are not arranged in any order”.
No, you must do something like this.
UPDATE #tempCustomer SET #tempCustomer.AccountNumber = -- Replace(#tempCustomer.AccountNumber, 'AW', 'PF') OUTPUT Deleted.CustomerID, Deleted.AccountNumber AS before, Inserted.AccountNumber AS AFTER FROM ( SELECT TOP 10 CustomerID FROM #tempCustomer ORDER BY #tempCustomer.CustomerID DESC ) AS ordered WHERE #tempCustomer.CustomerID = ordered.CustomerID GO
INSERT statement. We can’t use
TOP to insert rows in a meaningful chronological order. As the book says:
TOP is used with
INSERT the referenced rows are not arranged in any order and the
ORDERBY clause cannot be directly specified in this statement.”
If you need to do this, you must use
TOP together with an
ORDERBY clause that is specified in a sub-select statement.
DELETE has a
TOP clause, but we can’t use it either. What if you wanted to clean out old purchase order details? You’d want to be certain to clean out the oldest ones first. We can’t put an
ORDERBY in the delete statement, but then we don’t have to.
Let’s set up the test.
DROP TABLE IF EXISTS #tempPurchaseOrderDetail; --in case it exists SELECT POD.PurchaseOrderID, POD.PurchaseOrderDetailID, POD.DueDate, POD.OrderQty, POD.ProductID, POD.UnitPrice, POD.LineTotal, POD.ReceivedQty, POD.RejectedQty, POD.StockedQty, POD.ModifiedDate INTO #tempPurchaseOrderDetail FROM Purchasing.PurchaseOrderDetail AS POD
We now delete the ten oldest purchase order details.
DELETE FROM #tempPurchaseOrderDetail OUTPUT Deleted.DueDate, Deleted.LineTotal, Deleted.PurchaseOrderID WHERE PurchaseOrderDetailID IN ( SELECT TOP 10 PurchaseOrderDetailID FROM #tempPurchaseOrderDetail ORDER BY DueDate ASC ); GO
So, what is the point of having that
TOP filter if it can’t be used to
UPDATE? Well, in fact, it can be used, in circumstances where a particular set of records don’t need to be deleted in a specific order, as long as they eventually get deleted.
The use of the
TOP filter without the
ORDERBY is a life-saver if you need, for example, to delete lots of rows, regularly, from an operational production system. Deletes are logged, and can also result in escalating locks. I once had to design a system that regularly cleared out a million rows from a SQL Server database. The optimum method involved taking a lot of successive bites when eating the elephant rather than attempting it in a mouthful.
We can illustrate this very easily, though you wouldn’t see the advantage until you have a working system, especially one that was needing to access the table while you were deleting, updating, inserting or merging. Again, we will illustrate the point with a temporary table so as not to interfere with the good working of AdventureWorks.
DROP TABLE IF EXISTS #tempPurchaseOrderDetail; --in case it exists SELECT POD.PurchaseOrderID, POD.PurchaseOrderDetailID, POD.DueDate, POD.OrderQty, POD.ProductID, POD.UnitPrice, POD.LineTotal, POD.ReceivedQty, POD.RejectedQty, POD.StockedQty, POD.ModifiedDate INTO #tempPurchaseOrderDetail FROM Purchasing.PurchaseOrderDetail AS POD --we delete rows successively DECLARE @rowcount INT = 1 WHILE @rowcount > 0 BEGIN DELETE TOP (200) FROM #tempPurchaseOrderDetail WHERE #tempPurchaseOrderDetail.DueDate < DateAdd(YEAR, -2, GetDate()) SELECT @rowcount = @@RowCount END
I’ve found, in the past, that large-scale operations like this often benefit from being done in chunks, and the size of the chunk is a matter of fine-tuning with the operational system to get it right. For this sort of work, the
TOPclause without the
ORDERBY in a
UPDATE can be very valuable for doing large-scale changes, a short step at a time, in a hard-working transactional system
TOP clause in a
SELECT statement is very useful and intuitive, but it allows you to leave out the associated
ORDERBY clause that clarifies what you had in mind:
TOP by what aspect? After all, your
TOP ten songs aren’t the ten loudest, or the ten sung in the highest voice. They are the ten most popular in terms of the records sold. You may, fortuitously get the right result in development work, but in production, the workload, the server and the data size could result in the queries being optimised in very different ways, resulting in a different result.
For a more versatile way of dealing with such thing, I’d suggest using the
FETCH syntax introduced in SQL Server 2012 because it is more versatile and conformant. It isn’t too much harder to remember than the
TOP filter either.
Published at DZone with permission of Phil Factor . See the original article here.
Opinions expressed by DZone contributors are their own.