The Top 5 Most Common SQL Server Performance Problems
The Top 5 Most Common SQL Server Performance Problems
Let's see the top five most common SQL Server performance problems.
Join the DZone community and get the full member experience.Join For Free
The most effective database administrators don’t make rash decisions based on what seems apparent on the surface. They look more deeply at problems to better understand the root cause before taking action.
This is an incredibly important approach for optimizing your SQL Server. Here are the top five SQL Server performance problems I see and why it’s important never to make hasty assumptions and apply a critical lens to everything.
You might also enjoy: Learn How to Use SQL Server With Node.js
1. Tempdb PAGELATCH Contention
This increasingly prevalent problem is usually due to a system making heavy use of tempdb for some type of extract, transform, and load (ETL) process. This is especially common if it is an ongoing "real-time" style ETL process.
The symptoms can vary, but some things are always the same: high PAGELATCH waits in tempdb and poor performance recorded for processes using tempdb. I'll typically follow the waits to Top SQL in Performance Advisor, and see lots of queries that use temp tables listed in Top SQL. These queries usually run in milliseconds, and should never be counted among the "Top SQL" for the server. This can have people feeling like these queries are a large part of the problem, but that is not necessarily the case at all. The queries are the victims of the real problem.
Once I suspect this to be the case, I will usually jump to the Disk Activity tab in Performance Advisor to see how tempdb is configured. Most times I actually see the same thing: A busy tempdb with a single data file defined. From here I'll usually recommend reconfiguring tempdb.
2. Expecting Auto Update Statistics to Keep Statistics Updated
The problem here is that the thresholds for triggering auto statistics updates end up being the same in most cases, even for a very large table. The threshold is about 20% of the rows in the table. On a really big table, it takes a lot of data change to trigger an update.
The reason this makes the list is that database administrators seem really surprised to find out that the auto-update isn't taking care of things the way the name implies. Then there are also many DBAs that believe it should be handled by their maintenance job. Then after looking at the maintenance, they are doing index reorgs most of the time, and that won't update the statistics either (though a rebuild will).
The lesson is to keep an eye on statistics and make sure they're updated regularly, especially on large tables, which are becoming more and more common. Another option is to use trace flag 2371 to actually change the formula used to trigger the update.
3. The CXPACKET Wait Type
This is hands down the single most common wait type I see on larger SQL Server systems when someone asks me to look into query performance with them.
Sadly, I still see a lot of people make the initial assumption that the problem should be solved by having either the query or the entire SQL Server set max degree of parallelism (MAXDOP) to 1. Often, the problem can be handled by proper indexing or statistics maintenance. It could also be that the plan cached for this query is not optimal, and you can mark it for recompile using sp_recompile, set recompile at the query level, or just evict the plan using DBCC FREEPROCCACHE with a plan handle. It is best to exhaust these options before deciding to change MAXDOP to 1 because you could be throwing away a lot of processing power without realizing it.
4. Misunderstood "Timeout Expired Prior to the Completion of…"
This one is huge. Outside of some very edge case behavior, there are two basic types of timeouts you might deal with for SQL Server. These are connection timeouts and operation (or query) timeouts. In both cases, these are values set by the client connecting to the SQL Server. On the server side, there is a remote query timeout setting, but this is the very edge case.
Operation timeouts are the most common and might be the most misunderstood situation I come across. The cause boils down to one simple factor: The client executing the command has set a maximum amount of time that it will wait for the command to complete. If this maximum is reached prior to completion the command is aborted. An error is raised from the client.
Often, the timeout error will induce a panic mode, because the error can look intimidating. The reality is that this is not much different than hitting the stop button in SQL Server Management Studio because the query was taking too long. It will show up exactly the same in a profiler trace with Error = 2 (Aborted).
A timeout like this tells us that queries are taking longer than expected. We should go into "performance tuning" mode rather than "something is broken" mode. The error information from the client is good information on where you might start to focus your tuning efforts.
This is true for any system that uses an RDBMS for a repository. Your database needs some TLC now and again. Without it you may indeed experience some timeouts from your client. We spend a lot of time tuning our queries for performance before they ever make it out the door, but proper maintenance will ensure they continue to run as intended.
5. Memory Pressure
This is the big one because I see it so often, and also because it is so often mistaken for poor disk performance.
There are lots of caches in SQL Server, but the most well-known is the data cache (aka buffer pool). The easiest way to describe the data cache is that it is the data stored in memory, rather than persisted to disk. It’s desirable to store lots of data in memory long term because working with data in memory is generally much quicker than having to perform physical I/Os.
Usually, memory pressure manifests as a few different symptoms. When viewed individually, some of these symptoms can lead you to incorrect, and sometimes costly, conclusions.
The two misleading symptoms are that you may start to see higher than normal latency across the disk subsystem, and you may start to see abnormally high waits related to disk activity. If you only look at these two symptoms, you may conclude that you need to work on your disk system.
This is why being presented with all relevant metrics on one dashboard is so important. You have to look at the bigger picture, and having the memory-related data available along with the disk activity and waits helps to paint a clearer picture of what is really going on.
Typically, I’ll see a PLE (Page Lifetime Expectancy) that is fairly low for this server. The larger your buffer cache is, the higher your "critical" threshold will be for PLE. The more data there is to churn in and out of the buffer, the worse off you will be when the "churn" happens. Another consideration is non-uniform memory access (NUMA). The way the PLE counter is calculated can cause this value to be very misleading when multiple NUMA nodes are involved.
I'll usually also see consistently higher lazy writer activity, and SQL Server page faults (SQL Server going to disk). Sometimes I'll see what I call buffer tearing. This occurs when the data buffer is up and down, frequently creating a jagged (or torn) edge on the history chart in Performance Advisor. I may also see an abnormally large plan cache reducing available memory for the data cache.
All of these factors together spell memory pressure. There are various ways to deal with them, but the important point is that this is not a disk issue. I wouldn't call up your storage area network contact and order new hardware based on this situation. Once you get the memory pressure situation under control, SQL Server will not need to go to disk as much, and the few symptoms related to disk may disappear entirely!
The top SQL Server performance issues/topics will continue to evolve as organizations face new database requirements. If there is one major takeaway from all of these potential problems, it’s to always consider the full picture of performance because looking at one factor out of context could severely limit your options for a solution.
Opinions expressed by DZone contributors are their own.