In this article, we will elaborate on the subject of boosting productivity while working with database tables. Chances are you already know this from multiple resources on database development process.
However, the topic seems to become a front-burner issue when there are continuous data growths — tables become too large that leads to the performance loss.
It happens due to an ill-designed database schema that was not originally designed for handling large volumes of data.
To avoid the performance loss in the context of continuous data growth, you should stick to certain rules when designing a database schema.
Rule # 1 — Minimum Redundancy of Data Types
The fundamental unit of SQL Server data storage is the page. The disk space intended for a data file in a database is logically divided into pages numbered contiguously from 0 to n. In SQL Server, the page size is 8 KB. This means SQL Server databases have 128 pages per megabyte.
Disk I/O operations are performed at the page level. That is, SQL Server reads or writes whole data pages. The more compact data types is used, the less pages for storing that data are required, and as a result, less I/O operations needed.
Introduced in SQL Server, buffer pool significantly improves I/O throughput. The primary purpose of the SQL buffer pool is to reduce database file I/O and improve the response time for data retrieval.
Thus, when compact data types are used, the buffer pool stores larger amount of data on the same amount of pages. As a result, you will not waste the memory and reduce a number of logical operations.
Consider the following example — a table that stores working days of employees.
CREATE TABLE dbo.WorkOut1 ( DateOut DATETIME , EmployeeID BIGINT , WorkShiftCD NVARCHAR(10) , WorkHours DECIMAL(24,2) , CONSTRAINT PK_WorkOut1 PRIMARY KEY (DateOut, EmployeeID) )
Are the selected data types correct? The most probable answer is no. It is unlikely that an enterprise has (2^63-1) employees. Therefore, the BIGINT is an unsuitable data type in this case.
We can remove this redundancy and estimate the query execution time.
CREATE TABLE dbo.WorkOut2 ( DateOut SMALLDATETIME , EmployeeID INT , WorkShiftCD VARCHAR(10) , WorkHours DECIMAL(8,2) , CONSTRAINT PK_WorkOut2 PRIMARY KEY (DateOut, EmployeeID) )
The following execution plan demonstrates the cost difference which depends on a row size and expected number of rows.
The less data you need to retrieve, the faster query will run.
(3492294 row(s) affected) SQL Server Execution Times: CPU time = 1919 ms, elapsed time = 33606 ms. (3492294 row(s) affected) SQL Server Execution Times: CPU time = 1420 ms, elapsed time = 29694 ms.
As you can see, the usage of non-redundant data types is a keystone for the best query performance. It also allows reducing the size of problem tables. By the way, you can execute the following query for measuring a table size:
SELECT table_name = SCHEMA_NAME(o.[schema_id]) + '.' + o.name , data_size_mb = CAST(do.pages * 8. / 1024 AS DECIMAL(8,4)) FROM sys.objects o JOIN ( SELECT p.[object_id] , total_rows = SUM(p.[rows]) , total_pages = SUM(a.total_pages) , usedpages = SUM(a.used_pages) , pages = SUM( CASE WHEN it.internal_type IN (202, 204, 207, 211, 212, 213, 214, 215, 216,221, 222) THEN 0 WHEN a.[type]! = 1 AND p.index_id < 2 THEN a.used_pages WHEN p.index_id < 2 THEN a.data_pages ELSE 0 END ) FROM sys.partitions p JOIN sys.allocation_units a ON p.[partition_id] = a.container_id LEFT JOIN sys.internal_tables it ON p.[object_id] = it.[object_id] GROUP BY p.[object_id] ) do ON o.[object_id] = do.[object_id] WHERE o.[type] = 'U'
For the above-considered tables, the query returns the following results:
table_name data_size_mb — — — — — — – — — — — — — — — — — - dbo.WorkOut1 167.2578 dbo.WorkOut2 97.1250
Rule # 2 — Use Database Normalization and Avoid Data Duplication
Recently, I have analyzed a database of a free web service that allows formatting T-SQL code. The server part is quite simple over there and consists of a single table:
CREATE TABLE dbo.format_history ( session_id BIGINT , format_date DATETIME , format_options XML )
Every time when formatting SQL code, the following parameters were saved to the database: current session ID, server time, and the settings that were applied while formatting user’s SQL code.
This data subsequently were used for determining of most popular formatting styles. There were plans to add these styles to SQL Complete default formatting styles.
However, the service popularity rise led to a significant table rows increase, and profiles processing became slow. The settings had the following XML structure:
<FormatProfile> <FormatOptions> <PropertyValue Name="Select_SelectList_IndentColumnList">true</PropertyValue> <PropertyValue Name="Select_SelectList_SingleLineColumns">false</PropertyValue> <PropertyValue Name="Select_SelectList_StackColumns">true</PropertyValue> <PropertyValue Name="Select_SelectList_StackColumnsMode">1</PropertyValue> <PropertyValue Name="Select_Into_LineBreakBeforeInto">true</PropertyValue> ... <PropertyValue Name="UnionExceptIntersect_LineBreakBeforeUnion">true</PropertyValue> <PropertyValue Name="UnionExceptIntersect_LineBreakAfterUnion">true</PropertyValue> <PropertyValue Name="UnionExceptIntersect_IndentKeyword">true</PropertyValue> <PropertyValue Name="UnionExceptIntersect_IndentSubquery">false</PropertyValue> ... </FormatOptions> </FormatProfile>
450 formatting options in total. Each row takes 33 KB in the table. The daily data growth exceeds 100 MB. As an obvius outcome, the database has been expanding day by day, thus making data analysis yet more complicated .
Surprisingly, the salvation turned out to be quite easy: all unique profiles were placed into a separate table, where a hash was defined for every set of options. As of SQL Server 2008, you can use the sys.fn_repl_hash_binary function for this.
So the DB schema has been normalized:
CREATE TABLE dbo.format_profile ( format_hash BINARY(16) PRIMARY KEY , format_profile XML NOT NULL ) CREATE TABLE dbo.format_history ( session_id BIGINT , format_date SMALLDATETIME , format_hash BINARY(16) NOT NULL , CONSTRAINT PK_format_history PRIMARY KEY CLUSTERED (session_id, format_date) )
And if the previous query looked like this:
SELECT fh.session_id, fh.format_date, fh.format_options FROM SQLF.dbo.format_history fh
The new schema required the JOIN usage to retrieve the same data:
SELECT fh.session_id, fh.format_date, fp.format_profile FROM SQLF_v2.dbo.format_history fh JOIN SQLF_v2.dbo.format_profile fp ON fh.format_hash = fp.format_hash
If we compare the execution time for two queries, we can hardly see the advantages of the schema changes.
(3090 row(s) affected) SQL Server Execution Times: CPU time = 203 ms, elapsed time = 4698 ms. (3090 row(s) affected) SQL Server Execution Times: CPU time = 125 ms, elapsed time = 4479 ms.
But in this case, the goal was to decrease time for analysis. Before we had to write an intricate query for getting the list of popular formatting profiles:
;WITH cte AS ( SELECT fh.format_options , hsh = sys.fn_repl_hash_binary(CAST(fh.format_options AS VARBINARY(MAX))) , rn = ROW_NUMBER() OVER (ORDER BY 1/0) FROM SQLF.dbo.format_history fh ) SELECT c2.format_options, c1.cnt FROM ( SELECT TOP (10) hsh, rn = MIN(rn), cnt = COUNT(1) FROM cte GROUP BY hsh ORDER BY cnt DESC ) c1 JOIN cte c2 ON c1.rn = c2.rn ORDER BY c1.cnt DESC
Now due to the data normalization, we managed to simplify the query:
SELECT fp.format_profile , t.cnt FROM ( SELECT TOP (10) fh.format_hash , cnt = COUNT(1) FROM SQLF_v2.dbo.format_history fh GROUP BY fh.format_hash ORDER BY cnt DESC ) t JOIN SQLF_v2.dbo.format_profile fp ON t.format_hash = fp.format_hash
As well as to decrease the query execution time:
(10 row(s) affected) SQL Server Execution Times: CPU time = 2684 ms, elapsed time = 2774 ms. (10 row(s) affected) SQL Server Execution Times: CPU time = 15 ms, elapsed time = 379 ms.
In addition, the database size has decreased:
database_name row_size_mb — — — — — - — — — — — SQLF 123.50 SQLF_v2 7.88
To retrieve a file size, the following query can be used:
SELECT database_name = DB_NAME(database_id) , row_size_mb = CAST(SUM(CASE WHEN type_desc = 'ROWS' THEN size END) *8. / 1024 AS DECIMAL(8,2)) FROM sys.master_files WHERE database_id IN (DB_ID('SQLF'), DB_ID('SQLF_v2')) GROUP BY database_id
Hope I managed to demonstrate the importance of data normalization.
Rule # 3 — Be Careful While Selecting Indexed Columns.
An index is an on-disk structure associated with a table or view that speeds retrieval of rows from a table or a view. Indexes are stored on pages, thus, the less pages is required to store indexes, the faster search process is. It is extremely important to be careful while selecting clustered indexed columns, because all the clustered index columns are included in every non-clustered index. Due to this fact, a database size can increase dramatically.
Rule # 4 — Use Consolidated Tables.
You do not need to execute a complex query on a large table. Instead, you can execute a simple query on a small table.
For instance, we have the following consolidation query
SELECT WorkOutID , CE = SUM(CASE WHEN WorkKeyCD = 'CE' THEN Value END) , DE = SUM(CASE WHEN WorkKeyCD = 'DE' THEN Value END) , RE = SUM(CASE WHEN WorkKeyCD = 'RE' THEN Value END) , FD = SUM(CASE WHEN WorkKeyCD = 'FD' THEN Value END) , TR = SUM(CASE WHEN WorkKeyCD = 'TR' THEN Value END) , FF = SUM(CASE WHEN WorkKeyCD = 'FF' THEN Value END) , PF = SUM(CASE WHEN WorkKeyCD = 'PF' THEN Value END) , QW = SUM(CASE WHEN WorkKeyCD = 'QW' THEN Value END) , FH = SUM(CASE WHEN WorkKeyCD = 'FH' THEN Value END) , UH = SUM(CASE WHEN WorkKeyCD = 'UH' THEN Value END) , NU = SUM(CASE WHEN WorkKeyCD = 'NU' THEN Value END) , CS = SUM(CASE WHEN WorkKeyCD = 'CS' THEN Value END) FROM dbo.WorkOutFactor WHERE Value > 0 GROUP BY WorkOutID
If there is no need to often change the table data, we can create a separate table
SELECT * FROM dbo.WorkOutFactorCache
The data retrieval from such consolidated table will be much faster:
(185916 row(s) affected) SQL Server Execution Times: CPU time = 3448 ms, elapsed time = 3116 ms. (185916 row(s) affected) SQL Server Execution Times: CPU time = 1410 ms, elapsed time = 1202 ms.
Rule # 5 — Every Rule Has an Exception
I’ve shown a couple of examples that demonstrated how to eliminate redundant data types and shorten queries execution time. But it does not always happen.
For instance, the BIT data type has a peculiarity — SQL Server optimizes the storage of such columns group on a disk. If a table contains 8 (or less) columns of the BIT type, they are stored in the page as 1 byte. And if the table contains 16 columns of the BIT type, they are stored in the page as 2 bytes etc. The good news is that the table will take up little space and reduce disc I/O.
The bad news is that an implicit decoding will take place while retrieving data, and the process is very demanding in terms of CPU resources.
Here is the example. Assume we have 3 identical tables containing information about employees work schedule (31 + 2 PK columns). The only difference between tables is the data type for consolidated values (1– presence, 2 – absence)
SELECT * FROM dbo.E_51_INT
SELECT * FROM dbo.E_52_TINYINT
SELECT * FROM dbo.E_53_BIT
When using less redundant data types, the table size decreases considerably (especially the last table)
table_name data_size_mb — — — — — — – — — — — – dbo.E31_INT 150.2578 dbo.E32_TINYINT 50.4141 dbo.E33_BIT 24.1953
However, there is no significant speed gain from using the BIT type
(1000000 row(s) affected) Table ‘E31_INT’. Scan count 1, logical reads 19296, physical reads 1, read-ahead reads 19260, … SQL Server Execution Times: CPU time = 1607 ms, elapsed time = 19962 ms. (1000000 row(s) affected) Table ‘E32_TINYINT’. Scan count 1, logical reads 6471, physical reads 1, read-ahead reads 6477, … SQL Server Execution Times: CPU time = 1029 ms, elapsed time = 16533 ms. (1000000 row(s) affected) Table ‘E33_BIT’. Scan count 1, logical reads 3109, physical reads 1, read-ahead reads 3096, … SQL Server Execution Times: CPU time = 1820 ms, elapsed time = 17121 ms.
But the execution plan will show the opposite.
So the negative effect from the decoding will not appear if a table contains less than 8 BIT columns. One must note that the BIT data type is hardly used in SQL Server metadata. More often the BINARY data type is used, however it requires manual manipulations for obtaining specific values.
Rule # 6 — Delete Data That Is No Longer Required.
SQL Server supports a performance optimization mechanism called read-ahead. This mechanizm anticipates the data and index pages needed to fulfill a query execution plan and brings pages into the buffer cache before they are actually used by the query.
So if the table contains a lot of needless data, it may lead to unnecessary disk I/O. Besides, getting rid of needless data allows you to reduce the number of logical operations while reading from the Buffer Pool.
In conclusion, my advice is to be extremely careful while selecting data types for columns and try predicting future data loads.