SQL Query Optimization and Normalization
SQL Query Optimization and Normalization
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
The same data can be retrieved from a database using different SQL queries, but for better performance the user needs to understand data normalization and query optimization. Below are a few concepts to improve overall performance.
Normalization: Normalizing data means eliminating redundant information from a table and organizing the data so that future changes to the table are easier.
De-normalization: This means allowing redundancy in a table. The main benefit of de-normalization is improved performance with simplified data retrieval and manipulation. This is accomplished by reduction in the number of joins needed for data processing.
What structure can the user implement for the database to speed up table reads?
The user needs to follow the rules of DB tuning:
- Properly use indexes (different types of indexes)
- Properly locate different DB objects across different table spaces, files and so on
- Create a special space (table space) to locate some of the data with special data type (for example CLOB, LOB etc.)
What are the tradeoffs of indexes?
- Faster selects, slower updates.
- Extra storage space to store indexes. Updates are slower because in addition to updating the table you have to update the index.
Why can a "group by" or "order by" clause be expensive to process?
Processing of "group by" or "order by" clause often requires creation of Temporary tables to process the results of the query, which can be very expensive for large data.
Always try to avoid NOT operator because positive operators (LIKE, IN, EXIST, =) has better performance than negative operators (NOT LIKE, NOT IN, NOT EXIST, !=). Negative operators search into each row to identify the data where positive operators stop searching once the result is found.
Use Default Value
While designing the database, assign default value to columns. This helps to improve performance while inserting the data.
Query can be optimized by using proper symbol operator such as >,<,=,!=, etc.
1. SELECT * FROM table_name
2. SELECT * FROM table_name WHERE column_name > 100
3. SELECT * From table_name WHERE column_name >=99
4. SELECT column_name1, column_name2 From table_name WHERE column_name >=99
Query 1 will return all records from database and if database has few billion records then it will kill the database. Query 2 will look for value 100 in database then will perform action on value 100. Query 3 will not look for value 100. Query 4 is more optimized than query 3 because it is fetching only required data. For Better performance always limit the result using WHERE clause.
Wildcard plays an important role in SQL query performance. It slows down for large tables. Performance can be improved by using postfix instead of pre and full wild card.
1. SELECT * FROM table_name WHERE column_name LIKE '%test%';
2. SELECT * FROM table_name WHERE column_name LIKE '%test; (column from where clause should be indexed)
3. SELECT * FROM table_name WHERE column_name LIKE test%';
Also try to avoid SUBSTR in SQL query, rather use Wildcard.
We always write sub queries in main query. Most of the time sub queries are very expensive because outer query executes before inner query.
1. SELECT * FROM table_name1 WHERE column_name1 IN (SELECT column_name2 FROM table_name2)
2. SELECT * FROM table_name1 t1, table_name2 t2 WHERE t1.column_name1 = t2.column_name2;
Union operator runs more faster and has better performance than OR operator
1. SELECT * FROM table_name WHERE column_name1 = 'test1' OR column_name2 = 'test2'
2. SELECT * FROM table_name WHERE column_name1= 'test1'
SELECT * FROM table_name WHERE column_name2 = 'test2'
HAVING and GROUP BY
GROUP BY gives better performance than using HAVING clause because HAVING clause operates after retrieving data from database.
1. SELECT column_name, COUNT(column_name) FROM table_name GROUP BY column_name
2. SELECT column_name, COUNT(column_name) FROM table_name WHERE column_name >=100
GROUP BY column_name
If you want to learn more, or if you're interested in any of Imaginovation's service offerings, which include: Web Design, Mobile Application Development, Search Engine Optimization, please visit our website at www.imaginovation.net, or feel free to give us a call at (888) 723-8643.
Opinions expressed by DZone contributors are their own.