There are cases where one needs an index, and in particular a unique constraint, on a subset of the rows of a database table. In other words the index is required within a predicate.
Many relational databases supply a technique to address the problem without adding further structures.
The Case Study
In order to illustrate the topic and the concepts of this essay we avail ourselves of the following case.
Suppose to have the following requirements:
- store a transaction involving a date (year, month, day), a type of operation and an amount;
- for each type of operation only an operation a day is allowed;
- store all changes of the amount.
The first two requirements suggest the spontaneous creation of two related entities as from the following E/R diagram:
The requirements say nothing about storing audit data such as when the row has been inserted, but experience teaches us that it is a worthy piece of information, so we have added the "WHEN" field expecting this use (1), but it will play a more important role.
The last requirement complicates matters and there is not a master path to the solution: the extra information may be managed either in the "OPERATION" structure or in an auxiliary one. In the next paragraphs we examine the alternatives highlighting pros and contras.
The Naive Solution
It seems that adding the "WHEN" field to the primary key solves the problem:
In a sense it is true, but from a semantic point of view this is not satisfactory as there is a more or less desirable implicit assumption: for a given operation type and date the current amount value is given by the row with the maximum "WHEN" value.
This supposal has at least two consequences:
- returning back to the value of a previous inserted amount means adding a new row with a later "WHEN" value: in other words all other data is duplicated;
- retrieving of information is complex.
Perhaps the first point is not an issue as, in this way, we preserve the history of changes, but retrieving the current value is unnatural and requires joining the "OPERATION" table with a subset of itself.
For example the view of all current value operations is given by the following SQL statement:
select t1.OP_CODE, t1.OP_DATE, t1.AMOUNT from OPERATION t1 inner join ( select OP_CODE, OP_DATE, max(WHEN) from OPERATION group by OP_CODE, OP_DATE) t2 on t1.OP_CODE = t2.OP_CODE and t1.OP_DATE = t2.OP_DATE
The use of the aggregate function max, moreover in a nested sub-query, for a so simple exigency seems excessive.
Are there simpler solutions?
Solutions Involving Auxiliary Structures
Before analyzing solutions that involve the only "OPERATION" table, it pays to understand how to deal with the problem with auxiliary structures: we can make more informed decisions.
In the following E/R diagrams we hide the "OPERATION_TYPE" table as it is never necessary to explain the representation of the solution.
Solutions Involving Auxiliary Structures
Going back to the previous figure we can mark the current amount referencing it in a specific table as in the following E/R diagram:
Note that the primary key in the "CURRENT_OPERATION" table is "OP_CODE" + "OP_DATE".
This expedient allows to manage the data integrity and assures that there will be only a current amount with a given type and date.
We still need a join in order to get all current amounts, but the SQL statement is fairly obvious:
select t1.OP_CODE, t1.OP_DATE, t1.AMOUNT from OPERATION t1 inner join (select OP_CODE, OP_DATE from OPERATION) t2 on t1.OP_CODE = t2.OP_CODE and t1.OP_DATE = t2.OP_DATE
The join is required as the amount is only stored in the main table and not replicated in the current one.
Why not storing also the "AMOUNT" in the "CURRENT_OPERATION" table?
Well doing so we need to guarantee data consistency and it introduces unpleasant sophistication.
Auxiliary History Table
A possible answer to the previous question is depicted in the following E/R diagram:
The "OPERATION" table stores the current operations and the history table all the operations: in other words the "OPERATION" table is an appropriate subset of the "OPERATION_HISTORY" table. Now the SQL statement cannot be simpler:
select OP_CODE, OP_DATE, AMOUNT from OPERATION
Note that the "WHEN" field has been moved in the history table: it makes the "OPERATION" table as clean as possible.
This solution unfortunately introduces two difficulties:
- Consistency: how can we guarantee that the amount is the same in the related rows?
- Integrity: how can we guarantee that not exist rows in "OPERATION" that are not in "OPERATION_HISTORY"?
These problems may be addressed using database triggers in environments that allow them, but in this way we introduce more complexity instead of reducing it.
Adding the "WHEN" field to the "OPERATION" table is of little help:
The relation between the two tables has been removed as the two sets of rows shall be disjoint: it makes consistency evident, but we still need database triggers to check that the intersection is void (2).
The use of auxiliary tables reduces the complexity of the queries by transferring it in the management statements (insert, update and delete): this shift is welcome as typically reading occurs many more times than writing.
The technique of the "OPERATION_HISTORY" table seems more natural, but it hides pitfalls.
The technique of the "CURRENT_OPERATION" is effective from a relational point of view.
The Revisited Naive Solution
As the use of auxiliary tables may raise new challenges, it is instinctive to try to do without them.
We explored other possibilities in order to simplify a SQL statement, but going back to the offending query the problem was the identification of the current amount: why not make it explicit adding a boolean flag to the "OPERATION" table?
In this way the complexity has moved in the management statements (insert, update, delete) that have to handle appropriately the "CURRENT" field.
But now the intricate query can be reduced to:
select OP_CODE, OP_DATE, AMOUNT from OPERATION where CURRENT = 'Y'
As managing insertion complexity is more desirable that query complexity, this solution would end this analysis except that we should also ensure data consistency.
Unfortunately with the introduction of the "CURRENT" flag, how can we guarantee that there is at most one current amount for a given operation type and date?
Before trying to answer the previous question and explore some alternatives, we introduce a real world improvement to the modeling.
Primary Keys and Unique Constraints
In the last E/R diagram the primary key of the "OPERATION" table is composed by three fields: "OP_CODE", "OP_DATE" and "WHEN".
Compound primary keys have at least two drawbacks:
- they are very inconvenient to manipulate especially when joining other tables;
- they often convey business information and/or rules that may change over the time.
A simple and effective workaround is to create a surrogate key, a numeric identifier - unrelated with any business meaning - to use as primary and, in particular, relational key.
Such numeric identifiers - very commonly used today - typically exploit some special features provided by the database to generate (4) incremental and unique values to each new request.
Adding such independent numeric identifier does not eliminate the need to handle the uniqueness of the business key (the natural key of the data). All relational databases allow to creating unique constraints on set of fields, so the problem is easily addressed.
We gain flexibility with such enhancement:
By the way, there is another possibility for defining the unique constraint, less evident but perhaps technically better, that is: "OP_CODE" + "OP_DATE" + "OP_ID"
We do not deepen pros and contras of this variant as not relevant to the present thesis.
Solution with Special Index
With reference to the model in previous figure there is a last problem to face: the data consistency.
We need to find a way to guarantee that only a row with "CURRENT" = 'Y' for each operation type and date is present in the table.
If only unique indexes may be applied on a subset of the table data ...
But wait! Perhaps some database environments can supply the sought functionality. And the answer is yes! They are called partial indexes, filtered indexes, conditional indexes, or with some other adjective.
There is no agreement on the name and syntax but the need was met and implemented.
If such a special index is provided then the unique constraint shall be defined on the "OP_CODE" and "OP_DATE" fields only filtering for "CURRENT" = 'Y'; the "WHEN" field in effect become superfluous from a structural point of view and we leave it as remainder of the original audit purpose:
When applicable, this is clearly a very clean and effective strategy from a modeling point of view (5): readability, minimal structure, self-evidence, simple queries.
Microsoft SQL Server, PostgreSQL, SQLite RDBMS
Recent versions of the Microsoft SQL Server RDBMS, the PostgreSQL RDBMS, the SQLite RDBMS supply the possibility to create filtered indexes by adding the clause where.
Actually PostgreSQL and SQLite call it partial indexes instead of filtered indexes.
The syntax to define an index on a table subset is crystal clear as we can appreciate by means of the following SQL statement:
create unique index OPERATION_UK on OPERATION (OP_CODE, OP_DATE) where CURRENT = 'Y'
The Oracle RDBMS does not support the where clause to create filtered indexes, but it provides a flexible though complex syntax to create functional indexes able to get the same result.
This is the SQL statement to build the sought index:
create unique index OPERATION_UK on OPERATION ( (case when CURRENT = 'Y' then OP_CODE else NULL end), (case when CURRENT = 'Y' then TRUNC(OP_DATE, 'DD') else NULL end) )
The use of the TRUNC function is a technical expedient that allows to treating the Oracle Date type as a Date and not as a DateTime.
No other RDBMS - with the possible exception of IBM DB2 - seem to provide filtered index, at least in a so direct way as shown in the previous two paragraphs.
- MySQL calls partial the indexes with functions applied to part of a column (such as substrings); in other words it does not supply indexes for table row subsets;
- IBM DB2 supplies a fairly complex syntax for creating indexes, but they do not look like supporting predicate to filter rows;
- SAP Sybase ASE (Adaptive Server Enterprise) - even though shares the same origin with Microsoft SQL Server - supports partial indexes but requires partitioned (6) tables.
Perhaps an in-depth analysis of the SQL dialect of specific database systems may highlight other opportunities, but for our case study this is not necessary as a smart alternative exists and it works with near all the RDBMS.
The Last but not the Least resource: The Tricky NULL Value
We recall a couple of facts:
- near all relational databases allow to assign the null value to column fields;
- near all relational databases require the 'IS NULL' predicate to check if the value of a column field is null, in other words the use of '=' operator with nulls does not work.
Well, the answer is at first sight surprising: the SQL standard states that unknown is the result of the expression
|NULL = NULL
We cannot compare a field value with a NULL value because unknown is the result of the expression
|FieldName = NULL
This behavior derives by the three value logic (3VL) implemented in SQL (8). The 3VL is based on the three values TRUE, FALSE and UNKNOWN:
- all expressions involving only the values TRUE and FALSE are resolved in the classical boolean way;
- all expressions including at least one UNKNOWN value are resolved as UNKNOWN.
Obviously in SQL the UNKNOWN role is taken by the NULL value.
In this scenario NULL values shall be handled accordingly in the relational database indexes, and in particular with the unique constraints!
Near all relation databases allow to index columns that accept NULL values: check specific RDBMS documentation for details and possible exceptions.
Armed with this knowledge, it is not difficult to figure out how to adapt our model in order to exploit the new logic: it is sufficient to simply force the "CURRENT" field to accept the values 'Y' and NULL instead of 'Y' and 'N':
Now the following unique constraint magically works!
create unique index OPERATION_UK on OPERATION (OP_CODE, OP_DATE, CURRENT)
For a given "OP_CODE" and "OP_DATE" we can have only a row with "CURRENT" = 'Y' and any number of rows with the "CURRENT" value null as they are all different pursuant of the fact that
|"OP_CODE" + "OP_DATE" + NULL
||cannot ever be equal to
||"OP_CODE" + "OP_DATE" + NULL
due to the NULL presence in the expression!
The main purpose of this article is to highlight the flexibility of the relational database paradigm and the data integrity problems that also a simple case study may raise. We showed various techniques to approach the difficulties trying to evaluate pros and contras steadily seeking simplicity.
In summary we have compared two different kind of database extensions: the first based on auxiliary tables and the second based on auxiliary indexes (unique constraints to be more precise).
The strong point of using auxiliary tables is that it works with any relational database, while the strong point of using indexes is the neatness.
We focused our efforts on filtered indexes as they are a powerful resource when available, but eventually we introduced an expedient worthy of being understood as it is based on 3VL that is at the core of the relational database theory.
The last techniques unfortunately may not suitable or convenient: the interest for filtered (or partial) indexes is due to the fact they provide extra adaptability.
|CURRENT = 'Y'||==>||REMOVED = 'N' or REMOVED is NULL|
|CURRENT = 'N' or CURRENT is NULL||==>||REMOVED = 'Y'|
The use of this virtual deletion of a record does not allow the application of the unique constraints with NULL values unless to codifying REMOVED is NULL as logically equivalent to REMOVED = 'Y', but it is strongly counterintuitive.
|1||In the real world some other fields will be required (such as the user that performed the operation), but they are not necessary in this abstract and simplified context.|
|2||The intersection involves not the all columns, but only the key fields "OP_CODE", "OP_DATE" and "WHEN".|
|3||The exact where condition depends on the SQL dialect as some databases does not handle directly boolean values and they require to treat boolean values with integers (1 and 0) or characters ('Y' and 'N', or 'T' and 'F', or '1' and '0').|
|4||Some databases supply auto-incremental numeric types, others the concept of sequence: a sequence is an auto-incremental resource more general than auto-incremental types.|
|5||Well, somebody may object that separating current values from other values putting them in different tables (see solution with history table) is good for performance and maintainability. Perhaps this may be true, but if we have such troublesome then we may define partitioned tables (when supported) on "CURRENT" flag!|
|6||Table partitions address important issues in supporting large collections of rows by decomposing them in smaller and more manageable blocks. The RDBMSs that support partitions often allow to create specific indexes for each partition.|
|7||The programming languages take generally a different position: in Java for example the expression variable == null is a standard Boolean expression that it is true when variable is null and false otherwise. Be aware that languages related to database - such as the PL/SQL, the Transact SQL - follow the database paradigm.|
|8||3VL has advantages, but also drawbacks, for example, writing SQL where conditions is complex and insidious when columns with null values are involved.|