What Goes Around Comes Around is a fascinating paper about the (cyclical) history of data modeling. It was written by two database experts: Joseph Hellerstein, a computer science professor at UC Berkeley, and Michael Stonebraker, founder of Ingres and Postgres and winner of the 2014 Turing award.
This article came to my attention as the first paper discussed in Readings in Database Systems (or the "Red Book," also in part by Michael Stonebraker), linked to me by a co-worker at Grakn. The book has proven to be a very good reference point for tried-and-tested techniques and architectures in database management systems.
Written in 2005, it presents a surprisingly relevant view on the current state of databases and data models. The trends it outlines are still holding true, with models such as JSON and property graphs sitting firmly in the "semi-structured" era it describes.
The paper takes us through nine eras of data models:
- Hierarchical (IMS): Late 1960s and 1970s
- Network (CODASYL): 1970s
- Relational: 1970s and early 1980s
- Entity-Relationship: 1970s
- Extended relational: 1980s
- Semantic: Late 1970s and 1980s
- Object-oriented: Late 1980s and early 1990s
- Object-relational: Late 1980s and early 1990s
- Semi-structured (XML): Late 1990s to the present
In reading the paper, it's clear that there are certain lessons to be learned from each era. Certain things clearly worked and other things clearly did not, and it shows that innovations don't always get adopted in a straightforward way. Moreover, it's also evident that the same sorts of issues appear and re-appear and the same lessons keep being relearned.
In order to illustrate these historical patterns in database theory and innovation, the authors use the classic example of suppliers and parts from Codd:
Supplier (sno, sname, scity, sstate) Part (pno, pname, psize, pcolor) Supply (sno, pno , qty, price )
This is presented as a relational schema. Each line is a relation (or table), with the attributes in brackets. So, in this case, we have a set of suppliers and a set of parts. There is also a "supply" relation, indicating that a particular part is a supplied by a particular supplier, with the given quantity and price.
IMS (Information Management System) was released in 1968. It structured data into record types with fields, similar to the relational table above. Each record type (except the root) has one parent record type. Similarly, every instance of a record type needs a parent that is an instance of the parent record type. This is called a hierarchical model because it forms a hierarchy of record types and instances.
This model is so limited that we can't even represent our example properly!
Our options are either to make
Part a child of
Supplier (left), or
Supplier a child of
Part (right). Notice that in the former case, we end up duplicating part information if the same part is sold by different suppliers. In the latter case, we will duplicate supplier information if the same supplier sells different parts.
This redundancy is an issue in terms of storage efficiency as well as consistency. If the same data is stored in two places, special care has to be made to make sure that both pieces of data do not go out of sync.
The former case also cannot model a part that is not sold by any supplier. Similarly, the latter case cannot model a supplier who sells no parts.
IMS was queried record-at-a-time by navigating the hierarchy. For example, to find all the red parts supplied by Supplier 16:
get unique Supplier (sno = 16) until failure: get next within parent (pcolor = red)
This reads a lot like an imperative programming language; we have step-by-step instructions, state, and control flow. We essentially have to describe explicitly the algorithm to execute in order to complete a query. Working out how to run a query quickly can be a challenge. Even something this simple may have a faster execution method in certain circumstances (for example, if we have an index to look up red parts, but not one for supplier numbers).
Additionally, IMS supported several ways to store records based on the key(unique, ordered identifier for a record): you could store them sequentially, or indexed with a B-tree, or hashed. This meant you could choose whatever would provide the best performance for your use case.
However, the storage method chosen would actually disable certain operations. For example, you could not insert a single record when they were stored sequentially. If using the hash method, then you could not use "get next," such as in the query above. This demonstrates a lack of physical data independence — we cannot change the physical data level without also changing the logical data level. This becomes very important with databases that are maintained for a long time: business requirements change, the data stored changes and increases. At some point, a different physical data representation could be necessary to improve performance. This was a motivator behind the relational model, which we'll discuss later.
IMS did offer a little support for logical data independence — meaning we can extend the logical data level without impacting applications built on it. It allowed exposing only a subset of record types (essentially a sub-tree of the hierarchy). This meant new record types could be added without disrupting the view of the data.
IMS has seen several extensions to its model such that it can (sort of) handle more complicated models such as our example. It is actually still in use and maintained today (the latest release was in 2015). In general, users appear to be legacy systems where the costs outweigh the benefits of switching to something more modern.
So these are the lessons from IMS:
- Physical and logical data independence are highly desirable.
- Tree-structured data models are very restrictive.
- It is a challenge to provide sophisticated logical reorganizations of tree-structured data.
- A record-at-a-time user interface forces the programmer to do manual query optimisation, and this is often hard.
Independently of IMS, CODASYL (Committee on Data Systems Languages) created several reports between 1969 and 1973 describing the network model, a more flexible model that allows a record type to have multiple parents.
In our example, we introduce two arrows:
Supplied_by. These are called set types. The arrow goes from the owner record type to the child record type. An instance of the owner type has a set of instances of the child type (this is a little distinct from a mathematical set because the set must have an owner). So in our example above, a
Supplier may have any number of
Supply children via the set
This extra flexibility has allowed us to eliminate all redundancy. Additionally, it's perfectly possible to have a
Supplier without any
Parts and vice-versa - they would just have no
Like IMS, CODASYL uses a record-at-a-time query language:
find Supplier (SNO = 16) until no-more: find next Supply record in Supplies find owner Part record in Supplied_by get current record check current record (color = red)
In this case, we have several "cursors" pointing to records — a cursor for the
Supplier, a cursor for the
Supply, a cursor for the
Part and a cursor for the current record. When writing a CODASYL query, the programmer has to consider the location of all these different cursors in their head. The paper explains it best: "One way to think of a CODASYL programmer is that he should program looking at a wall map of the CODASYL network that is decorated with various colored pins." As you can imagine, this isn't terribly easy or intuitive!
An additional issue with the network model is loading the data. Building the sets of records can be time-consuming and often the order things are loaded can make a big difference in terms of performance. Working out the order things should be loaded is non-trivial.
There are still some databases floating around based on CODASYL, such as IDMS; however, they don't seem to be widespread.
The lessons from CODASYL are:
- Networks are more flexible than hierarchies but more complex.
- Loading and recovering networks is more complex than hierarchies.
The relational model was proposed by Ted Codd in 1970, motivated in part by producing a better model than IMS for data independence. Data is structured as relations — sets of tuples. The query language is set-at-a-time, performed on entire relations using operators such as
select σ and
join ⋈ :
σ[sno=16 ∧ pcolor=red](Supply ⋈ Supplier ⋈ Part)
This combination of a simple model and a high-level language allows for much better logical and physical data independence: using the relational operators, we can describe logical relations in terms of physical ones (the ones that are actually stored to disc). Applications do not need to know the difference.
In contrast to the previous models discussed, the relational model took a more theoretical approach: it is (deliberately) less focused on implementation and therefore does not have a physical storage proposal. You can use whatever storage may be appropriate (whether that involves storing them unordered, indexed with a B-tree, or both).
Not only that, but this higher-level algebra allows a system to use rules of replacement to optimise the query. For example, using rules like the commutativity of ⋈ we can rewrite the query as:
σ[sno=16](Supplier) ⋈ Supply ⋈ σ[pcolor=red](Part)
The database system can determine whether this is faster using knowledge about the storage, such as any indices available and the size of the relations in question.
This new model kicked off the "great debate" between Ted Codd (supporting the relational model and mostly backed by academics) and Charlie Bachman (supporting CODASYL and mostly backed by DBMS users).
CODASYL was criticised for being simultaneously too complex and too inflexible to represent common situations. Record-at-a-time programming was also considered too hard to optimise.
CODASYL advocates responded to this criticism by introducing their own set-at-a-time query languages that would also allow better physical and logical data independence. There were also efforts to clean the model to make it easier to understand.
The relational model was criticised for being too academic: relational algebra queries were not easy to read or understand. Additionally, the relational model did not have a physical storage proposal so it was not initially clear that an efficient implementation could even be built.
Relational advocates responded to this by creating "natural" query languages based on the relational algebra such as SQL. Implementations such as System R and INGRES eventually proved that the model was viable and that automatic query optimisation could be competitive even with very skilled programmers.
So the two camps responded to the criticisms of the other and attempted to arrive at solutions. In practice, these differences didn't matter as much as you might expect. Having previously exclusively backed IMS, IBM changed its tune in 1984 and announced dual support for IMS and DB/2 (an early relational database). DB/2 was the newer and easier-to-use technology, so naturally was selected over IMS. This advantage was enough to settle the issue.
Tangentially, this also established SQL as the relational query language. Nowadays, "relational" and "SQL" are often treated as synonymous, but at the time SQL had competition (such as QUEL). There are people of the opinion that SQL is not a good language for the relational model, such as Hugh Darwen and C. J. Date, writers of The Third Manifesto. One of their criticisms of SQL is that it is not strictly the same as the relational model. For example, SQL allows an additional NULL value for absent or unknown information. SQL relations also allow duplicate rows, meaning they are not sets.
So, our lessons:
- Set-at-a-time languages are good, regardless of the data model, since they offer much improved physical data independence.
- Logical data independence is easier with a simple data model than with a complex one.
- Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.
- Query optimizers can beat all but the best record-at-a-time DBMS application programmers.
We've looked at three eras of data models: IMS, CODASYL, and relational. We've seen the importance of physical and logical data independence, the limitations of hierarchical models, and the advantages of set-at-a-time high-level query languages.
At Grakn, we're building a knowledge base and attempting to follow these lessons. We have a high-level set-at-a-time query language, with built-in reasoning allowing high logical data independence. Additionally, we use a flexible entity-relationship model (which you will see in the next post) to structure our data.
Relational databases continue to dominate and remain a hugely popular data model even thirty years later. Nonetheless, more recent competition from other models such as graph databases and NoSQL has made the contemporary database landscape much more diverse and has provided a new set of tools to developers. In the next blog post, we'll look at some of the key "post-relational" models:
- Extended relational
- Semi-structured (XML)