The Unreasonable Effectiveness of SQL
''SQL is dead. Long live SQL.''
Two score and five years ago, two young IBM researchers brought forth databases; a new language conceived in relational and dedicated to the proposition that data can be manipulated declaratively and easily. In the years since Don Chamberlin and Ramond Boyce published SQL: A Structured English Query Language, relational models and SQL have been extended and adapted to a significant number of technologies: OLTP, OLAP, object databases, object-relational databases, and even NoSQL. SQL has inspired query language design for non-relational databases: SQL for object databases, SQL for object-relational, SQL for XML, SQL for spatial, SQL for search, SQL for JSON, SQL for timeseries, SQL for streams, and so on. Every BI tool interacts with the data using a variety of SQL. In fact, SQL is the most successful 4th generation language.
“SQL is a device whose mystery is only exceeded by its power.”
As Don said recently, SQL was based on the foundation of relational algebra with the goal to make it easier by providing an English-like query language with the following goals:
- A declarative language and processing (instead of procedural)
- Make the language composable to help write complex queries easily
- Work with the relational model developed by Edger F Codd.
While big data tried to compliment and replace relational systems for data warehousing, they tried to speak the same language: SQL. Hive, Impala, drill, and BigSQL all speak SQL inspired language, optimizer and execute similar to the MPP execution of SQL. They're also adding new SQL features regularly. All this on every type of data store and model you think of. The separation of the data storage formats, data model, and query processing in SQL has yielded significant benefits. In the forty-five years since SQL was introduced, many databases have come and gone; many data processing have come and gone. Some in the NoSQL movement implied, even if inadvertently, the death of SQL and SQL databases. SQL camp has taken this in stride and Don Chamberlin recently said: “When a language is so well recognized that other languages start defining themselves as not that one, it must be doing pretty good.”
Databases, on the other hand, simply went No-SQL. While the current definition of this is “Not Only SQL,” the original approaches were to go without SQL and try alternative languages and frameworks like map-reduce. A decade later, every popular NoSQL databases had a variation of SQL: N1QL in Couchbase, CQL in Cassandra, ElasticSearch SQL in Elastic. You say, “MongoDB doesn’t have SQL.” I say, “Squint! You’ll see a very simplistic SQL implementation.” By using a simplistic, somewhat procedural and ad-hoc design in MongoDB, queries loose composability, optimizations, and many of the innovations done with SQL.
While the relational model has been very successful, databases support a variety of data models: JSON, Graph, XML, Timeseries, spatial, wide-column, columnar, document and more. Most, if not all of these databases have their version of SQL. N1QL is SQL for JSON; SQL/XML, SQL from InfluxDB, SQL/Spatial, CQL in Cassandra, etc. Even the NoSQL databases have implemented SQL and SQL-inspired query languages. Even in the new cool “data science” world, SQL skills are highly recommended. Lukas Eder makes this point in his must-see talks. See the links for his talks in the reference section below.
Now, there are more SQL projects in NOSQL databases than SQL databases.
|Data models/formats||SQL implementation|
|JSON||Couchbase N1QL: SQL for JSON|
|Wide column||Cassandra CQL|
|Hadoop/Big Data||Hive, Impala, Drill, BigSQL|
|Graph||SQL Graph Database, Oracle Graph|
|NoSQL database||Apache Phoenix|
Why Is SQL so Successful?
- Declarative: You declare the output and the query engines figure out the optimal way to execute the query. The optimizer, especially the cost based optimizer invented Pat Selinger, et al in 1979, has helped to continuously improve the performance. This set a high bar for every new entrant to live up to. A recent paper on Apache Hive is an example of complexity and sophistication involved.
- SQL was used not just for “querying,” but to update the data doing transactions. Stored procedures, UDFs, have extended the reach by combining procedural language with the declarative SQL.
- SQL has been malleable. It has been standardized many times, each time adding a book full of features, a store full of syntaxes, and a dictionary full of keywords. For sure, not all SQL is the same. Even the traditional SQL implementations on RDBMS are not exactly compatible unless you're careful to write your SQL to be compatible. Through all these, the original spirit of SQL has remained. An example of SQL lending itself to evolution is SQL++. Don Chamberlin and Prof. Mike Carey discuss the need for supporting complex data models, making the data in JSON easily accessible to users and developers. Don's book SQL++ For SQL Users: A Tutorial teaches you the recent developments in SQL++, language designed for data processing on the flexible JSON data model, but in a SQL compatible way.
- SQL, like the English language it borrowed from, has been open to new ideas and extensions for new data types, access methods, use cases.
- SQL’s independence from the data representation has allowed itself to be used on non-relational data: CSV, JSON, and all the big data format. Some folks conflate rigidity of relational model representation to the rigidity of SQL. In fact, on a given schema, SQL allows you to select-join-group-aggregate-project any of the data formats.
Evaluating SQL Support
Now that SQL is everywhere, you need to do the due diligence on the level of support.
- Find out the workload characteristics and goal of each workload. E.g. interactive applications or interactive analytics or batch analytics or BI workload, etc.
- Statements supported reflects operational capabilities
- Language capabilities in terms of expressions (scalar, aggregate, boolean), joins (inner, left/right/full outer), subqueries, derived tables, sorting and pagination (LIMIT/OFFSET)
- Indexing: SQL without the right indexes is just a Turing machine prototype
- Optimizer: Query rewrite, choosing the right access path, creating the optimal query execution path is what makes SQL a successful 4GL. Some have a rule-based optimizer, some have a cost-based optimizer, some others have both. Evaluating the quality of the optimizer is critical. Typical benchmarks (TPC-C, TPC-DS, YCSB, YCSB-JSON) won't help you here
- As the saying goes: "There are three things important in databases: performance, performance, and performance." It's important to measure the performance of your workload. YCSB and the extended YCSB-JSON will make this evaluation easier
- SDKs: Rich SDKs and language support speed up your development
- BI tool support: For large data analysis, support from BI tools usually via standard database connectivity drivers is important
Gerald Sangudi, the creator of N1QL, once remarked SQL is successful because it represents the fundamental operations of data processing. SQL supports a rich set of operations select-join-group-aggregate-having-window-order-paginate-setops. Is this how we (or machines) think when specifying data operations? While that remains to be seen, other languages like python and java are adding operators for these operations on data. Maybe others will follow suit. SQL has gone where the relational model didn't. It's not an exaggeration to say:
SQL is dead. Long live SQL.
- SQL++ For SQL Users: A Tutorial
- Lukas Eder - 2000 Lines of Java? Or 50 Lines of SQL? The Choice is Yours!
- Ten SQL Tricks that You Didn’t Think Were Possible
- How Modern SQL Databases Come up with Algorithms that You Would Have Never Dreamed Of by Lukas Eder
- SQL History
- Eugene Wigner: The unreasonable effectiveness of mathematics in the natural sciences
- Alon Halevy, Peter Norvig, and Fernando Pereira: The Unreasonable Effectiveness of Data
- Difference between 3GL and 4GL