The Coming SQL Collapse
I looked at neo4j briefly the other day, and quite predictably thought ‘wow, this looks like a serious tinkertoy: it‘s basically a bunch of nodes where you just blob your attributes.‘ Worse than that, to wrap objects around it, you have to have them explicitly incorporate their node class, which is ugly, smelly, violates every law of separation of concerns and logical vs. physical models. On the plus side, as I started to look at it more, I realized that it was the perfect way to implement a backend for a bayesian inference engine (more on that later). Why? Because inference doesn‘t care particularly about all the droll requirements that are settled for you by SQL, and there are no real set operations to speak of. One of the things the status quo defenders of SQL almost always forget is that they have to write themselves a pass to violate the most basic laws of their preferred screed. The NOSQL thing is going to bring that whole reality to the fore and my prediction is that you are going to see a larger exodus away from SQL than we have seen in the past 2 decades.
Here are some of the specific reasons:
- Most interesting models in the future will involve discovery, SQL cannot handle discovery – Have you ever worked on a project that was based on some element of discovery? Let‘s take a monitoring app. You don‘t know what type of devices you are going to encounter or what properties they are going to expose. How do you model that with SQL? If you sat 10 programmers down in an interview and gave them this problem, 10 of them would resort to making a model where they stored device names in one table and attributes in another and then they joined them. ‘Gee, how can you account for types that way?‘ Yeah, you can‘t. But that won‘t stop most programmers. They‘ll add a type field and go about their business. Actually, I have come across this in the wild a few times. Believe it or not, the developers will start patting themselves on the back for doing ‘metaprogramming.‘ When in fact, they are doing nothing of the sort. They are taking a language that has a small number of rules and predicates and removing most of them. As Joe Celko is fond of saying, if you can‘t write SQL against your database when you are done, it‘s not really a SQL database now is it? And of course, you can‘t write SQL against a bunch of name/value pairs that are joined in some arbitrary fashion.
- Reasoning requires discovery and also heterogeneity, neither of which SQL supports – ok, so now I want to do some reasoning about something. Suppose I want to predict whether or not you will attend an event that is being put on. Give that problem to a bunch of programmers and the vast majority of them will schematize what they know about you. Now let‘s throw a wrench into the works. Suppose I contract to buy some data from outside but only 1/3 of my population is in that database? SQL is going to require me to create a schema that is a superset of anything I could ever know about you. Then I can juggle all the nulls that will occur for the people who didn‘t fill out those surveys. As we all know, null referencing is the largest source of defects in programming. Things are starting to point to a big mess.
- Inference requires representations that SQL is useless for – now when I start training, I am going to have to store the features samples into distributions, either graphs, e.g. a Gaussian Kernel Estimator, or discretized representations like Histograms or clustered representations. SQL will offer us almost nothing worthwhile in this regard. But further, having been down this road, invariably, in the SQL solution, you will end up storing the native values then recreating the distribution each time. Once someone comes along and taps on your shoulder and says ‘well actually, you are killing the database pounding it so ruthlessly‘ or ‘if it takes 10s to decide, nevermind,‘ you will invariably be off to make a cache of the distributions. In the graph database, you won‘t really have to do that. Someone logs in, I navigate right to that person‘s node, then I have their distributions right there, ready to be used to compute probabilities.
- Schemas in SQL are strong on typing and set operations, weak on relationships – the relationship semantics in SQL are pretty lame. You can basically aggregate things. In discovery environments, and reasoning ones, relationship semantics are much more complex. For instance, I showed some level of interest in event x, a different level in event y. How can I learn from these differences? In a graph or object database, I can easily incorporate that into the model and navigation of the nodes is all I need to do. Can I do that in SQL? Sure, but again, I would have to do it by bolting semantics onto the language that are not there. Remember RUP? The thinness of its relationship semantics caught up to it pretty quickly and after a short while, everything was using archetypes all over the place. Relationships are of greater and greater importance in models, and will continue to become more important.
- SQL is about tables mostly, can do a Tree, Graphs are a nuisance – If I were interviewing you, could you implement a tree in SQL? Properly I mean? I have come across almost no programmers who have ever done it. Joe Celko wrote a whole book on trees. Really there are two ways to implement them: Adjacency and Nested Sets. The latter are very useful and powerful. That said, graphs are much more complicated and the additional amount of mess that has to be managed quickly turns to goo in SQL. Clearly, graph databases have no problem with this.
- SQL has retained its position of supremacy out of performance anxiety FUD – Larry Ellison is flying around in jets and sailing yachts because he has been able to keep people very scared that if they don‘t use his db, they will suffer a huge performance failure when their big night comes. The problem is that now that the cloud is here, people can scale in other ways, and they don‘t have to spend to prevent something out of fear, because they can test different setups, and use free ones until they fail. This will materially change the whole dynamic of how/why people will be willing to spend so much money on databases.
- SQL portrays itself as the only adult performance solution, then relies on caching to deliver the goods – kinda says it all.
- Properly implemented object databases outperform SQL at loading complex graphs, and have for 2 decades – which is what your program does a lot of, if you wrote it properly. Furthermore, those graphs are going to get bigger and more complex as more context is called upon to make the abilities of the app greater.
I still think that an object database would be preferable and will be looking at a few at the same time, but NOSQL is going to make serious inroads and those who are calling these guys simpletons and morons should come out and show us how they can handle some of the cases I have presented here. Also note, in my prior posts in the last month or so, the SQL camp has gotten fat and lazy. They have done nothing about ETL, the amount of work required to evaluate the performance of SQL solutions is still a huge drag, and no attempt has been made to support more sophisticate models. Add to that the fact that languages like R are growing rapidly, and you have a situation where SQL is going to, at the very least, wake to a new Copernican reality: they are not the center of the universe. That should have dawned on them when ORMs took hold, but in fact, those have largely preserved the notion among programmers that the database is the center of the universe, unfortunately.
The SQL defense against NoSQL is that it‘s a flash in the pan and real work can‘t be done with it. That‘s silly. The dumbest part of arguments like this is they assume that the people who are making the solutions are too stupid to know whether or when they will be out of gas. It‘s really pretty simple: when you stop getting the performance you need.