If anything at all, our jOOQ talks at various JUGs and conferences have revealed mostly one thing:
Java developers don’t know SQL.
And it isn’t even necessarily our fault. We’re just not exposed to SQL nowadays.
But consider this: We developers (or our customers) are paying millions of dollars every year to Oracle, Microsoft, IBM, SAP for their excellent RDBMS, only to ignore 90% of their database features and to perform a little bit of CRUD and a little bit of ACID with ORMs like Hibernate. We have forgotten about why those RDBMS were so expensive in the first place. We haven’t paid attention to the various improvements to the SQL standards, including SQL:1999, SQL:2003, SQL:2008, and the recent SQL:2011, which are all mostly left unsupported by JPA.
If only we Java developers knew how easy and how much fun it can be to replace thousands of lines of erroneous Java code with five lines of SQL. Don’t believe it? Check this out:
Promoting SQL, the Language
Instead of simply promoting jOOQ, we started to help Java developers appreciate actual SQL, regardless of the access pattern that they’re using. Because true SQL can be appreciated through any of these APIs:
And believe us, most developers were astonished by what was possible in SQL, when they saw our NoSQL? No, SQL! talk:
Calculating a Running Total
So let’s delve into the essence of the talk and calculate a running total with SQL. What’s a running total? It’s easy. Imagine you have these bank account transaction data in your database:
You’ll notice immediately, that the balance per account transaction is missing. Yes, we want to calculate that balance as such:
If we’re assuming that we know the current balance on the bank account, we can use each account transaction’s
AMOUNT value and subtract it from that current balance. Alternatively, we could assume an initial balance of zero and add up all the
AMOUNT value till today. This is illustrated here:
Each transaction’s balance can be calculated through either one of these formulas:
So, that’s a running total. Easy, right?
But how can we do it in SQL?
Most of us would probably pull out a little Java programme out of their sleeves, keeping all the amounts in memory, writing unit tests, fixing all sorts of bugs (we’re not mathematicians, after all), wrestling with
BigDecimals, etc. Few of us would probably go through the hassle of doing the same in PL/SQL or T-SQL, or whatever other procedural language you have at disposition, and possibly update each balance directly into the table when inserting / updating new transactions.
But as you might have guessed so far, the solution we’re looking for here is a solution in SQL. Please bear with us as we’re going through the examples. They’re getting better and better. And if you want to play around with the examples, download Oracle XE and the running-totals.sql script and get up and running!
What we’ve learned from college / SQL-92 would probably involve a…
Using Nested SELECT
Let’s assume that we have a view like
v_transactions, which already joins the accounts table to the account transactions table in order to access the
current_balance. Here’s how we would write this query, then:
Notice how the nested SELECT uses row value expression predicates to express the filtering criteria. If your database doesn’t support the SQL standard row value expression predicates (and you’re not using jOOQ to emulate them), you can factor them out yourself to form this equivalent query, instead:
So in essence, for any given account transaction, your nested SELECT simply fetches the sum of all
AMOUNT values for account transactions that are more recent than the currently projected account transaction.
Does it perform?
Nope. As you can see, for the relatively simple sample data set (only 1101 records filtered from
account_id = 1 in line 9), there is an INDEX RANGE SCAN materialising a whopping total of 1212K rows in memory. This looks like we have
O(n2) complexity. I.e. a very naïve algorithm is being applied.
While you could probably tune this query slightly, we should still feel that Oracle should be able to devise an
O(n) algorithm for this simple task.
Using Recursive SQL
No one enjoys writing recursive SQL. No one. Let me convince you.
For simplicity, we’re assuming that we also have a
TRANSACTION_NR column enumerating transactions in their sort order, which can be used to simplify recursion:
Ready? Check out this gorgeous piece of SQL!
Ach… How to read this beauty?
Essentially, we’re self-joining the view (common table expression) that we’re about to declare:
In the first subselect of the
UNION ALL expression, we’re projecting the
current_balance of the account, only for the first
In the second subselect of the
UNION ALL expression, we’re projecting the difference of the balance of the
previous account transaction and the
AMOUNT of the current account transaction.
And because we’re recursing into the
ordered_with_balance common table expression, this will continue until we reach the “last” transaction.
Now let’s make an educated guess, whether this performs well…
Well. It doesn’t. We get even more rows in memory, namely 11M rows for what should be at most 1101. Parts of this plan are due to the fact that the
TRANSACTION_NUMBER utility column is another calculated column that couldn’t be optimised by Oracle. But the essence here is the fact that it is already very hard to get it right, it’s even harder to get it fast.
Using Window Functions
So, we’ve suffered enough. Let’s hear some good news.
The best solution for this problem is this one:
Essentially, we’re doing exactly the same thing as with the nested SELECT. We’re subtracting the
SUM() of all
AMOUNT values “over” the subset of rows that is:
- in the same
PARTITIONas the current row (i.e. has the same
- ordered by the same ordering criteria as the account transactions (from the outer query)
- positioned strictly before the current row in the sense of the above ordering
Or, again, visually:
And now, does this perform?
It couldn’t be much faster! Window functions are probably the most underestimated SQL feature.
Using the Oracle MODEL Clause
Now, this is more of a special treat for those SQL nerds among you who want to piss off your fellow developers with eerie, weird-looking SQL. The MODEL clause (only available in Oracle).
Now, how to read this beast? We’re taking the sample data and transforming it to:
PARTITION‘ed by the usual criteria
DIMENSION‘ed along the sort order, i.e. the transaction row number
MEASURE‘ed, i.e. to provide calculated values for date, amount, balance (where date and amount remain untouched, original data)
- be calculated according to
RULES, which define the balance of each transaction (except for the first one) to be the balance of the previous transaction minus the amount of the previous transaction
Still too abstract? I know. But think about it this way:
MS Excel! Every time you have a problem that your project manager thinks is peanuts to solve with his fancy MS Excel spreadsheets, then the MODEL clause is your friend!
And does it perform?
… pretty much so (although the above shouldn’t be confused with an actual benchmark).
If you haven’t seen enough, see another great use-case for the MODEL clause here. And for all the details, consider reading Oracle’s official MODEL clause whitepaper
Do it yourself
Did you like the above examples? Don’t worry. When you’ve seen these things for the first time, they can be pretty confusing. But in essence, they’re really not that complicated. And once you have those features in your tool-chain, you will be indefinitely more productive than if you had to write out all those algorithms in an imperative programming language.
At Data Geekery, we always say:
SQL is a device whose mystery is only exceeded by its power
And sometimes, we also quote Winston Churchill for having said something along the lines of:
Indeed, as the MODEL clause has shown us, SQL can become very extreme. But once you know the tricks and expressions (and most importantly, window functions), you’ll be incredibly more productive with SQL than with many other technologies, if the task at hand is a bulk calculation operation on a simple or complex data set. And your query is often faster than if you had hand-written it, at least when you’re using a decent database.
So let’s put SQL to action in our software!