Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Use Recursive SQL for Data Normalization

DZone's Guide to

How to Use Recursive SQL for Data Normalization

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Recursive SQL can be awesome, although a bit hard to read in its SQL standard beauty. Let’s assume you have some aggregated data with dates and a number of events per date:

|                           DATE | COUNT |
|--------------------------------|-------|
| October, 01 2013 00:00:00+0000 |     2 |
| October, 02 2013 00:00:00+0000 |     1 |
| October, 03 2013 00:00:00+0000 |     3 |
| October, 04 2013 00:00:00+0000 |     4 |
| October, 05 2013 00:00:00+0000 |     2 |
| October, 06 2013 00:00:00+0000 |     0 |
| October, 07 2013 00:00:00+0000 |     2 |

Now let’s assume you want to normalise or “unaggregate” this data, generating “COUNT” records per date. The desired output is this:

| DATE | EVENT_NUMBER | |--------------------------------|--------------| | October, 01 2013 00:00:00+0000 | 1 | | October, 01 2013 00:00:00+0000 | 2 | | October, 02 2013 00:00:00+0000 | 1 | | October, 03 2013 00:00:00+0000 | 1 | | October, 03 2013 00:00:00+0000 | 2 | | October, 03 2013 00:00:00+0000 | 3 | | October, 04 2013 00:00:00+0000 | 1 | | October, 04 2013 00:00:00+0000 | 2 | | October, 04 2013 00:00:00+0000 | 3 | | October, 04 2013 00:00:00+0000 | 4 | | October, 05 2013 00:00:00+0000 | 1 | | October, 05 2013 00:00:00+0000 | 2 | | October, 07 2013 00:00:00+0000 | 1 | | October, 07 2013 00:00:00+0000 | 2 |

As you may have noticed, there are no records for those dates with zero events (October 06). With recursive SQL, this is rather simple to achieve.

with recursive
 
-- Data could also be a regular table containing
-- the actual data
data(date, count) as (
  select date '2013-10-01', 2 union all
  select date '2013-10-02', 1 union all
  select date '2013-10-03', 3 union all
  select date '2013-10-04', 4 union all
  select date '2013-10-05', 2 union all
  select date '2013-10-06', 0 union all
  select date '2013-10-07', 2
),
 
-- This is the recursive common table expression
-- It starts with all data where count > 0
-- ... and then recurses by subtracting one
recurse(date, count) as (
  select date, count
  from data
  where count > 0
  union all
  select date, count - 1
  from recurse
  where count > 1
)
select date, count event_number from recurse
order by date asc, event_number asc;

See also this SQLFiddle to see the above CTE in action.

Incredibly, Oracle’s CONNECT BY clause doesn’t seem to be an option here. I challenge you to find a better solution, though! For instance, this beautiful solution that works with PostgreSQL:

with recursive
data(date, count) as (
  select date '2013-10-01', 2 union all
  select date '2013-10-02', 1 union all
  select date '2013-10-03', 3 union all
  select date '2013-10-04', 4 union all
  select date '2013-10-05', 2 union all
  select date '2013-10-06', 0 union all
  select date '2013-10-07', 2
)
select date, generate_series(1, count) event_number
from data
where count > 0
order by date asc, event_number asc;

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}