DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • What Is SQL Injection and How Can It Be Avoided?
  • Snowflake Data Time Travel
  • SQL Server From Zero To Advanced Level: Leveraging nProbe Data

Trending

  • Issue and Present Verifiable Credentials With Spring Boot and Android
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Java Virtual Threads and Scaling
  • Evolution of Cloud Services for MCP/A2A Protocols in AI Agents
  1. DZone
  2. Data Engineering
  3. Databases
  4. Filling Sparse Data With the Previous Non-Empty Value in SQL

Filling Sparse Data With the Previous Non-Empty Value in SQL

Want to learn how to fill sparse data sets with the "previous non-empty value"? Take a look at these two lean, SQL-based solutions.

By 
Lukas Eder user avatar
Lukas Eder
·
Dec. 21, 15 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
12.4K Views

Join the DZone community and get the full member experience.

Join For Free

How do I fill the cells of a sparse data set with the “previous non-empty value”?

This is a very common problem in all data-related technologies, and we’re going to look into two very lean, SQL-based solutions for it.

The Problem

The problem is really simple, and I’m reusing the example provided by Stack Overflow user aljassi to answer this question:

We have a table containing “sparse” data:

Col1  Col2  Col3  Col4
----------------------
A     0     1     5
B     0     4     0
C     2     0     0
D     0     0     0
E     3     5     0
F     0     3     0
G     0     3     1
H     0     1     5
I     3     5     0

The above data set contains a couple of interesting data points that are non-zero, and some gaps modelled by the value zero. In other examples, we could replace zero by NULL, but it would still be the same problem. The desired result is the following:

Col1  Col2  Col3  Col4
----------------------
A     0     1     5
B     0     4     5
C     2     4     5
D     2     4     5
E     3     5     5
F     3     3     5
G     3     3     1
H     3     1     5
I     3     5     5

Note that all the generated values are highlighted in red, and they correspond to the most recent blue value.

How to do it with SQL? We’ll be looking at two solutions...

A Solution Using Window Functions

This is the solution you should be looking for, and there are two answers in the linked Stack Overflow question that both make use of window functions:

  • User nop77svk’s solution using LAST_VALUE()
  • User MT0’s solution using LAG()

Both solutions are roughly equivalent. Here’s how they work (using Oracle syntax):

WITH t(col1, col2, col3, col4) AS (
  SELECT 'A', 0, 1, 5 FROM DUAL UNION ALL
  SELECT 'B', 0, 4, 0 FROM DUAL UNION ALL
  SELECT 'C', 2, 0, 0 FROM DUAL UNION ALL
  SELECT 'D', 0, 0, 0 FROM DUAL UNION ALL
  SELECT 'E', 3, 5, 0 FROM DUAL UNION ALL
  SELECT 'F', 0, 3, 0 FROM DUAL UNION ALL
  SELECT 'G', 0, 3, 1 FROM DUAL UNION ALL
  SELECT 'H', 0, 1, 5 FROM DUAL UNION ALL
  SELECT 'I', 3, 5, 0 FROM DUAL
)
SELECT
  col1,

  nvl(last_value(nullif(col2, 0)) 
      IGNORE NULLS OVER (ORDER BY col1), 0) col2,

  nvl(last_value(nullif(col3, 0)) 
      IGNORE NULLS OVER (ORDER BY col1), 0) col3,

  nvl(last_value(nullif(col4, 0)) 
      IGNORE NULLS OVER (ORDER BY col1), 0) col4
FROM t

Now, let’s decompose these window functions:

NULLIF(colx, 0)

This is just an easy way of producing NULL values whenever we have what is an accepted “empty” value in our data set. So, instead of zeros, we just get NULL. Applying this function to our data, we’re getting:

Col1  Col2  Col3  Col4
----------------------
A     NULL  1     5
B     NULL  4     NULL
C     2     NULL  NULL
D     NULL  NULL  NULL
E     3     5     NULL
F     NULL  3     NULL
G     NULL  3     1
H     NULL  1     5
I     3     5     NULL

We’re doing this because now, we can make use of the useful IGNORE NULLS clause that is available to some ranking functions, specifically LAST_VALUE(), or LAG(). We can now write:

last_value(...) IGNORE NULLS OVER (ORDER BY col1)

Where we take the last non-NULL value that precedes the current row when ordering rows by col1:

  • If the current row contains a non-NULL value, we’re taking that value.
  • If the current row contains a NULL value, we’re going “up” until we reach a non-NULL value.
  • If we’re going “up” and we haven’t reached any non-NULL value, well, we get NULL.

This leads to the following result:

Col1  Col2  Col3  Col4
----------------------
A     NULL  1     5
B     NULL  4     5
C     2     4     5
D     2     4     5
E     3     5     5
F     3     3     5
G     3     3     1
H     3     1     5
I     3     5     5

Note that with most window functions, once you specify an ORDER BY clause, then the following frame clause is taken as a default:

last_value(...) IGNORE NULLS OVER (
  ORDER BY col1
  ROWS BETWEEN UNBOUNDED PRECEEDING AND CURRENT ROW
)

That’s a lot of keywords, but their meaning is not really that obscure once you get a hang of window functions. We suggest reading the following blog posts to learn more about them:

  • The Difference Between ROW_NUMBER(), RANK(), and DENSE_RANK()
  • Don’t Miss out on Awesome SQL Power with FIRST_VALUE(), LAST_VALUE(), LEAD(), and LAG()

Finally, because we don’t want those NULL values to remain in our results, we simply remove them using NVL() (or COALESCE() in other databases):

nvl(last_value(...) IGNORE NULLS OVER (...), 0)

Easy, isn’t it? Note, that in this particular case, LAG() and LAST_VALUE() will have the same effect.

A Solution Using the MODEL Clause

Whenever you have a problem in (Oracle) SQL that becomes hard to solve with window functions, the Oracle MODEL clause might offer an “easy” solution to it. I’m using quotes on “easy”, because the syntax is a bit hard to remember, but the essence of it is really not that hard.

The MODEL clause is nothing else than an Oracle-specific dialect for implementing spreadsheet-like logic in the database. I highly recommend reading the relevant Whitepaper by Oracle, which explains the functionality very well:

http://www.oracle.com/technetwork/middleware/bi-foundation/10gr1-twp-bi-dw-sqlmodel-131067.pdf

Here’s how you could tackle the problem with MODEL (and bear with me):

WITH t(col1, col2, col3, col4) AS (
  SELECT 'A', 0, 1, 5 FROM DUAL UNION ALL
  SELECT 'B', 0, 4, 0 FROM DUAL UNION ALL
  SELECT 'C', 2, 0, 0 FROM DUAL UNION ALL
  SELECT 'D', 0, 0, 0 FROM DUAL UNION ALL
  SELECT 'E', 3, 5, 0 FROM DUAL UNION ALL
  SELECT 'F', 0, 3, 0 FROM DUAL UNION ALL
  SELECT 'G', 0, 3, 1 FROM DUAL UNION ALL
  SELECT 'H', 0, 1, 5 FROM DUAL UNION ALL
  SELECT 'I', 3, 5, 0 FROM DUAL
)
SELECT * FROM t
MODEL
  DIMENSION BY (row_number() OVER (ORDER BY col1) rn)
  MEASURES (col1, col2, col3, col4)
  RULES (
    col2[any] = DECODE(col2[cv(rn)], 0, NVL(col2[cv(rn) - 1], 0), col2[cv(rn)]),
    col3[any] = DECODE(col3[cv(rn)], 0, NVL(col3[cv(rn) - 1], 0), col3[cv(rn)]),
    col4[any] = DECODE(col4[cv(rn)], 0, NVL(col4[cv(rn) - 1], 0), col4[cv(rn)])
  )

There are three clauses that are of interest here:

The DIMENSION BY Clause

Like in a Microsoft Excel spreadsheet, DIMENSION corresponds to the consecutive, distinct index of each spreadsheet cell, by which we want to access the cell. In Excel, there are always two dimensions (one written with letters A..Z, AA..ZZ, …) and the other one written with numbers (1..infinity).

Using MODEL, you can specify as many dimensions as you want. In our example, we’ll only use one, the row number of each row, ordered by col1 (another use-case for a window function).

The MEASURES Clause

The MEASURES clause specifies the individual cell values for each “cell”. In Microsoft Excel, a cell can have only one value. In Oracle’s MODEL clause, we can operate on many values at once, within a “cell”.

In this case, we’ll just make all of the columns our cells.

The RULES Clause

This is the really interesting part in the MODEL clause. Here, we specify by what rules we want to calculate the values of each individual cell. The syntax is simple:

RULES (
  <rule 1>,
  <rule 2>,
  ...,
  <rule N>
)

Each individual rule can implement an assignment of the form:

RULES (
  cell[dimension(s)] = rule
)

In our case, we’ll repeat the same rule for cells col2, col3, and col4, and for any value of the dimension rn (for row number). So, the left-hand side of the assignment is:

RULES (
  col2[any] = rule,
  col3[any] = rule,
  col4[any] = rule,
)

The right hand side is a trivial (though not so trivial-looking) expression:

1 DECODE(col2[cv(rn)], 0, NVL(col2[cv(rn) - 1], 0), col2[cv(rn)])

Let’s decompose again.

DECODE

DECODE is a simple and useful Oracle function that takes a first argument, compares it with argument 2, and if they’re the same, returns argument 3, otherwise argument 4. It works like a CASE, which is a bit more verbose:

1 2 3 4 5 DECODE(A, B, C, D)   -- The same as:   CASE A WHEN B THEN C ELSE D END

cv(rn)

cv() is a MODEL specific “function” that means “current value”. On the left-hand side of the assignment, we used "any" as the dimension specifier, so we’re applying this rule for “any” value of rn. In order to access a specific rnvalue, we’ll simply write cv(rn), or the “current value of rn”.

Recursiveness

The RULES of the MODEL clause are allowed to span a recursive tree (although not a graph, so no cycles are allowed), where each cell can be defined based on a previous cell, which is again defined based on its predecessor. We’re doing this via col2[cv(rn) - 1], where cv(rn) - 1means the “current row number minus one”.

Easy, right? Granted. The syntax isn’t straight-forward and we’re only scratching the surface of what’s possible with MODEL.

Conclusion

SQL provides cool ways to implementing data-driven, declarative specifications of what your data should be like. The MODEL clause is a bit eerie, but at the same time extremely powerful. Much easier and also a bit faster are window functions, a tool that should be in the tool chain of every developer working with SQL.

In this article, we’ve shown how to fill gaps in sparse data using window functions or MODEL. A similar use-case are running totals. If this article has triggered your interest, I suggest reading about different approaches of calculating a running total in SQL.

sql Data (computing) Database

Published at DZone with permission of Lukas Eder. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • What Is SQL Injection and How Can It Be Avoided?
  • Snowflake Data Time Travel
  • SQL Server From Zero To Advanced Level: Leveraging nProbe Data

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: