DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Simplify Big Data Analytics With AirMettle
  • Building Real-Time Weather Dashboards With Apache Pinot
  • Why We Chose Iceberg Over Delta After Evaluating Both at Scale
  • Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory

Trending

  • Detecting Bugs and Vulnerabilities in Java With SonarQube
  • Working With Cowork: Don’t Be Confused
  • Can Claude Skills Replace Playwright Agents? A Practical View for QA Engineers
  • From Data Movement to Local Intelligence: The Shift from Centralized to Federated AI
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Efficient Long-Term Trend Analysis in Presto Using Datelists

Efficient Long-Term Trend Analysis in Presto Using Datelists

To make long-term trend analysis easier, we can leverage datelists, where we store each metric value corresponding to a date in an array in a sequential manner.

By 
Ajay Krishnan Prabhakaran user avatar
Ajay Krishnan Prabhakaran
·
Jan. 15, 25 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
4.1K Views

Join the DZone community and get the full member experience.

Join For Free

Data analytics teams, plenty of times, would have to do long-term trend analysis to study patterns over time. Some of the common analyses are WoW (week over week), MoM (month over month), and YoY (year over year). This would usually require data to be stored across multiple years. 

However, this takes up a lot of storage and querying across years worth of partitions is inefficient and expensive. On top of this, if we have to do user attribute cuts, it will be more cumbersome. To overcome this issue, we can implement an efficient solution using datelists.

What Are Datelists?

The most commonly used data formats in Hive are the simple types like int, bigint, varchar, and boolean. However, there are other complex types like Array<int>, Array<boolean>, and dict<varchar, varchar>, which give us more flexibility in terms of what we can achieve. 

To create a datelist, we simply store the metric values from different date partitions in each index position of an array with a start_date column to indicate the date corresponding to index 0 in the array. E.g., a datelist array would look like [5, 3, 4] with start_date column value as, say, 10/1, which means the first value 5 in index 0 corresponds to the metric value that was recorded on 10/1 and so on. 

If you look at the table below, you will see how traditional systems store data, where each row corresponds to each transaction that occurred. This causes redundancy, which can be avoided by transforming this data into a datelist format.

Traditional Data Storage

Traditional Data Storage

Datelist Data Storage

Datelist Data Storage

As you must have noticed, the number of rows has significantly reduced as there is no redundancy, i.e., each row within a particular date partition would have only one row per user. This is because we have aggregated all the different metric values corresponding to a user into a single array.

Designing a Datelist

Designing a datelist involves joining the metric values of a user from today’s source table with yesterday’s target table and storing the corresponding results again into the target table but into today’s partition value. 

If it's a new user who has not yet been active, then we will create an empty array with all zeroes whose length would be the difference between start_date and today. Then, we would append today’s metric value to this newly created array. If the user already exists in yesterday’s partition, we simply append it to the already existing array.

For example, if start_date is 10/1, and if a user first appears on 10/3, their array would be initialized as [0, 0], and the value for 10/3 would be appended, resulting in [0, 0, 7].

Designing a datelist

Sequence of Events

From 10/1 to 10/4, the datelist grows as follows:

On 10/1:

On 10/1

On 10/2:

On 10/2

Each day, as the data pipeline runs, the array would keep growing in length indefinitely. You can put some limits as to how long you want the array to be, i.e., if you are only interested in doing WoW analysis for the last 6 months, the array can be trimmed to fit those needs and also update the start_date value accordingly every time the job runs. But this is not really necessary as arrays are generally very efficient, so even if it's a long query, it shouldn’t cause any performance issues.

Here is a simple example of how to set up the SQL query to create a datelist in Presto:

SQL
 
WITH today_source AS (
    SELECT
        *
    FROM (
        VALUES
            ('2024-10-02', 123, 10),
            ('2024-10-02', 234, 45)
    ) AS nodes (ds, userid, time_spent)
),
yest_target AS (
    SELECT
        *
    FROM (
        VALUES
            ('2024-10-01', 123, ARRAY[4])
    ) AS nodes (ds, userid, time_spent)
)
SELECT
    '2024-10-01' AS dateid,
    userid,
    COALESCE(y.time_spent, ARRAY[0]) || t.time_spent AS time_spent_datelist
FROM today_source t
FULL OUTER JOIN yest_target y
    USING (userid)


Which would yield an output like this:

Output of the datelist


Querying the Datelist Table to Calculate Ln(n = 1, 7, 28, …) Metrics

SQL
 
SELECT
    id,
    ARRAY_SUM(SLICE(metric_values, -1, 1)) AS L1,
    ARRAY_SUM(SLICE(metric_values, -7, 7)) AS L7,
    ARRAY_SUM(SLICE(metric_values, -28, 28)) AS L28,
    *
FROM dim_table_a
WHERE
    ds = '<LATEST_DS>'


In the above example, we look at a sample SQL query where we can easily calculate the L1, L7, and L28 results of a metric by simply querying the latest partition from the table and using a slice to get the subset of the array that we need and sum it. This helps in reducing the retention of a table just by maintaining as little as 7 days of partitions. We would be able to do an analysis that spans across years.

Benefits

  • Storage savings: We get considerable storage savings as we don't have to store partitions beyond 7-10 days, as we have all the data we need to be compressed into the array, which could span across years as we store only one user per row per date partition.
  • Long-term trend analysis: Simpler query, as we just fetch the data from the latest partition and sum the subset of values needed from the datelist array for long-term analysis.
  • Privacy compliance: If we need to delete a user record (for example, they deactivated their account, so we can’t store/use their data anymore), then we just have to delete it from a few partitions instead of having to clean it up across various partitions, especially if it's a tenured user.
  • Fast processing and reduced compute: The time complexity would be O(n), and storage would be ‘retention value’ * O(n), where n is the number of users active on the app.

Conclusion

Datelists are a valuable tool that every data engineer can take advantage of. They are easy to implement and maintain, and the benefits we get out of them are vast. However, we need to be cautious about backfills. We need to build a framework that can properly update the right index values in the array. However, once this is tested and implemented, we can simply reuse it whenever backfills are required.

sql Analytics Big data

Opinions expressed by DZone contributors are their own.

Related

  • Simplify Big Data Analytics With AirMettle
  • Building Real-Time Weather Dashboards With Apache Pinot
  • Why We Chose Iceberg Over Delta After Evaluating Both at Scale
  • Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook