DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial
  • From ETL to Lakeflow: Shifting to a Declarative Data Paradigm
  • Stop Loading Everything into Redshift: A Spectrum + Iceberg Pattern for Hybrid Analytics
  • Operationalizing Enterprise AI at Scale: Architecture, Governance, and Adoption

Trending

  • Spring AI Advisors: Chat Memory, Token Tracking, and Message Logging
  • Intelligent Matching and Semantic Search for Marketplace Applications Using OpenAI and .NET
  • Prompt Injection Is Real, So I Built a Python Firewall for LLM Pipelines
  • Amazon CodeWhisperer to Q Developer to Kiro: The Rise of Agentic Coding
  1. DZone
  2. Data Engineering
  3. Data
  4. The Data Journey

The Data Journey

During the last few years I've observed that many organizations are in the middle of a three steps process, that typically takes years. This process is the Data Journey.

By 
Alejandro Martin user avatar
Alejandro Martin
·
Updated Sep. 23, 22 · Opinion
Likes (3)
Comment
Save
Tweet
Share
4.1K Views

Join the DZone community and get the full member experience.

Join For Free

During the last few years, I talked to many retailers and other large tech companies dealing with Data.

I've observed the same pattern repeatedly so far: these organizations are in the middle of three steps process, which usually takes more than 5 years.

1. Data Consolidation

Migrating a platform to a micro-services ecosystem means you'll generate tons silos of information: every service has its own private repository.

Most analytical use cases require merging information from multiple services, and this usually gets challenging, given that each service may even have different database technology for the data repository. The difficulty of solving analytical use cases grows exponentially under this paradigm.

That's why the first step of this process is consolidating this data in a data lake, ideally on the cloud, for scalability and flexibility.

2. Data Modeling

Once the information is all together in a data lake, organizations need to model and normalize it, so they enable analysts to get insights and run queries.

The same entity may be referenced using different codes across services, besides other non-standard conventions.

This task is ideally implemented following a domain-entity organization, and every vertical domain is accountable for modeling the entities they own for the rest of the company. This approach fits quite well with the Data Mesh Principles.

Ideally, teams will have tools that guarantee a single source of truth for definitions and logic and some level of automation and observability, such as dbt. It's quite easy to imagine why this tool is growing in popularity, given these premises.

Completing this step enables analytical and ML teams to provide huge value. Now your organization has a single source of truth and can run on-demand queries with "the power of the cloud." A huge leap forward.

3. Publication

Most companies are already between working on steps one and two, but they are really struggling with this last one: "Ok, now you have your data modeled on, say, Snowflake; how do I build a dashboard on top of that? How do I consume these metrics or insights in near real-time?"

The first attempt is typically building a REST data service on top of Snowflake, BigQuery, Redshift, etc. I've seen this countless times, always with the same poor results. These products are great for analytical, long-running queries, but they're just not meant for interactive use cases: they don't have the low latency and concurrency capabilities by design; they're something else.

And that's the exact situation where a product like Tinybird provides the most value: making your already built information available to use in real-time, with the concurrency and latency you need.

Operational Analytics

When you complete the three steps, the ultimate benefit you get as an organization is enabling real-time Operational Analytics.

This means you can operate your business in real-time based on actual facts and insights you get from your Data Platform.

For example, if you're a retailer running a big promotion during Black Friday, you'll be able to re-arrange the items on your website depending on their performance in real-time of hiding out-of-stock products. Supplying the demand in an optimal way during a timely event is huge for increasing revenue.

Another common use case that's also unlocked is real-time personalization for user experience when the website adapts to the users with their interaction.

A Note on Streaming Analytics

There's a general agreement that distributed systems increase complexity exponentially. So does asynchronous communications between services.

Once you've shifted to an event-driven paradigm, you'll typically want to ingest all your events using a common hub, for example, a Kafka cluster. But unlocking the value and getting insights from the information you already have on the platform in near real-time is challenging.

There are already a few products for streaming analytics, such as Rockset, ksqlDB, Apache Druid, Imply, etc. They work great for some streaming use cases, but they all fall a bit short when it comes to high volume and concurrency, complex logic or multiple joins.

That's because they don't leverage a full OLAP like Clickhouse, which enables arbitrary time spans (vs window functions), advanced joins for complex use cases, managed MVs for rollups, and many other benefits.

Data (computing)

Published at DZone with permission of Alejandro Martin. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial
  • From ETL to Lakeflow: Shifting to a Declarative Data Paradigm
  • Stop Loading Everything into Redshift: A Spectrum + Iceberg Pattern for Hybrid Analytics
  • Operationalizing Enterprise AI at Scale: Architecture, Governance, and Adoption

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook