DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Upgrading Spark Pipelines Code: A Comprehensive Guide
  • Query Federation in Data Virtualization and Best Practices
  • Snowflake Data Processing With Snowpark DataFrames

Trending

  • Apache Doris vs Elasticsearch: An In-Depth Comparative Analysis
  • Hybrid Cloud vs Multi-Cloud: Choosing the Right Strategy for AI Scalability and Security
  • Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
  • A Modern Stack for Building Scalable Systems
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Intro to Soda SQL: Open Data Testing and Monitoring

Intro to Soda SQL: Open Data Testing and Monitoring

How to use Soda SQL, an open data testing, monitoring and profiling tool for data-intensive environments

By 
Tom Baeyens user avatar
Tom Baeyens
·
Updated Feb. 17, 21 · News
Likes (4)
Comment
Save
Tweet
Share
8.1K Views

Join the DZone community and get the full member experience.

Join For Free

On behalf of the Soda team, I am pleased to announce the availability of Soda SQL, Soda’s first open source data testing, monitoring and profiling tool for data-intensive environments. You can download Soda SQL today for free on GitHub.

With more and more products being built using data as the core input, it’s never been more important to test and monitor the quality of data being used. For data engineers this usually requires extra capacity and the development of a homegrown data testing framework. As we know, these solutions become unwieldy as the volumes of data and size of teams grow.

Which is why we’re excited to release Soda SQL; the first release for Soda as we develop open tools to support data engineers working in data-intensive environments.

Highlights of the capabilities included in Soda SQL include:

  • Stopping your pipeline when bad data is detected
  • Extracting metrics and column profiles through efficient SQL
  • Full control over metrics and queries through declarative configuration files

Why Are We Launching Soda SQL?

In software, as in so many other areas, what you don’t know can hurt you. At Soda, we call these the silent data issues. Left unchecked, they cause ripple effects across an entire application ecosystem.

Soda SQL works with your existing data engineering workflows to create a quick and easy way to redefine what good quality data means to your business. It provides an open data monitoring tool for data engineers to define tests and protect against the silent data issues that go undetected in datasets, data lakes, and data warehouses.

Soda SQL profiles and tests your data:

  • As it lands in your warehouse
  • After every important data processing step
  • And right before consumption.

This prevents delivery of bad data to downstream consumers within your organisation and means you don’t have to spend anymore late nights firefighting issues with your data.

How Does Soda SQL Work?

It's easy (and free!) to download, straightforward to set up and go.

Soda SQL uses a simple Command Line Interface (CLI) and Python library to test and monitor your data through metric collection. As an input, it uses YAML configuration files that includes: 1) SQL connection details, 2) What metrics to compute, and 3) What tests to run on the measurements. Based on these config files, Soda SQL performs scans - typically after new data has arrived - and runs tests associated with one table. Once you’re happy with the datasets and tests, you can add them to any modern data orchestration tool.

Check out this 5 minute tutorial for a more in-depth explanation: https://docs.soda.io/soda-sql/getting-started/5_min_tutorial.html

In the meantime, here's a quick example.

Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file is as follows:

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

The next step is to add Soda SQL scans in your favourite data pipeline orchestration solution like:

  • Airflow
  • AWS Glue
  • Prefect
  • Dagster
  • Fivetran
  • Matillion
  • Luigi

That’s it!

This is the very first of our community releases designed to support data engineers working in environments where data quality is important. We are also developing a library of developer tools for data testing and monitoring that will include data frames and streaming data which will operate across all major data workloads, engines and environments including Kafka, Spark, AWS S3, Azure Blob Storage, Google Cloud Datastore, Presto, Snowflake, Azure Synapse, Google BigQuery, and AWS Redshift.

To test drive Soda SQL, please download it from GitHub. Your feedback is appreciated - use our Issues or join the community on Slack!

Data processing sql Open data

Opinions expressed by DZone contributors are their own.

Related

  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Upgrading Spark Pipelines Code: A Comprehensive Guide
  • Query Federation in Data Virtualization and Best Practices
  • Snowflake Data Processing With Snowpark DataFrames

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!