Intro to Soda SQL: Open Data Testing and Monitoring
How to use Soda SQL, an open data testing, monitoring and profiling tool for data-intensive environments
Join the DZone community and get the full member experience.
Join For FreeOn behalf of the Soda team, I am pleased to announce the availability of Soda SQL, Soda’s first open source data testing, monitoring and profiling tool for data-intensive environments. You can download Soda SQL today for free on GitHub.
With more and more products being built using data as the core input, it’s never been more important to test and monitor the quality of data being used. For data engineers this usually requires extra capacity and the development of a homegrown data testing framework. As we know, these solutions become unwieldy as the volumes of data and size of teams grow.
Which is why we’re excited to release Soda SQL; the first release for Soda as we develop open tools to support data engineers working in data-intensive environments.
Highlights of the capabilities included in Soda SQL include:
- Stopping your pipeline when bad data is detected
- Extracting metrics and column profiles through efficient SQL
- Full control over metrics and queries through declarative configuration files
Why Are We Launching Soda SQL?
In software, as in so many other areas, what you don’t know can hurt you. At Soda, we call these the silent data issues. Left unchecked, they cause ripple effects across an entire application ecosystem.
Soda SQL works with your existing data engineering workflows to create a quick and easy way to redefine what good quality data means to your business. It provides an open data monitoring tool for data engineers to define tests and protect against the silent data issues that go undetected in datasets, data lakes, and data warehouses.
Soda SQL profiles and tests your data:
- As it lands in your warehouse
- After every important data processing step
- And right before consumption.
This prevents delivery of bad data to downstream consumers within your organisation and means you don’t have to spend anymore late nights firefighting issues with your data.
How Does Soda SQL Work?
It's easy (and free!) to download, straightforward to set up and go.
Soda SQL uses a simple Command Line Interface (CLI) and Python library to test and monitor your data through metric collection. As an input, it uses YAML configuration files that includes: 1) SQL connection details, 2) What metrics to compute, and 3) What tests to run on the measurements. Based on these config files, Soda SQL performs scans - typically after new data has arrived - and runs tests associated with one table. Once you’re happy with the datasets and tests, you can add them to any modern data orchestration tool.
Check out this 5 minute tutorial for a more in-depth explanation: https://docs.soda.io/soda-sql/getting-started/5_min_tutorial.html
In the meantime, here's a quick example.
Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file is as follows:
Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:
The next step is to add Soda SQL scans in your favourite data pipeline orchestration solution like:
- Airflow
- AWS Glue
- Prefect
- Dagster
- Fivetran
- Matillion
- Luigi
That’s it!
This is the very first of our community releases designed to support data engineers working in environments where data quality is important. We are also developing a library of developer tools for data testing and monitoring that will include data frames and streaming data which will operate across all major data workloads, engines and environments including Kafka, Spark, AWS S3, Azure Blob Storage, Google Cloud Datastore, Presto, Snowflake, Azure Synapse, Google BigQuery, and AWS Redshift.
To test drive Soda SQL, please download it from GitHub. Your feedback is appreciated - use our Issues or join the community on Slack!
Opinions expressed by DZone contributors are their own.
Comments