DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Snowflake vs. Databricks: How to Choose the Right Data Platform
  • Profiling Big Datasets With Apache Spark and Deequ
  • AI: The Future of HealthTech
  • Building an ETL Pipeline With Airflow and ECS

Trending

  • Next Evolution in Integration: Architecting With Intent Using Model Context Protocol
  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation
  • Apache Spark 4.0: Transforming Big Data Analytics to the Next Level
  • Caching 101: Theory, Algorithms, Tools, and Best Practices
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Quality and Validation

Data Quality and Validation

Examples of data quality and validation checks and how easy it is to programmatically ensure data quality with the help of Apache Spark and Scala.

By 
Vinod Kumar user avatar
Vinod Kumar
·
Jul. 24, 18 · Tutorial
Likes (9)
Comment
Save
Tweet
Share
27.9K Views

Join the DZone community and get the full member experience.

Join For Free

Big data and machine learning deal with data. So, its important to keep the data correct in the system. If data is not accurate, it not only reduces the efficiency of the system, but also leads to some unfavourable insights. One of the big steps toward ensuring the correctness of data is through data quality and validation. With an increasing volume of data, and the noise that goes along with that, new methods or checks are getting added every day to ensure this data's quality. Since the amount of data is huge, one more thing which needs to be considered here is how to ensure fast processing of all of these checks and validations; i.e., a system which can go through each and every record ingested in a highly distributed way. This post talks about some examples of data quality and validation checks and how easy it is to programmatically ensure data quality with the help of Apache Spark and Scala.

Data accuracy, which refers to the closeness of results of observations to the true values or values accepted as being true.

  • Null Value: Record that contains null value. For example: male/female/null
sampledataframe.where(sampledataframe.col("columnname").isNull).count()
  • Specific Value: company ID
sampledataframe.where(sampledataframe.col("ColumnName").===("StringToMatch")).count()

Schema Validation: Every batch of data should follow the same column name and data type.

    for (elem <- sampledataframe.schema) {
        if (elem.dataType != "ExpectedDataType") {
            // Print Error
        }
      }

Column Value Duplicates (like duplicate email in records)

    val dataframe1 = sampledataframe.groupBy("columnname").count()
    val dataframe2 = dataframe1.filter("count = 1")
println("No of duplicate records : " 
            + (dataframe1.count() - dataframe2.count()).toString())

Uniqueness Check: Records are unique and kept in a w.r.t column

This is similar to duplicate.

val dataframe1 = sampledataframe.groupBy("columnname").count()
dataframe1.filter("count = 1").count() // this will give unique count.

Accuracy Check: Regular Expressions can be used. For example, we can look for email IDs that contain@.

sampledataframe.where(sampledataframe.col("columnname").===("female")).count()
or 
sampledataframe.where(sampledataframe.col("columnname").rlike("f*l*e")).count()

Data currency: How up-to-date is your data? Here the assumption is that data is coming in on a daily basis and is then checked and timestamped.

This list can go on and on, but the good thing about this approach based on Spark and Scala is that, with fewer code, a lot can be achieved using a huge amout of data.

Sometimes, a system may have some specific requirements related to who is consuming the data and in what form; and the consumber may have assumptions about the the data. 

Data usability: Consumer applications may apply certain expectations like:

  • column1.value should not be equal to column2.value
  • column3.value should always be column1.value + column2.value
  • No value in column x should appear more than x% of the time
  •     var arr = Array("ColumnName1", "ColumnName2", "ColumnName3")
        var freq = sampledataframe.stat.freqItems(arr, 0.4)
        freq.collect()
        freq.show()

While these are considered basic validations, we also have some advanced level checks to ensure data quality, like:

  • Anomaly Detection: This includes two major points:
    • If the dimension is given, like a time-based anomaly. This means within any timeframe (slice period), the number of records should not be more than x% of the average. To achive this with Spark:
      • Let's assume the slice period is 1 minute.
      • First, the timestamp column needs to be filtered/formatted such that the unit representation of the timestamp is a minute. This will produce duplicates, but that should not be an issue.
      • Next, use groupBy, like so: sampledataframe.groupBy("timestamp").count().
      • Get the average of that count and also find the slice period (if it exists), which has x% more records than the average. 
  • Ordering
    • The record should follow a certain order. For example, within a day the records for a particular consumer should start with impressions, clicks, landing page, cart, and end with purchases. There may be partial records, but it should follow a general order. To check this with Spark:
      •  groupBy("ID").
      • Run the order check for the group.
  • Circular dependency: Let me explain this with an example.
    • If two columns are taken up where column A => Column B, and the records are like:
    ID Name Fathers Name
    1 Alpha Bravo
    2 Bravo Gamma
    3 Gamma

    Alpha

    • If consuming the application tries to print the family hierarchy, it may fall into a loop.
  • Failure Trend
    • Consider that data is coming into the system everyday. Let's assume its behavioral/touchpoint data. For simplicity, let's call each day's data a 'batch.' In every batch, if we are getting exactly the same set of failures, then there must be a failure trend which is going on across batches.
    • If the failure is coming for same a set of email_id (emain id is one column), then it might be a symptom of a bot's behavior.
  • Data Bias: This means a consistent shift in the graph. Like:
    • If 30 minutes is getting added to the timestamp, then all the records will always have this 30 minute implicit bias. So, if the prediction algorithm is going to use this data, this bias will impact its results.
    • If the algorithms which is producing this data, has learning bias then for one set of data it produces more default values then for other. Like based on buying behaviour, it can predict the wrong gender. 

Bot Behavioor: Usually, a bot's behaviour is something like:

  • It generates records with the same set of unique identifiers of records. Like same set of email_ids.
  • It generates website traffic at any particular time. This is a time-based anomaly.
  • It generates records in a defined order: ordering checks across data batches.
Big data Data quality Apache Spark Record (computer science) Machine learning application Scala (programming language) Alpha (finance) consumer

Opinions expressed by DZone contributors are their own.

Related

  • Snowflake vs. Databricks: How to Choose the Right Data Platform
  • Profiling Big Datasets With Apache Spark and Deequ
  • AI: The Future of HealthTech
  • Building an ETL Pipeline With Airflow and ECS

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!