Over a million developers have joined DZone.

Data Quality and Validation

DZone's Guide to

Data Quality and Validation

Examples of data quality and validation checks and how easy it is to programmatically ensure data quality with the help of Apache Spark and Scala.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Big data and machine learning deal with data. So, its important to keep the data correct in the system. If data is not accurate, it not only reduces the efficiency of the system, but also leads to some unfavourable insights. One of the big steps toward ensuring the correctness of data is through data quality and validation. With an increasing volume of data, and the noise that goes along with that, new methods or checks are getting added every day to ensure this data's quality. Since the amount of data is huge, one more thing which needs to be considered here is how to ensure fast processing of all of these checks and validations; i.e., a system which can go through each and every record ingested in a highly distributed way. This post talks about some examples of data quality and validation checks and how easy it is to programmatically ensure data quality with the help of Apache Spark and Scala.

Data accuracy, which refers to the closeness of results of observations to the true values or values accepted as being true.

  • Null Value: Record that contains null value. For example: male/female/null
  • Specific Value: company ID

Schema Validation: Every batch of data should follow the same column name and data type.

    for (elem <- sampledataframe.schema) {
        if (elem.dataType != "ExpectedDataType") {
            // Print Error

Column Value Duplicates (like duplicate email in records)

    val dataframe1 = sampledataframe.groupBy("columnname").count()
    val dataframe2 = dataframe1.filter("count = 1")
println("No of duplicate records : " 
            + (dataframe1.count() - dataframe2.count()).toString())

Uniqueness Check: Records are unique and kept in a w.r.t column

This is similar to duplicate.

val dataframe1 = sampledataframe.groupBy("columnname").count()
dataframe1.filter("count = 1").count() // this will give unique count.

Accuracy Check: Regular Expressions can be used. For example, we can look for email IDs that contain@.


Data currency: How up-to-date is your data? Here the assumption is that data is coming in on a daily basis and is then checked and timestamped.

This list can go on and on, but the good thing about this approach based on Spark and Scala is that, with fewer code, a lot can be achieved using a huge amout of data.

Sometimes, a system may have some specific requirements related to who is consuming the data and in what form; and the consumber may have assumptions about the the data. 

Data usability: Consumer applications may apply certain expectations like:

  • column1.value should not be equal to column2.value
  • column3.value should always be column1.value + column2.value
  • No value in column x should appear more than x% of the time
  •     var arr = Array("ColumnName1", "ColumnName2", "ColumnName3")
        var freq = sampledataframe.stat.freqItems(arr, 0.4)

While these are considered basic validations, we also have some advanced level checks to ensure data quality, like:

  • Anomaly Detection: This includes two major points:
    • If the dimension is given, like a time-based anomaly. This means within any timeframe (slice period), the number of records should not be more than x% of the average. To achive this with Spark:
      • Let's assume the slice period is 1 minute.
      • First, the timestamp column needs to be filtered/formatted such that the unit representation of the timestamp is a minute. This will produce duplicates, but that should not be an issue.
      • Next, use groupBy, like so: sampledataframe.groupBy("timestamp").count().
      • Get the average of that count and also find the slice period (if it exists), which has x% more records than the average. 
  • Ordering
    • The record should follow a certain order. For example, within a day the records for a particular consumer should start with impressions, clicks, landing page, cart, and end with purchases. There may be partial records, but it should follow a general order. To check this with Spark:
      •  groupBy("ID").
      • Run the order check for the group.
  • Circular dependency: Let me explain this with an example.
    • If two columns are taken up where column A => Column B, and the records are like:
    ID Name Fathers Name
    1 Alpha Bravo
    2 Bravo Gamma
    3 Gamma


    • If consuming the application tries to print the family hierarchy, it may fall into a loop.
  • Failure Trend
    • Consider that data is coming into the system everyday. Let's assume its behavioral/touchpoint data. For simplicity, let's call each day's data a 'batch.' In every batch, if we are getting exactly the same set of failures, then there must be a failure trend which is going on across batches.
    • If the failure is coming for same a set of email_id (emain id is one column), then it might be a symptom of a bot's behavior.
  • Data Bias: This means a consistent shift in the graph. Like:
    • If 30 minutes is getting added to the timestamp, then all the records will always have this 30 minute implicit bias. So, if the prediction algorithm is going to use this data, this bias will impact its results.
    • If the algorithms which is producing this data, has learning bias then for one set of data it produces more default values then for other. Like based on buying behaviour, it can predict the wrong gender. 

Bot Behavioor: Usually, a bot's behaviour is something like:

  • It generates records with the same set of unique identifiers of records. Like same set of email_ids.
  • It generates website traffic at any particular time. This is a time-based anomaly.
  • It generates records in a defined order: ordering checks across data batches.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

apache spark ,data quality ,data validation ,big data ,scala

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}