Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Normalization and Filtration Using Drools

DZone's Guide to

Data Normalization and Filtration Using Drools

Learn about normalizing and filtering data using Drools by looking at an example of oil well drilling datasets from Arkansas and Oklahoma.

Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Drools, a rule engine, is used to implement an expert system using a rule-based approach. It's used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in DRL files.

In this blog, let's discuss normalizing and filtering data using Drools.

Prerequisites

Download and install the following:

Use Case

Oil well drilling datasets from two different states — Arkansas and Oklahoma — are taken as the input data for processing based on API numbers.

  • Filter invalid drill types data.
  • Remove null date values.
  • Normalize API numbers to have correct digits.
  • Format dates to the required format.
  • Remove duplicate well information by taking only maximum modified date value.

Data Description

The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates. 

This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.

Arkansas Dataset

Null values for date_of_1st_prod:select

Invalid values for initial_production:select

Incorrect digits in well API numbers:

select

Oklahoma Dataset

Duplicate well data:

selectInvalid date values in test date:select

  • MS SQL is used to transform the input data into transient data.

  • Java Database Connectivity (JDBC) is used for interaction between Java and MS SQL in order to get input and to write output into MS SQL after transforming the data.
 LOGGER.info("Creating MSSQL connection ................");
 Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
 conn = DriverManager.getConnection(url,userName,passWord);

KIE File System

The KIE File System is used to load DRL files and to reduce the dependency from the KIE module configuration file. It allows us to change the business logic without redeployment and to keep the DRL files separately from Jar and deployment files.

All the rule files in the given path are loaded into MS SQL to apply the rules on facts inserted. Facts are inserted to ksession for which particular rules are loaded.

public void createKnowledgeSession(String rulePath) {
 //Gets the factory class for KIE.
 KieServices ks = KieServices.Factory.get();
 KieFileSystem kfs = ks.newKieFileSystem();
 KieRepository kr = ks.getRepository();
 File rulePathFile = new File(rulePath);
 File ruleFiles[] = null;
 if(rulePathFile.isDirectory()) {
 ruleFiles = rulePathFile.listFiles();
 }
 for(File drlFile : ruleFiles) {
 LOGGER.info("File path is "+drlFile.getAbsolutePath());
 kfs.write(ResourceFactory.newFileResource(drlFile));
 }
 KieBuilder kb = ks.newKieBuilder(kfs);
 kb.buildAll();
 KieContainer kContainer = ks.newKieContainer(kr.getDefaultReleaseId());
 this.kSession = kContainer.newKieSession();
 }

Applying Rules

Multiple rules were applied to both the datasets to process data, remove duplicate and invalid data, normalize data, and filter data.

Applying Rules in Arkansas Dataset

The rules are:

  1. Remove invalid values.
  2. Set initial production and unifying date format.

Rule 1: Removing Invalid Values

This rule is applied to remove invalid production date and initial production values using retract key words.

rule "remove invalid initial production and first production values" salience 2
 when
 $arkanas : Arkanas( $first_prod_date : firstProdDate,$initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || 
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || 
$initial_production.contains("See Remarks") )
 then
 retract($arkanas);
 end

The filtered Arkansas data, after removing invalid production date and initial production values, is as follows:

select

Rule 2: Setting Initial Production and Unifying Date Format

This rule is applied to format gas_vol values (initial production value is considered as gas_vol) and to format the production date to prescribed the date type from the string by giving the date format.

API numbers are normalized with zeros for formatting the API numbers to fourteen digits.

rule "set initial production to gas_vol and first_prod_date to correct format" salience 1
 when 
$arkanas : Arkanas( $initial_production : initialProcuction,$first_prod_date : firstProdDate,$api_number : apiNumber )
eval( CommonUtil.isValidDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy") )
 then
 try {
$arkanas.setDateTime(CommonUtil.getDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy"));
 $arkanas.setGasVol(Float.parseFloat($initial_production));
 $arkanas.setFilteredData(true);
 $arkanas.setApiNumber($api_number+"0000");
 } catch (Exception e) {
 e.printStackTrace();
 }
end

API numbers are normalized after applying the normalizing rule:select

First, production dates are converted into the prescribed format in DRL file and mapped with the date_time column as shown in the below diagram:select

Applying Rules in Oklahoma Dataset

The rules are:

  1. Apply date filter.
  2. Get max modify date values.
  3. Filter max modify date values.

Rule 1: Applying Date Filter

This rule is applied to filter the values based on their test date for last seven years and to filter horizontal drill type.

rule "data between given range"
 when
 $oklahoma : Oklahoma( $api_number : apiNumber,$test_date : testDate,$modify_date : 
 modifyDate,$drill_type : drillType)
eval( !(CommonUtil.isDateWithinRange(CommonUtil.get
Date("2010-01-01","yyyy-MM-dd"),new 
Date(),$test_date) && StringUtil.isValidString($drill_type) && ( 
$drill_type.toUpperCase().startsWith("HORIZONTAL") || 
$drill_type.toUpperCase().equalsIgnoreCase("H") ) ) )
 then
 retract($oklahoma);
 end

Data with test date before 2010-01-01 is filtered to remove invalid test date values and is mapped with date_time as mentioned in the data description section:select

Rule 2: Getting Max Modify Date Values

This rule is applied to get the max modify date values by grouping API numbers in a temp storage object called MaxValue by using accumulate.

Accumulate gets the max modify date by grouping the data by well API numbers.

rule "selecting ok data with max modify date"
 when
 $oklahoma : Oklahoma($api_number : getApiNumber()) and not MaxValue($api_number == apinumber)
accumulate(ok : Oklahoma(getApiNumber()==$api_number),$maxDateValue : 
max(ok.getModifyDate().getTime()))
 then
 insert(new MaxValue($api_number,$maxDateValue));
 end

Rule 3: Filtering Max Modify Date Values

This rule is applied to filter max modify date values by storing the unique values such that data is not replicated.

rule "Filters max modify date values"
 when
 $oklahoma : Oklahoma($api_number : getApiNumber(),$max_date : getModifyDate().getTime()) 
and not UniqVal($api_number == apinumber,$max_date == max_modify_date)
$maxValue : MaxValue($api_number==apinumber && $max_date==max_modify_date)
 then
 insert(new UniqVal($api_number,$max_date));
 $oklahoma.setFilteredData(true);
 end

After removing the duplicate well information by grouping the API number with max modify date, the number of records is reduced to fifty:

select

Salience

Salience is used to set orders for applying rules as certain rules need to be executed after execution of certain rules.

For example, in the Arkansas dataset, rules for removing invalid initial production and first production values had to be executed first to format date and gas_vol .

rule "remove invalid initial production and first production values" salience 2
 when
 $arkanas : Arkanas( $first_prod_date : firstProdDate, $initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || 
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || 
$initial_production.contains("See Remarks") )
 then
 retract($arkanas);
 end

Conclusion

Business rules are separated from business code by applying the business logic for normalization and filtration in the DRL file, thus easily changing business logic without redeployments.

In this blog, the test date was used for getting data from the last seven years. Business analysts can change this date range in the future without changing the code and without performing redeployment.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
drools ,normalization ,big data ,tutorial ,data filter

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}