Data Normalization and Filtration Using Drools
Learn about normalizing and filtering data using Drools by looking at an example of oil well drilling datasets from Arkansas and Oklahoma.
Join the DZone community and get the full member experience.
Join For FreeDrools, a rule engine, is used to implement an expert system using a rule-based approach. It's used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in DRL files.
In this blog, let's discuss normalizing and filtering data using Drools.
Prerequisites
Download and install the following:
Use Case
Oil well drilling datasets from two different states — Arkansas and Oklahoma — are taken as the input data for processing based on API numbers.
- Filter invalid drill types data.
- Remove null date values.
- Normalize API numbers to have correct digits.
- Format dates to the required format.
- Remove duplicate well information by taking only maximum modified date value.
Data Description
The oil well drilling datasets contain raw information about wells and their formation details, drill types, and production dates. The Arkansas dataset has 6,040 records and the Oklahoma dataset has 2,559 records.
The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well and invalid well information with modified dates.
This raw data from the source is transformed to MS SQL for further filtering and normalization. To download raw data, look at the Reference section.
Arkansas Dataset
Null values for date_of_1st_prod
:
Invalid values for initial_production
:
Incorrect digits in well API numbers:
Oklahoma Dataset
Duplicate well data:
Invalid date values in test date:
MS SQL is used to transform the input data into transient data.
- Java Database Connectivity (JDBC) is used for interaction between Java and MS SQL in order to get input and to write output into MS SQL after transforming the data.
LOGGER.info("Creating MSSQL connection ................");
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
conn = DriverManager.getConnection(url,userName,passWord);
KIE File System
The KIE File System is used to load DRL files and to reduce the dependency from the KIE module configuration file. It allows us to change the business logic without redeployment and to keep the DRL files separately from Jar and deployment files.
All the rule files in the given path are loaded into MS SQL to apply the rules on facts inserted. Facts are inserted to ksession
for which particular rules are loaded.
public void createKnowledgeSession(String rulePath) {
//Gets the factory class for KIE.
KieServices ks = KieServices.Factory.get();
KieFileSystem kfs = ks.newKieFileSystem();
KieRepository kr = ks.getRepository();
File rulePathFile = new File(rulePath);
File ruleFiles[] = null;
if(rulePathFile.isDirectory()) {
ruleFiles = rulePathFile.listFiles();
}
for(File drlFile : ruleFiles) {
LOGGER.info("File path is "+drlFile.getAbsolutePath());
kfs.write(ResourceFactory.newFileResource(drlFile));
}
KieBuilder kb = ks.newKieBuilder(kfs);
kb.buildAll();
KieContainer kContainer = ks.newKieContainer(kr.getDefaultReleaseId());
this.kSession = kContainer.newKieSession();
}
Applying Rules
Multiple rules were applied to both the datasets to process data, remove duplicate and invalid data, normalize data, and filter data.
Applying Rules in Arkansas Dataset
The rules are:
- Remove invalid values.
- Set initial production and unifying date format.
Rule 1: Removing Invalid Values
This rule is applied to remove invalid production date and initial production values using retract
key words.
rule "remove invalid initial production and first production values" salience 2
when
$arkanas : Arkanas( $first_prod_date : firstProdDate,$initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) ||
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") ||
$initial_production.contains("See Remarks") )
then
retract($arkanas);
end
The filtered Arkansas data, after removing invalid production date and initial production values, is as follows:
Rule 2: Setting Initial Production and Unifying Date Format
This rule is applied to format gas_vol
values (initial production value is considered as gas_vol
) and to format the production date to prescribed the date type from the string by giving the date format.
API numbers are normalized with zeros for formatting the API numbers to fourteen digits.
rule "set initial production to gas_vol and first_prod_date to correct format" salience 1
when
$arkanas : Arkanas( $initial_production : initialProcuction,$first_prod_date : firstProdDate,$api_number : apiNumber )
eval( CommonUtil.isValidDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy") )
then
try {
$arkanas.setDateTime(CommonUtil.getDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy"));
$arkanas.setGasVol(Float.parseFloat($initial_production));
$arkanas.setFilteredData(true);
$arkanas.setApiNumber($api_number+"0000");
} catch (Exception e) {
e.printStackTrace();
}
end
API numbers are normalized after applying the normalizing rule:
First, production dates are converted into the prescribed format in DRL file and mapped with the date_time
column as shown in the below diagram:
Applying Rules in Oklahoma Dataset
The rules are:
- Apply
date
filter. - Get max modify date values.
- Filter max modify date values.
Rule 1: Applying Date Filter
This rule is applied to filter the values based on their test date for last seven years and to filter horizontal drill type.
rule "data between given range"
when
$oklahoma : Oklahoma( $api_number : apiNumber,$test_date : testDate,$modify_date :
modifyDate,$drill_type : drillType)
eval( !(CommonUtil.isDateWithinRange(CommonUtil.get
Date("2010-01-01","yyyy-MM-dd"),new
Date(),$test_date) && StringUtil.isValidString($drill_type) && (
$drill_type.toUpperCase().startsWith("HORIZONTAL") ||
$drill_type.toUpperCase().equalsIgnoreCase("H") ) ) )
then
retract($oklahoma);
end
Data with test date before 2010-01-01 is filtered to remove invalid test date values and is mapped with date_time
as mentioned in the data description section:
Rule 2: Getting Max Modify Date Values
This rule is applied to get the max modify date values by grouping API numbers in a temp storage object called MaxValue
by using accumulate
.
Accumulate gets the max modify date by grouping the data by well API numbers.
rule "selecting ok data with max modify date"
when
$oklahoma : Oklahoma($api_number : getApiNumber()) and not MaxValue($api_number == apinumber)
accumulate(ok : Oklahoma(getApiNumber()==$api_number),$maxDateValue :
max(ok.getModifyDate().getTime()))
then
insert(new MaxValue($api_number,$maxDateValue));
end
Rule 3: Filtering Max Modify Date Values
This rule is applied to filter max modify date values by storing the unique values such that data is not replicated.
rule "Filters max modify date values"
when
$oklahoma : Oklahoma($api_number : getApiNumber(),$max_date : getModifyDate().getTime())
and not UniqVal($api_number == apinumber,$max_date == max_modify_date)
$maxValue : MaxValue($api_number==apinumber && $max_date==max_modify_date)
then
insert(new UniqVal($api_number,$max_date));
$oklahoma.setFilteredData(true);
end
After removing the duplicate well information by grouping the API number with max modify date, the number of records is reduced to fifty:
Salience
Salience is used to set orders for applying rules as certain rules need to be executed after execution of certain rules.
For example, in the Arkansas dataset, rules for removing invalid initial production and first production values had to be executed first to format date
and gas_vol
.
rule "remove invalid initial production and first production values" salience 2
when
$arkanas : Arkanas( $first_prod_date : firstProdDate, $initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) ||
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") ||
$initial_production.contains("See Remarks") )
then
retract($arkanas);
end
Conclusion
Business rules are separated from business code by applying the business logic for normalization and filtration in the DRL file, thus easily changing business logic without redeployments.
In this blog, the test date was used for getting data from the last seven years. Business analysts can change this date range in the future without changing the code and without performing redeployment.
Published at DZone with permission of Rathnadevi Manivannan. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments