Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Solving a Classification Problem Using the Decision Tree Algorithm With Oracle

DZone's Guide to

Solving a Classification Problem Using the Decision Tree Algorithm With Oracle

Learn how to solve a classification problem in Oracle by looking at an example that uses an interesting HR analytics dataset.

· AI Zone
Free Resource

Find out how AI-Fueled APIs from Neura can make interesting products more exciting and engaging. 

In this article, I will solve a classification problem with Oracle data mining. 

Data science and machine learning are very popular today. But these subjects require extensive knowledge and application expertise. We can solve these problems with various products and software that have been developed by various companies. In Oracle, methods and algorithms for solving these problems are presented to users with the DBMS_DATA_MINING package.

With the DBMS_DATA_MINING package, we can create models such as clustering, classification, regression, anomaly detection, feature extraction, and association. We can interpret efficiency with the models we create. The results we obtain from these models can be put into our business scenario.

The DBMS_DATA_MINING package does not come up by default on the Oracle database. For this reason, it's necessary to install this package first. You can set up your database Oracle Data Mining by following this link.

With the installation of the Oracle Data Mining package, three new dictionary tables are created:

SELECT * FROM  ALL_MINING_MODELS;
SELECT * FROM ALL_MINING_MODEL_SETTINGS;
SELECT * FROM ALL_MINING_MODEL_ATTRIBUTES;

The ALL_MINING_MODELS table contains information about all the models.

The ALL_MINING_MODELS_SETTINGS and ALL_MINING_MODELS_ATTRIBUTES tables contain parameters and specific details about these models.

Now, let's prepare an easily understood data set to solve a sample classification problem.

We need a dataset to do the example. We will do this example through the HR_ANALYTICS dataset we downloaded from kaggle.

Let's first examine the HR_ANALYTICS dataset.

The HR_ANALYTICS data set consists of ten (eight numeric, two string) columns and contains 15,000 records. The description of each column in the HR_ANALYTICS dataset is given below.

Column Name Description Data Type
satisfaction_level 
Level of satisfaction (0-1) Numeric
last_evaluation 
Time since last performance evaluation (in years) Numeric
number_project
Number of projects completed while at work Numeric
average_montly_hours
Average monthly hours at workplace Numeric
time_spend_company
Number of years spent in the company Numeric
work_accident
Whether the employee had a workplace accident Numeric
left
Whether the employee left the workplace (1 or 0 factor) Numeric
promotion_last_5years
Whether the employee was promoted in the last five years Numeric
sales
Department in which they work String
salary
Relative level of salary (high)

String

Now that we have reviewed the details with our dataset, let's load the HR_ANALYTICS that we downloaded to our Oracle database.

First, create the Oracle table in which we will load the data set (CSV) that we downloaded.

CREATE TABLE hr_data
(
   satisfaction_level      NUMBER,
   last_evaluation         NUMBER,
   number_project          NUMBER,
   average_montly_hours    NUMBER,
   time_spend_company      NUMBER,
   Work_accident           NUMBER,
   left                    NUMBER,
   promotion_last_5years   NUMBER,
   sales                   VARCHAR2 (20),
   salary                  VARCHAR2 (20)
);

Now that we have created our table, we will load the dataset we downloaded as CSV into the table; we have multiple methods to do this:

  • Using Oracle External Table

  • Using Oracle SQL Loader

  • Using SQL-PL/SQL editors (Oracle SQL Developer, Toad, PL/SQL Developer...)

I will load the dataset with the editor that I am using. I use Toad as an editor. With Toad, you can perform the data loading process by following the steps below.

Database > Import > Import Table Data

You may not be using this editor because Toad is paid. This feature is available to other editors, so you can easily do this with other editors. For example, with Oracle SQL Developer for free, you can load data as follows using Oracle SQL Developer.

SELECT * FROM HR_DATA;

Image title

We have completed the dataset installation process. Now let's determine which columns we will use for classification.

When we examine our data set, we see that there is some information about each employee. At the end of this information, the employee's salary (SALARY) was given (low-medium-high). We will produce our model using a large part of the data set (training set). We will then test the model using the rest of the data (test set). Finally, we will compare the actual results with the results produced by the model we build.

First, let's separate our data set into two separate clusters: training (70%) and test (30%).

CREATE TABLE HR_DATA_MAIN AS
SELECT *
  FROM (SELECT ROWNUM RN,
               SATISFACTION_LEVEL,
               LAST_EVALUATION,
               NUMBER_PROJECT,
               AVERAGE_MONTLY_HOURS,
               TIME_SPEND_COMPANY,
               WORK_ACCIDENT,
               LEFT,
               PROMOTION_LAST_5YEARS,
               SALES,
               SALARY
          FROM HR_DATA); 



CREATE TABLE HR_DATA_TRAINING
AS
   SELECT SATISFACTION_LEVEL,
          LAST_EVALUATION,
          NUMBER_PROJECT,
          AVERAGE_MONTLY_HOURS,
          TIME_SPEND_COMPANY,
          WORK_ACCIDENT,
          LEFT,
          PROMOTION_LAST_5YEARS,
          SALES,
          SALARY
     FROM HR_DATA_MAIN
    WHERE RN < 10500;


CREATE TABLE HR_DATA_TEST
AS
   SELECT SATISFACTION_LEVEL,
          LAST_EVALUATION,
          NUMBER_PROJECT,
          AVERAGE_MONTLY_HOURS,
          TIME_SPEND_COMPANY,
          WORK_ACCIDENT,
          LEFT,
          PROMOTION_LAST_5YEARS,
          SALES,
          SALARY
     FROM HR_DATA_MAIN
    WHERE RN >= 10500; 

We prepared our training and test data set. Now let's create our table that we will read the algorithm settings and enter the algorithm parameters we will use in this table. We do not have to enter all the parameters necessary for the operation of the algorithm. If we do not define the optional parameters, Oracle uses these parameters with default values in the system. You can find the parameters defined for the decision tree algorithm and the default values for these parameters below.

Image title

Source

CREATE TABLE DTSETTINGS
(
   SETTING_NAME    VARCHAR2 (200),
   SETTING_VALUE   VARCHAR2 (200)
);


BEGIN
INSERT INTO DTSETTINGS
VALUES ('ALGO_NAME', 'ALGO_DECISION_TREE');

INSERT INTO DTSETTINGS
VALUES (dbms_data_mining.tree_impurity_metric, 'TREE_IMPURITY_ENTROPY');

COMMIT;
END;

We separated our data set as training and testing, and then we created our algorithm settings. Now, we can move on to the step of creating our model.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL (
      model_name            => 'DT_MODEL',
      mining_function       => DBMS_DATA_MINING.CLASSIFICATION,
      data_table_name       => 'HR_DATA_TRAINING',
      case_id_column_name   => NULL,
      target_column_name    => 'SALARY',
      settings_table_name   => 'DTSETTINGS');
END;

Now that we have our model, let's look at the details about the model now from the dictionary.

SELECT MODEL_NAME,
       ALGORITHM,
       COMMENTS,
       CREATION_DATE,
       MINING_FUNCTION,
       MODEL_SIZE
  FROM ALL_MINING_MODELS
 WHERE MODEL_NAME = 'DT_MODEL';

SELECT SETTING_NAME, SETTING_VALUE
FROM ALL_MINING_MODEL_SETTINGS
WHERE MODEL_NAME = 'DT_MODEL';

Now that we have seen the details about our model, we can now use our model to make predictions through our test data set.

SELECT T.SALARY ACTUAL,
 PREDICTION (DT_MODEL USING *) MODEL_PREDICT_RESPONSE,
 PREDICTION_PROBABILITY (DT_MODEL USING *) MODEL_PROBABILTY_RESPONSE
FROM HR_DATA_TEST T;

We applied our model to our test data set and observed the estimates that our model produced.

We can calculate the performance of the model we build by ourselves, as well as use the DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX method.

DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
      accuracy                     OUT NUMBER,
      apply_result_table_name      IN  VARCHAR2,
      target_table_name            IN  VARCHAR2,
      case_id_column_name          IN  VARCHAR2,
      target_column_name           IN  VARCHAR2,
      confusion_matrix_table_name  IN  VARCHAR2,
      score_column_name            IN  VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name  IN  VARCHAR2 DEFAULT 'PROBABILITY',
      cost_matrix_table_name       IN  VARCHAR2 DEFAULT NULL,
      apply_result_schema_name     IN  VARCHAR2 DEFAULT NULL,
      target_schema_name           IN  VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name      IN  VARCHAR2 DEFAULT NULL,
      score_criterion_type         IN  VARCHAR2 DEFAULT 'PROBABILITY');

Detailed explanations of the use of DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX method can be found here.

To find out how AI-Fueled APIs can increase engagement and retention, download Six Ways to Boost Engagement for Your IoT Device or App with AI today.

Topics:
oracle ,classifcation ,ai ,classification ,decision trees ,tutorial ,algorithm ,data mining

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}