Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Solving a Clustering Problem Using the k-Means Algorithm With Oracle

DZone's Guide to

Solving a Clustering Problem Using the k-Means Algorithm With Oracle

Clustering algorithms let machines group data points or items into groups with similar characteristics. See how to use the k-means algorithm with Oracle to do clustering.

· AI Zone
Free Resource

Find out how AI-Fueled APIs from Neura can make interesting products more exciting and engaging. 

In this article, I will solve a clustering problem with Oracle data mining. 

Data science and machine learning are very popular today. But these subjects require extensive knowledge and application expertise. We can solve these problems with various products and software that have been developed by various companies. In Oracle, methods and algorithms for solving these problems are presented to users with the DBMS_DATA_MINING package.

With the DBMS_DATA_MINING package, we can create models such as clustering, classification, regression, anomaly detection, feature extraction, and association. We can interpret the efficiency with the models we create. The results we obtain from these models can be put into our business scenario.

The DBMS_DATA_MINING package does not come up by default on the Oracle database. For this reason, it's necessary to install this package first. You can set up your database Oracle Data Mining by following this link.

With the installation of the Oracle Data Mining package, three new dictionary tables are created:

SELECT * FROM  ALL_MINING_MODELS;

SELECT * FROM ALL_MINING_MODEL_SETTINGS;

SELECT * FROM ALL_MINING_MODEL_ATTRIBUTES;

The ALL_MINING_MODELS table contains information about all the models.

The ALL_MINING_MODELS_SETTINGS and ALL_MINING_MODELS_ATTRIBUTES tables contain parameters and specific details about these models.

Now, let's prepare an easily understood data set to solve a sample clustering problem.

CREATE TABLE KMEANSDATA
(
   INSTANCE   NUMBER,
   X_AXIS     NUMBER,
   Y_AXIS     NUMBER
);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (1, 3, 5);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (2, 2, 1);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (3, 1, 1);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (4, 4, 3);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (5, 6, 1);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (6, 7, 5);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (7, 4, 4);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (8, 5, 6);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (9, 3, 8);

INSERT INTO KMEANSDATA (INSTANCE, X_AXIS, Y_AXIS)
     VALUES (10, 5, 6);

COMMIT;

I created a simple dataset to understand the problem. My goal is to cluster the closest one in this data set with the k-Means algorithm.

k-Means is one of the simplest unsupervised learning algorithms. It solves the well-known clustering problem. It's a simple way to classify a given dataset through a certain number of clusters.

The following links provide detailed descriptions of the k-Means algorithm.

There are 10 records in the dataset. I want to collect these records in three different clusters, so I will choose a k value of 3. First, I look at the default parameter settings of my algorithm.

SELECT *
  FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS)
 WHERE setting_name LIKE 'KMNS%'OR setting_name like '%CLUS_NUM_CLUSTERS%';

The default k parameter (CLUS_NUM_CLUSTERS) of the algorithm is set to 10, as seen in the default settings. I am creating a new table to set this to 3 to solve my problem and updating the relevant parameters.

CREATE TABLE kmeanssettings
AS
   SELECT *
     FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS)
    WHERE setting_name LIKE 'KMNS%' OR setting_name like '%CLUS_NUM_CLUSTERS%';

UPDATE kmeanssettings
   SET setting_value = 3
 WHERE setting_name = 'CLUS_NUM_CLUSTERS';

UPDATE kmeanssettings
   SET setting_value = 10
 WHERE setting_name = 'KMNS_ITERATIONS';

COMMIT;

SELECT * FROM kmeanssettings;

I have updated the relevant parameters as shown. Now, I will create my model using these settings:

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL (
      model_name            => 'K_MEANNS_MODEL',
      mining_function       => DBMS_DATA_MINING.CLUSTERING,
      data_table_name       => 'KMEANSDATA',
      case_id_column_name   => 'INSTANCE',
      target_column_name    => NULL,
      settings_table_name   => 'KMEANSSETTINGS');
END;

Our code runs and we have created our model according to the current settings. Now, let's take this model data set and examine the clustering results:

BEGIN
   DBMS_DATA_MINING.APPLY (model_name            => 'K_MEANNS_MODEL',
                           data_table_name       => 'KMEANSDATA',
                           case_id_column_name   => 'INSTANCE',
                           result_table_name     => 'KMEANS_RESULT');
END;

Now, let's examine the table:

SELECT * FROM KMEANS_RESULT;

When we look at the result set, we see a total of 30 records. For this reason, we had 10 records in our total data set, and we gave the algorithm the number of clusters. In this case, the probability of each record separately for each set in the output of the algorithm is calculated (3X10 = 30). To see the exact result, find the maximum probability record of each element and list it.

SELECT t1.instance,
       t1.CLUSTER_ID,
       t1.probability,
       t2.x_axis,
       t2.y_axis
  FROM (SELECT INSTANCE, CLUSTER_ID, PROBABILITY
          FROM (SELECT T.*,
                       MAX (PROBABILITY)
                       OVER (PARTITION BY INSTANCE ORDER BY PROBABILITY DESC)
                          MAXP
                  FROM KMEANS_RESULT T)
         WHERE MAXP = PROBABILITY) t1,
       KMEANSDATA t2
 WHERE t1.instance = t2.instance order by cluster_id;

We see that our data set is distributed to three different clusters. Finally, let's look at the last situation on the chart.

As can be understood from the graph, our model is logically clustered according to the k-Means algorithm.

To find out how AI-Fueled APIs can increase engagement and retention, download Six Ways to Boost Engagement for Your IoT Device or App with AI today.

Topics:
machine learning ,oracle ,clustering ,k-means clustering ,ai ,data mining ,algorithms ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}