Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Prepare Your Data for ML Training

DZone's Guide to

Prepare Your Data for ML Training

Learn the process of how to prepare data for machine learning model training.

· AI Zone ·
Free Resource

Adding AI to your task-oriented bots enables advanced abilities such as bringing structure to your unstructured data. Get our free platform for developers today.

The process to prepare data for machine learning model training looks somewhat similar to the process of preparing food ingredients to cook dinner. You know that in both cases it takes time, but then you are rewarded with a tasty dinner, or in this case, a great ML model.

I will not be diving into data science or discussing how to structure and transform data. It all depends on the use case, and there are so many ways to reformat data to get the most out of it. I would rather focus on a simple but practical example — how to split data into training and test datasets with Python.

Today's example is based on a notebook from this post: Jupyter Notebook — Forget CSV, fetch data from DB with Python. It explains how to load data from the DB and construct a data frame.

This Python code snippet builds train/test datasets:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split data into X and y\n",
    "X = df.iloc[:, 0:8]\n",
    "Y = df.iloc[:, 8:9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(514, 8) (254, 8)\n",
      "\n",
      "Number of rows in Train dataset: {train_df.shape[0]}\n",
      "0    335\n",
      "1    179\n",
      "Name: class_var, dtype: int64\n",
      "\n",
      "Number of rows in Test dataset: {test_df.shape[0]}\n",
      "0    165\n",
      "1     89\n",
      "Name: class_var, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "headers = list(X)\n",
    "train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=0.33, stratify=Y, random_state=0)\n",
    "\n",
    "print(train_X.shape, test_X.shape)\n",
    "print()\n",
    "print('Number of rows in Train dataset: {train_df.shape[0]}')\n",
    "print(train_Y['class_var'].value_counts())\n",
    "print()\n",
    "print('Number of rows in Test dataset: {test_df.shape[0]}')\n",
    "print(test_Y['class_var'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

The first thing is to assign X and Y. Data columns assigned to the X array are the ones that produce decisions encoded in the Y array. We assign X and Y by extracting columns from the data frame.

In the next step, train X/Y and test X/Y sets are constructed by the function train_test_split from the sklearn module. You must import this function in the Python script:

from sklearn.model_selection import train_test_split

One of the parameters is for train_test_split function - test_size. This parameter controls the proportion of the test dataset size taken from the entire data set (~30 percent in this example).

The parameter stratify enforces equal distribution of Y data across the train and test datasets.

Parameter random_state ensures that the data split will be the same in the next run. To change the split, it is enough to change this parameter value.

The function train_test_split returns four arrays. The train X/Y and test X/Y pairs can be used for the train and test ML model. The dataset shape and structure can be printed out for convenience.

A sample of the Jupyter notebook is available on GitHub. The sample credentials can be found in the JSON file.

Experience creating bots that include artificial intelligence and machine learning to solve complex processes with the Community Edition, free for developers. Download it now.

Topics:
prepare your data ,machine learning ,artificial intelligence ,tutorial ,ml model training ,machine learning models

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}