{{announcement.body}}
{{announcement.title}}

A Beginner's Guide to Automated Machine Learning: 4 Maturity Models to Understand

DZone 's Guide to

A Beginner's Guide to Automated Machine Learning: 4 Maturity Models to Understand

In this article, see a beginner's guide to automated machine learning and explore four maturity models.

· AI Zone ·
Free Resource

Sign marked with number four overlooking a lake

4 Maturity Models to Understand

The concepts of artificial intelligence and machine learning are becoming popular among data scientists. Using these concepts, replacing many of the human tasks has become possible, and the efficiency and accuracy of these tasks have increased. With the changing trends of technology, more business requirements are answered and the need for solutions that can cater to industry demand has increased. And automated machine learning eases up many tasks with saving time and providing efficient results.

Automated Machine Learning: Automate Training Process

Machine learning aims to train the machine to process real-world data and provide outputs accordingly. It enhances and enables a machine to learn from the experiences and provide more accurate outputs. And automated machine learning aims to automate the entire process from the beginning to the end.

Automated machine learning uses the machine learning model to train the machine from existing data (or experiences) and generates useful outputs. Augmenting the human support required, automated machine learning aims at saving time while providing accurate outputs through complete data processing.

You may also want to read:  Automated Machine Learning: Is It the Holy Grail?

Why Is It a Challenge for Data Scientists?

To automate tasks through machines, the concepts of artificial intelligence and machine learning are put in effect. With the aim of automating these tasks, it becomes somewhat difficult to let a machine start learning by itself without needing any external input/commands from humans. And it becomes a challenge as some part of the entire machine learning process is automated, which requires scientists to choose the best approach for completing the task.

Automated Machine Learning Maturity Models

The approach to automated ML can be classified according to the maturity of each approach in different classes. The higher maturity of a model indicates better support for automated tasks, and it includes the majority of the functions to be performed when training a model from the data set.

Source: ZELROS


1. Hyperparameter Optimization

Once the dataset gets submitted, the autoML working on this maturity model will try to fit various selected models on the data, e.g., random forest, linear regression, and more (the data used is structured). And it will optimize the hyperparameters for each model applied to the data according to provided needs. These optimization techniques include manual search, random search, grid search, and many more. 

For example, Auto-sklearn uses the Bayesian model for hyperparameter optimization and provides the required results. In this particular maturity level, the autoML performs limited tasks, e.g., cross-validation, machine learning algorithm selection, hyperparameter optimization, and more. As the maturity level keeps increasing, the more functions are served, and excellent results from AutoML are observed.

2. Level 1+ Preprocessing of Data

In level one, the autoML processes excluded the use of data preprocessing measures that a user has to implement on their own. However, in level 2, a more mature model is used where data preprocessing tasks are handled by the autoML itself, and further processes are completed. 

Searching and understanding the column type, transforming all data to numeric data and missing value replacing are performed by the ML itself. Also, data completion and other measures are implemented on the data before processing it. However, the advanced concepts of data preprocessing are not present here. Data scientists have to perform advanced preprocessing by themselves and then send the data for further operations.

The task of searching and selecting an appropriate machine learning algorithm is handled through the system only. For example, consider a dataset that is formed for deriving the estimated budget and time required for a mobile app development task at hand. Preprocessing of the data gets completed by the autoML model and data is later executed to provide accurate results.

However, the autoML system implementing the advanced data preprocessing methods is neither considered as level 2 or level 3 mature. Systems that can implement feature selection, dimensionality reduction, data compression, and more can be built to eliminate the requirement of data preprocessing and perform training tasks seamlessly.

3. Find Suitable ML Architecture

The autoML systems implementing level 1 and level 2 have their machine learning architecture fixed already. However, the systems falling under this level discover and find the machine learning architecture according to the nature of the data and apply it to ensure excellent outputs. Open-source autoML library AutoKeras implements a neural architecture search (NAS) that is popular for implementing machine learning algorithms efficiently on the image, voice, or text and is one of the examples of ML architecture.

There are different neural architecture search algorithms available for data scientists to use, and autoML implementing them can provide enhanced support and experience when implementing machine learning concepts. The level 3 autoML systems can be listed as a self-driven car, automated consumer services, and more.

4. Use of Domain Knowledge

What is required to construct a machine learning system that provides accurate outputs? Knowing the data very well. It is important to understand the domain of data and the requirements from the system. The most sophisticated implementation of AI can be done using the domain knowledge and putting all required criteria in mind. 

The accuracy of final results increases if the data in use is backed up by the existing knowledge of the domain. This increase in accuracy drives excellent prediction ability and provides thorough support for automating machine learning tasks. Therefore, it is important to consider adding up the background domain knowledge and the autoML systems implementing this maturity level are highly result-oriented as records significant accuracy hike than any other systems.

Practical Examples of Automated ML (AutoML)

There are tools and software libraries made available for researchers to put automated machine learning in effect. These tools are developed keeping the machine requirements in mind and they help generate the best outputs when used for automating the processes.

Open-source Libraries for Automated ML

There are plenty of open-source libraries supporting and answering the needs of developers to implement autoML in their system.

1. AutoKeras

This library is available on GitHub for developers to use. Developed by Data Lab, it aims to provide access to all possible deep learning tools and enhance the learning process of deep learning models. Here is a small example of AutoKeras in action:

Python




xxxxxxxxxx
1


 
1
 import autokeras as ak
2
clf = ak.ImageClassifier()
3
clf.fit(x_train, y_train)
4
results = clf.predict(x_test


(source)

2. MLBox

MLBox is another open-source library that is coded in Python for faster and easier development of AutoML functions. It includes functions for data preprocessing, cleaning, formatting, and more. Here is an illustration of how it starts data preprocessing once imported and used:

Python




xxxxxxxxxx
1


 
1
from mlbox.preprocessing import *
2
from mlbox.optimisation import *
3
from mlbox.prediction import *
4
 
          
5
paths = ["../input/train.csv","../input/test.csv"]
6
target_name = "Survived"
7
rd = Reader(sep = ",")
8
df = rd.train_test_split(paths, target_name)   #reading and preprocessing (dates, ...)


(source)

3. Auto-sklearn

Auto-sklearn is another open-source autoML supporting library that works by choosing an appropriate machine learning algorithm to study the data patterns and requirements. It eliminates the need for hyperparameter processing from the user end and handles the work on its own. Here is a small example of implemented Auto-sklearn on a dataset:

Python




xxxxxxxxxx
1
13


 
1
 import autosklearn.classification
2
 import sklearn.model_selection
3
 import sklearn.datasets
4
 import sklearn.metrics
5
 
          
6
 X, y = sklearn.datasets.load_digits(return_X_y=True)
7
 
          
8
 X_train, X_test, y_train, y_test = \
9
        sklearn.model_selection.train_test_split(X, y, random_state=1)
10
 automl = autosklearn.classification.AutoSklearnClassifier()
11
 automl.fit(X_train, y_train)
12
 y_hat = automl.predict(X_test)
13
 print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))


(source)

Automated Machine Learning Tools

These tools have been released for commercial use and their increasing popularity guarantees success in the field of automated machine learning.

DataRobot

The first-ever automated machine learning tool that supports the implementation offers an advanced platform to implement the concepts of AI without having to worry about the execution as it handles all and provides required and claimed results. The DataRobot API supports prediction and enables the machine to automatically process and provide outputs by selecting an appropriate approach. 

Here is a small example of how the DataRobot API can be implemented in  Python. The dataset used here is for predicting the possible readmission of patients in respective hospitals within 30 days of time.

Python




xxxxxxxxxx
1
137


 
1
import datarobot as dr
2
import pandas as pd
3
pd.options.display.max_columns = 1000
4
import numpy as np
5
import time
6
import matplotlib.pyplot as plt
7
 
          
8
    from jupyterthemes import jtplot
9
 
          
10
# currently installed theme will be used to set plot style if no arguments provided
11
 
          
12
jtplot.style()
13
    get_ipython().magic('matplotlib inline')
14
 
          
15
# load input data
16
 
          
17
    df = pd.read_csv('../demo_data/10kDiabetes.csv')
18
 
          
19
# initialize datarobot client instance
20
 
          
21
dr.Client(config_path='/Users/benjamin.miller/.config/datarobot/my_drconfig.yaml')
22
 
          
23
# create 100 samples with replacement from the original 10K diabetes dataset
24
 
          
25
    samples = []
26
    for i in range(100):
27
    samples.append(df.sample(10000, replace=True))
28
 
29
 # loop through each sample dataframe
30
 
          
31
    for i, s in enumerate(samples):
32
 
          
33
# initialize project
34
 
          
35
    project = dr.Project.start
36
  (
37
      project_name='API_Test_{}'.format(i+20),
38
       sourcedata=s,
39
       target='readmitted',
40
       worker_count=2
41
  )
42
 
          
43
  # get all projects
44
 
          
45
    projects = []
46
    for project in dr.Project.list():
47
    if "API_Test" in project.project_name:
48
       projects.append(project)
49
 
          
50
# *For each project...*  
51
 
          
52
# Make predictions on the original dataset using the most accurate model
53
 
          
54
# initialize list of all predictions for consolidating results
55
 
          
56
bootstrap_predictions = []
57
 
          
58
# loop through each relevant project to get predictions on original input dataset
59
    for project in projects:
60
 
          
61
  # get best performing model
62
 
          
63
    model = dr.Model.get(project=project.id, model_id=project.get_models()[0].id)
64
 
          
65
 # upload dataset
66
 
          
67
   new_data = project.upload_dataset(df)
68
 
          
69
# start a predict job
70
 
          
71
   predict_job = model.request_predictions(new_data.id)
72
 
          
73
# get job status every 5 seconds and move on once 'inprogress'
74
 
          
75
   for i in range(100):
76
    time.sleep(5)
77
       try:
78
           job_status = dr.PredictJob.get(
79
               project_id=project.id,
80
               predict_job_id=predict_job.id
81
           ).status
82
       except:  # normally the job_status would produce an error when it is completed
83
           break
84
# now the predictions are finished
85
   predictions = dr.PredictJob.get_predictions(
86
       project_id=project.id,
87
       predict_job_id=predict_job.id
88
   )
89
 
          
90
  # extract row ids and positive probabilities for all records and set to dictionary
91
 
          
92
   pred_dict = {k: v for k, v in zip(predictions.row_id,    predictions.positive_probability)}
93
    
94
 # append prediction dictionary to bootstrap predictions
95
   bootstrap_predictions.append(pred_dict) 
96
 
          
97
# combine all predictions into single dataframe with keys as ids
98
 
          
99
# each record is a row, each column is a set of predictions pertaining to
100
 
          
101
# a model created from a bootstrapped dataset
102
 
          
103
    df_predictions = pd.DataFrame(bootstrap_predictions).T
104
 
          
105
# add mean predictions for each observation in df_predictions
106
 
          
107
    df_predictions['mean'] = df_predictions.mean(axis=1)
108
 
          
109
# place each record into equal sized probability groups using the mean
110
 
          
111
    df_predictions['probability_group'] = pd.qcut(df_predictions['mean'], 10)
112
 
          
113
# aggregate all predictions for each probability group
114
 
          
115
    d = {}  # dictionary to contain {Interval(probability_group): array([predictions])}
116
 
          
117
    for pg in set(df_predictions.probability_group):
118
 
          
119
 # combine all predictions for a given group
120
 
          
121
       frame = df_predictions[df_predictions.probability_group == pg].iloc[:, 0:100]
122
   d[str(pg)] = frame.as_matrix().flatten()
123
 
          
124
# create dataframe from all probability group predictions
125
 
          
126
    df_pg = pd.DataFrame(d) 
127
 
          
128
# create boxplots in order of increasing probability ranges
129
    props = dict(boxes='slategray', medians='black', whiskers='slategray')
130
    viz = df_pg.plot.box(color=props, figsize=(15,7), patch_artist=True, rot=45)
131
    grid = viz.grid(False, axis='x')
132
    ylab = viz.set_ylabel('Readmission Probability')
133
    xlab = viz.set_xlabel('Mean Prediction Probability Ranges')
134
    title = viz.set_title(
135
    label='Expected Prediction Distributions by Readmission Prediction Range',
136
    fontsize=18
137
)


(source)

H2O.ai

Another AI enabling service platform, H2O, has introduced remarkable tools dedicated to completing many of the tasks of machine learning. For example, it has introduced Driverless AI that provides excellent results.

Conclusion

Implementing the concepts of machine learning to drive automated training is made possible using these tools and libraries. While there are other commercial solutions like Google AutoML that are also available in the market, a firm can decide to use the one that suits requirements and provide excellent results.

Implementation of automated machine learning will become more present in today's time and achieving the results from it can help drive many benefits for a business and ultimately, it will help automate the entire technology stream and enhance the use of artificial intelligence.

Further Reading

Intelligently Automate Machine Learning, Artificial Intelligence, and Data Science

18 Machine Learning Platforms for Developers

Topics:
automated machine learning ,ml model ,ml ,artificial intelligence

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}