DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. How to Build a Simple Machine Learning Pipeline

How to Build a Simple Machine Learning Pipeline

Read this step-by-step tutorial in order to learn how to build a simple machine learning pipeline by importing from scikit-learn.

Sibanjan Das user avatar by
Sibanjan Das
CORE ·
Umit Mert Cakmak user avatar by
Umit Mert Cakmak
·
May. 17, 18 · Tutorial
Like (11)
Save
Tweet
Share
29.61K Views

Join the DZone community and get the full member experience.

Join For Free

The following blog, explaining the concepts of building a simple pipeline, is an excerpt from the book Hands-On Automated Machine Learning, written by Sibanjan Das and Umit Mert Chakmak.

Image title

There are many moving parts in a Machine Learning (ML) model that have to be tied together for an ML model to execute and produce results successfully. This process of tying together different pieces of the ML process is known as a pipeline. A pipeline is a generalized but very important concept for a Data Scientist. In software engineering, people build pipelines to develop software that is exercised from source code to deployment. Similarly, in ML, a pipeline is created to allow data flow from its raw format to some useful information. It provides a mechanism to construct a multi-ML parallel pipeline system in order to compare the results of several ML methods.

Each stage of a pipeline is fed data processed from its preceding stage; that is, the output of a processing unit is supplied as the input to the next step. The data flows through the pipeline just as water flows in a pipe. Mastering the pipeline concept is a powerful way to create error-free ML models, and pipelines are a crucial element of an AutoML system.

A Simple Pipeline

We will first import a dataset known as Iris, which is already available in scikit-learn's sample dataset library (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). The dataset consists of four features and has 150 rows. We will be developing the following steps in a pipeline to train our model using the Iris dataset. The problem statement is to predict the species of an Iris data using four different features, as shown in the following flowchart:

Image title

In this pipeline, we will use a MinMaxScaler method to scale the input data and logistic regression to predict the species of the Iris. The model will then be evaluated based on the accuracy measure:

1.  The first step is to import from scikit-learn various libraries that will provide methods to accomplish the task. We have to add the Pipeline method from sklearn.pipeline, which will provide us with the necessary methods needed to create an ML pipeline:

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

2.  The next step is to load the iris data and split it into training and test datasets. In this example, we will use 80% of the dataset to train the model and the remaining 20% to test the accuracy of the model. We can use the shape function to view the dimension of the dataset:

# Load and split the data

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size= 0.2,random_state=42 )
X_train.shape

3.  The following result shows that the training dataset has 4 columns and 120 rows, which equates to 80% of the Iris dataset and is as expected:

Image title

4.  Next, we print the dataset:

print(X_train)

The above code yields the following output:

Image title

5.  The next step is to create a pipeline. The pipeline object is in the form of (key, value) pairs. Key is a string that has the name for a particular step and value is the name of the function or actual method. In the following code snippet, we have named the MinMaxScaler() method as minmax and LogisticRegression() as lr:

pipe_lr = Pipeline([('minmax', MinMaxScaler()),
 ('lr', LogisticRegression())])

6.  Then, we fit the pipeline object, pipe_lr, to the training dataset:

pipe_lr.fit(X_train, y_train)

7.  On executing the preceding code, you’ll get the following output, which shows the final structure of the fitted model that was built:

Image title

8.  The last step is to score the model on the test dataset using the score method:

score = pipe_lr.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

As we can note from the following results, the accuracy of the model is 0.900, which is 90%:

Image title

In this example, we created a pipeline constituting of two steps, that is, minmax scaling and LogisticRegression. When we executed the fit method on pipe_lr, the MinMaxScaler performed a fit and transform method on the input data, and it was passed on to the estimator, which is a logistic regression model. These intermediate steps in a pipeline are known as transformers, and the last step is an estimator.

Machine learning Pipeline (software) Build (game engine)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Streamlining Your Workflow With the Jenkins HTTP Request Plugin: A Guide to Replacing CURL in Scripts
  • The 31 Flavors of Data Lineage and Why Vanilla Doesn’t Cut It
  • Three SQL Keywords in QuestDB for Finding Missing Data
  • DevOps Roadmap for 2022

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: