DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Integrating Google BigQuery With Amazon SageMaker
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Google Cloud Document AI Basics
  • Recommender Systems Best Practices: Collaborative Filtering

Trending

  • IoT and Cybersecurity: Addressing Data Privacy and Security Challenges
  • Cosmos DB Disaster Recovery: Multi-Region Write Pitfalls and How to Evade Them
  • Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers
  • Dropwizard vs. Micronaut: Unpacking the Best Framework for Microservices
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Google Colab: Create Predictive Models in No Time

Google Colab: Create Predictive Models in No Time

In this article, we will see how we can use this amazing cloud-based platform and use a Random Forest model to predict customer churn in less than 200 lines of code.

By 
Sunil Kappal user avatar
Sunil Kappal
·
Apr. 12, 19 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
15.5K Views

Join the DZone community and get the full member experience.

Join For Free

To democratize data analytics and do all the data munging related heavy lifting, let's explore Google's Colaboratory, which is a Jupyter notebook environment that requires no setup and runs entirely on the cloud. Google's Colaboratory is a perfect solution for today's data analysts and engineers. In this article, we will see how we can use this amazing cloud-based platform and use a Random Forest model to predict customer churn in less than 200 lines of code.

Before we start, I would like to point out some great capabilities that Google Colab environment has in store for its users.

  • No more dependency on the dependencies: You talk about any programming language, Google Colab has got all the required packages and dependencies already installed. This saves a lot of time and effort, given the fact that there are thousands of such dependencies available to make data analytics a breeze.
  • CPU to GPU to TPU: This is one of the most exciting and amazing features of Google Colab. As I stated above, Google Colab will do all the data processing and crunching-related heavy lifting on its own without worrying about the capacity of the user's physical machine. Considering the fact that Google Colab is a cloud-based platform. allowing its users to experience the true power of the cloud-based application, itwill be worth stating the key differences in between CPU, GPU, and TPU
  • CPU: Going by its text book definition, a Central Processing Unit is the electronic circuitry which is considered the brain of the computer and is used to perform the basic arithmetic, logical, control, and input/output operations specified by the instructions of a computer application.
  • GPU: A Graphics Processing Unit is a high-end circuitry that is designed to render 2D and 3D graphics together with the CPU. However, nowadays GPUs are being used to do all the heavy data crunching to accelerate the computational workloads while developing any model.
  • TPU: A Tensor Processing Unit is custom made circuitry developed specifically to execute machine learning and Tensor Flow, Google's open source machine learning framework.

Now that we have some basic understanding of the Google Colaboratory environment and its key features, we can get started with a very basic routine that we will deploy using the Google Colaboratory environment. To demonstrate the ease of use we will be using our all-time favorite data science language, Python.

Before we get started, I just want to call out that this is not a tutorial article on the Python language. This article is just to demonstrate the ease of use and simplicity of the Google Colab environment.

Let's get started!

Once you have fired up your Google Colab environment, its time to call all the required libraries for this modeling routine.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML

Once all the dependencies have been called, it is time to use them. Using the Pandas read.csv command we will read the customer data file. Using the below code we should be able to see the 5 rows of our customer data set.

df = pd.read_csv("Customer Churn Data.csv")
display(df.head(5))

Google Colab data set

Though the data that I am using is a made-up or dummy data and all the values are in binaries, I will still go ahead and use the Pandas shape data frame function to look at the rows and column counts.

print("Number of rows: ", df.shape[0])
counts = df.describe().iloc[0]
display(
pd.DataFrame(
counts.tolist(), 
columns=["Count of values"], 
index=counts.index.values
).transpose()
)

Now let's split the data into training and test:

df_train, df_test = train_test_split(df, test_size=0.25)

Fire up the random forest model by calling it!

clf = RandomForestClassifier(n_estimators=30)
clf.fit(df_train[features], df_train["Churn"])

Make some predictions:

predictions = clf.predict(df_test[features])
probs = clf.predict_proba(df_test[features])
display(predictions)

Evaluate the model:

score = clf.score(df_test[features], df_test["Churn"])
print("Accuracy: ", score)

Create a confusion matrix and ROC:

get_ipython().magic('matplotlib inline')
confusion_matrix = pd.DataFrame(
confusion_matrix(df_test["Churn"], predictions), 
columns=["Predicted False", "Predicted True"], 
index=["Actual False", "Actual True"]
)
display(confusion_matrix)

# Calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = roc_curve(df_test["Churn"], probs[:,1])
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Now you can check the results by deploying the model that you have created on the test data.

df_test["prob_true"] = probs[:, 1]
df_risky = df_test[df_test["prob_true"] > 0.9]
display(df_risky.head(5)[["prob_true"]])

When I create the very first version of any model, I go full throttle. This means that I use all the variables. If I see that my model is too accurate or totally useless from the perspective of its accuracy score than I deploy the feature selection method. Some may agree with this approach and some may not, but it is more of personal preference.

fig = plt.figure(figsize=(20, 18))
ax = fig.add_subplot(111)

df_f = pd.DataFrame(clf.feature_importances_, columns=["importance"])
df_f["labels"] = features
df_f.sort_values("importance", inplace=True, ascending=False)
display(df_f.head(5))

index = np.arange(len(clf.feature_importances_))
bar_width = 0.5
rects = plt.barh(index , df_f["importance"], bar_width, alpha=0.4, color='b', label='Main')
plt.yticks(index, df_f["labels"])
plt.show()

So we can clearly see that in less than 200 lines of code, I have been able to create a pretty decent Random Forest model using dummy data with an accuracy which is better than a flip of a coin.

The original model can be accessed on my GitHub site - Access it here!

Google (verb) Data science

Published at DZone with permission of Sunil Kappal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Integrating Google BigQuery With Amazon SageMaker
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Google Cloud Document AI Basics
  • Recommender Systems Best Practices: Collaborative Filtering

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!