DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Python Packages for Validating Database Migration Projects
  • dovpanda: Unlock Pandas Efficiency With Automated Insights
  • A Guide to Regression Analysis Forecasting in Python
  • Data Analytics Using Python

Trending

  • Scaling in Practice: Caching and Rate-Limiting With Redis and Next.js
  • How to Format Articles for DZone
  • Scalable System Design: Core Concepts for Building Reliable Software
  • Stateless vs Stateful Stream Processing With Kafka Streams and Apache Flink
  1. DZone
  2. Coding
  3. Languages
  4. Python: Making scikit-learn and Pandas Play Nice

Python: Making scikit-learn and Pandas Play Nice

By 
Mark Needham user avatar
Mark Needham
·
Nov. 14, 13 · Interview
Likes (1)
Comment
Save
Tweet
Share
30.4K Views

Join the DZone community and get the full member experience.

Join For Free

In the last post I wrote about Nathan and my attempts at the Kaggle Titanic Problem, I mentioned that our next step was to try out scikit-learn, so I thought I should summarize where we’ve got up to.

We needed to write a classification algorithm to work out whether a person onboard the Titanic survived, and luckily, scikit-learn has extensive documentation on each of the algorithms.

Unfortunately almost all those examples use numpy data structures and we’d loaded the data using pandas and didn’t particularly want to switch back!

Luckily it was really easy to get the data into numpy format by calling ‘values’ on the pandas data structure, something we learnt from a reply on Stack Overflow.

For example if we were to wire up an ExtraTreesClassifier which worked out survival rate based on the ‘Fare’ and ‘Pclass’ attributes we could write the following code:

import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
 
train_df = pd.read_csv("train.csv")
 
et = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=1, random_state=0)
 
columns = ["Fare", "Pclass"]
 
labels = train_df["Survived"].values
features = train_df[list(columns)].values
 
et_score = cross_val_score(et, features, labels, n_jobs=-1).mean()
 
print("{0} -> ET: {1})".format(columns, et_score))

To start with with read in the CSV file which looks like this:

$ head -n5 train.csv 
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S

Next we create our classifier which “fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.“ i.e. a better version of a random forest.

On the next line we describe the features we want the classifier to use, then we convert the labels and features into numpy format so we can pass the to the classifier.

Finally we call the cross_val_score function which splits our training data set into training and test components and trains the classifier against the former and checks its accuracy using the latter.

If we run this code we’ll get roughly the following output:

$ python et.py
 
['Fare', 'Pclass'] -> ET: 0.687991021324)

This is actually a worse accuracy than we’d get by saying that females survived and males didn’t.

We can introduce ‘Sex’ into the classifier by adding it to the list of columns:

columns = ["Fare", "Pclass", "Sex"]

If we re-run the code we’ll get the following error:

$ python et.py
 
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (514, 0))
...
Traceback (most recent call last):
  File "et.py", line 14, in <module>
    et_score = cross_val_score(et, features, labels, n_jobs=-1).mean()
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/cross_validation.py", line 1152, in cross_val_score
    for train, test in cv)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
    self.retrieve()
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 450, in retrieve
    raise exception_type(report)
sklearn.externals.joblib.my_exceptions.JoblibValueError/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/my_exceptions.py:26: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
  self.message,
: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...
ValueError: could not convert string to float: male
___________________________________________________________________________

This is a slightly verbose way of telling us that we can’t pass non numeric features to the classifier – in this case ‘Sex’ has the values ‘female’ and ‘male’. We’ll need to write a function to replace those values with numeric equivalents.

train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == "male" else 1)

Now if we re-run the classifier we’ll get a slightly more accurate prediction:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)

The next step is to use the classifier against the test data set, so let’s load the data and run the prediction:

test_df = pd.read_csv("test.csv")
 
et.fit(features, labels)
et.predict(test_df[columns].values)

Now if we run that:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
Traceback (most recent call last):
  File "et.py", line 22, in <module>
    et.predict(test_df[columns].values)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict
    proba = self.predict_proba(X)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba
    X = array2d(X, dtype=DTYPE)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 91, in array2d
    X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 235, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: male

which is the same problem we had earlier! We need to replace the ‘male’ and ‘female’ values in the test set too so we’ll pull out a function to do that now.

def replace_non_numeric(df):
	df["Sex"] = df["Sex"].apply(lambda sex: 0 if sex == "male" else 1)
	return df

Now we’ll call that function with our training and test data frames:

train_df = replace_non_numeric(pd.read_csv("train.csv"))
...
test_df = replace_non_numeric(pd.read_csv("test.csv"))

If we run the program again:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
Traceback (most recent call last):
  File "et.py", line 26, in <module>
    et.predict(test_df[columns].values)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict
    proba = self.predict_proba(X)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba
    X = array2d(X, dtype=DTYPE)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 93, in array2d
    _assert_all_finite(X_2d)
  File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

There are missing values in the test set so we’ll replace those with average values from our training set using an Imputer:

from sklearn.preprocessing import Imputer
 
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(features)
 
test_df = replace_non_numeric(pd.read_csv("test.csv"))
 
et.fit(features, labels)
print et.predict(imp.transform(test_df[columns].values))

If we run that it completes successfully:

$ python et.py 
['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359)
[0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1
 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
 0 1 1 1 1 0 0 1 0 0 0]

The final step is to add these values to our test data frame and then write that to a file so we can submit it to Kaggle.

The type of those values is ‘numpy.ndarray’ which we can convert to a pandas Series quite easily:

predictions = et.predict(imp.transform(test_df[columns].values))
test_df["Survived"] = pd.Series(predictions)

We can then write the ‘PassengerId’ and ‘Survived’ columns to a file:

test_df.to_csv("foo.csv", cols=['PassengerId', 'Survived'], index=False)

Then output file looks like this:

$ head -n5 foo.csv 
PassengerId,Survived
892,0
893,1
894,0

The code we’ve written is on github in case it’s useful to anyone.

Scikit-learn Pandas Test data file IO Python (language)

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Python Packages for Validating Database Migration Projects
  • dovpanda: Unlock Pandas Efficiency With Automated Insights
  • A Guide to Regression Analysis Forecasting in Python
  • Data Analytics Using Python

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!