Where to Start With a New Data Problem
Procrastination is a problem for everyone, even data scientists. So, one of our own gives some tips to quickly dive into a Python-based project.
Join the DZone community and get the full member experience.Join For Free
So I get a data file, CSV, text, etc…. and my usual first step is to stare at the file in my downloads folder for a few minutes. Maybe then change the name. Then go make some coffee. Then come back and read the name of the file again. Maybe change it back.
I’ll open some IDE and make a new Python file. Save it. Stare at that. Import some libraries… that name sucks, I should change it.
The news is on, I should probably watch.
My point is that it’s hard to start. And the best way to start is just to start. Here’s a good list of things to put in your .py file to at least start getting a handle on what you’re dealing with.
You might not need them all, but you can always remove them later:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston from sklearn import metrics
Import Your Data
customers = pd.read_csv("Ecommerce Customers.csv")
Get Some Visualizations Going
print(pdf.head()) print(pdf.info()) print(pdf.describe()) print(pdf.columns)
Grab Some Coefficients
See if anything stands out:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient']) print(coeff_df)
Print Some Nice Plots
snsData = sns.load_dataset('tips') print(snsData.head()) print(sns.pairplot(snsData)) print(sns.distplot(snsData['some_column'])) sns.heatmap(snsData.corr(), annot=True)
Although not the answer to any of your main problems, this will at least get the process going and the juices running.
Published at DZone with permission of Matt Hughes. See the original article here.
Opinions expressed by DZone contributors are their own.