DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Getting Insights From Survey Results Using Data Science in Python

Getting Insights From Survey Results Using Data Science in Python

To ask specific and deeper questions from your data, custom coding is usually your best bet.

Tim Ojo user avatar by
Tim Ojo
·
Aug. 22, 16 · Opinion
Like (4)
Save
Tweet
Share
21.50K Views

Join the DZone community and get the full member experience.

Join For Free

prI'm neither a statistician nor a professional data scientist. But when I started talking with Jared Menard, a friend and fellow Chattanooga-based developer, about the salary survey he put together, I was intrigued by the insights that could be gotten from the data. The survey asked questions like:

  • How much money do you make a year?
  • How many years of experience do you have?
  • How many books have you read in the last year?
  • Do you have a degree? Is it related to your career?
  • Name some things that you use at work.

Of the roughly 500 developers targeted 67 people responded to the survey, putting us short of the 80 respondents needed for a 10% margin of error per survey monkey. Also, the survey was not designed by a professional researcher / survey designer. So while I readily acknowledge that there may be some potential flaws in the findings, I was more interested in the process of using data science and machine learning techniques to get deeper insights from the data than what could be provided by the basic statistical tools of mean, mode, standard deviation, scatter plots and column charts.

Two of the questions I was hoping to answer were:

  1. What kinds/groups of developers answered the survey and what are the general characteristics of those groupings?
  2. Of all the features we have, what are the important features in this dataset when it comes to determining salary and given those features can we predict an expected salary?

Data Preparation

To get started, I had to do what every data scientist is forced to spend 80% of their time doing, which is data cleaning and preparation. Due to the small size of the dataset, it was easy enough to pull up the CSV file in a text editor or spreadsheet program and visually examine the data to see what needed fixing. For larger data sets, you would have to rely on tools to find missing values and outliers, generate histograms, etc. For example, the dataframe class from the Pandas library contains some methods like isnull() and fillna() that can be used to manage missing values.

When it comes to missing values, there are a number of strategies for handling them. You could drop the record with the missing value entirely, you could set the missing value to 0 / 'NA' / -1, or you could impute the missing value. Some imputation methods are; calculating the mean value, carrying forward the last observation, and using a predicted value from a regression analysis. For this survey, one record was dropped because the respondent failed to complete the survey, a couple of missing values in other records were set to 0 or NA, and a couple of other missing values were imputed based on the previous answers given by the respondent.

Another data preparation task that was done was taking a continuous variable (number of employees at your company) and turning it into a categorical variable (micro, small, medium, large, enterprise). This task could easily be accomplished with a few lines of Python code.

def companySizeCat(numEmployees):
    if numEmployees < 7: # micro
        return 1
    elif numEmployees < 250: # small
        return 2
    elif numEmployees < 500: # medium
        return 3
    elif numEmployees < 1000: # large
        return 4
    else: # enterprise
        return 5 survey_df['COMP_SIZE'] = survey_df['COMP_SIZE'].apply(lambda x: companySizeCat(x)) 

Finally, in the survey there was a free text field with the question; "Name some things that you use at work". In order to be able to accurately do some form of text mining with that field, I opted to perform a number of transformations: I lowercased all the words in the field, standardized/stemmed certain terms (e.g. 'osx' and 'os x' both get standardized to 'osx'), and deduced broader tools from specific ones.

Examining the Data

Once you get your data into a Pandas data frame, examining your data is quite trivial. Especially if you are working in a jupyter notebook environment. Pandas gives you lots of methods, such as describe, head, tail, hist, info, median, mean, mode, std, nlargest, nsmallest, that you can use to gain an understanding of what your data looks like.

For example, calling the describe() method will show you various summary statistics for every column in your dataset.

survey_df.describe()

YEARS_EXPNUM_HOURS_WORKDAYS_OFFSALARYCOMP_SIZEHEALTH_INSCOMP_MEALSCONF_BUDGET401KTRAINING_BUDGETBOOKS_READDEGREECAREER_DEGREEWORK_HIST_COUNT
count67.00000067.00000067.00000067.00000067.0000006767.00000067.000000676767.000000676767.000000
mean9.30597043.04477621.05970182020.149254198.9253730.8805971.3432845302.2388060.62686570.37313438.8283580.77611940.61194035.328358
std7.3350326.88307716.73761928621.548435329.7535950.32670942.42182127144.6607740.48728750.487287510.4608040.4199890.49098614.247595
min0.50000020.0000000.00000020000.0000001.000000False0.0000000.000000FalseFalse0.000000FalseFalse1.000000
25%4.00000040.00000014.00000065000.00000010.00000010.0000000.000000002.500000103.000000
50%7.00000040.00000020.00000077000.00000030.00000010.0000000.000000105.000000114.000000
75%12.00000047.00000023.50000095000.000000250.00000012.000000475.0000001110.000000116.000000
max30.00000065.000000100.000000185000.0000001001.000000True10.000000200000.000000TrueTrue50.000000TrueTrue27.000000

Answering the question; "Amongst the survey respondents, what is the most common company size, years of experience, number of jobs held, number of books read and salary?" becomes as easy as:

sel_columns_df = survey_df[['COMP_SIZE','YEARS_EXP', 'WORK_HIST_COUNT', 'BOOKS_READ', 'SALARY']]
sel_columns_df.mode()

COMP_SIZEYEARS_EXPWORK_HIST_COUNTBOOKS_READSALARY
02103375000
1NaNNaN4NaNNaN

Note: For the work history count there is a tie: 3 & 4 are both the most recorded number of jobs held.

Another question that Python's data science tools (scikit-learn and Pandas) enable me to answer is the question: What are the most common tools used mentioned by the respondents?

To answer this, first I need to split up the raw entered text into units representing each item. This is called tokenization. Then I need to create a matrix of the count of the occurrence of each token for every entry. This matrix can then be used in other algorithms but I'm simply summing the counts and then displaying the top counts.

# Define a function to tokenize the raw text entered by the users 
def tokenizr(text):
    tokens = text.split(";")
    return [ token.strip() for token in tokens if token.isspace() == False ]
>>>[u'.net',
 u'airflow',
 u'android',
 u'anger',
 u'angular',
 u'ansible',
 u'apache',
 u'apex',
 u'asp .net',
 u'asp mvc',
 u'atom',
 u'aws',
 u'baby bottles & burping cloths.',
 u'bash',
 u'bootstrap',
 u'brainpower',
 u'c++',
 u'centos',
javascript    39
linux         23
node          22
ruby          22
rails         19
postgres      19
nginx         16
php           15
mysql         14
angular       12

The results show that many of the survey respondents are frontend developers, or at the very least full stack developers, as JavaScript, Ruby on Rails, Nginx, PHP and Angular dominate the list.

Clustering the Respondent Data

The first question that I had that requires the use of machine learning algorithms is: What kinds/groups of developers answered the survey and what were the general characteristics of those groupings?

Being able to cluster respondents into groups may be useful for understanding and expecting similar behavior from the members of a certain group. It allows us to make assumptions about certain populations that are founded in data. For instance, it may be that the data shows that infrastructure engineers tend to make certain salaries, work 40 hours a week, work for companies of a certain size and have X number of prior jobs. Which could be in contrast to front end engineers, who maybe have a different salary range, work slightly more hours, have X number of years of experience and have read Y number of books on average.

To get started with clustering I used the K-Means implementation provided by SciKit Learn. I also performed a version of the clustering using SciKit Learn's Agglomerative clustering implementation but since the results I got were similar to the K-Means solution, I chose to focus on iterating my K-Means solution. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. My first iteration of the solution was simply fed the dataframe and asked to come up with 5 clusters:

# Using a basic KMeans algorithm
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, max_iter=500)

# Attempt 1: Just give it all the data in its raw, naked glory and see what it comes up with
kmeans.fit(survey_df)
# Taking a look at what records were assigned to what cluster, I notice a disturbing trend: Salary is outweighing all other features
survey_df_c1 = survey_df.copy()
survey_df_c1.insert(0, 'CLUSTER1_LABEL', kmeans.labels_)
survey_df_c1.head(10)

CLUSTER1_LABELYEARS_EXPNUM_HOURS_WORKDAYS_OFFSALARYCOMP_SIZEHEALTH_INSCOMP_MEALSCONF_BUDGET401KTRAINING_BUDGETBOOKS_READDEGREECAREER_DEGREEWORK_HIST_COUNT
.netairflowandroidangerangularansibleapacheapexasp .netasp mvcatomawsbaby bottles & burping cloths.bashbootstrapbrainpowerc++centoschairsclassic aspclojurecognoscomputersconsulcsharpcsscss3db2 for z/osdb2 luwdebiandigitaloceandivshotdjangodockerdrupalec2eclipse ideelasticacheelasticsearchelixiremberes6excelexpressexpress.jsflaskfluxforce.com platformfsharpgiphygitgithubgogooglegoogle docsgoogle drivegoogle hangoutsgotomeetinggraphitegulphand toolshateherokuhipchathl7htmlhtml5iisimsinternetjavajavascriptjekylljenkinsjirajqueryjrubykeraskibanaknockoutknockout.jskuberneteslaserslempline scan cameraslinuxmagentomercurialmesosmicrosoftmongodbmssqlmysqlnetezzanginxnodeobj coracleos xoutlookperlphonephpplc/mlcpostgrespostgresqlpower toolspowershellpuppetpythonrabbitmqrackspacerailsrails.jsrdsreactredisredshiftreduxrhelrubys3sailssalesforcesassscikit-learnscsssegment.ioseleniumserver 2012sinatraskypeslacksqlsql serversshstack overflowsublime textsurreal.cmsswiftteamcityteradataterminalstorqueboxubuntuvagrantvbvbaverovimvisual studiovisual studio 2012vmwarewindowswindows batch fileswindows serverwordpresswpxcodexslt
0053520700001True40FalseFalse0TrueTrue800000100000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000100000001000000000000000000000000000000000000000
12104001200001False43000FalseTrue3FalseFalse600000001000000000000000000000000000000000000001000000000000000000000000010000000000010100000010000000001000000100000000000000000000000000000000000000000000000000
20114515750002True00FalseFalse2TrueTrue601000000000000000000000000000010000000000000000000000000000000000000000110000000000000100000110010000000010001000000000000000000000000000000000000000000001000000
3064016770002True00FalseFalse10TrueTrue300000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000100000001001000000000000000000000000000000000000000000000000000000
4065014700002True00FalseFalse3FalseFalse300000000000000000000000000000000000000000000000000000000000000000000000000000000000000110000000100010001000000000100000001000000000000000000000000000000000000000
5325014450001False00FalseTrue4TrueFalse100000000000100000000000000100000000000000000000000000000000010000010000010000000000000100000000110000000000000000000000000000001000000000000000001000010000001000
60165015935503True0350TrueTrue22TrueTrue901000000000000000000100001000000000000000000000001000000000000000000000000000000000000000000000000000000000010100000000000000000000000100000000000011000000100000
7314060500001False00TrueFalse24TrueFalse500000000000000000000000000100000100000000000000000000000000000000010000011000000000000000000000000000000000000000000000000000000000000000001000000000000000000100
821850301300005True00TrueFalse10FalseFalse310000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
923020201200002True05000TrueTrue10FalseFalse2000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000101000000000000000000000000000000000000000000000000000000000
# To prove that out a bit more I group by the cluster label and run some stats
cl1_groups = survey_df_c1.groupby(['CLUSTER1_LABEL'])
cl1_groups.agg([np.mean, np.min, np.max, np.std])

YEARS_EXPNUM_HOURS_WORKDAYS_OFFSALARYCOMP_SIZEHEALTH_INSCOMP_MEALSCONF_BUDGET401KTRAINING_BUDGETBOOKS_READDEGREECAREER_DEGREEWORK_HIST_COUNT
.netairflowandroidangerangularansibleapacheapexasp .netasp mvc...sshstack overflowsublime textsurreal.cmsswiftteamcityteradataterminalstorqueboxubuntuvagrantvbvbaverovimvisual studiovisual studio 2012vmwarewindowswindows batch fileswindows serverwordpresswpxcodexslt

meanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstd...meanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstd
CLUSTER1_LABEL








































































































































































































010.3333331.0307.03054643.44444430606.59196221.638889010020.07721282586680001000009938.8675062.333333150.9258200.916667FalseTrue0.2803060.861111081.709683423.611111050001008.7110670.638889FalseTrue0.4871360.305556FalseTrue0.4671779.83333305012.1901600.833333FalseTrue0.3779640.694444FalseTrue0.4671776.0277781274.4624540.027778010.1666670.222222010.4216370.027778010.1666670.055556010.2323110.027778010.1666670.250000010.4391550.055556010.2323110.055556010.2323110.027778010.1666670.000000000.0000000.000000000.000000...0.027778010.1666670.055556010.2323110.000000000.0000000.000000000.0000000.027778010.1666670.000000000.0000000.000000000.0000000.027778010.1666670.000000000.0000000.166667010.3779640.055556010.2323110.027778010.1666670.027778010.1666670.027778010.1666670.027778010.1666670.027778010.1666670.000000000.0000000.027778010.1666670.027778010.1666670.027778010.1666670.027778010.1666670.000000000.0000000.000000000.0000000.027778010.1666670.000000000.000000
19.0000009.09NaN40.0000004040NaN25.0000002525NaN900009000090000NaN2.00000022NaN1.000000TrueTrueNaN1.00000011NaN200000.000000200000200000NaN0.000000FalseFalseNaN0.000000FalseFalseNaN8.00000088NaN1.000000TrueTrueNaN1.000000TrueTrueNaN4.00000044NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN...0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN
212.8333334.0306.93912939.58333320507.21687822.66666704510.62016412666610500018500024058.0106993.083333151.4433760.916667FalseTrue0.2886752.083333082.7784343125.0000000200005657.3564340.750000FalseTrue0.4522670.583333FalseTrue0.5149297.5833331205.8691540.500000FalseTrue0.5222330.416667FalseTrue0.5149297.1666673204.9696050.083333010.2886750.083333010.2886750.000000000.0000000.000000000.0000000.000000000.0000000.166667010.3892490.000000000.0000000.083333010.2886750.000000000.0000000.000000000.0000000.000000000.000000...0.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.083333010.2886750.000000000.0000000.000000000.0000000.000000000.0000000.083333010.2886750.083333010.2886750.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.000000
35.1470590.5286.80478944.70588240657.17430118.23529406013.53481350426200006500012621.8327422.470588151.4194030.764706FalseTrue0.4372371.4117650102.807553147.05882401500424.4373440.588235FalseTrue0.5073000.352941FalseTrue0.4925927.9705882409.8114580.823529FalseTrue0.3929530.588235FalseTrue0.5073002.882353161.6156400.235294010.4372370.058824010.2425360.000000000.0000000.000000000.0000000.000000000.0000000.058824010.2425360.000000000.0000000.000000000.0000000.000000000.0000000.058824010.2425360.058824010.242536...0.000000000.0000000.000000000.0000000.058824010.2425360.058824010.2425360.000000000.0000000.058824010.2425360.058824010.2425360.000000000.0000000.000000000.0000000.176471010.3929530.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.058824010.2425360.000000000.0000000.058824010.2425360.000000000.0000000.058824010.2425360.000000000.0000000.000000000.0000000.176471010.3929530.058824010.2425360.000000000.0000000.058824010.242536
41.0000001.01NaN45.0000004545NaN25.0000002525NaN550005500055000NaN2.00000022NaN1.000000TrueTrueNaN9.00000099NaN100000.000000100000100000NaN0.000000FalseFalseNaN1.000000TrueTrueNaN3.00000033NaN1.000000TrueTrueNaN0.000000FalseFalseNaN1.00000011NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN...0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN0.00000000NaN

This naive solution failed in my opinion. Because a naive K-Means implementation is sensitive to scale, the salary feature vastly outweighed the other features in determining the clusters. So my result was clusters that were dictated by salary ranges. Respondents with salaries in ranges from 20K - 65K were in cluster 3, 68K - 100K in cluster 0, 105K - 185K in cluster 2, etc.

In my 2nd iteration of the K-Means solution, I normalized (scaled) the data so that all the fields now contained values between 0 and 1.

def normalize(df, columns):
    result = df.copy()
    for feature_name in columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return resultsurvey_df_normalized = normalize(survey_df, ['YEARS_EXP', 'NUM_HOURS_WORK', 'DAYS_OFF', 'SALARY', 'CONF_BUDGET', 'BOOKS_READ', 'WORK_HIST_COUNT'])
survey_df_normalized.drop(['STACK', 'COMP_MEALS'], axis=1, inplace=True) 
kmeans.fit(survey_df_normalized)

survey_df_c2 = survey_df.copy()
survey_df_c2.insert(0, 'CLUSTER2_LABEL', kmeans.labels_)cl2_groups = survey_df_c2.groupby(['CLUSTER2_LABEL'])
cl2_groups.agg([np.mean, np.min, np.max, np.std]) 

YEARS_EXPNUM_HOURS_WORKDAYS_OFFSALARYCOMP_SIZEHEALTH_INSCOMP_MEALSCONF_BUDGET401KTRAINING_BUDGETBOOKS_READDEGREECAREER_DEGREEWORK_HIST_COUNT
.netairflowandroidangerangularansibleapacheapexasp .netasp mvc...sshstack overflowsublime textsurreal.cmsswiftteamcityteradataterminalstorqueboxubuntuvagrantvbvbaverovimvisual studiovisual studio 2012vmwarewindowswindows batch fileswindows serverwordpresswpxcodexslt

meanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstd...meanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstdmeanaminamaxstd
CLUSTER2_LABEL








































































































































































































09.9090911.0256.58214844.63636435605.98048822.2272731010018.094808827094000011500020678.2530942.181818230.3947711.000000TrueTrue0.0000001.363636082.300009870.454545050001524.2998391.000000TrueTrue0.0000000.454545FalseTrue0.50964711.31818205013.6080731.000000TrueTrue0.0000000.909091FalseTrue0.2942456.4545451275.4835450.136364010.3512500.227273010.4289320.045455010.2132010.000000000.0000000.045455010.2132010.318182010.4767310.045455010.2132010.045455010.2132010.045455010.2132010.000000000.0000000.000000000.000000...0.045455010.2132010.090909010.2942450.000000000.0000000.000000000.0000000.045455010.2132010.000000000.0000000.000000000.0000000.045455010.2132010.045455010.2132010.318182010.4767310.000000000.0000000.045455010.2132010.045455010.2132010.000000000.0000000.045455010.2132010.000000000.0000000.045455010.2132010.045455010.2132010.000000000.0000000.045455010.2132010.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.000000
113.5000002.0308.75961043.35714335505.12320730.9285711510021.345020946965200018500036481.3148344.571429450.5135531.000000TrueTrue0.0000002.5714290103.2749471757.1428570200005303.3681751.000000TrueTrue0.0000000.714286FalseTrue0.4688076.7142860154.6645730.857143FalseTrue0.3631370.714286FalseTrue0.4688074.6428571143.1770120.142857010.3631370.071429010.2672610.000000000.0000000.071429010.2672610.000000000.0000000.071429010.2672610.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.000000000.000000...0.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.071429010.267261
28.6428571.0308.00583641.78571420659.72845616.0714295256.207493805003000016500033575.9758312.000000220.0000000.928571FalseTrue0.2672610.857143092.4133337607.142857010000026627.1828780.357143FalseTrue0.4972450.214286FalseTrue0.4258156.7142861358.7743380.214286FalseTrue0.4258150.000000FalseFalse0.0000005.4285711204.6857280.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.000000000.0000000.142857010.3631370.071429010.2672610.000000000.0000000.000000000.0000000.071429010.2672610.071429010.267261...0.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.071429010.2672610.071429010.2672610.000000000.0000000.000000000.0000000.000000000.0000000.071429010.2672610.071429010.2672610.000000000.0000000.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.071429010.2672610.000000000.0000000.000000000.0000000.000000000.000000
36.5000001.0186.71565343.333333306010.32795614.00000006023.323808718332000012000038937.9848731.166667120.4082480.000000FalseFalse0.0000000.833333041.602082833.333333030001329.1598850.166667FalseTrue0.4082480.333333FalseTrue0.51639817.83333334013.7755820.666667FalseTrue0.5163980.000000FalseFalse0.0000005.1666671102.9268870.000000000.0000000.166667010.4082480.000000000.0000000.166667010.4082480.000000000.0000000.000000000.0000000.000000000.0000000.166667010.4082480.000000000.0000000.000000000.0000000.000000000.000000...0.000000000.0000000.000000000.0000000.166667010.4082480.166667010.4082480.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.333333010.5163980.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.166667010.4082480.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.166667010.4082480.166667010.4082480.166667010.4082480.000000000.000000
45.1363640.5112.93335140.90909135503.75378616.36363610254.31909172000400009000012369.3168771.818182120.4045200.909091FalseTrue0.3015110.636364041.28629118181.818182020000060302.2689130.000000FalseFalse0.0000000.000000FalseFalse0.0000004.3181820103.5935551.000000TrueTrue0.0000001.000000TrueTrue0.0000003.909091182.0714510.090909010.3015110.181818010.4045200.000000000.0000000.000000000.0000000.000000000.0000000.181818010.4045200.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.000000...0.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.090909010.3015110.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.000000000.0000000.090909010.3015110.000000000.0000000.090909010.3015110.090909010.3015110.000000000.0000000.000000000.0000000.000000000.000000

Thanks to the feature scaling, my 2nd iteration produced good clusters. Or did it? While scanning the tables revealed no noticeable skew from over-weighted features, its very difficult for humans to understand and visualize clusters in 174 dimensions. So how do you validate clustering results without visualizing them? The answer to that question is beyond the scope of this article, but a succinct and easy-to-understand document that explores this topic further can be found here.

In order to sidestep the difficulties of visualizing cluster data in high dimensions, for my 3rd iteration, I chose 3 features to cluster on. This allowed me to produce a 3D scatter plot of the respondents and verify that the clustering was successful.

survey_df_selected = survey_df_normalized[['SALARY', 'DAYS_OFF', 'WORK_HIST_COUNT']]

kmeans = KMeans(n_clusters=5, max_iter=500)
kmeans.fit(survey_df_selected)

Principal Component Analysis is a dimension reduction method that could prove to be useful in reducing the

N dimension vector space to the 2 or 3 components that are the most significant (show the most variance). While I didn't implement PCA for this project, I was able to make small steps towards marginal success in clustering the survey respondents.

Predicting Salary

The second question I had that requires the use of machine learning algorithms is: What are the important features in this dataset when it comes to determining salary and given those features can we predict an expected salary?

This distils down to being a classification problem and there are many algorithms to solve these kinds of problems; logistic regression, naive bayes classifier, support vector machines, neural networks, etc.

I chose to use a decision tree algorithm to solve the problem because decision trees have built-in feature selection, are easy to understand and interpret, work on both categorical and continuous variables and require little data preparation. Therefore it was a perfect match for my use case because it automatically answers the question; "What are the most important features in this dataset when it comes to determining salary". And because, being a survey with an expected margin of error, it makes sense for me to optimize for usability and demo-ability over accuracy.

The decision tree algorithm works by recursively splitting a set of data observations into subsets based on the attribute that most effectively splits the data set into groups (subsets) of observations with similar outcomes. This results in a tree structure in which each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label or regression value.

from sklearn.tree import DecisionTreeClassifier, export_graphviz

y = survey_df["SALARY"]
X = survey_df.drop(["SALARY", "STACK"], axis=1, inplace=False)
tree = DecisionTreeClassifier(min_samples_split=4)
tree.fit(X, y)


(click to see larger figure)


For my dataset, the decision tree algorithm found the test "does your company provide more than 8.5 meals per month" to provide the most effective 1st split. If the answer is no then the left branch is taken and the next test becomes "do you have more or less than 8.5 years of experience?". If the answer to that test is "less than 8.5 years" then the left branch is taken, otherwise, the right branch is taken. The tests answer the question of "What are the most important features in this dataset when it comes to determining salary". The leaf (terminal) nodes show a prediction based on the observations from the dataset. So for instance; if your company provides less than 8.5 meals per month, you have less than 8.5 years of experience, you've read less than 1.5 books, and your company does not provide a 401K you are likely to earn $70,000. This prediction is based on the fact that 3 observations in the dataset with those same attributes earn that salary.

A couple of things to note: like many other machine learning algorithms, decision trees are vulnerable to overfitting. Overfitting means that the model created is too closely fit to a limited set of observations and hence it describes the random error instead of the underlying relationship. There are many ways to combat the overfitting problem that are better covered in a machine learning course. One interesting approach is to use random forests in place of decision trees. For the decision tree I present in this post, I exclude the "STACK" feature (i.e. survey respondent's skills) from the model. This was done for simplicity sake. I have a version of the model that takes "skill" into account but it has significantly more tests and although it appears to be more accurate, it is also more susceptible to overfitting. The final thing to note is that confidence in a decision tree's classification prediction can be quantified by outputting class probabilities. For regression trees, the confidence measure is the error. This blog post has more information.

After you train your decision tree model in Python using SciKit-Learn, predicting outcomes is as easy as calling the predict method on the model object. If you would like to create a rules engine from the decision tree, or you simply want a textual representation of the tree, check out the last section of this blog post or this stackoverflow question.

Conclusion

Here's what I learned in the process:

  • To ask specific and deeper questions from your data, custom coding is usually your best bet. Most tools generalize to meet a broader set of needs. Your custom Python code is the best way to meet your needs and handle the specific data that you have.
  • The Pandas library makes it easy to perform your standard statistical analysis on data tables with methods like std, var, mean, quantile, etc. (See full list)
  • Clustering high-dimensional data is easy enough to do with the SciKit-Learn clustering implementations but validating those clusters is a harder problem that requires more research and publication
  • Decision trees are a very robust tool for both feature importance and prediction but care must be taken to ensure you don't overfit your model.

There's a lot more insights that could be deciphered from the survey data given adequate amounts of data and the right machine learning tools and techniques. If anyone is interested in this project, check out the GitHub repository: github.com/tim-ojo/chadev_survey.

Data science Machine learning clustering Docker (software) Python (language) Database Insight (email client) Decision tree Tree (data structure)

Published at DZone with permission of Tim Ojo, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Testing Level Dynamics: Achieving Confidence From Testing
  • What Is JavaScript Slice? Practical Examples and Guide
  • Specification by Example Is Not a Test Framework
  • Distributed Tracing: A Full Guide

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: