Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Building a Data Science Portfolio: Storytelling with Data - Part II

DZone's Guide to

Building a Data Science Portfolio: Storytelling with Data - Part II

In this second part of the series, learn how to polish your data science portfolio.

Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

Originally written by Vik Paruchuri, the founder of Dataquest, a platform that teaches data science interactively in your browser. Dataquest’s unique approach to learning blends theory and practice, then helps you build your portfolio with projects.

Before we dive into exploring the data [see Part I for steps relating to data preparation], we’ll want to set the context, both for ourselves, and anyone else that reads our analysis. One good way to do this is with exploratory charts or maps. In this case, we’ll map out the positions of the schools, which will help readers understand the problem we’re exploring.

In the below code, we:

  • Setup a map centered on New York City.
  • Add a marker to the map for each high school in the city.
  • Display the map.
In [82]: import folium
         from folium import plugins

         schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
         marker_cluster = folium.MarkerCluster().add_to(schools_map)
         for name, row in full.iterrows():
             folium.Marker([row["lat"], row["lon"]], popup="{0}: {1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
         schools_map.create_map('schools.html')
         schools_map

Out[82]:

datastory-f1

This map is helpful, but it’s hard to see where the most schools are in NYC. Instead, we’ll make a heatmap:

In [84]: schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
         schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in full.iterrows()]))
         schools_heatmap.save("heatmap.html")
         schools_heatmap

Out[84]:

datastory-f2

District-Level Mapping

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

  • Group full by school district.
  • Compute the average of each column for each school district.
  • Convert the school_dist field to remove leading 0s, so we can match our geographic district data.
In [ ]: district_data = full.groupby("school_dist").agg(np.mean)
        district_data.reset_index(inplace=True)
        district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))

We’ll now we able to plot the average SAT score in each school district. In order to do this, we’ll read in data in GeoJSON format to get the shapes of each district, then match each district shape with the SAT score using the school_dist column, then finally create the plot:

In [85]: def show_district_map(col):
         geo_path = 'schools/districts.geojson'
         districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
         districts.geo_json(
             geo_path=geo_path,
             data=district_data,
             columns=['school_dist', col],
             key_on='feature.properties.school_dist',
             fill_color='YlGn',
             fill_opacity=0.7,
             line_opacity=0.2,
         )
         districts.save("districts.html")
         return districts

show_district_map("sat_score")

datastory-f3

Exploring Enrollment and SAT Scores

Now that we’ve set the context by plotting out where the schools are, and SAT score by district, people viewing our analysis have a better idea of the context behind the dataset. Now that we’ve set the stage, we can move into exploring the angles we identified earlier when we were finding correlations. The first angle to explore is the relationship between the number of students enrolled in a school and SAT score.

We can explore this with a scatter plot that compares total enrollment across all schools to SAT scores across all schools.

In [87]: %matplotlib inline

         full.plot.scatter(x='total_enrollment', y='sat_score')

Out[87]:

datastory-f4

As you can see, there’s a cluster at the bottom left with low total enrollment and low SAT scores. Other than this cluster, there appears to only be a slight positive correlation between SAT scores and total enrollment. Graphing out correlations can reveal unexpected patterns.

We can explore this further by getting the names of the schools with low enrollment and low SAT scores:

In [88]: full[(full["total_enrollment"] < 1000) & (full["sat_score"] < 1000)]["School Name"]

Out[88]: 34     INTERNATIONAL SCHOOL FOR LIBERAL ARTS
         143                                      NaN
         148    KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
         203                MULTICULTURAL HIGH SCHOOL
         294      INTERNATIONAL COMMUNITY HIGH SCHOOL
         304          BRONX INTERNATIONAL HIGH SCHOOL
         314                                      NaN
         317            HIGH SCHOOL OF WORLD CULTURES
         320       BROOKLYN INTERNATIONAL HIGH SCHOOL
         329    INTERNATIONAL HIGH SCHOOL AT PROSPECT
         331               IT TAKES A VILLAGE ACADEMY
         351    PAN AMERICAN INTERNATIONAL HIGH SCHOO
         Name: School Name, dtype: object

Some searching on Google shows that most of these schools are for students who are learning English, and are low enrollment as a result. This exploration showed us that it’s not total enrollment that’s correlated to SAT score – it’s whether or not students in the school are learning English as a second language.

Exploring English Language Learners and SAT Scores

Now that we know the percentage of English language learners in a school is correlated with lower SAT scores, we can explore the relationship. The ell_percent column is the percentage of students in each school who are learning English. We can make a scatterplot of this relationship:

In [89]: full.plot.scatter(x='ell_percent', y='sat_score')

Out[89]: <matplotlib.axes._subplots.AxesSubplot at 0x10fe824e0>

datastory-f5

It looks like there are a group of schools with a high ell_percentage that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district:

In [90]: show_district_map("ell_percent")

Out[90]:

datastory-f6

As we can see by looking at the two district level maps, districts with a low proportion of ELL learners tend to have high SAT scores, and vice versa.

Correlating Survey Scores and SAT Scores

It would be fair to assume that the results of student, parent, and teacher surveys would have a large correlation with SAT scores. It makes sense that schools with high academic expectations, for instance, would tend to have higher SAT scores. To test this theory let's plot out SAT scores and the various survey metrics:

In [91]: full.corr()["sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()

Out[91]: <matplotlib.axes._subplots.AxesSubplot at 0x114652400>

datastory-f7

Surprisingly, the two factors that correlate the most are N_p and N_s, which are the counts of parents and students who responded to the surveys. Both strongly correlate with total enrollment, so are likely biased by the ell_learners. The other metric that correlates most is saf_t_11. That is how safe students, parents, and teachers perceived the school to be. It makes sense that the safer the school, the more comfortable students feel learning in the environment. However, none of the other factors, like engagement, communication, and academic expectations, correlated with SAT scores. This may indicate that NYC is asking the wrong questions in surveys, or thinking about the wrong factors (if their goal is to improve SAT scores, it may not be).

Exploring Race and SAT Scores

One of the other angles to investigate involves race and SAT scores. There was a large correlation differential, and plotting it out will help us understand what’s happening:

In [92]: full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()

Out[92]: <matplotlib.axes._subplots.AxesSubplot at 0x108166ba8>

datastory-f8

It looks like the higher percentages of white and Asian students correlate with higher SAT scores, but higher percentages of black and Hispanic students correlate with lower SAT scores. For Hispanic students, this may be due to the fact that there are more recent immigrants who are ELL learners. We can map the hispanic percentage by district to eyeball the correlation:

In [93]: show_district_map("hispanic_per")

Out[93]:

datastory-f9

It looks like there is some correlation with ELL percentage, but it will be necessary to do some more digging into this and other racial differences in SAT scores.

Gender Differences in SAT scores

The final angle to explore is the relationship between gender and SAT score. We noted that a higher percentage of females in a school tends to correlate with higher SAT scores. We can visualize this with a bar graph:

In [94]: full.corr()["sat_score"][["male_per", "female_per"]].plot.bar()

Out[94]: <matplotlib.axes._subplots.AxesSubplot at 0x10774d0f0>

datastory-f10

To dig more into the correlation, we can make a scatterplot of female_per and sat_score:

In [95]: full.plot.scatter(x='female_per', y='sat_score')

Out[95]: <matplotlib.axes._subplots.AxesSubplot at 0x104715160>

datastory-f11

It looks like there’s a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster:

In [96]: full[(full["female_per"] > 65) & (full["sat_score"] > 1400)]["School Name"]

Out[96]: 3             PROFESSIONAL PERFORMING ARTS HIGH SCH
         92                    ELEANOR ROOSEVELT HIGH SCHOOL
         100                    TALENT UNLIMITED HIGH SCHOOL
         111            FIORELLO H. LAGUARDIA HIGH SCHOOL OF
         229                     TOWNSEND HARRIS HIGH SCHOOL
         250    FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL
         265                  BARD HIGH SCHOOL EARLY COLLEGE
         Name: School Name, dtype: object

Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores.

AP Scores

So far, we’ve looked at demographic angles. One angle that we have the data to look at is the relationship between more students taking Advanced Placement exams and higher SAT scores. It makes sense that they would be correlated since students who are high academic achievers tend to do better on the SAT.

In [98]: full["ap_avg"] = full["AP Test Takers "] / full["total_enrollment"]

         full.plot.scatter(x='ap_avg', y='sat_score')

Out[98]: <matplotlib.axes._subplots.AxesSubplot at 0x11463a908>

datastory-f12

It looks like there is indeed a strong correlation between the two. An interesting cluster of schools is the one at the top right, which has high SAT scores and a high proportion of students that take the AP exams:

In [99]: full[(full["ap_avg"] > .3) & (full["sat_score"] > 1700)]["School Name"]

Out[99]: 92             ELEANOR ROOSEVELT HIGH SCHOOL
         98                    STUYVESANT HIGH SCHOOL
         157             BRONX HIGH SCHOOL OF SCIENCE
         161    HIGH SCHOOL OF AMERICAN STUDIES AT LE
         176           BROOKLYN TECHNICAL HIGH SCHOOL
         229              TOWNSEND HARRIS HIGH SCHOOL
         243    QUEENS HIGH SCHOOL FOR THE SCIENCES A
         260      STATEN ISLAND TECHNICAL HIGH SCHOOL
         Name: School Name, dtype: object

Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers.

Wrapping Up the Story

With data science, the story is never truly finished. By releasing analysis to others, you enable them to extend and shape your analysis in whatever direction interests them. For example, in this post, there are quite a few angles that we explored incompletely, and could have dived into more.

One of the best ways to get started with telling stories using data is to try to extend or replicate the analysis someone else has done. If you decide to take this route, you’re welcome to extend the analysis in this post and see what you can find. If you do this, make sure to comment below so I can take a look.

Next Steps

If you’ve made it this far, you hopefully have a good understanding of how to tell a story with data, and how to build your first data science portfolio piece. Once you’re done with your data science project, it’s a good idea to post it on Github so others can collaborate with you on it.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:
data science ,python

Published at DZone with permission of Vik Paruchuri, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}