# Building a Data Science Portfolio: Storytelling with Data - Part II

# Building a Data Science Portfolio: Storytelling with Data - Part II

### In this second part of the series, learn how to polish your data science portfolio.

Join the DZone community and get the full member experience.

Join For FreeHow to Simplify Apache Kafka. Get eBook.

*Originally written by Vik Paruchuri, the founder of Dataquest, a platform that teaches data science interactively in your browser. Dataquest’s unique approach to learning blends theory and practice, then helps you build your portfolio with projects.*

Before we dive into exploring the data [see Part I for steps relating to data preparation], we’ll want to set the context, both for ourselves, and anyone else that reads our analysis. One good way to do this is with exploratory charts or maps. In this case, we’ll map out the positions of the schools, which will help readers understand the problem we’re exploring.

In the below code, we:

- Setup a map centered on New York City.
- Add a marker to the map for each high school in the city.
- Display the map.

```
In [82]: import folium
from folium import plugins
schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
marker_cluster = folium.MarkerCluster().add_to(schools_map)
for name, row in full.iterrows():
folium.Marker([row["lat"], row["lon"]], popup="{0}: {1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
schools_map.create_map('schools.html')
schools_map
Out[82]:
```

This map is helpful, but it’s hard to see where the most schools are in NYC. Instead, we’ll make a heatmap:

```
In [84]: schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
Out[84]:
```

## District-Level Mapping

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

- Group full by school district.
- Compute the average of each column for each school district.
- Convert the school_dist field to remove leading 0s, so we can match our geographic district data.

```
In [ ]: district_data = full.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))
```

We’ll now we able to plot the average SAT score in each school district. In order to do this, we’ll read in data in GeoJSON format to get the shapes of each district, then match each district shape with the SAT score using the `school_dist`

column, then finally create the plot:

```
In [85]: def show_district_map(col):
geo_path = 'schools/districts.geojson'
districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
districts.geo_json(
geo_path=geo_path,
data=district_data,
columns=['school_dist', col],
key_on='feature.properties.school_dist',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
)
districts.save("districts.html")
return districts
show_district_map("sat_score")
```

## Exploring Enrollment and SAT Scores

Now that we’ve set the context by plotting out where the schools are, and SAT score by district, people viewing our analysis have a better idea of the context behind the dataset. Now that we’ve set the stage, we can move into exploring the angles we identified earlier when we were finding correlations. The first angle to explore is the relationship between the number of students enrolled in a school and SAT score.

We can explore this with a scatter plot that compares total enrollment across all schools to SAT scores across all schools.

```
In [87]: %matplotlib inline
full.plot.scatter(x='total_enrollment', y='sat_score')
Out[87]:
```

As you can see, there’s a cluster at the bottom left with low total enrollment and low SAT scores. Other than this cluster, there appears to only be a slight positive correlation between SAT scores and total enrollment. Graphing out correlations can reveal unexpected patterns.

We can explore this further by getting the names of the schools with low enrollment and low SAT scores:

```
In [88]: full[(full["total_enrollment"] < 1000) & (full["sat_score"] < 1000)]["School Name"]
Out[88]: 34 INTERNATIONAL SCHOOL FOR LIBERAL ARTS
143 NaN
148 KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
203 MULTICULTURAL HIGH SCHOOL
294 INTERNATIONAL COMMUNITY HIGH SCHOOL
304 BRONX INTERNATIONAL HIGH SCHOOL
314 NaN
317 HIGH SCHOOL OF WORLD CULTURES
320 BROOKLYN INTERNATIONAL HIGH SCHOOL
329 INTERNATIONAL HIGH SCHOOL AT PROSPECT
331 IT TAKES A VILLAGE ACADEMY
351 PAN AMERICAN INTERNATIONAL HIGH SCHOO
Name: School Name, dtype: object
```

Some searching on Google shows that most of these schools are for students who are learning English, and are low enrollment as a result. This exploration showed us that it’s not total enrollment that’s correlated to SAT score – it’s whether or not students in the school are learning English as a second language.

## Exploring English Language Learners and SAT Scores

Now that we know the percentage of English language learners in a school is correlated with lower SAT scores, we can explore the relationship. The `ell_percent`

column is the percentage of students in each school who are learning English. We can make a scatterplot of this relationship:

```
In [89]: full.plot.scatter(x='ell_percent', y='sat_score')
Out[89]: <matplotlib.axes._subplots.AxesSubplot at 0x10fe824e0>
```

It looks like there are a group of schools with a high `ell_percentage`

that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district:

```
In [90]: show_district_map("ell_percent")
Out[90]:
```

As we can see by looking at the two district level maps, districts with a low proportion of ELL learners tend to have high SAT scores, and vice versa.

## Correlating Survey Scores and SAT Scores

It would be fair to assume that the results of student, parent, and teacher surveys would have a large correlation with SAT scores. It makes sense that schools with high academic expectations, for instance, would tend to have higher SAT scores. To test this theory let's plot out SAT scores and the various survey metrics:

```
In [91]: full.corr()["sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()
Out[91]: <matplotlib.axes._subplots.AxesSubplot at 0x114652400>
```

Surprisingly, the two factors that correlate the most are `N_p`

and `N_s`

, which are the counts of parents and students who responded to the surveys. Both strongly correlate with total enrollment, so are likely biased by the `ell_learners`

. The other metric that correlates most is `saf_t_11`

. That is how safe students, parents, and teachers perceived the school to be. It makes sense that the safer the school, the more comfortable students feel learning in the environment. However, none of the other factors, like engagement, communication, and academic expectations, correlated with SAT scores. This may indicate that NYC is asking the wrong questions in surveys, or thinking about the wrong factors (if their goal is to improve SAT scores, it may not be).

## Exploring Race and SAT Scores

One of the other angles to investigate involves race and SAT scores. There was a large correlation differential, and plotting it out will help us understand what’s happening:

```
In [92]: full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()
Out[92]: <matplotlib.axes._subplots.AxesSubplot at 0x108166ba8>
```

It looks like the higher percentages of white and Asian students correlate with higher SAT scores, but higher percentages of black and Hispanic students correlate with lower SAT scores. For Hispanic students, this may be due to the fact that there are more recent immigrants who are ELL learners. We can map the hispanic percentage by district to eyeball the correlation:

```
In [93]: show_district_map("hispanic_per")
Out[93]:
```

It looks like there is some correlation with ELL percentage, but it will be necessary to do some more digging into this and other racial differences in SAT scores.

## Gender Differences in SAT scores

The final angle to explore is the relationship between gender and SAT score. We noted that a higher percentage of females in a school tends to correlate with higher SAT scores. We can visualize this with a bar graph:

```
In [94]: full.corr()["sat_score"][["male_per", "female_per"]].plot.bar()
Out[94]: <matplotlib.axes._subplots.AxesSubplot at 0x10774d0f0>
```

To dig more into the correlation, we can make a scatterplot of `female_per`

and `sat_score`

:

```
In [95]: full.plot.scatter(x='female_per', y='sat_score')
Out[95]: <matplotlib.axes._subplots.AxesSubplot at 0x104715160>
```

It looks like there’s a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster:

```
In [96]: full[(full["female_per"] > 65) & (full["sat_score"] > 1400)]["School Name"]
Out[96]: 3 PROFESSIONAL PERFORMING ARTS HIGH SCH
92 ELEANOR ROOSEVELT HIGH SCHOOL
100 TALENT UNLIMITED HIGH SCHOOL
111 FIORELLO H. LAGUARDIA HIGH SCHOOL OF
229 TOWNSEND HARRIS HIGH SCHOOL
250 FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL
265 BARD HIGH SCHOOL EARLY COLLEGE
Name: School Name, dtype: object
```

Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores.

## AP Scores

So far, we’ve looked at demographic angles. One angle that we have the data to look at is the relationship between more students taking Advanced Placement exams and higher SAT scores. It makes sense that they would be correlated since students who are high academic achievers tend to do better on the SAT.

```
In [98]: full["ap_avg"] = full["AP Test Takers "] / full["total_enrollment"]
full.plot.scatter(x='ap_avg', y='sat_score')
Out[98]: <matplotlib.axes._subplots.AxesSubplot at 0x11463a908>
```

It looks like there is indeed a strong correlation between the two. An interesting cluster of schools is the one at the top right, which has high SAT scores and a high proportion of students that take the AP exams:

```
In [99]: full[(full["ap_avg"] > .3) & (full["sat_score"] > 1700)]["School Name"]
Out[99]: 92 ELEANOR ROOSEVELT HIGH SCHOOL
98 STUYVESANT HIGH SCHOOL
157 BRONX HIGH SCHOOL OF SCIENCE
161 HIGH SCHOOL OF AMERICAN STUDIES AT LE
176 BROOKLYN TECHNICAL HIGH SCHOOL
229 TOWNSEND HARRIS HIGH SCHOOL
243 QUEENS HIGH SCHOOL FOR THE SCIENCES A
260 STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: School Name, dtype: object
```

Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers.

## Wrapping Up the Story

With data science, the story is never truly finished. By releasing analysis to others, you enable them to extend and shape your analysis in whatever direction interests them. For example, in this post, there are quite a few angles that we explored incompletely, and could have dived into more.

One of the best ways to get started with telling stories using data is to try to extend or replicate the analysis someone else has done. If you decide to take this route, you’re welcome to extend the analysis in this post and see what you can find. If you do this, make sure to comment below so I can take a look.

## Next Steps

If you’ve made it this far, you hopefully have a good understanding of how to tell a story with data, and how to build your first data science portfolio piece. Once you’re done with your data science project, it’s a good idea to post it on Github so others can collaborate with you on it.

Published at DZone with permission of Vik Paruchuri , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}