The Art of Storytelling in Data Science
The Art of Storytelling in Data Science
Take a deep dive into the wonderful world of data visualization and learn all of the best practices for storytelling in data science.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
There's an in-built author/writer in all of us. This is evident from our day-to-day life incidents. Whether we narrate a funny incident or our findings, stories have always been the “go-to” to draw interest from listeners and readers alike. It brings the idea to life and makes it more interesting. The idea of storytelling is fascinating: to take an idea or an incident and turn it into a story.
For instance, when we talk of how one of our friends got scolded by a teacher, we tend to narrate the incident from the beginning so that a flow is maintained. I'd do the same if it were the event of friend's marriage or a girl proposing to me.
Let’s take an example of the most common driving distractions by gender. There are two ways to tell this. The first is that I give you some statistics as follows:
- 6% of men believe texting is a distraction as compared to 4.2% of women.
- Kids in the car cause 9.8% of men to be distracted as compared to 26.3% of women.
Another way to recreate similar statistics is this visual from kids4kars.org.
Which one do you think tells a better story?
The Need for Storytelling
The art of storytelling is simple and complex at the same time. Stories provoke thought and bring out insights that could not have been understood or explained before. It’s often overlooked in data-driven operations, as we believe it’s a trivial task. What we fail to understand is that the best stories, when not presented well, end up being useless!
In several firms, the first step towards analyzing anything is storyboarding it. This helps answer questions like, Why do we have to analyze it? and What decisions can we make out of it? Sometimes, data alone tells such visual and intricate stories that we don’t need to run complex correlations to confirm it.
The best example of needing stories and visuals to explain data is Anscombe’s Quartet. The Anscombe’s Quartet is a set of four datasets. You can learn more about Anscombe's Quartet here.
You would be surprised to see the results and visuals of Anscombe's Quartet.
How to Create Stories
To create a story or a plot is the first step to selling your ideas with your best foot forward. Most people fail to think their stories through and cannot differentiate themselves from mediocrity. Let me take an example and guide you through the steps of creating stories.
We will be exploring a dataset that has news headlines and details of every stock price from the NASDAQ 100 tech companies. The columns selected are as follows.
1. Begin With a Pen-Paper Approach
Visually engaging presentations will inspire your audience, but they definitely need more work to be put in. Some of the best presentations have been created on rough pages and napkins.
Writing down your ideas and flow before you start structuring your story is essential to your final product.
The single most important thing you can do to dramatically improve your analytics is to have a story to tell. A flow that you generate can have a lot of friction in your end result.
Aristotle’s classic five-point plan that helps deliver strong impacts are the must for any storyteller or writer. Here is the detailed Aristotle's five-point plan.
The way I structured my report was by involving plots that would give me a better understanding of my data. The first question that I had was, How can I make better business decisions about stocks by using the data that I have?
Involving a line graph would help me analyze trend lines of specific stock prices.
As you can see, February 2016 experienced a drop in all stocks. This knowledge can help me scrape news articles from only that period to identify what caused the drop. Now, how do I select which news source to scrape from?
By identifying which news source reported most information about a particular stock, we would have reason to believe that this is a good source for the specific stock.
2. Dig Deeper to Identify the Sole Purpose of Your Story
Identify closely what the idea of your story is. Ask yourself, What am I really giving with this story? It’s never the story alone, but what the story can do to make decision-making better. What you’re displaying is the idea of better decision-making or analytics.
Develop a personal “passion statement.” In one sentence, tell your prospects why you are genuinely excited about working with them. Your passion statement will be remembered for a long time to come.
3. Use Powerful Headings
Create your heading: a one-sentence statement for your story, visual, or analysis. The most effective headlines are concise, specific, and offer a personal benefit.
Remember, your heading is a statement that offers your audience a vision of a better understanding. It’s not about you. It’s about them.
4. Design a Roadmap
Create a list of all the key points you want your audience to know about your story, visual, or analysis.
Categorize the list until you are left with only three major message points. This group of three will provide the verbal roadmap for your story.
Under each of your three key messages, add supporting evidence to enhance the narrative. These could personal stories, facts, examples, and/or analogies, for example.
5. Conclude With Brevity
Now that you have put forward all points of your story, your conclusion should be short and powerful. In my report, I mentioned small 3-4 liner summaries to conclude why to buy a particular stock.
Types of Data and Suitable Charts
Let's see the common types of data we encounter and how to tell stories from those, by selecting the best-fit charts.
Commonly encountered types of data include the following.
When data is found in this form, it’s usually good to find out how often a word has been used or what the sentiment of the text is. Stories can be told best using this form of data.
One of the best-suited visualizations for textual data is the word cloud. The word cloud brings the more frequent ones to the center and enlarges them, giving us a clear picture of what the general idea of the text depicts.
For example, the word cloud in this article displayed above gives a representation of a Twitter dataset. It shows that love is the most frequent positive term used in the tweets.
When our data consists of numeric or any other variety of data formats, we need to know which ones are important and give us better insights from our dataset.
The preferred visual for this kind of data can vary; you can use Titanic passenger data to show how powerful visualization can be.
As this plot shows us, females and first-class passengers tend to have a higher survival chance than men who are a part of the crew or in lower boarding classes.
Isn’t that what really happened on the Titanic? Wow!
Here is a detailed visualization of the Titanic passenger dataset.
Another way to visualize this kind of data is with a multivariate plot.
Here is the detailed dataset on car performance and specifications used in this visualization above.
Here, we can see how cars that have a heavier build are slower than the ones with lighter bodies. Makes sense, right?
When we encounter this kind of data, we’re usually looking for trends or lines that depict numbers. The visual that would suit numeric data best would be a line or a step graph.
Here, we can very clearly see the rise of prices at a local attraction for adults and children. See how easy it is to see the growth at each year interval?
One of the datasets that we also encounter is related to stocks. Stock market data is primarily time series data of numeric values, but as a trader or an investor, I would like to understand each date and drop carefully.
The most visually captivating chart in this regard is the Candlestick chart:
Here, we take the example of Tesla’s stocks. Candlestick charts can be used to maneuver across each date and see the lows and highs of stocks individually. This can help us make better investment decisions based on current or past market trends. To learn more about the power of machine learning in stocks analysis, read this article on Bollinger Bands.
As the graph shows us, February 2016 experienced a drop in Tesla’s stocks. We can now use this information to understand other market conditions and economic situations to make decisions about their stock.
When we have data pertaining to specific locations and areas, we use maps to add clarity and meaning to our analysis.
In this example, we can see how countries fared during and after the 2002 World Cup. Germany has scored the maximum number of goals, being one of the most dominant teams in world football (soccer) ever since. Further, we can predict for whom the world is cheering in the FIFA WC using geographical Twitter data.
Storytelling During the Steps of Predictive Modeling
Often, we are questioned about how our stories and visuals can work or help when it’s time to create mathematical models. During all stages of predictive modeling, storytelling can be a vital addition to your analysis.
Let's understand the basic steps involved in creating models out of our data and go through telling stories within them.
The first step of model building is understanding your data. I’ll give you instances and show you how you can explore your data without computing complex statistics.
Let’s consider a dataset on wine quality. This is the structure of the dataset is as follows:
Here, we can see the associated summary statistics of the dataset in use:
So, if we need to see whether there is any correlation between alcohol volume and wine quality, how do we do it? To learn about correlation, read this article.
We could compute Pearson’s r. It would help us in building a model... but would not help us in analyzing much.
This shows a very strong correlation between alcohol content and wine quality. But does it tell you anything else?
Ideally, it doesn’t. So, what does?
Let’s see how we can visualize these and tell a lot more from them. First, we’ll begin by seeing how wine quality relates to alcohol content.
Here, we can see that higher alcohol volumes relate to better wine qualities and helps us come to a better understanding of our data. We can also spot outliers better in this scenario.
Next, would you wonder how acid contents in your wine affect its quality? Learn more and dive into this example violin plot here.
After you generate features, how do you see how well one is predicting?
Graphs tell us how far away our predicted points are from our fitted line.
Another example where we might have to visualize newly created visuals is the principal component analysis. If you want to get an in-depth understanding of PCA, you can go through this article.
This is the Iris dataset found in RStudio:
When we run the principal component analysis on this dataset, we find these statistics:
Although, when we plot this, we find that the resulting visual is much more informative than the statistics themselves:
Model Creation and Comparison
Coming to the model creation phase, we usually find the need to understand how our data is being fitted.
This is a model that predicts whether the car should go fast or slow based on the grade of the road and its bumpiness.
As you can see, the decision boundary clearly classifies most of the data, but an accuracy of 88.21% doesn’t tell much of a story. Here, we can even see how far the misclassified points are from the decision boundary.
We can also compare certain algorithms and techniques by looking at their decision boundaries as we did above.
Another example using the Iris dataset is shown below:
Here, there’s not much information to derive valuable insights about our model.
To learn more about support vector machines, you can go through this article.
This plot shows us a clear classification boundary where the species separate from each other.
Best Practices for Story Telling
Now that you know the scenarios where we can use storytelling to explain our point, I will give you a few practical tips for when you take this up on your own.
- Always label your axes and give a heading to your plot.
- Use legends where necessary.
- Use colors that are lighter on the eye and in proportion.
- Avoid adding unnecessary detail to your visualization, like backgrounds or themes that don’t afford good readability.
- Only one point can be used to simultaneously encode two quantitative values based on a horizontal and vertical location.
- Never use points for visualization if you are doing time series encoding.
Published at DZone with permission of Shantanu Kumar . See the original article here.
Opinions expressed by DZone contributors are their own.