Too Big Data: Coping with Overplotting
Too Big Data: Coping with Overplotting
As we add more and more data points to a scatter plot we can start to lose these patterns and clusters.
Join the DZone community and get the full member experience.Join For Free
Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.
By Tim Brock.
Scatter plots are a wonderful way of showing (apparent) relationships in bivariate data. Patterns and clusters that you wouldn't see in a huge block of data in a table can become instantly visible on a page or screen. With all the hype around Big Data in recent years it's easy to assume that having more data is always an advantage. But as we add more and more data points to a scatter plot we can start to lose these patterns and clusters. This problem, a result of overplotting, is demonstrated in the animation below.
The data in the animation above is randomly generated from a pair of simple bivariate distributions. The distinction between the two distributions becomes less and less clear as we add more and more data. So what can we do about overplotting?
One simple option is to make the data points smaller. (Note this is a poor "solution" if many data points share exactly the same values.) We can also make them semi-transparent. And we can combine these two options:
These refinements certainly help when we have ten thousand data points. However, by the time we've reached a million points the two distributions have seemingly merged in to one again. Making points smaller and more transparent might help things; nevertheless, at some point we may have to consider a change of visualization. We'll get on to that later. But first let's try to supplement our visualization with some extra information. Specifically let's visualize the marginal distributions. We have several options. There's far too much data for a rug plot, but we can bin the data and show histograms. Or we can use a smoother option - a kernel density plot. Finally, we could use the empirical cumulative distribution. This last option avoids any binning or smoothing but the results are probably less intuitive. I'll go with the kernel density option here, but you might prefer a histogram. The animated gif below is the same as the gif above but with the smoothed marginal distributions added. I've left scales off to avoid clutter and because we're only really interested in rough judgements of relative height.
Adding marginal distributions, particularly the distribution of Variable 2, helps clarify that two different distributions are present in the bivariate data. The twin-peaked nature of Variable 2 is evident whether there are a thousand data points or a million. The relative sizes of the two components is also clear. By contrast, the marginal distribution of Variable 1 only has a single peak, despite coming from two distinct distributions. This should make it clear that adding marginal distributions is by no means a universal solution to overplotting in scatter plots. To reinforce this point, the animation below shows a completely different set of (generated) data points in a scatter plot with marginal distributions. The data again comes from a random sample of two different 2D distributions, but both marginal distributions of the complete dataset fail to highlight this separation. As previously, when the number of data points is large the distinction between the two clusters can't be seen from the scatter plot either.
Returning to point size and opacity, what do we get if we make the data points very small and almost completely transparent?
We can now clearly distinguish two clusters in each dataset. It's difficult to make out any fine detail though.
Since we've lost that fine detail anyway, it seems apt to question whether we really want to draw a million data points. It can be tediously slow and impossible in certain contexts. 2D histograms are an alternative. By binning data we can reduce the number of points to plot and, if we pick an appropriate color scale, pick out some of the features that were lost in the clutter of the scatter plot. After some experimenting I picked a color scale that ran from black through green to white at the high end. Note, this is (almost) the reverse of the effect created by overplotting in the scatter plots above.
In both 2D histograms we can clearly see the two different clusters representing the two distributions from which the data is drawn. In the first case we can also see that there are more counts from the upper-left cluster than the bottom-right cluster, a detail that is lost in the scatter plot with a million data points (but more obvious from the marginal distributions). Conversely, in the case of the second dataset we can see that the "heights" of the two clusters are roughly comparable.
3D charts are overused, but here (see below) I think they actually work quite well in terms of providing a broad picture of where the data is and isn't concentrated. Feature occlusion is a problem with 3D charts so if you're going to go down this route when exploring your own data I highly recommend using software that allows for user interaction through rotation and zooming.
In summary, scatter plots are a simple and often effective way of visualizing bivariate data. If, however, your chart suffers from overplotting, try reducing point size and opacity. Failing that, a 2D histogram or even a 3D surface plot may be helpful. In the latter case be wary of occlusion.
Published at DZone with permission of Josh Anderson , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.