Data Science vs. Data Analysis: What's the Difference?
Data Science vs. Data Analysis: What's the Difference?
The short version is that data science includes and goes beyond data analysis. If you contrast data scientists with data analysts, the data scientists' goals are deeper and their area of concern is larger.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Data science is hot right now. A report from McKinsey Global Institute estimates a shortage of 190,000 data scientists jobs in 2018, which is due to demands of tech companies, ranging from Apple to Zendesk. Courses teaching data science have popped up. Languages used in data science, such as Python and R, have grown wildly popular.
One common question seen in this field is, What's the difference between data science and data analysis? To answer that, we need to understand why there's this confusion in the first place.
Why the Confusion?
Most people are confused about the difference between data science and data analysis because the most visible part of a data scientist's job is data analysis.
Data scientists need to be familiar with many techniques to do their job well. I'll cover just a few below.
- A/B testing. Also known as split testing, this is a technique in which you compare a variety of test groups against one another to find what changes will improve a given objective variable. For example, you might measure the marketing response rate (the objective variable) from two or more different email campaigns.
- Association rule learning. Amazon product pages often state, "Customers who bought the product you're looking at also bought these other products." That's the power of association rule learning. Without human supervision, it discovers local patterns within data that express hidden relationships between input variables.
- Classification. This refers to the problem of identifying and categorizing a new data point and funneling it into the right category or group. Typically, the technique involves having a training set containing existing data points that are already categorized and then applying machine learning to identify the new data point.
- Cluster analysis. This is similar to classification. But cluster analysis splits a diverse group of data points into smaller groups based on how similar they are. The difference between this statistical method and classification is that the characteristics of the similarity aren't known in advance. So, there's no training set to use.
More Data Analysis Techniques
Other data analysis techniques data scientists need to be familiar with include the following:
- Data mining
- Ensemble learning
- Genetic algorithms
- Machine learning
- Natural language processing (NLP)
- Neural networks
- Network analysis
- Pattern recognition
- Predictive modeling
- Sentiment analysis
- Signal processing
- Spatial analysis
- Supervised learning
- Time series analysis
- Time series forecasting
- Unsupervised learning
Some of these techniques are related more to statistics, such as regression. Some are broad umbrella terms — this means other techniques can be grouped under them. For example, cluster analysis is a form of unsupervised learning.
You can perform data analysis without a deep knowledge of these techniques. This is possible because there are software packages to help you. Here's a simple example: let's say you need to run a regression analysis on your data points. All you need is Microsoft Excel and its built-in regression formula... though you probably can't get away with calling yourself a data scientist if that's all you do.
So, What Are the Differences?
Now that we have a feel for what data analysis covers and how it's confused with data science, we can get down to discussing the differences. The short version is that data science includes and goes beyond data analysis. If you contrast data scientists with data analysts, the data scientists' goals are deeper and their area of concern is typically larger.
The Deeper Goal
A data scientist's ultimate goal is to discover new knowledge. In business, these insights may mean a huge edge for the company. Or it may mean a breakthrough in current methods, like a brand new analysis technique. Or it may mean a different paradigm altogether; maybe the data scientist discovers how to apply existing techniques in a novel way.
A data analyst need not go as deep. It would be nice for analysts to pursue excellence to such depths, but it's not really their goal. While data scientists think about what questions to ask or hypotheses they have before they do their analysis, a data analyst is primarily concerned with simply answering those questions.
In summary, data science is both the crafting of questions and the answering. Data analysis is mostly about the answering.
Furthermore, good data scientists need to constantly monitor how effective their techniques are. They need to think about increasing the accuracy of algorithms and how to integrate multiple data sources with platforms inside and outside their organization. In other words, a data scientist's day-to-day work is tied more deeply with the organization's goals. This is true regardless of whether the organization is an online retail company or a research unit.
A Bigger Area of Concern
If you Google for information about data science, you get these commonly used Venn diagrams.
Data scientist Venn diagram by Drew Conway.
First off, let's ignore the question of which diagram is right. We must look for common points within these Venn diagrams. Data science, in both these diagrams, is in the middle of multiple domains. In other words, data science is an interdisciplinary field. You need to know some programming and database-related skills. You need to master the specifics of the domain or business you're in. For example, if you worked in e-commerce, you'd need to learn the purchasing behavior of online shoppers. You'd also have to know the mechanics of e-commerce. Statistics knowledge is a must. On top of this, you probably need to explain your ideas to other people — at the very least, to fellow data scientists in your group.
The Deeper Goal Drives the Areas of Concern
Earlier, I mentioned that a data scientist works toward providing new insights. This deeper goal drives the data scientist to branch out into many areas, and in turn, that increases the chances of doing notable work, like having creative insights and inventing new analysis techniques. The data scientist may even repurpose existing technologies in novel ways.
A data analyst, on the other hand, doesn't have to be skilled in so many areas. In fact, depending on the nature of the analysis, we can perform it without any clue about areas of specialty like programming, statistics, or even business fundamentals.
Let's use an imaginary scenario as an example. In it, the analysis tasks are straightforward and the data points are pristine. And in this scenario, you can even get an intern to perform data analysis using off-the-shelf software with minimal instructions. You wouldn't expect the intern to interpret the results. And guess what? Making sense of the results wouldn't be part of the data analyst's job. That's part of data science.
What Matters Is What's Next for You
So we've reached the end of this post. In it, we've examined the similarities and differences between data analysis and data science. The more important question now is what does that mean for your career in this field? Recall the shortage of data scientists mentioned earlier. That's not going to last forever.
To go far in this area, you need to acquire more knowledge and training. In fact, it may be wise for some self-appraisal in the areas you saw featured in the Venn diagrams shown here.
Here are some useful steps for you to get ahead.
- Look at the three or four areas needed for data science: areas such as statistics, programming, communications, and business domain expertise. What are your strengths? What are your weaknesses?
- Pick your one strength out of those four areas. Now, look for positions that require that trait. Suppose you're strongest in statistics. Use that to get into an analyst role for which being good in statistics is non-negotiable.
- From there, seek feedback from peers and your manager. Ask them which of the four traits they perceive as your strengths and weaknesses. Are your weaknesses holding you back? Or do you need to get even stronger in your competencies? Be prepared for either case.
- Keep leveling up after that. Rinse and repeat.
You don't need to be good in all areas to start your data science career today. Given the huge demands and easy access to data science courses such as ours at ASPE Training, this might be the golden age of data science you've been waiting for.
Published at DZone with permission of Erik Dietrich , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.