An Introduction to Data Science
An Introduction to Data Science
The definitions, skills required, and job roles around the emerging and rapidly growing field of data science.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Whenever someone talks about Data Science, the following questions come to mind:
What is Data Science?
What is the role of a Data Scientist?
What skills are required to enter the field of Data Science?
You will get answers to all the questions above in this article.
According to Wikipedia, Data Science is defined as:
An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of data analysis fields such as statistics, data mining, and predictive, similar to Knowledge Discovery in Databases (KDD).
So in general, Data Science is a field which studies how Information is gathered, what it conveys, and how that information can be converted into a valuable resource to generate meaningful insights for the betterment of a business as a whole.
You can see another explanation of Data Science in the Venn Diagram below:
The white area signifies Data Science, which is basically an intersection of the following skills: software engineering, domain expertise, and operational research.
When we talk about Data Science you will often hear the buzzword "Data Scientist".
To summarize role of a Data Scientist, it goes as follows:
They are people who are good with coding as well as great analytical skills.
In general they are good with coding, statistics, data mining, and data visualization.
Data scientists broadly fall into two categories. A detailed explanation will be found at the end:
Analysts / Statisticians
There are several programming languages for Data Science out in the market. Here are a few of them:
It is the most popular language for Data Science
It is an open source language.
Its IDE is also open sourced.
It has an excellent set of libraries.
It only has some issues with scalability.
It is equally as popular as R.
It can severe a variety of purposes other than Data Science, such as developing Web applications.
It has a good set of libraries, packages, and tools built around it.
It is better than R from Scalability point of view.
There are other languages such as SAS, SPSS, MatLab, etc., but the only problem with using these packages is that they are not open sourced.
Apart from learning these languages, you also need to be a domain expert and a better presenter.
A domain knowledge expert is needed to understand what the data communicates. This knowledge is something which you can only build through gaining experience over a period of time.
You should have the ability to visualize data and present it in such a manner that even a layman can understand it.
To broadly classify, here are the different professional roles in the field of data science:
Data Engineer: Write code for performing ETL (Extract, Transform, Load) tasks as well as writing code for the selected algorithms.
Analyst: Wrangle and analyze data in order to identify patterns.
Statistician: The primary role of a statistician is to recommend the appropriate algorithms and strategies that fits the use case.
Hope this article helped anyone who is planning to enter the field of data science!
Opinions expressed by DZone contributors are their own.