The 7 Types of Data Scientists
The 7 Types of Data Scientists
Are you an R-using number-cruncher, a domain specialist, or an old-school ML-er? Are you somewhere in between? Read on to find out.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I recently got in question on Quora asking something on lines of what exact skills companies look for when they are recruiting a Data Scientist and whether there is a definition of what exactly a Data Scientist is. As is pretty obvious, there is no one definition, as every company is solving its own set of problems. But I tried to make a few generic job profiles that can somewhat fit the job descriptions of different companies.
There's way more variety, but I've narrowed it down to a couple of general profiles.
1. The R-Using Number-Cruncher
This type of Data Scientist can run quick group counts in R and Python. He or she is the coding version of a Data Analyst from the earlier days. This type of Data Scientist is mostly involved in automated report generation in more analytical organizations.
Tools used: R (dataframes) and SQL.
2. The Modeller
This type of Data Scientist has a deeply mathematical mind and can apply Bayesian/Frequentist inferences/hierarchal models. I'm probably grouping too many people into a single group here, but the common theme here is that mathematics forms the base of the work.
Tools used: R, Fortran, C++, and sometimes, functional languages.
3. The Data Engineer Who's an Occasional Data Scientist
Take a library from here, take some code from there, and make something good enough while you manage the data pipeline. This is a very common type of Data Scientist. Tasks include writing programs to automate report generation in Pandas, trying out simple Machine Learning models, and (nowadays) running a pre-trained neural network on the data.
Tools used: Python toolchain, Pandas, NLTK, and Keras.
4. The Tabular ML’er (AKA the XGBoost Specialist)
This type of Data Scientist can train multiple algorithms and stack models and optimize the heck out of them. These guys have deep expertise with running and optimizing standard algorithms like XGBoost, Ridge Regression, and (nowadays) Keras models.
Tools used: Python, R, XGB, and Keras.
5. The Old School ML-er
He or she is similar to the one above, but they're not limited to categorical models. He or she is very good at feature engineering. This was the only Machine Learning expertise until the newer Deep Learning stuff came up.
Tools used: C++, Python, and Scikit Learn.
6. The Deep Learning Guy
This type of Data Scientist needs a GPU system and a well-tagged dataset, needs to try out architectures, and does no feature engineering. They'll spend a lot of time trying architectures and minimal in feature engineering — but the accuracy will be insane!
Tools used: Python, Theano, Tensorflow, and high-level libraries like Keras.
7. The Domain Specialist
He or she knows a lot about the domain and knows some things about linear models. This Data Scientist codes the domain information and trains a linear algorithm on top. This description includes mechanical engineers, analysts at different firms, and scientists in pure and applied sciences.
Tools used: Matlab, C++/Fortran, and R/Python — but different specialists will use very different tools.
8. The Newbie
The intern. Will evolve into whichever of the seven categories his or her mentor belongs.
Most data companies will have all of these types of Data Scientists. Which category do you fall into?
Published at DZone with permission of Muktabh Srivastava . See the original article here.
Opinions expressed by DZone contributors are their own.