With the current “big data” hype, there is a big demand for skilled and knowledgeable data scientists. The fact that demand for data science is so big and the number of skilled data scientists is very limited is skyrocketing rates and attracting more and more people from various backgrounds to work in this exciting field.
With clients and business owners that have very limited knowledge in this field, it is easy for the so called, “fake data scientists," lured by high rates, to jump in and probably ruin the project because data science is often related to the core business functionality of an application, site, etc.
Here are a few real life examples from my experience. Of course, actual names will be omitted because we, as a young company, are more interested in hiring developers than lawyers at this stage.
Person 1 had created a classification model and claimed that it had 99 percent accuracy. That awesome accuracy is always a red flag. After reviewing person 1's code, it was clear that it just took the whole dataset into Weka and ran one classification model. So no data tidying, no exploratory analysis, no test set, no validation set, just one big, juicy data set with highly overfitted model over it. And ta-da! This was presented as a huge success to the client, everybody was happy until new data came and this model's predictions were just terrible.
Person 2 had a hourly rate which the majority of us would kill for. His background was financial trading with a nice portfolio of interesting projects so it was relatively easy for him to get a project with price predictions. After spending a huge amount of that startup's budget, the predictions were very bad and plus daily model creation was a long process, using a 32-core machine on AWS which is pretty expensive.
When reviewing person 2's code, I noticed that the actual R code was written by a very skilled developer, but some decisions related to features were really strange, e.g. time-date feature was just cast to integer value (!) and used in the model. I would expect that from the time-date value we could extract features like day, month, year etc.. and then use these features in model creation. Of course, after fixing this thing, the accuracy rose significantly. Finally, the most shocking was the fact that I was able to find that 95 percent of code was just copy/paste from some other person's github account. So, in this case the formula was: take someone’s code from the net, add something that looks like some data science work but is actually bad and finally charge that astronomically to the unsuspecting client.
Losing money is not the worst thing that could happen. Time is a precious resource and missing the famous “window of opportunity” can ruin your business. How you could filter and hire only great data scientists is probably a million-dollar question. However, there are a few ideas that could help you:
Check out some of the great questions for candidates given here. It will definitely stop a lot of fake ones
Ask if they know what is kaggle, do they have an account, and have they ever participated in any competition.
Check their stack-overflow and github accounts to see what is their focus and which technologies they use.
Ask about the previous projects , find out what role the candidate had, how they solved issues like “cold start," how did the process flow from proof of concept until production, etc.
Finally, give them a short test project, e.g. if your project is related to classification, find some freely available data set and ask them to perform data analysis and create a classification model. Once you get the report and code look for these:
Check for preprocessing steps on data set like: Cleaning, normalizing, discretization, etc.
Is there any exploratory analysis present? e.g. are there any different plots present to visualize the data, are outliers identified, etc.
Is the code book created?
Is there a dimensionality reduction step present? This is quite important because “script kiddies” often don't know nothing about this step.
Is there a train and test set? How was the test set created (just random sampling or stratified sampling)?
Is there any metrics and referent model present?
What about ensemble methods for improving classification results?
Using ML pipelines? Currently present in Apache Spark mllib and python? If yes, it's a big plus.
Using cross-validation during model training? This is important because it will provide the best model parameters for the data set.
Hopefully, the entire process explained above will be relatively easy and fun for the real data scientists and a nightmare for fake ones. I've tried it while working with one of the previous clients and we were very happy with the results. Actually, one of the guys who was very successful with this test project became my boss in the data science team just two months after joining the company. Compare the situation where the whole team benefits and learns from the new data scientist versus the situation where the new guy is practically useless and ruining the whole project. I guess this is a pretty easy choice to make.