There is this tendency on the internet to pretend that everything is easy. You want to become a rocket scientist? Just read a few articles! You want to become president? Just start spewing your opinions online! You want to be a Data scientist? Well if you can 1 and 1 you’re halfway there!
The thing is, not only is it not true, by pretending it is we end up steering a lot of people down the wrong paths. Not everybody can be everything. If you’re just over four foot, you probably shouldn’t try to play professional basketball. Similarly, if you don’t have the right intellectual skill set then you’re better off investing your efforts in a different direction than trying to be something that will ultimately feel like you’re banging your head into a wall.
So what skills do you need in order to become a data scientist?
You Need to Be Able to Program
Sure, some companies might use a software package like Matlab or Octave, but the majority of companies instead have their own data analysis software programmed in Java, Python, Scala or Ruby (here’s some great advice about how to program in Java). In order to be able to use these correctly, you really should know how to program in them.
After all, when a client or your boss wants you to adjust program or integrate a new algorithm into it and you don’t have the knowhow to do so, then you’re not going to look very good. In fact, at that moment you’re failing as a data scientist. Not because you don’t know the numbers, but because you can’t apply them.
Think you’ll pick these languages up on the job? It’s possible, but unless you’re a genius, it’s not likely. In fact, you should probably start learning them now, because otherwise when they give you a program that is ten thousand lines long and expects you to modify it, you’re probably already too late to learn what you need to.
Big Data Software
The current boom in data that’s going on is hugely beneficial for finding out what is going on, why people do what they do and how to either make money from that or to deliver better services. That said, it does mean that data scientists suddenly have to work with an entirely new realm of software packages and computers.
Standard desktops that you are generally going to be familiar with won’t be able to deal with the data or the software packages that are out there. And so you’ll have to learn how to do redistributive processing and that means understanding map-reduce, distributive file systems and being able to use Hadoop. Don’t know what that is? Then you better find out!
Then there’s data cleaning. You see, when you’re following a course to learn how to become a data scientist, they’ll generally provide you with the data package right there, scrubbed of strange anomalies and weird entries.
That’s not how the real world works. Their data is ugly. There people and machines have done strange things. For that reason, you’ve got to know how to clean up your data, what you can eliminate without creating a problem and when you’re going to be corrupting your data simply by erasing an entry or altering it.
Of course, you’ll also need to know how to tweak the data. And that means learning yet more, as that’s best done in UNIX. Are you familiar with it? Have you worked with commands like sed, grep, tr, cut, sort, awk, and map/reduce? Well, then add that to your list of things that you really want to have mastered before you actually get a data analysis job.
Probability and statistics
What is a p-value? Is your feature dependent or independent? What are you confidence intervals and how do you set them up? Can you do an F test? What is your standard error? What is the difference between mediators and moderators? How do you set up your hypotheses beforehand and how do you test them correctly?
Statistics is not easy and – at least initially – not a hell of a lot of fun. (It actually becomes incredibly interesting once you understand it, but you’ll probably have to take my word for it).
And yet it is absolutely vital if you want to be a data scientist, as this is how you give yourself the big picture understanding of a dataset and know if you’re doing the test in accordance with how it is expected to be done.
Why should you do that? Because otherwise other data scientists are going to poke holes in every theory you’ve come up with and in every way you’ve tested it.
Don't Let Me Discourage You
Everything I’ve said here is not to discourage you. It is just to make you aware of what you’ll actually have to know. You see, most article writers aren’t writing that article to really educate you. Oh sure, they’ll take that as a bonus, but what they’re really trying to do is attract readers.
And you don’t attract readers by saying something is hard. For example, how many articles say that it is really difficult to develop leadership skills? Sure, they’re out there, but they’re very rarely popular. Instead, it’s the ones that make it seem easy that do well.
Don’t let that fool you. If it was easy everybody would be doing it and nobody would be making any money.
Nothing in life that is worth doing is done easily. That’s what makes it worth doing. So, if you’re going to become a data scientist, more power to you! I’m happy to hear it. Just be ready to roll up your sleeves and really learn everything you need to know. Then after that, you can join the chorus of people saying how easy it is, even while you know different.