Over a million developers have joined DZone.

Am I a data scientist?

DZone's Guide to

Am I a data scientist?

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Last night I gave a very short talk (less than 5 min­utes) at the Mel­bourne Ana­lyt­ics Char­ity Christ­mas Gala, a com­bined event of the Sta­tis­ti­cal Soci­ety of Aus­tralia, Data Sci­ence Mel­bourne, Big Data Ana­lyt­ics and Mel­bourne Users of R Net­work.

This is (roughly) what I said.

Sta­tis­ti­cians seem to go through reg­u­lar peri­ods of exis­ten­tial cri­sis as they worry about other groups of peo­ple who do data analy­sis. A com­mon theme is: all these other peo­ple (usu­ally com­puter sci­en­tists) are doing our job! Don’t they know that sta­tis­ti­cians are the best peo­ple to do data analy­sis? How dare they take over our discipline!

I take a com­pletely dif­fer­ent view. I think our dis­ci­pline is in the best posi­tion it has ever been in. The demand for data analy­sis skills is greater than ever. Our grad­u­ates are highly sought after, and well paid. Being a sta­tis­ti­cian has even been described as a sexy pro­fes­sion (which pre­sum­ably is a good thing to be!).

The dif­fer­ent per­spec­tives are all about inclu­sive­ness. If we treat sta­tis­tics as a nar­row dis­ci­pline, fit­ting mod­els to data, and study­ing the prop­er­ties of those mod­els, then sta­tis­tics is in trou­ble. But if we treat what we do as a broad dis­ci­pline involv­ing data analy­sis and under­stand­ing uncer­tainty, then the future is incred­i­bly bright.

Here are two quotes from well-​​known blog­gers in the last year or two:

April 2013: Larry Wasser­man blog
Data sci­ence: the end of sta­tis­tics?
If you’re ana­lyz­ing data, you’re doing sta­tis­tics. You can call it data sci­ence or infor­mat­ics or ana­lyt­ics or what­ever, but it’s still statistics.

Novem­ber 2013: Andrew Gel­man blog
Sta­tis­tics is the least impor­tant part of data sci­ence
There’s so much that goes on with data that is about com­put­ing, not sta­tis­tics. I do think it would be fair to con­sider sta­tis­tics as a sub­set of data science …

Sta­tis­tics is important—don’t get me wrong—statistics helps us cor­rect biases … esti­mate causal effects … reg­u­lar­ize so that we’re not over­whelmed by noise … fit mod­els … visu­al­ize data … I love sta­tis­tics! But it’s not the most impor­tant part of data sci­ence, or even close.

How can two pro­fes­sors of sta­tis­tics have such dif­fer­ent views on their dis­ci­pline? The same per­spec­tives can be seen in the fol­low­ing two dia­grams (both repro­duced with permission).

Source: Drew Con­way, Sept 2010. Repro­duced under a Cre­ative Com­mons Licence.

In the first nar­row view, to be a data sci­en­tist you have to know a great deal about sta­tis­tics, math­e­mat­ics, com­puter sci­ence, pro­gram­ming, and the appli­ca­tion dis­ci­pline. If that’s true, I’ve never met a data sci­en­tist. I don’t believe they exist.

In the sec­ond broader view, every­one here is a data sci­en­tist, although we have dif­fer­ent spe­cial­iza­tions and dif­fer­ent per­spec­tives and training.

I take the broad inclu­sive view. I am a data sci­en­tist because I do data analy­sis, and I do research on the method­ol­ogy of data analy­sis. The way I would express it is that I’m a data sci­en­tist with a sta­tis­ti­cal per­spec­tive and train­ing. Other data sci­en­tists will have dif­fer­ent per­spec­tives and dif­fer­ent training.

We are com­fort­able with hav­ing med­ical spe­cial­ists, and we will go to a GP, endocri­nol­o­gist, phys­io­ther­a­pist, etc., when we have med­ical prob­lems. We also need to take a team per­spec­tive on data science.

None of us can real­is­ti­cally cover the whole field, and so we spe­cialise on cer­tain prob­lems and tech­niques. It is crazy to think that a doc­tor must know every­thing, and it is just as crazy to think a data sci­en­tist should be an expert in sta­tis­tics, math­e­mat­ics, com­put­ing, pro­gram­ming, the appli­ca­tion dis­ci­pline, etc. Instead, we need teams of data sci­en­tists with dif­fer­ent skills, with each being aware of the bound­ary of their exper­tise, and who to call in for help when required.

Let’s not be too sec­tar­ian about our dis­ci­plines, think­ing every­one not trained in the same way we were is a heretic.

It reminds me of a famous joke, writ­ten by come­dian Emo Philips:

I was walk­ing across a bridge one day, and I saw a man stand­ing on the edge, about to jump off. I imme­di­ately ran over and said “Stop! Don’t do it!“
“Why shouldn’t I?” he said.
I said, “Well, there’s so much to live for!“
“Like what?“
“Well … are you reli­gious or athe­ist?“
“Me too! Are you Chris­t­ian or Jew­ish?“
“Me too! Are you Catholic or Protes­tant?“
“Me too! What fran­cise?“
“Wow! Me too! North­ern Bap­tist or South­ern Bap­tist?“
“North­ern Bap­tist“
“Me too! Are you North­ern Con­ser­v­a­tive Bap­tist or North­ern Lib­eral Bap­tist?“
“North­ern Con­ser­v­a­tive Bap­tist“
“Me too! Are you North­ern Con­ser­v­a­tive Fun­da­men­tal­ist Bap­tist or North­ern Con­ser­v­a­tive Reformed Bap­tist?“
“North­ern Con­ser­v­a­tive Fun­da­men­tal­ist Bap­tist“
To which I said, “Die, heretic scum!” and pushed him off.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}