Over a million developers have joined DZone.

Hubris and the Data Scientist

DZone's Guide to

Hubris and the Data Scientist

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Joe Brockmeier captures a recurring issue from the recent O’Reilly Strata conference, asking “ Can Big Data replace domain expertise?” According to Brockmeier, the audience of data scientists narrowly agreed that their arsenal of tools and algorithms trumped the knowledge and experience of the meteorologists, financiers, and retailers to whose domains data scientists are increasingly turning.

This is a worrying attitude, and I can only hope that those who hold it realise the error of their ways before they make a catastrophic mistake that adversely affects the rest of us.

Data scientists are an increasingly capable bunch, and the tools at their disposal sometimes appear almost magical in their capability to derive insight. Competitions such as those run by Kaggle (more on them in a moment) clearly show that an aptitude for numbers and analysis can deliver some remarkable results, even when that analysis is being undertaken by individuals who lack specific domain expertise.

But to suggest that simply “letting the numbers speak for themselves” is an effective way to make real decisions is, quite simply, bonkers. Data is merely one input to an effective decision making process. Prior knowledge, policy considerations, and an awareness of experimental bias, sampling error, and quaint notions such as ground truth continue to play a fundamental part.

Data scientists undeniably bring a wealth of skills to the table, but so do domain specialists. The domain specialists would be unwise to presume that they can continue to keep pace with exploding data volumes without judicious application of data science. But for data scientists to presume, even for a moment, that they and their algorithms can replace domain expertise is laughable.

Moving forward, we need both domain skills and data skills. Sometimes those skills may be present within a single individual, especially as practitioners within more data-intensive domains equip themselves with the skills required to continue functioning as data volumes blossom. At other times, they will be brought together in the makeup of a team that comprises domain experts and data scientists. It remains unclear whether it would normally be quicker or easier to teach a domain specialist data skills, or vice versa.

One is not ‘better’ than the other, and instead we need to concentrate on finding ways to make it as easy as possible for both groups to work together. How, for example, do we set about ensuring that conflicting use of superficially ‘obvious’ technical terms does not derail the process from the outset? How do we package and convey deep-seated presumptions from one group to the other, and how do we create the common space within which number-cruncher and specialist can work together?

Mike Driscoll, who chaired a debate on the issue at Strata, offers more, including one of the many ways in which domain knowledge remains a vital aspect of the process;

“One of the conclusions reached was that, when a problem is well-structured (or to Drew Conway’s point, when a good question is posed), it is much easier for machine learning to succeed. Kaggle’s strength as a contest platform is that domain experts have already framed the problem: they choose the features of the data to use (feature engineering or “feature creation”, as Monica Rogati calls it) as well as the criteria for success. This is the first, hardest step in any data science project. After this, machine learners can step in and develop the best algorithms for classifying and predicting new data (or, less usefully, explaining old data).”

In responding to Brockmeier’s post, Strata co-chair Alistair Croll also makes an important point:

“Of course, understanding which data to apply to a problem, and when to listen to the numbers, is a nuanced thing.

One thing about data is that it often has non-obvious, and disruptive, nuggets within it that threaten the status quo. And many ‘domain experts’ thrive on their political skills rather than their actual results. So part of the debate is really about housecleaning to replace anecdote with evidence—an uncomfortable cultural shift.”

Too true.

Data science — and the data scientist — are here to stay, and they bring tremendous value with them. But they’re an adjunct to domain knowledge, not a replacement for it.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}