Over a million developers have joined DZone.

What Are Data Scientists, and Are They Here to Stay?

DZone's Guide to

What Are Data Scientists, and Are They Here to Stay?

With the data landscape evolving so fast, exactly what role do data scientists play in today’s organizations and how will technological advances affect this?

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Data scientists might have been awarded Harvard Business Review’s “sexiest job title” just five years ago, but with the data landscape evolving so fast, just what role do they play in today’s organizations and how will technological advances affect this?

To find out more about the so-called “unicorns” of BI, I caught up with Nir Regev, who heads up the Data Science and Algorithms Team here at Sisense.

Nir’s role is split between creating ever-more effective algorithms to support Sisense’s machine learning and AI capabilities and handling internal data projects at Sisense.

Okay Nir, let’s clear this up: just what is data science?

Data science is emerging into the spotlight in high tech. It covers so many issues! Basically, I see data science as a collection of methods to solve problems with data — this is as generic as you can get, but it’s true.

You start with a business problem described in the context of a business scenario or a use case, then data scientists start to collect data, and then there’s a series of steps that we apply to eventually develop a model or some kind of advanced statistical analysis to tackle the problem and provide a solution.

Thinking specifically about high-tech companies, what is the role of a data scientist there?

That really depends on the company. If they develop products, like Sisense, it usually means the data scientist contributes machine learning features to the product. That’s a very challenging job, as the product can be sold to many diverse clients with different use cases, and typically data scientists in this situation do not necessarily have their clients’ data. This makes it a non-typical and challenging job for a data scientist.

So what is a “typical” data science job?

Usually, you start with a business problem. Let’s say there’s a drop in one of the KPIs for sales or revenue, and you get a dataset describing the characteristics of the data for different KPIs. In other words, you are asked to understand why there was a sudden drop in that KPI last month and provide tools to predict it in the future, for example.

This is a much more concrete problem because you get a clear definition of the problem. You have the use case, and you have access to the data.

What kind of projects are you working on right now?

One of the recent projects deals with the win rate. We saw some interesting changes in our win rate and wanted to understand more about influencing factors so we collected lots of data, coming from a lot of different sources, including SalesForce and Gong [which records sales calls]. We analyzed unstructured data like text from a free text, notes field in SalesForce. In the end, we had a bunch of different data sets that we combined to understand the prospect’s journey through the sales and marketing funnel, and tried to predict whether that opportunity would turn into a client or not, based on these data items.

And that’s fed back to the Sales Team?

That is a good question! Who is going to use it, and how? This is something that should be answered at the beginning of a data project, but usually, data scientists are so enthusiastic about solving the problem that we’re not thinking ahead about how the solution will be implemented. How a data science solution will be used is a part of the productization process, and usually should be managed by the product team collaborating with the internal client and the data scientist’s team.

In this case, we came up with a great prediction model that predicted the chance of an opportunity becoming a client with over 85% balanced accuracy. It’s been validated by a number of different aspects and it did quite well. But now we’re asking the question: how are we going to use it to increase the win rate? A prediction model can predict the future, but it can’t change it.

In this case, however, we realized that sales could use the model to prioritize their workload. So, if you have a thousand opportunities, maybe you want to take the first 100 opportunities that have the highest probability or propensity to become a client. It doesn't completely solve the problem, but it’s a step in the right direction.

What are the biggest challenges for a data scientist?

I think, in general, the most important obstacle a data scientist encounters is not having a clear enough problem definition or use case definition. Sometimes, a manager will come to me and say: "Please uncover insights in the data." We have to say, "Okay, why do you want to uncover these insights?" Do you have any hypothesis on what’s going on — like a drop or an increase in some KPI?

The second is being able to sufficiently collect and process the right data.

Often, we’re facing problems that we don’t necessarily have the right data to feed the machine learning algorithms for solving these problems. For example, we have data with co-linearities or missing other important predictors we haven’t been able to develop.

So, we need to ask: Do we have the data? What data items do we have? What kind of data sources do we have? Can we collect data from different sources and merge and combine them? What kind of pre-processing do we need to apply to the data (normalization, missing values, texts processing, etc.). Do we have a hypothesis about what the data is trying to tell us with regards to our problem?

A data scientist should have some intuition on how to solve the problem because you can’t just collect every single piece of data in the organization — it’s inefficient. You need to have some kind of hypothesis, some kind of working assumptions, in order to know what to focus on.

Can you give me an example?

In the case of the win rate, I was positive that the answers would come from SalesForce and Gong since most of the funnel interactions are captured in these systems. We had a bunch of other data sources but their data was incomplete and less related to the problem. On the other hand, I considered the text fields in SalesForce might uncover powerful features. I took the initiative to run some Natural Language Processing (NLP) algorithms to extract insights and features from the text. It turned out that a few of the most powerful predictors were hidden in there.

Ah, so you used NLP on the free text in SalesForce to pull certain phrases from there?

Exactly. The NLP packaging in R, for example, will extract key phrases from the free text field. The package can calculate how frequently some key phrases occur, and whether it’s frequent enough to be considered as a good classifier feature.

For instance, if something is mentioned across the entire set of free text fields, maybe it’s not very beneficial to use it, because it can’t really help separate opportunities from clients. We need something that’s mentioned enough but not in too many free text fields over our population of opportunities. This specific feature is called TF (Term Frequency).

You mentioned that the data scientist is becoming a huge role, especially in tech companies. Where do you see the role going next?

I feel the data scientist role will shift from developing predictive or descriptive models to developing a meta machine learning model that can take data and a problem definition and plug in the right set of models, extract the right set of features, and automate a production scale solution. Eventually, meta machine learning platforms will emerge. These platforms will independently, autonomously, develop machine learning-based solutions.

Data scientists will develop these platforms and business users will use it to design, develop, and productize their machine learning solution without any data science knowledge. It’s part of the democratization of data science through platforms and products.

Does that mean that the internal data scientist could die out?

Oh, this process of developing platforms and products to democratize data science is not going to happen so fast. Data scientists will still have plenty of job opportunities.

But eventually, data scientists will evolve into a job where they don’t solve a specific problem using specific data, but will design generic algorithms that let users specify a specific business problem and plug in their raw datasets and get a solution independently, automatically, and without the supervision of a data scientist.

Are there any other projects within Sisense that are heading in that direction?

At Sisense, we’re striving to democratize our data science and "gift wrap" it for customers, so that they can just choose a problem, specify their data and algorithms and the relevant features, and consume it as a self-service.

We’re currently working on an engine that allows you uncover insights relating to a specific KPI in an automated seamless manner, and then provide features to present these insights to a wide range of organization roles in an interesting and effective manner. The client’s users can then take these insights and make them actionable.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,data science ,business intelligence

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}