Roles in Data Science Teams
In today’s world, it’s actually harder to say what cannot generate data rather than what can. Even no data can tell us something.
Join the DZone community and get the full member experience.Join For Free
In 2017, Netflix changed its five-star rating system to a simple thumbs-up, thumbs-down. Now the service was recommending movies based on the match percentage, and people hated it. How can we reduce all the nuance that lives in cinematic art to a primitive binary reaction?
In reality, what Netflix found was that people were giving high rates to those movies that they believed were good, not necessarily those they’ve really enjoyed watching. At least that’s what the data said. So how does data analysis work in organizations like Netflix, and what are the roles of data science teams?
Netflix Feedback System
Gibson Biddle is the former VP and chief product officer at Netflix. When talking about consumer insights, he explained an unexpected customer behavior that led to changing the whole rating system. In shifting to percentage match, Netflix acknowledged that while you may be ready to leave your brains at the door ‘Adam Sandler comedy’ only three stars – you enjoy watching it, and as much as you feel good about watching a ‘Schindler’s List' and give it five stars it doesn't increase your overall enjoyment and keeping subscribers entertained is kind of critical for Netflix. So, they simplified the feedback system to avoid bias. But these insights into customers are impressive by themselves, and they wouldn’t be possible without two things: The culture that fosters the use of data and a powerful data infrastructure in tech jargon, it’s called a data-driven organization.
You have likely heard this buzz phrase hundreds of times, but what does it really mean? Netflix alone records more than 700 billion events every day, from logins and clicks on movie thumbnails to pausing the video and turning on subtitles. All this data is available to thousands of users inside the organization. Anyone can access it using visualization tools like tableau or Jupiter, or they can get to it via a big data portal – an environment that lets users check reports, generate them, or query any information they need. Then this data is used to make business decisions, from smaller like which thumbnails to show you to really serious ones like which shows should Netflix invest in next.
However, Netflix isn’t alone. According to some estimates, about 97% of Fortune 1000 businesses invest in initiatives including artificial intelligence and big data. Let’s have a look at the real data infrastructure technology and data engineers that make it work.
Data Infrastructure Technology
To describe how data infrastructure works, technicians borrowed the term from liquid and gas transportation. Similar to physical pipelines, data pipelines have their own origins, destinations, and intermediate stations. So, it’s a pretty apt metaphor. The origin of data may be anything from clicks on a reserve button and pulling to refresh to conversation records with customer support, from vehicle tracking devices to turbine vibration sensors on power plants. In today’s world, it’s actually harder to say what cannot generate data rather than what can. Even no data can tell us something.
Once the data item is generated, it travels down its pipe to a staging area. This is the place where all raw data is kept. Raw data isn’t yet ready to be used. It must be prepared. You have to remove the airs from it, fill in the gaps, change its format or merge data from different sources. To get a more nuanced view. As soon as these operations are done, the data now structured and clean can’t continue on its journey. All these operations happen automatically. They are described in three words.
- Extract: extracting data from its origin and getting it to a staging area
- Transform: preparing data for use and load push prepared data further ETL for short
All prepared data falls into another storage, a data warehouse.
Unlike the staging area, a warehouse is a place where all stored records are structured and prepared for use. Just like in the library with its classification system finally, you can query, visualize and download information for a warehouse. To do that, you must have business intelligence or BI (Business Intelligence) software. It presents data to final users.
Who carry out essential tasks. They access data, explore it, visualize it and try to make business sense of it. Did our marketing campaign work out well? What’s our worst-performing channel? They act like a sensory system supporting an organization with historical data and getting insights to management and, ultimately, anyone who makes decisions.
Data Engineers are in charge of building this whole pipeline. Mostly tech people are adept at what's known as plumbing. Moving data from its origins to destinations across the pipeline and transforming it on the way. They design pipeline architecture, set up ETL processes, configure the warehouse, and connect it with reporting tools. Airbnb, for instance, has about 50 data engineers. Sometimes you might encounter a more granular approach with several extra rules involved. Data quality engineers, for instance, make sure that data is captured and transformed correctly. Having biased or incorrect data is too expensive when trying to derive decisions from it. there may be a separate engineer responsible for ETL only. Also, a business intelligence developer focusing solely on integrating reporting and visualization tools. However, reporting tools don't make headlines, and a data engineer wasn't called the best job of the 21st century. But machine learning does, and a data scientist was.
Machine Learning and Data Science
What everybody knows is that data science is particularly good at taking data and answering complex questions about it. How much will the company earn in the next quarter? How soon will your uber driver arrive? How likely is it that you’ll enjoy Schindler's List the same as uncut gems?
There are actually two ways of answering such questions. Data scientists make use of BI tools and warehouse data as business analysts and data analysts do. So, they would sit here and get the data from the warehouse. Sometimes data scientists would use a Data Lake: another type of storage that keeps unstructured fraud data. They'll create a predictive model and suggest a forecast that will be used by management. One-time reporting. It works for revenue estimates, but it doesn't help with predicting the uber arrival time.
The real value of machine learning is production models that work automatically and generate answers to complex questions regularly, sometimes thousands of times per second, and things are much more complicated with them.
Production ML Model
To make the model work, you also need an Infrastructure. Sometimes a big one. Data scientists explore data from warehouses and lakes, experiment with it, choose algorithms, and train models to come up with the final ML code. It takes a deep understanding of Statistics databases, machine learning algorithms, and a subject field.
Josh Wills, the former head of data engineering at SLAC, has tweeted, saying that a ‘data scientist is the person who is better at statistics than any software engineering than any statistician.'
Imagine yourself isolating and ordering food at uber eats. Once you confirm your order, the app must estimate the time of delivery, your phone center location, restaurant, and order data to a server with a delivery prediction ML model deployed. This data isn’t enough. The model also gets additional data from a separate database that contains an average time for your restaurant to prepare a mean and a wealth of other details. Once all the data is here, the model returns a prediction to you. However, the process doesn’t stop there. The prediction itself gets saved in a separate database. Your delivery person who shows up in real-time of arrival will also be captured to record the ground truth. It will monitor the model performance against it and explore the model via analysis tools to update it later. All this data will eventually appear in a Data Lake and a warehouse.
In reality, uber eats service alone uses hundreds of different models working simultaneously to score recommendations, search rankings of restaurants, and estimate delivery time.
However, Adam Waxman, head of core technology at Foursquare, believes that there won’t be data scientists or ML engineers anymore since will keep automating model training and building production environments. Much of the data science work will become a common function inside software development.
Opinions expressed by DZone contributors are their own.