Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Art of Feature Engineering in the World of Data Science

DZone's Guide to

The Art of Feature Engineering in the World of Data Science

A look at how data science and big data teams can use data analysis, data modeling, and other techniques to solve real-world problems.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Shubham Agrawal, CEO of Balludas Information Technology talks with Hari Shrawgi, Manager of Talent Acquisition team.

The Problem

Shubham: Hari, we have lot of open positions to be backfilled and our engineering teams are unhappy with the hiring turn-around.

Hari: But boss, we are doing our best to bring in good candidates, scheduling interviews for the earliest possible date. The problem lies within the engineering team in interviewing and providing feedback for the next rounds as quickly as possible.

Shubham: Hari, we are one team and I don’t like excuses. You claim to have the best Data Scientists in your team. Why don’t you work with them to come up with some level of analysis and recommendation? It will help us set expectations.

Hari meets his team, Bhawna Bhardwaj, Mohit Jain, Preeti Patel, and Vasu Sharma. The team has expertise in building various forecasting and prediction models leveraging Data Science tools and techniques.

Brainstorming

Hari: Let’s discuss a high-level approach on how we can tackle the problem that Shubham brought up this morning.

Vasu: We have tons of data in our system. Let’s process them and perform time series analyses to build a forecast.

Preeti: Yes! Also, we can use decision trees or rule-based classifiers for recommendations.

Bhawna: I guess we need to take a step backwards before we rush into a solution. With Hari and couple of us knowing the entire process of hiring, we have enough domain knowledge amongst ourselves. Let’s think of what to do next.

Mohit: Before we get into modeling, how about brainstorming the attributes required and have them ready for further processing?

Vasu: You mean, the feature engineering. But isn’t it time consuming? You know our CEO, he wants results yesterday!

Preeti: Let’s bring a method to the madness. How about a fishbone diagram to identify all possible causes that could potentially influence the problem? This shouldn't take much time.

Feature Selection

Hiring Cycle Time - FishboneHari: It’s amazing to see in a short time that we were able to cover all possible influencers.

Bhawna: Though we have identified all attributes, we may not be able to get the data.

Mohit: Yes, we have constraints such as data privacy, sanity of collected data, and more. But we will use data transformation techniques to minimize the challenges.

Vasu: Hari, we need to set the expectation that if we are unable to get key features, even after the data transformation that Mohit is referring to, the accuracy and precision of the model may be less.  

Preeti: We also need to be careful about over-fitting the model since we have all possible attributes and data.

Transforming Features

Image title

Bhawna: I just did descriptive statistics on the data set we got. Looks like we have a few records where the expected salaries and the location of job are missing.

Mohit: For the salary, since this category only has a few records, we can work with the respective hiring team to fill it in. Regarding the missing location, we can do similarity-based imputation since we have data for same hiring manager, similar job, responsibilities, and other matching criteria.

Vasu: There are some rows where the class label such as the time taken to shortlist the profile itself is missing. We can’t do any imputation here. Hence, it’s better to ignore the records.

Preeti: We can convert the expected salary into 10 bins such as 25-50k, 50-75k, etc., and replace the numeric value with the categorical ID. This will help reduce the noise and fit for the models.

Identifying Key Features

Bhawna: We have so many features identified. It’s important we select vital few for better goodness of our model and to limit the over-fitting.

Mohit: Well, we have couple of options. We can identify variables that are related to each other through correlation analysis.Image title

Vasu: Yes, but correlation doesn’t mean causation. To ensure the impact, we can conduct statistical hypothesis testing.

Preeti: We can also apply Design of Experiment. This not only helps in identifying key features also in optimizing the outcome, in our case the cycle time.

Bhawna: Principal Component Analysis also helps in reducing the dimensionality. It works only for numeric data like correlation analysis. That’s why the feature transformation helps to convert categorical data into numeric. For example, in our problem, instead of categorizing the level as ‘Associate, Senior, Principal’ we can number them.

Exploratory Data Analysis and Modeling

Mohit: Wow, even before we apply model, we can visualize the trend and oscillation for individual features as well as the relationships.

Vasu: That’s the power of Exploratory Data Analysis. Visualization helps in not just ‘why it happened’ but also some level of recommendation.

Preeti: Very true. From our analysis, it seems to be difficult to get the profile for Senior SAP consultant especially from this vendor.

Bhawna: Maybe we can check the alternate by different vendors as well as working with hiring manager for the need for senior level or if we can adjust with next level.

Hari: Awesome work folks. Never knew feature engineering itself can help identifying the causes and recommendations. While we continue working on implementing suitable model, I’ll share this with our CEO and the next steps.

Conclusion

CEO Appreciation

Shubham: This is absolutely fantastic team! Looks like lot of work has already been done to identify the key challenges. While we explore this further, I'll work with Engineering team on addressing some of the root causes. Good job again!

We have great predictive modeling algorithms and techniques available but let’s not forget the foundation. Feature Engineering helps understanding the problem from customer perspective for whom we are building those models. After all, “Well begun is half done.”

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
feature engineering ,data science ,modeling ,correlation ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}