The Art of Feature Engineering in the World of Data Science
The Art of Feature Engineering in the World of Data Science
A look at how data science and big data teams can use data analysis, data modeling, and other techniques to solve real-world problems.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Shubham Agrawal, CEO of Balludas Information Technology talks with Hari Shrawgi, Manager of Talent Acquisition team.
Shubham: Hari, we have lot of open positions to be backfilled and our engineering teams are unhappy with the hiring turn-around.
Hari: But boss, we are doing our best to bring in good candidates, scheduling interviews for the earliest possible date. The problem lies within the engineering team in interviewing and providing feedback for the next rounds as quickly as possible.
Shubham: Hari, we are one team and I don’t like excuses. You claim to have the best Data Scientists in your team. Why don’t you work with them to come up with some level of analysis and recommendation? It will help us set expectations.
Hari meets his team, Bhawna Bhardwaj, Mohit Jain, Preeti Patel, and Vasu Sharma. The team has expertise in building various forecasting and prediction models leveraging Data Science tools and techniques.
Hari: Let’s discuss a high-level approach on how we can tackle the problem that Shubham brought up this morning.
Vasu: We have tons of data in our system. Let’s process them and perform time series analyses to build a forecast.
Preeti: Yes! Also, we can use decision trees or rule-based classifiers for recommendations.
Bhawna: I guess we need to take a step backwards before we rush into a solution. With Hari and couple of us knowing the entire process of hiring, we have enough domain knowledge amongst ourselves. Let’s think of what to do next.
Mohit: Before we get into modeling, how about brainstorming the attributes required and have them ready for further processing?
Vasu: You mean, the feature engineering. But isn’t it time consuming? You know our CEO, he wants results yesterday!
Preeti: Let’s bring a method to the madness. How about a fishbone diagram to identify all possible causes that could potentially influence the problem? This shouldn't take much time.
Hari: It’s amazing to see in a short time that we were able to cover all possible influencers.
Bhawna: Though we have identified all attributes, we may not be able to get the data.
Mohit: Yes, we have constraints such as data privacy, sanity of collected data, and more. But we will use data transformation techniques to minimize the challenges.
Vasu: Hari, we need to set the expectation that if we are unable to get key features, even after the data transformation that Mohit is referring to, the accuracy and precision of the model may be less.
Preeti: We also need to be careful about over-fitting the model since we have all possible attributes and data.
Bhawna: I just did descriptive statistics on the data set we got. Looks like we have a few records where the expected salaries and the location of job are missing.
Mohit: For the salary, since this category only has a few records, we can work with the respective hiring team to fill it in. Regarding the missing location, we can do similarity-based imputation since we have data for same hiring manager, similar job, responsibilities, and other matching criteria.
Vasu: There are some rows where the class label such as the time taken to shortlist the profile itself is missing. We can’t do any imputation here. Hence, it’s better to ignore the records.
Preeti: We can convert the expected salary into 10 bins such as 25-50k, 50-75k, etc., and replace the numeric value with the categorical ID. This will help reduce the noise and fit for the models.
Identifying Key Features
Bhawna: We have so many features identified. It’s important we select vital few for better goodness of our model and to limit the over-fitting.
Mohit: Well, we have couple of options. We can identify variables that are related to each other through correlation analysis.
Vasu: Yes, but correlation doesn’t mean causation. To ensure the impact, we can conduct statistical hypothesis testing.
Preeti: We can also apply Design of Experiment. This not only helps in identifying key features also in optimizing the outcome, in our case the cycle time.
Bhawna: Principal Component Analysis also helps in reducing the dimensionality. It works only for numeric data like correlation analysis. That’s why the feature transformation helps to convert categorical data into numeric. For example, in our problem, instead of categorizing the level as ‘Associate, Senior, Principal’ we can number them.
Exploratory Data Analysis and Modeling
Mohit: Wow, even before we apply model, we can visualize the trend and oscillation for individual features as well as the relationships.
Vasu: That’s the power of Exploratory Data Analysis. Visualization helps in not just ‘why it happened’ but also some level of recommendation.
Preeti: Very true. From our analysis, it seems to be difficult to get the profile for Senior SAP consultant especially from this vendor.
Bhawna: Maybe we can check the alternate by different vendors as well as working with hiring manager for the need for senior level or if we can adjust with next level.
Hari: Awesome work folks. Never knew feature engineering itself can help identifying the causes and recommendations. While we continue working on implementing suitable model, I’ll share this with our CEO and the next steps.
Shubham: This is absolutely fantastic team! Looks like lot of work has already been done to identify the key challenges. While we explore this further, I'll work with Engineering team on addressing some of the root causes. Good job again!
We have great predictive modeling algorithms and techniques available but let’s not forget the foundation. Feature Engineering helps understanding the problem from customer perspective for whom we are building those models. After all, “Well begun is half done.”
Opinions expressed by DZone contributors are their own.