Feature Engineering Transforming Predictive Models

Delve into the transformative power of feature engineering in applied machine learning, and learn how carefully crafted features can elevate your models.

Sundeep Goud Katta

Sep. 25, 24 · Tutorial

Likes (3)

Comment

Save

3.4K Views

Imagine you’re building a model to predict house prices: two models, identical in every aspect except one; one uses raw data, and the other leverages thoughtfully engineered features like the age of the house, proximity to schools, and seasonal price trends. Which model do you think performs better? The answer is intuitive: the latter.

Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work more effectively. It bridges the gap between raw data and the insights needed to drive decision-making. In this article, we’ll explore how feature engineering can significantly impact the performance of your predictive models.

Predictive models are a type of algorithm used to forecast future outcomes based on historical data. It leverages various techniques such as regression (for predicting continuous outcomes), classification (for categorizing data), clustering (for grouping similar data), time series analysis (for sequential data), and more advanced methods like neural networks, reinforcement learning, and ensemble methods. These models identify patterns in past data to make informed predictions about new or unseen data.

"Prediction is very difficult, especially if it's about the future."
- Niels Bohr

The Basics of Feature Engineering

What Is Feature Engineering?

At its core, feature engineering involves transforming raw data into meaningful features that better represent the underlying problem to the predictive models. These features help algorithms discern patterns and make accurate predictions.

Raw Data vs. Engineered Features

Raw Data is original, unprocessed data collected from various sources. It often contains noise, and inconsistencies, and lacks the structure required for effective modeling.

Engineered Features are derived attributes created by processing raw data. They encapsulate domain-specific knowledge and highlight relevant aspects of the data.

The Feature Engineering Workflow

Data collection: Gather raw data from various sources.
Data cleaning: Handle missing values, remove duplicates, and correct inconsistencies.
Feature creation: Generate new features through transformations, aggregations, or domain-specific computations.
Feature transformation: Apply scaling, encoding, or normalization techniques.
Feature selection: Identify and retain the most relevant features for the model.

Feature Transformation Techniques

Effective feature transformation can show a path to hidden patterns and enhance model performance. Let’s explore some common techniques with practical examples using Python’s pandas and scikit-learn libraries.

1. Normalization and Scaling

Normalization and scaling are crucial techniques in preprocessing numerical data to ensure that features with different units or ranges don’t disproportionately influence the model. Normalization typically rescales values to a specific range, often [0, 1], making all features comparable and minimizing bias caused by large differences in magnitude. Scaling, particularly standardization, adjusts the distribution of values by centering the data around the mean and scaling it based on standard deviation, resulting in a mean of 0 and a standard deviation of 1. This is especially important for models that rely on distance metrics (like KNN or SVM) or gradient-based optimization (like neural networks) to avoid skewed results due to differing ranges in feature values.

    Python
   
   import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample DataFrame
data = {'Age': [25, 32, 47, 51, 62],
        'Income': [50000, 60000, 80000, 90000, 120000]}
df = pd.DataFrame(data)

# Min-Max Scaling
scaler = MinMaxScaler()
df[['Age_scaled', 'Income_scaled']] = scaler.fit_transform(df[['Age', 'Income']])

print(df)

Output:

    Python
   
      Age  Income  Age_scaled  Income_scaled
 25   50000    0.000000        0.000000
 32   60000    0.142857        0.142857
 47   80000    0.428571        0.428571
 51   90000    0.500000        0.500000
 62  120000    1.000000        1.000000

2. Polynomial Features

Polynomial features expand the input feature set by adding higher-degree terms, such as squares, cubes, or interactions between features. This technique is particularly useful when the relationship between the features and the target variable is non-linear. For instance, in linear regression, adding polynomial terms allows the model to fit more complex curves rather than straight lines, improving the model’s ability to capture intricate patterns in the data. While polynomial features can significantly enhance the model’s performance on non-linear problems, they can also increase the complexity of the model and risk overfitting, so careful use and regularization are often necessary.

    Python
   
   from sklearn.preprocessing import PolynomialFeatures

# Original features
X = df[['Age_scaled']]

# Generate polynomial features up to degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

poly_features = poly.get_feature_names_out(['Age_scaled'])
df['Age_scaled_squared'] = X_poly[:, 1]

print(df)

Output:

    Python
   
      Age  Income  Age_scaled  Income_scaled  Age_scaled_squared
 25   50000    0.000000        0.000000             0.000000
 32   60000    0.142857        0.142857             0.020408
 47   80000    0.428571        0.428571             0.183673
 51   90000    0.500000        0.500000             0.250000
 62  120000    1.000000        1.000000             1.000000

3. Encoding Categorical Variables

In machine learning, most algorithms require numerical input, but real-world datasets often contain categorical variables — variables that represent categories or groups (e.g., color, city names, product type). Encoding categorical variables involves converting these text-based categories into numerical values so that machine learning models can process them. There are various methods for encoding, with two common techniques being one-hot encoding and label encoding. One-hot encoding creates new binary columns for each category, which is useful when categories have no ordinal relationship. Label encoding, on the other hand, assigns a unique integer to each category but may introduce unintended ordinal relationships. Choosing the appropriate encoding method is crucial for improving the performance and accuracy of the model, as poorly encoded categorical variables can negatively impact predictions.

One-hot encoding:

    Python
   
   # Sample DataFrame with categorical feature
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']}
df = pd.DataFrame(data)

# One-Hot Encoding using pandas
df_encoded = pd.get_dummies(df, columns=['City'])

print(df_encoded)

Output:

    Python
   
      City_Chicago  City_Los Angeles  City_New York
           0                  0               1
           0                  1               0
           1                  0               0
           0                  0               1
           1                  0               0

Label Encoding:

    Python
   
   from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['City_encoded'] = le.fit_transform(df['City'])

print(df)

Output:

    Python
   
             City  City_encoded
   New York             2
Los Angeles             1
    Chicago             0
   New York             2
    Chicago             0

4. Log Transformations for Skewed Data

Log transformations are commonly used when dealing with skewed data — data that exhibits a long tail in one direction, either left (negatively skewed) or right (positively skewed). Skewed data can lead to models that perform poorly because they are overly influenced by extreme values. By applying a log transformation, you can compress the range of the data, making it more normally distributed, which helps certain algorithms (like linear regression) perform better. This technique is particularly helpful when dealing with variables like income or sales, where a small number of high values can disproportionately impact the model. It stabilizes variance and reduces the impact of outliers.

    Python
   
   import numpy as np

# Sample skewed data
data = {'Sales': [100, 150, 200, 250, 300, 1000]}
df = pd.DataFrame(data)

# Apply log transformation
df['Sales_log'] = np.log(df['Sales'])

print(df)

Output:

    Python
   
 

      Sales  Sales_log
  100    4.605170
  150    5.010635
  200    5.298317
  250    5.521461
  300    5.703782
 1000    6.907755

  

Feature Selection

Not all features contribute positively to the model’s performance. Feature selection involves identifying and retaining the most relevant features while discarding the rest. This process can prevent overfitting, reduce complexity, and improve model interpretability.

1. Variance Threshold

A variance threshold is a simple feature selection technique used to remove features with low variability, which typically contribute little to model performance. Features with zero or near-zero variance have nearly identical values across all data points, meaning they provide minimal information and are unlikely to help the model make distinctions between different classes or predict the target variable. By applying a variance threshold, we can filter out these low-variance features, reducing model complexity and potentially improving both training speed and prediction accuracy.

    Python
   
 

   from sklearn.feature_selection import VarianceThreshold

# Sample DataFrame
data = {'Feature1': [0, 0, 0, 0, 0],
        'Feature2': [1, 2, 3, 4, 5],
        'Feature3': [10, 10, 10, 10, 10]}
df = pd.DataFrame(data)

# Apply Variance Threshold
selector = VarianceThreshold(threshold=0.1)
selector.fit(df)

# Get columns to keep
cols = df.columns[selector.get_support()]
df_selected = df[cols]

print(df_selected)

  

Output:

2. Correlation Matrix

A correlation matrix is a table that shows the correlation coefficients between multiple variables in a dataset. Correlation values range from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation. In feature selection, a correlation matrix helps identify highly correlated features, which can introduce redundancy and multicollinearity in models like linear regression. By examining the matrix, you can remove one of the features that are highly correlated (typically above 0.95) to simplify the model without losing much predictive power. This step helps reduce overfitting and improves the interpretability of the model.

    Python
   
 

   import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [2, 4, 6, 8, 10],
        'C': [5, 3, 6, 9, 12],
        'D': [5, 3, 6, 9, 12]}
df = pd.DataFrame(data)

# Compute correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation > 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features
df_reduced = df.drop(columns=to_drop)

print("Features to drop:", to_drop)
print(df_reduced)

  

Output:

    Python
   
 

   Features to drop: ['B', 'D']
   A  C
0  1  5
1  2  3
2  3  6
3  4  9
4  5 12

  

3. Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a feature selection technique that works by recursively fitting a model and eliminating the least important features based on the model's coefficients or importance scores. It starts with all features and systematically removes the least significant one, retraining the model each time until the desired number of features is reached. RFE is commonly used with linear models, decision trees, or random forests to rank and retain the most relevant features for the problem at hand. This method ensures that the model only uses the most valuable features, which can improve performance, reduce overfitting, and enhance interpretability.

    Python
   
 

   from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# Load dataset
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

# Initialize model
model = LinearRegression()

# Initialize RFE
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)

# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected Features:", selected_features.tolist())
  

Output:

    Python
   
   Selected Features: ['RM', 'PTRATIO', 'LSTAT', 'DIS', 'NOX']

Automation With Feature Engineering Tools

Manually crafting features can be time-consuming, especially with large datasets. Automation tools can expedite this process, though they come with their own set of advantages and limitations.

FeatureTools: Automated Feature Engineering

FeatureTools is an open-source Python library for automated feature engineering.

    Python
   
 

   import pandas as pd
import featuretools as ft

# Sample data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'join_date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-20'])
})

transactions = pd.DataFrame({
    'transaction_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 2],
    'amount': [250, 450, 300, 150, 500],
    'transaction_date': pd.to_datetime(['2023-01-10', '2023-02-20', '2023-01-15', '2023-03-25', '2023-02-22'])
})

# Create EntitySet
es = ft.EntitySet(id='Customers')

# Add entities
es = es.add_dataframe(dataframe_name='customers',
                      dataframe=customers,
                      index='customer_id',
                      time_index='join_date')

es = es.add_dataframe(dataframe_name='transactions',
                      dataframe=transactions,
                      index='transaction_id',
                      time_index='transaction_date')

# Define relationship
relationship = ft.Relationship(es, 
                               parent_dataframe_name='customers', 
                               parent_column_name='customer_id',
                               child_dataframe_name='transactions',
                               child_column_name='customer_id')

es = es.add_relationship(relationship)

# Automatically generate features
feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_dataframe_name='customers',
                                      agg_primitives=['sum', 'mean', 'count'],
                                      trans_primitives=['year', 'month'])

print(feature_matrix)

  

Output:

customer_id	join_date	transactions.sum (amount)	transactions.mean (amount)	transactions.count(transaction_id)	transactions.year(transaction_date)	transactions.month (transaction_date)
1	2023-01-01	550	275	2	2023	1
2	2023-02-15	950	475	2	2023	2
3	2023-03-20	150	150	1	2023	3

Pros

Efficiency: Rapidly generates a large number of features
Consistency: Applies standardized transformations
Scalability: Handles large datasets with ease

Cons

Overfitting risk: Automated features might introduce noise or redundant information.
Lack of domain insight: May miss domain-specific nuances that manual feature engineering can capture
Computational overhead: Generating numerous features can be resource-intensive.

Conclusion

Feature engineering stands as a cornerstone. By transforming and selecting the right features, you empower your models to recogonise intricate patterns and deliver accurate predictions. Whether through manual ingenuity or leveraging automation tools, the essence remains: understanding your data and domain is paramount.

As machine learning continues to permeate diverse industries, the ability to craft meaningful features will distinguish adept practitioners from the rest. Embrace feature engineering not just as a task, but as an art that blends data science with domain expertise to sculpt models that truly resonate with real-world complexities.

Happy Feature Engineering!

Data science Feature engineering Feature selection Machine learning Python (language)

Opinions expressed by DZone contributors are their own.

Related

Trending