Make ML Models Work: A Real-World Take on Size and Imbalance

Overcoming NLP hurdles like large models and imbalanced data, the project used TF-IDF, class weighting, and simpler models for accuracy and efficiency.

Gowsiya Syednoor Shek

Mar. 25, 25 · Analysis

Likes (0)

Comment

Save

2.1K Views

The Initial Hurdle

In NLP classification tasks, especially those involving text descriptions, it's common to encounter two significant hurdles: large model sizes and imbalanced datasets. A large model can be difficult to deploy and manage, while class imbalance can lead to poor predictive performance, particularly for minority classes.

Imagine you are building a system to automatically categorize product descriptions into various product categories. The project began with a dataset of close to 40,000 records, where each record contained a short product title and a longer product description, along with its corresponding category. The initial Random Forest model, while achieving a decent accuracy of around 70%, ballooned to a whopping 11 GB size.

Further experimentation revealed that tuning the model parameters caused a drastic drop in accuracy (to 14%), rendering the model practically useless. This situation highlights the core challenges:

Large model size: An 11 GB model is unwieldy and resource-intensive.
Class imbalance: Uneven distribution of product categories makes it hard for the model to learn effectively.
Overfitting: Parameter tuning leads to a significant drop in accuracy, indicating that the model is overfitting the training data.

The Solutions Explored

To address these issues, the focus was on reducing the model size and improving its accuracy. Here's a rundown of the techniques tried, the challenges faced, and what ultimately worked:

1. Addressing Class Imbalance

Resampling Techniques (SMOTE and ADASYN)

Oversampling the minority classes using SMOTE (Synthetic Minority Oversampling Technique) and ADASYN (Adaptive Synthetic Sampling Approach) was explored. The goal was to artificially increase the number of samples in the less frequent categories. However, several issues were encountered:

ValueError: Expected n_neighbors <= n_samples_fit. This error occurred when the number of nearest neighbors required by SMOTE/ADASYN exceeded the number of available samples in a minority class.
Low accuracy. Even when the error was avoided, the resulting models often had lower accuracy than the baseline, suggesting that the synthetic samples were introducing noise or not effectively improving the model's ability to generalize. The oversampled data sometimes led to the model "memorizing" the synthetic examples, leading to poor performance on unseen data.

Here is a sample code with SMOTE:

    Python
   
 

   from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Assuming X is your features and y is your labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Split data
smote = SMOTE(random_state=42)  # Adjust k_neighbors
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train) 
print(f"Shape of X_train before SMOTE: {X_train.shape}")
print(f"Shape of X_train after SMOTE: {X_train_resampled.shape}")
  

Here are some key considerations for the SMOTE method:

Apply resampling only to the training data (not the testing set) to prevent data leakage.
Carefully tune the k_neighbors parameter in SMOTE/ADASYN. A value that is too high can lead to errors, while a value that is too low might not generate effective synthetic samples.
Use stratified sampling during train/test split to maintain class proportions. This ensures that the class distribution in the training and testing sets is representative of the overall dataset.

Cost-Sensitive Learning

Higher misclassification costs were assigned to minority classes by using the class_weight='balanced' or class_weight='balanced_subsample' parameter in the Random Forest model. This tells the model to penalize misclassifying minority class examples more heavily.

    Python
   
 

   pythonfrom sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,  # Or your chosen number of trees
    class_weight='balanced_subsample',  # Or 'balanced'
    random_state=42,
    n_jobs=-1  # Use all available cores
)
  

Threshold Adjustment

Adjusting the classification threshold to favor the minority class was considered. This involves analyzing the predicted probabilities and choosing a threshold that works well for your specific problem. However, this requires careful evaluation on a validation set to avoid overfitting.

2. Improving Model Accuracy

Feature Engineering

Given the text data, more informative features were created:

TF-IDF (Term Frequency-Inverse Document Frequency) with N-grams: This technique captures the importance of words in a document relative to the entire corpus, while N-grams capture phrases and word order information.
Word embeddings (Word2Vec, GloVe, FastText): These represent words as dense vectors, capturing semantic relationships between words.
Length of description: The length of the text description itself could be a useful feature.
Number of keywords: The occurrence of specific keywords related to different product categories was counted.

Model Complexity and Regularization

The complexity of the Random Forest model was reduced by:

Reducing n_estimators: The number of trees in the Random Forest is a major contributor to its size.
Limiting max_depth: Restricting the maximum depth of the trees prevents them from becoming overly complex.
Increasing min_samples_split and min_samples_leaf: These parameters control the minimum number of samples required to split a node and to be at a leaf node, respectively. Increasing them forces the trees to be less specific.

Model Selection

Simpler models like Multinomial Naive Bayes, Logistic Regression, and Linear SVM were considered, as Random Forest can be prone to overfitting. Gradient Boosting Machines (GBM) and XGBoost were also tried.

Hyperparameter Optimization

Cross-validation was used to evaluate model performance and tune hyperparameters. Randomized search and Bayesian optimization were employed to efficiently explore the hyperparameter space.

    Python
   
 

   from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),  # Adjust max_features
    ('classifier', LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42,
                                        class_weight='balanced'))  # Use class_weight
])

param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': [True, False],
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l1', 'l2']
} 
randomized_search = RandomizedSearchCV(pipeline,                                      
                                       param_distributions=param_grid,
                                       n_iter=10,
                                       cv=3,
                                       scoring='accuracy',
                                       random_state=42,
                                       n_jobs=-1)

randomized_search.fit(X_train, y_train)
  

3. Reducing Model Size

Reducing Vocabulary Size

The vocabulary size of the text vectorizer was limited using max_features in TfidfVectorizer.

Quantization

Model quantization techniques were explored to reduce model size, but this required using a different framework like TensorFlow Lite or ONNX Runtime, which was too much work for such a small task.

What Worked Ultimately

Ultimately, a significantly smaller model size was achieved while maintaining acceptable accuracy by:

Removing infrequent classes (with a higher threshold). Removing categories with fewer than X occurrences significantly improved model performance. This created a more stable and manageable dataset, despite reducing the overall size.
Combining textual information. The product title and product description were combined into a single text field. This allowed the model to learn relationships between the concise title and the more detailed description.
TF-IDF with limited vocabulary. Using TF-IDF with a limited vocabulary size (max_features=5000) struck a good balance between accuracy and model size. This prevented the model from overfitting to rare or irrelevant words.
Removing samples with less than X (this will depend on your data) occurrences.
Combining the textual information from the product title and product description into a single feature.
Using TF-IDF with max_features=5000.

Here's a sample of the data:

product_title	product_description	product_category
Laptop	RAM size foldable screen, touch	Electronics & Computers
Nonstick pot	Scratched after a month, but maybe that’s my fault for using a metal spoon	Kitchen & Cookware
Ceiling Fan	Remote works from my bed—life is good.	Home & Appliances
Laptop Bag	Water-resistant? Yes. Waterproof? Nope.	Accessories & Bags
BookShelf	Assembly took me an hour and my sanity, but it looks great!	Furniture & Storage

Lessons Learned

This experience underscored several key lessons:

Data cleaning is paramount. Thoroughly examine your data, clean it, and handle missing values appropriately. Inconsistent data can significantly hinder model performance.
Class imbalance requires careful consideration. While resampling techniques can be helpful, they're not always the best solution. Sometimes, a more aggressive filtering approach can be more effective.
Feature engineering matters. Combining existing features or creating new ones can significantly improve model accuracy.
Model complexity is a trade-off. Simpler models are often more robust and easier to manage than complex ones.
Experimentation is essential. Don't be afraid to try different approaches and evaluate their performance.

By systematically addressing these issues, you can build NLP classification models that are both accurate and manageable.

NLP Random forest Machine learning

Opinions expressed by DZone contributors are their own.

Related

Trending