Make ML Models Work: A Real-World Take on Size and Imbalance
Overcoming NLP hurdles like large models and imbalanced data, the project used TF-IDF, class weighting, and simpler models for accuracy and efficiency.
Join the DZone community and get the full member experience.
Join For FreeThe Initial Hurdle
In NLP classification tasks, especially those involving text descriptions, it's common to encounter two significant hurdles: large model sizes and imbalanced datasets. A large model can be difficult to deploy and manage, while class imbalance can lead to poor predictive performance, particularly for minority classes.
Imagine you are building a system to automatically categorize product descriptions into various product categories. The project began with a dataset of close to 40,000 records, where each record contained a short product title and a longer product description, along with its corresponding category. The initial Random Forest model, while achieving a decent accuracy of around 70%, ballooned to a whopping 11 GB size.
Further experimentation revealed that tuning the model parameters caused a drastic drop in accuracy (to 14%), rendering the model practically useless. This situation highlights the core challenges:
- Large model size: An 11 GB model is unwieldy and resource-intensive.
- Class imbalance: Uneven distribution of product categories makes it hard for the model to learn effectively.
- Overfitting: Parameter tuning leads to a significant drop in accuracy, indicating that the model is overfitting the training data.
The Solutions Explored
To address these issues, the focus was on reducing the model size and improving its accuracy. Here's a rundown of the techniques tried, the challenges faced, and what ultimately worked:
1. Addressing Class Imbalance
Resampling Techniques (SMOTE and ADASYN)
Oversampling the minority classes using SMOTE (Synthetic Minority Oversampling Technique) and ADASYN (Adaptive Synthetic Sampling Approach) was explored. The goal was to artificially increase the number of samples in the less frequent categories. However, several issues were encountered:
ValueError: Expected n_neighbors <= n_samples_fit
. This error occurred when the number of nearest neighbors required by SMOTE/ADASYN exceeded the number of available samples in a minority class.- Low accuracy. Even when the error was avoided, the resulting models often had lower accuracy than the baseline, suggesting that the synthetic samples were introducing noise or not effectively improving the model's ability to generalize. The oversampled data sometimes led to the model "memorizing" the synthetic examples, leading to poor performance on unseen data.
Here is a sample code with SMOTE:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
# Assuming X is your features and y is your labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Split data
smote = SMOTE(random_state=42) # Adjust k_neighbors
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print(f"Shape of X_train before SMOTE: {X_train.shape}")
print(f"Shape of X_train after SMOTE: {X_train_resampled.shape}")
Here are some key considerations for the SMOTE method:
- Apply resampling only to the training data (not the testing set) to prevent data leakage.
- Carefully tune the
k_neighbors
parameter in SMOTE/ADASYN. A value that is too high can lead to errors, while a value that is too low might not generate effective synthetic samples. - Use stratified sampling during train/test split to maintain class proportions. This ensures that the class distribution in the training and testing sets is representative of the overall dataset.
Cost-Sensitive Learning
Higher misclassification costs were assigned to minority classes by using the class_weight='balanced'
or class_weight='balanced_subsample'
parameter in the Random Forest model. This tells the model to penalize misclassifying minority class examples more heavily.
pythonfrom sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(
n_estimators=100, # Or your chosen number of trees
class_weight='balanced_subsample', # Or 'balanced'
random_state=42,
n_jobs=-1 # Use all available cores
)
Threshold Adjustment
Adjusting the classification threshold to favor the minority class was considered. This involves analyzing the predicted probabilities and choosing a threshold that works well for your specific problem. However, this requires careful evaluation on a validation set to avoid overfitting.
2. Improving Model Accuracy
Feature Engineering
Given the text data, more informative features were created:
- TF-IDF (Term Frequency-Inverse Document Frequency) with N-grams: This technique captures the importance of words in a document relative to the entire corpus, while N-grams capture phrases and word order information.
- Word embeddings (Word2Vec, GloVe, FastText): These represent words as dense vectors, capturing semantic relationships between words.
- Length of description: The length of the text description itself could be a useful feature.
- Number of keywords: The occurrence of specific keywords related to different product categories was counted.
Model Complexity and Regularization
The complexity of the Random Forest model was reduced by:
- Reducing
n_estimators
: The number of trees in the Random Forest is a major contributor to its size. - Limiting
max_depth
: Restricting the maximum depth of the trees prevents them from becoming overly complex. - Increasing
min_samples_split
andmin_samples_leaf
: These parameters control the minimum number of samples required to split a node and to be at a leaf node, respectively. Increasing them forces the trees to be less specific.
Model Selection
Simpler models like Multinomial Naive Bayes, Logistic Regression, and Linear SVM were considered, as Random Forest can be prone to overfitting. Gradient Boosting Machines (GBM) and XGBoost were also tried.
Hyperparameter Optimization
Cross-validation was used to evaluate model performance and tune hyperparameters. Randomized search and Bayesian optimization were employed to efficiently explore the hyperparameter space.
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)), # Adjust max_features
('classifier', LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42,
class_weight='balanced')) # Use class_weight
])
param_grid = {
'tfidf__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': [True, False],
'classifier__C': [0.1, 1, 10],
'classifier__penalty': ['l1', 'l2']
}
randomized_search = RandomizedSearchCV(pipeline,
param_distributions=param_grid,
n_iter=10,
cv=3,
scoring='accuracy',
random_state=42,
n_jobs=-1)
randomized_search.fit(X_train, y_train)
3. Reducing Model Size
Reducing Vocabulary Size
The vocabulary size of the text vectorizer was limited using max_features
in TfidfVectorizer
.
Quantization
Model quantization techniques were explored to reduce model size, but this required using a different framework like TensorFlow Lite or ONNX Runtime, which was too much work for such a small task.
What Worked Ultimately
Ultimately, a significantly smaller model size was achieved while maintaining acceptable accuracy by:
- Removing infrequent classes (with a higher threshold). Removing categories with fewer than X occurrences significantly improved model performance. This created a more stable and manageable dataset, despite reducing the overall size.
- Combining textual information. The product title and product description were combined into a single text field. This allowed the model to learn relationships between the concise title and the more detailed description.
- TF-IDF with limited vocabulary. Using TF-IDF with a limited vocabulary size (
max_features=5000
) struck a good balance between accuracy and model size. This prevented the model from overfitting to rare or irrelevant words. - Removing samples with less than X (this will depend on your data) occurrences.
- Combining the textual information from the product title and product description into a single feature.
- Using TF-IDF with
max_features=5000
.
Here's a sample of the data:
product_title | product_description | product_category |
---|---|---|
Laptop | RAM size foldable screen, touch | Electronics & Computers |
Nonstick pot | Scratched after a month, but maybe that’s my fault for using a metal spoon | Kitchen & Cookware |
Ceiling Fan | Remote works from my bed—life is good. | Home & Appliances |
Laptop Bag | Water-resistant? Yes. Waterproof? Nope. | Accessories & Bags |
BookShelf | Assembly took me an hour and my sanity, but it looks great! | Furniture & Storage |
Lessons Learned
This experience underscored several key lessons:
- Data cleaning is paramount. Thoroughly examine your data, clean it, and handle missing values appropriately. Inconsistent data can significantly hinder model performance.
- Class imbalance requires careful consideration. While resampling techniques can be helpful, they're not always the best solution. Sometimes, a more aggressive filtering approach can be more effective.
- Feature engineering matters. Combining existing features or creating new ones can significantly improve model accuracy.
- Model complexity is a trade-off. Simpler models are often more robust and easier to manage than complex ones.
- Experimentation is essential. Don't be afraid to try different approaches and evaluate their performance.
By systematically addressing these issues, you can build NLP classification models that are both accurate and manageable.
Opinions expressed by DZone contributors are their own.
Comments