DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Getting Started With GenAI on BigQuery: A Step-by-Step Guide
  • Personalized Product Recommendations in E-Commerce Using ML
  • AI Summarization: Extractive and Abstractive Techniques
  • CUI Document Identification and Classification

Trending

  • Navigating Change Management: A Guide for Engineers
  • Dropwizard vs. Micronaut: Unpacking the Best Framework for Microservices
  • Building an AI/ML Data Lake With Apache Iceberg
  • Advancing Robot Vision and Control
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Make ML Models Work: A Real-World Take on Size and Imbalance

Make ML Models Work: A Real-World Take on Size and Imbalance

Overcoming NLP hurdles like large models and imbalanced data, the project used TF-IDF, class weighting, and simpler models for accuracy and efficiency.

By 
Gowsiya Syednoor Shek user avatar
Gowsiya Syednoor Shek
·
Mar. 25, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.1K Views

Join the DZone community and get the full member experience.

Join For Free

The Initial Hurdle

In NLP classification tasks, especially those involving text descriptions, it's common to encounter two significant hurdles: large model sizes and imbalanced datasets. A large model can be difficult to deploy and manage, while class imbalance can lead to poor predictive performance, particularly for minority classes.

Imagine you are building a system to automatically categorize product descriptions into various product categories. The project began with a dataset of close to 40,000 records, where each record contained a short product title and a longer product description, along with its corresponding category. The initial Random Forest model, while achieving a decent accuracy of around 70%, ballooned to a whopping 11 GB size. 

Further experimentation revealed that tuning the model parameters caused a drastic drop in accuracy (to 14%), rendering the model practically useless. This situation highlights the core challenges:

  • Large model size: An 11 GB model is unwieldy and resource-intensive.
  • Class imbalance: Uneven distribution of product categories makes it hard for the model to learn effectively.
  • Overfitting: Parameter tuning leads to a significant drop in accuracy, indicating that the model is overfitting the training data.

The Solutions Explored

To address these issues, the focus was on reducing the model size and improving its accuracy. Here's a rundown of the techniques tried, the challenges faced, and what ultimately worked:

1. Addressing Class Imbalance

Resampling Techniques (SMOTE and ADASYN)

Oversampling the minority classes using SMOTE (Synthetic Minority Oversampling Technique) and ADASYN (Adaptive Synthetic Sampling Approach) was explored. The goal was to artificially increase the number of samples in the less frequent categories. However, several issues were encountered:

  • ValueError: Expected n_neighbors <= n_samples_fit. This error occurred when the number of nearest neighbors required by SMOTE/ADASYN exceeded the number of available samples in a minority class.
  • Low accuracy. Even when the error was avoided, the resulting models often had lower accuracy than the baseline, suggesting that the synthetic samples were introducing noise or not effectively improving the model's ability to generalize. The oversampled data sometimes led to the model "memorizing" the synthetic examples, leading to poor performance on unseen data.

Here is a sample code with SMOTE:

Python
 
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Assuming X is your features and y is your labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Split data
smote = SMOTE(random_state=42)  # Adjust k_neighbors
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train) 
print(f"Shape of X_train before SMOTE: {X_train.shape}")
print(f"Shape of X_train after SMOTE: {X_train_resampled.shape}")


Here are some key considerations for the SMOTE method:

  • Apply resampling only to the training data (not the testing set) to prevent data leakage.
  • Carefully tune the k_neighbors  parameter in SMOTE/ADASYN. A value that is too high can lead to errors, while a value that is too low might not generate effective synthetic samples.
  • Use stratified sampling during train/test split to maintain class proportions. This ensures that the class distribution in the training and testing sets is representative of the overall dataset.

Cost-Sensitive Learning

Higher misclassification costs were assigned to minority classes by using the class_weight='balanced' or class_weight='balanced_subsample' parameter in the Random Forest model. This tells the model to penalize misclassifying minority class examples more heavily.

Python
 
pythonfrom sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=100,  # Or your chosen number of trees
    class_weight='balanced_subsample',  # Or 'balanced'
    random_state=42,
    n_jobs=-1  # Use all available cores
)


Threshold Adjustment

Adjusting the classification threshold to favor the minority class was considered. This involves analyzing the predicted probabilities and choosing a threshold that works well for your specific problem. However, this requires careful evaluation on a validation set to avoid overfitting.

2. Improving Model Accuracy

Feature Engineering

Given the text data, more informative features were created:

  • TF-IDF (Term Frequency-Inverse Document Frequency) with N-grams: This technique captures the importance of words in a document relative to the entire corpus, while N-grams capture phrases and word order information.
  • Word embeddings (Word2Vec, GloVe, FastText): These represent words as dense vectors, capturing semantic relationships between words.
  • Length of description: The length of the text description itself could be a useful feature.
  • Number of keywords: The occurrence of specific keywords related to different product categories was counted.

Model Complexity and Regularization

The complexity of the Random Forest model was reduced by:

  • Reducing n_estimators: The number of trees in the Random Forest is a major contributor to its size.
  • Limiting max_depth: Restricting the maximum depth of the trees prevents them from becoming overly complex.
  • Increasing min_samples_split and min_samples_leaf: These parameters control the minimum number of samples required to split a node and to be at a leaf node, respectively. Increasing them forces the trees to be less specific.

Model Selection

Simpler models like Multinomial Naive Bayes, Logistic Regression, and Linear SVM were considered, as Random Forest can be prone to overfitting. Gradient Boosting Machines (GBM) and XGBoost were also tried.

Hyperparameter Optimization

Cross-validation was used to evaluate model performance and tune hyperparameters. Randomized search and Bayesian optimization were employed to efficiently explore the hyperparameter space.

Python
 
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),  # Adjust max_features
    ('classifier', LogisticRegression(solver='liblinear', multi_class='ovr', random_state=42,
                                        class_weight='balanced'))  # Use class_weight
])

param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': [True, False],
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l1', 'l2']
} 
randomized_search = RandomizedSearchCV(pipeline,                                      
                                       param_distributions=param_grid,
                                       n_iter=10,
                                       cv=3,
                                       scoring='accuracy',
                                       random_state=42,
                                       n_jobs=-1)

randomized_search.fit(X_train, y_train)

                            

3. Reducing Model Size

Reducing Vocabulary Size

The vocabulary size of the text vectorizer was limited using max_features in TfidfVectorizer.

Quantization

Model quantization techniques were explored to reduce model size, but this required using a different framework like TensorFlow Lite or ONNX Runtime, which was too much work for such a small task.

What Worked Ultimately

Ultimately, a significantly smaller model size was achieved while maintaining acceptable accuracy by:

  • Removing infrequent classes (with a higher threshold). Removing categories with fewer than X occurrences significantly improved model performance. This created a more stable and manageable dataset, despite reducing the overall size. 
  • Combining textual information. The product title and product description were combined into a single text field. This allowed the model to learn relationships between the concise title and the more detailed description.
  • TF-IDF with limited vocabulary. Using TF-IDF with a limited vocabulary size (max_features=5000) struck a good balance between accuracy and model size. This prevented the model from overfitting to rare or irrelevant words.
  • Removing samples with less than X (this will depend on your data) occurrences.
  • Combining the textual information from the product title and product description into a single feature.
  • Using TF-IDF with max_features=5000.

Here's a sample of the data:

product_title product_description product_category
Laptop RAM size foldable screen, touch Electronics & Computers 
Nonstick pot Scratched after a month, but maybe that’s my fault for using a metal spoon Kitchen & Cookware
Ceiling Fan Remote works from my bed—life is good. Home & Appliances
Laptop Bag Water-resistant? Yes. Waterproof? Nope. Accessories & Bags
BookShelf Assembly took me an hour and my sanity, but it looks great! Furniture & Storage


Lessons Learned

This experience underscored several key lessons:

  • Data cleaning is paramount. Thoroughly examine your data, clean it, and handle missing values appropriately. Inconsistent data can significantly hinder model performance.
  • Class imbalance requires careful consideration. While resampling techniques can be helpful, they're not always the best solution. Sometimes, a more aggressive filtering approach can be more effective.
  • Feature engineering matters. Combining existing features or creating new ones can significantly improve model accuracy.
  • Model complexity is a trade-off. Simpler models are often more robust and easier to manage than complex ones.
  • Experimentation is essential. Don't be afraid to try different approaches and evaluate their performance.

By systematically addressing these issues, you can build NLP classification models that are both accurate and manageable.

NLP Random forest Machine learning

Opinions expressed by DZone contributors are their own.

Related

  • Getting Started With GenAI on BigQuery: A Step-by-Step Guide
  • Personalized Product Recommendations in E-Commerce Using ML
  • AI Summarization: Extractive and Abstractive Techniques
  • CUI Document Identification and Classification

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!