Can Machine Learning Predict Poverty?
Can Machine Learning Predict Poverty?
We participated in World Bank's poverty prediction competition as a chance to work with visual analytics on images and videos, healthcare AI, NLP, and deep learning.
Join the DZone community and get the full member experience.Join For Free
World Bank hosted a poverty prediction competition on the competition-hosting website drivendata.org. We decided to try out our machine learning skills on this dataset. Most regular work at ParallelDots is around three themes: visual analytics on images and videos, healthcare AI, and NLP, all three of which are solved using deep learning techniques. This competition was a chance to try out something new and build our internal codebase to handle tabular datasets like what we had in the competition.
What we wanted to achieve from the competition was to:
- Try out a multitude of machine learning models that might be able to solve the problem.
- Try out existing AutoML methods. (AutoML methods just need you to feature-engineer them and they will figure out the rest of pipeline on their own.)
- Create one best model to solve the problem without getting into ensembling too many models and enhancing scores. Since AIaaS is our day-to-day job, optimizing for one good model is more important for us as ensembles are hard to deploy as services.
- Build a code repository to tackle data science and machine learning problems in future.
Analyzing the Dataset (Without a Lot of Sweat)
The first task as in any machine learning project is to analyze the datasets and see their properties. Some information we can derive just by looking at the dataset is:
- There are data files for three different countries.
- All the fields are anonymized and coded, so you do not know what the fields mean. This reduces any chance of domain-specific feature engineering to zero.
- Data for all three countries are totally different, so one needs to build three models: one for each country.
One way to dive deeper into data (quickly) is to use the new package pandas-profiling (which can be downloaded from GitHub here). This package does a lot of primary analysis and saves them as pretty HTML files one can view on their browser. We ran pandas-profiling on the data of all three countries to understand more about the datatypes, frequencies, correlation, etc.
Sample output for one of the countries can be seen in the following image:
Correlations amongst different features
Some more conclusions we can draw are:
- There seems to be a default value for most categorical fields that is the most common value for the field. (In the above picture, for example, you can see the field AOSWkWKB has a default value — it takes 80%+ times.)
- The datasets are highly imbalanced; we need to take care of that while training.
2 Ways to Model the Data
If one looks at the datatypes of the objects, they can see that the data is a mix of categorical values (attributes that can take one out of a constant number of enumerable values) and numerical values (both floats and integers). In fact, that is how the random forest benchmark provided by WorldBank models works. However, when you look at the numerical quantities, they are not that numerous and might represent quantities like date of birth, etc. (If you have taken the Coursera course, Dmitry talks about a similar set of fields in the Handling Anonymized Datasets section.) So, another approach we wanted to try was to treat all the fields as categorical attributes. We ended up trying both.
Another important property of dataset is the imbalance between +ve and -ve classes (non-poor people vastly outnumber poor people). For Country A, the data is still balanced, but for B and C, the data has a very skewed distribution. To train models on such skewed data, we tried different approaches using an imbalanced-learn library in Python:
- Training on the skewed dataset (which worked okay, but not too great).
- Training on a dataset with a negative class undersampled (which performed very badly; even the best machine learning models could work just as well as the baseline with this dataset).
- Oversampling the +ve class (which worked quite well).
- Oversampling using SMOTE algorithms (did not work as well as normal oversampling, primarily because the SMOTE algorithm is not really defined for categorical attributes).
- Oversampling using ADASYN (did not work as well as normal oversampling).
The dataset was preprocessed as follows:
- All categorical features were converted to binary features.
- The numerical values were normalized. Both max-min and mean-std normalization were tested.
- Household-level and individual-level data was merged (individual-level data had separate data for each member of all households provided). Only household-level data was retained for attributes common in individual and household data. We take the mean of all numerical features in the household data (which might not have been the best way) and all categorical values were aggregated to the oddest value amongst the household (so, for example, if Feature X had value 1,1,1,0 in the household, we would take the combined value for the household as 0). The reason is that a lot of categorical variables hold the default value and we expected the odd value to have more information.
Approaches We Tried
We now talk about multiple approaches that we tried.
Things that did not work:
- We thought that the default attributes for categorical fields might not be useful for modeling. To check this, we trained Machine Learning models both with and without the default attributes. Models not being fed default attributes consistently performed worse than ones being fed the default values.
- SMOTE and ADASYN oversampling didn’t give better results than normal oversampling.
- Two-staged machine learning: the first to create a decision tree to get the importance of features and the other to train on the most important features. We did not get any gains by trying this technique.
- Trying different methods to normalize numerical data did not change accuracy. However, non-normalized numerical attributes got worse accuracy.
Tricks that helped us increase our score:
- The combination of numerical and categorical features worked better to train algorithms than all categorical attributes — at least for decision trees.
- The choice of default values for missing data helped us improve our accuracy. We started by making all missing values as zero, but later used -999, which worked better.
- Grid search across machine learning hyperparameters got us 2-4% better on the validation set with no effort.
- A strong AutoML baseline helped us get started well.
Tricks that we wanted to try but couldn’t/didn’t/were too lazy to code:
- Feature-engineering by taking Cartesian products of non-default categorical values and then choosing important features to train the model on.
- Feature-engineering by combining numerical features in different ways and doing feature selection on generated features.
- Trying ensembles of multiple models. We had earlier fixed the goal to get one good model but still ended up training many methods. We could have combined them as an ensemble like stacking.
Machine Learning Algorithms
The libraries we used were SKLearn, XGBOOST, and TPOT.
We will now talk about machine learning approaches we tried out, talking about things in a chronological order, as in what order we tried approaches in. Please note that all tricks that worked for us weren’t around on our first try and we included them one by one. Please see the points for each trial to understand what was the pipeline at that time. All machine learning models used were from the scikit-learn library unless otherwise stated.
The Usual Suspects With Default Parameters
- We started with trying out the usual suspects with default parameters: logistic regression, SVM, and random forests. We also tried a new library called CATBOOST, but we couldn’t find a lot of documentation about its hyperparameters nor could fit it well on data, so we decided to replace it with the more well-known XGBOOST. We also had knowledge about XGBOOST hyperparameter tuning (which we knew we had to do in later stages).
- The first attempt modeled all the columns as categorical data and an imbalanced dataset.
- All the models fit okay and gave us way better accuracy than a coin toss on validation data. That kind of tells us that the data extraction pipeline is okay (has no obvious bugs but needs to be finetuned more).
- Like the baseline provided by competition providers, random forests and XGBOOST with default hyperparameters show good results.
- LR and SVM can model the data well (not as well as RF and XGBOOST due to less variance in the default hyperparameters). SVM (SKLearn SVC) had a good accuracy, too, but the probabilities it returns are not really usable in SKLearn (which I found is a common issue with default hyperparameters), which made us drop SVM as the competition judged on mean log loss — and this would require extra effort to make sure that the probability numbers were right. It’s just that probability SVC returns are not exactly probability but some type of score.
TPOT: AutoML Makes a Good Baseline
- Still continuing with all features being taken as categorical, we tried to fit a baseline using an AutoML method called TPOT.
- TPOT uses genetic algorithms to figure out a good machine learning pipeline for the problem at hand along with what hyperparameters to use with it.
- This got us in the Top 100 in the competition public leaderboard at the time we submitted it.
- TPOT takes time to figure out the pipeline and converged in a few hours for the entire dataset.
Can Neural Networks Be Used? Neural Networks, Anyone?
The love we have for deep learning made our hands itch to try something neural networks-y. We set ahead to train a good neural network algorithm that could solve this problem, too. Please note that at this time, we were doing experiments considering all columns as categorical. What is a problem that has many categorical variables and needs to predict a label? Text classification. That is one place where neural networks shine a lot. However, unlike text, this dataset has no concept of sequence, so we decided to use a neural network that's common in text classification but doesn’t take order into account. That algorithm is FastText. We wrote a (deep) version of fast text like the algorithm in Keras to train on the dataset. Another thing we did to train the neural network was oversampling the minority class, as it did not train well on imbalanced data.
We tried training using the recently proposed self-normalized neural networks. This gave us a free bump of accuracy on the validation set.
Self-normalized FFNN (SELU) we used
Although we got gains in accuracy on validation set when we used deep neural networks, especially Country B, where the highest accuracy we ever received (even better than our best performing model) was using self-normalized deep neural network — the results don’t translate on the leaderboard where we kept getting low scores (high logloss).
Towards Improving the AutoML Baseline and Tuning XGBOOST
The AutoML baseline we created still stared us in the face, as all our handcrafted methods were still worse. We hence decided to switch to the tried and tested XGBOOST models to better the scores. We wrote a data pipeline for trying out different tricks we mentioned (successful/unsuccessful) and a pipeline to grid search over different hyperparameters and try a five-fold cross-validation.
Grid search example for the single validation set
The tricks that worked above combined with grid search gave massive boosts to our scores and we could beat 0.2 logloss and then 0.9 logloss score too. We tried another TPOT AutoML with a dataset generated by our successful tricks, but it could only take up to a pipeline with close to 0.2 logloss on the leaderboard. So ultimately, the XGBOOST model turned out to be the best one. We couldn’t get the accuracy of the same order when we tried to grid search over parameters of a random forest algorithm.
Our score/rank got slightly worse on the private leaderboard as compared to public competition leaderboard. We ended the competition at around the 90th percentile.
Opinions expressed by DZone contributors are their own.