Recurrent Neural Networks for Email List Churn Prediction
Recurrent Neural Networks for Email List Churn Prediction
Learn how recurrent neural networks can be used to analyze MailChimp data to predict and prevent high levels of email churn.
Join the DZone community and get the full member experience.Join For Free
Not very long after finishing writing my lessons learned from building a Hello World neural network, I thought that I could move on from a simple MLP to a more sophisticated neural net. It's probably thanks to Karpathy's blog post about the unreasonable effectiveness of recurrent neural networks that I chose to continue using RNNs.
Although to be honest, this wasn't my only motive. Those who read my posts will already know that one of the problems I have studied extensively during the past few months is mailing list churn prediction using data from MailChimp. They were a series of posts in which I covered:
- How to Predict Churn: When Do Email Recipients Unsubscribe?
- How to Predict Churn: A Model Can Get You as Far as Your Data Goes
- Predicting Email Churn With NBD/Pareto
- Lessons Learned From Building a Hello World Neural Network
So, churn prediction boils down to time series analysis — and RNNs are doing great at this.
Recurrent Neural Networks
I'm not going to dive into a lot of technical details about what RNNs are and how they work, as there are plenty of sources on line explaining them in detail.
I'm just going to mention that the distinguishing feature between RNNs and the other types of neural networks is that only RNNs have a feedback loop. That means that at each time step, a recurrent neuron receives the input as well as the output of the previous step. That's exactly the reason why they are suitable for time series predictions.
A simple diagram representing exactly this process is the following:
Due to this feedback loop, we can claim that recurrent neurons have memory cells of some kind, and so when making a decision, they can take previous states into consideration, too.
A specific type of neurons with memory cells are called LSTM (long short-term memory). No need to worry much about what they are or how they work. Just keep in mind that they are ordinary neurons with some extra properties that allow them to perform even better.
Collecting and Transforming Data
The first step I had to take was the data collection and the preprocessing.
Although collecting the data wasn't that difficult, finding out how it had to be formatted in order to be fed in the RNN was the most painful part of the process.
As in all previous posts, the type of data I chose to work with came directly from mailing campaigns launched through MailChimp. I used Blendo, a data integration tool, to sync MailChimp data with PostgreSQL in a few minutes, but only a subset of the tables provided by the service was used regarding the actions that a recipient performed in a particular month. Here is a sample data model of the MailChimp schema that I got in my DB built for me.
Based on this, my output would be a prediction of who is going to perform at least one action in the next month so that those who aren't going to open an email or click a link will be considered high-risk.
What was most troubling to me was the fact that I wanted my input to be a large number of uncorrelated time series, each corresponding to a recipient. Well, a correlation might actually exist, but in any case, I didn't want to consider the time series of a recipient as a feature for predicting another recipient's behavior.
After a lot of googling, multiple experiments, and equal failures, I ended up with the following format. Each column of my input represents a recipient and each row represents a month. The value of each month is set to 1 when a specific user has opened an email or clicked a link at least once during this month.
Another problem I faced was the fact that the length of each time series wasn't the same, as not all recipients were subscribed during the same calendar months. I overcame this by padding shorter time series with zeros.
But how many months will you have available? Three years? Five? Ten? Even ten years is only 120 months, i.e, 120 data points. Studying data at a more granular level — for example, days — seemed a better option.
Even then, the clean data I had at my disposal covered about 2.5 years, resulting in less than 150 data points.
Although training an RNN with so little data seemed like an impossible mission, I gave it a try. Here are the results.
Building the Network
For building the RNN, Keras was my choice as it is very easy and straightforward to use. I built a linear stack of three layers instances, including:
- LSTM: The main recurrent layer of the constructed network. Refer here for more information.
- Dense: A regular densely connected layer. Refer here for more information.
- Activation: Specifies the activation function that will be applied to the output. I chose it to be linear, which is probably the simplest and seemed to behave quite well.
The number of the input neurons was set to 1,349. The size of the dataset and the neurons per hidden layer were 300.
Training and Evaluating
Before moving to the actual training, I had to split the data into "train" and "test." In the evaluation of the model's performance, this is an important step that cannot be omitted even when the dataset is already too small.
The chosen loss function was the mean square error, which aims to minimize the average of the squares of the errors during back propagation. More information about the available loss functions can be found in the Keras documentation.
After training the model for 15 epochs, the following graphs were generated for the train dataset and the test dataset.
The good news is that the loss of the training set decreases as epochs go by. The bad news is that unfortunately, on the test dataset, the loss does not follow a steady decreasing progress but instead has its ups and downs.
This behavior is expected and it indicates that the network isn't actually learning. Instead, it "memorizes" the training points and cannot generalize on new unseen data. This a consequence of the insufficient amount of data I had available for this problem. Given a larger amount of data, the network could possibly be able to generalize and generate better predictions on the test dataset.
On the bright side, according to my intuition and not some super scientific explanation, the fact that the network seems to overfit the training data as expected indicates that the code is more or less sane and can be probably used with a few modification in some other similar problem.
In this post, I tried to cope with a churn prediction task, made mistakes, and learned from them. I also had the chance to play (scratched the surface) with recurrent neural networks, a technique of immense value to intelligent systems, and how they work.
From the results, it seemed that the amount of data available was insufficient and so in order to move forward, one suggestion would be data enrichment. Data from MailChimp can be combined with behavioral data from other sources like Μixpanel and Ιntercom. This way, the time series will be expanded by including more data points for each user so that model will perform increasingly better and eventually predict churn.
But again, success is not guaranteed, as training neural networks indeed requires a large amount of data.
In any case, I would encourage you to get the code, which is available on GitHub, and try to implement it on your own data. I would love to hear your comments regarding how the model performed on different churn tasks, so please share your experience in the comments!
Published at DZone with permission of Eleni Markou , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.