This is the first in a series of blog posts where we'll explore a use case and a few different machine learning platforms to see how we might build a model using platforms that can help predict flight cancellations. In part one, we'll talk about the use case, how and why we limited the scenario, and about the data we gathered to start the data science/machine learning process.
For our use case, we chose flight cancellations and weather data for a few different reasons. We wanted a project that would:
- Already have a reasonably large amount of data, but not so much that we'd need more than just our laptop to do the data processing.
- Require federating data from more than just one source.
- Require the various steps in a real data science/machine learning project. CRISP-DM is one such process.
Wikimedia Commons license: Kenneth Jensen
Many people think that "training" a model is all that a machine learning project consists of. It doesn't take much reading about data science to know that things like data collection, data preparation, data exploration, and data engineering can take up the largest amount of your time on such a project. So, we wanted a use case and data set that required all of this.
And so we decided to see how well we can predict airline flight cancellations if we include weather data with historical flight data. This required all of the things we were looking for but also ended up including something we didn't think about ahead of time: the fact that this data is heavily imbalanced. Specifically, of all the flights in the data set, only a small percentage are actually canceled.
This fact pushed us to a greater understanding of how to deal with heavily imbalanced classes in our data. First is that for this problem, "accuracy" is a terrible measure. Just predicting that the flight won't be canceled will give us great accuracy, but isn't a good model. We needed to look to measures like the confusion matrix, precision, recall, and ROC curves. Next, we wanted to try different algorithms and techniques like oversampling and undersampling, penalizing the wrong classification of our rare class, and a few other things like the SMOTE algorithm. Heavily imbalanced data makes analysis difficult, but we realized that it's also pretty common to real-life scenarios.
Limiting the Scope
We decided that running our analysis for every airport in the world would be too big in scope. Even limiting to airports in the USA would be more than needed for our project. So we decided to limit to the top 10 airports most affected by weather. That left us with a manageable amount of data — and we suspected the data itself would be less imbalanced. A quick search gave us this site, 10 Most Weather-Delayed US Major Airports, and the ten airports we would use.
To get US flight data, we used this United States Department of Transportation site, whose filters let us isolate the features we wanted. Unfortunately, the site can only deliver data for a particular month at a time. So, we had to gather twelve separate files for the twelve months of 2016, which increased the complexity of the data engineering since we had to first merge the 12 data files and then filter out all but the 10 desired airports. Not difficult, but a real-world task. The twelve files held over five million records so it wasn't something that could be done in Excel.
Next, we used The Weather Company API to get historical weather data for those ten airport sites for 2016. Our plan was to combine these two data sources as a part of the data preparation and data engineering.
Our goal for this use case was to come up with an exercise for creating a machine learning model using a few different platforms.
In the next post in the series, we'll use IBM's SPSS Modeler, which is ideal for beginners because of its visual graphical interface, many different machine learning algorithms including one that finds the best machine learning algorithm to use, and easy ways to explore, prepare and transform data.
In the third post, we'll try replicating our efforts using IBM's DSX Cloud platform with Watson Machine Learning (WML). Creating a Jupyter notebook using the Python programming language might give us more flexibility in code versus the GUI interface in SPSS. Admittedly, it's also likely a harder task if you're not a wizard with Python, so it could take a bit longer. WML is still in beta but we'll see what it can do.
For the final post, we'll try converting the SPSS model we did first into a "flow" — which is a new capability coming soon to IBM's DSX, which provides SPSS Modeler capabilities directly within DSX. Trying to recreate our original SPSS model using a flow in the cloud should prove interesting.
To be clear, we aren't trying to create a production quality model. That would require a lot more work and time. Instead, we want to create something that works reasonably well and that can be done using the different platforms described. At the same time, with a little more work and expertise, the project could possibly be tuned to the point of being production quality. If so, we can imagine hotels using it to generate real-time advertising in the airports where it predicts flights will be canceled. Or Uber might use it to gear up more cars for stranded passengers. Or perhaps the airport itself could use the model to prepare better for cancellations, and offer better experiences for flyers. Let us know of any other ideas that come to mind.