Unfriendly Skies: Predicting Flight Cancellations Using Weather Data (Part 3)
Unfriendly Skies: Predicting Flight Cancellations Using Weather Data (Part 3)
Flight cancellations can be predicted with the magic of AI and predictive analytics. Check out how to do it with IBM's Data Science Experience!
Join the DZone community and get the full member experience.Join For Free
The most visionary programmers today dream of what a robot could do, just like their counterparts in 1976 dreamed of what personal computers could do. Read more on MistyRobotics.com and enter to win your own Misty.
In Part 1 of this series, we wrote about our goal to explore a use case and use various machine learning platforms to see how we might build classification models with those platforms to predict flight cancellations. Specifically, we hoped to predict the probability of the cancellation of flights between the ten U.S. airports most affected by weather. We used historical flight data and historical weather data to make predictions for upcoming flights.
Tools Used in This Use Case Solution
DSX is a collaborative platform for data scientists built on open-source components and IBM-added value, which is available in the cloud or on-premise. In the simplest terms, DSX is a managed Apache Spark cluster with a Notebook front-end. By default, it includes integration with data tools like a data catalog and data refinery, Watson Machine Learning services, collaboration capability, model management, and the ability to automatically review a model's performance and refresh/retrain the model with new data. And IBM is quickly adding more capabilities. Read here to see what IBM is doing lately for data science.
A Python Notebook Solution
In this case, we followed roughly the same steps we used in the SPSS model from Part 2 — only this time, we wrote Python code in a Jupyter notebook to get similar results. We encourage readers to come up with their own solutions. Let us know. We'd love to feature your approaches in future blog posts.
The first step of the iterative process is gathering and understanding the data needed to train and test our model. Since we did this work for Part 2, we made use of the analysis here.
- Flights data: We gathered data for 2016 flights from the U.S. Bureau of Transportation Statistics website. The website allowed us to export one month at a time, so we ended up with twelve CSV (comma separated value) files. Importing those as DataFrames and merging into a single DataFrame was straightforward.
Figure 1: Gathering and preparing flight data in IBM DSX
- Weather data: With the latitude and longitude of the 10 Most Weather-Delayed U.S. Major Airports, we used one of the Weather Company's APIs to get the historical hourly weather data for all of 2016 for each of the ten airport locations and created a CSV file that became our data set in the notebook.
- Combined flights and weather data: To each flight in the first dataset, we added two new columns,
DEST, containing the respective airport codes. Next, we merged flight data and the weather data so that the resulting DataFrame contained the flight data along with the weather for the corresponding origin and destination airports.
Data Preparation, Modeling, and Evaluation
To start preparing the data, we used the combined flights and weather data from the previous step and performed some cleanup. We deleted columns of features that we didn't need and replaced null values in rows where flight cancellations were not related to weather conditions.
Next, we took the features we discovered when we created a model using SPSS (such as flight date, hour, the day of the week, origin and destination airport codes, and weather conditions) and we used them as inputs to our Python model. We also chose the target feature for the model to predict: the cancellation status. We deleted the remaining features.
Next, we ran OneHotEncoder on the four categorical features. One-hot encoding is a process by which categorical features get converted into a format that works better with certain algorithms, like classification and regression. Figure 2 shows the number of feature columns, expanded significantly with one hot encoding.
Figure 2: One-hot encoding expands four feature columns into many more
Interestingly, the flight data is heavily imbalanced. Specifically, as seen in Figure 3, of all the flights in the data set only a small percentage are actually canceled.
Figure 3: Historical data; distribution of canceled (1) and non-canceled (0) flights
To address that skewness in the original data, we tried oversampling the minority class, under-sampling the majority class, and a combination of both — but none of these approaches worked well. We then tried something called SMOTE (Synthetic Minority Over-Sampling Technique), an algorithm that provides an advanced over-sampling algorithm to deal with imbalanced datasets. Since it generates synthetic examples rather than just using replication, it helped our selected model work more effectively by mitigating the problem of overfitting that random oversampling can cause. SMOTE isn't considered effective for high dimensional data, but that isn't the case here.
In Figure 4, we notice a balanced distribution between canceled and non-canceled flights after running the data through SMOTE.
Figure 4: Distribution of canceled and non-canceled flights after using SMOTE
It's important to mention is that we applied SMOTE only to the training dataset, not the test data set. A detailed blog by Nick Becker guided our choices in the notebook.
At this point, we used the Random Forest Classifier for our model. It did the best when we used SPSS so we used again in our notebook. We have several ideas for the second iteration of our model in order to tune it, one of which is to try multiple algorithms to see how they compare.
Since this use case deals with classification analysis, we used some of the common ways to evaluate the performance of the model: the confusion matrix, F1 score, and ROC curve, among some others. Figures 5 and 6 show the results.
Figure 5: Test/validation results
Figure 6: ROC curve for training dataset
Figure 6 is the ROC curve from the training dataset. Figure 5 shows us that the results of the training and test datasets are pretty close, which is a good indication of consistency, though we realize that with some tuning it could get better. Nevertheless, we decided that the results were still good for the purposes of our discussion in this blog, and we stopped our iterations here. We encourage readers to refine the model further or even to use other models to solve this use case.
This was a project to compare creating a model in IBM's SPSS with IBM's Data Science Experience. SPSS offers a no-code experience while DSX offers the best of open-source coding capability with many IBM value-adds. SPSS is an amazing product and gets better with every release, adding many new capabilities.
IBM's Data Science Experience is a great platform for both the beginning and experienced data scientist. Anyone can log in and have immediate access to a managed Spark cluster with a choice of a Jupyter notebook front-end using Scala, Python or R, SPSS and visual data modeler (no coding). It offers easy collaboration with other users, including adding other data scientists who could then look over our shoulders and make suggestions. The community is active and has already contributed dozens of tutorials, datasets, and notebooks. If we had added Watson Machine Learning, we could very easily have deployed and managed our model with an instant REST endpoint to call from any application. If our data was changing, we could have WML review our model periodically and retrain it with any new data if our metric (ROC Curve) value fell below a given threshold. That, along with new data cataloging and data refinery tooling added recently, make this a platform worth checking out for any data science project.
SPSS has a lot, but not everything. Writing the python code in a notebook was a bit more time-consuming than what we did in SPSS, but it also gave quite a bit more flexibility and freedom. We had access to everything in the python libraries, and of course, one of the benefits of python as an open-source language is the trove of helpful examples.
I would say both platforms have their place, and neither can claim to be better for everything. Those doing data science for the first time will probably find SPSS an easier place to start given its drag-and-drop user interface. Those who have come out of school as programming wizards will want to write code, and DSX will give them a great way to do that without worrying about installing, configuring, and correctly integrating various product versions.
The IBM notebook and data that form the basis for this blog are available on GitHub.
Published at DZone with permission of Tim Bohn , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.