Unfriendly Skies: Predicting Flight Cancellations Using Weather Data (Part 2)
Unfriendly Skies: Predicting Flight Cancellations Using Weather Data (Part 2)
Learn how to use the IBM SPSS Modeler and APIs from The Weather Company to predict flight cancelations. And oh, yeah — not one line of code was written in the making of this blog.
Join the DZone community and get the full member experience.Join For Free
The most visionary programmers today dream of what a robot could do, just like their counterparts in 1976 dreamed of what personal computers could do. Read more on MistyRobotics.com and enter to win your own Misty.
As we described in Part 1 of this series, our objective is to help predict the probability of the cancellation of a flight between two of the ten U.S. airports most affected by weather conditions. We use historical flights data and historical weather data to make predictions for upcoming flights.
Tools Used in This Use Case Solution
IBM SPSS Modeler is designed to help discover patterns and trends in structured and unstructured data with an intuitive visual interface supported by advanced analytics. It provides a range of advanced algorithms and analysis techniques, including text analytics, entity analytics, decision management, and optimization to deliver insights in near real-time. For this use case, we used SPSS Modeler 18.1 to create a visual representation of the solution, or in SPSS terms, a stream. That's right: not one line of code was written in the making of this blog.
We also used The Weather Company APIs to retrieve historical weather data for the ten airports over the year 2016. IBM SPSS Modeler supports calling the weather APIs from within a stream. That is accomplished by adding extensions to SPSS, available in the IBM SPSS Predictive Analytics resources page, AKA Extensions Hub.
A Proposed Solution
In this blog, we propose one possible solution for this problem. It's not meant to be the only or the best possible solution, or a production-level solution for that matter, but the discussion presented here covers the typical iterative process (described in the sections below) that helps us accumulate insights and refine the predictive model across iterations. We encourage the readers to try and come up with different solutions and provide us with your feedback for future blogs.
Business and Data Understanding
The first step of the iterative process includes understanding and gathering the data needed to train and test our model later.
We gathered 2016 flights data from the US Bureau of Transportation Statistics website. The website allows us to export one month at a time, so we ended up with 12 CSV (comma-separated value) files. We used IBM SPSS Modeler to merge all the CSV files into one set and to select the ten airports in our scope. Some data clean-up and formatting was done to validate dates and hours for each flight, as seen in Figure 1.
Figure 1: Gathering and preparing flights data in IBM SPSS Modeler
From the Extensions Hub, we added the TWCHistoricalGridded extension to SPSS Modeler, which made the extension available as a node in the tool. That node took a CSV file listing the 10 airports latitude and longitude coordinates as input, and generated the historical hourly data for the entire year of 2016, for each airport location, as seen in Figure 2.
Figure 2: Gathering and preparing weather data in IBM SPSS Modeler
Combined Flights and Weather Data
To each flight in the first data set, we added two new columns,
DEST, containing the respective airport codes. Next, flight data and the weather data were merged together. Note: The "stars" or SPSS supernodes in Figure 3 are placeholders for the diagrams in Figures 1 and 2 above.
Figure 3: Combining flights and weather data in IBM SPSS Modeler
Data Preparation, Modeling, and Evaluation
We iteratively performed the following steps until the desired model qualities were reached:
Evaluate the model
Figure 4 shows the first and second iterations of our process in IBM SPSS Modeler.
Figure 4: Iterations (prepare data, run models, evaluate) — and do it again
To start preparing the data, we used the combined flights and weather data from the previous step and performed some data cleanup (for example, took care of null values). In order to better train the model later on, we filtered out rows where flight cancellations were not related to weather conditions (such as cancellations due to technical issues, security issues, etc.)
Figure 5: Imbalanced data found in our input data set
This is an interesting use case, and often a hard one to solve, due to the imbalanced data it presents, as seen in Figure 5. By "imbalanced" we mean that there were far more non-canceled flights in the historical data than canceled ones. We will discuss how we dealt with imbalanced data in the following iteration.
Next, we defined which features were required as inputs to the model (such as flight date, hour, day of the week, origin and destination airport codes, and weather conditions), and which one was the target to be generated by the model (i.e. predict the cancellation status). We then partitioned the data into training and testing sets, using an 85/15 ratio.
The partitioned data was fed into an SPSS node called Auto Classifier. This node allowed us to run multiple models at once and preview their outputs, such as the area under the ROC curve, as seen in Figure 6.
Figure 6: Models output provided by the Auto Classifier node
That was a useful step in making an initial selection of a model for further refinement during subsequent iterations. We decided to use the Random Trees model since the initial analysis showed it has the best area under the curve as compared to the other models in the list.
During the second iteration, we addressed the skewed-ness of the original data. For that purpose, we chose one of the SPSS nodes called SMOTE (Synthetic Minority Over-sampling Technique). This node provides an advanced over-sampling algorithm that deals with imbalanced datasets, which helped our selected model work more effectively.
Figure 7: Distribution of canceled and non-canceled flights after using SMOTE
In Figure 7, we notice a more balanced distribution between canceled and non-canceled flights after running the data through SMOTE.
As mentioned earlier, we picked the Random Trees model for this sample solution. This SPSS node provides a model for tree-based classification and prediction that is built on Classification and Regression Tree methodology. Due to its characteristics, this model is much less prone to overfitting, which gives a higher likelihood of repeating the same test results when you use new data, that is, data that was not part of the original training and testing data sets. Another advantage of this method — particularly for our use case — is its ability to handle imbalanced data.
Since in this use case we are dealing with classification analysis, we used two common ways to evaluate the performance of the model: confusion matrix and ROC curve. One of the outputs of running the Random Trees model in SPSS is the confusion matrix seen in Figure 8. The table shows the precision achieved by the model during training.
Figure 8: Confusion Matrix for canceled vs. non-canceled flights
In this case, the model's precision was about 95% for predicting canceled flights (true positives), and about 94% for predicting non-canceled flights (true negatives). That means, the model was correct most of the time, but also made wrong predictions about 4-5% of the time (false negatives and false positives).
That was the precision given by the model using the training dataset. This is also represented by the ROC curve on the left side of Figure 9. We can see, however, that the area under the curve for the training data set was better than the area under the curve for the testing dataset (right side of Figure 9), which means that during testing, the model did not perform as well as during training (i.e. it presented a higher rate of errors or higher rate of false negatives and false positives).
Figure 9: ROC curves for the training and testing data sets
Nevertheless, we decided that the results were still good for the purposes of our discussion in this blog, and we stopped our iterations here. We encourage readers to further refine this model or even to use other models that could solve this use case.
Deploying the Model
Finally, we deployed the model as a REST API that developers can call from their applications. For that, we created a "deployment branch" in the SPSS stream. Then, we used the IBM Watson Machine Learning service available on IBM Bluemix here. We imported the SPSS stream into the Bluemix service, which generated a scoring endpoint (or URL) that application developers can call. Developers can also call The Weather Company APIs directly from their application code to retrieve the forecast data for the next day, week, and so on, in order to pass the required data to the scoring endpoint and make the prediction.
A typical scoring endpoint provided by the Watson Machine Learning service would look like the URL shown below.
https://ibm-watson-ml.mybluemix.net/pm/v1/score/flights-cancellation?accesskey=<provided by WML service>
By passing the expected JSON body that includes the required inputs for scoring (such as the future flight data and forecast weather data), the scoring endpoint above returns if a given flight is likely to be canceled or not. This is seen in Figure 10, which shows a call being made to the scoring endpoint — and its response — using an HTTP requester tool available in a web browser.
Figure 10: Actual request URL, JSON body, and response from scoring endpoint
Notice in the JSON response above that the deployed model predicted this particular flight from Newark to Chicago would be 88.8% likely to be canceled, based on forecast weather conditions.
IBM SPSS Modeler is a powerful tool that helped us visually create a solution for this use case without writing a single line of code. We were able to follow an iterative process that helped us understand and prepare the data, then model and evaluate the solution to finally deploy the model as an API for consumption by application developers.
The IBM SPSS stream and data used as the basis for this blog are available on GitHub. There you can also find instructions on how to download IBM SPSS Modeler, get a key for The Weather Channel APIs, and much more.
Published at DZone with permission of Ricardo Balduino , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.