Modeling Airline Flights in Neo4j
Modeling Airline Flights in Neo4j
Let us take a closer dive into data modeling by looking at how one might model airline flight data in Neo4j.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
If you’ve come to any of the Neo4j Data Modeling classes I’ve taught, you must have heard me say “your model depends on both your data and your queries” about a million times. Let us take a closer dive into what this means by looking at how one might model airline flight data in Neo4j.
So what is our data? Airports and the flights between them. Let’s start our model with that:
Right away this model feels a bit off. The concept of a Flight is expressed as a relationship but if we want to connect Customers or Staff to a flight, or say that a flight was REROUTED to another Airport due to weather or any kind of problem, we can’t. Given some of the queries we imagine for our data a flight really should be an instance or an event and thus be a Node, so let’s try this model instead:
It’s not a bad model, but we will have very dense nodes. Think about major hubs like Atlanta, Beijing, Dubai, London Heathrow, or even my local Chicago O’Hare. These would be very massive nodes with no quick way to filter routes without checking multiple properties which will slow our traversal down quite a bit.
Let us use the queries to guide our model. When a user is trying to book a flight, they know where they are starting from, where they want to go, and what day they want to fly. So lets reimagine our model to introduce the concept of days. There is a variety of ways to model dates and times in a graph, but our queries are telling us we should find away to limit our traversals to a small subgraph, so we will create nodes to identify the subgraph we want. Each Airport would have 365 AirportDay nodes so we can book and schedule flights up to a year in advance.
We added days, but our model didn’t really improve our queries just yet. We are still checking the date property on all the AirportDay nodes. We can move the date property to the HAS_DAY relationship to save ourselves from having to traverse all the way to the AirportDay node when starting from an Airport, but there is another way:
We are using the date as an actual relationship type, so we could start from an Airport node, and quickly jump to the AirportDay node by relationship type without having to check the date property of 365 relationships. This is an important concept to understand when designing your model. The less work the traversal has to do the faster it will be. Checking a few hundred relationship properties is more expensive than traversing a single relationship type. This however begs the question “why start at Airport, when we can just start at AirportDay”? Indeed:
We can use a key like “ORD-1441065600” to quickly get to the Chicago O’Hare Airport for 9/1/2015 via an index and start our traversal there. The key is made up of our departing airport code and the linux epoch time representation of the day of our flight. I think this is as good as we are going to get for finding our starting point in the traversal. Let’s now start thinking about when our traversal would finish. This is of course once we found a flight that reached our intended destination. However with the current model in order to check if we reached our destination we have to go through every flight an Airport flies to that day and that is just not good for performance. Imagine your traversal finding flight paths with one stop or two stops via major hubs, that would be quite painful. This is where the best part of working with Neo4j comes in… you have to get creative.
So lets try something crazy. We know there may be a couple thousands flights at an airport any day, but very few airports have more than 200 destinations, so lets add Destination nodes to every AirportDay. We may have 100 flights from Chicago to Atlanta per day, but we should only have to check the destinations once. If we can’t find the destination we want, we can stop traversing through this AirportDay node right away and try a different route. We don’t have to check all the flights of every AirportDay node we encounter in a multi-stop traversal. Most of the time we are booking flights way in advance but the other travelers that need to book a flight are those who missed their flight, missed their connection, have to deal with a cancelled flight or any such change in their plans. I am tempted to make destinations an array property and just check there, but I left it as a node because this model has the added benefit that if all flights from Chicago to Atlanta are delayed or cancelled due to weather, we can edit just the Destination node and affect them all. Let the queries guide your model. So our model is now:
I like this model, but maybe we can do a little bit better. Frequent flyers tend to book flights with their favorite airline whenever possible to earn Flight Miles or Points. Usually people tend to take the return route on the same airline they took to get there. So lets take the “date as relationship type” idea we saw earlier and try “airline as relationship type”:
With this new model we can start from an AirportDay, check the Destinations to see if a non-stop or direct flight is available right away. If not, we can look at routes with one hop, and quickly check the Destination nodes of those AirportDays to see if they can reach our destination at all from our hop. If the user is willing to make a two-hop flight, we can then check those the same way. We can check the flights in our preferred Airline order, or even limit our traversal to just the Airlines we choose.
I think this is a good model, but maybe you can come up with a better one. Think about it, and if you do, please let me know. If you want to see many different models for other industries and uses cases, be sure to take a look at the GraphGists Collection.
Coming up with the right model and iterating through various scenarios like we did above is one of the things we can help you with in our Neo4j Technical Bootcamp. A week long program where we come on site and help you build a Proof of Concept so you can be confident your Neo4j project will be a success. So if you are on the fence about adding a graph database to your next project, lets chat. Make the investment. Give us a week and we’ll put you on the right path.
Opinions expressed by DZone contributors are their own.