Over a million developers have joined DZone.

Using Spring Data Neo4j for the Hubway Data Challenge

DZone 's Guide to

Using Spring Data Neo4j for the Hubway Data Challenge

· Java Zone ·
Free Resource

Editor's Note: This post was originally authored by Michael Hunger of the Neo4j Blog.


Using Spring Data Neo4j it was incredibly easy to model and import the Hubway Challenge dataset into a Neo4j graph database, to make it available for advanced querying and visualization.

The Challenge and Data

Tonight @graphmaven pointed me to the boston.com article about the Hubway Data Challenge.

Hubway is a bike sharing service which is currently expanding worldwide. In the Data challenge they offer the CSV-data of their 95 Boston stations and about half a million bike rides up until the end of September. The challenge is to provide answers to some posted questions and develop great visualizations (or UI's) for the Hubway data set. The challenge is also supported by MAPC (Metropolitan Area Planning Council).

Getting Started

As midnight had just passed and the Spring Data Neo4j 2.1.0.RELEASE was built inofficially during the day I thought it would be a good exercise to model the data using entities and importing it into Neo4j. So the first step was the domain model, which is pretty straightforward:

Based on the Spring Data book example project, I created the pom.xml with the dependencies (org.springframework.data:spring-data-neo4j:2.1.0.RELEASE) and the Spring application context files.

Import Stations

Starting with the Station in modelling and importing was the easiest. In the entity we have several names, one of which is the unique identifier (terminalName), the station name itself can be searched with a fulltext-index. As hubway also provides geo-information for the stations we use the Neo4j-Spatial index provider to later integrate with spatial searches (near, bounding box etc.)

public class Station {
    @GraphId Long id;
    @Indexed(numeric = false)
    private Short stationId;
    private String terminalName;
    @Indexed(indexType = IndexType.FULLTEXT, indexName = "stations")
    private String name;
    boolean installed, locked, temporary;
    double lat, lon;
    @Indexed(indexType = IndexType.POINT, indexName = "locations")
    String wkt;
    protected Station() {
    public Station(Short stationId, String terminalName, String name,
                   double lat, double lon) {
        this.stationId = stationId;
        this.name = name;
        this.terminalName = terminalName;
        this.lon = lon;
        this.lat = lat;
        this.wkt = String.format("POINT(%f %f)",lon,lat).replace(",",".");

I used the JavaCSV library for reading the data files. The importer just creates a Spring contexts and retrieves the service with injected dependencies and declarative transaction management. Then the actual import is as simple as creating entity instances and passing them to the Neo4jTemplate for saving.

ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext("classpath:META-INF/spring/application-context.xml");
ImportService importer = ctx.getBean(ImportService.class);
CsvReader stationsFile = new CsvReader(stationsCsv);
public class ImportService {
    @Autowired private Neo4jTemplate template;
    private final Map<short,station> stations = new HashMap<short, station="">();
    public void importStations(CsvReader stationsFile) throws IOException {
        // id,terminalName,name,installed,locked,temporary,lat,lng
        while (stationsFile.readRecord()) {
            Station station = new Station(asShort(stationsFile,"id"),
                                          asDouble(stationsFile, "lat"),
                                          asDouble(stationsFile, "lng"));
            stations.put(station.getStationId(), station);

Import trips

Importing the trips themselves is only a little more involved. In the modeling of the trip I choose to create a RelationshipEntity called Action to represent the start or end of a trip. That entity connects the trip to a station and holds the date at which it happend. During the import I found a number of data rows to be inconsistent (missing stations), so those were skipped. As half a million entries are a bit too much for a single transaction I split the import up into batches of 5k trips each.

public boolean importTrips(CsvReader trips, int count) throws IOException {
    // "end_date","end_station_id","bike_nr","subscription_type",
    // "zip_code","birth_date","gender"
    while (trips.readRecord()) {
        Station start = findStation(trips, "start_station_id");
        Station end = findStation(trips, "end_station_id");
        if (start==null || end==null) continue;
        Member member = obtainMember(trips);
        Bike bike = obtainBike(trips);
        Trip trip = new Trip(member, bike)
                        .from(start, date(trips.get("start_date")))
                        .to(end, date(trips.get("end_date")));
        if (count==0) return true;
    return false;

First look at the data

After running the import, after two minutes we have a Neo4j database (227MB) that contains all those connections. I uploaded it to our sample dataset site. Please get a Neo4j server and put the content of the zip-file into data/graph.db then it is easy to visualize the graph and run some interesting queries. I list a few but those should only be seen as a starting point, feel free to explore and find new and interesting insights.

Stations most often used by a user

 START n=node(205) 
 MATCH n-[:TRIP]->(t)-[:`START`|END]->stat 
 RETURN stat.name,count(*) 
 ORDER BY count(*) desc LIMIT 5; 

| stat.name                           | count(*) |
| "South Station - 700 Atlantic Ave." | 22       |
| "Post Office Square"                | 21       |
| "TD Garden - Legends Way"           | 10       |
| "Boylston St. at Arlington St."     | 5        |
| "Rowes Wharf - Atlantic Ave"        | 5        |
5 rows
31 ms 

Most beloved bikes

  START bike=node:Bike("bikeId:*") 
  MATCH bike<-[:BIKE]->trip 
  RETURN bike.bikeId,count(*) 
  ORDER BY count(*) DESC LIMIT 5;

| bike.bikeId | count(*) |
| "B00145"    | 1074     |
| "B00114"    | 1065     |
| "B00538"    | 1061     |
| "B00490"    | 1059     |
| "B00401"    | 1057     |
5 rows
2906 ms


The data can also be easily added to a Heroku Neo4j Add-On and from there you can use any programming language and rendering framework (d3, jsplumb, raphael, processing) to visualize the dataset.

What's next

Next steps for us are to import the supplied shapefile for Boston and the stations as well into the Neo4j database and connect them with the data and create a cool visualization. I rely on @maxdemarzi for it to be awesome. Another path to follow is to craft more advanced cypher queries for exploring the dataset and making them and their results available.

Boston Hubway Data-Challenge Hackaton

Hubway will host a Hack Day at The Bocoup Loft in Downtown Boston on Saturday, October 27, 2012. Register here and spread some graph love.





Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}