Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

NYC Taxi Ride Analysis With Couchbase and Cloud9 Charts

DZone's Guide to

NYC Taxi Ride Analysis With Couchbase and Cloud9 Charts

A step-by-step guide to analyzing NYC Green Cab Rides using N1QL, Couchbase, and Cloud9 charts.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

Couchbase N1QL is a query processing engine on Couchbase to provide SQL for JSON on distributed data. Cloud9 Charts is an analytics platform built for the modern data stack, with native support for NoSQL databases, traditional SQL databases, REST APIs, and other sources.

Native NoSQL Analytics

Traditionally, analytics on NoSQL databases typically means one of the following:

  1. Shoehorn the data into a relational form using an ODBC driver that traditional BI tools can understand. This usually requires an ODBC driver, a traditional (typically desktop based) BI tool and schemas defined upfront.

  2. ETL processes to load up relevant data into a relational database for analysis. This negates the schema flexibility of NoSQL databases like Couchbase, given that schemas must be defined upfront along with what gets moved where. This typically implies relatively long lead times and coordination for any changes to the data structure to propagate.

Native NoSQL Analytics unshackles the dependency on ODBC drivers and ETL processes, thus enabling business and technical users to leverage the underlying database to the fullest extent to derive actionable insights quickly.

Briefly, the Couchbase-Cloud9 Charts Integration features the following:

  • Fully Native N1QL integration without ODBC drivers

  • Point and click N1QL query generator

  • Support for nested objects and Arrays

  • Join between Couchbase and other SQL/NoSQL or REST API based sources

  • Instant visualizations & embeddable dashboards

  • Predictions & Advanced Analytics

Dataset

The dataset we'll use for our analysis consists of 45 million green cab taxi rides, made available by the NYC Taxi and Limousine Commission.

Green cabs were launched in 2013 in New York city, targeted towards taxi rides in the outer boroughs of NYC that were traditionally underserved by yellow cabs.

The analysis focuses on the following:

  • Geospatial analysis of pickup areas and drop offs

  • Trip durations, by hour by day across neighborhoods

  • Fare analysis

  • Ride Predictions

The raw dataset in CSV form can be found here (2 GB zipped, 15GB unzipped).

Couchbase Cluster

A 3 node Couchbase cluster was provisioned by our friends at Couchbase. cbtransfertool was used to load up the data into Couchbase from CSV files.

Raw data looks like this (truncated for the sake of brevity):

{
"greentaxi": {
  "Dropoff_latitude": "40.824813842773438",
  "Dropoff_longitude": "-73.902938842773438",
  "Ehail_fee": "",
  "Extra": 0.5,
  "Fare_amount": 10.5,
  "Lpep_dropoff_datetime": "2013-12-22 02:17:35",
     …..
   "dropoff_city": "New York City-Bronx",
   "dropoff_county": "Bronx",
       "pickup_city": "New York City-Manhattan",
   "pickup_county": "New York”,  
     }
}

Indexes and Performance Considerations

Memory optimized secondary indexes were added for dropoff_city, pickup_city, dropoff_city, pickup_date and dropoff_date (in addition to the primary index).

Aggregations of the raw data to track rides by hour/day/pickup/dropoff was created into another bucket to enable fast query executions:

INSERT INTO rides_agg (KEY UUID())  
select 
    count(*) as rides, 
round(avg(Fare_amount),0) as fare,
    round(avg(duration),0) as duration, 
    day,
    hour,
    pickup_region,
    dropoff_region
    from 
     (select date_part_str(millis_to_zone_name(dropoff_ts, "EST"),"day_of_week") as day,
        date_part_str(millis_to_zone_name(dropoff_ts, "EST"),"hour") as hour,
        pickup_region,
        dropoff_region,
        fare_amount,
        date_diff_millis(dropoff_ts,pickup_ts,"minute") as duration
        from greentaxi 
      ) as tmp
group by 
    day,
    hour,
    dropoff_region,
pickup_region
order by day,hour

Analysis

The full dashboard of the following analysis can be accessed here.

Pickup and Dropoffs

The following N1QL query clusters the pickup geo locations, from which a geospatial view of the pickup locations can be derived using Cloud9 Charts:

select 
    round(to_number(Pickup_latitude),3) as latitude, 
    round(to_number(Pickup_longitude),3) as longitude, 
    count(*) as count 
from cloud9 
GROUP BY round(to_number(Pickup_latitude),3),round(to_number(Pickup_longitude),3) 
order BY count DESC

See the dead zone within the Manhattan area on the map?

Here’s why: Green cabs are only allowed pick ups from the north of East 96th St and West 110 St.

Contrast this with Yellow cab pickups below where the majority of the pickups are concentrated around the Manhattan area.

While pickup zones are restricted, there are no limitations on passenger drop off areas for the green cab service. The drop-off heatmap looks like this:

Ride Trends and Predictions

Let’s look at the overall monthly ride trends since the Green Cab service was launched.

Trends show that service ramped up in late 2013, and from 2014 onwards, the rides per month trend have since stabilized somewhat.

By applying predictive models to it, we can determine the total rides over the next few months (show in yellow in the graph):

Neighborhood Comparisons

Pickup restrictions pose interesting conundrums from a cab owner’s standpoint. Dropoffs into restricted areas imply that driver must get back into a pickup region for the next pickup.

What are the most productive areas for a driver/owner operator? To answer this question, let’s look at some neighborhood analysis.

Following shows a Chord diagram of the relationships between start regions vs end regions. For example, there are far more (43% more) rides from Harlem → Hamilton Heights, vs Hamilton Heights → Harlem.

N1QL Query

select pickup_region,dropoff_region,sum(rides) as rides
from rides_agg
where pickup_region !='' and dropoff_region !=''
group by pickup_region, dropoff_region
order by rides desc 

Fare Analysis

If the pickup is in Harlem and the drop-off is in Chelsea, a prohibited pickup area. During 5-6 PM on a Wednesday, it’s a 35-minute ride as indicated by the grid heatmap below, for an average fare of $28. But it means that the driver must get back to a pick-up area for the next ride. From a driver’s standpoint, not ideal.

Contrast that to another dropoff location, say the Fort Green neighborhood in Brooklyn (from Harlem). This is a $47 dollar ride taking an average of 57 mins during rush hour and also provides ample pickup opportunities within the Fort Green area as indicated by the geo heatmap.

To take this one step further, which areas should a cab operator deploy their asset?

The following chart helps to answer that question, providing the top average fares across locations for a given day and hour of the day.

Top Fares By Locations By Date/Time


So it looks like, for Tuesdays from 8-9 AM, Nkew Gardens neighborhood yields the highest average pickup fare, whereas Saturday night at 11 PM, the pole position goes to the Jamaica neighborhood.

Summary

Gone are the days of long drawn out ETL processes or shoehorning semi-structured data into relational formats for analytics purposes. With a native NoSQL analytics solution, you can leverage N1QL to derive immediate, actionable insights that can be shared and embedded in a jiffy.

Special thanks to Prasad Varakur, Chin Hong, and others at Couchbase for their hands-on support with Couchbase deployment and query optimization.

Resources

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
couchbase ,nosql ,analytics ,data visualization

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}