Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Joining JSON: Comparing Couchbase N1QL and MongoDB

DZone's Guide to

Joining JSON: Comparing Couchbase N1QL and MongoDB

Let's take a look at a high-level comparison between Couchbase N1QL and MongoDB.

· Database Zone ·
Free Resource

Download the Altoros NoSQL Performance Benchmark 2018. Compare top NoSQL solutions – Couchbase Server v5.5, MongoDB v3.6, and DataStax Enterprise v6 (Cassandra).

Image title


As NoSQL databases evolved, each added higher level APIs or languages to help programmers to complex things easily. SQL, having done that for relational data, showed the way. In SQL, developers say"what" needs to be done and the database engine figures out the "how." "How" is the efficient procedure/algorithm to execute the statement. Select, join, and project are the basic operations SQL processing. Even in NoSQL systems, when you model data without much normalization, you still need to join a collection of objects. Customers with orders, orders with inventory, inventory with suppliers, suppliers with credits, and so forth. Hence, Couchbase N1QL has supported join operations since its first release. Following that, MongoDB, in version 3.2, added $lookup operator to the aggregation framework to perform the join operations.

Without an expressive and high-performance query feature, application developers have to do it within the application or export the data to a system that does it. Both expensive propositions.

In this article, do a comparative study of joins between MongoDB and Couchbase. Joins are unsupported in Cassandra CQL, DynamoDB natively. Applications have to do it themselves or have to use other layers like Spark or Amazon EMR do it separately. So, we won’t cover them in this article.

Joins in Couchbase

Couchbase introduced INNER and LEFT OUTER joins starting with Couchbase 4.0 (2015). This supported joins of in a child-to-parent relationship. Children documents (e.g. Orders) can be joined with parent documents (e.g. customer). In 4.5 (2016), Couchbase introduced index joins to query from parent to child joins. In both cases, there was an implied attribute-value to document-key equality predicate, specified by the ON KEY clause.

Couchbase 5.5 has ANSI standard SQL extended for JSON. It supports INNER JOIN, LEFT OUTER JOIN, and limited RIGHT OUTER join. We’ll be using examples based on Couchbase 5.5.

Couchbase joins documentation.

Joins in MongoDB:

Joins are supported via the $lookup operator within the aggregation framework.

Following is the excerpt from the MongoDB documentation.

New in version 3.2.

Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. To each input document, the $lookup stage adds a new array field whose elements are the matching documents from the “joined” collection. The $lookup stage passes these reshaped documents to the next stage.

Eliot Horowitz, MongoDB CTO, said: “MongoDB aggregation is similar to Unix pipeline. The output of one stage goes into another...[it’s] very procedural. Lets you think about in a very procedural way.”

MongoDB $lookup.

Interested in learning more on JOINs? Read this article by Lukas Eder. 

High-Level Comparison Between Couchbase N1Ql and MongoDB

Couchbase N1QL: Supports INNER JOIN, LEFT OUTER JOIN and limited RIGHT OUTER JOIN. The query language, like SQL, is declarative. Developers write, tools generate the query to the N1QL syntax. The engine figures out the plan and executes the query.

MongoDB: Supports LEFT OUTER JOIN for scalar values only. The design of the joins into the MongoDB query language is done help write the query and process data in a procedural way.

Implication:

  1. Left outer join resultset is a superset of inner join resultset. It’s possible to add additional predicates to eliminate the non-matching (null-projected or missing subservient side of the join) documents after the left outer join is performed. That’s like going from San Francisco to Chicago via London. You can do it, but expensive. For the query execution, it takes time, memory, cpu resources affecting the overall performance of the system.
  2. N1QL support for joins is declarative. MongoDB language is somewhat procedural. You’ve to separate the predicates, think about the join order between collections, think about when to group, sort, etc. Writing queries with MongoDB aggregation is like writing query plans, step by step.

Examples:

We use the simple travel-sample model and data. Here are the details of the model data.

We simply exported the data from Couchbase and imported to a mongo database called travel-sample. In MongoDB, the 5 different types of the document (landmark, route, airline, airport, hotel) are stored in 5 collections with respective names.

Example 1: LEFT OUTER JOIN with ON clause on scalar values.

Couchbase N1QL

SELECT count(*)
FROM `travel-sample` route
         LEFT OUTER JOIN `travel-sample` airline
         ON (route.airlineid = META(airline).id)
WHERE route.type = 'route';

Converted query in MongoDB

db.route.aggregate([
   {
     $lookup:
       {
         from:"airline",
         localField: "airlineid",
         foreignField: "_id",
         as: "airline_docs"
       }
   },
   { $group: { _id: null, myCount: { $sum: 1 } } },
   { $project: { _id: 0 } }
 ]);

Observations:

This is a fairly simple left outer join query joining two collections and then simply counting the total number of documents produced. Notice, unlike N1QL (and SQL), in MongoDB, you’d still have to group the resultset to get the count, even if you have a single group.

Example 2: List the airports and landmarks in the same city, ordered by the airports.

Couchbase N1QL:

SELECT landmark.name AS Landmark_Name,
       MIN(airport.airportname) AS Airport_Name,
       MIN(airport.tz) AS Landmark_Time
FROM `travel-sample` airport INNER JOIN `travel-sample` landmark
      ON airport.city = landmark.city
WHERE landmark.country = "United States"
      AND airport.type = "airport"
      AND landmark.type = "landmark"
GROUP BY landmark.name
ORDER BY Airport_Name

Converted query in MongoDB

db.airport.aggregate([
   {
     $lookup:
       {
         from:"landmark",
         localField: "city",
         foreignField: "city",
         as: "aplm_docs"
       }
   },
   {
      $match: {"airline_docs": {$ne: []}}
   },
   {  $unwind: { path: "$aplm_docs", preserveNullAndEmptyArrays: true }},
   {  $group: {
      _id: "$aplm_docs.name",
      Airport_Name: { $min: "$airportname" } ,
      Landmark_Time: { $min: "$tz"}
     }
   },
   { $sort : { Airport_Name: 1 } },
   { $project: { _id: 1, Airport_Name:1, Landmark_Time:1  } }
 ]);

Observations:

  1. This query uses INNER JOIN which MongoDB does not have. So, in MongoDB, you first do the lookup join to get the LEFT OUTER JOIN, and then eliminate non-matching, but projected documents (because of the left outer) using the match stage (code: $match: {“airline_docs”: {$ne: []}}).
  2. Then, you’ve got to remember the matched documents are in an array data structure, unwind them before you group them by the landmark.name. Then do the sort and final projection.

As expected, the MongoDB join query is procedural and you’ve to understand the execution plan and write code for each stage.

Example 3: Starting from San Francisco, find all the destination airports (those have routes from SFO).

Couchbase N1QL

SELECT DISTINCT route.destinationairport
FROM `travel-sample` airport JOIN `travel-sample` route
     ON (airport.faa = route.sourceairport AND route.type = "route")
WHERE airport.type = "airport"
  AND airport.city = "San Francisco"
  AND airport.country = "United States"
ORDER BY route.destinationairport

Converted query in MongoDB:

db.airport.aggregate([
   {
      $match: {
          $and: [
            {"type": "airport"},
            { city: "San Francisco"},
            { "country": "United States"}
          ]
      }
   },
   {
     $lookup:
       {
         from:"route",
         let: { rfaa : "$faa"},
         pipeline: [
           { $match:
               { $expr:
                 { $and:
                   [
                     { $eq: ["$sourceairport", "$$rfaa"]} ,
                     { $eq: ["$type", "route"] }
                   ]

                }
              }
           }
         ],
         as: "airline_docs"
       }
   },
   { $match: {"airline_docs": {$ne: []}} },
   {  $unwind: { path: "$airline_docs", preserveNullAndEmptyArrays: true }},
   { $project: { _id:0,  "airline_docs.destinationairport" : 1 }},
   { $group: {
      _id : "$airline_docs.destinationairport"
     }
   },
   { $sort: { _id : 1 }},
 ]);

Observations:

  1. The join clause for this query is a bit more complex, with two predicates (airport.faa = (route.sourceairport AND route.type = “route”). This requires a cumbersome pipeline syntax on the MongoDB query.
  2. And because you need to differentiate between the two collections, you need another let stage to create the local variables for airport attributes.
  3. Like before, it requires an additional match clause to eliminate non-matching (empty) airline docs, followed by grouping and sorting.
  4. As you can see visually, the MongoDB query is getting larger and larger to do the same job as Couchbase N1QL.

Example 4: Find all the hotels and landmarks in Yosemite. Hotels should have at least 5 likes.

Couchbase N1QL

SELECT hotel.name hotel_name, landmark.name landmark_name, landmark.activity
FROM `travel-sample` hotel INNER JOIN `travel-sample` landmark
    ON (hotel.city = landmark.city 
        AND hotel.country = landmark.country 
        AND landmark.type = "landmark")
WHERE hotel.type = "hotel" 
  AND hotel.title like "Yosemite%" 
  AND array_length(hotel.public_likes) > 5;

The converted query in MongoDB

db.hotel.aggregate([
   { $match:  { title: { $regex: /^Yosemite/ } }, },
   {
     $lookup:
       {
         from:"landmark",
         let: { hcity : "$city", hcountry : "$country"},
         pipeline: [
           { $match:
               { $expr:
                 { $and:
                   [
                     { $eq: ["$city", "$$hcity"]} ,
                     { $eq: ["$country", "$$hcountry"] }
                   ]
                }
              }
           }
         ],
         as: "hotel_lm_docs"
       }
   },
   { $match : {"hotel_lm_docs": { $ne: [] }}},
   { $project: {_id:0, hname: "$name", public_likes: 1, hotel_lm_docs:1}},
   { $unwind: { path: "$hotel_lm_docs", preserveNullAndEmptyArrays: true }},
   { $project: { _id: 1, hname : 1 , "hotel_lm_docs.name" : 1, "hotel_lm_docs.name" : 1, "hotel_lm_docs.activity" : 1, mt5 : {$gt: [ {$size: "$public_likes"}, 5]}}},
   { $match: { mt5 : true } },
   { $project: {_$id:0}}
 ]);

Observation:

  1. Translating the LIKE predicate into a regular expression was straightforward, but determining if there were atleast five public_likes was not. Needed additional projection and matching phase to calculate the size of the public_likes at the end.
  2. When you have many attributes to match, manipulate and project, you’d have to rename them properly at appropriate stage otherwise, the query can’t reference it. For example, hotel.name had to be renamed to hname before the unwind. Maybe there’s a better way to write this stage!
  3. N1QL expressed the query in 370 characters. MongoDB required 956 characters. All this for a two table join. As the complexity increases, the ratio increases as well since the MongoDB query is written in a procedural way.

Example 5: Find all the hotels and landmarks in Yosemite. Hotels should have at least 5 likes.

This is just like Example 4, but do it faster!

Couchbase N1QL

SELECT hotel.name hotel_name, landmark.name landmark_name, landmark.activity
FROM `travel-sample` hotel INNER JOIN `travel-sample` landmark USE HASH(build)
    ON (hotel.city = landmark.city 
        AND hotel.country = landmark.country 
        AND landmark.type = "landmark")
WHERE hotel.type = "hotel"
  AND hotel.title like "Yosemite%" 
  AND array_length(hotel.public_likes) > 5;

Observation:

The default join method in Couchbase N1QL is nested loop join. This works fine when you have a smaller number of documents involved on each side of the join. When you have a larger data set, typically in reporting queries, nested loop join slows down. Couchbase N1QL has hash joins and this speeds up joins significantly. When each side of join has thousands of documents to millions of documents, the speed increase can be 2x to 20x or more. See the detailed Couchbase blog on ANSI Joins for more information.

From the documentation and explain plan, it’s unclear what join method MongoDB uses. Some of the blogs indicate that they’ve used a nested loop join to implement the $lookup operator.

Summary



  

Couchbase N1QL

MongoDB

JOIN approach Declarative, like SQL.

Allows joining between any sized and distributed data set.


Procedural with some declarative aspects (e.g. index selection).

Can only join a sharded collection in an unsharded collection. To join two sharded collections, applications will have to write the join algorithm.

JOINs supported LEFT OUTER JOIN

INNER JOIN

RIGHT OUTER JOIN

$lookup implements the LEFT OUTER JOIN on scalar values.
ON-clause support Full expressions.

Scalars

Arrays

Implicit equality

Pipeline expression

Arrays should to be $unwind before the $lookup

JOIN implementation Block Nested Loop

Hash join with user defined build and probes.

Nested Loop
ON Clause ON clause with any expression. $pipeline expression
Array expressions in ON clause Use ANY, IN expressions.

Supports UNNEST

Pipeline with $unwind before $match
Explain Visual explain and

JSON explain

Visual explain and

JSON explain

JOIN order Left to right, as specified by the user. The optimizer is rule based. As specified in the pipeline.
Nested JOINs Supported via derived tables.

FROM clause can have subselects which can have joins or subselects in turn.

No
JOIN predicate processing Optimizer processes the join predicates, constant predicates and pushes the predicates to the index automatically. Manual design of predicates for each collection, careful ordering of pipeline stages without full help from the optimizer.





How about performance? Good question. That’s for a future article!

And now, a quote:

“A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts.”

— William Strunk, Jr. Elements of Style.

References:

  1. Couchbase Documentation
  2. MongoDB Documentation
  3. ANSI Joins in Couchbase N1QL

Download the whitepaper, Moving From Relational to NoSQL: How to Get Started. We’ll take you step by step through your first NoSQL project.

Topics:
couchbase ,mongodb ,sql ,syntax ,nosql ,json ,database ,rdbms ,n1ql ,comparison

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}