ElasticSearch: Parent and Child Joins — Game of Thrones Edition

ElasticSearch is not a relational database, it is all about search efficiency and not storage efficiency.

Sohan Ganapathy

Jun. 17, 19 · Tutorial

Likes (3)

Comment

Save

19.2K Views

In a relational database, a child table references the parent with a foreign key and this relationship is called a Join. The design typically involves normalizing the data.

ElasticSearch is not a relational database, it is all about search efficiency and not storage efficiency. The data stored is denormalized and is pretty much flat. What that means is joins cannot be across Indexes, ElasticSearch is all about speed and traditional joins would run too slow. So both the child and parent documents must be on the same Index and in the same Shard.

Image title

Example Parent/Child Relationship

Let’s consider two famous houses from the HBO series Game of Thrones (For those worried about spoilers, I have faked the isAlive status of the characters). The family tree depicted in Image 1 has four Parents and nine Children. Each character has a gender and an isAlive status.

*Image 1: The Starks and Lannister family tree with Parent and Child relationships.*

Creating the “Game_Of_Thrones” Index

The code below helps create an index for the above relationship. (Setup guide for Elastic Search). Starting ElasticSearch 7, a type is no longer required for indexes, unlike previous versions.

createIndex.sh — Create the game_of_thrones Index

Line 23: The relation_type, is a name for the join.

Line 24: The type join is a special field that creates parent/child relation within documents of the same index.

Line 25: Parent-child uses the Global Ordinals to speed up joins.

Line 26–28: The relations section defines a set of possible relations within the documents, each relation being a parent name and a child name.

Inserting the Parent Data

Let’s walk through the code for one parent insert before running a script to insert the other parents depicted on Image 1.

Create Eddard Stark

The above code creates a new document for Eddard Start and marks it as a parent document using, the relation_type field. A value parent is assigned to the name of the relation. Along with the relations, it also adds fields needed like house, gender, and isAlive.

One key thing to notice here is the routing query parameter. Each parent assigns its own name to the parameter. The routing field helps us control which shard the document is going to be indexed on. The shard is identified using the below equation:

shard = hash(routing_value) % number_of_primary_shards

We can insert the remaining parents using the script here.

Inserting the Children data

Similarly, let’s walk through one child insert before running a bulk insert of the 9 Children depicted on Image 1.

Create Arya Stark

In our example, Arya Stark is a child of Eddard Stark. Notice that we use the same routing query parameter that we used to create a record for Eddard. This is because of the restriction where both the child and parent documents must be on the same shard.

The join between this record and Eddard’s is made by the relation_type field, where we add the name of the relation as a child, making Arya Stark a child of the parent whose Id is “1” (The same Id we created Eddard with).

We can insert the remaining children using the script here.

Querying Our Data

Now the fun part of executing and understanding, the queries we can run on the relationship we just created.

Searching and Filtering Specific Parents

Get all children of Lyanna Stark: The parent_id query can be used to find child documents which belong to a particular parent.

Get all children of Lyanna Stark

Executing the above query gets the John Snow document.

{
    "took": 2,
    ..."hits": [{
        "_index": "game_of_thrones",
        "_type": "_doc",
        "_id": "10",
        "_routing": "Lyanna",
        "_source": {
            "name": "John",
            "house": "Snow",
            "gender": "Male",
            "isAlive": true,
            "relation_type": {
                "name": "child",
                "parent": "2"
            }
        }
    }]...
}

Get All children of Eddard who are alive: The bool and must query keywords can be used to fetch the records.

Get All children of Eddard who are alive

Executing the above query will get the records for Arya, Sansa, Bran, and Rickon Stark.

Has Child and Has Parent Queries

The query keywords has_child and has_parent help query data with parent-child relationships.

Get All parents who have daughters who are dead: The has_child, keyword helps us fetch all the parent records, where the children have filters.

Get All parents who have daughters who are dead

Executing the above query gets the record of Tywin Lannister, who is the only parent with a dead daughter Cersei.

Get All Children who's Parent has gender as Female: The has_parent, keyword helps us fetch all the child records, where the parents have filters.

Get All Children who’s Parent has gender as Female

Executing the above query gets the record of John Snow, whose parent is Lyanna Stark. All other parents being Male.

Having Multiple Children per Parent

Let us add Catelyn Stark as a wife to Eddard Stark, which is depicted in the below Image 2. Eddard now has Children and Wife documents attached.

*Image 2: The Starks and Lannister family tree with Parent, Wife and Child relationships.*

The Index can be changed using the code below:

Modify Index Adding a New Child to Parent — Wife.

Line 9: We now have an array of relationships associated with the Parent which are “child” and “wife”.

Inserting a “Catelyn Stark” document, is similar to the child record we created earlier, this will use the same routing parameter we used on the parent routing=Eddard and use “wife” as the relation_type name.

Creating Catelyn Stark Record

Query the wife data:

Get the Lords who have a wife: The query uses the has_child keyword and filters by the type of “wife”

Get the Lords who have a wife

Executing the above query gets the record of Eddard Stark.

Multiple Levels of Relationship (Grandchildren)

Let us add Grandchildren to the Starks and Lannisters as depicted in the below Image 3.

The Index needs to be recreated here. This is because of another restriction where it’s is possible to add a child to an existing element only if the element is already a parent. Since “child” type was not a parent when we created the index earlier, we need to drop the earlier index, create a new one with the below code and re-insert all the data.

Line 16: The child, is also made a parent here of the type grandchild. This lets us have the relationship PARENT → CHILD → GRANDCHILD.

Inserting Grandchildren documents is very similar to inserting child records.

In our example, “Ned Jr Something” is a child of Sansa Stark and a grandchild of Eddard Stark. Notice that we use the same routing query parameter that we used to create a record for Eddard. This is to ensure all the children associated with the super parent, Eddard, are indexed on the same shard.

The join between this record and Sansa’s is made by the relation_type field, where we add the name of the relation as a “grandchild” making “Ned Jr” a grandchild of the parent whose Id is “6” (The same Id we created Sansa with).

We can insert the remaining grand children using the bulk script here.

Querying GrandParent Data

Get All Grandparents who have grand-daughters:

Executing this query gets us the “Tywin Lannister” record, since he is the only grandparent with a granddaughter Myrcella, as depicted in Image 3.

Using multiple levels of relations to replicate a relational model is not recommended. Each level of relation adds an overhead at query time in terms of memory and computation. You should de-normalize your data if you care about performance. — elastic.co

Restrictions of joins in ElasticSearch

Now that we have seen the join feature in action, let’s go over the restrictions noticed above.

Parent and child documents must be indexed on the same shard
Only one join field mapping is allowed per index
An element can have multiple children but only one parent
It is possible to add a new relation to an existing join field
It is also possible to add a child to an existing element but only if the element is already a parent

Conclusion

Parent-child joins can be a useful technique for managing relationships when index-time performance is more important than search-time performance, but it comes at a significant cost. One must be aware of the tradeoffs like the physical storage constraint of parent and child document and added complexity. Another precaution is to avoid multi-layered parent-child relationship since this will consume more memory and computation.

Database Relational database Joins (concurrency library) Elasticsearch Document Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending