Using JSONB in PostgreSQL: How to Effectively Store and Index JSON Data in PostgreSQL
In this post, we show you tips and techniques on how to effectively store and index JSON data in PostgreSQL. Learn more about JSONB PostgreSQL.
Join the DZone community and get the full member experience.Join For Free
You can also check out our Working with JSON Data in PostgreSQL vs. MongoDB webinar in partnership with PostgresConf to learn more on the topic, and check out our SlideShare page to download the slides.
Why should a relational database even care about unstructured data? It turns out that there are a few scenarios where it is useful.
One of the main reasons to store data using the JSON format is schema flexibility. Storing your data in JSON is useful when your schema is fluid and is changing frequently. If you store each of the keys as columns, it will result in frequent DML operations - this can be difficult when your data set is large - for example, event tracking, analytics, tags, etc. Note: If a particular key is always present in your document, it might make sense to store it as a first class column. We discuss more about this approach in section "JSON Patterns & Antipatterns" below.
If your data set has nested objects (single or multi-level), in some cases, it is easier to handle them in JSON instead of denormalizing the data into columns or multiple tables.
Syncing with External Data Sources
Often times an external system is providing data as JSON, so it might be a temporary store before data is ingested into other parts of the system. For example, Stripe transactions.
JSON support in PostgreSQL was introduced in 9.2 and has steadily improved in every release going forward.
Wave 1: PostgreSQL 9.2 (2012) added support for JSON data typeJSON database in 9.2 was fairly limited (and probably overhyped at that point) - basically a glorified string with some JSON validation thrown in. It is useful to validate incoming JSON and store in the database. More details are provided below.
Wave 2: PostgreSQL 9.4 (2014) added support for JSONB data typeJSONB stands for "JSON Binary" or "JSON better" depending on whom you ask. It is a decomposed binary format to store JSON. JSONB supports indexing the JSON data, and is very efficient at parsing and querying the JSON data. In most cases, when you work with JSON in PostgreSQL, you should be using JSONB.
Wave 3: PostgreSQL 12 (2019) added support for SQL/JSON standard and JSONPATH queriesJSONPath brings a powerful JSON query engine to PostgreSQL.
When Should You Use JSON vs. JSONB?
In most cases, JSONB is what you should be using. However, there are some specific cases where JSON works better:
- JSON preserves the original formatting (a.k.a whitespace) and ordering of the keys.
- JSON preserves duplicate keys.
- JSON is faster to ingest vs. JSONB - however, if you do any further processing, JSONB will be faster.
For example, if you're just ingesting JSON logs and not querying them in any way, then JSON might be a better option for you. For the purposes of this blog, when we refer to JSON support in PostgreSQL, we will refer to JSONB going forward.
If PostgreSQL has great support for JSONB, why do we need columns anymore? Why not just create a table with a JSONB blob and get rid of all columns like the schema below:
At the end of the day, columns are still the most efficient technique to work with your data. JSONB storage has some drawbacks vs. traditional columns:
PostreSQL Does Not Store Column Statistics For JSONB columnsPostgreSQL maintains statistics about the distributions of values in each column of the table - most common values (MCV), NULL entries, histogram of distribution. Based on this data, the PostgreSQL query planner makes smart decisions on the plan to use for the query. At this point, PostgreSQL does not store any stats for JSONB columns or keys. This can sometimes result in poor choices like using nested loop joins vs. hash joins, etc. A more detailed example of this is provided in this blog post - When To Avoid JSONB In A PostgreSQL Schema.
JSONB Storage Results in A Larger Storage FootprintJSONB storage does not deduplicate the key names in the JSON. This can result in a considerably larger storage footprint compared to MongoDB BSON on WiredTiger or traditional column storage. I ran a simple test with the below JSONB model storing about 10 million rows of data, and here are the results. In some ways this is similar to the MongoDB MMAPV1 storage model where the keys in JSONB were stored as-is without any compression. One long-term fix is to move the key names to a table level dictionary and refer this dictionary instead of storing the key names repeatedly. Until then, the workaround might be to use more compact names (unix-style) instead of more descriptive names. For example, if you're storing millions of instances of a particular key, it would be better storage-wise to name it "pb" instead of "publisherName".
The most efficient way to leverage JSONB in PostgreSQL is to combine columns and JSONB. If a key appears very frequently in your JSONB blobs, it is probably better off being stored as a column. Use JSONB as a "catch all" to handle the variable parts of your schema while leveraging traditional columns for fields that are more stable.
Both JSONB and MongoDB BSON are essentially tree structures, using multi-level nodes to store the parsed JSONB data. MongoDB BSON has a very similar structure.
JSONB & TOAST
Another important consideration for storage is how JSONB interacts with TOAST (The Oversize Attribute Storage Technique). Typically, when the size of your column exceeds the TOAST_TUPLE_THRESHOLD (2kb default), PostgreSQL will attempt to compress the data and fit in 2kb. If that doesn't work, the data is moved to out-of-line storage. This is what they call "TOASTing" the data. When the data is fetched, the reverse process "deTOASTting" needs to happen. You can also control the TOAST storage strategy:
- Extended - Allows for out-of-line storage and compression (using pglz). This is the default option.
- External - Allows for out-of-line storage, but not compression.
If you're experiencing delays due to the TOAST compression or decompression, one option is to proactively set the column storage to 'EXTENDED'. For all of the details, please refer to this PostgreSQL doc.
PostgreSQL provides a variety of operators to work on JSONB. From the docs:
|->||Get JSON array element (indexed from zero, negative integers count from the end)|
|->||Get JSON object field by key|
|->>||Get JSON array element as text|
|->>||Get JSON object field as text|
|#>||Get JSON object at the specified path|
|#>>||Get JSON object at the specified path as text|
|@>||Does the left JSON value contain the right JSON path/value entries at the top level?|
|<@||Are the left JSON path/value entries contained at the top level within the right JSON value?|
|?||Does the string exist as a top-level key within the JSON value?|
|?|||Do any of these array strings exist as top-level keys?|
|?&||Do all of these array strings exist as top-level keys?|
|||||Concatenate two jsonb values into a new jsonb value|
|-||Delete key/value pair or string element from left operand. Key/value pairs are matched based on their key value.|
|-||Delete multiple key/value pairs or string elements from left operand. Key/value pairs are matched based on their key value.|
|-||Delete the array element with specified index (Negative integers count from the end). Throws an error if top level container is not an array.|
|#-||Delete the field or element with specified path (for JSON arrays, negative integers count from the end)|
|@?||Does JSON path return any item for the specified JSON value?|
|@@||Returns the result of JSON path predicate check for the specified JSON value. Only the first item of the result is taken into account. If the result is not Boolean, then null is returned.|
JSONB provides a wide array of options to index your JSON data. At a high-level, we are going to dig into 3 different types of indexes - GIN, BTREE and HASH. Not all index types support all operator classes, so planning is needed to design your indexes based on the type of operators and queries that you plan on using.
GIN stands for "Generalized Inverted indexes". From the docs:
"GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words."
GIN supports two operator classes:
- jsonb_ops (default) - ?, ?|, ?&, @>, @@, @? [Index each key and value in the JSONB element]
- jsonb_pathops - @>, @@, @? [Index only the values in the JSONB element]
Existence Operators (?, ?|, ?& )
These operators can be used to check for the existence of top-level keys in the JSONB. Let's create a GIN index on the data JSONB column. For example, find all books that are available in braille. The JSON looks something like this:
As you can see from the explain output, the GIN index that we created is being used for the search. What if we wanted to find books that were in braille or in hardcover?
The GIN index supports the "existence" operators only on "top-level" keys. If the key is not at the top level, then the index will not be used. It will result in a sequential scan:
The way to check for existence in nested docs is to use "expression indexes". Let's create an index on data->tags:
Note: An alternative here is to use the @> operator:
However, this only works if the value is an object. So, if you're unsure if the value is an object or a primitive value, it could lead to incorrect results.
Path Operators @>, <@
The "path" operator can be used for multi-level queries of your JSONB data. Let's use it similar to the ? operator above:
The path operators support querying nested objects or top-level objects:
The queries can be multi-level as well:
GIN Index "pathops" Operator Class
GIN also supports a "pathops" option to reduce the size of the GIN index. When you use the pathops option, the only operator support is the "@>" so you need to be careful with your queries. From the docs:
"The technical difference between a jsonb_ops and a jsonb_path_ops GIN index is that the former creates independent index items for each key and value in the data, while the latter creates index items only for each value in the data"
You can create a GIN pathops index as follows:
On my small dataset of 1 million books, you can see that the pathops GIN index is smaller - you should test with your dataset to understand the savings:
Let's rerun our query from before with the pathops index:
However, as mentioned above, the "pathops" option does not support all of the scenarios that the default operator class supports. With a "pathops" GIN index, all these queries are not able to leverage the GIN index. To summarize, you have a smaller index but it supports a more limited use case.
B-tree indexes are the most common index type in relational databases. However, if you index an entire JSONB column with a B-tree index, the only useful operators are "=", <, <=, >, >=. Essentially, this can only be used for whole object comparisons, which has a very limited use case.
A more common scenario is to use B-tree "expression indexes". For a primer, refer here - Indexes on Expressions. B-tree expression indexes can support the common comparison operators '=', '<', '>', '>=', '<='. As you might recall, GIN indexes don't support these operators. Let's consider the case when we want to retrieve all books with a data->criticrating > 4. So, you would build a query something like this:
Well, that doesn't work since the '->' operator returns a JSONB type. So we need to use something like this:
If you're using a version prior to PostgreSQL 11, it gets more ugly. You need to first query as text and then cast it to integer:
For expression indexes, the index needs to be an exact match with the query expression. So, our index would look something like this:
If you are only interested in the "=" operator, then Hash indexes become interesting. For example, consider the case when we are looking for a particular tag on a book. The element to be indexed can be a top level element or deeply nested.
E.g. tags->publisher = XlekfkLOtL
Hash indexes also tend to be smaller in size than B-tree or GIN indexes. Of course, this ultimately depends on your data set.
Special Mention: GIN Trigram Indexes
PostgreSQL supports string matching using trigram indexes. Trigram indexes work by breaking up text into trigrams. Trigrams are basically words broken up into sequences of 3 letters. More information can be found in the documentation. GIN indexes support the “gin_trgm_ops” class that can be used to index the data in JSONB. You can choose to use expression indexes to build the trigram index on a particular column.
As you can see in the query above, we can search for any arbitrary string occurring at any potion. Unlike the B-tree indexes, we are not restricted to left anchored expressions.
Special Mention: GIN Array Indexes
JSONB has great built-in support for indexing arrays. Let's consider an example of indexing an array of strings using a GIN index in the case when our JSONB data contains a "keyword" element and we would like to find rows with particular keywords:
The order of the items in the array on the right does not matter. For example, the following query would return the same result as the previous:
All elements in the right side array of the containment operator need to be present - basically like an "AND" operator. If you want "OR" behavior, you can construct it in the WHERE clause:
More details on the behavior of the containment operators with arrays can be found in the documentation.
SQL standard added support for JSON in SQL - SQL/JSON Standard-2016. With the PostgreSQL 12/13 releases, PostgreSQL has one of the best implementations of the SQL/JSON standard. For more details refer to the PostgreSQL 12 announcement.
|.key||Returns an object member with the specified key.|
|[*]||Wildcard array element accessor that returns all array elements.|
|.*||Wildcard member accessor that returns the values of all members located at the top level of the current object.|
|.**||Recursive wildcard member accessor that processes all levels of the JSON hierarchy of the current object and returns all the member values, regardless of their nesting level.|
Refer to JSONPath documentation for the full list of operators. JSONPath also supports a variety of filter expressions.
PostgreSQL 12 provides several functions to use JSONPath to query your JSONB data. From the docs:
- jsonb_path_exists - Checks whether JSONB path returns any item for the specified JSON value.
- jsonb_path_match - Returns the result of JSONB path predicate check for the specified JSONB value. Only the first item of the result is taken into account. If the result is not Boolean, then null is returned.
- jsonb_path_query - Gets all JSONB items returned by JSONB path for the specified JSONB value. There are also a couple of other variants of this function that handle arrays of objects.
Let's start with a simple query - finding books by publisher:
You can rewrite this expression as a JSONPath filter:
You can also use very complex query expressions. For example, let's select books where print style = hardcover and price = 100:
However, index support for JSONPath is very limited at this point - this makes it dangerous to use JSONPath in the where clause. JSONPath support for indexes will be improved in subsequent releases.
Projecting Partial JSON
Another great use case for JSONPath is projecting partial JSONB from the row that matches. Consider the following sample JSONB:
Select only the publisher field:
Select the prints field (which is an array of objects):
Select the first element in the array prints:
Select the last element in the array prints:
Select only the hardcover prints from the array:
We can also chain the filters:
In summary, PostgreSQL provides a powerful and versatile platform to store and process JSON data. There are several gotcha's that you need to be aware of, but we are optimistic that it will be fixed in future releases.
Published at DZone with permission of Dharshan Rangegowda, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.