Probing Text Data Using PostgreSQL Full-Text Search
Learn how to leverage the power of full-text search in PostgreSQL to search through large text content in a fast and accurate way.
Join the DZone community and get the full member experience.Join For Free
In the previous article, we saw how we could use fuzzy search in PostgreSQL. This article is more about using the power of full-text indexes to search through large text data. This is different from any other PostgreSQL indexes, which work either with alphanumeric or smaller textual content.
To understand how full-text search works, we are going to utilize the same data set we saw in the PostgreSQL Fuzzy Search blog -
The table has around 79091 rows like so:
We will focus more on the column
episode_narrative, which has more text content as it narrates all of the things that happened during that particular weather event.
Fuzzy Search vs. Full-Text Search
Let's first understand the difference between fuzzy search and full-text search. Below is sample content from the
Hurricane Irene formed east of the Caribbean island of Dominica, part of the Lesser Antilles region, on the afternoon of August 20. Irene moved through the Caribbean and up the east coast of the United States making landfall twice. She first made landfall as a Category 1 Hurricane near Cape Lookout, North Carolina around 7:30am on August 27, then moved offshore again during the evening. She then made a 2nd landfall, again as a Category 1 Hurricane at 540am on August 28 near Little Egg Inlet in New Jersey. She moved over New York City and then into southeastern New York State and Connecticut as a Tropical Storm a few hours later. By the end of the evening of the 28th, Irene was crossing the U.S./Canada border having produced significant amounts of rain, storm surge, inland and coastal flooding, and wind damage across southern New England and much of the east coast of the United States. ||In Southern New England, the minimum surface pressure recorded was 976.9mb taken at Barnes Municipal Airport in Westfield, Massachusetts. The storm surge experienced along the coast was generally in the two to four foot range with a high of 4.78 feet at Fox Point in Providence, Rhode Island. The highest sustained windspeed was 54 knots (62 mph) at the Physical Oceanographic Real Time System station at Conimicut Light in Narragansett Bay, RI. The highest sustained wind speed on land was 38 knots (44 mph) recorded on the Automated Surface Observing Systems at both Barnstable Municipal Airport in Hyannis, MA (KHYA) and Logan International Airport in Boston, MA (KBOS). Rainfall amounts ranged from nearly zero (0.03 at Nantucket Memorial Airport - ACK) to nearly 10 inches (9.92 in Conway, MA). ||Despite the relatively low wind speeds, sustained winds over a 6 to 12 hour long duration resulted in widespread tree damage and resulted in power outages to roughly half a million customers throughout the state. Some of these customers did not get their power back until the Friday following the storm (some five days later). Durring the passage of Tropical Storm Irene, the winds resulted in $25,000 in property damages. ||The Connecticut River at Walpole reached its highest level since the 1938 hurricane.||The collective effects of Tropical Storm Irene on August 28, resulted in 1 fatality, 0 injuries, and $127.3M in property damage in the following counties: Barnstable, Bristol, Essex, Franklin, Hampden, Hampshire, Middlesex, Nantucket, Norfolk, Plymouth, Suffolk, and Worcester (all in MA), Hartford, Tolland, and Windham (all in CT), Cheshire and Hillsborough (all in NH), and Bristol, Providence, Kent, Washington, and Newport (all in RI).
As you can see, the text is huge. It is like a few paragraphs combined, at least. Let's say we want to search this column for the content
Logan International Airport i.e., we want all of the storm events involving that airport.
A query might look like this:
SELECT episode_narrative from storm_events WHERE episode_narrative ILIKE '%Logan International Airport%'
This query and its results look great, but the query takes a lot of time (763 ms) for a mere 79 thousand rows. A realistic production use case would involve millions of rows.
To solve this, we need to move beyond the text data type and enter the world of documents.
Before we jump in and use PostgreSQL's abilities, we need to understand specific basic terminologies.
The text that we store in PostgreSQL needs to be converted into a document structure for faster searching. From the documentation,
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
This can be done by storing the data type as
tsvector, a particular data type that stores text structure in the document format.
tsvector stands for text search vector. We can use the
to_tsvector to convert any arbitrary text to
tsvector similar to how we typecast other data types.
tsvector type not only stores the text-like documents, but it stores the value as a sorted list of distinct lexemes. A lexeme is a basic unit of language, which in our case is English. Let's understand this with an example taken from the documentation.
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
If we run this, then we get:
Notice that the
a character is present only once, and other words are separated. This is a pre-processing step that is required to speed up our search. We will explore more of this once we get to our actual content.
Also, if we give a set of words with verbs, then the
tsvector will normalize them like below.
scare and the word
the got removed because it is a stop word. Usually, we would not search the content/words if it contains the
the word. Also, normalizing words makes it much easier to search for variations because we want usually search for
scare is also a good enough match as well. Go ahead and play around with the different words to see how they behave. As usual, you can find more literature about the complete behavior of these functions in the Postgres documentation.
Searching Against the Vector
By now, we know that the
tsvector is not a standard text column, and there is a lot of pre-processing that happens for easier search. We need to use a particular function called
tsquery to search against the
The output as we see is simply
false. In this case, note that we searched for the word
scares, but still, the result was true. That was because of the normalization we saw earlier. The
@@ operator is used for direct matches. We have other operators that perform more sophisticated matches.
Now that we have a hang of the basics. Let's move on to our actual table to perform some neat searches.
Implementing Full-Text Search
We can choose to implement our search either on the fly or by constructing a new column of the type
tsvector. Converting our data on the fly and then comparing as we did above will be extremely slow. Let's create a new column of the type
tsvector by converting our existing
ALTER TABLE storm_events ADD COLUMN episode_narrative_tsv tsvector; UPDATE storm_events SET episode_narrative_tsv = to_tsvector(episode_narrative);
Then, we can query for
Logan International Airport using a
SELECT episode_narrative from storm_events WHERE episode_narrative_tsv @@ to_tsquery('logan & international & airport')
This returns the same 85 rows as before and takes almost the same time (Around 800 milliseconds). These numbers are not practical for production use cases. In the next section, we see how to create an index to speed up the search.
We can create a
GIN index for the
tsvector column we created above.
CREATE INDEX tsv_idx ON storm_events USING gin(episode_narrative_tsv);
The time taken is pretty fast, i.e., under 10 ms almost every time (the 377ms indicated above is for rendering and not the actual query time). An
Explain Analyze makes things much more apparent.
Great! Now we have a fully functional search index that rivals most text search engines out there. But there is always a trade-off which we will get to at a later part of the article. We can also use plurals of the words in our search queries (not for obvious nouns), and our index/query would still work because of the normalization.
SELECT episode_narrative from storm_events WHERE episode_narrative_tsv @@ to_tsquery('logan & international & airports')
Notice the word
airports with an
s in the end, so the query works fine.
GIN Indexes are known for their large size compared to the actual content itself because they have to index in a format called inverted index. Let's compare the size of the index with respect to the actual column size.
SELECT PG_SIZE_PRETTY(SUM(PG_COLUMN_SIZE('episode_narrative'))) as column_size, PG_SIZE_PRETTY(PG_RELATION_SIZE('tsv_idx')) as index_size from storm_events
The index is almost ten times the size of the actual content. Our dataset is not huge here, but it can get pretty crazy if we attempt to index real-world text data such as instant messages, emails, etc. We should be mindful of what content we are indexing and its size, as these indexes come and sit in the
shared_buffers —also known as cache—and can affect the performance of other queries if we are not careful.
There are a considerable number of operators present in the documentation. I would highly encourage the readers to import the data set and experiment with these different operators to understand the power of search fully. In the next section, we will see a few special operators that can be helpful in our search.
Advanced Search Techniques
So far, we have searched for the existence of words and found that they can be searched in a very efficient fashion.
JSON is a data type that PostgreSQL natively supports. We can create native data types such as
JSONB. JSON indexing and its functionality in PostgreSQL have been covered extensively in a previous article. Check it out here.
Using the GIN index, we can search entire JSON documents with the same performance as a B-Tree index.
PostgreSQL provides a dictionary facility for common languages, which is used in the backend for many tasks such as,
- Stop word removal (a, an, the, etc.).
- Normalization (scared -> scare as we saw above).
And, a lot more.
We can customize this dictionary to add more stop words, add/remove certain words and even create our own custom dictionary for use cases where each word means differently in a different business context.
If our results contain more than one match, it makes sense to sort it by the best match to least match. This can be done using the function.
SELECT event_type, event_source FROM storm_events WHERE episode_narrative_tsv @@ to_tsquery('logan & international & airports') ORDER BY ts_rank( episode_narrative_tsv, to_tsquery('logan & international & airports') ) DESC
Here, we are ranking in descending order by the most relevant results so that the most matched one will come on top.
The underlying technology of all search engines is similar from a data structure perspective: an inverted index, but the datastore is very different. Industry-standard solutions for search such as Elastic Search, Apache Solr are built for horizontal scaling.
We can use the search in PostgreSQL if:
- Our data fit in a relational structure, and there is a need to do a full-text search.
- It would be overkill to choose a full-blown search solution.
- The entire data model needs ACID consistency.
We must keep in mind that when we say search with a relational use case, it means a search for content limited to operational databases. These data stores typically move their data which is older than a year or more, depending on the business policy, to data warehouses or other systems for analytics and do not retain data indefinitely/longer durations. As always, performance test as closely as possible to production before deploying to minimize the impact.
Published at DZone with permission of Zach Naimon. See the original article here.
Opinions expressed by DZone contributors are their own.