Over a million developers have joined DZone.

Mastering RediSearch (Part 3)

DZone's Guide to

Mastering RediSearch (Part 3)

Learn how to dig into the TMDB dataset and start playing with real data and further expanding your client library for RediSearch.

· Big Data Zone ·
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Today, we're going to dive quite a bit deeper and make something useful with Node.js, RediSearch, and the client library we started in Part 2.

While RediSearch is a great full-text search engine, it's much more than that and has extensive power as a secondary index for Redis. Considering this, let's get a dataset that contains some more field-based data. I'll be using the TMDB (the movie database) dataset. You can download this dataset from the Kaggle TMDB page. The data is formatted in CSV over two files and has a few features that we're not going to use yet, so we'll need to build an ingestion script.

To follow along, it would be best to get the chapter-3 branch of the GitHub repo.

Understanding the Dataset Structure

The first file is tmdb_5000_credits.csv, which contains the cast and crew information. Though cast and crew rows are modeled a bit differently, they do share some features. Initially, it's not a very usable CSV file since two columns (cast, crew) contain JSON.

  • movie_id (column): This correlates with a movie row in the other file.
  • title (column): The title of the movie identified with the movie_id.
  • cast (column with JSON):
    • character: The name of the character.
    • cast_id: An identifier of the character across multiple movies.
    • name: The name of the actor or actress.
    • credit_id: A unique identifier of this credit.
  • crew (column with JSON):
    • department: The department or category of this role.
    • job: The name of the role on set.
    • name: The name of the crew member.
    • credit_id: A unique identifier of this credit.

Another problem with the CSV is that it's quite huge for a file of its type: 40mb (for comparison, the total works of Shakespeare is just 5.5mb). While Node.js can handle files of this size, it's certainly not all that efficient. To ingest this data optimally, we'll be using csv-parse, a streaming parser. This means that as the CSV file is read in, it is also parsed and events are emitted. Streamed parsing is a fitting addition to the already high-performing RediSearch.

Each row in the CSV represents a single movie, with about 4,800 movies in the file. Each movie, as you might imagine, has dozens to hundreds of cast and crew members. All in all, you can expect to index about 235,838 cast and crew members — each represented by nine fields.

The other file in the TMDB dataset is the movie data. This is quite a bit more straightforward. Each movie is a row with a few columns that contain JSON data. We can ignore those JSON data columns for this part in the series. Here is how the fields are represented:

  • movie_id (column): A unique ID for each movie.

  • budget (column): Thetotal film budget.

  • original_language (column): ISO 639-1 version of the original language.

  • original_title (column): The original title of the film.

  • overview (column): A few sentences about the film.

  • popularity (column): The film's popularity ranking.

  • release_date (column): The film's release date (in YYYY-MM-DD format).

  • revenue (column): The amount of money it earned (USD).

  • runtime (column): The film's runtime (in minutes).

  • status (column): The film's release status ("released" or "unreleased").

  • Ignored columns: genres, production_companies, keywords, production_countries, spoken_languages

Importing Data From TMDB

Now that we've explored both the fields and how to create an index, let's move forward with creating our actual indexes and putting data into them.

Thankfully, most of the data in these files is fairly clean, but that doesn't mean we don't need to make adjustments. In the cast/crew file, we have the challenge of cast and crew entries having slightly different sets of data. In the data, each row represents a movie and the cast column has all the cast members while the crew column has all the crew members. So, when we're representing this in the schema, we're effectively creating a union of the fields (since there is overlap). cast and crew are numeric fields that are set to "1" for each kind of credit-think of it like a flag.

For the movie database, we're going to convert release_date to a number. We'll simply parse the date into a JavaScript timestamp and store it in a numeric field. Finally, we'll ignore a number of fields — we'll just compare columns' keys to an array of columns in order to skip (ignoreFields).

From a project structure, we may end up doing more with our fieldDefinitions later on, so we'll store both schema in a Node.js module. This is purely optional but is a clean pattern that reduces the likelihood of having to duplicate your code later on.

module.exports = {
  movies    : function(search) {
    return [
      //homepage is just stored, not indexed
      search.fieldDefinition.text('original_language',true,{ noStem : true }),
      search.fieldDefinition.text('original_title',true,{ weight : 4.0 }),
      search.fieldDefinition.text('status',true,{ noStem : true }),
      search.fieldDefinition.text('title',true,{ weight : 5.0 }),
  castCrew  : function(search) {
    return [
      search.fieldDefinition.text('title',true, { noStem : true }),
      search.fieldDefinition.text('name', true, { noStem : true }),
      //cast only
      search.fieldDefinition.text('character', true, { noStem : true }),
      //crew only

For importing both movies and crew, we'll be using an Async queue and the above-mentioned streaming CSV parser. These two modules work similarly and well together, but have some different terminology in their syntax. First, the CSV parser will read a chunk of data (parser.on('readable',...)) and then continue to read a full-row in at a time (while(record = parser.read()){ ... }). Each row is manipulated and readied for RediSearch (csvRecord(record)). In this function, a few rows are formatted while some are ignored, and finally, the item is pushed into the queue (q.push(...)).

Async is a very useful JavaScript library that provides a huge number of metaphors for handling asynchronous behavior. The queue implementation is pretty fun — items are pushed into a queue and are processed at a given concurrency by a single worker function defined at instantiation. The worker function has two arguments: the item to be processed and a callback. Once the callback has ran, the next is available for processing (up to the given concurrency). There is a great animation that explains it:

Image title

From the async documentation

The other feature of queue that we'll be using is the drain function. This function executes when there are no items left in the queue, i.e. the queue was in a working state but is no longer processing anything. It's important to understand that a queue is never "finished" — it just becomes idle. Given that we'll be using a streaming parser, it's possible that RediSearch is ingesting faster than the CSV parser, resulting in an empty Async queue (triggering drain). To address this potential problem, each record is added to the queue and when it is successfully indexed, two individual counters are incremented (total and processed, respectively) and the CSV parser sets a parsed variable from "false" to "true." So when drain is called, we check to see if parsed is true and if the value of processed matches total. If both of these conditions are true, we know that all the values have been parsed from the CSV and that everything has been added to our RediSearch index. After you've successfully added an item to the index, you invoke the callback for the worker function and the queue manages the rest.

As mentioned earlier, the credits CSV is more complex, with each row in the table representing multiple cast/crew members. To manage this complexity, I'll be using a batch. We'll be using the same overall structure with the CSV parser and Async queue, but each row will contain multiple calls to RediSearch via the batch (one for each cast or crew member). Instead of pushing a plain object into the queue, we'll actually push a RediSearch batch. In the worker function, we'll call exec on it and then the callback. While in the movies CSV we'll have a single Redis (RediSearch) command per movie, in the credits CSV we'll have a single batch (made up of dozens of individual features) for each movie.

The two imports are different enough to warrant separate import scripts. To import the movies, you'll run the script like this:

$ node import-tmdb-movies.node.js --connection ./your-connection-object-as.json --data ./path-to-your-movie-csv-file/tmdb_5000_movies.csv

And the credits will be imported like this:

$ node import-tmdb-credits.node.js --connection ./your-connection-object-as.json --data ./path-to-your-credits-csv-file/tmdb_5000_credits.csv

A killer feature of RediSearch is that as soon as your data is indexed, it's available for query. True real-time stuff!

Searching the Data

Now that we've ingested all this data, let's write some scripts to get it back out of RediSearch. Despite the data being quite different between the cast/crew and the movie datasets, we can write a single, short script to do the searching.

This little script will allow you to pass in a search string from the command line (-search="your search string") and also designate the database you're searching through (-searchtype movies or -searchtype castcrew). The other command line arguments are the connection JSON file (-connection path-to-your-connection-file.json) and optional arguments to set the offset and number of results (-offset and -resultsize, respectively).

After we instantiate the module by passing in the argv.searchtype, we'll just need to use the search method to send the search query and options as an object to RediSearch. Our library from the last section takes care of building the arguments that will be passed through to the FT.SEARCH command.

In the callback, we get a very standard looking error-first callback. The second argument has the results of our search, pre-parsed and formatted in a useful way — each document has it's own object with two properties: doc and docId. The docId contains the unique identifier and document as an object.

All we need to do is JSON.stringify the results (so console.log won't display [Object object]) and then quit the client.

You can try it out by running the following command:

$ node search.node.js --connection ./your-connection-object-as.json --search="star" --searchtype movies --offset 0 --resultsize 1

This should return an entry about the movie Lone Star (anyone seen it? No, I didn't think so). Now, let's look in the cast/crew index for anything with the same move_id:

$ node search.node.js --connection ./your-connection-object-as.json --search="@movie_id:[26748 26748]" --searchtype castCrew

This will give you the first ten items of the cast and crew for Lone Star. Looks straightforward except for the search string — why do you have to repeat 26748 twice? In this case, it's because the movie_id field in the database is numeric and numerics can only be limited by a range.

Grabbing a Document

Getting a document from RediSearch is even easier. Basically, we instantiate everything the same way as we did with the search script, but we don't need to supply any options and instead of a search string we're getting a docId.

We just need to pass the docId to the getDoc method (abstracting the FT.GET command) along with a callback and we're in business! The show function is the same as in the search.

This will work equally well for both cast/crew or movie documents:

$ node by-docid.node.js --connection ./your-connection-object-as.json --docid="56479e8dc3a3682614004934" --searchtype castCrew

Dropping the Index

If you try to import a CSV file twice, you'll get an error similar to this:

if (err) { throw err; }
ReplyError: Index already exists. Drop it first!

This is because you can't just create an index over a pre-existing one. There is a script included to quickly drop one of your indexes. You can run it like this:

$ node drop-index.node.js --connection ./your-connection-object-as.json --searchtype movies

Status and Next Steps

In this installment, we've covered how to parse large CSV files and import them into RediSearch efficiently with a couple of scripts tailored to their different structures. Then, we built a script to run search queries over this data, grab individual documents, and drop an index. We've now learned most of the steps to managing the lifecycle of a dataset.

In our next installment, we'll build out a few more features in our library to better abstract searching and add in a few more options. Then, we'll start building a web UI to search through our data. Stay tuned to the Redis Labs blog!

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

redis ,big data ,tutorial ,redisearch ,search engine

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}