Solr Powered ISFDB – Part #2: Basic Schema Improvements
This is Part 2 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.
When we left off last time, I had configs that enabled me to build a very basic index of all the title+author pairs in the ISFDB using the DataImportHandler, but everything is being treated as a raw string — which means it’s not very useful for searching (we don’t have any tokenization, or handling for numeric fields at all). The goal for today is to improve the schema, so we it can be used to execute some interesting queries.
(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_1 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_2 tag containing the end result of this article.)
The first step is to audit the data we have, and think about how best to model it in our schema.xml. For now I’m going to punt on the larger issue of modeling the ISFDB data as a whole (Authors, Titles, Publications, etc…) and focus just on dealing with the fields I’m currently getting back with my simplistic MySQL query. There are a couple of ways to go about this — I could start by looking at the database, but Solr also gives me some handy tools for doing this.
(Note: If you’ve ever seen me give an “Apache Solr: Out of The Box” Demo, reading this blog may feel redundant)
I’m going to start by loading the Schema Browser. Even though my schema.xml file is really simplistic and only has one dynamic field declared, the Schema Browser can tell me all sorts of interesting things about the concrete fields that exist because of that dynamic field.
The first thing that jumps out at me is that two fields include “wikipedia” in their name: author_wikipedia and title_wikipedia. When I click on either of those fields, I see stats telling me how many documents have terms indexed in that field, how many uniquer terms exist in that field, what some of the top terms are, and a histogram of term frequencies across the index. In both cases, it’s pretty obvious from the lists of top terms that these fields contain URLs of wikipedia pages about the author (and title) respectively. I can also see from that histograms (and frequencies of the top terms) that the “title_wikipedia” URL is generally unique per document, while the author_wikipedia URL can be very common. Since each document corresponds to a title+author pair, this makes sense — because authors write many books, but most books don’t have multiple authors. Since these are URL fields, and the only words in those URLs that are relevant to my search are the names of the author and title (which I can clearly see are the names of other fields) I don’t need to bother “indexing” these fields (ie: I won’t search, sort, or facet on them) but I will continue to “store” them, so people can access those links after executing a search.
Rather then explicitly list each field as being indexed=false, I declare a fieldType for dealing with URLs (it’s still implemented as a StrField) and set the property there. This is just my preference, since I’m going to want this behavior for any other URL type fields I wind up adding.
Once I change my schema.xml, there a lot of little tricks I could use to reload it on the fly, but I tend to prefer the simplest approach — stop Solr, change the file, start Solr (It’s fast and easy). They way I’ve setup the github project, if you edit the files in src/, they will automatically be copied over and used every time you run “ant run-solr”, so all that’s left is to tell DIH to reindex…
From here we iterate and look at more fields in the schema browser, and deal with a few more simple changes to our schema.xml, noting things that could be useful later when we want to consider our search experience…
- author_imdb and author_image are also URL fields.
- Every document in the index has the same value for title_ctl, so lets ignore it.
- author_id and title_id are clearly the numeric identifiers for the author and title corresponding to the current row, so we can safely ignore them for now — but there are also fields like series_id, title_parent and title_synopsis that are numeric, and their histograms suggest they are probably foreign keys to other tables in the database — so for now we ignore them too, but we should keep them in mind for the future when we improve our domain model.
- ca_status and author_marque are also numeric values, but have an extremely low number of distinct terms — so they are probably codes for something and not foreign keys. They may be useful for faceting later once we know what they mean.
- title_ttype is an obvious enumeration – keep as string for now, but consider for faceting
- author_annualviews, author_views, title_views, title_annualviews and title_rating are all numeric values, and suggest that they might be useful for “weighting” results later (to bias relevancy).
- author_birthdate and author_deathdate are (obviously) date fields.
The remaining fields are all “Text” in one form or another. In Solr the distinction between “Strings” and “Text” is that string fields are indexed verbatim, and searching on a string field requires an exact match of the entire value (capitalization, punctuation, whitespace, everything…) while text fields can have whatever custom tokenization and analysis you want at index and query time. I’ll save for another day the topic of why/how/when I might want different text analysis for different fields, and just use a very basic approach for all of them. With that done, I can now remove the “*” dynamicField I was using (because I’ve identified all of my fields and given them explicit types) and I’m done for the day.
Something I want to stress at this point: Once you get the hang of it, using the Schema Browser to inspect the fields, and iteratively editing the schema file and reindexing is a really simple and straight forward process — even when you’ve never looked at the underlying DB schema before. It took at least ten times longer to write about the things I did in this blog then it did to actually do them.
That brings us up to the blog_2 tag. I’ve now got a much more useful index then I had last week, and I can do all sorts of very interesting queries, like…
- All records where the authors last name is smith
- All records for Novels by authors born in Paris
- records about titles with “space” in their name, with faecet counts based on the type of work
- A break down showing how many novels contain “space” vs “robot” vs “robots” in their titles
That’s it for this week. I haven’t decided what to focus on next time, but there are endless improvements to be made, and many different ways to go about each of them — if you have suggestions, please post them in the comments.