When we left off on Jan 28, I had configs that enabled me to build a very basic index of all the title+author pairs in the ISFDB using the DataImportHandler, indexing into meaningful fields with useful data types. The goal last Friday (When I managed to squeeze in a bit of work on this, but I didn’t get a chance to blog about it until today) was to improve the “Document Modeling” of the index, so that queries could be used to answer meaningful questions about “Titles”.
(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_2 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_3 tag containing the end result of this article.)
What does “Document Modeling” mean
“Document Modeling” is just my fancy way of describing “Data Modeling” from a search perspective – I use the term “Document Modeling” because unlike traditional RDBMS or OO Data Modeling, when using an IR Engine like Solr, you need to think “flat”. Complete relationships need to be flattened into individual “Documents” that form the basis of all actions.
At the end of last weeks blog, I mentioned some queries we could do with the data we had…
The key part of these query descriptions to notice is the term “record” … what i meant by that is that because of how the data was being index, each Document in our index corresponded to a “record” of a title+author pair — which is fairly arbitrary. This isn’t the type of information most people are looking for, people want to search for “Books” or “People” – not “Instances where a Person was an author of a Book”.
So today we’re going to tweak our index so that each Document models a “Title” and contains info about all of the Authors that collaborated on it.
To start things off, I took a look at the ISFDB.org DB Schema documentation.
(It says a lot about Solr that this is really the first time I had to
look at any documentation on the tables I was indexing). This lead me
to a few pieces of information that I took advantage of…
- constraining ca_status=1 in our DB query is how we eliminate reviews and just get real authors of titles
- verified that title_ctl is a bogus field
With those tweaks in mind, the first thing to do to make progress on the configs was to spell out explicitly the fields I wanted in our SQL so I could keep track of what was coming from where and completely skip the fields i was already ignoring.
Next up, was to switch to making our top level “Entity” in the DIH config be “title” and add a “nested entity” that corresponds to the “author” data. Using nested entities like this allows for one query that fetches the main details about each row (“title”), with a sub query that returns multiple values for multivalued fields (the “author” related fields).
Using nested entities is a lot slower (~6.5 minutes on my laptop vs ~2 minutes in the old version) but that’s just because we need to execute a lot more sql queries (600k+ SQL queries instead of 1) — it’s the nature of the DB data model. If the datamodel had the notion of a “primary” author, then we could include that in our main select for the “title” entity, and then most of the queries for the “title_author” sub-entity would be no-ops. (not sure if it would actually be much faster though)
The upside is now we have a single document per “title” with multivalued fields containing the author details — since these documents are title-centric, I remove some of the author fields since they don’t really add any value. (I’ll add them back when we also include author-centric documents in our index)
Gotcha: DIH and Multiple Values for Single Valued Fields
At this point while doing some testing, I discovered something interesting about DIH that I didn’t know before. If you configure it to produce multiple values for a field (like all the author_* fields in my current isfdb-dih.xml) but some of those fields are multiValued=”false” in your schema (I forgot to add it to author_canonical) DIH will silently throw away the duplicates for you and just give you one value. I would have expected it to error (if it had I would have caught my mistake before my last commit) but it doesn’t — live and learn (and fix my schema.xml mistake).
Speed Things Up?
At this point, I tried adding the CachedSqlEntityProcessor to my sub-entity, but it didn’t help speed things up. In hindsight this wasn’t too surprising given the number of authors, but then I realized because I was using canonical_authors in the sql for the nested entity, of course every select would be unique.
Then I noticed the option on CachedSqlEntityProcessor to load all rows from an entity into memory as part of a single upfront query. This is a feature I wasn’t aware of, that sounded like it could be great for dealing with authors (assuming you have the RAM). The trick was to move canonical_authors out of the entity (so we only have one row per author, not one per author_title) which leads to a triple nested entity – requiring only 3 DB SQL calls to get all the data.
this got the indexing time down to only ~70 seconds — way better then my expectations (I figured it might be close to as fast as the single select, but I didn’t expect it to be faster). Using CachedSqlEntityProcessor won’t help in all situations (it really depends on the volume of data you’re trying to cache, and how much RAM you have to work with) but in this case it turned out awesome.
Conclusion (For Now)
And that wraps up this latest installment with the blog_3
tag. Now queries like the ones I mentioned last week all give us
results that actually mean something interesting about “Titles” in the
- All titles by an author whose name contains “Smith”
- Titles with “space” in their name, with facet counts based on the type of work
- A break down showing how many novels contain “space” vs “robot” vs “robots” in their titles
Check back at the end of this week, when I plan to add “Author Centric” Documents to our Document Model, letting users search for Titles or Authors and still get meaningful results.