When we left last time, I had basic support for multiple types of documents: Title documents (that were fairly well fleshed out) and Authors documents (that were not). In this installment, I do some housekeeping, and improve my modeling of Author related data.
(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_4 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_5 tag containing the end result of this article.)
Step #0: Cleaning Up My Messes
As I mentioned last time, I took a short cut when adding my Author documents by reusing several fields in the schema.xml that existed for the Title documents. The first thing I wanted to do before adding more fields to my Author documents, was to cleanup the fields I already had, so there was a nice clear separation, and I could feel comfortable tweaking fields w/o risk of breaking my other doc_type.
So I started by renaming all of my fields in the DIH and schema.xml config files so that Title and Author entities have distinct field names except in the rare situations where the concept behind the fields really is the same for both types of entity…
- imdb URLs
- wikipedia URLs
For the most part the new naming convention used is just a simplification of the DB column name, dropping the “title_” and “author_” prefixes in most cases because they are un-ambiguous (ie: “Authors” don’t have a “ttype” and), and switching to things like “author_*s” as the author list fields for titles (since they’ll be multi-valued, but the corresponding fields for authors aren’t).
While I was at it, I made all of the Date and URL fields use *_date and *_url (and *_urls) naming conventions to help make their usage/meaning more obvious. A nice side effect of this change is that the schema.xml gets simpler because i can use more dynamic field declarations.
While reviewing those changes, I noticed two little oddities in my docs…
- All Author docs were getting a “seriesnum” field
- All Title docs were getting an “author_id” field
The seriesnum field was really no surprise once I remembered that I had this configured this field with a default value of “-1″ back when I was only using title based docs. Fixing this was a simple matter of moving that default logic into the SQL using a MySQL “IF” trick.
The “author_id” field was more interesting — this was apparently coming from the “second tier” nested entity used to get the title/author mappings for title docs. This is one of those edge case situations where DIH’s goal of making it easy to have fields returned by DataSources become document fields is counterproductive. Yes I have an “author_id” field in my schema, but i don’t want DIH using it for this type of entity. I spent a few minutes trying to figure out how to avoid this — my initial guess was that I might need to use the <field ... /> syntax to declare every field I wanted in order to prevent these implicit fields from springing into existence, but before I got that far it occurred to me that if I used an alternate ‘name’ in my SQL that did not exist in my schema.xml, DIH would (probably) ignore it (similar to how it happily deals with multiple values for single value fields). It worked like a charm.
Step #0.5: Tidy Up My Workspace
It may seem like a trivial little thing, but once i had these field names cleaned up and working, I really wanted to re-organize my schema.xml file to group the field declarations more closely together based on the type of documents there are used in. My physical workspace is a mess — but I like having my virtual workspace neat and orderly and easy to find things in. While I was re-grouping these field declarations I noticed that my “catchall” field was still declared as a string type from my original schema (something I probably would have spotted had I looked at the Schema Browser recently). Switching this to “simpletext” gives me a much better experience when trying to quickly execute some basic searches.
On With The Show
Once things were a little neater, I could get on with improving how I was modeling Author data. The trivial first step was to add a few more fields to my Author docs — some simple ones from the “authors” table as well as new “webpages” and “emails” (using more nested entity). While doing this I also realized I still had some fields marked “multiValued” that shouldn’t be (no matter how much you clean, some dust always sneaks through).
I also improved how author data was modeled in Title documents a bit, by updating my “author_*” fields to specifically index a blank value when a piece of metadata was NULL for an author. This is crucial in cases where a title has multiple authors but some data (ie: a photo) is missing for one of the authors. It lets us correlate each field value with the correct author by position in the list. (Solr always returns the values of a multi-valued field in the same order they were added, and DIH always adds fields from nested entities in the same order for each doc)
Conclusion (For Now)
And that wraps up this latest installment with the blog_5 tag. The index is now in much better shape, returning much more useful information when you search for authors or titles. I suspect that next time I will have to bite the bullet and deal head on with the big elephant in the room: Pseudonyms.