DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Solr Powered ISFDB – Part #6: Pseudonyms

Solr Powered ISFDB – Part #6: Pseudonyms

Chris Hostetter user avatar by
Chris Hostetter
·
Nov. 23, 11 · Interview
Like (0)
Save
Tweet
Share
3.49K Views

Join the DZone community and get the full member experience.

Join For Free
This is Part 6 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.

When we left last time, I had some decent modeling of Titles and Authors in distinct documents, but Pseudonyms were being treated as distinct Authors. Today I set out to deal with that.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_5 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_6 tag containing the end result of this article.)

What’s a Pseudonym?

In the real world, a pseudonym is just an alternate name someone uses. In the ISFDB pseudonyms are actually modeled as real Author Objects, with metadata indicating that they have a pseudonym relationship with some other Author Object.

This affects our existing Solr Document model in a couple of annoying ways:

  • In an Author centric document, there is no indication if that Author has any pseudonyms; nor any way to search for that Author by a pseudonym.
  • In an Author centric document, there is no indication if that Author is a pseudonym for another author; nor any way to search for that pseudonym by the Author’s real name.
  • In a Title centric document, there is no indication if any of the listed authors are pseudonyms for other (real) Authors; nor any way to search for that Title by an Author’s real name.

As mentioned back in Blog #3, document modeling is all about thinking flat, and denormalizing the data, and that’s how we’re going to try and deal with pseudonyms.

Indexing Pseudonyms For Real Authors

Tackling the first issue was relatively straight forward. It was basically no different then when we added a list of email addresses for each author using a nested entity (and since once again the list of pseudonyms is relatively small, we can cache the entire thing in memory using the CachedSqlEntityProcessor)

So now when we do a search for Author’s named Asimov we not only see “Isaac Asimov” in the list, we also see that there are 6 pseudonyms for him, and we have the ID for each if we want to look up one of those records or see which Titles that pseudonym wrote

Of course to really make this useful, we also want a simple “names” field for Author docs, that lists all of the different names that (real) Author is known by. <copyField /> makes this trivial, so now a search for authors named “French” will not only return the alias “Paul French” but also the real author “Isaac Asimov”.

Indexing Real Names For Pseudonym Authors

My approach for adding the “real” names (and ids) to pseudonym Author documents was basically the same as adding pseudonym names/ids to “real” author documents. I would like to say there was an easy way to tweak and reuse the previously cached nested entity in DIH to do a reverse lookup, but I certainly couldn’t find one. Now when searching for authors named “Isaac Asimov” we not only get the “real” Isaac, but also his various pseudonyms. If we want to exclude synonyms from an author search, we can add a simple filter query: fq=-real_author_id:[* TO *].

I wanted to make that pseudonym filtering easier to do (and easier to facet on), by adding an “is_pseudonym” field. One way to do this might have been to use the TemplateTransformer on my nested “pseudonym_real_author” entity — but I was pretty sure that would have only set it for Authors that had that had a mapping to that nested entity; I want the boolean to be set for all Author docs. So instead I used my first ScriptTransformer to set the value of the field.

My first attempt didn’t work the way I expected at all. For every author, the row never contained a value for the “real_author_id” when my script was executed, so is_pseudonym was always false. As near as I can tell what seems to be happening is that since the Transformer was specified on the top level “author” entity, the script was being executed as soon as the “row” got populated with data from the top level “query”. (Disclaimer: I didn’t dig into the code to verify this, but reading the wiki again it makes sense). I couldn’t figure out an easy way around this, so for now I’ve ripped it out and will look into it more later.

Digression: Document Modeling Choices

A quick digression before we move on to adding pseudonym info to Title documents: I want to point out that I made a conscious Document Modeling Decision in the previous section, where I decided that “Pseudonym Authors” would still be indexed just like any other author — they would have have a few extra fields. IN particular, they still have a “doc_type” of “AUTHOR”. Another possible choice I could have made would be to have introduced a new “PSEUDO_AUTHOR” doc_type. I can’t really explain why I made the choice I did, it just felt more right given the vague notices I have about how I want to use this index. I want to be able to easily tell when an Author is really a pseudonym for another Author, but I don’t really need/want to treat those pseudonym documents as second class citizens. Maybe down the road I’ll run into a particular use case that will change my mind, but for now it made sense to continue to treat them the same as regular authors.

Indexing Real Names For Titles

My first attempt at indexing the real name/id for each author of a Title was basically the same as it was for the Author documents, just adding a nested entity using the pseudonyms table. The problem with this approach was that since it only added fields for authors that were pseudonyms, the list of “real_author_names” and “real_author_ids” in each title would be shorter then the list of credited authors when some were “real” and some were pseudonyms. For example, “The Lost” has four credited Authors, but one of them (“J. D. Robb”) is a pseudonym (for “Nora Roberts”). This is how those fields looked in the results…

  <arr name="author_ids">
    <str>2857</str>
    <str>36103</str>
    <str>136275</str>
    <str>35293</str>
  </arr>
  <arr name="author_names">
    <str>J. D. Robb</str>
    <str>Mary Blayney</str>
    <str>Patricia Gaffney</str>
    <str>Ruth Ryan Langan</str>
  </arr>
  <arr name="real_author_ids">
    <str>4853</str>
  </arr>
  <arr name="real_author_names">
    <str>Nora Roberts</str>
  </arr>

Can’t really tell whose who there can we?

So To improve on this, I removed the special sub-entity for pseudonym relationships, and instead I modified the existing “author” sub-entity to do a LEFT JOIN on the pseudonym table to populate the same fields. (Since LEFT JOIN is really a DB concept, and not anything special to Solr or DIH, I’m not going to bother explaining it here, but you can read about it online). So now those same fields look like…

  <arr name="author_ids">
    <str>2857</str>
    <str>36103</str>
    <str>136275</str>
    <str>35293</str>
  </arr>
  <arr name="author_names">
    <str>J. D. Robb</str>
    <str>Mary Blayney</str>
    <str>Patricia Gaffney</str>
    <str>Ruth Ryan Langan</str>
  </arr>
  <arr name="real_author_ids">
    <str>4853</str>
    <str>36103</str>
    <str>136275</str>
    <str>35293</str>
  </arr>
  <arr name="real_author_names">
    <str>Nora Roberts</str>
    <str>Mary Blayney</str>
    <str>Patricia Gaffney</str>
    <str>Ruth Ryan Langan</str>
  </arr>

Which makes them much more useful.

Conclusion (For Now)

Ok, thats going to wrap up this latest installment with the blog_6 tag. The index is in pretty good shape, we can now do some pretty interesting queries on Titles and Authors using either the real names of authors or the pseudonyms they use. I think next week I may really get my hands dirty and do some UI work so I can show off some screen shots.


Source:  http://www.lucidimagination.com/blog/2011/02/27/solr-powered-isfdb-part-6/

Database Document

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Real-Time Analytics for IoT
  • How To Handle Secrets in Docker
  • Custom Validators in Quarkus
  • A Beginner's Guide to Infrastructure as Code

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: