DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Solr Powered ISFDB – Part #4: Multiple Doc Types

Solr Powered ISFDB – Part #4: Multiple Doc Types

Chris Hostetter user avatar by
Chris Hostetter
·
Nov. 21, 11 · Interview
Like (0)
Save
Tweet
Share
4.63K Views

Join the DZone community and get the full member experience.

Join For Free

This is Part 4 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.

When we left last time, I had a nice index of “Title Centric” documents — One document for each title in the ISFDB, with multi-valued fields containing the basic data about each Author that worked on a title. This is the first week I didn’t have time to work on the project on Friday, So I squeezed in a bit of work today (Saturday) to just get the basics in place for Author Centric documents.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_3 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_4 tag containing the end result of this article.)

Why Not Use Multiple Indexes?

A quick digression: Today I’m going to start putting multiple different types of documents all in one index. You might ask (and rightly so) “Why not just use multiple indexes? Wouldn’t using Multiple SolrCores make sense for this?” The answer is “it depends”. If I had completely different use cases for title searching and author searching, then it would certainly make sense to use different cores with different schemas — but ultimately I want to be able to support a single simple search box where you can just type in anything (a name, a title, a keyword) and get “good” results, with authors and titles intermixed in one set of results. At that point faceting can be used to say “no, i really just meant i was looking for titles…” or “actually i just want authors…”.

Now that we’ve cleared that up…

Supporting Multiple Document Types

The first thing to do to support multiple types of documents in a single index, is to make our uniqueKey something that can be distinct across all types of documents, so we don’t risk collisions. So I’ve added a “doc_id” field to my schema, and made it the uniqueKey field. To populate this for each of my documents, I’m using the “TemplateTransformer” to construct an artificial field out of the “title_id” field, with a “TITLE_” prefix put in front of each value to make them unique (so title_id #4321 and author_id #4321 don’t overwrite each other in the index.

This is the first time I’ve had to explicitly use DIH’s <field …> syntax to declare a field (because the field name isn’t exactly the same as the column name from the DB and i need to generate it using a transformer) and it exposes an eccentricity about how DIH deals with fields: it refers to the field name you want to use in Solr as a “column”…

 

<field column="doc_id" template="TITLE_${title.title_id}" />



This thoroughly confused me when i looked at the examples, I really wanted to write something like this…

 

<field name="doc_id" template="TITLE_${title.title_id}" />



…but as you can see from other examples in the DIH docs, this is how the <field …> tag is used, even when the source of the data is something that isn’t a DB. (it doesn’t make a lot of sense to me, but it is what it is).

So now we’ve got a new uniqueKey field with a generated value for all of our title docs, we’ll also want a “doc_type” field that keeps track of (surprise!) the type of document we’re dealing with. As the DIH FAQ mentions, this is trivial to do by abusing the template transformer…

<field column="doc_type" template="TITLE" />



Adding Authors

With the foundation in place, adding Author Centric documents is just a matter of adding a new “top level” entity for them in our DIH config. For now I’m just using simple fields from the Author table that I can expand on later. I’m also reusing overlapping fields in the schema for properties of an author that are already in my title centric docs. This is a shortcut to save time now, but ultimately I’ll want to better distinguish these fields for two reasons:

  • Schema Properties: Even though it will probably make sense to use the same field type for these fields in the author docs and the title docs, in a good document model, properties like “multiValued” should be different between them — because a title can have multiple author names, but an author can only have one canonical name.
  • Field Stats: Things like the IDF of a field span the entire index — they don’t know about “doc_type” so for fields like “author_canonical” they will get seriously out of whack they way I’m reusing them. (ie: not a lot of authors have “asimov” in their name, but lots of title documents do.)

Conclusion (For Now)

And that wraps up this latest installment with the blog_4 tag. Now we can not only query for “books by people named asimov” but also “people named asimov”

Sorry this was such a short post — check back at the end of this week, when I should hopefully have some more time to clean up the schema a bit, and improve on the modeling of the “Author Centric” Documents.

Source: http://www.lucidimagination.com/blog/2011/02/12/solr-powered-isfdb-part-4/

 

Doc (computing) Document Database

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Introduction to Automation Testing Strategies for Microservices
  • How To Build an Effective CI/CD Pipeline
  • GitLab vs Jenkins: Which Is the Best CI/CD Tool?
  • Spring Cloud

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: