DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. Solr: Experimenting with Join Performance

Solr: Experimenting with Join Performance

Erick Erickson user avatar by
Erick Erickson
·
Jun. 25, 12 · Interview
Like (0)
Save
Tweet
Share
13.63K Views

Join the DZone community and get the full member experience.

Join For Free

We recently had a client who wanted some up-front sense of how Solr joins performed. Naturally, the client wanted to use joins in the most painful way, so I set out to make a prototype. Of course I ran into some issues, but one of the delights of working for Lucid is that I have ready access to many of the people who wrote the code, something to treasure! Being able to access these folks makes me look waaaay smarter than I am….

Anyway, on my 2009 Macbook pro I ran some rather unscientific experiments, but enough to give me a sense of joins in one particular case. I’ll outline what I did and what the results were.

The setup

For this experiment, I created an index consisting of 26M documents. They were divided up into groups, one text document and 5 metadata documents. The text document contained 1K of semi-random English words (just chosen from “some list I got from the internet”). Semi-random because I weighted them a bit to have more common words than rare words, but it turns out that the searching part of the process isn’t where the time is spent so we can pretty much ignore that.

There are 5 metadata documents related to the text doc by Solr’s <uniqueKey>. Think of this as the metadata documents having a foreign key into the text doc <uniqueKey>. The metadata documents also had an integer field in the range 0-10,000. The whole purpose of this setup was to form queries that returned the text docs for which a metadata doc existed granting access. The complexity of granting access is…er…low, I just did a range query. “Not realistic” you say. You’re right. I didn’t want any complex processing to get in the way of looking at joins, so I kept all this simple.

The form of the join query was:

 q=text_all:(1 to 3 random words)&fl=id,score&sort=score desc&fq={!join from=join_id to=id}access:[7434 TO 7514]

See: http://wiki.apache.org/solr/Join. I wrote a little harness to fire off HTTP queries at the instance of Solr (4.x from a couple of months back). I could configure the number of simultaneous threads firing off queries. Note that I was testing this form because it applied to the customer, but I suspect that the other forms have the same issue.

Small dislcaimer

As I mentioned, one of the pleasures of working for Lucid is having access to people who deeply understand the code. So I chatted with the join author (Yonik Seeley) and discovered, of course, that the scenario I was testing was the worst performance wise. Joins are O(num_terms_in_fields), and using the <uniqueKey> as my join field guarantees that there are lots and lots and lots of terms. So these results are worst-case. Unfortunately, they’re also one of the most common.

Threads Queries (total) Avg QTime(seconds) Elapsed Clocktime Queries/second
1 20 4.9 98 0.2
2 40 5.9 123 0.3
5 100 15.3 310 0.3
10 200 31.5 649 0.3

 

A note about these rather counter-intuitive numbers. Once the CPU maxes out, the QTime starts to increase, but the QPS rate stays rather constant. On a dual-core machine, we see that with 2 threads. The 5 and 10 thread (client) rows simply show that each individual request takes longer, end-to-end, but there are more queries being served by Solr simultaneously.

When I took the join part out, performance went up about 15x. I was monitoring the CPU, and it was pegged with 2 threads, which makes sense. I had jConsole running and didn’t see any anything odd with memory/garbage collection, but it was just a cursory examination.

But what does it mean?

Well, the take-away is that you really, really should experiment with the join performance in your situation before deciding on it as a solution for all your problems. I’d expect the numbers to be much better for fields with fewer unique values. But Solr makes a lousy RDBMS, and every time you think of using it as one, you should make an effort to re-think your problem in a way that doesn’t try to make Solr behave as one. These numbers, assuming that they are representative of your particular situation could well be killers. On the other hand, they may be fine if your particular situation is serving a small community of users for whom the time spent waiting for a query to return is well-spent. It Depends ™.

It might also mean that the case that Solr join functionality was meant to solve takes an unnecessarily restrictive approach for this particular problem. I suspect it’s quite possible that specializing the join code for the to-id was a <uniqueKey> might change the performance radically. One of the characteristics of open-source code is that solutions for the immediate problem get implemented and then refined for other cases as necessary.

“Rethinking” often involves at least three phases.

  1. Think hard about the problem. Can it be solved by clever indexing? DB folks really don’t like to flatten data, but that’s often a viable approach.
  2. Ask if the functionality is really and truly something that’ll help the user experience. Often faceting and filtering will be “good enough”. There’s nothing particularly “natural” about RDBMS concepts as far as your users are concerned, so ask your UI design experts what would really help the user.
  3. Prototype in your situation and talk to your product managers before irrevocably deciding to go down this route. The eXtreme Programming people emphasize over and over that making your PMs aware of the costs of a feature they’re insisting on will help them make better decisions. Asking “What you want will require 5 times as much hardware and take an additional month to implement, will XXX be good enough?” gives them some information.
  4. Ask “Is Solr the right solution?”. I love Solr/Lucene. Working with this ecosystem pays my bills. I admire the work that people do in the nitty-gritty parts of the code. But Solr and Lucene aren’t suited for some tasks. It may be that the problem you’re trying to solve would be better served by an RDBMS. It may be that some kind of hybrid between Solr and an <insert your favorite solution here> works better. It may be that Solr shouldn’t be part of the solution to this problem at all. Not all nails should be driven with Solr.

As I understand it, this behavior is inherent in how the join code is implemented and the number of matching documents isn’t the limiting factor (and this was borne out in my experiments). I wonder if one could make use of the fact that the join field is a <uniqueKey> to implement a specialization. Hmmm, I’ll have to talk to Yonik, but I suspect it’s one of those things that seems simple but quickly gets untenable. And, here we go again trying to make Solr behave like a DB…..

You can still use joins!

Don’t interpret this as saying “Don’t use joins”. Rather, you should be aware that they were implemented to solve a specific problem, not the general many-to-many relationship. The algorithm does what it needs to do to solve that problem, but when applied to different problems may not be performant enough to apply to your situation. Test, test, test!!!

 

Joins (concurrency library) Database Relational database Document

Published at DZone with permission of Erick Erickson. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 10 Most Popular Frameworks for Building RESTful APIs
  • Spring Boot vs Eclipse MicroProfile: Resident Set Size (RSS) and Time to First Request (TFR) Comparative
  • Scaling Your Testing Efforts With Cloud-Based Testing Tools
  • How To Handle Secrets in Docker

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: