Over a million developers have joined DZone.

Estimating Memory and Storage for Lucene/Solr

· Java Zone

Microservices! They are everywhere, or at least, the term is. When should you use a microservice architecture? What factors should be considered when making that decision? Do the benefits outweigh the costs? Why is everyone so excited about them, anyway?  Brought to you in partnership with IBM.

Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I’ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to “cat XXX > /dev/null” to load everything into memory first, which isn’t what most people do when running their system) and or estimates because I know there are so many variables involved that it is possible to vary the results quite significantly depending on marketing goals. Thus, I tend to be pragmatic (which I think the Lucene/Solr community does as well) and focus on what do my tests show for my specific data and my specific use cases.

For instance, for testing memory, it’s pretty easy to set up a series of tests that start with a small heap size and successively grow it until no Out Of Memory Errors (OOME) occur. Then, to be on the safe side, add 1 GB of memory to the heap.  It works for the large majority of people. Ironically, for Solr at least, this usually ends up with a heap size somewhere between 6-12 GBs for a system doing “consumer search” with faceting, etc. and reasonably sized caches on an index in the 10-50 million docs range. Sure, there are systems that go beyond this or are significantly less (I just saw one the other day that has around 200M docs in less than 3 GB of RAM while handling decent search load), but the 6-12 GB seems to be a nice sweet spot for the application and the JVM, especially when it comes to garbage collection, while still giving the operating system enough room to do it’s job.  Too much heap and garbage may pile up and give you ohmygodstoptheworld full garbage collections at some point in the future.  Too little heap and you get the dreaded OOME.  Also too much heap relative to total RAM and you choke off the OS.  Besides, that range also has a nice business/operations side effect in that 16 GBs of RAM has a nice performance/cost benefit for many people.

Recently, however, I thought it would be good to get beyond the inherent hand waving above and attempt to come up with a theoretical (with a little bit of empiricism thrown in) model for estimating memory usage and disk space.   After a few discussions on IRC with McCandless and others, I put together a DRAFT Excel spreadsheet that allows people to model both memory and disk space (based on the formula in Lucene in Action 2nd ed. – LIA2), after filling in some assumptions about their applications (I put in defaults.)   First a few caveats:

  1. This is just an estimate, don’t construe it for what you are actually seeing in your system.
  2. It is a DRAFT.  It is likely missing a few things, but I am putting it up here and in Subversion as a means to gather feedback.  I reserve the right to have messed up the calculations.
  3. I feel the values might be a little bit low for the memory estimator, especially the Lucene section.
  4. It’s only good for trunk.  I don’t think it will be right for 3.3 or 3.4.
  5. The goal is to try to establish a model for the “high water mark” of memory and disk, not necessarily the typical case.
  6. It inherently assumes you are searching and indexing on the same machine, which is often not the case.
  7. There are still a couple of TODOs in the model.  More to come later.

As for using the memory estimator, the primary things to fill in are the number of documents, number of unique terms and information on sorting and indexed fields, but you can also mess with all of the other assumptions.  For Solr, there are entries for estimating cache memory usage.  Keep in mind that the assumption for caching is that they are full, which often is not the case and not even feasible.  For instance, your system may only ever employ 5 or 6 different filters.

The disk space estimator is much more straightforward and based on LIA2′s fairly simple formula of:

disk space used(original) = 1/3 original for each indexed field + 1 * original for stored + 2 * original per field with term vectors


It will be interesting to see how some of the new flexible indexing capabilities in trunk effect the results of this equation.  Also note, I’ve seen some applications where the size of the indexed fields is as low as 20%.

Hopefully, people will find this useful as well as enhance it and fix any bugs in it.  In other words, feedback is welcome.  As with any model like this, YMMV!

Discover how the Watson team is further developing SDKs in Java, Node.js, Python, iOS, and Android to access these services and make programming easy. Brought to you in partnership with IBM.


Published at DZone with permission of Grant Ingersoll. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}