DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • 13 Impressive Ways To Improve the Developer’s Experience by Using AI
  • What Is mTLS? How To Implement It With Istio
  • Building and Deploying Microservices With Spring Boot and Docker
  • Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch

Trending

  • 13 Impressive Ways To Improve the Developer’s Experience by Using AI
  • What Is mTLS? How To Implement It With Istio
  • Building and Deploying Microservices With Spring Boot and Docker
  • Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch

Estimating Memory and Storage for Lucene/Solr

Grant Ingersoll user avatar by
Grant Ingersoll
·
Sep. 19, 11 · News
Like (0)
Save
Tweet
Share
17.11K Views

Join the DZone community and get the full member experience.

Join For Free
Many times, clients ask us to help them estimate memory usage or disk space usage or to share benchmarks as they build out there search system. Doing so is always an interesting process, as I’ve always been wary of claims about benchmarks (for instance, one of the old tricks of performance benchmark hacking is to “cat XXX > /dev/null” to load everything into memory first, which isn’t what most people do when running their system) and or estimates because I know there are so many variables involved that it is possible to vary the results quite significantly depending on marketing goals. Thus, I tend to be pragmatic (which I think the Lucene/Solr community does as well) and focus on what do my tests show for my specific data and my specific use cases.

For instance, for testing memory, it’s pretty easy to set up a series of tests that start with a small heap size and successively grow it until no Out Of Memory Errors (OOME) occur. Then, to be on the safe side, add 1 GB of memory to the heap.  It works for the large majority of people. Ironically, for Solr at least, this usually ends up with a heap size somewhere between 6-12 GBs for a system doing “consumer search” with faceting, etc. and reasonably sized caches on an index in the 10-50 million docs range. Sure, there are systems that go beyond this or are significantly less (I just saw one the other day that has around 200M docs in less than 3 GB of RAM while handling decent search load), but the 6-12 GB seems to be a nice sweet spot for the application and the JVM, especially when it comes to garbage collection, while still giving the operating system enough room to do it’s job.  Too much heap and garbage may pile up and give you ohmygodstoptheworld full garbage collections at some point in the future.  Too little heap and you get the dreaded OOME.  Also too much heap relative to total RAM and you choke off the OS.  Besides, that range also has a nice business/operations side effect in that 16 GBs of RAM has a nice performance/cost benefit for many people.

Recently, however, I thought it would be good to get beyond the inherent hand waving above and attempt to come up with a theoretical (with a little bit of empiricism thrown in) model for estimating memory usage and disk space.   After a few discussions on IRC with McCandless and others, I put together a DRAFT Excel spreadsheet that allows people to model both memory and disk space (based on the formula in Lucene in Action 2nd ed. – LIA2), after filling in some assumptions about their applications (I put in defaults.)   First a few caveats:

  1. This is just an estimate, don’t construe it for what you are actually seeing in your system.
  2. It is a DRAFT.  It is likely missing a few things, but I am putting it up here and in Subversion as a means to gather feedback.  I reserve the right to have messed up the calculations.
  3. I feel the values might be a little bit low for the memory estimator, especially the Lucene section.
  4. It’s only good for trunk.  I don’t think it will be right for 3.3 or 3.4.
  5. The goal is to try to establish a model for the “high water mark” of memory and disk, not necessarily the typical case.
  6. It inherently assumes you are searching and indexing on the same machine, which is often not the case.
  7. There are still a couple of TODOs in the model.  More to come later.


As for using the memory estimator, the primary things to fill in are the number of documents, number of unique terms and information on sorting and indexed fields, but you can also mess with all of the other assumptions.  For Solr, there are entries for estimating cache memory usage.  Keep in mind that the assumption for caching is that they are full, which often is not the case and not even feasible.  For instance, your system may only ever employ 5 or 6 different filters.

The disk space estimator is much more straightforward and based on LIA2′s fairly simple formula of:

disk space used(original) = 1/3 original for each indexed field + 1 * original for stored + 2 * original per field with term vectors

 

It will be interesting to see how some of the new flexible indexing capabilities in trunk effect the results of this equation.  Also note, I’ve seen some applications where the size of the indexed fields is as low as 20%.

Hopefully, people will find this useful as well as enhance it and fix any bugs in it.  In other words, feedback is welcome.  As with any model like this, YMMV!

Memory (storage engine) operating system

Published at DZone with permission of Grant Ingersoll. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • 13 Impressive Ways To Improve the Developer’s Experience by Using AI
  • What Is mTLS? How To Implement It With Istio
  • Building and Deploying Microservices With Spring Boot and Docker
  • Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: