DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Related

  • Open Source: A Pathway To Personal and Professional Growth
  • Enhancing Software Quality with Checkstyle and PMD: A Practical Guide
  • Linting Excellence: How Black, isort, and Ruff Elevate Python Code Quality
  • Mastering GitHub Copilot: Top 25 Metrics Redefining Developer Productivity

Trending

  • MuleSoft: Best Practices on Batch Processing
  • Dust: Open-Source Actors for Java
  • Understanding the Differences Between Repository and Data Access Object (DAO)
  • How to Enhance the Performance of .NET Core Applications for Large Responses

What is New in RavenDB 3.0: Indexing Enhancements

By 
Oren Eini user avatar
Oren Eini
·
Sep. 26, 14 · Interview
Likes (0)
Comment
Save
Tweet
Share
3.8K Views

Join the DZone community and get the full member experience.

Join For Free


chess-345904_640

we  talked previously  about the kind of improvements we have in ravendb 3.0 for the indexing backend. in this post, i want to go over a few features that are much more visible.

 attachment indexing.  this is a feature that i am not so hot about, mostly because we want to move all attachment usages to ravenfs. but in the meantime, you can reference the contents of an attachment during index. that can let you do things like store large text data in an attachment, but still make it available for the indexes. that said, there is no tracking of the attachment, so if it change, the document that referred to it won’t be re-indexed as well. but for the common case where both the attachments and the documents are always changed together, that can be a pretty nice thing to have.

 optimized new index creation.  in ravendb 2.5, creating a new index would force us to go over all of the documents in the database, not just the documents that we have in that collection. in many cases, that surprised users, because they expected there to be some sort of physical separation between the collections. in ravendb 3.0, we changed things so creating a new index on a small collection (by default, less than 131,072 items) will be able to only touch the documents that belong to the collections being covered by that index. this alone represent a pretty significant change in the way we are processing indexes.

in practice, this means that creating a new index on a small index would complete much more rapidly. for example, i reset an index on a production instance, it covers about 7,583 documents our of 19,191. ravendb was able to index that in just 690 ms, out of about 3 seconds overall that took for the index reset to take place.

what about the cases where we have new indexes on  large  collections? at this point, in 2.5, we would do round robin indexing between the new index and the existing ones. the problem was that 2.5 was biased toward the new index. that meant that it was busy indexing the new stuff, while the existing indexes (which you are actually using) took longer to run. another problem was that in 2.5 creating a new index would effectively poison a lot of performance heuristics.  those were built for the assumptions of all indexes running pretty much in tandem. and when we have one or more that weren’t doing so… well, that caused things to be more expensive.

in 3.0, we have changed how this works. we’ll have separate performance optimization pipelines for each group of indexes based on its rough indexing position. that lets us take advantage of batching many indexes together. we are also  not  going to try to interleave the indexes (running first the new index and then the existing ones). instead, we’ll be running all of them in parallel, to reduce stalls and to increase the speed in which everything comes up to speed.

this is using our scheduling engine to ensure that we aren’t actually overloading the machine with computation work (concurrent indexing) or memory (number of items to index at once). i’ve  very  proud in what we have done here, and even though this is actually a backend feature, it is too important to get lost in the minutia of all the other backend indexing changes we talked about in my previous post.

 explicit cartesian/fanout indexing.  a cartesian index (we usually call them fanout indexes) is an index that output multiple index entries per each document. here is an example of such an index:

from postcomment in docs.postcomments
from comment in postcomment.comments
where comment.isspam == false
select new {
    createdat = comment.createdat,
    commentid = comment.id,
    postcommentsid = postcomment.__document_id,
    postid = postcomment.post.id,
    postpublishat = postcomment.post.publishat
}

for a large post, with a lot of comments, we are going to get an entry per comment. that means that a single document can generate hundreds of index entries.  now, in this case, that is actually what i want, so that is fine.

but there is a problem here. ravendb has no way of knowing upfront how many index entries a document will generate, that means that it is  very  hard to allocate the appropriate amount of memory reserves for this, and it is possible to get into situations where we simply run out of memory. in ravendb 3.0, we have added explicit instructions for this. an index has a budget, by default, each document is allowed to output up to 15 entries. if it tries to output more than 15 entries, that document indexing is aborted, and it won’t be indexed by this index.

you can override this option either globally, or on an index by index basis, to increase the number of index entries per document that are allowed for an index (and old indexes will have a limit of 16,384 items, to avoid breaking existing indexes).

the reason that this is done is so either you didn’t specify a value, in which case we are limited to the default 15 index entries per document, or you  did  specify what you believe is a maximum number of index entries outputted per document, in which case we can take advantage of that when doing capacity planning for memory during indexing.

 simpler auto indexes.  this feature is closely related to the previous one. let us say that we want to find all users that have an admin role and has an unexpired credit card. we do that using the following query:

var q = from u in session.query<user>()
        where u.roles.any(x=>x.name == "admin") && u.creditcards.any(x=>x.expired == false)
        select u;

in ravendb 2.5, we would generate the following index to answer this query:

from doc in docs.users
from doccreditcardsitem in ((ienumerable<dynamic>)doc.creditcards).defaultifempty()
from docrolesitem in ((ienumerable<dynamic>)doc.roles).defaultifempty()
select new {
    creditcards_expired = doccreditcardsitem.expired,
    roles_name = docrolesitem.name
}

and in ravendb 3.0 we generate this:

from doc in docs.users
select new {
    creditcards_expired = (
        from doccreditcardsitem in ((ienumerable<dynamic>)doc.creditcards).defaultifempty()
        select doccreditcardsitem.expired).toarray(),
    roles_name = (
        from docrolesitem in ((ienumerable<dynamic>)doc.roles).defaultifempty()
        select docrolesitem.name).toarray()
}

note the difference between the two. the 2.5 would generate multiple index entries per document, while ravendb 3.0 generate just one. what is worse is that 2.5 would generate a cartesian product, so the number of index entries outputted in 2.5 would be the number of roles for a user times the number of credit cards they have.  in ravendb 3.0, we have just one entry, and the overall cost is much reduced. it was a big change, but i think it was well worth it, considering the alternative.

in my next post, i’ll talk about the other side of indexing, queries. hang on, we still have a lot to go through.


code style

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Open Source: A Pathway To Personal and Professional Growth
  • Enhancing Software Quality with Checkstyle and PMD: A Practical Guide
  • Linting Excellence: How Black, isort, and Ruff Elevate Python Code Quality
  • Mastering GitHub Copilot: Top 25 Metrics Redefining Developer Productivity

Partner Resources


Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: