DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Securing Your Software Supply Chain with JFrog and Azure
Register Today

Trending

  • SRE vs. DevOps
  • 5 Key Concepts for MQTT Broker in Sparkplug Specification
  • Getting Started With the YugabyteDB Managed REST API
  • 10 Traits That Separate the Best Devs From the Crowd

Trending

  • SRE vs. DevOps
  • 5 Key Concepts for MQTT Broker in Sparkplug Specification
  • Getting Started With the YugabyteDB Managed REST API
  • 10 Traits That Separate the Best Devs From the Crowd

Design Practice: Building a Search Engine Library

Oren Eini user avatar by
Oren Eini
·
Apr. 28, 14 · Interview
Like (0)
Save
Tweet
Share
5.47K Views

Join the DZone community and get the full member experience.

Join For Free

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   var index = new Corax.Index(storageOptions, new StandardAnalyzer());
    
   var scope = index.CreateIndexingScope();
    
   long docId = scope.Add(new Document
   {
       {"Name", "Oren Eini", Analysis.Analyzer},
       {"Name", "Ayende Rahien", Analysis.Analyzer},
       {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
   });
    
   docId = scope.Add(new Document
   {
       {"Name", "Arava Eini", Analysis.Analyzer},
       {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
   });
    
   index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

    var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
    foreach(var result in queryResult.Matches)
    {
        Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
    }

How does querying works? The query above would actually be something like:

    new BooleanQuery(
        Match.Or,
        new TermQuery("Name", "Oren"),
        new TermQuery("Email", "ayende@ayende.com")
        )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.

Search engine (computing) Design Engine Library

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • SRE vs. DevOps
  • 5 Key Concepts for MQTT Broker in Sparkplug Specification
  • Getting Started With the YugabyteDB Managed REST API
  • 10 Traits That Separate the Best Devs From the Crowd

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: