Distributed Lucene: Full Text Searching in .NET for Scalability

DZone 's Guide to

Distributed Lucene: Full Text Searching in .NET for Scalability

Full-text search with .NET for scalability.

· Web Dev Zone ·
Free Resource


As data has lately been coined as “the new currency,” Apache Lucene has gained traction as a popular full-text search engine, widely used in applications to incorporate flexible text search over huge amounts of textual data. Lucene uses inverted indexing, drastically cutting down the time to find documents related to a particular term.

However, it is a stand-alone solution that does not scale as your data grows — you need to rebuild entire Lucene indexes to search data which is an expensive and slow task, becoming a performance bottleneck. While a few Java and REST-based solutions now exist to cater to scalable full-text search, there is still a lack of a scalable full-text searching solution that may naturally fit in the .NET stack.

Using Distributed Lucene With NCache for .NET

NCache, being a powerful and popular .NET In-Memory Datastore, has implemented the native Lucene.NET API over its distributed architecture. This is the standard Lucene.NET API due to which there is no code change required to your Lucene application in order to use it in a scalable manner with NCache.

NCache also utilizes this Lucene.NET to create indexes in a dynamically scalable environment to allow distributed full-text searches. The results of these searches are then merged before being sent back to your application.

This enhances the stand-alone Lucene into a fast and linearly scalable full-text searching solution.

Figure 1: Distributed Lucene in .NET

Figure 1: Distributed Lucene in .NET

For further details about Distributed Lucene: Distributed Lucene for Enterprise Search.  

Using Lucene in .NET Apps

Let us consider an E-commerce site that holds information for thousands of products, orders and customer details. Hence, indexing all attributes, especially non-textual fields (which are not used while searching), is not a wise approach, as it puts a burden on the cache memory.

For example, our document for a product looks like this:

    “ID”: “ABC34”,
    “Name”: “Tupperware”,
    “Description”: “Microwaveable, dishwasher-friendly, reusable 
                      Tupperware in three sizes”,
    “RetailPrice”: 15.00,
    “Discount”: 3.00

Now, we know that our clients perform full-text searches on the product description (one field of the document). So, what if we index only those fields that can be searched on and have a key that refers to its corresponding document in the persistence store, e.g. a database or file system? This way, once you query for a certain type of product, let’s say “dishwasher-friendly Tupperware,” all products that have a description that matches these terms are returned with their ProductID as a document key; the whole document can then be fetched from the persisted index.

To use distributed Lucene in your existing applications, all you need is to specify NCacheDirectory when opening a directory. This requires the NCache cache name and the index name. The following code snippet opens a directory on a cache LuceneCache in NCache and an index named ProductIndex.

// Specify the cache name and index path for Lucene
string cache = "LuceneCache";
string index = "ProductIndex";

// Create directory and open it on the cache
Directory directory = NCacheDirectory.Open(cache, index);

Lucene ships an extensive query language, which interprets a given string into a Lucene query. This can be done either on a term, multiple terms, wildcards, or even fuzzy words. To learn more about Lucene queries, read Lucene Query Docs.

The following code snippet creates an IndexReader on the directory, which is used by the IndexSearcher. The data is analyzed and tokenized on the basis of StandardAnalyzer. The first 50 hits from the result are returned to the application. Note that the analyzer must be the same as the one used during index creation.

// The 'applyAllDeletes' is true so all enqueued deletes are applied on writer
IndexReader reader = DirectoryReader.Open(indexWriter, true);

// A searcher is opened to perform searching
IndexSearcher indexSearcher = new IndexSearcher(reader);

// Specify the searchTerm, fieldName and analyzer
string searchTerm = "Beverages";
string fieldName = "Category";

// Note that the analyzer should be same as the one used during index creation
Analyzer analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);

// Create a query parser to parse the query
QueryParser parser = new QueryParser(LuceneVersion.LUCENE_48, fieldName, analyzer);
Query query = parser.Parse(searchTerm);

// Returns the top 50 hits from the result set
ScoreDoc[] docsFound = indexSearcher.Search(query, 50).ScoreDocs;

Load Data to Build Distributed Index

With Lucene, you can build indexes and load data into them as needed. Indexes require an analyzer, that analyzes and tokenizes data according to your needs – it could be whitespace, non-letters, punctuation, etc. Once you create a writer for your Lucene index, you can create documents and add fields to it.

This document is then indexed in NCache as a distributed index once you call  Commit(). For more details on Lucene analyzers, have a look at Lucene Analyzer Docs.

// Specify the cache name and index path for Lucene
string cache = "LuceneCache";
string indexPath = "ProductIndex";

// Create directory and open it on the cache
Directory directory = NCacheDirectory.Open(cache, indexPath);

// The same analyzer is used as for the reader
Analyzer analyzer = new StandardAnalyzer(LuceneVersion.LUCENE_48);
IndexWriterConfig config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);

// Create indexWriter on NCache directory
IndexWriter indexWriter = new IndexWriter(directory, config);

Product[] products = FetchProductsToIndex();
foreach (var product in products)
    Document doc = new Document
        new StoredField("id", product.ID),
        new TextField("name", product.Name, Field.Store.YES),
        new TextField("description", product.Description, Field.Store.YES),
        new StringField("category", product.Category, Field.Store.No),
        new StoredField("retail_price", product.RetailPrice),

Why NCache for Distributed Lucene?

Using NCache for distributed Lucene provides you with the following benefits:

  • Extremely Fast and Linearly Scalable: NCache is an in-memory distributed data store, so building distributed Lucene on top of it provides the same optimum performance for your full-text searches. Moreover, because of NCache’s distributed architecture, the Lucene index is partitioned across all the servers of the cluster.

    This makes it scalable as you can add more servers on the go as your data load increases, and Lucene indexes are automatically redistributed without any client intervention.
  • Data Replication for Reliability and High Availability: With NCache’s Partition-of-Replica topology, the Lucene index is not only partitioned across all servers, but each partition is also replicated to another server of the cluster. Hence, if any server goes down, the replica of the partition serves all the queries for that index, ensuring reliability.

    Similarly, if a server node goes down, NCache dynamically self-heals by readjusting the data within the remaining nodes, without any downtime or impact to your Lucene index, ensuring high availability.


To sum it up, full-text searching has now become fundamental in almost every business, owing to the powerful search engine Lucene. But as data grows, rebuilding indexes can cause more damage than gain and this where an in-memory, distributed .NET solution such as NCache steps in. All it requires is a one-line code change in your existing Lucene application, and voila, you have the best of both worlds as an in-memory, distributed full-text search mechanism.

Additional resources on Lucene: 

Further Reading

asp.net ,caching ,databaese ,full-text search ,ncache ,tutorial ,web dev

Published at DZone with permission of Iqbal Khan . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}