Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How Does Elasticsearch Real-time Search?

DZone's Guide to

How Does Elasticsearch Real-time Search?

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Compared to other features, real-time search capability is undoubtedly one of the most important features in Elasticsearch. Today we’ll look closely how is provided real-time search by Elasticsearch.

Real time

First of all, if we need to explain the concept of real-time, in general, we can say that the delay between input and out time in the information is small at real-time systems. This means, data is taken without data accumulation, processed in real time.

Today, the best solution Elasticsearch known for real-time search, when a record is added to it for storage makes it searchable in 1 second.

How?

As is known, the disks are able to create a risk of bottleneck for I/O operations at the data persistence step. Also some mechanisms used for prevent any loss of data increases cost of time.

At this point Elasticsearch uses the file-system cache that sitting between itself and the disk for overcome the risk of bottleneck and ensure the a new document can be searched in real time.

A new segment is written to the file-system cache first and only later it flushed to disk by Elasticsearch. This lightweight process of writing and opening a new segment is called a refresh in Elasticsearch. By default, all shards is refreshed automatically once every second. In this way, Elasticsearch support real-time search.

Test time

Above digression about the time of refresh of the shards you can bring to mind the following questions:

  1. What happens, when a new document is requested in less than 1 second time?
  2. Can be documents requested, without having to depend of the refresh period shards of managed by Elasticsearch?

Short answers.

  1. Elasticsearch does not return the document.
  2. Yes.

Now let’s get clarity on this issue is a simple example.

hakdogan$ curl -XPUT localhost:9200/kodcucom/document/1 -d'{
> "title": "Document A"
> }'

We sent a document to Elasticsearch. The index name is kodcucom, type document, id value 1. The title field is only field in the document and the value of "Document A". Let’s take this document from Elasticsearch.

hakdogan$ curl -XGET localhost:9200/kodcucom/document/1?pretty
{
  "_index" : "kodcucom",
  "_type" : "document",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{
"title": "Document A"
}
}

As expected, the document was returned to us. Well, if we keep short the time between document recording and get request than default shard refresh time what will happen?

Let’s see.

hakdogan$ curl -XPUT localhost:9200/kodcucom/document/2 -d'{"title": "Document B"}'; curl -XGET localhost:9200/kodcucom/_search?pretty
{"_index":"kodcucom","_type":"document","_id":"2","_version":1,"created":true}{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "kodcucom",
      "_type" : "document",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{
"title": "Document A"
}
    } ]
  }
}

As can be seen, only the previous document was returned to us by Elasticsearch when we do concurrently create and get request. Well, how can I get the document concurrently?

Let’s see.

hakdogan$ curl -XPUT localhost:9200/kodcucom/document/3 -d'{"title": "Document C"}'; curl -XGET localhost:9200/kodcucom/_refresh; curl -XGET localhost:9200/kodcucom/_search?pretty
{"_index":"kodcucom","_type":"document","_id":"3","_version":1,"created":true}{"_shards":{"total":10,"successful":5,"failed":0}}{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "kodcucom",
      "_type" : "document",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{
"title": "Document A"
}
    }, {
      "_index" : "kodcucom",
      "_type" : "document",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"title": "Document B"}
    }, {
      "_index" : "kodcucom",
      "_type" : "document",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{"title": "Document C"}
    } ]
  }
}

In this command, we perform to refresh operation on kodcucom index before the search request. In this way, the document was returned to us.

Auto refresh time can be changed.

  1. By setting the index.refresh_interval parameter in the configuration file. Applies to all indices in the cluster.
  2. A per-index basis by updated index setting.

In addition to these, you can turn off automatic refresh. An important point to keep in mind about the refresh time of the shards, the refresh operation is costly in terms of system resources. If you wished to make changes to the auto-refresh time, this situation should be taken into account.

Extension of the automatic refresh time, enables faster indexing but new documents and changes made to the existing documents will not appear in searches during specified period of time.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:

Published at DZone with permission of Hüseyin Akdoğan. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}