Elasticsearch has taken the world by storm. Elasticsearch is a search engine, which provides a distributed, multitenant-capable full-text search engine. It stores data in the form of JSON documents and provides native Java APIs or Rest APIs for integration; underlying it uses inverted index structure. Along with search capabilities, it also provides good analytics capabilities. Together with its open source Logstash product for server-side log tailing and its popular open source visualization tool Kibana, Elastic’s ELK analytics stack is gaining momentum in the market.
Elasticsearch is a great tool for document indexing and powerful full-text search. However, there are few limitations of ElasticSearch because of which you don’t have any option other than reindexing. For example, there is a major limitation with ElasticSearch that once index is created and data is indexed, then changing mapping/schema might cause adverse impact. For example, If you switch a field type from string to a date, all of the data for that field that you already have indexed becomes useless.
The biggest problem that clients face with Elastic Search is that they have to reindex whenever schema is changed. Reindexing is an expensive operation and depending upon volume, it may take few seconds, few hours or days to reindex data. If you don’t bother about old data then there is no issue, but most of the applications do care about old data. Applications should employ mechanisms to overcome the limitation of ElasticSearch. Other than schema change, there are other scenarios because of which reindexing is required for example corrupt index.
As a part of this two-article series, I am going to discuss the solution to overcome the limitation of ElasticSearch. The solution that I am going to elaborate belongs to fault tolerant category. The change in schema or corrupt index is identified as a fault and reindexing is triggered as a part of the action. It’s completely automated and involves following two parts.
- Identify scenarios in which it's essential to do reindexing
- Zero downtime reindexing solution
In this article, we are going to see scenarios in which its must to do the reindexing. In many applications, ElasticSearch is the secondary storage and not the primary storage. Primary storage can be relational database or other no SQL database, so when we do reindexing it may require reindexing from existing index to new index or indexing from the primary data storage. In some scenarios, reindexing can be done from snapshots. This really depends on scenarios.
Following scenarios do not require reindexing:
- Removal of a field from existing schema – There is no need of reindexing.
- Addition of new field – Elastic search segment only contains indices for fields that actually exist in the documents for that segment. Put_mapping takes care of new filed.
- New index is created after fixed time period – In some application domains like “Log Analyzer”, index is created every day or after some fixed time interval. In such cases, it’s not required to do reindexing.
Following scenarios require reindexing:
- Schema change from less restrictive to more restrictive data type – For example, data type is changed from Integer to String. In this case, re-indexing can be done from old index to new index. Before starting re-indexing, new index has to be created with new schema/mapping.
- Schema change from more restrictive data type to less restrictive data type – For example, data type is changed from String to Double or Integer to Double. Schema check can be performed as follows:
ElasticSearchPlatform elasticSearchPlatform = (ElasticSearchPlatform) context.getBean("searchPlatform"); GetMappingsResponse mappingResponse = elasticSearchPlatform.admin() .indices() .prepareGetMappings(alias) .setTypes(index) .execute().actionGet();
- Corrupt index – This is very weird and rare scenario that requires reindexing from the primary storage. It’s very annoying because it might happen due to various reasons which one cannot anticipate advance. In our case, index corruption happened after ElasticSearch node crash, or rarely after system restart. This happens because all primary shards might not recover fully and few shards remain unassigned.
Elasticsearch throws the following exception in case of corrupted index:
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [test_index] failed to fetch index version after copying it over at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:158) at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.lucene.index.CorruptIndexException: [test_index] Preexisting corrupted index [corrupted_HXhLgNexRJevJjm02WLiDw] caused by: CorruptIndexException[codec footer mismatch: actual footer=0 vs expected footer=-1071082520 (resource: .. org.apache.lucene.index.CorruptIndexException: codec footer mismatch: actual footer=0 vs expected footer=...
In this scenario, old index requires to be deleted and indexing can be done from snapshots or primary storage. In this scenario, always remember that reindexing from snapshot is the preferred way as it’s very quick.
- Change in number of primary shards – There is no way other than reindexing if you want to change number of primary shards of an index after index creation. By default, Elastic Search allocates 5 shards to an index. One can change it using "number_of_shards" setting during creation of index. Now one would wonder why I have specified this scenario. Why one would require to change number of primary shards and is it a common case if application is already in production? Well, let me tell you that indeed it's a rare scenario, but not unthinkable!!
Consider sass multitenant product in which you have different categories of tenants; let’s say, Trial, Premium and Platinum. For all trial users, product offers few capabilities with limited resources and one of the resource is Elastic Search. You may have a shared index for all tenants belonging to Trial category or a single shard index per tenant with Trial license. You may also prefer to keep such indexes on not so powerful hardware. Imagine one of the Trial customer upgrades to other category. So, in that case, you will have to perform changes in index such as changing number of primary shards or moving index to powerful hardware by copying data and what’s the solution for this; well, you know the answer its reindexing!!
If you are using ElasticSearch in your application then it’s required to think about possible faults and corresponding fault tolerance strategies. In this article, we discussed different scenarios in which reindexing is absolutely essential. Depending upon the volume of data, reindexing might take a lot of time and application should be designed in advance to cover all such scenarios and provide zero downtime.
In Part II, I have explained different reindexing strategies and their design and implementation.