Inside the Elastic Shard
The majority of most of the industrial use cases have leveraged Elasticsearch for observability of the high volume user traffic or server-side interactions.
Join the DZone community and get the full member experience.Join For Free
Elasticsearch clusters have been around for a while, scaling petabytes of data & billions of requests. The majority of the industrial use cases have leveraged Elasticsearch for observability of the high volume user traffic or server-side interactions.
Added with the full package of the beats to collate data and visualize using Kibana dashboards, it's quick to integrate and start collecting metrics or events or logs into this powerhouse. However, it's only when the scale increases with higher retention windows—things go topsy turvy. Now you need to start looking in deeper to understand how the system works and what it takes to scale the cluster.
Let's take a closer look.
Like other databases architecture, elastic at the unit level depends on the shard and how the clusters are configured for performance. Shards are partitions of data and are either primary or replica—typically used to handle load, recovery, and availability of the data in the cluster.
The figure above is illustrative of what happens inside the shard & the corresponding resource interaction on the node during indexing, search, restart operations. The same is color-coded to understand different components, calls to replica wherever required. In the below sections we will go over individual operations deeper.
During the indexing operation, where you are ingesting a high volume of data into the cluster or shard(s) at the unit level, the documents are dual written to the transaction logs and the indexing buffer, which is part of the JVM heap. The writes to the transaction log are done for sync of data between replicas, recovery during failures & avoid data loss by frequent checkpoints. The indexing buffer size is configurable and can be adjusted based on the indexing needs.
For reliability and recovery scenarios, transaction logs persist the operation details and the data documents/segments are synced every 5 seconds and recovered from replica shards in case of a failure of the primary shard node. Based on the refresh interval and size limit configuration, the data is written periodically into Lucene segments from the indexing buffer; the Lucene reopen operation ensures the streaming of the data.
Post the refresh interval duration, the buffer is cleared and retained in OS file system cache, which persists as segment (Lucene indices) which are only committed to disk if there is an associated fsync operation, which is triggered as a part of a flush operation.
Whenever a flush operation is triggered, the transaction log is deleted, the file system buffers are cleared as fsync operation ensures the availability of the segment data into the disk. Also clearing the transaction log ensures that only pending operations are retained, which is helpful for quick restart and replication in case of failures.
Also whenever there is no indexing for 5 minutes, the shard is marked inactive and a commit id is assigned and the same is flushed to disk, avoiding scenarios of revisiting the transaction log even when there has been no update, thereby reducing segment copies for quicker bootup. Based on indexing, delete, and update operation, the segments are merged into larger segments and flushed to disk. In these scenarios, the merged segment would result in a new segment in disk whose references are maintained in OS file system cache during the search.
The typical golden rule for scale is to keep the shard size to around 20-30GB with the constraint of no more 2.1 billion documents—either limit exceeded will result in cluster performance issue on the current set of H/W configuration in the market. Since indexing buffer is configured as a percentage of the heap size, pay special attention to how much heap would be available for the search data structure for quicker retrieval. On the resource size—the open file handles, sockets, and memory utilization added to disk i/o play vital indicators of indexing performance apart from CPU.
When the documents are indexed, they are indexed as Lucene segments consisting of multiple documents with associated docId & each docId referencing the actual document with Fields and associated Terms vector information. Fundamentally it's Lucene at its core.
Whenever there is a search operation triggered, the operation is done across segments in the RAM and any delta which persist in the transaction log for non-flushed operations; this is to ensure no data loss scenario during the search. Also, the search is made on both the primaries and replicas simultaneously.
Based on different search operations typically the field & associated field cache per segment are updated in the heap for quicker retrieval. But there could be scenarios where the search query resultant is search data structures beyond a memory limit, which could result in Out of memory. To avoid this, the circuit breaker in elastic search ensures field data-driven queries are optimized and managed against memory.
Whenever there is not enough memory to keep the field data resident in memory, elasticsearch will constantly reload data from the disk & evict other data for free memory. Evictions cause heavy disk IO and generate a large amount of garbage in memory. The circuit breaker estimates the memory requirements of a query by introspecting the fields involved ( type, cardinality, size ). It then checks to see whether loading the required field data would push the total field data size over the configured percentage of the heap & circuit breaks where required to avoid cluster havoc due to inefficient search queries or global search on larger data set.
All the Search data structures, field, and filter caches per segment are part of the heap. Segments, file buffer, file handles are part of the rest of the RAM. The tokenizer, analyzer inside the Lucene segments, term vector are vital for search relevance but for improving the search latency performance—the major factors would be how the data is routed to different shards, the number of shards, and how the load is balanced out across replicas, the memory configuration, the underlying disk performance, how the data is stored.
Opinions expressed by DZone contributors are their own.