Tips on ElasticSearch Configuration for High Performance: Part II
Tips on ElasticSearch Configuration for High Performance: Part II
This is Part II of this short series designed to help you optimize your performance with ElasticSearch.
Join the DZone community and get the full member experience.Join For Free
Sensu is an open source monitoring event pipeline. Try it today.
Here are the last four tips from Manoj Chaudhardy about configuring ElasticSearch to yield high performance. Did you miss Part I? Check it out here.
Tip #6: Optimizing Index Requests
At Loggly, we built our own index management system since the nature of log management means that we have frequent updates and mapping changes. This index manager’s responsibility is to manage indices on our ES cluster. It detects when the index needs to be created or closed based on the configured policies. There are many policies in the index manager. For example, if the index grows beyond a certain size or lives for more than a certain time, the index manager will close the index and create a new one.
When the index manager sends a node an index request to process, the node updates its own mapping and then sends that mapping to the master. While the master processes it, that node receives a state that includes an older version of the mapping. If there’s a conflict, it’s not bad (i.e. the cluster state will eventually have the correct mapping), but we send a refresh just in case from that node to the master. In order to make the index request more efficient, we have set this property on our data nodes.
Sending refresh mapping is more important when the reverse happens, and for some reason, the mapping in the master is ahead (or in conflict) with the actual parsing of it in the actual node on which the index exists. In this case, the refresh mapping will result in a warning being logged on the master node.
Tip #7: Navigating ElasticSearch’s Allocation-related Properties
Shard allocation is the process of allocating shards to nodes. This can happen during initial recovery, replica allocation, or rebalancing. Or it can happen when handling nodes that are being added or removed.
The cluster.routing.allocation.cluster_concurrent_rebalance property determines the number of shards allowed for concurrent rebalance. This property needs to be set appropriately depending on the hardware being used, for example, the number of CPUs, IO capacity, etc. If this property is not set appropriately, it can impact the ElasticSearch performance with indexing.
By default the value is set at 2, meaning that at any point in time only 2 shards are allowed to be moving. It is good to set this property low so that the rebalance of shards is throttled and doesn’t affect indexing.
The other shard allocation property is cluster.routing.allocation.disk.threshold_enabled. If this property is set to true, the shard allocation will take free disk space into account while allocating shards to a node.
When enabled (i.e when it is set to true), the shard allocation takes two watermark properties into account: low and high.
- The low watermark dictates the disk usage point that which ES won’t allocate new shards. In the example below, ES stops allocating shards for a node once disk usage reaches 97%.
- The high watermark dedicates the disk usage value at which the shards will start moving out of the node (99% in the example below).
cluster.routing.allocation.disk.threshold_enabled:true cluster.routing.allocation.disk.watermark.low:.97 cluster.routing.allocation.disk.watermark.high:.99
Tip #8: Recovery Properties Allow for Faster Restart Times
ES includes several recovery properties which improve both ElasticSearch cluster recovery and restart times. We have shown some sample values below. The value that will work best for you depends on the hardware you have in use, and the best advice we can give is to test, test, and test again.
This property is how many shards per node are allowed for recovery at any moment in time. Recovering shards is a very IO-intensive operation, so you should set this value with real caution.
This controls the number of primary shards initialized concurrently on a single node. The number of parallel stream of data transfer from node to recover shard from peer node is controlled by indices.recovery.concurrent_streams. The value below is setup for the Amazon instance, but if you have your own hardware you might be able to set this value much higher. The property max_bytes_per_sec (as its name suggests) determines how many bytes to transfer per second. This value again needs to be configured according to your hardware.
indices.recovery.concurrent_streams: 4 indices.recovery.max_bytes_per_sec: 40mb
All of the properties described above get used only when the cluster is restarted.
Tip #9: Threadpool Properties Prevent Data Loss
ElasticSearch node has several thread pools in order to improve how threads are managed within a node. At Loggly, we use bulk request extensively, and we have found that setting the right value for bulk thread pool using threadpool.bulk.queue_size property is crucial in order to avoid data loss or _bulk retries:
This property value is for the bulk request. This tells ES the number of requests that can be queued for execution in the node when there is no thread available to execute a bulk request. This value should be set according to your bulk request load. If your bulk request number goes higher than queue size, you will get a RemoteTransportException as shown below.
Note that in ES the bulk requests queue contains one item per shard, so this number needs to be higher than the number of concurrent bulk requests you want to send if those requests contain data for many shards. For example, a single bulk request may contain data for 10 shards, so even if you only send one bulk request, you must have a queue size of at least 10. Setting this value “too high” will chew up heap in your JVM, but does let you hand off queuing to ES, which simplifies your clients.
You either need to keep the property value higher than your accepted load or gracefully handle RemoteTransportException in your client code. If you don’t handle the exception, you will end up losing data. We simulated the exception shown below by sending more than 10 bulk requests with a queue size of 10.
RemoteTransportException[[<Bantam>][inet[/192.168.76.1:9300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 10) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13fe9be];
In Summary: ElasticSearch Configuration Properties are Key to Its Elasticity
The depth of configuration properties available in ElasticSearch as been a huge benefit to Loggly since our use cases take ElasticSearch to the edge of its design parameters (and sometimes beyond, as we’ll be sharing in several upcoming posts). If the ES default configurations are working perfectly adequately for you in the current state of your application’s evolution, rest assured that you’ll have plenty of levers available to you as your application grows.
Published at DZone with permission of Manoj Chaudhardy , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.