DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Elasticsearch Client: Handling Scale and Backpressure

Elasticsearch Client: Handling Scale and Backpressure

When scaling a high volume of records, Elasticsearch can experience backpressure and data loss. Learn about methods for dealing with this and related issues.

Shruthi MS user avatar by
Shruthi MS
·
Sep. 25, 18 · Tutorial
Like (2)
Save
Tweet
Share
8.15K Views

Join the DZone community and get the full member experience.

Join For Free

Elasticsearch is a full-text search engine where JSON documents are stored, indexed, and are searchable. 

Our Elasticsearch cluster consists of 14 nodes. The current scale we operate at is around 54,000 documents per second, a total of 4.6 billion records or 5TB of data per day. 

Image title

Components

  1. Data source
  2. Application: Data processing/enriching
  3. Data source: Elasticsearch 

Problem Statement

a. If you are trying to scale to thousands of records per second, then you might experience some data loss. The application might not be able to hold on to thousands of records per second and might end up dropping them or crashing due to running out of heap memory.

b. If the number of messages sent is much higher than the indexing rate, Elasticsearch rejects those messages and responds with TOO_MANY_REQUESTS (429) response codes.

Solution

We need to handle the following cases:

  1. A way to handle thousands of records per second input data in real time
  2. Avoid any data drop
  3. Scale as the input increases
  4. Zero application downtime
  5. Handle backpressure

Use a Distributed Queue

Introducing a distributed queue will help to deal with the huge number of incoming records. The application can dequeue a limited number of records that it is able to process. This avoids the application receiving an overwhelming amount of data.

Image title

Use Multiple Instances of an Application

Using a distributed queue like Apache Kafka, we can scale up by spawning additional instances of the application to handle the increase in data load. As long as they have the same consumer group id, every newly spawned instance will start dequeuing messages from Apache Kafka.

Image title

If the application is down, messages can be queued up until the system is ready to process them and push them into Elasticsearch for storing and indexing.

Use Bulk Requests

While dealing with huge volumes of data, a bulk request is preferred. Start by pushing a batch of roughly 5MB and keep increasing the size until there is no more performance gain.  If Elasticsearch sends a TOO_MANY_REQUESTS (429) response code or EsRejectedExecutionException with the Java client, that's an indication that the indexing rate is higher than what the cluster can handle. When this situation arises, pause indexing for a bit and try with exponential backoff.

Use Multiple Workers/Threads to Send Data to Elasticsearch

In every application, use multiple threads to push bulk messages concurrently.

Image title

For example, if you want to push 20 parallel bulk requests containing 5,000 messages each, 20 * 5,000 = 100,000 messages in total. The worker threads will automatically pick up messages from their respective queues and start pushing them to Elasticsearch.

Use blocking queues. If the queue is empty, queue consumer threads to wait until data is pushed into the queue. If the queue is full, the queue producer threads wait until there is space to push new messages into the queue.

Elasticsearch application Database Data (computing)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Three SQL Keywords in QuestDB for Finding Missing Data
  • Cloud-Native Application Networking
  • Data Mesh vs. Data Fabric: A Tale of Two New Data Paradigms
  • GPT-3 Playground: The AI That Can Write for You

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: