When Search Started Breaking at Scale: How We Chose the Right Search Engine

Learn about how we scaled our search system, evaluated Solr and Elasticsearch, and redesigned the architecture for better performance and reliability.

sunil paidi

May. 12, 26 · Analysis

Likes (0)

Comment

Save

1.5K Views

When we first built our search system, everything worked fine.

The data size was manageable, search responses were fast, and updates were happening as expected. Like many teams, we assumed that once a search engine is set up, it will continue to work as the system grows.

But that didn’t last long.

As traffic increased and data volume grew, we started seeing issues that were hard to ignore. Search became slower, updates were delayed, and maintaining the system required more effort.

At that point, it was no longer just a technical choice — it became a decision that directly impacted user experience and system reliability.

In this article, I want to share how we approached this problem and how we evaluated the right search engine when our system started breaking at scale.

The Problem We Faced

At the beginning, our search system worked without any major issues.

But as the system grew, we started noticing several problems:

Search responses became slower during peak traffic
The newly updated data was not showing up immediately
Indexing pipelines started lagging behind
Scaling the search cluster required manual effort and tuning

These were not small issues. Users expect fast and accurate results, and delays started affecting their experience.

This forced us to step back and ask: Should we continue scaling our current setup, or is it time to move to a different search engine? The issue wasn’t just the search engine — it was that our architecture wasn’t designed for scale.

The Decision Moment

There was a point where continuing with the existing setup started requiring more effort than expected.

We had to decide whether to keep optimizing the current system or invest time in redesigning the search architecture.

This was not just a technical decision — it was about choosing the right long-term direction.

What We Need to Solve

We were not just looking for a better tool.

We needed a system that could:

Handle increasing data without performance drops
Support near real-time indexing
Deliver low-latency search responses
Reduce operational overhead
Be ready for future improvements like AI-based search

This changed how we evaluated different options.

Our Existing Search Architecture

To understand where the problems were coming from, here’s a simplified view of how our system was structured:

    Plain Text
   
 

                   ┌───────────────────────┐
                │   Source Systems      │
                │  DB / APIs / Events   │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │ Event / Update Layer  │
                │  CDC / Queue / Stream │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │   Indexing Service    │
                │ Transform + Enrich    │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │     Solr Cluster      │
                │   (Primary Search)    │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │   Search API Layer    │
                │ Query + Ranking       │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │   Users / Frontend    │
                └───────────────────────┘
  

This setup worked well initially, but as scale increased, indexing delays and query latency became bottlenecks.

How We Evaluated Our Options

Instead of just comparing features, we focused on real production needs.

1. Data Scale

At smaller scale, most systems work well. But at larger scale, architecture matters.

2. Real-Time Indexing

Delays in indexing meant users were seeing outdated data.

3. Query Performance

Users expect results instantly, especially under heavy traffic.

4. Operational Complexity

Managing clusters and tuning performance required significant effort.

5. Cost Beyond Infrastructure

We considered engineering effort and maintenance, not just infra cost.

6. Future Readiness

We evaluated support for AI search, vector search, and ML integration.

Comparing the Options

Feature	Solr	Elasticsearch	OpenSearch	Cloud Search
Setup	Complex	Easier	Easier	Very easy
Scaling	Good	Very good	Very good	Managed
Real-time updates	Good	Very good	Very good	Excellent
Maintenance	High	Medium	Medium	Low
Cost	Lower infra	Medium	Medium	Higher
AI support	Limited	Good	Good	Strong

At scale, the real difference between these systems is not features — it’s how much operational effort they require and how consistently they perform under load.

Understanding the Trade-Offs

Each option comes with trade-offs.

More flexible systems provide control but require more tuning. Managed solutions reduce operational effort but may increase cost.

There is no perfect choice — only the right choice for your system.

What Our Future Architecture Looked Like

We realized that fixing individual components wouldn’t solve the problem — we needed to rethink the architecture itself.

After evaluating different options, we moved toward a more scalable and flexible architecture.

    Plain Text
   
 

                   ┌───────────────────────┐
                │   Source Systems      │
                │  DB / APIs / Events   │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │ Event Streaming Layer │
                │ Kafka / Queue / CDC   │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │   Indexing Service    │
                │ Async + Scalable      │
                └──────────┬────────────┘
                           │
                           ▼
                ┌────────────────────────────┐
                │ Distributed / Managed      │
                │ Search Engine              │
                │ (Elastic / Cloud Search)   │
                └──────────┬─────────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │   Search API Layer    │
                │ Caching + Ranking     │
                └──────────┬────────────┘
                           │
                           ▼
                ┌───────────────────────┐
                │   Users / Frontend    │
                └───────────────────────┘
  

What Changed in the New Architecture

The key improvements were:

Moving to an event-driven indexing pipeline
Introducing asynchronous processing
Using a more scalable distributed search system
Reducing manual operational effort

Impact After Changes

After moving to this approach, we started seeing noticeable improvements:

Faster indexing updates
More consistent query response times
Better handling of peak traffic
Reduced operational overhead

In our case, indexing delays were reduced significantly, and query performance became more stable as the system scaled.

A Common Mistake We Made

One mistake we made early on was focusing too much on initial setup and not enough on long-term scalability.

We also considered continuing with the existing setup and optimizing it further. However, we realized that incremental fixes would not solve the underlying scaling challenges.

What We Learned

The best search engine is not the one that works today — it’s the one that continues to work as your system grows.

How I Would Approach This Today

If I had to make this decision again today, I would start by evaluating:

Expected data size in the next 1–2 years
Real-time vs batch indexing requirements
Operational ownership (team size and expertise)
Need for AI or semantic search

This would help avoid rework later and make the system easier to scale from the beginning.

When to Choose What

Solr → good for controlled enterprise environments
Elasticsearch/OpenSearch → flexible and scalable
Cloud search → low operational overhead and AI-ready

Final Thoughts

Choosing a search engine is not just about features — it’s about making a decision that will hold up as your system grows.

In my experience, it’s better to think about scale and future requirements early, rather than trying to fix limitations later.

The right decision early can save a lot of time and effort down the road.

Semantic search Data (computing) Search engine (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending