When Search Started Breaking at Scale: How We Chose the Right Search Engine
Learn about how we scaled our search system, evaluated Solr and Elasticsearch, and redesigned the architecture for better performance and reliability.
Join the DZone community and get the full member experience.
Join For FreeWhen we first built our search system, everything worked fine.
The data size was manageable, search responses were fast, and updates were happening as expected. Like many teams, we assumed that once a search engine is set up, it will continue to work as the system grows.
But that didn’t last long.
As traffic increased and data volume grew, we started seeing issues that were hard to ignore. Search became slower, updates were delayed, and maintaining the system required more effort.
At that point, it was no longer just a technical choice — it became a decision that directly impacted user experience and system reliability.
In this article, I want to share how we approached this problem and how we evaluated the right search engine when our system started breaking at scale.
The Problem We Faced
At the beginning, our search system worked without any major issues.
But as the system grew, we started noticing several problems:
- Search responses became slower during peak traffic
- The newly updated data was not showing up immediately
- Indexing pipelines started lagging behind
- Scaling the search cluster required manual effort and tuning
These were not small issues. Users expect fast and accurate results, and delays started affecting their experience.
This forced us to step back and ask: Should we continue scaling our current setup, or is it time to move to a different search engine? The issue wasn’t just the search engine — it was that our architecture wasn’t designed for scale.
The Decision Moment
There was a point where continuing with the existing setup started requiring more effort than expected.
We had to decide whether to keep optimizing the current system or invest time in redesigning the search architecture.
This was not just a technical decision — it was about choosing the right long-term direction.
What We Need to Solve
We were not just looking for a better tool.
We needed a system that could:
- Handle increasing data without performance drops
- Support near real-time indexing
- Deliver low-latency search responses
- Reduce operational overhead
- Be ready for future improvements like AI-based search
This changed how we evaluated different options.
Our Existing Search Architecture
To understand where the problems were coming from, here’s a simplified view of how our system was structured:
┌───────────────────────┐
│ Source Systems │
│ DB / APIs / Events │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Event / Update Layer │
│ CDC / Queue / Stream │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Indexing Service │
│ Transform + Enrich │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Solr Cluster │
│ (Primary Search) │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Search API Layer │
│ Query + Ranking │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Users / Frontend │
└───────────────────────┘
This setup worked well initially, but as scale increased, indexing delays and query latency became bottlenecks.
How We Evaluated Our Options
Instead of just comparing features, we focused on real production needs.
1. Data Scale
At smaller scale, most systems work well. But at larger scale, architecture matters.
2. Real-Time Indexing
Delays in indexing meant users were seeing outdated data.
3. Query Performance
Users expect results instantly, especially under heavy traffic.
4. Operational Complexity
Managing clusters and tuning performance required significant effort.
5. Cost Beyond Infrastructure
We considered engineering effort and maintenance, not just infra cost.
6. Future Readiness
We evaluated support for AI search, vector search, and ML integration.
Comparing the Options
| Feature | Solr | Elasticsearch | OpenSearch | Cloud Search |
|---|---|---|---|---|
| Setup | Complex | Easier | Easier | Very easy |
| Scaling | Good | Very good | Very good | Managed |
| Real-time updates | Good | Very good | Very good | Excellent |
| Maintenance | High | Medium | Medium | Low |
| Cost | Lower infra | Medium | Medium | Higher |
| AI support | Limited | Good | Good | Strong |
At scale, the real difference between these systems is not features — it’s how much operational effort they require and how consistently they perform under load.
Understanding the Trade-Offs
Each option comes with trade-offs.
More flexible systems provide control but require more tuning. Managed solutions reduce operational effort but may increase cost.
There is no perfect choice — only the right choice for your system.
What Our Future Architecture Looked Like
We realized that fixing individual components wouldn’t solve the problem — we needed to rethink the architecture itself.
After evaluating different options, we moved toward a more scalable and flexible architecture.
┌───────────────────────┐
│ Source Systems │
│ DB / APIs / Events │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Event Streaming Layer │
│ Kafka / Queue / CDC │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Indexing Service │
│ Async + Scalable │
└──────────┬────────────┘
│
▼
┌────────────────────────────┐
│ Distributed / Managed │
│ Search Engine │
│ (Elastic / Cloud Search) │
└──────────┬─────────────────┘
│
▼
┌───────────────────────┐
│ Search API Layer │
│ Caching + Ranking │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Users / Frontend │
└───────────────────────┘
What Changed in the New Architecture
The key improvements were:
- Moving to an event-driven indexing pipeline
- Introducing asynchronous processing
- Using a more scalable distributed search system
- Reducing manual operational effort
Impact After Changes
After moving to this approach, we started seeing noticeable improvements:
- Faster indexing updates
- More consistent query response times
- Better handling of peak traffic
- Reduced operational overhead
In our case, indexing delays were reduced significantly, and query performance became more stable as the system scaled.
A Common Mistake We Made
One mistake we made early on was focusing too much on initial setup and not enough on long-term scalability.
We also considered continuing with the existing setup and optimizing it further. However, we realized that incremental fixes would not solve the underlying scaling challenges.
What We Learned
The best search engine is not the one that works today — it’s the one that continues to work as your system grows.
How I Would Approach This Today
If I had to make this decision again today, I would start by evaluating:
- Expected data size in the next 1–2 years
- Real-time vs batch indexing requirements
- Operational ownership (team size and expertise)
- Need for AI or semantic search
This would help avoid rework later and make the system easier to scale from the beginning.
When to Choose What
- Solr → good for controlled enterprise environments
- Elasticsearch/OpenSearch → flexible and scalable
- Cloud search → low operational overhead and AI-ready
Final Thoughts
Choosing a search engine is not just about features — it’s about making a decision that will hold up as your system grows.
In my experience, it’s better to think about scale and future requirements early, rather than trying to fix limitations later.
The right decision early can save a lot of time and effort down the road.
Opinions expressed by DZone contributors are their own.
Comments