Building Gateway Analytics: My Journey to Making API Traffic Data Useful
Learn how to use OpenSearch to turn raw API gateway logs into insights, monitor performance, detect failures, and optimize distributed systems.
Join the DZone community and get the full member experience.
Join For FreeAPIs are everywhere today. Whether it's buying something online, logging into a mobile app, or streaming a movie, an API is always working behind the scenes. Over the last decade, APIs have become the backbone of modern software systems. As an application scales, the volume of API calls increases rapidly, and managing them becomes more complex. This is where API gateways come into action.
An API gateway acts as an entry point for all internal or external API traffic. It sits in front of the backend services and handles responsibilities such as authentications, routing, rate limiting, logging, performance monitoring, and more.
But simply deploying gateways is not enough. When managing large, distributed systems, multiple gateways may be deployed across regions or clusters. Each of them processes millions of API calls, generates logs, and produces metrics. Over time, these logs pile up to be large amounts of raw data, and if this data is left unused, it becomes nothing more than storage costs. However, if this data is transformed into insights, it becomes extremely powerful.
I wanted to utilize these raw logs to answer questions like:
- Which gateway handles the most traffic?
- Are there errors or failures increasing over time?
- Which region or services are experiencing latency?
- Are certain APIs causing bottlenecks?
To do this, I needed a system that could process large volumes of log data and quickly run aggregations. That is where OpenSearch came into the picture.
Using OpenSearch to Power the Analytics
OpenSearch, the open-source search and analytics engine, was ideal for this. It supports full-text search, real-time indexing, and, most importantly, it can perform quick aggregations. Since gateways generate a large volume of structured and semi-structured JSON logs, OpenSearch could store them efficiently and help query them at scale.
An example of the API event data with gateway details:
{
"gateway_service": "service-east-1",
"catalog": "production",
"status_code": 200,
"latency_ms": 123,
"bytes_sent": 512,
"bytes_received": 2048,
"timestamp": "2025-10-20T12:34:56Z"
}
A single payload like this meant nothing on its own, but when millions of such events are combined with aggregation queries, it will help answer real operational questions.
Turning Raw Data Into Insights Using Aggregations
OpenSearch aggregation allows grouping, counting, summing, and computing statistical values. For example, identifying which gateway handles most traffic is as simple as grouping the logs by gateway name and counting the requests:
{
"aggs": {
"gateway_services": {
"terms": {
"field": "gateway_service.keyword"
},
"aggs": {
"total_calls": {
"value_count": {
"field": "timestamp"
}
}
}
}
}
}
This aggregation returned a ranked list of gateways sorted by the number of calls. In a distributed system, this is crucial as it helps highlight hotspots. A gateway receiving extremely high traffic may require scaling, caching, or rate limiting to avoid overload.
Another common question is gateway reliability. This aggregation can be used to identify the status codes of each API call and group them into success or failure.
{
"aggs": {
"status_groups": {
"terms": {
"script": {
"source": "doc['status_code'].value >= 400 ? 'Failed' : 'Success'"
}
}
}
}
}
A high number of failure often indicate misconfigured services or networking issues. Without aggregations, someone would have to manually search through thousands of logs for this information.
Designing Reports That Tell a Story
After the data was aggregated, this information should be made easily visualisable so that developers and SREs could immediately spot if something is going wrong.
A gateway service leaderboard was designed with a ranked list of gateways by calls in the form of a table. It gave a quick outline of high traffic gateways. Clicking on a particular gateway service displayed its detailed report, showing:
- Total API calls – the total number of API requests processed by each gateway.
- Success and failure breakdown – a breakdown of API requests based on success and failure rates.
- Bytes sent and bytes received charts – data volume metrics to track inbound and outbound traffic
- Usage charge – a visualisation of API call distribution categorised by HTTP status code over a specified period.
- Latency chart – a graphical representation of response times for API calls passing through gateways.
- Top APIs and products – a list of the most frequently used APIs and products associated with the selected gateway.
Each visualisation helped get a narrative about the transformed data. The latency charts use line charts to detect gradual slowdowns that wouldn't be evident from the raw logs.
Success and failure breakdowns can be displayed as a stacked bar graph that highlights gateways where error rates were creeping in.
With aggregations and dashboards:
- Performance bottlenecks became visible
- High-traffic gateways can be scaled more proactively
- Failures can be detected and fixed earlier
- Product owners know which APIs customers rely on the most
Even though the dashboard is really simple, the value is enormous. Overall, this would be a single pane of glass for overall gateway health and performance.
Conclusion
If you are working with APIs or distributed gateways, start small by indexing logs, experimenting with aggregations, and building simple dashboards that answer useful questions. It need not be perfect, it just needs to be useful because at the end, analytics isn't just about pretty graphs or charts, it's about clarity and helping understand your system deep enough to make it better.
Opinions expressed by DZone contributors are their own.
Comments