Cracking the Code: How To Decode Vague API Errors Like a Pro
In this article, we’ll look at some common reasons for 500-level errors and how hard it is to pinpoint the issue with traditional debugging tools.
Join the DZone community and get the full member experience.Join For Free
HTTP status codes are three-digit numbers returned by the server to indicate the status of the request. These status codes provide valuable information to help debug various states of HTTP requests.
Responses are grouped into five classes:
- 1xx: information responses indicating that the server has received the request
- 2xx: successful responses indicating that the request was successfully received and accepted
- 3xx: redirection messages to indicate that further action needs to be taken to complete the request
- 4xx: client error responses to indicate that the client made an error in the request
- 5xx: server error responses to indicate that the server ran into an issue while fulfilling the request
When we are trying to troubleshoot API errors, usually, we are dealing with 400 or 500-level errors. 400 errors are a bit easier to debug since most of them deal with 401 (Unauthorized) or 404 (Not Found) errors. However, 500 errors are usually harder to troubleshoot as they encompass a much broader range of errors.
Some common server error responses include:
- 500: Internal server error for generic error that the server know how to handle
- 502: Bad gateway to indicate that server got an invalid response while handling the request
- 503: Service unavailable to indicate that the server is not ready to handle the request
In this article, we’ll look at some common reasons for 500-level errors and how hard it is to pinpoint the issue with traditional debugging tools. Then we’ll take a look at Lightrun’s Dynamic Logging and Snapshot features that can help to better resolve these issues.
Common Reasons for 500 Errors
Since 500-level status codes encompass any errors on the server, the reason for these errors are varied. But most commonly, these errors stem from:
- Server overload: if the server is experiencing high traffic, it may not be able to handle all requests in time
- Resource exhaustion: server may also be running out of resources such as CPU or memory due to a memory leak or bad logic
- Connection issues: the server may have encountered an unexpected error trying to connect to an external service like a database
- Configuration issues: incorrect configurations such as insufficient permissions, missing IAM roles, or environment variables may also cause the server to throw an error
- Logical errors: if there’s a runtime error, the server may throw an exception and cause it to return a 500 response
Besides these common server-side issues, there may be other reasons that the client receives a 500-level response. For example, the error could be thrown at the reverse proxy or gateway level due to firewall or security policy issues. Other times the server may intentionally give a vague 500-level error to mask the underlying issue for external clients. Finally, errors could be thrown due to an outage from the cloud provider or external vendor for network management.
Difficulty in Debugging 500 Errors
Because the underlying reasons for 500-level errors are so vast, it is often very difficult to pinpoint the cause of the issue. The error could be thrown by the network, proxy, infrastructure, or the application itself. So to debug, it is essential to review not only the server logs but also other infrastructure logs to see where the issue is occurring from load balancers or NAT gateways.
With traditional observability tooling, this can be a daunting task unless everything is set up. First to catch application errors, you would need sufficient application logs to expose the error and the stack trace. For infrastructure-induced errors, you would need to collect application metrics as well as infrastructure resource metrics to determine whether the cause is due to a programming error, faulty infrastructure, or heavy load. Finally, to debug 500 errors that are caught by the API server and masking an underlying error, you would need a sophisticated tracing system to track the root cause of the issue.
In an ideal world, every software should have these observability components available for developers to debug. However, in practice, at least one of these components (i.e., logging, monitoring, tracing) is often missing. This could be due to an immature codebase, inheriting a legacy codebase, or incorrect configuration. Without these tools intact, it becomes very difficult to debug a live incident in production. In a best-case scenario, developers might be able to piece together a hypothesis from the logs and try to replicate it in a lower environment for a deeper analysis. In other cases, a new deployment may be needed simply to add more logging statements or to collect more metrics.
Dynamic Instrumentation With Lightrun
Given this reality, Lightrun takes a different path toward observability. After injecting an agent into the application code, Lightrun allows developers to dynamically instrument logs, traces, and metrics at runtime. In other words, no code changes, redeployments, or even restarts are needed to start gathering valuable information on the fly.
With dynamic logs, developers can add new logging statements anywhere in the code to provide more context. This can be helpful to check the logic of the program and test out the flows with live traffic. Also, it can help surface the internal state if the API error is being caught by another service and masked as a generic 500-level error before returning to the client.
Watch this tutorial for more information.
Next, there is the snapshot feature that allows developers to create virtual breakpoints without stopping or blocking the application. The snapshot exposes stack traces and variables to investigate bugs quicker without adding logging statements that simply print out the state of the system. Developers can leverage snapshots to step through the code and rule out configuration or programming errors.
Watch this tutorial for more information.
Troubleshooting Made Easy
Troubleshooting API errors is not an easy task. However, with Lightrun, you can dynamically create logs or snapshots to get to the root of the issue faster. Together with other existing observability tooling, developers can diagnose the issue and roll out a fix quicker without having to waste valuable time just to add new logs or diagnostic info in a redeploy.
Opinions expressed by DZone contributors are their own.