To break silos and improve availability, DevOps and Ops should be actively collecting useful feedback of production environment maintenance on a regular basis. They should also enable developers to easily access this feedback to improve the feedback loop together as a team effort.
How to Provide Developers with Meaningful Feedback
Image from http://dennyzhang.com/continuous_feedback
1. Monitor at Both the OS and Process Levels
- Nagios Plugin: Monitor Service CPU
- Nagios Plugin: Monitor Process FD
- Nagios Plugin: Monitor Service Memory
2. Detect Resource Leaks in Your Applications
- Memory leak: This defect is a close friend of service outages. If memory usage keeps rising steadily, ring a bell to your dev team.
- Stale file handlers: Files may have been deleted somehow, but your application still holds the file handlers or even reads and write those files. Detect it with "pmap -x $pid | grep deleted".
- Overwhelming network sockets: Either your application can't serve requests fast enough or it has issues reclaiming socket fd. Check this by "lsof -p $pid | grep -iE 'TCP|pipe|socket|anon_inode'". If lots of TCP sockets are in the WAIT_CLOSE state, it's a bad sign.
3. Always Be on Top of Logfiles
Believe or not, I've seen applications diligently recording hundreds of messages to logfiles every second. This eats up disk quickly, even before logrotate takes effect.
For application logging, alert developers about any major errors or exceptions found. For syslogs, DevOps/Ops are usually the only gatekeepers.
4. Monitor DB Slow Query
This usually incurs random or constant performance penalties to your applications. If we can grab this information for developers, it can be a very valuable input for developers' trouble shooting.
5. Change History Of Production Environments
A clear and full change list of production environments may empower developers to identify root causes quickly. See how to Automatically Track All Change History.
6. Observe Machine Reboots and Service Restarts
Not all developers know or remember that the /tmp directory won't survive a machine reboot. This turns into issues when it does reboot. Scan the source code for /tmp and alert developers if necessary.
Restarting services can be scary. Since a service stop is doing a magic clean shutdown, it might close requests in processing, flush the data to a disk, etc. A service start might be slow or even fail due to complicated service dependencies. Some behaviors might not align to developers' assumptions. For example, it might take too long to stop or start service, miss tricks when it's stuck for a long time, etc. Thus, DevOps/Ops should observe this carefully and pass it to developers.
7. Enable Coredump When Applications Crash
Coredump helps developers understand which thread and which function cause a crash.
8. Examine JVM for Key Metrics
For Java application operation, the JVM toolkit can help detect suspicious issues. Be familiar with tools like jps, jstack, jmap, etc.
9. Simulate Production Environments at a Reasonable Cost
The last but not the least. If DevOps can simulate production environments quickly, developers can have a safe playground to do tests or dry-run patches. Some common obstacles to achieve this are:
- Budget concern: We may need to start enough VMs, in order to get a min production environment.
- Automate to automate: Not only to automate cluster deployment, but also to automate data export and import.
- Simulate production environments as much as possible: This is the most difficult part, and it varies across projects.
More Reading: Generate Common DB Data Report By ELK
Like our blog posts? Discuss with us on LinkedIn, Wechat, or Newsletter.