Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

9 Key Tips for Production Environment Maintenance

DZone's Guide to

9 Key Tips for Production Environment Maintenance

Denny Zhang explains his nine key tips for maintaining production environments and providing developers with meaningful feedback.

· DevOps Zone
Free Resource

Learn more about how CareerBuilder was able to resolve customer issues 5x faster by using Scalyr, the fastest log management tool on the market. 

To break silos and improve availability, DevOps and Ops should be actively collecting useful feedback of production environment maintenance on a regular basis. They should also enable developers to easily access this feedback to improve the feedback loop together as a team effort.

How to Provide Developers with Meaningful Feedback

continuous_feedback.png

Image from http://dennyzhang.com/continuous_feedback

1. Monitor at Both the OS and Process Levels

2. Detect Resource Leaks in Your Applications

  • Memory leak: This defect is a close friend of service outages. If memory usage keeps rising steadily, ring a bell to your dev team.
  • Stale file handlers: Files may have been deleted somehow, but your application still holds the file handlers or even reads and write those files. Detect it with "pmap -x $pid | grep deleted".
  • Overwhelming network sockets: Either your application can't serve requests fast enough or it has issues reclaiming socket fd. Check this by "lsof -p $pid | grep -iE 'TCP|pipe|socket|anon_inode'". If lots of TCP sockets are in the WAIT_CLOSE state, it's a bad sign.

3. Always Be on Top of Logfiles

Believe or not, I've seen applications diligently recording hundreds of messages to logfiles every second. This eats up disk quickly, even before logrotate takes effect.

For application logging, alert developers about any major errors or exceptions found. For syslogs, DevOps/Ops are usually the only gatekeepers.

4. Monitor DB Slow Query

This usually incurs random or constant performance penalties to your applications. If we can grab this information for developers, it can be a very valuable input for developers' trouble shooting.

5. Change History Of Production Environments

A clear and full change list of production environments may empower developers to identify root causes quickly. See how to Automatically Track All Change History.

6. Observe Machine Reboots and Service Restarts

Not all developers know or remember that the /tmp directory won't survive a machine reboot. This turns into issues when it does reboot. Scan the source code for /tmp and alert developers if necessary.

Restarting services can be scary. Since a service stop is doing a magic clean shutdown, it might close requests in processing, flush the data to a disk, etc. A service start might be slow or even fail due to complicated service dependencies. Some behaviors might not align to developers' assumptions. For example, it might take too long to stop or start service, miss tricks when it's stuck for a long time, etc. Thus, DevOps/Ops should observe this carefully and pass it to developers.

7. Enable Coredump When Applications Crash

Coredump helps developers understand which thread and which function cause a crash.

8. Examine JVM for Key Metrics

For Java application operation, the JVM toolkit can help detect suspicious issues. Be familiar with tools like jps, jstack, jmap, etc.

9. Simulate Production Environments at a Reasonable Cost

The last but not the least. If DevOps can simulate production environments quickly, developers can have a safe playground to do tests or dry-run patches. Some common obstacles to achieve this are:

  • Budget concern: We may need to start enough VMs, in order to get a min production environment.
  • Automate to automate: Not only to automate cluster deployment, but also to automate data export and import.
  • Simulate production environments as much as possible: This is the most difficult part, and it varies across projects.

More Reading: Generate Common DB Data Report By ELK

Like our blog posts? Discuss with us on LinkedIn, Wechat, or Newsletter.


Find out more about how Scalyr built a proprietary database that does not use text indexing for their log management tool.

Topics:
devops ,ci ,cd

Published at DZone with permission of Denny Zhang, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}