Following is my grouping of tools that I have learned/used as a sysadmin and DevOps dude at ThoughtWorks while maintaining our distributed infrastructure, setting up our private cloud installations, and in many different client gigs.
You can add some of these tools when required as your infrastructure/deployment/app grows.
- Provisioner: Abstracts your vm / environment provisioning
mechanism. Mostly relevant if you are on cloud infrastructure. Examples are boto and fog.
Very important if you plan to do something like auto scaling. Gives you
- Configuration Management system: Lets you create reusable
environments by expressing packages, services, files and other components
via a DSL. It also addresses cross platform issues. Puppet, Chef,
Cfengine, and Salt are examples. A mature CMS setup will give you context
aware infrastructure, like how your web server can automatically recognize the DB server, or how the load balancer can automatically recognize your
web servers. A mature CMS setup will also incorporate the notion of
environments and have versioned infrastructures like UAT, which can have app
deployment version 1.3 and production version in 1.1 and staging in 2.0 etc.
- Application deployers: Let you deploy your application. CM tools can do
this too, but there are dedicated apps for this: Vlad the Deployer (Ruby), Capistrano (Ruby), Func: Fedora Unified Network Controller (Bash, Python) and Fabric (Python) come to mind. They also help you in creating ad hoc
system automation. Most of these are SSH in a loop (or using GNU
- Orchestrator: Functions similarly app deployers but incorporates middleware-like facilities like to do async command dispatching. Mcollective and Salt are
examples. Both of them uses a middleware (Salt uses ZeroMQ while Mcollective can use any STOMP compliant AMQP) to broker 1->N, 1>1,
N>M, N>N real time, and async dispatching. They can be used with
platforms that dont have SSH and are massively scalable.
- Monitoring solution: Keep tabs on performance. There are 3 kinds of monitoring you'll need mostly.
System (disk, CPU load, memory)
Services (web server, DB server etc)
App (I use a cuke script that checks how the whole app is working).
A good monitoring solution is one which easily integrates with all other infra services, lets you define metrics (app response time, free memory, cached memory etc). It can also include customization notifications (email, jabber, sms etc) and escalations. How the tools chart your metrics is also very important for understanding trends. Reporting and event handlers are two important features here as well (use the event handlers in conjunction with the provisioner to get auto scaling features).
Examples are Nagios, Zabbix, Zenoss and many many more. None of them are complete, but all of them can be complemented with some tool plugins (like for Graphite, an awesome charting tool). Nagios has text based configs but does not use any DB. It's easy to install, scales well, and is mature hence integration with other apps is very easy.
- Log management and log analytics: For tracking your logs. Three parts again for this solution:
Forwarding: a client that will sit in every VM and forward the log to a central location. Options are rsyslog, syslog, syslog-ng, Graylog agents, Logstash, Splunk forwarders etc.
Gathering: A server that will accept all logs. Syslog-ng, rsyslog, Splunk, Graylog2)
Analytics: In most cases you will be searching , indexing your logs for particular patterns. Graylog2, Logstash (both use ElasticSearch as the engine). Splunk (very powerful , very costly). A matured log management solution will let you set up alerts based on patterns (like failed transactions, 50Xs, 40Xs etc).
- Supervisors: They observe a service and take appropriate action to
bring them alive whenever they are down. Bypassing the whole network
monitoring > event handler loop. Supervisors are very helpful for a shaky service.
Monit, Bluepill, Godrb, etc. are some examples. A good supervisor has low
memory/CPU footprint, provides fast healing capacity, and rich DSL for
expressing a service state (like which port should be responsive, which
process should be running, how to fix the process if it dies, or an alert when
it takes an action etc).
- Security, Hardening, and Auditing Tools: Specialized tools for strengthening system security. Tools like Bastille ensures you have done the basic OS level
hardening. It can also assess your infrastructure and lock it down if
needed. Tools like PSAD and Snort uses IP tables' logs to automatically
block intruders. Some of the CM tools like Puppet or Chef can be used
Any other good additions to this list by category are welcome and encouragd in the comments!