Following is my grouping of tools that I have learned/used as a sysadmin and DevOps dude at ThoughtWorks while maintaining our distributed infrastructure, setting up our private cloud installations, and in many different client gigs.
You can add some of these tools when required as your infrastructure/deployment/app grows.
- Provisioner: Abstracts your vm / environment provisioning mechanism. Mostly relevant if you are on cloud infrastructure. Examples are boto and fog. Very important if you plan to do something like auto scaling. Gives you elastic infrastructure.
- Configuration Management system: Lets you create reusable environments by expressing packages, services, files and other components via a DSL. It also addresses cross platform issues. Puppet, Chef, Cfengine, and Salt are examples. A mature CMS setup will give you context aware infrastructure, like how your web server can automatically recognize the DB server, or how the load balancer can automatically recognize your web servers. A mature CMS setup will also incorporate the notion of environments and have versioned infrastructures like UAT, which can have app deployment version 1.3 and production version in 1.1 and staging in 2.0 etc.
- Application deployers: Let you deploy your application. CM tools can do this too, but there are dedicated apps for this: Vlad the Deployer (Ruby), Capistrano (Ruby), Func: Fedora Unified Network Controller (Bash, Python) and Fabric (Python) come to mind. They also help you in creating ad hoc system automation. Most of these are SSH in a loop (or using GNU parallel).
- Orchestrator: Functions similarly app deployers but incorporates middleware-like facilities like to do async command dispatching. Mcollective and Salt are examples. Both of them uses a middleware (Salt uses ZeroMQ while Mcollective can use any STOMP compliant AMQP) to broker 1->N, 1>1, N>M, N>N real time, and async dispatching. They can be used with platforms that dont have SSH and are massively scalable.
- Monitoring solution: Keep tabs on performance. There are 3 kinds of monitoring you'll need mostly.
System (disk, CPU load, memory)
Services (web server, DB server etc)
App (I use a cuke script that checks how the whole app is working).
A good monitoring solution is one which easily integrates with all other infra services, lets you define metrics (app response time, free memory, cached memory etc). It can also include customization notifications (email, jabber, sms etc) and escalations. How the tools chart your metrics is also very important for understanding trends. Reporting and event handlers are two important features here as well (use the event handlers in conjunction with the provisioner to get auto scaling features).
Examples are Nagios, Zabbix, Zenoss and many many more. None of them are complete, but all of them can be complemented with some tool plugins (like for Graphite, an awesome charting tool). Nagios has text based configs but does not use any DB. It's easy to install, scales well, and is mature hence integration with other apps is very easy.
- Log management and log analytics: For tracking your logs. Three parts again for this solution:
Forwarding: a client that will sit in every VM and forward the log to a central location. Options are rsyslog, syslog, syslog-ng, Graylog agents, Logstash, Splunk forwarders etc.
Gathering: A server that will accept all logs. Syslog-ng, rsyslog, Splunk, Graylog2)
Analytics: In most cases you will be searching , indexing your logs for particular patterns. Graylog2, Logstash (both use ElasticSearch as the engine). Splunk (very powerful , very costly). A matured log management solution will let you set up alerts based on patterns (like failed transactions, 50Xs, 40Xs etc).
- Supervisors: They observe a service and take appropriate action to bring them alive whenever they are down. Bypassing the whole network monitoring > event handler loop. Supervisors are very helpful for a shaky service. Monit, Bluepill, Godrb, etc. are some examples. A good supervisor has low memory/CPU footprint, provides fast healing capacity, and rich DSL for expressing a service state (like which port should be responsive, which process should be running, how to fix the process if it dies, or an alert when it takes an action etc).
- Security, Hardening, and Auditing Tools: Specialized tools for strengthening system security. Tools like Bastille ensures you have done the basic OS level hardening. It can also assess your infrastructure and lock it down if needed. Tools like PSAD and Snort uses IP tables' logs to automatically block intruders. Some of the CM tools like Puppet or Chef can be used to audit.
Any other good additions to this list by category are welcome and encouragd in the comments!