This year, we've been using automation for everything as much as we can, including network automation, server provisioning, and application deployments.
Every part has its own limitations and requirements. For network automation, we use Ansible because it has flexible template mechanism and many third party modules, it’s fast, and it really fits our needs. Servers are provisioned with Ansible too, but configurations are pushed using our first-class citizen Chef and some parts are handled with Consul + consul-template.
Why not run a single tool for all things? Because every tool has its own limitations:
- Ansible doesn’t have a mechanism to run itself periodically and it’s not as much flexible as Chef;
- Chef doesn’t have an orchestration layer;
- Consul...okay, but who would be responsible to bootstrap the Consul cluster and other dependencies around it?
We make changes on Github by creating a pull request and Jenkins pulls these changes. Before Github introduced Code better with Reviews, we used just :+1:,:-1: as a comment to approve or reject changes. Thanks to this new feature, we are allowed to do it more transparently. Our most maintained Chef repository has enabled
Require pull request reviews before merging protection, which means code review is mandatory. After pulling changes from Github, Jenkins starts the build and does the rest. Some builds have multiple Jenkins slaves for executing specific builds, like building LXC containers or Docker containers, which requires different Linux distributions. As explained above, we exploit Jenkins for every kind of automation. In most cases, we have two or three different builds per Github repository:
- ansible-network—apply changes in production from the master branch;
- ansible-network-pr—bootstrap development environment and apply changes from pull request;
- ansible-repo—check the syntax for every playbook.
Automation is the way we work at Hostinger. We don’t SSH into the server and do not do any changes. We do changes locally on a personal laptop first, then later push the code to Github as a Pull Request. SSH to the server is necessary only for ad-hoc debugging; everything else is dynamically adjusted by Chef, thus there is no point to log in. Most of our user-facing servers are identical (depends on role), thus we just pick a single one to verify configuration quickly.
Before any new feature is deployed into our current stack, we have to think about automation first. We have two environments: development and production. At first, changes go to the development environment, where we have more or less identical infrastructure (virtual) and changes are seen quickly after the merge. When everything is fine with the development environment, we are free to use the same versions of cookbooks in the production environment. Just up another pull request with increased versions.
When we provision a new server, it is automatically detected by our monitoring platform Prometheus, clusters are reconfigured according to decent cluster size, and so on; nothing else has to be done manually. If someone breaks or changes something in configuration files, everything is reverted back automatically by
chef-client which runs every seven minutes in the background. Some services need to react faster than every seven minutes. Consul-template helps here. We have one consul cluster per region and consul-template is running as a client where it’s needed. As an example, we use consul-template for regenerating upstreams for Openresty. It requires near real-time operation.
Network automation is done using the primary tool Ansible. We use the Cumulus network operating system, which allows us to have a fully automated network where we reconfigure the network including BGP neighbors, firewall rules, ports, bridges, etc. on changes. Nothing is changed directly inside the switch. Cumulus has a virtual instance called Cumulus VX which allows us to converge all changes before pushing them to production. Jenkins build converges Cumulus VX locally, applies the Ansible playbook, and does tests. If everything is fine then we are happy, too. For instance, we add a new node, then Ansible will automatically see changes in Chef inventory by looking at LLDP attributes and regenerate network configuration for a particular switch. If we want to add a new BGP upstream or firewall rule, we just create a pull request to our Github repo and everything is done automatically including checking syntax and deploying changes in production. You can find more information about our network stack in this previous blog post.
Other Small, Yet Nice-to-Have Automation
We internally use Slack and it’s common sense to do work as much as possible in the chat. We have automated Jenkins builds and Github hooks. For example, a new issue or pull request is created, then a notification to the Slack channel is sent. Or, we can start a Jenkins build directly from a channel by typing
ada j b 22, putting the website to sleeping state
ada sleep <url>, and so forth.
- You can automate 80% of tasks; it’s not necessary to cover 100%.
- You have more time to spend on more interesting tasks instead of copy pasting the same around fleet of servers.
- Knowledge sharing: all organization members are able to see, comment, and do changes freely.
- There are no secrets between teammates; infrastructure as a code is visible by everyone.
- A server that is not under automation costs more time and money than automated.