Speed Up Ansible
A call to action from a frustrated developer! No, actually, it's more of a tutorial to show you how to speed up Ansible yourself. Probably much more useful.
Join the DZone community and get the full member experience.
Join For FreeUnder the hood of d2c.io service, we use Ansible a lot: from cloud VM creation and provisioning, to Docker containers and user apps orchestration.
Ansible is a convenient tool that doesn't require complex setup because of it agentless nature (you don't need to preinstall any software, or agents, on managed hosts). In most cases, you would use an 'ssh' connection to configure servers. One of the cons to this simplicity is speed. Depending on your environment and playbook workflow Ansible can operate with alarmingly slow speed: it does all the logic locally, generates task "package," sends it to remote host, executes, waits for results, reads results, analyzes them, and moves to the next task. In this article, we describe several ways to increase that speed.
Test Methodology
If you can't measure it, you can't improve it. So, we are going to write a small script file for counting execution time.
Test playbook test.yml
:
---
- hosts: all
# gather_facts: no
tasks:
- name: Create directory
file:
path: /tmp/ansible_speed
state: directory
- name: Create file
copy:
content: SPEED
dest: /tmp/ansible_speed/speed
- name: Remove directory
file:
path: /tmp/ansible_speed
state: absent
Time measurement script time_test.sh
:
#!/bin/bash
# calculate the mean average of wall clock time from multiple /usr/bin/time results.
# credits to https://stackoverflow.com/a/8216082/2795592
cat /dev/null > time.log
for i in `seq 1 10`; do
echo "Iteration $i: $@"
/usr/bin/time -p -a -o time.log $@
rm -rf /home/ubuntu/.ansible/cp/*
done
file=time.log
cnt=0
if [ ${#file} -lt 1 ]; then
echo "you must specify a file containing output of /usr/bin/time results"
exit 1
elif [ ${#file} -gt 1 ]; then
samples=(`grep --color=never real ${file} | awk '{print $2}' | cut -dm -f2 | cut -ds -f1`)
for sample in `grep --color=never real ${file} | awk '{print $2}' | cut -dm -f2 | cut -ds -f1`; do
cnt=$(echo ${cnt}+${sample} | bc -l)
done
# Calculate the 'Mean' average (sum / samples).
mean_avg=$(echo ${cnt}/${#samples[@]} | bc -l)
mean_avg=$(echo ${mean_avg} | cut -b1-6)
printf "\tSamples:\t%s \n\tMean Avg:\t%s\n\n" ${#samples[@]} ${mean_avg}
grep --color=never real ${file}
fi
So we execute our playbook 10 times and take mean execution time.
SSH multiplexing
LAN connection: before 7.68s, after 2.38s
WAN connection: before 26.64s, after 10.85s
The first thing to check is whether SSH multiplexing is enabled and used. This gives a tremendous speed boost because Ansible can reuse opened SSH sessions instead of negotiating new one (actually more than one) for every task. Ansible has this setting turned on by default. It can be set in configuration file as follows:
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
But be careful to override ssh_args
— if you don't set ControlMaster
and ControlPersist
while overriding, Ansible will "forget" to use them.
To check whether SSH multiplexing is used, start Ansible with -vvvv
option:
ansible test -vvvv -m ping
You should see required settings in the output:
SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s ... -o ControlPath=/home/ubuntu/.ansible/cp/7c223265ce
That follows by setting up multiplex master socket
:
Trying existing master
Control socket "/home/ubuntu/.ansible/cp/7c223265ce" does not exist
setting up multiplex master socket
Also you can check for socket files to be present in ControlPath
for 60 seconds after connection (in our example: /home/ubuntu/.ansible/cp/7c223265ce
).
Warning: if you work with several identical environments from one Ansible control host (for example, blue/green or stage/prod), be careful not to shoot yourself in the foot. For example, you have checked something on production servers (e.g. executed configuration steps in check-mode). Now you have opened master sockets that point to production servers. Then you decide to update your staging environment (that has same host names as production), and boom! Your production is blown up. To prevent this, always close/clean master sessions when switching environment or set unique "ControlPath" setting for each environment.
Pipelining
LAN: before 2.38s, after 1.96s
WAN: before 10.85s, after 5.23s
Here is the default workflow of module execution:
Generate Python-file with module and its parameters for remote execution
Connect via SSH to detect remote user home directory
Connect via SSH to create temporary work directory
Connect via SSH to upload Python-file via SFTP
Connect via SSH to execute Python-file and cleanup temp dir
Get module's result from SSH standard output
Now multiply this to a number of tasks and loop iterations in your playbook to imagine overhead. This is a 100% effective way to execute different type of modules on a variety of target systems. But if you use only Ansible native modules and modern target boxes, you can enable pipelining mode. Here's the parameter:
[ssh_connection]
pipelining = true
Here is the workflow with pipelining mode enabled:
Generate Python-file with module and its parameters for remote execution
Connect via SSH to execute Python interpreter
Send Python-file content to interpreter's standard input
Get module's result from standard output
As a result: one SSH connection instead of four! Speed boost is significant, especially over WAN connections.
To check whether pipelining is in use call Ansible with verbose output, for example:
ansible test -vvv -m ping
If you see several ssh
calls:
SSH: EXEC ssh ...
SSH: EXEC ssh ...
SSH: EXEC sftp ...
SSH: EXEC ssh ... python ... ping.py
Then pipelining is NOT in use. If there is single `ssh` call:
SSH: EXEC ssh ... python && sleep 0
Then pipelining is working.
By default, this settings is turned off in Ansible because of possible conflict with requiretty
setting for sudo
. At the time of writing this article requiretty
is disabled on recent Ubuntu and RHEL images in Amazon EC2 cloud, so you can safely enable pipelining on this distributions.
PreferredAuthentications vs. UseDNS
LAN: before 1.96s, after 1.92s
WAN: before 5.23s, after 4.92s
UseDNS
UseDNS is an SSH-server setting (/etc/ssh/sshd_config file) which forces a server to check a client's PTR-record upon connection. It may cause connection delays especially with slow DNS servers on the server side. In modern Linux distribution, this setting is turned off by default, which is correct.
PreferredAuthentications
It is an SSH-client setting which informs server about preferred authentication methods. By default Ansible uses:
-o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
So if GSSAPIAuthentication
is enabled on the server (at the time of writing this it is turned on in RHEL EC2 AMI) it will be tried as the first option, forcing the client and server to make PTR-record lookups. But in most cases, we want to use only public key auth. We can force Ansible to do so by changing ansible.cfg
:
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey
This eliminates unnecessary steps and speeds up initial ssh master connection.
Facts Gathering
LAN: before 1.96s, after 1.47s
WAN: before 4.92s, after 4.77s
At the start of playbook execution, Ansible collects facts about remote system (this is default behaviour for ansible-playbook
but not relevant to ansible
ad-hoc commands). It is similar to calling "setup" module thus requires another ssh communication step. If you don't need any facts in your playbook (e.g. our test playbook) you can disable fact gathering:
gather_facts: no
If you often run playbooks that depend on facts, but fact gathering slows your runs, consider setting up external fact-caching backend is some information). For example, you define Redis backend, collect facts by hourly cron job and disable fact gathering in your playbooks in favor of cached facts.
WAN → LAN
Before 4.77s, after 1.47s
Your WAN connection can have good bandwidth and latency. But LAN connection is better. If you manage multitude of hosts let's say in Amazon EC2 eu-west-1 region, you can expect significant speed boost if your Ansible control machine is also in that region. Rule of thumb is to move control host closer to managed systems.
Pull-mode
Before 1.47s, after 1.25s
Need even more speed? Execute playbooks locally on remote servers. There is ansible-pull
tool for that. You can read about it in official Ansible docs. It works as follows:
Clone specified repo into local subdirectory
Executes specified playbook with local connection (`-c local` option)
If playbook name is omitted, tries to execute:
<fqdn>.yml
<hostname>.yml
local.yml
One of the workflows is to execute ansible-pull --only-if-changed
as cron job: it will monitor target repository and if there is a change, execute playbook.
Fork
Until this moment we discussed how to speed up playbook execution on a given remote host. But if you run playbook against tens or hundreds of hosts, Ansible internal performance becomes a bottleneck. For example, there's preconfigured number of forks – number of hosts that can be interacted simultaneously. You can change this value in ansible.cfg
file:
[defaults]
forks = 20
The default value is 5, which is quite conservative. You can experiment with this setting depending on your local CPU and network bandwidth resources.
Another thing about forks is that if you have a lot of servers to work with and a low number of available forks, your master ssh-sessions may expire between tasks. Ansible uses linear
strategy by default, which executes one task for every host and then proceeds to the next task. This way if time between task execution on the first server and on the last one is greater thanControlPersist
then master socket will expire by the time Ansible starts execution of the following task on the first server, thus new ssh connection will be required.
Poll Interval
When module is executed on remote host, Ansible starts to poll for its result. The lower is interval between poll attempts, the higher is CPU load on Ansible control host. But we want to have CPU available for greater forks number (see above). You can tweak poll interval in ansible.cfg
:
[defaults]
internal_poll_interval = 0.001
If you run "slow" jobs (like backups) on multiple hosts, you may want to increase the interval to 0.05
to use less CPU.
Hope this helps you to speed up your setup. Seems like there are no more items in environment check-list and further speed gains only possible by optimizing your playbook code.
Published at DZone with permission of Konstantin Suvorov. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments