Speed Up Ansible

A call to action from a frustrated developer! No, actually, it's more of a tutorial to show you how to speed up Ansible yourself. Probably much more useful.

Konstantin Suvorov

Jan. 11, 18 · Tutorial

Likes (13)

Comment

Save

63.4K Views

Under the hood of d2c.io service, we use Ansible a lot: from cloud VM creation and provisioning, to Docker containers and user apps orchestration.

Ansible is a convenient tool that doesn't require complex setup because of it agentless nature (you don't need to preinstall any software, or agents, on managed hosts). In most cases, you would use an 'ssh' connection to configure servers. One of the cons to this simplicity is speed. Depending on your environment and playbook workflow Ansible can operate with alarmingly slow speed: it does all the logic locally, generates task "package," sends it to remote host, executes, waits for results, reads results, analyzes them, and moves to the next task. In this article, we describe several ways to increase that speed.

Test Methodology

If you can't measure it, you can't improve it. So, we are going to write a small script file for counting execution time.

Test playbook test.yml:

---
- hosts: all
# gather_facts: no
tasks:
- name: Create directory
file:
path: /tmp/ansible_speed
state: directory
- name: Create file
copy:
content: SPEED
dest: /tmp/ansible_speed/speed
- name: Remove directory
file:
path: /tmp/ansible_speed
state: absent

Time measurement script time_test.sh:

#!/bin/bash
# calculate the mean average of wall clock time from multiple /usr/bin/time results.
# credits to https://stackoverflow.com/a/8216082/2795592
cat /dev/null > time.log
for i in `seq 1 10`; do
echo "Iteration $i: $@"
/usr/bin/time -p -a -o time.log $@
rm -rf /home/ubuntu/.ansible/cp/*
done
file=time.log
cnt=0
if [ ${#file} -lt 1 ]; then
echo "you must specify a file containing output of /usr/bin/time results"
exit 1
elif [ ${#file} -gt 1 ]; then
samples=(`grep --color=never real ${file} | awk '{print $2}' | cut -dm -f2 | cut -ds -f1`)
for sample in `grep --color=never real ${file} | awk '{print $2}' | cut -dm -f2 | cut -ds -f1`; do
cnt=$(echo ${cnt}+${sample} | bc -l)
done
# Calculate the 'Mean' average (sum / samples).
mean_avg=$(echo ${cnt}/${#samples[@]} | bc -l)
mean_avg=$(echo ${mean_avg} | cut -b1-6)
printf "\tSamples:\t%s \n\tMean Avg:\t%s\n\n" ${#samples[@]} ${mean_avg}
grep --color=never real ${file}
fi

So we execute our playbook 10 times and take mean execution time.

SSH multiplexing

LAN connection: before 7.68s, after 2.38s

WAN connection: before 26.64s, after 10.85s

The first thing to check is whether SSH multiplexing is enabled and used. This gives a tremendous speed boost because Ansible can reuse opened SSH sessions instead of negotiating new one (actually more than one) for every task. Ansible has this setting turned on by default. It can be set in configuration file as follows:

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

But be careful to override ssh_args — if you don't set ControlMaster and ControlPersist while overriding, Ansible will "forget" to use them.

To check whether SSH multiplexing is used, start Ansible with -vvvv option:

ansible test -vvvv -m ping

You should see required settings in the output:

SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s ... -o ControlPath=/home/ubuntu/.ansible/cp/7c223265ce

That follows by setting up multiplex master socket:

Trying existing master

Control socket "/home/ubuntu/.ansible/cp/7c223265ce" does not exist

setting up multiplex master socket

Also you can check for socket files to be present in ControlPath for 60 seconds after connection (in our example: /home/ubuntu/.ansible/cp/7c223265ce ).

Warning: if you work with several identical environments from one Ansible control host (for example, blue/green or stage/prod), be careful not to shoot yourself in the foot. For example, you have checked something on production servers (e.g. executed configuration steps in check-mode). Now you have opened master sockets that point to production servers. Then you decide to update your staging environment (that has same host names as production), and boom! Your production is blown up. To prevent this, always close/clean master sessions when switching environment or set unique "ControlPath" setting for each environment.

Pipelining

LAN: before 2.38s, after 1.96s

WAN: before 10.85s, after 5.23s

Here is the default workflow of module execution:

Generate Python-file with module and its parameters for remote execution
Connect via SSH to detect remote user home directory
Connect via SSH to create temporary work directory
Connect via SSH to upload Python-file via SFTP
Connect via SSH to execute Python-file and cleanup temp dir
Get module's result from SSH standard output

Now multiply this to a number of tasks and loop iterations in your playbook to imagine overhead. This is a 100% effective way to execute different type of modules on a variety of target systems. But if you use only Ansible native modules and modern target boxes, you can enable pipelining mode. Here's the parameter:

[ssh_connection]
pipelining = true

Here is the workflow with pipelining mode enabled:

Generate Python-file with module and its parameters for remote execution
Connect via SSH to execute Python interpreter
Send Python-file content to interpreter's standard input
Get module's result from standard output

As a result: one SSH connection instead of four! Speed boost is significant, especially over WAN connections.

To check whether pipelining is in use call Ansible with verbose output, for example:

ansible test -vvv -m ping

If you see several ssh calls:

SSH: EXEC ssh ...

SSH: EXEC ssh ...

SSH: EXEC sftp ...

SSH: EXEC ssh ... python ... ping.py

Then pipelining is NOT in use. If there is single `ssh` call:

SSH: EXEC ssh ... python && sleep 0

Then pipelining is working.

By default, this settings is turned off in Ansible because of possible conflict with requirettysetting for sudo . At the time of writing this article requiretty is disabled on recent Ubuntu and RHEL images in Amazon EC2 cloud, so you can safely enable pipelining on this distributions.

PreferredAuthentications vs. UseDNS

LAN: before 1.96s, after 1.92s

WAN: before 5.23s, after 4.92s

UseDNS

UseDNS is an SSH-server setting (/etc/ssh/sshd_config file) which forces a server to check a client's PTR-record upon connection. It may cause connection delays especially with slow DNS servers on the server side. In modern Linux distribution, this setting is turned off by default, which is correct.

PreferredAuthentications

It is an SSH-client setting which informs server about preferred authentication methods. By default Ansible uses:

-o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey

So if GSSAPIAuthenticationis enabled on the server (at the time of writing this it is turned on in RHEL EC2 AMI) it will be tried as the first option, forcing the client and server to make PTR-record lookups. But in most cases, we want to use only public key auth. We can force Ansible to do so by changing ansible.cfg:

[ssh_connection]

ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey

This eliminates unnecessary steps and speeds up initial ssh master connection.

Facts Gathering

LAN: before 1.96s, after 1.47s

WAN: before 4.92s, after 4.77s

At the start of playbook execution, Ansible collects facts about remote system (this is default behaviour for ansible-playbookbut not relevant to ansible ad-hoc commands). It is similar to calling "setup" module thus requires another ssh communication step. If you don't need any facts in your playbook (e.g. our test playbook) you can disable fact gathering:

gather_facts: no

If you often run playbooks that depend on facts, but fact gathering slows your runs, consider setting up external fact-caching backend is some information). For example, you define Redis backend, collect facts by hourly cron job and disable fact gathering in your playbooks in favor of cached facts.

WAN → LAN

Before 4.77s, after 1.47s

Your WAN connection can have good bandwidth and latency. But LAN connection is better. If you manage multitude of hosts let's say in Amazon EC2 eu-west-1 region, you can expect significant speed boost if your Ansible control machine is also in that region. Rule of thumb is to move control host closer to managed systems.

Pull-mode

Before 1.47s, after 1.25s

Need even more speed? Execute playbooks locally on remote servers. There is ansible-pull tool for that. You can read about it in official Ansible docs. It works as follows:

Clone specified repo into local subdirectory
Executes specified playbook with local connection (`-c local` option)
If playbook name is omitted, tries to execute:
<fqdn>.yml
<hostname>.yml
local.yml

One of the workflows is to execute ansible-pull --only-if-changed as cron job: it will monitor target repository and if there is a change, execute playbook.

Fork

Until this moment we discussed how to speed up playbook execution on a given remote host. But if you run playbook against tens or hundreds of hosts, Ansible internal performance becomes a bottleneck. For example, there's preconfigured number of forks – number of hosts that can be interacted simultaneously. You can change this value in ansible.cfg file:

[defaults]
forks = 20

The default value is 5, which is quite conservative. You can experiment with this setting depending on your local CPU and network bandwidth resources.

Another thing about forks is that if you have a lot of servers to work with and a low number of available forks, your master ssh-sessions may expire between tasks. Ansible uses linear strategy by default, which executes one task for every host and then proceeds to the next task. This way if time between task execution on the first server and on the last one is greater thanControlPersist then master socket will expire by the time Ansible starts execution of the following task on the first server, thus new ssh connection will be required.

Poll Interval

When module is executed on remote host, Ansible starts to poll for its result. The lower is interval between poll attempts, the higher is CPU load on Ansible control host. But we want to have CPU available for greater forks number (see above). You can tweak poll interval in ansible.cfg:

[defaults]

internal_poll_interval = 0.001

If you run "slow" jobs (like backups) on multiple hosts, you may want to increase the interval to 0.05 to use less CPU.

Hope this helps you to speed up your setup. Seems like there are no more items in environment check-list and further speed gains only possible by optimizing your playbook code.

Ansible (software) Host (Unix) Connection (dance) workplace Task (computing) Execution (computing)

Published at DZone with permission of Konstantin Suvorov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending