After a lot of effort and communication, finally the system deployment works! To guarantee a smooth deployment anytime, we enforce daily deployment test as a next step.
Surprisingly daily deployment doesn't always succeed like we expect, even if there are no major changes. Interestingly, many failed tests are kinds of false negatives. So what are the obstacles, and how we can avoid them?
Permanent Link: http://dennyzhang.com/false_negative
What Does False Negative Mean? Ideally each test failure should be an improvement opportunity. But if you tend to do a repetitive blind retry for a certain failure, we can say it's a false negative. Why? It indicates either you don't care about this failure or you have no other way to improve it. Furthermore false negatives come out with two bad consequences: it takes time to check and retry; it breaks or populates normal tests.
The success of basic deployment logic is only the beginning. Besides constant changes initiated from Dev team,there are multiple things you need to pay close attention to, if you're ambitious and want to deliver a reliable and smooth deployment.
Here are several typical false negatives, from my first-hand experience.
Outage Of External Services
We try to download files from external websites, however they're in maintenance mode. We pull source code from GitHub/Bitbucket for build and deployment, but it's temporarily down or unreachable. This also happens when Node.js needs to fetch community packages, or when Java needs to download jar modules from a public nexus server, however the external servers flip from time to time.
It's always a good practice to replicate and serve files in servers under our own control, instead of 3rd-party websites. Here's how to detect all outbound traffic in deployment. Another improvement allows no hidden dependencies and makes dependencies crystal clear.
Always Download Latest Version
You may be familiar with actions like the below.
# Install Package package XXX do action :install end # Download raw file from Github remote_file '/opt/devops/bin/backup_dir.sh' do source 'https://raw.githubusercontent.com/' \ 'XXX/backup_dir/master/backup_dir.sh' mode '0755' retries 3 action :create_if_missing end
Quite often, changes in a latest version come with incompatible issues. This gives us a surprise for our deployment test, or issues that are hard to detect and diagnose.
It's better using a stable tag/branch/version, instead of head revision. For our own code, it's easy to enforce this. However for community and open source code, the story is different.
Low Hardware Resource
To better utilize test machines, we may keep running lots of simultaneous test jobs all the time. The machine may run into low memory, and this will fail our test, even though our code had nothing to do with it.
Even worse, the OS may run into an OOM issue (Out of Memory). This may crash our critical services like Jenkins or DB, which demands human intervention. Or it runs into kernel panic, which blocks us to ssh, and a machine reboot is our final resort.
To avoid this, we can add precheck logic before running new test jobs. When the OS runs into low hardware resource, stop launching any more test jobs.
Two parallel jobs may fail when they both want to get exclusive access to the same resources. Typically there are two major conflicts of deployment test jobs: 1. Run the same job in parallel 2. Run different jobs with shared resources.
Common shared resource are:
- Global Environments, like JDK versions or global variables
- Docker specific resources: container name, mounted volume, NAT TCP ports
- ssh private key file of different Jenkins Jobs
Run Test On Unclean Envs
The deployment test will be invalid and troubleshooting effort will be wasted if the test env is not clean. For example, if "apt-get update" fails, "apt-get install" is doomed to fail as well. If people have deliberately removed some files or packages in advance, deployment may fail as well.
It's better we perform tests on envs with a fresh start.
Slow Service Start
Application may takes several minutes to start while performing system initialization or waiting to get a DB cluster up and running. We need to wait and check our assumptions before testing. Otherwise we will get a false alarm again. Tips: How To Avoid Blind Wait.
Like our blog posts? Discuss with us on LinkedIn, Wechat, or our newsletter.