In DevOps, installation is one of our major tasks. Package installation is pretty straight-forward and easy now. You simply run commands like apt-get, yum, brew, etc., or just leave it to containers.
Is it really that easy? Maybe not. Here is a list of headaches and hidden costs. Discuss with us, DevOps gurus!
Admit It. We All Have Unexpected Installation Failures.
We have wrapped up multiple scripts, which will install and configure all required services and components. The test looks good. Services are running correctly and GUI opens nicely. It feels just great. Maybe we're even a bit proud of our achievements. Shouldn't we be?
Now, more and more people are starting to use our code to do the deployment. That's when the real fun comes. Oh, yes. Surprises and embarrassments, too. Package installation fails with endless issues. The process mysteriously sticks somewhere with few clues as to why, or installation itself seem to be fine but the system just doesn't behave the same as our testing environments.
Firstly, people won't complain. They understand. It happens.
However, with more and more issues, the smell changes. You feel the pressure! You're gonna tell yourself that the failure won't and shouldn't happen again, but do you really have 100% confidence?
Your boss and colleagues have their concerns, too. This part hasn't been changed that much and the task seems to be quite straightforward. Why does it take so long, and how much longer will you need to stabilize the installation?
Does this situation sound familiar? It's exactly how I've felt in the past years. So, what are the moving parts and obstacles in terms of system installation? We want to deliver the installation feature quickly, and it has to be reliable and stable.
Problem 1: Tools in Rapid Development and Complicated Package Dependencies With Incompatible Versions
Linux is powerful because it believes in the philosophy of simplicity. Each tool is there for one and simple purpose. Then, we combine different tools into bigger ones for bigger missions.
That's so-called integration. Yeah, the integration!
If we only integrate stable and well-known tools, we're in luck. Things will probably go smoothly. Otherwise, the situation would be much different.
Tools being in rapid development simply indicates issues, limitations, and workarounds.
Even worse, the error messages could be confusing. Check out the below error of chef development. How we can easily guess it's a locale issue, not a bug, at the first time?
Installing yum-epel (0.6.0) from https://supermarket.getchef.com ([opscode] https://supermarket.chef.io/api/v1) Installing yum (3.5.3) from https://supermarket.getchef.com ([opscode] https://supermarket.chef.io/api/v1) /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `encode': "\xC2" on US-ASCII (Encoding::InvalidByteSequenceError) from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `initialize' from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `new' from /var/lib/gems/1.9.1/gems/json-1.8.2/lib/json/common.rb:155:in `parse' from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook/metadata.rb:473:in `from_json' from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook/metadata.rb:29:in `from_json' from /var/lib/gems/1.9.1/gems/ridley-4.1.2/lib/ridley/chef/cookbook.rb:36:in `from_path' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cached_cookbook.rb:15:in `from_store_path' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:86:in `cookbook' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:67:in `import' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/cookbook_store.rb:30:in `import' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/installer.rb:106:in `block in install' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:38:in `block in download' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:35:in `each' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/downloader.rb:35:in `download' from /var/lib/gems/1.9.1/gems/berkshelf-3.2.3/lib/berkshelf/installer.rb:105:in `install' from /var/lib/gems/1.9.1/gems/celluloid-0.16.0/lib/celluloid/calls.rb:26:in `public_send'
Issues because of incompatible versions frequently happen in system integration. Usually, using the latest released version for all tools will work, but not always. Sometimes, our develop team may have their own preference, which makes things a bit complicated.
We see issues like the one below constantly. Yes, I know. I need to upgrade Ruby, Python, or whatever. It just takes time. Unplanned work, again.
sudo gem install rack -v '2.0.1' ERROR: Error installing rack: rack requires Ruby version >= 2.2.2.
Tip: Record the exact version for all components, including OS. After a successful deployment, I usually automatically dump versions via the trick listed in another post of mine, Compare Difference of Two Envs.
Problem 2: Every Network Request Is a Vulnerable Failing Point
It's quite common for installation to run commands like apt-get/yum or curl/wget. It will launch outgoing requests.
Well, watch out any network requests, my friends.
- The external server may run into the 5XX error, time out, or be slower than before.
- Files are removed in the server, which results in the HTTP 404 error.
- The corporate firewall blocks the requests for the concern of security or data leak.
Each ongoing network request is a failure point. Consequently, our deployment fails or suffers.
Tip: Replicate as many as possible in servers under your control, i.e., the local HTTP server, apt repo server, etc.
People might try to pre-cache all internet downloads by building customized OS images or docker images. This is meaningful for installation with no network, but it comes with a cost. Things are now more complicated and it takes a significant amount of effort.
Tip: Record all outgoing network requests during deployment. Yes, the issue is still there. However, this gives us valuable input on what to improve or what to check. Tracking requests can be done easily. Learn more in another post of mine, Monitor Outbound Traffic in Deployment.
Problem 3: Always Installing the Latest versions Could Be Troublesome
People install packages like the one below quite often.
apt-get -y update && \ apt-get -y install ruby
What version will we get? Today, we get Ruby 1.9.5, but months later, it would be Ruby 2.0.0, or 2.2.2. You do see the potential risks, don't you?
Tip: Only install packages with fixed versions.
|Ubuntu||apt-get install docker-engine||apt-get install docker-engine=1.12.1-0~trusty|
|CentOS||yum install kernel-debuginfo||yum install kernel-debuginfo-2.6.18-238.19.1.el5|
|Ruby||gem install rubocop||gem install rubocop -v "0.44.1"|
|Python||pip install flake8||pip install flake8==2.0|
|NodeJs||npm install express||npm install email@example.com|
Problem 4: Avoid Installation From the Third Repo
Let's say that we want to install Haproxy 1.6. However, the official Ubuntu repo only provides Haproxy with 1.4 or 1.5. So, we finally find a nice code like this:
sudo apt-get install software-properties-common add-apt-repository ppa:vbernat/haproxy-1.6 apt-get update apt-get dist-upgrade apt-get install haproxy
It works like a charm, but wait; does this really put an end to the problem? Yes — mostly.
The availability of the third repo is usually lower than the official repo.
---- Begin output of apt-key adv --keyserver keyserver.ubuntu.com --recv 1C61B9CD ---- STDOUT: Executing: gpg --ignore-time-conflict --no-options --no-default-keyring --homedir /tmp/tmp.VTYpQ40FG8 --no-auto-check-trustdb --trust-model always --keyring /etc/apt/trusted.gpg --primary-keyring /etc/apt/trusted.gpg --keyring /etc/apt/trusted.gpg.d/brightbox-ruby-ng.gpg --keyring /etc/apt/trusted.gpg.d/oreste-notelli-ppa.gpg --keyring /etc/apt/trusted.gpg.d/webupd8team-java.gpg --keyserver keyserver.ubuntu.com --recv 1C61B9CD gpgkeys: key 1C61B9CD can't be retrieved STDERR: gpg: requesting key 1C61B9CD from hkp server keyserver.ubuntu.com gpg: no valid OpenPGP data found. gpg: Total number processed: 0 ---- End output of apt-key adv --keyserver keyserver.ubuntu.com --recv 1C61B9CD ----
The third repo is the most likely to change. Now, you get 1.6.5 and are happy with that. However, suddenly, days later, it starts to install 1.6.6 or 1.6.7. Surprise!
Tip: Avoid the third repo as much as possible. If there's no way to do this, track and examine the version installed closely.
Problem 5: Installation by Source Code Could Be Painful
If we can install directly from source code, it's much more reliable. However, there are a few problems with this.
- It's usually harder. Try to build Linux from scratch, you will feel the disaster and mess. There will be too many wired errors, missing packages, conflict versions, etc. It will feel like flying a plane without a manual.
- Source compile takes much longer. For example, compiling nodeJS would take around 30 minutes, but apt-get only takes seconds.
- Missing facility of service management. We want to manage service by via service XXX status/stop/start and configure it to autostart. With source code installation, this might be missing.
Do Containers Cure the Pain?
Nowadays, more and more people start to use containers to avoid installation failure. Yes, it largely reduces the failures for end users.
Well, it doesn't solve the problem completely, especially for DevOps. We're the ones who provide the Docker image, right?
To build images from Dockerfile, we still have five common failures, listed above. In conclusion, container shifts the failure risks from real deployment to the image build process.
Further reading: 5 Tips for Building Docker Image.
Bring it all together
Improvement suggestions for package installation:
- List all versions and hidden dependencies.
- Replicate as many services under your control as you can.
- Monitor all external outgoing traffic.
- Only install package with fixed versions.
- Try your best to avoid the third repo.
Containers help reduce installation failures greatly. For DevOps people like us, we still need to deal with all of the above possible failures in images build process.
More reading: How to Check Linux Process Deeply With Common Sense.