Maintainable Servers with Docker

July 27, 2017 – tagged Docker, Admin

Setting up a Linux server to host a certain software is easier than ever before. There is more free software available than ever before, it is easier to install than ever before thanks to huge package repositories and there is no lack of blog posts that explain “how to get X running in five minutes”. Potential problems in the configuration of this software aside, one challenge that the average user will most likely not be able to solve as easily is the continued maintenance of such systems. How do you get the latest security updates, how do you make sure your configuration still works after an upgrade, what if the underlying OS does not receive updates any more, …?

This blog post does not describe a complete solution for this problem, but rather presents a set of related ideas and one possible approach how it could be done.

The Challenge

Some years ago I was responsible for the complete IT infrastructure of a small company, including not only the web shop, but everything: email, XMPP, ticket system, database, ERP, OpenLDAP, wiki, backup and so on. We had decided early that in order to achieve a high degree of integration between systems and not having to worry about storing customer data in cloud services, we would host as much as possible ourselves. We rented a server, I installed Ubuntu 10.04 LTS on it, created a number of virtual machines, each running Ubuntu 10.04 LTS, and installed the software that we needed.

While I kept a detailed log of what I did on every VM, what software I installed and which settings I changed, the whole system turned into what Chad Fowler described as “one of the scariest things”: “a server that’s been running for ages which has seen multiple upgrades of system and application software. […] Cron jobs spring up in unexpected places, running obscure but critical functions that only one person knows about. Application code is deployed outside of the normal straight-from-source-control process.”

Luckily the server was running fine for many years, but there are various issues with setups of this kind.

Transferability

At some point every server is getting old, the software it runs is getting old, performance is not the best any more as data grows and grows, and one may want to switch to a more performant system. Getting a new server is an easy thing to do, but what about the OS it runs and the installed software packages and configuration? How to migrate all the running software to the new hardware?

Software End of Life

When using a Ubuntu LTS release, there is relative peace of mind, because with an occasional apt-get upgrade (which can be automated by just installing the unattended-upgrades package) all important security updates and bugfixes are installed, but (unless PPAs are used) versions do not change so much that configuration changes become necessary. However, when that LTS reaches the end of its support life, the running system must be upgraded to a new Ubuntu version, and this is very likely to break many of the customizations that may have been done on the system.

Maintainability

As described nicely by the quote above, such a system is very fragile and prone to break. Changing one thing may have weird effects in other places and the administrator may become literally afraid of touching anything.

Restorability from Backup

Backup is one more difficult thing. Sure, all the data (SQL databases etc.) may be backed up every day and can hopefully be restored. But configuration is a different beast. I do have /etc under version control, but is it enough? Aren't some config files just symlinks to /var/lib/something? Are permissions and owners preserved properly? And so on and so on. It would feel better if there was a simple solution to rebuild a system from backup files with a couple of commands.

Testability

Finally, all configuration changes are like an open-heart surgery. There is no testbed, no way to check if configuration does what it is supposed to do, if it is even syntactically correct. When building software, I am usually quite confident that the code I wrote does what it's supposed to do. Still I add tests so that I can be sure it will still do two years later, after adding and removing and changing components. That, plus the possibility to check the effect of a configuration change would be great to have and would give way more confidence when changing things.

These five aspects are surely what full-time system administrators deal with every day and I am sure that there are good solutions and best practices for each of them. However, everyone who hosts any software (be it Wordpress or Postfix or anything) on their own small server has these problems and is most likely not a full-time system administrator. So let's look at a couple of ideas that are slightly related to the problem.

Some Related Concepts & Technologies

Upgrade Early, Upgrade Often

I used to like the Ubuntu LTS versions because of the long time you do not have to care about upgrading to a newer OS version, potentially breaking the system. Of course, on a developer machine you may run into the situation of wanting newer versions of some package once in a while (which can often be addressed by using PPAs), but on a server that is rarely the case once you have your software deployed. However, when the five-year support period is over, good luck upgrading.

Already quite a while ago, I stumbled upon a blog post called long term support considered harmful by OpenBSD developer Ted Unangst. The main message of that post is that not all security bugfixes are backported because they will often just not be identified as a security bugfix. (In the case of OpenBSD, there is also the problem that there are not enough developers to backport all security bugfixes to old versions; something that Ubuntu aims to do.) The main point he makes is that also for LTS Linux distributions, the assumption of getting all the security bugfixes in the included software for years is simply wrong; that is, LTS does not work.

However, he also addresses a consequence of that: “Now on the one hand, this forces users to upgrade at least once per year. On the other hand, this forces users to upgrade at least once per year. Upgrades only get harder and more painful (and more fragile) the longer one goes between them.” And that is actually the point I find almost more interesting than the security aspects: With a long release cycle, it is virtually impossible to read through release notes, learn about deprecated configuration options, changed file locations etc. With a short release cycle and an appropriate tool support (for example, equery or dispatch-conf), upgrading a system and understanding what has changed and what configuration changes are required may actually become feasible.

Memo: Maybe it is better to upgrade early and often rather than late and as rarely as possible.

Immutable Infrastructure

The term “Immutable Infrastructure” was coined by the above-mentioned Chad Fowler in the blog post Trash Your Servers and Burn Your Code: Immutable Infrastructure and Disposable Components which was also linked above in the introductory paragraph. What he identified as a core problem of reliable systems administration is that it is dealing with manipulating state and then it becomes difficult to argue about that state, reproduce it, and think about how a system will behave in the state that it is thought to believe in.

In programming, a similar issue exists: When you have code that updates data structures in place, like calling obj.setFoobar() or list.remove(elem) or maybe you call obj.foo() and foo() changes the internal state of obj, then if the codebase becomes bigger and older it becomes difficult to argue about the state of obj at a certain point in time, in particular if it has been passed around to various actors and everyone may have a reference to it and do things with it. In certain areas this has led to a revival of functional programming languages such as Scala that embrace immutability. Instead of writing list.remove(elem) you would write something like val list2 = list.filterNot(_ == elem) and then get a new list, but the old one is untouched. It becomes a lot easier to think about the state of list if it has actually never been changed after initialization.

Immutable Infrastructure is similar in the sense that you never change a system after it has been set up. If you need to change the behavior of a system, throw it away and set up a new one. Of course the only realistic way to do that is to have an automated way to set up a system, such as a long shell script or configuration management tools.

Note, however, that just using configuration management tools does not mean that you have an Immutable Infrastructure. For example, while Ansible playbooks encourage an approach where you describe the target state of a system instead of the steps to get there (so in particular executing a playbook should be an idempotent operation, i.e., calling it n times has the same effect as calling it once), it is not trivial to maintain a playbook that works for a system in state A and for a system in an earlier state A'. For example, just think about a requirement such as: “I want the data directory of this PHP application to be in location X instead of Y”, then the steps to achieve this may be significantly different depending on whether the application is running, has been running before, has stored data in X etc. To cover all those situations in a single playbook becomes a big, big mess. In case of doubt, you will ignore earlier states A' and get in big trouble once you really need to set up a system from scratch using that playbook.

Immutable Infrastructure employs an approach where you always set up your system from zero. Of course this requires a lot of thinking about, for example, how to switch from a running system to a new system (cf. Blue-Green Deployment), how to deal with data as opposed to configuration (because your users will most likely not be happy to see that their uploaded files disappear everytime you redeploy, and you will most likely not be happy to add all uploaded files to your setup script), or how to actually describe your setup routine, but the idea itself is quite worth thinking about it.

Memo: Don't touch a running system, rather set up a new one and then switch over.

Docker

Docker is a set of tools to build and run containers (think “lightweight virtual machines” if you are not familiar with the term, although it is technically not correct) in Linux and various other operating systems. There are various aspects to that:

A “container” is actually a simple process, but it lives in its own address space with respect to process IDs, user IDs, possibly network interfaces, devices etc., so it allows for a high level of isolation between processes, but without the overhead of launching and managing a virtual machine. A container is launched from an “image” that is built from some base image with a basic linux environment and then extended either through a series of manual operations and a “commit” operation, or through a series of commands described in a so-called Dockerfile. Using the docker-compose tool, multiple containers can be launched together and be linked to each other, for example one container could run a PostgreSQL server and one container the web application that uses this database.

Now there are a number of issues with Docker, like the occasional product renames, breaking changes, the whole idea of downloading binary images that someone has created and running them on your machines (I love this rant about the container ecosystem from a sysadmin perspective), and running Docker in production requires some experience and can be tricky. Maybe rkt is a better choice, I have never tried it. However, the idea of describing your environment in a plain text file, including all the steps you need to take to get from the base installation to a working system, and being able to run a copy of this environment wherever you want and how often you want is brilliant.

Besides being a natural building block for Immutable Infrastructure, describing your environment in Dockerfiles also provides a great way to test your components. Say you have a web application that sends emails for notification purposes, then in many cases that functionality will already be covered by tests that involve launching a dummy SMTP server and checking whether that server receives the emails correctly. However, if you have a web application container and an SMTP server container, then you can also launch both of them on your local development laptop (!) and test whether user name, password, port settings, etc. are all correct – you can test the configuration of your environment, you can test if everything still works if you change the Postfix settings, if you change the sender address of your emails and so on and so on.

Memo: Containers can be used to provide reproducible, portable, and testable builds of your application environment.

Putting the Pieces Together

The situation I stated in the introduction was that often you may want to host one or more applications on a server (whether it is a real server or a VPS or an EC2 machine does not matter that much). The straightforward way I (and I guess many other people as well) did this in the past would be something like:

Get a server and install Ubuntu on it.
apt-get install the software you want to use.
Edit configuration files, restart any daemon if necessary.
Test if it works manually.
Set up a cron job to backup some files somewhere, wondering about whether that is actually enough to recover the service functionality.
Install security upgrades regularly, but always wonder what you will do the day that the OS support expires.

An approach that I think is more sustainable and, while introducing more overhead to get started, may reduce trouble in the long run looks as follows:

Get a server and install some minimal Linux on it, just enough to run Docker. This can be an LTS of some Linux distribution, maybe managed via Ansible or so, nothing except Docker will run here and you can easily throw it away and switch to a new server.
Describe the services you want to run with Dockerfiles and a docker-compose.yml. Since one of the goals of this setup is not to rely on promises about security fixes being backported but rather “minimize the amount of time old, unfixed code remains in the wild” (from Ted Unangst's post), it is recommended not to choose the distribution with the longest support window, but the one that makes upgrades between releases as simple as possible.
Write tests (using serverspec or any other tool that you are familiar with) that launch your containers and test whether they can interact as intended.
Write a backup script only for the data and test whether you can restore your services from that backup.
Deploy your Docker files and backup script to the server and make sure the containers are launched when the server boots.
If there is a mechanism to install non-breaking updates in your containers (e.g., apt-get upgrade if it is configured not to install breaking updates), do so regularly. For distribution upgrades, change your Dockerfiles, test all changes locally, then redeploy them to your server.

On a conceptual level, this approach uses (1) Docker as a means to realize Immutable Infrastructure in order to achieve reproducibility, testability, and maintainability, and (2) frequent upgrades with small changesets to avoid becoming stuck in a situation where a OS upgrade becomes a task too big to manage. There are still a lot of open questions on how to realize each step in the list above, and while I doubt there are best practices that suit everyone's use cases, I will try to follow up on those questions in a future blog post.