This article is an overview of how you can manage docker-like containers using systemd.
As you may know, the famous container manager docker is not the only container technology. It has been originally based on LXC, though it embed additional functionalities, but it mostly provides a high level API for container management. And actually it has been made possible by a bunch of quite recent kernel features such as namespaces, control groups or union filesystems.
As I was reading the excellent series of Lennart Poettering's articles about the use of systemd for administrators, I found out that systemd is actually able to handle its own containers.
It is not so surprising actually, since systemd is close to the kernel and so provides an abstraction over a lot of its features. But anyway, I did not know it was so easy-to-use!
I present in this article the basics of systemd-nspawn
, the enhanced chroot
that systemd brought to us to build containers.
Our first container
Let's start with some practical example, directly inspired from one of the Poettering articles. We are going to run a simple Debian inside the most basic container.
Set up files
We use debootstrap
to pull a minimal Debian on our filesystem. This command is available on most linux distributions (well, I checked for apt
, yum
and yaourt
, using AUR).
debootstrap --arch=amd64 unstable debian-tree/
It downloads Debian unstable files in the local debian-tree
directory.
NB: I use Debian as an example, but using yum
or pacstrap
, you can install respectively fedora
or arch
as easily.
Run container
Just run systemd-nspawn
(eventually as root):
systemd-nspawn --directory=debian-tree/
NB: You can use -D
instead of --directory=
It performs some kind of chroot
and open up a shell inside the hosted system. It not only isolates local filesystem from its host, but also abstracts its interfaces. It is much more powerfull than a simple chroot
.
NB: By default /bin/sh
is called inside the container, but you can specify the command you want systemd to run into the container as an extra argument. In that case, arguments specified after the command are forwarded to it and not treated by systemd-nspawn
.
But we just mounted the Debian filesystem actually, and no more. If you call ps
, you will not see any process but the shell (even if you are root).
root@debian-tree:~# ps
PID TTY TIME CMD
1 ? 00:00:00 bash
6 ? 00:00:00 ps
The Debian OS has not really started. Before we do so, think about setting a password for your root user, you will need it later:
root@debian-trash:~# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Booting container
Call systemd-nspawn
with --boot
option (or -b
). Alternatively, you can ask systemd-nspawn
to run /sbin/init
:
systemd-nspawn -D debian-tree/ --boot
# or
systemd-nspawn -D debian-tree/ /sbin/init
You should see boot messages, with lines beginning with [ OK ]
. Also, if you are hosting a distribution relying itself on systemd, e.g. a recent release of Debian, you would see interesting message at the early boot:
Spawning container debian-trash on /your/current/path/debian-trash.
Press ^] three times within 1s to kill container.
systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ -SECCOMP -APPARMOR)
Detected virtualization 'systemd-nspawn'.
Detected architecture 'x86-64'.
[...]
You can notice that the hosted systemd detects that it has been ran inside a systemd-nspawn
container! You can use systemd-detect-virt
at any time to get this information.
Also note that you can escape the container (and so kill it) by pressing ^]
(Control key plus ]
) three times quickly, which can be useful, especially if you forgot to set any root password.
Managing containers
Docker provides a nice API to manage your containers. While systemd
is not as exhaustive, it also have an interface for it, though the machinectl
command (like "machine control").
$ machinectl
MACHINE CONTAINER SERVICE
debian-tree container nspawn
1 machines listed.
You can get more information about one particular container using machinectl status <machine>
:
$ machinectl status debian-tree
debian-tree
Since: mar. 2015-04-07 08:46:37 CEST; 55min ago
Leader: 24553 (systemd)
Service: nspawn; class container
Root: /home/system/local/debian-tree
Address: 192.168.1.108
fe80::2ae3:47ff:fe04:f134
OS: Debian GNU/Linux 8 (jessie)
Unit: machine-debian\x2dtree.scope
├─24553 /lib/systemd/systemd
└─system.slice
├─cron.service
│ └─24618 /usr/sbin/cron -f
├─systemd-journald.service
│ └─24571 /lib/systemd/systemd-journald
├─console-getty.service
│ ├─14910 man machinectl
│ ├─14921 pager -s
│ ├─24625 /bin/login --
│ └─24656 -bash
└─rsyslog.service
└─24620 /usr/sbin/rsyslogd -n
NB: If you want to process machinectl status
output, please consider using machinectl show
instead. It has been designed with this goal in mind.
If the container's content supports it, you can use machinectl reboot
, machinectl poweroff
or machinectl login
for example. You can still use machinectl terminate
to simply kill the whole container, whatever is running inside it.
About security
It is well known that LXC containers are still not secure enough to be used as absolute jails. Actually, this is due to possible exploits of kernel weakness that are present because the introduction of containers technologies is recent. If you need full isolation, consider using solution such as [KVM].
And as you can see it using systemd-cgtop
(systemd's real time cgroup visualizer), some information about the host system leaks into the guest machine:
On the guest
Path Tasks %CPU Memory Input/s Output/s
/ - 144.6 2.3G 0B 94.5K
/machine.slice - 1.5 30.9M - -
/machine.sli...e-debian\x2dtree.scope 7 1.5 12.3M - -
/system.slice - - 71.0M - -
On the host
Path Tasks %CPU Memory Input/s Output/s
/ - 144.6 2.3G 0B 94.5K
/machine.slice - 1.5 30.9M - -
/machine.sli...e-debian\x2dtree.scope 7 1.5 12.3M - -
/system.slice - - 71.0M - -
/system.slice/NetworkManager.service 2 - - - -
/system.slice/accounts-daemon.service 1 - - - -
/system.slice/avahi-daemon.service 2 - - - -
/system.slice/bluetooth.service 1 - - - -
/system.slice/colord.service 1 - - - -
/system.slice/dbus.service 1 - - - -
/system.slice/gdm.service 2 - - - -
/system.slice/geoclue.service 1 - - - -
/system.slice/httpd.service 7 - - - -
/system.slice/ipython-notebook.service 1 - - - -
/system.slice/itorch-notebook.service 2 - - - -
/system.slice/polkit.service 1 - - - -
/system.slice/postgresql.service 6 - - - -
/system.slice/redis.service 1 - - - -
/system.slice/rtkit-daemon.service 1 - - - -
/system.slic...lice/getty@tty2.service 1 - - - -
For instance, you can access from the contained system to the resources used by the host! You can also see all other machine started, since they are by default on the same cgroup slice, namely /machine.slice
.
And another important point is that by default the network interface is not virtualized: the container accesses the same interface than the host (try some ip link
). But the good news is that we can easily manage to isolate it. And we can even add bridges between host and guest interfaces.
Network isolation
Complete isolation
If you do not need any network connection, you can completely isolate the container from its host's network using the --private-network
option:
systemd-nspwan -bD debian-tree/ --private-network
If you look at the available network interfaces, you will see only a loopback:
guest$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
And actually, this loopback itself is isolated from the host's loopback interface, so there is absolutely no communication between host and guess (wrt. the previously advertised security limitations, of course).
Adding bridges
Anyway, it is common to need the container to be able to listen on some port. For instance, the container can host some web server, or socket served service. So we need to be able to expose some interface to the container.
Providing Internet interface
The simplest solution is to provide the container with a whole network interface.
systemd-nspwan -bD debian-tree/ --network-interface=eth0
Now you can see eth0
from within the container. But the main drawback is that this interface would not be accessible from the host: it is moved from one host namespace to guest namespace.
guest$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether **:**:**:**:**:** brd ff:ff:ff:ff:ff:ff
Note that the host loopback cannot be moved to the guest. Also not that you could add a second --network-interface
argument to use many network interfaces.
NB: As soon as you specify that you would share a network interface, it automatically hides the other one, as if --private-network
was specified.
You can then create bridged or vlan interfaces in the host and provide them to the container. Fortunately, systemd-spawn
comes with some options simplifying this process.
Virtual Ethernet link
You can use --network-veth
to create a virtual Ethernet link between host and container.
systemd-nspwan -bD debian-tree/ --network-veth
You will see a new interface on the host machine, named ve-debian-tree
like "virtual ethernet to debian-tree container" and the container will be provided a host0
interface.
In order to communicate, you must then add an IP address to these new interfaces. For instance, you could use the reserved 10.0.0.1/8
IP block:
# Add address 10.0.0.1 to ve-debian-tree interface of host system
host$ ip addr add 10.0.0.1/24 broadcast 10.0.0.255 dev ve-debian-tree
host$ ip link set dev ve-debian-tree up
# Add address 10.0.0.2 to host0 interface of guest system
guest$ ip addr add 10.0.0.2/24 broadcast 10.0.0.255 dev host0
guest$ ip link set dev host0 up
You can then check you setup using ping 10.0.0.1
and ping 10.0.0.2
on guest and host respectively.
Now that you can communicate between host and guest, you may want to forward some host external ports toward guest system and block the other ones using for instance iptables
.
Mount host directory within guest system
Although unlike docker containers, a systemd-nspawn
container is persistent (modifications are not forget at reboot), it can be made non-persistent using --volatile=yes
and anyway it can be useful to bind some host directory to some guest directory.
You can achieve this with --bind=/my/host/directory;/my/guest/directory
.
# Boot container and mount host's /home into it
systemd-nspawn -bD debian-tree --bind=/home
Note that when both host and guest path are identical, you can omit the second one.
You can also mount it as read-only, using --bind-ro
. Alternatively, you can use machinectl bind
.
Dealing with images
I will end this overview by speaking about images, although reading [the man page][man:systemd-nspawn] of systemd-nspawn
would show you a lot more settings.
I did not investigate a lot about images but there seems to be a complete image management available.
systemd-nspawn
can take as argument a container images instead of e filesystem directory, using --image
(or -i
). These images can be for instance simple tarballs, but also docker images, and machinectl can help you pull images with commands such as machinectl pull-tar
.
So I have to look further how I can use this power, but it seems that systemd provides us with a simple but powerful containing solution that can be deeply integrated with its other units.
References