Diving deep: How docker achieves container isolation using the underlying OS [Part 1]
Hello everyone! This article is going to be a deep dive into how docker achieves complete isolation using containers. When you spin up a container that has its filesystem, network, processes, etc. what exactly happens behind the scenes?
This article will be talking about some advanced topics in the Unix operating system in general but I'll try to explain everything as clearly as possible.
Namespaces
Docker relies heavily on Linux Namespaces, It's considered one of the core things that made Docker come to life.
Linux namespaces are a mechanism that allows the kernel to provide the illusion that a set of processes have their isolated instance of a particular resource, even though they might be sharing the same underlying global resource. This illusion is achieved by creating separate namespaces for each set of processes, effectively giving them their isolated views of the resource.
There are several types of namespaces Docker uses and we'll get into them one by one. But before that, let's talk more about the concept in general
Imagine a large public library with multiple floors, each dedicated to a different subject area like science, literature, history, and so on. Within each floor, there are various sections and bookshelves containing books related to that subject. Each section is like a separate namespace.
In this analogy:
The library itself is the physical computer.
Each floor of the library is a separate namespace.
Each section on a floor is like a resource or aspect of the system being isolated (e.g., network, processes, filesystems).
Docker uses six namespaces to achieve isolation:
PID namespace for process isolation.
USER namespace for managing user permissions
MNT namespace for managing filesystem mount points.
NET namespace for managing network interfaces.
IPC namespace for managing access to IPC resources.
UTS namespace for isolating kernel and version identifiers.
We'll get into the first 3 namespaces in this article and the last 3 in the next part.
PID Namespace
This namespace is all about process isolation, When creating a new PID Namespace the processes inside it start from PID 1. The first process is called the init process (the same without namespaces). If PID 1 dies a SIGKILL
is sent to all the other processes in that namespace, effectively terminating the namespace.
The kernel maintains a mapping between the PID of the process inside the namespace and the PID of the same process outside the namespace realm.
Processes inside the namespace can only see and interact with processes inside the same namespace. They are completely isolated from any other processes in the system.
Hands-on
Let's start by creating a new PID namespace.
sudo unshare -fp /bin/bash
sleep 90000 &
On executing the unshare command. A new PID namespace is created. If we examine the processes and the parent of the sleep command we'll find that its parent is the /bin/bash
process
ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:24 pts/0 00:00:00 /bin/bash
root 47 1 0 15:42 pts/0 00:00:00 unshare -fp /bin/bash
root 48 47 0 15:42 pts/0 00:00:00 /bin/bash
root 51 48 0 15:42 pts/0 00:00:00 sleep 90000
root 52 48 0 15:42 pts/0 00:00:00 ps -ef
The sleep command has a Parent PID which is the same as the /bin/bash
PID.
As we might've noticed when executing ps -ef
we can see processes outside of the namespace, why is that?
All the processes in the system are tracked in a special file called procfs
, this file is usually mounted in the /proc
directory. This directory lists all the processes running on the system.
We'll get into mount points later on in the article but for now, we need to understand that mount points such as the /proc
one, are shared across all namespaces. Any namespace newly created will see the same /proc
as other namespaces do.
To only see the processes in the namespace inside the proc directory, we'll need to create a mnt namespace. Also we'll need to mount /proc
specifically for this newly created namespace. The newly created mount point will not be visible to other namespaces, only visible in the newly created mnt namespace.
So the final command will be as follows:
unshare -Urpf --mount-proc
-u
creates a new user namespace (we'll get into them) but they map privileges and permissions to an isolated namespace.-f
forks a new process from theunshare
command where this process will act as the init process for the new PID namespace-p
creates a new PID namespace-r
command maps the outside user to the 'inside of the new user namespace user' (we'll get into this shortly)--mount-proc
creates a new mount namespace and mounts the/proc
As you might've noticed we can combine multiple namespaces where the main goal is achieving complete isolation from the outside system.
If we do ps -ef
we'll get the following:
# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 16:01 pts/0 00:00:00 -sh
root 5 1 0 16:01 pts/0 00:00:00 ps -ef
The sh
is the new forked process with a PID of 1. The ps
process is a child of it with PID 5.
User Namespace
The user namespace is a way for a container (a set of isolated processes) to have a different set of permissions than the system itself. Every container inherits its permissions from the user who created the new user namespace.
For example, let's say you have a user named foo-bar
. This user doesn't have root privileges but has a set of privileges assigned to him. When this user attempts to create a new user namespace. The root user of the newly created user namespace will map to foo-bar
. This means he will have the same privileges as foo-bar
does but his name is root
in the new namespace.
The biggest advantage to the user namespace is the ability to run containers without root privileges. Additionally, depending on how you set up the UID mapping, you can completely avoid having a superuser inside a given user namespace. This means it is not possible to run any privileged processes inside of this type of namespace.
Hands-on
Let's start by creating a new user:
sudo useradd foo-bar
sudo passwd foo-bar
Let's create a new user namespace as follows:
su foo-bar
unshare -Ur
The root user of the new namespace will map to the user that invoked the command above.
cat /proc/$$/uid_map # $$ refers to currenet process id
inside-ns outside-ns range
0 1000 1
As we can see the new foo-bar
user has an ID of 1000. When the new namespace was created the root user (id of 0) got mapped to the outside namespace.
MNT namespace
A mount namespace is a concept in Linux that provides an isolated view of the filesystem hierarchy to processes within a namespace. In simpler terms, it allows different processes to have their own separate "virtual" filesystem views that are distinct from each other and the main filesystem on the host system.
Each mount namespace has its own set of mounted filesystems, which can include local disk partitions, network shares, and other types of filesystems. This isolation allows processes in different namespaces to have different views of the filesystem. For example, one process might see a specific directory as its root directory, while another process might see a completely different directory as its root.
By default, if you were to create a new mount namespace with unshare -m
, your view of the system would remain largely unchanged and unconfined. That's because whenever you create a new mount namespace, a copy of the mount points from the parent namespace is created in the new mount namespace. That means that any action taken on files inside a poorly configured mount namespace will impact the host.
Focus on the difference between a mount point and the mount namespace because it confused me for some time.
When creating a new mount namespace and mounting anything in there, it isn't visible from the host root.
Mounts propagate by default because of a feature in the kernel called the shared subtree. This allows every mount point to have its propagation type associated with it. This metadata determines whether new mounts under a given path are propagated to other mount points.
In simpler terms, Imagine you have a bunch of folders in your computer, like /folder1
, /folder2
, and so on. Each of these folders can hold different things.
Now, let's say you connect a USB drive and it appears as a new folder called /usb
. If your computer is set up in a certain way, when you put files in the /usb
folder, those files might also show up in other folders like /folder1
or /folder2
.
This is like how information can travel between folders. The "propagation type" is like a rule that says whether changes in one folder (like adding files to /usb
) should automatically affect other folders (like /folder1
).
A mount state determines whether a member can receive the event. According to the same kernel documentation, there are five mount states:
shared - A mount that belongs to a peer group. Any changes that occur will propagate through all members of the peer group.
slave - One-way propagation. The master mount point will propagate events to a slave, but the master will not see any actions the slave takes.
shared and slave - Indicates that the mount point has a master, but it also has its peer group. The master will not be notified of changes to a mount point, but any peer group members downstream will.
private - Does not receive or forward any propagation events.
Unbindable - Does not receive or forward any propagation events and cannot be bind mounted.
Most container engines use private mount states when mounting a volume inside a container. Hence why you don't see any newly mounted points in a certain namespace in the root namespace.
In docker, mnt namespaces are the reason you can see a whole new filesystem inside an Ubuntu container for example. The filesystem shown inside the container wouldn't be achievable without mnt namespaces.
Hands-on
Let's mimic exactly what happens in docker. A container sees a completely different root filesystem from the host.
Let's create a new folder newroot
and download the Alpine Linux filesystem.
Assuming we added the user foo-bar
from above (he will be the user that maps to the root user of the new to-be-created namespace)
mkdir example
cd example
mkdir newroot
wget https://dl-cdn.alpinelinux.org/alpine/v3.13/releases/x86_64/alpine-minirootfs-3.13.1-x86_64.tar.gz
tar xvf alpine-minirootfs-3.13.1-x86_64.tar.gz -C newroot
chown foo-bar -R /example/newroot
Now foo-bar (new namespace's root user) will own the new filesystem.
Now if we start a new mount namespace
su foo-bar
unshare -Umr
In this namespace since we didn't specify a mount state
flag. Any mount point created in this namespace is going to be a private mount state. That means it won't propagate or receive any events. It is completely isolated from any other mounts on the same directory.
That's exactly what we're going to do. We are going to create a new mount point of the Alpine filesystem just for this namespace. Then when changing the mount point for the root filesystem for the process, we'll use this one as our new mount point.
cd newroot
mount --bind /example/newroot /example/newroot
This is a self
bind. It binds a directory to itself. This might be confusing but I'll explain.
When you perform a self-binding mount, you are creating a new mount point that points to the same directory. This can be used to change the properties or behavior of that directory within its context. It's a bit like looking at the same directory from a different perspective. It uniquely controls access to it.
When you perform a self-binding mount, you're creating a new way to access the same directory. In simpler terms, you're saying, "Let's look at this directory from a different angle, but it's still the same directory."
Imagine you have a room in a building. Now, you decide to install a mirror on one of the walls. When you look at the mirror, you're not creating a new room. It's still the same room, but you're seeing it differently.
Similarly, self-binding creates a "mirror" or alternate perspective of a directory. This can have some practical applications:
Security Isolation: You can set different permissions or attributes on the "mirrored" directory, effectively isolating it from the original directory. This might be useful for creating a more controlled environment for specific operations.
Access Control: You could have different access rules for the original directory and the bind-mounted version. For example, you might want to allow read-only access through the bind mount, even if the original directory has broader access permissions.
Sandboxing: If you're running specific applications, you might want to provide them a view of a directory that's slightly different from the actual directory, ensuring they interact with the data in a controlled manner.
Temporary Changes: You can temporarily apply different properties or settings to the bind-mounted directory without affecting the original.
After explaining everything all that's left is changing the root filesystem for the new namespace process.
This is usually done by pivot_root
command which changes what the current process views as the root filesystem. I won't be getting into it here but if you need a step-by-step head over to this article.
Summary
Linux namespaces are very complex and they shine when used together to achieve things such as isolation. Diving deep into this complexity makes you appreciate products like Docker and how long we've come to understand these complex concepts. This article featured the first 3 out of 6 used namespaces in Docker. In the next article, I'll be briefly talking about the remaining namespaces. Hope you got a slight glimpse of how everything works behind the scenes!