Containers are Linux
You probably already heard this expression, in today’s post we are going to desmitify container technologies by decomposing them part by part and describing which Linux technologies make containers possible.
We can describe a container as an isolated process running on a host. In order to isolate the process the container runtimes leverage Linux kernel technologies such as: namespaces, chroots, cgroups, etc. plus security layers like SELinux.
We will see how we can leverage these technologies on Linux in order to build and run our own containers.
Container File Systems (a.k.a rootfs)
Whenever you pull an image container from a container registry, you’re downloading just a tarball. We can say container images are just tarballs.
There are multiple ways to get a rootfs that we can use in order to run our containers, for this blogpost we’re going to download an already built rootfs for Alpine Linux.
There are tools such as buildroot that make it really easy to create our own rootfs. We will see how to create our own rootfs using buildroot on a future post.
As earlier mentioned, let’s download the x86_64 rootfs for Alpine Linux:
mkdir /var/tmp/alpine-rootfs/ && cd $_
curl https://dl-cdn.alpinelinux.org/alpine/v3.12/releases/x86_64/alpine-minirootfs-3.12.3-x86_64.tar.gz -o rootfs.tar.gz
We can extract the rootfs on the temporary folder we just created:
tar xfz rootfs.tar.gz && rm -f rootfs.tar.gz
If we take a look at the extracted files:
tree -L 1
As you can see, the result looks like a Linux system. We have some well known directories in the Linux Filesystem Hierarchy Standard such as: bin, tmp, dev, opt, etc.
.
├── bin
├── dev
├── etc
├── home
├── lib
├── media
├── mnt
├── opt
├── proc
├── root
├── run
├── sbin
├── srv
├── sys
├── tmp
├── usr
└── var
chroot
Chroot is an operation that changes the root directory for the current running process and their children. A process that runs inside a chroot cannot access files and commands outside the chrooted directory tree.
That being said, we can now chroot into the rootfs environment we extracted in the previous step and run a shell to poke around:
Create the chroot jail
sudo chroot /var/tmp/alpine-rootfs/ /bin/shCheck the os-release
cat /etc/os-releaseNAME="Alpine Linux" ID=alpine VERSION_ID=3.12.3 PRETTY_NAME="Alpine Linux v3.12" HOME_URL="https://alpinelinux.org/" BUG_REPORT_URL="https://bugs.alpinelinux.org/"Try to list /tmp/alpine-rootfs folder
ls /var/tmp/alpine-rootfsls: /var/tmp/alpine-rootfs: No such file or directory
As you can see we only have visibility of the contents of the rootfs we’ve chroot into.
We can now install python and run a simple http server for example:
Install python3
apk add python3Run a simple http server
NOTE: When we execute the Python interpreter we’re actually running it from
/var/tmp/alpine-rootfs/usr/bin/python3python3 -m http.serverIf you open a new shell on your system (even if it’s outside of the chroot) you will be able to reach the http server we just created:
curl http://127.0.0.1:8000
namespaces
At this point we were able to work with a tarball like if it was a different system, but we’re not isolating the processed from the host system like containers do.
Let’s check the level of isolation:
In a shell outside the chroot run a
pingcommand:ping 127.0.0.1Mount the
procfilesystem inside the chrooted shellNOTE: If you’re still running the python http server you can kill the process
mount -t proc proc /procRun a
pscommand inside the chroot and try to find thepingcommand:ps -ef | grep "ping 127.0.0.1"387870 1000 0:00 ping 127.0.0.1 388204 root 0:00 grep ping 127.0.0.1We have visibility over the host system processes, that’s not great. On top of that, our chroot is running as root so we can even kill the process:
pkill -f "ping 127.0.0.1"
Now is when we introduce namespaces.
Linux namespaces are a feature of the Linux kernel that partitions kernel resources so one process will only see a set of resources while a different process can see a different set of resources.
These resources may exist in multiple spaces. The list of existing namespaces are:
| Namespace | Isolates |
|---|---|
| Cgroup | Cgroup root directory |
| IPC | System V IPC, POSIX message queues |
| Network | Network devices, stacks, prots, etc. |
| Mount | Mount points |
| PID | Process IDs |
| Time | Boot and monotonic clocks |
| User | User and Group IDs |
| UTS | Hostname and NIS domain name |
Creating namespaces with unshare
Creating namespaces is just a single syscall (unshare). There is also a unshare command line tool that provides a nice wrapper around the syscall.
We are going to use the unshare command line to create namespaces manually. Below example will create a PID namespace for the chrooted shell:
Exit the chroot we have already running
NOTE: Run below command on the chrooted shell
exitCreate the
PIDnamespace and run the chrooted shell inside the namespacesudo unshare -p -f --mount-proc=/var/tmp/alpine-rootfs/proc chroot /var/tmp/alpine-rootfs/ /bin/shNow that we have created our new process namespace, we will see that our shell thinks its PID is 1:
ps -efNOTE: As you can see, we no longer see the host system processes
PID USER TIME COMMAND 1 root 0:00 /bin/sh 2 root 0:00 ps -efSince we didn’t create a namespace for the network we can still see the whole network stack from the host system:
ip -o aNOTE: Below output might vary on your system
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever 4: wlp82s0 inet 192.168.0.160/24 brd 192.168.0.255 scope global dynamic wlp82s0\ valid_lft 6555sec preferred_lft 6555sec 4: wlp82s0 inet6 fe80::4e03:6176:40f0:b862/64 scope link \ valid_lft forever preferred_lft forever
Entering namespaces with nsenter
One powerful thing about namespaces is that they’re pretty flexible, for example you can have processes with some separated namespaces and some shared namespaces. One example in the Kubernetes world will be containers running in pods: Containers will have different PID namespaces but they will share the Network namespace.
There is a syscall (setns) that can be used to reassociate a thread with a namespace. The nsenter command line tool will help with that.
We can check the namespaces for a given process by querying the /proc filesystem:
From a shell outside the chroot get the PID for the chrooted shell
UNSHARE_PPID=$(ps -ef | grep "sudo unshare" | grep chroot | awk '{print $2}') UNSHARE_PID=$(ps -ef | grep ${UNSHARE_PPID} | grep chroot | grep -v sudo | awk '{print $2}') SHELL_PID=$(ps -ef | grep ${UNSHARE_PID} | grep -v chroot | grep /bin/sh | awk '{print $2}') ps -ef | grep ${UNSHARE_PID} | grep -v chroot | grep /bin/shroot 390072 390071 0 12:32 pts/1 00:00:00 /bin/shFrom a shell outside the chroot get the namespaces for the shell process:
sudo ls -l /proc/${SHELL_PID}/nstotal 0 lrwxrwxrwx. 1 root root 0 mar 25 12:54 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 ipc -> 'ipc:[4026531839]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 mnt -> 'mnt:[4026532266]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 net -> 'net:[4026532008]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 pid -> 'pid:[4026532489]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 pid_for_children -> 'pid:[4026532489]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 time -> 'time:[4026531834]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 time_for_children -> 'time:[4026531834]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 user -> 'user:[4026531837]' lrwxrwxrwx. 1 root root 0 mar 25 12:54 uts -> 'uts:[4026531838]'Earlier we saw how we were just setting a different
PIDnamespace, let’s see the difference between thePIDnamespace configured for our chroot shell and for the regular shell:NOTE: Below commands must be run from a shell outside the chroot:
Get
PIDnamespace for the chrooted shell:sudo ls -l /proc/${SHELL_PID}/ns/pidlrwxrwxrwx. 1 root root 0 mar 25 12:54 pid -> pid:[4026532489]Get
PIDnamespace for the regular shell:sudo ls -l /proc/$$/ns/pidlrwxrwxrwx. 1 mario mario 0 mar 25 12:55 pid -> pid:[4026531836]As you can see, both processes are using a different
PIDnamespace. We saw that network stack was still visible, let’s see if there is any difference in theNetworknamespace for both processes. Let’s start with the chrooted shell:sudo ls -l /proc/${SHELL_PID}/ns/netlrwxrwxrwx. 1 root root 0 mar 25 12:54 net -> net:[4026532008]Now, time to get the one for the regular shell:
sudo ls -l /proc/$$/ns/netlrwxrwxrwx. 1 mario mario 0 mar 25 12:55 net -> net:[4026532008]As you can see from above outputs, both processes are using the same
Networknamespace.
If we want to join a process to an existing namespace we can do that using nsenter as we said before, let’s do that.
Open a new shell outside the chroot
We want run a new chrooted shell and join the already existing
PIDnamespace we created earlier:# Get the previous unshare PPID UNSHARE_PPID=$(ps -ef | grep "sudo unshare" | grep chroot | awk '{print $2}') # Get the previous unshare PID UNSHARE_PID=$(ps -ef | grep ${UNSHARE_PPID} | grep chroot | grep -v sudo | awk '{print $2}') # Get the previous chrooted shell PID SHELL_PID=$(ps -ef | grep ${UNSHARE_PID} | grep -v chroot | grep /bin/sh | awk '{print $2}') # We will enter the previous PID namespace, remount the /proc filesystem and run a new chrooted shell sudo nsenter --pid=/proc/${SHELL_PID}/ns/pid unshare -f --mount-proc=/tmp/alpine-rootfs/proc chroot /tmp/alpine-rootfs/ /bin/shFrom the new chrooted shell we can run a
pscommand and we should see the existing processes from the previous chrooted shell:ps -efPID USER TIME COMMAND 1 root 0:00 /bin/sh 4 root 0:00 unshare -f --mount-proc=/tmp/alpine-rootfs/proc chroot /tmp/alpine-rootfs/ /bin/sh 5 root 0:00 /bin/sh 6 root 0:00 ps -efWe have entered the already existing
PIDnamespace used by our previous chrooted shell and we can see that running apscommand from the new shell (PID 5) we can see the first shell (PID 1).
Injecting files or directories into the chroot
Containers are usually inmutables, that means that we cannot create or edit directories or files into the chroot. Sometimes we will need to inject files or directories either for storage or configuration. We are going to show how we can create some files on the host system and expose them as read-only to the chrooted shell using mount.
Create a folder in the host system to host some read-only config files:
sudo mkdir -p /var/tmp/alpine-container-configs/ echo "Test" | sudo tee -a /var/tmp/alpine-container-configs/app-config echo "Test2" | sudo tee -a /var/tmp/alpine-container-configs/srv-configCreate a folder in the rootfs directory to use it as mount point:
sudo mkdir -p /var/tmp/alpine-rootfs/etc/myconfigsRun a bind mount:
sudo mount --bind -o ro /var/tmp/alpine-container-configs /var/tmp/alpine-rootfs/etc/myconfigsRun a chrooted shell and check the mounted files:
NOTE: You can exit from the already existing chrooted shells before creating this one
sudo unshare -p -f --mount-proc=/var/tmp/alpine-rootfs/proc chroot /var/tmp/alpine-rootfs/ /bin/shls -l /etc/myconfigs/total 8 -rw-r--r-- 1 root root 5 Mar 25 13:28 app-config -rw-r--r-- 1 root root 6 Mar 25 13:28 srv-configIf we try to edit the files from the chrooted shell, this is what happens:
echo "test3" >> /etc/myconfigs/app-configNOTE: We cannot edit/create files since the mount is read-only
/bin/sh: can't create /etc/myconfigs/app-config: Read-only file systemIf we want to unmount the files we can run the command below from the host system:
sudo umount /var/tmp/alpine-rootfs/etc/myconfigs
CGroups
Control groups allow the kernel to restrict resources like memory and CPU for specific processes. In this case we are going to create a new CGroup for our chrooted shell so it cannot use more than 200MB of RAM.
Kernel exposes cgroups at the /sys/fs/cgroup directory:
ls /sys/fs/cgroup/
cgroup.controllers cgroup.stat cpuset.cpus.effective io.cost.model machine.slice system.slice
cgroup.max.depth cgroup.subtree_control cpuset.mems.effective io.cost.qos memory.numa_stat user.slice
cgroup.max.descendants cgroup.threads cpu.stat io.pressure memory.pressure
cgroup.procs cpu.pressure init.scope io.stat memory.stat
Let’s create a new cgroup, we just need to create a folder for that to happen:
sudo mkdir /sys/fs/cgroup/alpinecgroupls /sys/fs/cgroup/alpinecgroup/NOTE: The kernel automatically populated the folder
cgroup.controllers cgroup.stat io.pressure memory.max memory.swap.current pids.max cgroup.events cgroup.subtree_control memory.current memory.min memory.swap.events cgroup.freeze cgroup.threads memory.events memory.numa_stat memory.swap.high cgroup.max.depth cgroup.type memory.events.local memory.oom.group memory.swap.max cgroup.max.descendants cpu.pressure memory.high memory.pressure pids.current cgroup.procs cpu.stat memory.low memory.stat pids.eventsNow, we just need to adjust the memory value by modifying the required files:
# Set a limit of 200MB of RAM echo "200000000" | sudo tee -a /sys/fs/cgroup/alpinecgroup/memory.max # Disable swap echo "0" | sudo tee -a /sys/fs/cgroup/alpinecgroup/memory.swap.maxFinally, we need to assign this CGroup to our chrooted shell:
# Get the previous unshare PPID UNSHARE_PPID=$(ps -ef | grep "sudo unshare" | grep chroot | awk '{print $2}') # Get the previous unshare PID UNSHARE_PID=$(ps -ef | grep ${UNSHARE_PPID} | grep chroot | grep -v sudo | awk '{print $2}') # Get the previous chrooted shell PID SHELL_PID=$(ps -ef | grep ${UNSHARE_PID} | grep -v chroot | grep /bin/sh | awk '{print $2}') # Assign the shell process to the cgroup echo ${SHELL_PID} | sudo tee -a /sys/fs/cgroup/alpinecgroup/cgroup.procsIn order to test the cgroup we will create a dumb python script in the chrooted shell:
# Mount the /dev fs since we need to read data from urandom mount -t devtmpfs dev /dev # Create the python script cat <<EOF > /opt/dumb.py f = open("/dev/urandom", "r", encoding = "ISO-8859-1") data = "" i=0 while i < 20: data += f.read(10000000) # 10mb i += 1 print("Used %d MB" % (i * 10)) EOFRun the script:
python3 /opt/dumb.pyNOTE: The process was killed before it reached the memory limit.
python3 /opt/dumb.py Used 10 MB Used 20 MB Used 30 MB Used 40 MB Used 50 MB Used 60 MB Used 70 MB Used 80 MB Used 90 MB Used 100 MB Used 110 MB Used 120 MB Used 130 MB Used 140 MB Used 150 MB Used 160 MB Used 170 MB KilledWe can now close the chrooted shell and remove the cgroup
Exit the chrooted shell:
exit
NOTE: A CGroup cannot be removed until all the attached processes are finished.
Remove the cgroup:
sudo rmdir /sys/fs/cgroup/alpinecgroup/
Container security and capabilities
As you know, Linux containers run directly on top of the host system and share multiple resources like the Kernel, filesystems, network stack, etc. If an attacker breaks out of the container confinement security risks will arise.
In order to limit the attack surface there are many technologies involved in limiting the power of processes running in the container such as SELinux, Security Compute Profiles and Linux Capabilities.
You can learn more in this blogpost.
Closing Thoughts
Containers are not new, they use technologies that have been present in the Linux kernel for a long time. Tools like Podman or Docker make running containers easy for everyone by abstracting the different Linux technologies used under the hood from the user.
I hope that now you have a better understanding of what technologies are used when you run containers on your systems.