Containers are Linux

You probably already heard this expression, in today’s post we are going to desmitify container technologies by decomposing them part by part and describing which Linux technologies make containers possible.

We can describe a container as an isolated process running on a host. In order to isolate the process the container runtimes leverage Linux kernel technologies such as: namespaces, chroots, cgroups, etc. plus security layers like SELinux.

We will see how we can leverage these technologies on Linux in order to build and run our own containers.

Container File Systems (a.k.a rootfs)

Whenever you pull an image container from a container registry, you’re downloading just a tarball. We can say container images are just tarballs.

There are multiple ways to get a rootfs that we can use in order to run our containers, for this blogpost we’re going to download an already built rootfs for Alpine Linux.

There are tools such as buildroot that make it really easy to create our own rootfs. We will see how to create our own rootfs using buildroot on a future post.

As earlier mentioned, let’s download the x86_64 rootfs for Alpine Linux:

mkdir /var/tmp/alpine-rootfs/ && cd $_
curl https://dl-cdn.alpinelinux.org/alpine/v3.12/releases/x86_64/alpine-minirootfs-3.12.3-x86_64.tar.gz -o rootfs.tar.gz

We can extract the rootfs on the temporary folder we just created:

tar xfz rootfs.tar.gz && rm -f rootfs.tar.gz

If we take a look at the extracted files:

tree -L 1

As you can see, the result looks like a Linux system. We have some well known directories in the Linux Filesystem Hierarchy Standard such as: bin, tmp, dev, opt, etc.

.
├── bin
├── dev
├── etc
├── home
├── lib
├── media
├── mnt
├── opt
├── proc
├── root
├── run
├── sbin
├── srv
├── sys
├── tmp
├── usr
└── var

chroot

Chroot is an operation that changes the root directory for the current running process and their children. A process that runs inside a chroot cannot access files and commands outside the chrooted directory tree.

That being said, we can now chroot into the rootfs environment we extracted in the previous step and run a shell to poke around:

Create the chroot jail

sudo chroot /var/tmp/alpine-rootfs/ /bin/sh

Check the os-release

cat /etc/os-release

NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.12.3
PRETTY_NAME="Alpine Linux v3.12"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"

Try to list /tmp/alpine-rootfs folder

ls /var/tmp/alpine-rootfs

ls: /var/tmp/alpine-rootfs: No such file or directory

As you can see we only have visibility of the contents of the rootfs we’ve chroot into.

We can now install python and run a simple http server for example:

Install python3
```
apk add python3
```
Run a simple http server
NOTE: When we execute the Python interpreter we’re actually running it from /var/tmp/alpine-rootfs/usr/bin/python3
```
python3 -m http.server
```
If you open a new shell on your system (even if it’s outside of the chroot) you will be able to reach the http server we just created:
```
curl http://127.0.0.1:8000
```

namespaces

At this point we were able to work with a tarball like if it was a different system, but we’re not isolating the processed from the host system like containers do.

Let’s check the level of isolation:

In a shell outside the chroot run a ping command:
```
ping 127.0.0.1
```
Mount the proc filesystem inside the chrooted shell
NOTE: If you’re still running the python http server you can kill the process
```
mount -t proc proc /proc
```

Run a ps command inside the chroot and try to find the ping command:

ps -ef | grep "ping 127.0.0.1"

387870 1000      0:00 ping 127.0.0.1
388204 root      0:00 grep ping 127.0.0.1

We have visibility over the host system processes, that’s not great. On top of that, our chroot is running as root so we can even kill the process:
```
pkill -f "ping 127.0.0.1"
```

Now is when we introduce namespaces.

Linux namespaces are a feature of the Linux kernel that partitions kernel resources so one process will only see a set of resources while a different process can see a different set of resources.

These resources may exist in multiple spaces. The list of existing namespaces are:

Namespace	Isolates
Cgroup	Cgroup root directory
IPC	System V IPC, POSIX message queues
Network	Network devices, stacks, prots, etc.
Mount	Mount points
PID	Process IDs
Time	Boot and monotonic clocks
User	User and Group IDs
UTS	Hostname and NIS domain name

Creating namespaces with unshare

Creating namespaces is just a single syscall (unshare). There is also a unshare command line tool that provides a nice wrapper around the syscall.

We are going to use the unshare command line to create namespaces manually. Below example will create a PID namespace for the chrooted shell:

Exit the chroot we have already running
NOTE: Run below command on the chrooted shell
```
exit
```

Create the PID namespace and run the chrooted shell inside the namespace

sudo unshare -p -f --mount-proc=/var/tmp/alpine-rootfs/proc chroot /var/tmp/alpine-rootfs/ /bin/sh

Now that we have created our new process namespace, we will see that our shell thinks its PID is 1:
```
ps -ef
```
NOTE: As you can see, we no longer see the host system processes
```
PID   USER     TIME  COMMAND
1 root      0:00 /bin/sh
2 root      0:00 ps -ef
```

Since we didn’t create a namespace for the network we can still see the whole network stack from the host system:

ip -o a

NOTE: Below output might vary on your system

1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
1: lo    inet6 ::1/128 scope host \       valid_lft forever preferred_lft forever
4: wlp82s0    inet 192.168.0.160/24 brd 192.168.0.255 scope global dynamic wlp82s0\       valid_lft 6555sec preferred_lft 6555sec
4: wlp82s0    inet6 fe80::4e03:6176:40f0:b862/64 scope link \       valid_lft forever preferred_lft forever

Entering namespaces with nsenter

One powerful thing about namespaces is that they’re pretty flexible, for example you can have processes with some separated namespaces and some shared namespaces. One example in the Kubernetes world will be containers running in pods: Containers will have different PID namespaces but they will share the Network namespace.

There is a syscall (setns) that can be used to reassociate a thread with a namespace. The nsenter command line tool will help with that.

We can check the namespaces for a given process by querying the /proc filesystem:

From a shell outside the chroot get the PID for the chrooted shell

UNSHARE_PPID=$(ps -ef | grep "sudo unshare" | grep chroot | awk '{print $2}')
UNSHARE_PID=$(ps -ef | grep ${UNSHARE_PPID} | grep chroot | grep -v sudo | awk '{print $2}')
SHELL_PID=$(ps -ef | grep ${UNSHARE_PID} | grep -v chroot |  grep /bin/sh | awk '{print $2}')
ps -ef | grep ${UNSHARE_PID} | grep -v chroot |  grep /bin/sh

root      390072  390071  0 12:32 pts/1    00:00:00 /bin/sh

From a shell outside the chroot get the namespaces for the shell process:

sudo ls -l /proc/${SHELL_PID}/ns

total 0
lrwxrwxrwx. 1 root root 0 mar 25 12:54 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 mnt -> 'mnt:[4026532266]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 net -> 'net:[4026532008]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 pid -> 'pid:[4026532489]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 pid_for_children -> 'pid:[4026532489]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 time -> 'time:[4026531834]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 user -> 'user:[4026531837]'
lrwxrwxrwx. 1 root root 0 mar 25 12:54 uts -> 'uts:[4026531838]'

Earlier we saw how we were just setting a different PID namespace, let’s see the difference between the PID namespace configured for our chroot shell and for the regular shell:
NOTE: Below commands must be run from a shell outside the chroot:
1. Get PID namespace for the chrooted shell:
```
sudo ls -l /proc/${SHELL_PID}/ns/pid
```
```
lrwxrwxrwx. 1 root root 0 mar 25 12:54 pid -> pid:[4026532489]
```
2. Get PID namespace for the regular shell:
```
sudo ls -l /proc/$$/ns/pid
```
```
lrwxrwxrwx. 1 mario mario 0 mar 25 12:55 pid -> pid:[4026531836]
```
3. As you can see, both processes are using a different PID namespace. We saw that network stack was still visible, let’s see if there is any difference in the Network namespace for both processes. Let’s start with the chrooted shell:
```
sudo ls -l /proc/${SHELL_PID}/ns/net
```
```
lrwxrwxrwx. 1 root root 0 mar 25 12:54 net -> net:[4026532008]
```
4. Now, time to get the one for the regular shell:
```
sudo ls -l /proc/$$/ns/net
```
```
lrwxrwxrwx. 1 mario mario 0 mar 25 12:55 net -> net:[4026532008]
```
5. As you can see from above outputs, both processes are using the same Network namespace.

If we want to join a process to an existing namespace we can do that using nsenter as we said before, let’s do that.

Open a new shell outside the chroot

We want run a new chrooted shell and join the already existing PID namespace we created earlier:

# Get the previous unshare PPID
UNSHARE_PPID=$(ps -ef | grep "sudo unshare" | grep chroot | awk '{print $2}')
# Get the previous unshare PID
UNSHARE_PID=$(ps -ef | grep ${UNSHARE_PPID} | grep chroot | grep -v sudo | awk '{print $2}')
# Get the previous chrooted shell PID
SHELL_PID=$(ps -ef | grep ${UNSHARE_PID} | grep -v chroot |  grep /bin/sh | awk '{print $2}')
# We will enter the previous PID namespace, remount the /proc filesystem and run a new chrooted shell
sudo nsenter --pid=/proc/${SHELL_PID}/ns/pid unshare -f --mount-proc=/tmp/alpine-rootfs/proc chroot /tmp/alpine-rootfs/ /bin/sh

From the new chrooted shell we can run a ps command and we should see the existing processes from the previous chrooted shell:

ps -ef

PID   USER     TIME  COMMAND
  1   root     0:00  /bin/sh
  4   root     0:00  unshare -f --mount-proc=/tmp/alpine-rootfs/proc chroot /tmp/alpine-rootfs/ /bin/sh
  5   root     0:00  /bin/sh
  6   root    0:00  ps -ef

We have entered the already existing PID namespace used by our previous chrooted shell and we can see that running a ps command from the new shell (PID 5) we can see the first shell (PID 1).

Injecting files or directories into the chroot

Containers are usually inmutables, that means that we cannot create or edit directories or files into the chroot. Sometimes we will need to inject files or directories either for storage or configuration. We are going to show how we can create some files on the host system and expose them as read-only to the chrooted shell using mount.

Create a folder in the host system to host some read-only config files:

sudo mkdir -p /var/tmp/alpine-container-configs/
echo "Test" | sudo tee -a /var/tmp/alpine-container-configs/app-config
echo "Test2" | sudo tee -a /var/tmp/alpine-container-configs/srv-config

Create a folder in the rootfs directory to use it as mount point:
```
sudo mkdir -p /var/tmp/alpine-rootfs/etc/myconfigs
```

Run a bind mount:

sudo mount --bind -o ro /var/tmp/alpine-container-configs /var/tmp/alpine-rootfs/etc/myconfigs

Run a chrooted shell and check the mounted files:

NOTE: You can exit from the already existing chrooted shells before creating this one

sudo unshare -p -f --mount-proc=/var/tmp/alpine-rootfs/proc chroot /var/tmp/alpine-rootfs/ /bin/sh

ls -l /etc/myconfigs/

total 8
-rw-r--r--    1 root     root             5 Mar 25 13:28 app-config
-rw-r--r--    1 root     root             6 Mar 25 13:28 srv-config

If we try to edit the files from the chrooted shell, this is what happens:
```
echo "test3" >> /etc/myconfigs/app-config
```
NOTE: We cannot edit/create files since the mount is read-only
```
/bin/sh: can't create /etc/myconfigs/app-config: Read-only file system
```
If we want to unmount the files we can run the command below from the host system:
```
sudo umount /var/tmp/alpine-rootfs/etc/myconfigs
```

CGroups

Control groups allow the kernel to restrict resources like memory and CPU for specific processes. In this case we are going to create a new CGroup for our chrooted shell so it cannot use more than 200MB of RAM.

Kernel exposes cgroups at the /sys/fs/cgroup directory:

ls /sys/fs/cgroup/

cgroup.controllers      cgroup.stat             cpuset.cpus.effective  io.cost.model  machine.slice     system.slice
cgroup.max.depth        cgroup.subtree_control  cpuset.mems.effective  io.cost.qos    memory.numa_stat  user.slice
cgroup.max.descendants  cgroup.threads          cpu.stat               io.pressure    memory.pressure
cgroup.procs            cpu.pressure            init.scope             io.stat        memory.stat

Let’s create a new cgroup, we just need to create a folder for that to happen:

sudo mkdir /sys/fs/cgroup/alpinecgroup

ls /sys/fs/cgroup/alpinecgroup/

NOTE: The kernel automatically populated the folder

cgroup.controllers      cgroup.stat             io.pressure          memory.max        memory.swap.current  pids.max
cgroup.events           cgroup.subtree_control  memory.current       memory.min        memory.swap.events
cgroup.freeze           cgroup.threads          memory.events        memory.numa_stat  memory.swap.high
cgroup.max.depth        cgroup.type             memory.events.local  memory.oom.group  memory.swap.max
cgroup.max.descendants  cpu.pressure            memory.high          memory.pressure   pids.current
cgroup.procs            cpu.stat                memory.low           memory.stat       pids.events

Now, we just need to adjust the memory value by modifying the required files:

# Set a limit of 200MB of RAM
echo "200000000" | sudo tee -a /sys/fs/cgroup/alpinecgroup/memory.max
# Disable swap
echo "0" | sudo tee -a /sys/fs/cgroup/alpinecgroup/memory.swap.max

Finally, we need to assign this CGroup to our chrooted shell:

# Get the previous unshare PPID
UNSHARE_PPID=$(ps -ef | grep "sudo unshare" | grep chroot | awk '{print $2}')
# Get the previous unshare PID
UNSHARE_PID=$(ps -ef | grep ${UNSHARE_PPID} | grep chroot | grep -v sudo | awk '{print $2}')
# Get the previous chrooted shell PID
SHELL_PID=$(ps -ef | grep ${UNSHARE_PID} | grep -v chroot |  grep /bin/sh | awk '{print $2}')
# Assign the shell process to the cgroup
echo ${SHELL_PID} | sudo tee -a /sys/fs/cgroup/alpinecgroup/cgroup.procs

In order to test the cgroup we will create a dumb python script in the chrooted shell:

# Mount the /dev fs since we need to read data from urandom
mount -t devtmpfs dev /dev
# Create the python script
cat <<EOF > /opt/dumb.py
f = open("/dev/urandom", "r", encoding = "ISO-8859-1")
data = ""
i=0
while i < 20:
    data += f.read(10000000) # 10mb
    i += 1
    print("Used %d MB" % (i * 10))
EOF

Run the script:

python3 /opt/dumb.py

NOTE: The process was killed before it reached the memory limit.

python3 /opt/dumb.py
Used 10 MB
Used 20 MB
Used 30 MB
Used 40 MB
Used 50 MB
Used 60 MB
Used 70 MB
Used 80 MB
Used 90 MB
Used 100 MB
Used 110 MB
Used 120 MB
Used 130 MB
Used 140 MB
Used 150 MB
Used 160 MB
Used 170 MB
Killed

We can now close the chrooted shell and remove the cgroup
1. Exit the chrooted shell:
```
exit
```
NOTE: A CGroup cannot be removed until all the attached processes are finished.
1. Remove the cgroup:
```
sudo rmdir /sys/fs/cgroup/alpinecgroup/
```

Container security and capabilities

As you know, Linux containers run directly on top of the host system and share multiple resources like the Kernel, filesystems, network stack, etc. If an attacker breaks out of the container confinement security risks will arise.

In order to limit the attack surface there are many technologies involved in limiting the power of processes running in the container such as SELinux, Security Compute Profiles and Linux Capabilities.

You can learn more in this blogpost.

Closing Thoughts

Containers are not new, they use technologies that have been present in the Linux kernel for a long time. Tools like Podman or Docker make running containers easy for everyone by abstracting the different Linux technologies used under the hood from the user.

I hope that now you have a better understanding of what technologies are used when you run containers on your systems.

Containers under the Hood

Containers are Linux

Container File Systems (a.k.a rootfs)

chroot

namespaces

Creating namespaces with unshare

Entering namespaces with nsenter

Injecting files or directories into the chroot

CGroups

Container security and capabilities

Closing Thoughts

Sources

Containers are Linux#

Container File Systems (a.k.a rootfs)#

chroot#

namespaces#

Creating namespaces with unshare#

Entering namespaces with nsenter#

Injecting files or directories into the chroot#

CGroups#

Container security and capabilities#

Closing Thoughts#

Sources#

Containers are Linux

Container File Systems (a.k.a rootfs)

chroot

namespaces

Creating namespaces with unshare

Entering namespaces with nsenter

Injecting files or directories into the chroot

CGroups

Container security and capabilities

Closing Thoughts

Sources