Capabilities and Seccomp Profiles on Kubernetes

In a previous post we talked about Linux Capabilities and Secure Compute Profiles, in this post we are going to see how we can leverage them on Kubernetes.

We will need a Kubernetes cluster, I’m going to use kcli in order to get one. Below command will deploy a Kubernetes cluster on VMs:

NOTE: You can create a parameters file with the cluster configuration as well.

# Create a Kubernetes 1.20 cluster with 1 master and 1 worker using calico as SDN, nginx as ingress controller, metallb for loadbalancer services and CRI-O as container runtime
kcli create kube generic -P ctlplanes=1 -P workers=1  -P ctlplane_memory=4096 -P numcpus=2 -P worker_memory=4096 -P sdn=calico -P version=1.20 -P ingress=true -P ingress_method=nginx -P metallb=true -P engine=crio -P domain=linuxera.org caps-cluster

After a few moments we will get the kubeconfig for accessing our cluster:

Kubernetes cluster caps-cluster deployed!!!
INFO export KUBECONFIG=$HOME/.kcli/clusters/caps-cluster/auth/kubeconfig
INFO export PATH=$PWD:$PATH

We can start using it right away:

export KUBECONFIG=$HOME/.kcli/clusters/caps-cluster/auth/kubeconfig
kubectl get nodes

NAME                                 STATUS   ROLES                  AGE     VERSION
caps-cluster-master-0.linuxera.org   Ready    control-plane,master   8m19s   v1.20.5
caps-cluster-worker-0.linuxera.org   Ready    worker                 3m33s   v1.20.5

Capabilities on Kubernetes

Capabilities on Kubernetes are configured for pods or containers via the SecurityContext.

In the next scenarios we are going to see how we can configure different capabilities for our containers and how they behave depending on the user running our container.

We will be using a demo application that listens on a given port, by default the application image uses a non-root user. In a previous post we mentioned how capabilities behave differently depending on the user that runs the process, we will see how that affects when running on containers.

Container Runtime Default Capabilities

As previously mentioned, container runtimes come with a set of enabled capabilities that will be assigned to every container if not otherwise specified. We’re using CRI-O in our Kubernetes cluster and we can find the default capabilities in the CRI-O configuration file at /etc/crio/crio.conf present in the nodes:

default_capabilities = [
	"CHOWN",
	"DAC_OVERRIDE",
	"FSETID",
	"FOWNER",
	"SETGID",
	"SETUID",
	"SETPCAP",
	"NET_BIND_SERVICE",
	"KILL",
]

The capabilities in the list above will be the ones added to containers by default.

Pod running with root UID

Create a namespace:

NAMESPACE=test-capabilities
kubectl create ns ${NAMESPACE}

Create a pod running our test application with UID 0:

cat <<EOF | kubectl -n ${NAMESPACE} create -f -
apiVersion: v1
kind: Pod
metadata:
  name: reversewords-app-captest-root
spec:
  containers:
  - image: quay.io/mavazque/reversewords:ubi8
    name: reversewords
    securityContext:
      runAsUser: 0
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}
EOF

Review the capability sets for the application process:

kubectl -n ${NAMESPACE} exec -ti reversewords-app-captest-root -- grep Cap /proc/1/status

CapInh:	00000000000005fb
CapPrm:	00000000000005fb
CapEff:	00000000000005fb
CapBnd:	00000000000005fb
CapAmb:	0000000000000000

If we decode the effective set this is what we get:

capsh --decode=00000000000005fb

NOTE: You can see how the pod got assigned the runtime’s default caps.

0x00000000000005fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service

Pod running with non-root UID

Create a pod running our test application with a non-root UID:

NAMESPACE=test-capabilities
cat <<EOF | kubectl -n ${NAMESPACE} create -f -
apiVersion: v1
kind: Pod
metadata:
  name: reversewords-app-captest-nonroot
spec:
  containers:
  - image: quay.io/mavazque/reversewords:ubi8
    name: reversewords
    securityContext:
      runAsUser: 1024
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}
EOF

Review the capability sets for the application process:

kubectl -n ${NAMESPACE} exec -ti reversewords-app-captest-nonroot -- grep Cap /proc/1/status

CapInh:	00000000000005fb
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	00000000000005fb
CapAmb:	0000000000000000

You can see how the effective and permitted sets were cleared. We explained that behaviour in our previous post. That happens because we’re doing execve to an unprivileged process so those capability sets get cleared.

This has some consequences when running our workloads on Kubernetes, outside Kubernetes we could use Ambient capabilities, but at the time of this writing, Ambient capabilities are not supported on Kubernetes. This means that we can only use file capabilities or capability aware programs in order to get capabilities on programs running as nonroot on Kubernetes.

Configuring capabilities for our workloads

At this point we know what are the differences with regards to capabilities when running our workloads with a root or a nonroot UID. In the next scenarios we are going to see how we can configure our workloads so they only get the required capabilities they need in order to run.

Workload running with root UID

Create a deployment for our workload:

NOTE: We are dropping all of the runtime’s default capabilities, on top of that we add the NET_BIND_SERVICE capability and request the app to run with root UID. In the environment variables we configure our app to listen on port 80.

NAMESPACE=test-capabilities
cat <<EOF | kubectl -n ${NAMESPACE} create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: reversewords-app-rootuid
  name: reversewords-app-rootuid
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reversewords-app-rootuid
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: reversewords-app-rootuid
    spec:
      containers:
      - image: quay.io/mavazque/reversewords:ubi8
        name: reversewords
        resources: {}
        env:
        - name: APP_PORT
          value: "80"
        securityContext:
          runAsUser: 0
          capabilities:
            drop:
            - CHOWN
            - DAC_OVERRIDE
            - FSETID
            - FOWNER
            - SETGID
            - SETUID
            - SETPCAP
            - KILL
            add:
            - NET_BIND_SERVICE
status: {}
EOF

We can check the logs for our application and see that it’s working fine:

kubectl -n ${NAMESPACE} logs deployment/reversewords-app-rootuid

2021/04/01 09:59:39 Starting Reverse Api v0.0.18 Release: NotSet
2021/04/01 09:59:39 Listening on port 80

If we look at the capability sets this is what we get:

kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-rootuid -- grep Cap /proc/1/status

CapInh:	0000000000000400
CapPrm:	0000000000000400
CapEff:	0000000000000400
CapBnd:	0000000000000400
CapAmb:	0000000000000000

As expected, only NET_BIND_SERVICE capability is available:

capsh --decode=0000000000000400

0x0000000000000400=cap_net_bind_service

The workload worked as expected when running with root UID, in the next scenario we will try the same app but this time running with a non-root UID.

Workload running with non-root UID

Create a deployment for our workload:

NOTE: We are dropping all of the runtime’s default capabilities, on top of that we add the NET_BIND_SERVICE capability and request the app to run with non-root UID. In the environment variables we configure our app to listen on port 80.

NAMESPACE=test-capabilities
cat <<EOF | kubectl -n ${NAMESPACE} create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: reversewords-app-nonrootuid
  name: reversewords-app-nonrootuid
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reversewords-app-nonrootuid
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: reversewords-app-nonrootuid
    spec:
      containers:
      - image: quay.io/mavazque/reversewords:ubi8
        name: reversewords
        resources: {}
        env:
        - name: APP_PORT
          value: "80"
        securityContext:
          runAsUser: 1024
          capabilities:
            drop:
            - CHOWN
            - DAC_OVERRIDE
            - FSETID
            - FOWNER
            - SETGID
            - SETUID
            - SETPCAP
            - KILL
            add:
            - NET_BIND_SERVICE
status: {}
EOF

We can check the logs for our application and see if it’s working:

kubectl -n ${NAMESPACE} logs deployment/reversewords-app-nonrootuid

2021/04/01 10:09:10 Starting Reverse Api v0.0.18 Release: NotSet
2021/04/01 10:09:10 Listening on port 80
2021/04/01 10:09:10 listen tcp :80: bind: permission denied

This time the application didn’t bind to port 80, let’s update the app configuration so it binds to port 8080 and then we will review the capability sets:

# Patch the app so it binds to port 8080
kubectl -n ${NAMESPACE} patch deployment reversewords-app-nonrootuid -p '{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"reversewords"}],"containers":[{"$setElementOrder/env":[{"name":"APP_PORT"}],"env":[{"name":"APP_PORT","value":"8080"}],"name":"reversewords"}]}}}}'
# Get capability sets
kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- grep Cap /proc/1/status

CapInh:	0000000000000400
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000000000000400
CapAmb:	0000000000000000

We don’t have the NET_BIND_SERVICE in the effective set, if you remember from our previous post we would need the capability in the ambient set in order for our application to work, but as we said Kubernetes still doesn’t support ambient capabilities so our only option is make use of file capabilities.

We have created a new image for our application and our application binary now has the NET_BIND_SERVICE capability in the effective and permitted file capability sets. Let’s update the deployment configuration.

NOTE: We configured the app to bind to port 80 and changed the container image with the one that has the required changes.

kubectl -n ${NAMESPACE} patch deployment reversewords-app-nonrootuid -p '{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"reversewords"}],"containers":[{"$setElementOrder/env":[{"name":"APP_PORT"}],"env":[{"name":"APP_PORT","value":"80"}],"image":"quay.io/mavazque/reversewords-captest:latest","name":"reversewords"}]}}}}'

We can check the logs for our application and see if it’s working:

kubectl -n ${NAMESPACE} logs deployment/reversewords-app-nonrootuid

2021/04/01 10:18:42 Starting Reverse Api v0.0.21 Release: NotSet
2021/04/01 10:18:42 Listening on port 80

This time the application was able to bind to port 80, let’s review the capability sets:
```
kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- grep Cap /proc/1/status
```
NOTE: Since our application binary has the required capability in its file capability sets the process thread was able to gain that capability:
```
CapInh:	0000000000000400
CapPrm:	0000000000000400
CapEff:	0000000000000400
CapBnd:	0000000000000400
CapAmb:	0000000000000000
```

We can check the file capability configured in our application binary:

kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- getcap /usr/bin/reverse-words

/usr/bin/reverse-words = cap_net_bind_service+eip

Seccomp Profiles on Kubernetes

In this scenario we’re going to reuse the Secure Compute profile we created in the previous post.

Configuring Seccomp Profiles on the cluster nodes

By default Kubelet will try to find the seccomp profiles in the /var/lib/kubelet/seccomp/ path. This path can be configured in the kubelet config.

We are going to create the two seccomp profiles that we will be using in the nodes.

Create below file on every node that can run workloads as /var/lib/kubelet/seccomp/centos8-ls.json:

NOTE: This is the seccomp profile that allows us to run a centos8 image that runs ls / as we saw in the previous post.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": [
    "SCMP_ARCH_X86_64"
  ],
  "syscalls": [
    {
      "names": [
        "access",
        "arch_prctl",
        "brk",
        "capget",
        "capset",
        "chdir",
        "close",
        "epoll_ctl",
        "epoll_pwait",
        "execve",
        "exit_group",
        "fchown",
        "fcntl",
        "fstat",
        "fstatfs",
        "futex",
        "getdents64",
        "getpid",
        "getppid",
        "ioctl",
        "mmap",
        "mprotect",
        "munmap",
        "nanosleep",
        "newfstatat",
        "openat",
        "prctl",
        "pread64",
        "prlimit64",
        "read",
        "rt_sigaction",
        "rt_sigprocmask",
        "rt_sigreturn",
        "sched_yield",
        "seccomp",
        "set_robust_list",
        "set_tid_address",
        "setgid",
        "setgroups",
        "setuid",
        "stat",
        "statfs",
        "tgkill",
        "write"
      ],
      "action": "SCMP_ACT_ALLOW",
      "args": [],
      "comment": "",
      "includes": {},
      "excludes": {}
    }
  ]
}

Configuring seccomp profiles for our workloads

Create a namespace:

NAMESPACE=test-seccomp
kubectl create ns ${NAMESPACE}

Seccomp profiles can be configured at pod or container level, this time we’re going to configure it at pod level:

NOTE: We configured the seccompProfile centos8-ls.json.

cat <<EOF | kubectl -n ${NAMESPACE} create -f -
apiVersion: v1
kind: Pod
metadata:
  name: seccomp-ls-test
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: centos8-ls.json
  containers:
  - image: registry.centos.org/centos:8
    name: seccomp-ls-test
    command: ["ls", "/"]
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}
EOF

The pod was executed with no issues:

kubectl -n ${NAMESPACE} logs seccomp-ls-test

bin
dev
...

Let’s try to create a new pod that runs ls -l instead. On top of that we will configure the seccomp profile at the container level.

cat <<EOF | kubectl -n ${NAMESPACE} create -f -
apiVersion: v1
kind: Pod
metadata:
  name: seccomp-lsl-test
spec:
  containers:
  - image: registry.centos.org/centos:8
    name: seccomp-lsl-test
    command: ["ls", "-l", "/"]
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: centos8-ls.json
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}
EOF

As expected, the pod failed since the seccomp profile doesn’t have all the required syscalls required for the command to run permitted:
```
kubectl -n ${NAMESPACE} logs seccomp-lsl-test
```
```
ls: cannot access '/': Operation not permitted
```

Closing Thoughts

At this point you should’ve a clear understanding of when your workloads will benefit from using capabilities or seccomp profiles.

We’ve not been through how we can control which capabilities / seccomp a specific user can use, PodSecurityPolicies can be used to control such things on Kubernetes. In OpenShift you can use SecurityContextConstraints.

If you want to learn more around these topics feel free to take a look at the following SCCs lab: https://github.com/mvazquezc/scc-fun/blob/main/README.md

Capabilities and Seccomp Profiles on Kubernetes#

Capabilities on Kubernetes#

Container Runtime Default Capabilities#

Configuring capabilities for our workloads#

Seccomp Profiles on Kubernetes#

Configuring Seccomp Profiles on the cluster nodes#

Configuring seccomp profiles for our workloads#

Closing Thoughts#