Capabilities and Seccomp Profiles on Kubernetes
In a previous post we talked about Linux Capabilities and Secure Compute Profiles, in this post we are going to see how we can leverage them on Kubernetes.
We will need a Kubernetes cluster, I’m going to use kcli in order to get one. Below command will deploy a Kubernetes cluster on VMs:
NOTE: You can create a parameters file with the cluster configuration as well.
# Create a Kubernetes 1.20 cluster with 1 master and 1 worker using calico as SDN, nginx as ingress controller, metallb for loadbalancer services and CRI-O as container runtime
kcli create kube generic -P ctlplanes=1 -P workers=1 -P ctlplane_memory=4096 -P numcpus=2 -P worker_memory=4096 -P sdn=calico -P version=1.20 -P ingress=true -P ingress_method=nginx -P metallb=true -P engine=crio -P domain=linuxera.org caps-cluster
After a few moments we will get the kubeconfig for accessing our cluster:
Kubernetes cluster caps-cluster deployed!!!
INFO export KUBECONFIG=$HOME/.kcli/clusters/caps-cluster/auth/kubeconfig
INFO export PATH=$PWD:$PATH
We can start using it right away:
export KUBECONFIG=$HOME/.kcli/clusters/caps-cluster/auth/kubeconfig
kubectl get nodes
NAME STATUS ROLES AGE VERSION
caps-cluster-master-0.linuxera.org Ready control-plane,master 8m19s v1.20.5
caps-cluster-worker-0.linuxera.org Ready worker 3m33s v1.20.5
Capabilities on Kubernetes
Capabilities on Kubernetes are configured for pods or containers via the SecurityContext.
In the next scenarios we are going to see how we can configure different capabilities for our containers and how they behave depending on the user running our container.
We will be using a demo application that listens on a given port, by default the application image uses a non-root user. In a previous post we mentioned how capabilities behave differently depending on the user that runs the process, we will see how that affects when running on containers.
Container Runtime Default Capabilities
As previously mentioned, container runtimes come with a set of enabled capabilities that will be assigned to every container if not otherwise specified. We’re using CRI-O in our Kubernetes cluster and we can find the default capabilities in the CRI-O configuration file at /etc/crio/crio.conf present in the nodes:
default_capabilities = [
"CHOWN",
"DAC_OVERRIDE",
"FSETID",
"FOWNER",
"SETGID",
"SETUID",
"SETPCAP",
"NET_BIND_SERVICE",
"KILL",
]
The capabilities in the list above will be the ones added to containers by default.
Pod running with root UID
Create a namespace:
NAMESPACE=test-capabilities kubectl create ns ${NAMESPACE}Create a pod running our test application with UID 0:
cat <<EOF | kubectl -n ${NAMESPACE} create -f - apiVersion: v1 kind: Pod metadata: name: reversewords-app-captest-root spec: containers: - image: quay.io/mavazque/reversewords:ubi8 name: reversewords securityContext: runAsUser: 0 dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOFReview the capability sets for the application process:
kubectl -n ${NAMESPACE} exec -ti reversewords-app-captest-root -- grep Cap /proc/1/statusCapInh: 00000000000005fb CapPrm: 00000000000005fb CapEff: 00000000000005fb CapBnd: 00000000000005fb CapAmb: 0000000000000000If we decode the
effectiveset this is what we get:capsh --decode=00000000000005fbNOTE: You can see how the pod got assigned the runtime’s default caps.
0x00000000000005fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service
Pod running with non-root UID
Create a pod running our test application with a
non-rootUID:NAMESPACE=test-capabilities cat <<EOF | kubectl -n ${NAMESPACE} create -f - apiVersion: v1 kind: Pod metadata: name: reversewords-app-captest-nonroot spec: containers: - image: quay.io/mavazque/reversewords:ubi8 name: reversewords securityContext: runAsUser: 1024 dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOFReview the capability sets for the application process:
kubectl -n ${NAMESPACE} exec -ti reversewords-app-captest-nonroot -- grep Cap /proc/1/statusCapInh: 00000000000005fb CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000000005fb CapAmb: 0000000000000000
You can see how the effective and permitted sets were cleared. We explained that behaviour in our previous post. That happens because we’re doing execve to an unprivileged process so those capability sets get cleared.
This has some consequences when running our workloads on Kubernetes, outside Kubernetes we could use Ambient capabilities, but at the time of this writing, Ambient capabilities are not supported on Kubernetes. This means that we can only use file capabilities or capability aware programs in order to get capabilities on programs running as nonroot on Kubernetes.
Configuring capabilities for our workloads
At this point we know what are the differences with regards to capabilities when running our workloads with a root or a nonroot UID. In the next scenarios we are going to see how we can configure our workloads so they only get the required capabilities they need in order to run.
Workload running with root UID
Create a deployment for our workload:
NOTE: We are dropping all of the runtime’s default capabilities, on top of that we add the
NET_BIND_SERVICEcapability and request the app to run with root UID. In the environment variables we configure our app to listen on port 80.NAMESPACE=test-capabilities cat <<EOF | kubectl -n ${NAMESPACE} create -f - apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: reversewords-app-rootuid name: reversewords-app-rootuid spec: replicas: 1 selector: matchLabels: app: reversewords-app-rootuid strategy: {} template: metadata: creationTimestamp: null labels: app: reversewords-app-rootuid spec: containers: - image: quay.io/mavazque/reversewords:ubi8 name: reversewords resources: {} env: - name: APP_PORT value: "80" securityContext: runAsUser: 0 capabilities: drop: - CHOWN - DAC_OVERRIDE - FSETID - FOWNER - SETGID - SETUID - SETPCAP - KILL add: - NET_BIND_SERVICE status: {} EOFWe can check the logs for our application and see that it’s working fine:
kubectl -n ${NAMESPACE} logs deployment/reversewords-app-rootuid2021/04/01 09:59:39 Starting Reverse Api v0.0.18 Release: NotSet 2021/04/01 09:59:39 Listening on port 80If we look at the capability sets this is what we get:
kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-rootuid -- grep Cap /proc/1/statusCapInh: 0000000000000400 CapPrm: 0000000000000400 CapEff: 0000000000000400 CapBnd: 0000000000000400 CapAmb: 0000000000000000As expected, only
NET_BIND_SERVICEcapability is available:capsh --decode=00000000000004000x0000000000000400=cap_net_bind_service
The workload worked as expected when running with root UID, in the next scenario we will try the same app but this time running with a non-root UID.
Workload running with non-root UID
Create a deployment for our workload:
NOTE: We are dropping all of the runtime’s default capabilities, on top of that we add the
NET_BIND_SERVICEcapability and request the app to run with non-root UID. In the environment variables we configure our app to listen on port 80.NAMESPACE=test-capabilities cat <<EOF | kubectl -n ${NAMESPACE} create -f - apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: reversewords-app-nonrootuid name: reversewords-app-nonrootuid spec: replicas: 1 selector: matchLabels: app: reversewords-app-nonrootuid strategy: {} template: metadata: creationTimestamp: null labels: app: reversewords-app-nonrootuid spec: containers: - image: quay.io/mavazque/reversewords:ubi8 name: reversewords resources: {} env: - name: APP_PORT value: "80" securityContext: runAsUser: 1024 capabilities: drop: - CHOWN - DAC_OVERRIDE - FSETID - FOWNER - SETGID - SETUID - SETPCAP - KILL add: - NET_BIND_SERVICE status: {} EOFWe can check the logs for our application and see if it’s working:
kubectl -n ${NAMESPACE} logs deployment/reversewords-app-nonrootuid2021/04/01 10:09:10 Starting Reverse Api v0.0.18 Release: NotSet 2021/04/01 10:09:10 Listening on port 80 2021/04/01 10:09:10 listen tcp :80: bind: permission deniedThis time the application didn’t bind to port 80, let’s update the app configuration so it binds to port 8080 and then we will review the capability sets:
# Patch the app so it binds to port 8080 kubectl -n ${NAMESPACE} patch deployment reversewords-app-nonrootuid -p '{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"reversewords"}],"containers":[{"$setElementOrder/env":[{"name":"APP_PORT"}],"env":[{"name":"APP_PORT","value":"8080"}],"name":"reversewords"}]}}}}' # Get capability sets kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- grep Cap /proc/1/statusCapInh: 0000000000000400 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000000000000400 CapAmb: 0000000000000000We don’t have the
NET_BIND_SERVICEin theeffectiveset, if you remember from our previous post we would need the capability in theambientset in order for our application to work, but as we said Kubernetes still doesn’t support ambient capabilities so our only option is make use of file capabilities.We have created a new image for our application and our application binary now has the
NET_BIND_SERVICEcapability in theeffectiveandpermittedfile capability sets. Let’s update the deployment configuration.NOTE: We configured the app to bind to port 80 and changed the container image with the one that has the required changes.
kubectl -n ${NAMESPACE} patch deployment reversewords-app-nonrootuid -p '{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"reversewords"}],"containers":[{"$setElementOrder/env":[{"name":"APP_PORT"}],"env":[{"name":"APP_PORT","value":"80"}],"image":"quay.io/mavazque/reversewords-captest:latest","name":"reversewords"}]}}}}'We can check the logs for our application and see if it’s working:
kubectl -n ${NAMESPACE} logs deployment/reversewords-app-nonrootuid2021/04/01 10:18:42 Starting Reverse Api v0.0.21 Release: NotSet 2021/04/01 10:18:42 Listening on port 80This time the application was able to bind to port 80, let’s review the capability sets:
kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- grep Cap /proc/1/statusNOTE: Since our application binary has the required capability in its file capability sets the process thread was able to gain that capability:
CapInh: 0000000000000400 CapPrm: 0000000000000400 CapEff: 0000000000000400 CapBnd: 0000000000000400 CapAmb: 0000000000000000We can check the file capability configured in our application binary:
kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- getcap /usr/bin/reverse-words/usr/bin/reverse-words = cap_net_bind_service+eip
Seccomp Profiles on Kubernetes
In this scenario we’re going to reuse the Secure Compute profile we created in the previous post.
Configuring Seccomp Profiles on the cluster nodes
By default Kubelet will try to find the seccomp profiles in the /var/lib/kubelet/seccomp/ path. This path can be configured in the kubelet config.
We are going to create the two seccomp profiles that we will be using in the nodes.
Create below file on every node that can run workloads as /var/lib/kubelet/seccomp/centos8-ls.json:
NOTE: This is the seccomp profile that allows us to run a
centos8image that runsls /as we saw in the previous post.
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64"
],
"syscalls": [
{
"names": [
"access",
"arch_prctl",
"brk",
"capget",
"capset",
"chdir",
"close",
"epoll_ctl",
"epoll_pwait",
"execve",
"exit_group",
"fchown",
"fcntl",
"fstat",
"fstatfs",
"futex",
"getdents64",
"getpid",
"getppid",
"ioctl",
"mmap",
"mprotect",
"munmap",
"nanosleep",
"newfstatat",
"openat",
"prctl",
"pread64",
"prlimit64",
"read",
"rt_sigaction",
"rt_sigprocmask",
"rt_sigreturn",
"sched_yield",
"seccomp",
"set_robust_list",
"set_tid_address",
"setgid",
"setgroups",
"setuid",
"stat",
"statfs",
"tgkill",
"write"
],
"action": "SCMP_ACT_ALLOW",
"args": [],
"comment": "",
"includes": {},
"excludes": {}
}
]
}
Configuring seccomp profiles for our workloads
Create a namespace:
NAMESPACE=test-seccomp kubectl create ns ${NAMESPACE}Seccomp profiles can be configured at pod or container level, this time we’re going to configure it at pod level:
NOTE: We configured the seccompProfile
centos8-ls.json.cat <<EOF | kubectl -n ${NAMESPACE} create -f - apiVersion: v1 kind: Pod metadata: name: seccomp-ls-test spec: securityContext: seccompProfile: type: Localhost localhostProfile: centos8-ls.json containers: - image: registry.centos.org/centos:8 name: seccomp-ls-test command: ["ls", "/"] dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOFThe pod was executed with no issues:
kubectl -n ${NAMESPACE} logs seccomp-ls-testbin dev ...Let’s try to create a new pod that runs
ls -linstead. On top of that we will configure the seccomp profile at the container level.cat <<EOF | kubectl -n ${NAMESPACE} create -f - apiVersion: v1 kind: Pod metadata: name: seccomp-lsl-test spec: containers: - image: registry.centos.org/centos:8 name: seccomp-lsl-test command: ["ls", "-l", "/"] securityContext: seccompProfile: type: Localhost localhostProfile: centos8-ls.json dnsPolicy: ClusterFirst restartPolicy: Never status: {} EOFAs expected, the pod failed since the seccomp profile doesn’t have all the required syscalls required for the command to run permitted:
kubectl -n ${NAMESPACE} logs seccomp-lsl-testls: cannot access '/': Operation not permitted
Closing Thoughts
At this point you should’ve a clear understanding of when your workloads will benefit from using capabilities or seccomp profiles.
We’ve not been through how we can control which capabilities / seccomp a specific user can use, PodSecurityPolicies can be used to control such things on Kubernetes. In OpenShift you can use SecurityContextConstraints.
If you want to learn more around these topics feel free to take a look at the following SCCs lab: https://github.com/mvazquezc/scc-fun/blob/main/README.md