CPU and Memory Management on Kubernetes with Cgroupsv2

In this post I’ll try to explain how CPU and Memory management works under the hood on Kubernetes. If you ever wondered what happens when you set requests and limits for your pods, keep reading!

Attention

This is the result of my exploratory work around cgroupsv2 and their application to Kubernetes. Even though I tried really hard to make sure the information in this post is accurate, I’m far from being an expert on the topic and some information may not be 100% accurate. If you detect something that is missing / wrong, please comment on the post!

I’ll be using a Kubernetes v1.26 (latest at the time of this writing) with an operating system with support for cgroupsv2 like Fedora 37. The tool used to create the cluster is kcli and the command used was:

kcli create kube generic -P ctlplanes=1 -P workers=1 -P ctlplane_memory=4096 -P numcpus=8 -P worker_memory=8192 -P image=fedora37 -P sdn=calico -P version=1.26 -P ingress=true -P ingress_method=nginx -P metallb=true -P domain=linuxera.org resource-mgmt-cluster

Introduction to Cgroups

As we explained in a previous post, Cgroups can be used to limit what resources are available to processes on the system, since containers are processes this applies to them as well. In Kubernetes it’s not different.

Cgroups version 2 introduces improvements and new features on top of Cgroups version 1, you can read more about what changed in this link.

In the next sections we will see how we can limit memory and cpu for processes.

Limiting memory using Cgroupsv2

An evolved memory controller is available in Cgroupsv2, it allows for better management of memory resources for the processes inside the cgroup. In this section we will cover how to hard limit a process to a given amount of memory, and how to use new controls to make our programs work on memory-restricted environments.

Hard limiting memory

Hard limiting memory is pretty straightforward, we just set a memory.max and since the memory is a resource that cannot be compressed, once the process reaches the limit it will be killed.

We will be using this python script:

cat <<EOF > /opt/dumb.py
f = open("/dev/urandom", "r", encoding = "ISO-8859-1")
data = ""
i=0
while i < 20:
    data += f.read(10485760) # 10MiB
    i += 1
    print("Used %d MiB" % (i * 10))
EOF

Let’s create a new cgroup under the system.slice:

sudo mkdir -p /sys/fs/cgroup/system.slice/memorytest

Set a limit of 200MiB of RAM for this cgroup and disable swap:

echo "200M" > /sys/fs/cgroup/system.slice/memorytest/memory.max
echo "0" > /sys/fs/cgroup/system.slice/memorytest/memory.swap.max

Add the current shell process to the cgroup:

echo $$ > /sys/fs/cgroup/system.slice/memorytest/cgroup.procs

Run the python script:
```
python3 /opt/dumb.py
```
Attention
Even if the script stopped at 80MB that’s caused because the python interpreter + shared libraries consume also memory. We can check the current memory usage in the cgroup by using the systemd-cgtop system.slice/memorytest command or with something like this MEMORY=$(cat /sys/fs/cgroup/system.slice/memorytest/memory.current);echo $(( $MEMORY / 1024 / 1024 ))MiB
```
Used 10 MiB
Used 20 MiB
Used 30 MiB
Used 40 MiB
Used 50 MiB
Used 60 MiB
Used 70 MiB
Used 80 MiB
Killed
```
Remove the cgroup:
Warning
Make sure you closed the shell attached to the cgroup before running the command below, otherwise it will fail.
```
sudo rmdir /sys/fs/cgroup/system.slice/memorytest/
```

Better memory management

In the previous section we have seen how to hard-limit our processes to a given amount of memory, in this section we will be making use of new configurations to better allow our program to run under memory-restricted scenarios.

As we said, memory cannot be compressed, and as such, when a process reaches the limit set it will be OOMKilled. While this remains true, some memory in use by our program can be reclaimed by the kernel. This will free some memory that is no longer in use.

In Cgroupsv2 we can work with the following memory configurations:

memory.high: Memory usage throttle limit. If the cgroup goes over this limit, the cgroup processes will be throttled and put under heavy reclaim pressure.
memory.max: As we saw earlier, this is the memory usage hard limit. Anything going beyond this number gets OOMKilled.
memory.low: Best-effort memory protection. While processes in this cgroup or child cgroups are below this threshold, the cgroup memory won’t be reclaimed unless it cannot be reclaimed from other unprotected cgroups.
memory.min: Specifies a minimum amount of memory that the cgroup must always retain and that won’t be reclaimed by the system under any conditions as long as the memory usage is below the threshold defined.
memory.swap.high: Same as memory.high but for swap.
memory.swap.max: Same as memory.max but for swap.

Note

Memory throttling is a resource control mechanism that limits the amount of memory a process can use, when throttled the kernel will try to reclaim memory. Keep in mind that memory reclaiming is an I/O expensive process.

In order to demonstrate how this works, we will be using the same python script we used previously.

Let’s create a new cgroup under the system.slice:

sudo mkdir -p /sys/fs/cgroup/system.slice/memorytest2

Set a limit of 200MiB of RAM for this cgroup and disable swap:

echo "200M" > /sys/fs/cgroup/system.slice/memorytest2/memory.max
echo "0" > /sys/fs/cgroup/system.slice/memorytest2/memory.swap.max

Set a throttle limit of 150MiB:

echo "150M" > /sys/fs/cgroup/system.slice/memorytest2/memory.high

Add the current shell process to the cgroup:

echo $$ > /sys/fs/cgroup/system.slice/memorytest2/cgroup.procs

Run the python script:

python3 /opt/dumb.py

Used 10 MiB
Used 20 MiB
Used 30 MiB
Used 40 MiB
Used 50 MiB
Used 60 MiB
Used 70 MiB
<Hangs here>

Delete the cgroup:
Warning
Make sure you closed the shell attached to the cgroup before running the command below, otherwise it will fail.
```
sudo rmdir /sys/fs/cgroup/system.slice/memorytest2/
```

We tried to limit the memory consumption for our process, and we failed. Determining the exact amount of memory required by an application is a difficult and error-prone task. Luckily for us, Facebook folks created senpain. Let’s see how we can use it to better determine the configuration for our process.

Download senpai:

curl -L https://raw.githubusercontent.com/facebookincubator/senpai/main/senpai.py -o /tmp/senpai.py

Create a new cgroup under the system.slice:

sudo mkdir -p /sys/fs/cgroup/system.slice/memorytest3

Add the current shell process to the cgroup:

echo $$ > /sys/fs/cgroup/system.slice/memorytest3/cgroup.procs

Run senpai in a different shell with the following command:
```
python3 /tmp/senpay.py 
```
Run the python script:
```
python3 /opt/dumb.py
```
At this point senpai should’ve set the memory.high restrictions for our cgroups based on the usage of our python script:
```
cat /sys/fs/cgroup/system.slice/memorytest3/memory.high
```
```
437448704
```
We can stop senpai. We need around 420MiB memory to run our python script, so a better configuration for it would be:
Attention
We are adding a max swap usage of 50M to ease memory reclaim.
```
echo "450M" > /sys/fs/cgroup/system.slice/memorytest3/memory.max
echo "50M" > /sys/fs/cgroup/system.slice/memorytest3/memory.swap.max
```

At this point we should be able to run the program with no issues:

python3 /opt/dumb.py

Used 10 MiB
Used 20 MiB
...
Used 190 MiB
Used 200 MiB

Delete the cgroup:
Warning
Make sure you closed the shell attached to the cgroup before running the command below, otherwise it will fail.
```
sudo rmdir /sys/fs/cgroup/system.slice/memorytest3/
```

Now that we have seen how to limit memory, let’s see how to limit CPU.

Limiting CPU using Cgroupsv2

Limiting CPU is not as straightforward as limiting memory, since CPU can be compressed we can make sure that a process doesn’t use more CPU than allowed without having to kill it.

We need to configure the parent cgroup so it has the cpu and cpuset controllers enabled for its children groups. Below example configures the controllers for the system.slice cgroup which is the parent group we will be using. By default, only memory and pids controllers are enabled.

Enable cpu and cpuset controllers for the /sys/fs/cgroup/ and /sys/fs/cgroup/system.slice children groups:

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpu" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

Limiting CPU — Pin process to CPU and limit CPU bandwidth

Let’s create a new cgroup under the system.slice:

sudo mkdir -p /sys/fs/cgroup/system.slice/cputest

Assign only 1 core to this cgroup
Attention
Below command assigns core 0 to our cgroup.
```
echo "0" > /sys/fs/cgroup/system.slice/cputest/cpuset.cpus
```
Set a limit of half-cpu for this cgroup:
Attention
The value for cpu.max is in units of 1/1000ths of a CPU core, so 50000 represents 50% of a single core.
```
echo 50000 > /sys/fs/cgroup/system.slice/cputest/cpu.max
```

Add the current shell process to the cgroup:

echo $$ > /sys/fs/cgroup/system.slice/cputest/cgroup.procs

Download the cpuload utility:

curl -L https://github.com/vikyd/go-cpu-load/releases/download/0.0.1/go-cpu-load-linux-amd64 -o /tmp/cpuload && chmod +x /tmp/cpuload

Run the cpu load:
Attention
We’re requesting 1 core and 50% of the CPU, this should fit within the cpu.max setting.
```
/tmp/cpuload -p 50 -c 1
```

If we check with systemd-cgtop system.slice/cputest the usage we will see something like this:

Control Group           Tasks   %CPU   Memory  Input/s Output/s
system.slice/cputest        6   47.7   856.0K        -        -

Since we’re within the budget, we shouldn’t see any throttling happening:
Note
CPU throttling is a resource control mechanism that limits the amount of CPU time a process can use, preventing it from consuming excessive CPU resources and affecting the performance of other processes.
```
grep throttled /sys/fs/cgroup/system.slice/cputest/cpu.stat
```
```
nr_throttled 0
throttled_usec 0
```

If we stop the cpuload command and request 100% of 1 core we will see throttling:

/tmp/cpuload -p 100 -c 1

Control Group           Tasks   %CPU   Memory  Input/s Output/s
system.slice/cputest        6   50.0   720.0K        -        -

grep throttled /sys/fs/cgroup/system.slice/cputest/cpu.stat

nr_throttled 336
throttled_usec 16640745

Remove the cgroup:
Warning
Make sure you closed the shell attached to the cgroup before running the command below, otherwise it will fail.
```
sudo rmdir /sys/fs/cgroup/system.slice/cputest/
```

This use case is very simple, we pinned our process to 1 core and limited the CPU to half a core. Let’s see what happens when multiple processes compete for the CPU.

Limiting CPU — Pin processes to CPU and limit CPU bandwidth

Let’s create a new cgroup under the system.slice:

sudo mkdir -p /sys/fs/cgroup/system.slice/compitingcputest

Assign only 1 core to this cgroup
Attention
Below command assigns core 0 to our cgroup.
```
echo "0" > /sys/fs/cgroup/system.slice/compitingcputest/cpuset.cpus
```
Set a limit of one cpu for this cgroup:
Attention
The value for cpu.max is in units of 1/1000ths of a CPU core, so 100000 represents 100% of a single core.
```
echo 100000 > /sys/fs/cgroup/system.slice/compitingcputest/cpu.max
```

Open two shells and attach their process to the cgroup:

echo $$ > /sys/fs/cgroup/system.slice/compitingcputest/cgroup.procs

Run the cpu load in one of the shells:
Attention
We’re requesting 1 core and 100% of the CPU, this should fit within the cpu.max setting.
```
/tmp/cpuload -p 100 -c 1
```

If we check for throttling we will see that no throttling is happening.

grep throttled /sys/fs/cgroup/system.slice/compitingcputest/cpu.stat

nr_throttled 0
throttled_usec 0

Run another instance of cpuload on the other shell:
```
/tmp/cpuload -p 100 -c 1
```

At this point, we shouldn’t see throttling, but the CPU time would be shared by the two processes, in the top output below we can see that each process is consuming half cpu.

PID    USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                         
822742 root      20   0    4104   2004   1680 S  49.8   0.0   0:24.30 cpuload                                         
822717 root      20   0    4104   2008   1680 S  49.5   0.1   6:28.51 cpuload

Close the shells and remove the cgroup:
Warning
Make sure you closed the shell attached to the cgroup before running the command below, otherwise it will fail.
```
sudo rmdir /sys/fs/cgroup/system.slice/compitingcputest/
```

In this use case, we pinned our process to 1 core and limited the CPU to one core. On top of that, we spawned two processes that competed for CPU. Since CPU bandwidth distribution was not set, each process got half cpu. In the next section we will see how to distribute CPU across processes using weights.

Limiting CPU — Pin processes to CPU, limit and distribute CPU bandwidth

Let’s create a new cgroup under the system.slice with two sub-groups (appA and appB):
```
sudo mkdir -p /sys/fs/cgroup/system.slice/distributedbandwidthtest/{appA,appB}
```

Enable cpu and cpuset controllers for the /sys/fs/cgroup/system.slice/distributedbandwidthtest children groups:

echo "+cpu" >> /sys/fs/cgroup/system.slice/distributedbandwidthtest/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/system.slice/distributedbandwidthtest/cgroup.subtree_control

Assign only 1 core to the parent cgroup
Attention
Below command assigns core 0 to our cgroup.
```
echo "0" > /sys/fs/cgroup/system.slice/distributedbandwidthtest/cpuset.cpus
```
Set a limit of one cpu for this cgroup:
Attention
The value for cpu.max is in units of 1/1000ths of a CPU core, so 100000 represents 100% of a single core.
```
echo 100000 > /sys/fs/cgroup/system.slice/distributedbandwidthtest/cpu.max
```

Open two shells and attach their process to the different child cgroups, then run cpuload:

Shell 1

echo $$ > /sys/fs/cgroup/system.slice/distributedbandwidthtest/appA/cgroup.procs
/tmp/cpuload -p 100 -c 1

Shell 2

echo $$ > /sys/fs/cgroup/system.slice/distributedbandwidthtest/appB/cgroup.procs
/tmp/cpuload -p 100 -c 1

If you check the top output, you will see that CPU is evenly distributed across both processes, let’s modify weights to give more CPU to appB cgroup.
In cgroupvs1 there was cpu shares concept, in cgroupsv2 this changed and now we use cpu weights. All weights are in the range [1, 10000] with the default at 100. This allows symmetric multiplicative biases in both directions at fine enough granularity while staying in the intuitive range. If we wanted to give appA a 30% of the CPU and appB the other 70%, providing that the parent cgroup CPU weight is set to 100 this is the configuration we will apply:
```
cat /sys/fs/cgroup/system.slice/distributedbandwidthtest/cpu.weight
```
```
100
```
1. Assign 30% of the cpu to appA
```
echo 30 > /sys/fs/cgroup/system.slice/distributedbandwidthtest/appA/cpu.weight
```
2. Assign 70% of the cpu to appB
```
echo 70 > /sys/fs/cgroup/system.slice/distributedbandwidthtest/appB/cpu.weight
```

If we look at the top output we will see something like this:

Attention

You can see how one of the cpuload processes is getting 70% of the cpu while the other is getting the other 30%.

PID    USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                 
1077   root      20   0    4104   2008   1680 S  70.0   0.1  12:41.27 cpuload                                                                                 
1071   root      20   0    4104   2008   1680 S  30.0   0.1  12:24.14 cpuload

Close the shells and remove the cgroups:

Warning

Make sure you closed the shell attached to the cgroup before running the command below, otherwise it will fail.

sudo rmdir /sys/fs/cgroup/system.slice/distributedbandwidthtest/appA/
sudo rmdir /sys/fs/cgroup/system.slice/distributedbandwidthtest/appB/
sudo rmdir /sys/fs/cgroup/system.slice/distributedbandwidthtest/

At this point, we should have a clear understanding on how the basics work, next section will introduce these concepts applied to Kubernetes.

Resource Management on Kubernetes

We won’t be covering the basics, I recommend reading the official docs. We will be focusing on CPU/Memory requests and limits.

Cgroupsv2 configuration for a Kubernetes node

Before describing the cgroupsv2 configuration, we need to understand how Kubelet configurations will impact cgroupsv2 configurations. In our test cluster, we have the following Kubelet settings in order to reserve resources for system daemons:

systemReserved:
  cpu: 500m
  memory: 500Mi

If we describe the node this is what we will see:

oc describe node <compute-node>

Attention

You can see how half cpu (500m) and 500Mi of memory have been subtracted from the allocatable capacity.

Capacity:
  cpu:                4
  <omitted>
  memory:             6069552Ki
Allocatable:
  cpu:                3500m
  <omitted>
  memory:             5455152Ki

Even if we remove resources from the allocatable capacity, depending on the QoS of our pods we would be able to over commit on resources, at that point, eviction may happen and cgroups will make sure that pods with more priority get the required resources they asked for.

Cgroupsv2 configuration on the node

In a regular Kubernetes node we will have at least three main parent cgroups:

kubepods.slice: Parent cgroup used by Kubernetes to place pod processes. It has two child cgroups named after pod QoS inside: kubepods-besteffort.slice and kubepods-burstable.slice. Guaranteed pods get created inside this parent cgroup.
system.slice: Parent cgroup used by the O.S to place system processes. Kubelet, sshd, etc. run here.
user.slice: Parent cgroup used by the O.S to place user processes. When you run a regular command, it runs here.

Note

In Systemd a Slice is a concept for hierarchically managing resources of a group of processes. This management is done by creating a cgroup. Scopes manage a set of externally created processes, the main purpose of a scope is grouping worker processes for managing resources.

/sys/fs/cgroup/
├── kubepods.slice
│   ├── kubepods-besteffort.slice
│   │   └── kubepods-besteffort-pod7589d90f_83af_4a05_a4ee_8bb078db72b8.slice
│   │       ├── cri-containerd-2be6af51555a1d9ebb8678f3254e81b5f3547dfc230b07a2c1067f5d430b7221.scope
│   │       └── cri-containerd-cbce8911226299472976f069f20afe0ba20c80037f9fd8394c0a8f8aaac60bee.scope
│   ├── kubepods-burstable.slice
│   │   └── kubepods-burstable-pode00fb079_24be_4039_b2cb_f68876881d70.slice
│   │       ├── cri-containerd-a0c611e1b04856e9d565dfef25746d7bdcaaf12bb92fff6221aa6b89a12fbb31.scope
│   │       └── cri-containerd-ea6361278865134bd9d52e718baa556e7693d766ab38d28d64941a1935fae004.scope
│   └── kubepods-podbe70a1c9_81c5_4764_b28f_0965edee08d0.slice
│       ├── cri-containerd-208bf4e7ddeef45a3bd3acff96ff0747b35e9204cea418082b586df6adf022ad.scope
│       └── cri-containerd-71305184cec893cd21cfef2cbe699453ad89a51e4f60586670f194574f287a53.scope
├── system.slice
│   ├── kubelet.service
│   └── sshd.service
└── user.slice
    └── user-1000.slice

In order to get these cgroups created, Kubelet uses one of the two available drivers: systemd or cgroupsfs. Cgroupsv2 are only supported by systemd driver.

The root cgroup kubepods.slice and the QoS cgroups kubepods-besteffort.slice and kubepods-burstable.slice are created by Kubelet when it starts, on top of that Kubelet will create a cgroup (using the driver) as soon as a new Pod gets created. The pod will have from 1 to N containers, the cgroups for these containers will be created by the container runtime by using the driver as well.

On the output above you can see different cgroups for pods like kubepods-besteffort-pod7589d90f_83af_4a05_a4ee_8bb078db72b8.slice and one for a container like cri-containerd-2be6af51555a1d9ebb8678f3254e81b5f3547dfc230b07a2c1067f5d430b7221.scope.

So far, we have been looking at the configuration of cgroups via the filesystem. Systemd tooling can be used for that as well:

systemctl show --no-pager cri-containerd-2be6af51555a1d9ebb8678f3254e81b5f3547dfc230b07a2c1067f5d430b7221.scope

<OMITTED_OUTPUT>
CPUWeight=1
MemoryMax=infinity
<OMITTED_OUTPUT>

CPU Bandwidth configuration on the node

In the previous sections we have talked about how cpu.weight works for distributing CPU bandwidth to processes. The parent cgroups in a Kubernetes node will be configured as follows:

system.slice: A cpu.weight of 100.
user.slice: A cpu.weight of 100.

In a Kubernetes node, we won’t have much/any user processes running. So at the end, the two cgroups competing for resources will be system.slice and kubepods.slice. But wait, what cpu.weight is configured for kubepods.slice?

When Kubelet starts it detects the number of CPUs available on the node, on top of that it reads the systemReserved.cpu configuration. That will give you a number of milicores available for Kubernetes to use on that node.

For example, if I have a 4 CPU node that’s 4000 milicores, if I reserved 500m for the system resources (kubelet, sshd, etc.) that leaves Kubernetes with 3500 milicores that can be assigned to workloads.

Now, Kubelet knows that 3500 milicores is the amount of CPU that can be assigned to workloads (and assigned means that is more or less assured in case workloads request it). The cgroups cpu.weight needs to be configured so CPU get distributed accordingly, let’s see how that’s done:

In the past (cgroupsv1), CPU Shares were used and every CPU was represented by 1024 Shares. Now, we need to translate from shares to weight and the community has a formula for that (more info here).
In cgroupsv2 we still use Shares under the hood, but that’s only because the formula created to not having to change the specification requires them. So we have a constant that sets the Shares/CPU to 1024 and a function that translates milicores to shares.
Finally, there is a function that translates CPU Shares to CPU Weight using the formula from 1.

After we know the weight that needs to be applied to the kubepods.slice, the relevant code that does that is here and here.

Continuing with the example, the cpu.weight for our 4 CPU node with 500 milicores reserved for system resources would be:

Formula being used: (((cpuShares - 2) * 9999) / 262142) + 1

cpuShares = 3.5 Cores * 1024 = 3584

cpu.weight = (((3584 - 2) * 9999) / 262142) + 1 = 137,62

If we check our node:

cat /sys/fs/cgroup/kubepods.slice/cpu.weight

At this point we know how the different cgroups get configured on the node, next let’s see what happens when kubepods.slice and system.slice compete for cpu.

`kubepods.slice` and `system.slice` competing for CPU

In the previous section we have seen how the different cgroups get configured on our 4 CPU node, in this section we will see what happens when the two slices compete for CPU.

Let’s say that we have two processes, the sshd service and a guaranteed pod. Both processes have access to all 4 CPUs and they’re trying to use the 100% of the 4 CPUs.

To calculate the percentage of CPU allocated to each process, we can use the following formulas:

Pod Process: (cpu.weight of pod / total cpu.weight) * number of CPUs
Ssh Process: (cpu.weight of ssh / total cpu.weight) * number of CPUs

In this case, the total cpu.weight is 237 (137 from kubepods.slice + 100 from system.slice), so:

Pod Process: (137 / 237) * 4 = 2.31 CPUs or ~231%
Ssh Process: (100 / 237) * 4 = 1.68 CPUs or ~168%

So pod process would get around 231% of the available CPU (400% -> 4 Cores x 100) and ssh process would get around 168% of the available CPU.

Danger

Cgroupsv2 configuration for a Pod

In the previous sections we have focused on the configuration at the node level, but let’s see what happens when we create a pod on the different QoS.

Cgroup configuration for a BestEffort Pod

We will be using this pod definition:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: cputest
  name: cputest-besteffort
spec:
  containers:
  - image: quay.io/mavazque/trbsht:latest
    name: cputest
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Once created, in the node where the pod gets scheduled we can find the cgroup that was created by using these commands:

Get container id:

crictl ps | grep cputest-besteffort

2be6af51555a1       b67fff43d1e61       4 minutes ago       Running             cputest                     0                   cbce891122629       cputest-besteffort

Get the cgroups path:

crictl inspect 2be6af51555a1 | jq '.info.runtimeSpec.linux.cgroupsPath'

"kubepods-besteffort-pod7589d90f_83af_4a05_a4ee_8bb078db72b8.slice:cri-containerd:2be6af51555a1d9ebb8678f3254e81b5f3547dfc230b07a2c1067f5d430b7221"

With above information, the full path will be /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod7589d90f_83af_4a05_a4ee_8bb078db72b8.slice

If we check the cpu.max, cpu.weight and memory.max configuration, this is what we see:

cpu.max is set to max 100000.
cpu.weight is set to 1.
memory.max is set to max.

As we can see, the pod is allowed to use as much CPU as it wants, but it has the lowest weight possible which means that it only will get CPU when other processes with higher weight yield some. You can expect a lot of throttling for these pods when the system is under load. On the memory side, it can use as much memory as it wants, but if the cluster requires evicting this pod to reclaim memory in order to schedule more priority pods the container will be OOMKilled. The max from the cpu.max config means that the processes can use all the CPU time available on the system (which varies depending on the speed of your CPU).

Cgroup configuration for a Burstable Pod

We will be using this pod definition:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: cputest
  name: cputest-burstable
spec:
  containers:
  - image: quay.io/mavazque/trbsht:latest
    name: cputest
    resources:
      requests:
        cpu: 2
        memory: 100Mi
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Once created, in the node where go the cgroup configuration by following the steps described previously and this is the configuration we see:

cpu.max is set to max 100000.
cpu.weight is set to 79.
memory.max is set to max.

The pod will be allowed to use as much CPU as it wants, and the weight has been set to it has certain priority over other processes running on the system. On the memory side it can use as much memory as it wants, but if the cluster requires evicting this pod to reclaim memory in order to schedule more priority pods the container will be OOMKilled. The cpu.weight value 79 comes from the formula we saw earlier ((((cpuShares - 2) * 9999) / 262142) + 1):

cpuShares = 2 Cores * 1024 = 2048
cpu.weight = (((2048 - 2) * 9999) / 262142) + 1 = 79,04

Cgroup configuration for a Guaranteed Pod

We will be using this pod definition:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: cputest
  name: cputest-guaranteed
spec:
  containers:
  - image: quay.io/mavazque/trbsht:latest
    name: cputest
    resources:
      requests:
        cpu: 2
        memory: 100Mi
      limits:
        cpu: 2
        memory: 100Mi
  dnsPolicy: ClusterFirst
  restartPolicy: Always

Once created, in the node where go the cgroup configuration by following the steps described previously and this is the configuration we see:

cpu.max is set to 200000 100000.
cpu.weight is set to 79.
memory.max is set to 104857600 (100Mi = 104857600 bytes).

The cpu.max value is different to what we have seen so far, the first value 200000 is the allowed time quota in microseconds for which the process can run during one period. The second value 100000 specific the length of the period. Once the processes consume the time specified by this quota, they will be throttled for the remained of the period and won’t be allowed to run until the next period. This specific configuration allows our processes to run every 0.2 seconds of every 1 second (1/5th). On the memory side, the container can use up to 100Mi once it reaches this value if kernel will try to reclaim some memory, if it cannot be reclaimed the container will be OOMKilled.

Even if guaranteed QoS will ensure that your application gets the CPU it wants, sometimes your application may benefit from burstable capabilities since the CPU won’t be throttled during peaks (e.g: more visits to a web server).

How Kubepods Cgroups compete for resources

In the previous examples we have seen how the different pods get different CPU configurations. But what happens if they compete against them for resources?

In order for the guaranteed pods to have more priority than burstable pods, and these to have more priority than besteffort different weights get set for the three slices. In a 4 CPU node these are the settings we get:

Guaranteed pods will run under kubepods.slice which has a cpu.weight of 137.
Burstable pods will run under kubepods.slice/kubepods-burstable.slice which has a cpu.weight of 86.
BestEffort pods will run under kubepods.slice/kubepods-besteffort.slice which has a cpu.weight of 1.

As we can see from above configuration, the weights define the CPU priority. Keep in mind that pods running inside the same parent slice can compete for resources. In this situation, when they’re competing for resources the total cpu.weight will be the one from summing cpu weights from all cpu hungry processes inside a specific parent cgroup. For example:

We have two burstable pods, these are the cpu weights that will be configured (based on the formulas we have seen so far):

bustable1 requests 2 CPUs and gets a cpu.weight of 79
burstable2 requests 1 CPU and gets a cpu.weight of 39

So this is the CPU each one will get (formula: (cpu.weight of pod / total cpu.weight) * 100 * number of CPUs):

Danger

Keep in mind that these calculations are not 100% accurate, since the CFS will try to assign CPU in the fairest way possible and results may vary depending on the system load and other process running on the system. These calculations assume that there are no guaranteed pods demanding CPU. 118 value comes from summing all the CPU hungry processes from the burstable cgroup (in this case only two pods, burstable1 - cpu.weight=79 and burstable2 - cpu.weight=39).

burstable1: (79/118) * 100 * 4 = ~ 267% (or 2.67 CPU)
burstable2: (39/118) * 100 * 4 = ~ 132% (or 1.32 CPU)

Closing Thoughts

Even if knowing the low-level details about resource management on Kubernetes may not be needed in a day-to-day basis, it’s great knowing how the different pieces are tied together. If you’re working on environments were performance and latencies are critical, like in telco environments, knowing this information can make the difference!

On top of that, some of the new features that cgroupsv2 enable are:

Container aware OOMKilled: Useful when you have sidecars, this could be used to OOMKill the sidecar container rather than your application container.
Running Kubernetes System components root-less: More secure Kubernetes environments.
Kubernetes Memory QoS: Better overall control of the memory used by pods.

The Kubernetes Memory QoS kind of relates to this post, so I’ll be writing a new post covering that in the future.

Finally, in the next section I’ll put interesting resources around the topic, some of them were my sources when learning all this stuff.

Useful Resources

KubeCon NA 2022 - Cgroupv2 is coming soon to a cluster near you talk. Slides and Recording.
FOSDEM 2023 - 7 years of cgroup v2 talk. Slides and Recording.
Lisa 2021 - 5 years of cgroup v2 talk. Slides and Recording.
KubeCon EU 2020 - Kubernetes On Cgroup v2. Slides and Recording.
cgroups man page and kernel docs.
RHEL8 cgroupv2 docs.
Martin Heinz blog on kubernetes cgroups.
Kubernetes cgroups docs.
Kubernetes manage resources for containers docs.
Kubernetes reserve compute resources docs.
Runc Systemd driver docs.
Systemd scope and slice docs.

CPU and Memory Management on Kubernetes with Cgroupsv2#

Introduction to Cgroups#

Limiting memory using Cgroupsv2#

Hard limiting memory#

Better memory management#

Limiting CPU using Cgroupsv2#

Limiting CPU — Pin process to CPU and limit CPU bandwidth#

Limiting CPU — Pin processes to CPU and limit CPU bandwidth#

Limiting CPU — Pin processes to CPU, limit and distribute CPU bandwidth#

Resource Management on Kubernetes#

Cgroupsv2 configuration for a Kubernetes node#

Cgroupsv2 configuration on the node#

CPU Bandwidth configuration on the node#

kubepods.slice and system.slice competing for CPU#

Cgroupsv2 configuration for a Pod#

Cgroup configuration for a BestEffort Pod#

Cgroup configuration for a Burstable Pod#

Cgroup configuration for a Guaranteed Pod#

How Kubepods Cgroups compete for resources#

Closing Thoughts#

Useful Resources#

CPU and Memory Management on Kubernetes with Cgroupsv2

Introduction to Cgroups

Limiting memory using Cgroupsv2

Hard limiting memory

Better memory management

Limiting CPU using Cgroupsv2

Limiting CPU — Pin process to CPU and limit CPU bandwidth

Limiting CPU — Pin processes to CPU and limit CPU bandwidth

Limiting CPU — Pin processes to CPU, limit and distribute CPU bandwidth

Resource Management on Kubernetes

Cgroupsv2 configuration for a Kubernetes node

Cgroupsv2 configuration on the node

CPU Bandwidth configuration on the node

`kubepods.slice` and `system.slice` competing for CPU

Cgroupsv2 configuration for a Pod

Cgroup configuration for a BestEffort Pod

Cgroup configuration for a Burstable Pod

Cgroup configuration for a Guaranteed Pod

How Kubepods Cgroups compete for resources

Closing Thoughts

Useful Resources