Memory Management on Kubernetes with Golang and eBPF: Deep Dive

tl;dr: please jump directly to the conclusion if you think you already have some knowledge about memory and just want the recap. The conclusion links back to the other sections for more details.

Disclaimer

This (long) article is a Frankenstein 🧟 compilation of personal digest notes on various topics. Some sections are direct digests or extracts of very good resources I found on the topic and could not write something better:

Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.
Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications by Mihai Albert.
Out-of-memory (OOM) in Kubernetes - Part 3: Memory metrics sources and tools to collect them by Mihai Albert.
Golang Documentation: A Guide to the Go Garbage Collector by the Go maintainers.

If what you can find here is not enough on the specific sections marked as extract, digest, or inspiration, read these articles directly instead. I also, in this case, recommend reading most of this blogpost hyperlinks, those are the best resources I could find so far.

Introduction

To start this post about memory, let’s define a few terms and quickly explain how paging works on Linux. The following are extracts from the excellent Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.

Glossary

This section is an extract from Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.

Main memory: Also referred to as physical memory, this describes the fast data storage area of a computer, commonly provided as DRAM.
Virtual memory: An abstraction of main memory that is (almost) infinite and non-contended. Virtual memory is not real memory.
Resident memory: Memory that currently resides in main memory.
Anonymous memory: Memory with no file system location or path name. It includes the working data of a process address space, called the heap.
Address space: A memory context. There are virtual address spaces for each process, and for the kernel.
Segment: An area of virtual memory flagged for a particular purpose, such as for storing executable or writeable pages.
Instruction text: Refers to CPU instructions in memory, usually in a segment.
OOM: Out of memory, when the kernel detects low available memory.
Page: A unit of memory, as used by the OS and CPUs. Historically it is either 4 or 8 Kbytes. Modern processors have multiple page size support for larger sizes.
Page fault: An invalid memory access. These are normal occurrences when using on-demand virtual memory.
Paging: The transfer of pages between main memory and the storage devices.
Swapping: Linux uses the term swapping to refer to anonymous paging to the swap device (the transfer of swap pages). In Unix and other operating systems, swapping is the transfer of entire processes between main memory and the swap devices. This book uses the Linux version of the term.
Swap: An on-disk area for paged anonymous data. It may be an area on a storage device, also called a physical swap device, or a file system file, called a swap file. Some tools use the term swap to refer to virtual memory (which is confusing and incorrect).

Memory overcommit and the OOM killer

This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Overcommit by Mihai Albert.

Linux is overcommitting memory (while, for your culture, Windows is not, the original author wrote another post on that topic), however, you can tune Linux by adjusting sysctl vm.overcommit_memory=<value> with different modes. For example, 1 is “always overcommit”, or 2 is “prevent overcommit” (behaving similarly to Windows). Overcommitting memory allows accommodating applications that allocate a lot of memory but don’t use everything and is also consistent with the nature of forking, where the kernel copy-on-writes the parent memory to the child. Note that there are heated debates around the fact that overcommit is a good or a bad thing.

The OOM killer watches when memory hits a critical level when the kernel has to eventually back virtual memory to physical memory and choose a process to kill in order to free memory. Its principle documentation is clear and concise. Note that even with tuning the vm.overcommit_memory sysctl, it’s hard to actually disable the OOM killer completely.

The Paging Mechanism

This section is an extract from Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.

The result of the virtual memory model and demand allocation is that any page of virtual memory may be in one of the following states:

Unallocated
Allocated, but unmapped (unpopulated and not yet faulted)
Allocated, and mapped to main memory (RAM)
Allocated, and mapped to the physical swap device (disk)

State (4) is reached if the page is paged out due to system memory pressure. A transition from (2) to (3) is a page fault. If it requires disk I/O, it is a major page fault; otherwise, a minor page fault.

From these states, two memory usage terms can also be defined:

Resident set size (RSS): The size of allocated main memory pages (3)
Virtual memory size: The size of all allocated areas (2 + 3 + 4)

Process Memory

Understanding a process use of memory can be tricky and it’s maybe counter-intuitive but it’s pretty hard to define a universal definition for “memory use” and get the perfect statistic. In that regard, for more details, Brendan Gregg has been trying to define and estimate a metric: the Working Set Size. Here are a few basic actions to read memory usage, and some more elaborated methods for Golang and eBPF programs.

Reading a Linux process memory use

Fast but inaccurate: status

First, you can read the RSS stat of the process, as detailed in the proc(5) manpage, the read is fast but inaccurate. You can find the same information in:

a parsable form at /proc/pid/stat;
a human form at /proc/pid/status;
measured in pages at /proc/pid/statm.

Let’s take a look at the human form:

grep -i rss /proc/$(pidof <process>)/status

The output should be similar to:

VmRSS:     70244 kB
RssAnon:           30180 kB
RssFile:           40064 kB
RssShmem:              0 kB

As of the kernel /proc documentation:

VmRSS: size of memory portions. It contains the three following parts (VmRSS = RssAnon + RssFile + RssShmem).
RssAnon: size of resident anonymous memory.
RssFile: size of resident file mappings.
RssShmem: size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings).

Slow but accurate: smaps

For slower but more accurate results, one can use /proc/pid/smaps_rollup as per the proc(5) manpage to retrieve a better total of RSS memory use.

sudo grep -i rss /proc/$(pidof <process>)/smaps_rollup

The output should be similar to:

Rss:               70648 kB

To get more details of RSS consumption by memory segment, you can use:

sudo grep -e '^[^A-Z]' -e Rss /proc/$(pidof <process>)/smaps | less

The output should be similar to

00010000-01c29000 r-xp 00000000 fd:01 31577                /home/[...]/tetragon
Rss:               18340 kB
01c30000-03cfe000 r--p 01c20000 fd:01 31577                /home/[...]/tetragon
Rss:               20664 kB
03d00000-03e22000 rw-p 03cf0000 fd:01 31577                /home/[...]/tetragon
Rss:                1160 kB
03e22000-03e76000 rw-p 00000000 00:00 0
Rss:                 164 kB
4000000000-400c800000 rw-p 00000000 00:00 0
Rss:               22892 kB
400c800000-4010000000 ---p 00000000 00:00 0
Rss:                   0 kB
ff56b9c47000-ff56b9c58000 rw-s 00000000 00:0f 1064         anon_inode:[perf_event]
Rss:                  68 kB
ff56b9c58000-ff56b9c69000 rw-s 00000000 00:0f 1064         anon_inode:[perf_event]
Rss:                  68 kB
ff56b9c69000-ff56b9c7a000 rw-s 00000000 00:0f 1064         anon_inode:[perf_event]
Rss:                  68 kB
ff56b9c7a000-ff56b9c8b000 rw-s 00000000 00:0f 1064         anon_inode:[perf_event]
Rss:                  68 kB
ff56b9c8b000-ff56b9c9c000 rw-s 00000000 00:0f 1064         anon_inode:[perf_event]
Rss:                  68 kB
ff56b9c9c000-ff56b9cad000 rw-s 00000000 00:0f 1064         anon_inode:[perf_event]
Rss:                  68 kB
ff56b9cad000-ff56ba400000 rw-p 00000000 00:00 0
Rss:                6280 kB
ff56ba400000-ff56bc400000 rw-p 00000000 00:00 0
Rss:                   4 kB
[...]
ff5700d96000-ff5700da8000 rw-p 00000000 00:00 0
Rss:                  72 kB
ff5700da8000-ff5700ea7000 ---p 00000000 00:00 0
Rss:                   0 kB
ff5700ea7000-ff5700f07000 rw-p 00000000 00:00 0
Rss:                  56 kB
ff5700f07000-ff5700f09000 r--p 00000000 00:00 0            [vvar]
Rss:                   0 kB
ff5700f09000-ff5700f0a000 r-xp 00000000 00:00 0            [vdso]
Rss:                   4 kB
ffffe3ebf000-ffffe3ee0000 rw-p 00000000 00:00 0            [stack]
Rss:                  16 kB

Or better, you can use pmap(1) to get nicely formatted version of /proc/pid/smaps

sudo pmap $(pidof <process>) -xp

The output should be similar to

597662:   ./tetragon --bpf-lib bpf/objs/ --tracing-policy-dir /home/mtardy.linux/tetragon/examples/tracingpolicy/set
Address           Kbytes     RSS   Dirty Mode  Mapping
0000000000010000   28772   18340       0 r-x-- /home/mtardy.linux/tetragon/tetragon
0000000001c30000   33592   20664       0 r---- /home/mtardy.linux/tetragon/tetragon
0000000003d00000    1160    1160     228 rw--- /home/mtardy.linux/tetragon/tetragon
0000000003e22000     336     164     164 rw---   [ anon ]
0000004000000000  204800   22892   22892 rw---   [ anon ]
000000400c800000   57344       0       0 -----   [ anon ]
0000ff56b9c47000      68      68       4 rw-s-   [ anon ]
0000ff56b9c58000      68      68       4 rw-s-   [ anon ]
0000ff56b9c69000      68      68       4 rw-s-   [ anon ]
0000ff56b9c7a000      68      68       4 rw-s-   [ anon ]
0000ff56b9c8b000      68      68       4 rw-s-   [ anon ]
0000ff56b9c9c000      68      68       4 rw-s-   [ anon ]
0000ff56b9cad000    7500    6280    6280 rw---   [ anon ]
0000ff56ba400000   32768       4       4 rw---   [ anon ]
0000ff56bc400000     512       0       0 -----   [ anon ]
0000ff56bc480000       4       4       4 rw---   [ anon ]
0000ff56bc481000  524284       0       0 -----   [ anon ]
0000ff56dc480000       4       4       4 rw---   [ anon ]
0000ff56dc481000  523836       0       0 -----   [ anon ]
0000ff56fc410000       4       4       4 rw---   [ anon ]
0000ff56fc411000   65476       0       0 -----   [ anon ]
0000ff5700402000       4       4       4 rw---   [ anon ]
0000ff5700403000    8180       0       0 -----   [ anon ]
0000ff5700c06000     576     564     564 rw---   [ anon ]
0000ff5700c96000    1024       8       8 rw---   [ anon ]
0000ff5700d96000      72      72      72 rw---   [ anon ]
0000ff5700da8000    1020       0       0 -----   [ anon ]
0000ff5700ea7000     384      56      56 rw---   [ anon ]
0000ff5700f07000       8       0       0 r----   [ anon ]
0000ff5700f09000       4       4       0 r-x--   [ anon ]
0000ffffe3ebf000     132      16      16 rw---   [ stack ]
---------------- ------- ------- -------
total kB         1492204   70648   30324

Note that sometimes you can see the PSS column instead of the RSS column, the PSS is the “proportional set size”, taking into account shared memory. For example, if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. See more about that in ELC: How much memory are applications really using?.

If looking at the segment, especially the anonymous ones isn’t helpful, you can try to trace the memory operation of the process to see what’s happening and when they are allocated.

strace -e trace=memory -o out.trace <cmd>

But looking at a runtime allocating memory can be rather confusing and you better check a runtime tools to analyze anonymous memory consumption directly.

Golang memory use

This section was growing too much so I decided to make it a separate article that you can find here: A Deep Dive into Golang Memory.

eBPF programs’ memory impact

eBPF programs’ memory impact is mostly due to BPF maps. They serve as a way to store state, data and communicate between BPF programs and userspace programs. Due to their static nature, most of them have to be allocated and defined at compilation and thus empty or unused maps just use as much space as used ones.

It seems that memory cgroup v1 does not account for memory used by maps, not even in the kernel version of those stats. However, it changed for cgroup v2 certainly related to this series of patches. This can lead to a drastic change in overall memory consumption if you switch from cgroup v1 to v2 while having a lot of BPF maps.

If you spot some major memory consumptions from unused maps and you cannot make the existence of a map conditional, a good option is to make that map max_entries equal to one (zero is invalid) and resize it at loading time in the agent when needed.

Note that maps can be “anonymous” if the program loading them doesn’t pin them properly and are actually used in the BPF code (and thus are allocated). They are then not tied to the userspace process properly but account for memory usage.

Some helpful bpftool commands

Here are a few commands, using the great bpftool and jq, to gauge memory consumption of loaded maps.

Retrieve total memory usage of maps (in kB):

sudo bpftool map -j | jq '[ .[] | .bytes_memlock ] | add / 1000'

Same but filtered for the process with comm equal to tetragon:

sudo bpftool map -j | jq '[ .[] | select(.pids[0].comm == "tetragon") | .bytes_memlock ] | add / 1000'

Group bytes_memlock with name and sort by bytes_memlock:

sudo bpftool map -j | jq '[ .[] | {bytes_memlock:.bytes_memlock, name:.name}  ] | sort_by(.bytes_memlock)'

Sum memory by map name:

sudo bpftool map -j | jq ' group_by(.name) | map({name: .[0].name, total_bytes_memlock: map(.bytes_memlock | tonumber) | add, maps: length}) | sort_by(.total_bytes_memlock)'

Visualize the stats with pie charts

If you want to visualise 👀 that last command in a pie chart 🥧, you can try bpfmemapie, a little Go utility to render an interactive pie chart out of bpftool’s output.

An example of pie chart with Cilium Tetragon running

Control groups

Find out if I’m using cgroups v1 or v2

To start, it’s important to know if you are using cgroups version 1 or version 2. Here are a few techniques to quickly detect the cgroups version.

From this unix Stack Exchange answer:
```
mount | grep '^cgroup' | awk '{print $1}' | uniq
```
If the output contains cgroup2, then your kernel supports cgroups v2.
From the Kubernetes cgroups concepts documentation
```
stat -fc %T /sys/fs/cgroup/
```
- For cgroup v2, the output is cgroup2fs.
- For cgroup v1, the output is tmpfs.
From runc documentation:
Your are using cgroups v2 if /sys/fs/cgroup/cgroup.controllers is present.
A more bullet-proof method used by systemd
- if /sys/fs/cgroup exists and is on a cgroup2 file system, the system is running with a full unified hierarchy;
- if /sys/fs/cgroup exists and is on a tmpfs file system,
  - if either /sys/fs/cgroup/unified or /sys/fs/cgroup/systemd exist and are on cgroup2 file systems, the system is using a unified hierarchy for the systemd controller only;
  - if /sys/fs/cgroup/systemd exists and is on a cgroup file system (or, as a fallback, if it exists and isn’t on a cgroup2 file system), the system is using a legacy hierarchy.
Note that this is a bit similar to what cAdvisor is doing when retrieving values from the stat file.

The memory control group

This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Cgroups by Mihai Albert.

Note that this section is written with cgroups v1 in mind

Cgroups are a mechanism used on Linux for limiting and accounting resources and containers are built on cgroups (and namespaces), see a container terminology introduction post by Red Hat on that. However, contrary to namespaces, cgroups don’t limit what a process can “see”, see this article on why top and free inside containers don’t show the container memory values. Indeed, /proc/meminfo is not namespaced, contrary to PID list for example.

You can use the root memory cgroup accounting to retrieve the stat of the node memory usage: that’s what Kubernetes does on cgroups v1. The value is correct if memory.use_hierarchy is enabled for the root cgroup (and it cannot be modified dynamically). You can check with:

find /sys/fs/cgroup/memory -name memory.use_hierarchy -exec cat '{}' \;

To verify that all processes appearing in /proc are also in the root memory cgroup /sys/fs/cgroup/memory you can use:

ps aux | wc -l
find /sys/fs/cgroup/memory -name cgroup.procs -exec cat '{}' \; | wc -l

For more details read detailed documentation on memory cgroup v1 by Red Hat.

Read an interesting recap about using cgroups from LinkedIn engineering blog. It emphasizes that other cgroups (or just the root cgroup) can be noisy neighbors and affect the performance of an application running in its own cgroup, highlighting again that they are not an isolation but a resource limitation mechanism.

Using cgroups to measure memory usage

Looking at RSS to measure memory consumption can be insufficient. Indeed, the Linux mechanisms to measure and restrict memory consumption (cgroups) use different accounting.

Automatically with cmemstat

I created a small utility called cmemstat to perform the manual steps explained in the next section automatically.

With a working Go install you can fetch, compile and install with:

go install github.com/mtardy/cmemstat@latest

And then use it with:

cmemstat [option]... program [programoption]...

For example:

cmemstat sleep 3
# or with options
cmemstat --debug --refresh 400ms sleep 3

See more information on the project’s repository at github.com/mtardy/cmemstat.

Manually using the cgroup sys fs

Note that the process must start in the memory cgroup otherwise the accounting is incorrect because all the memory usage will be accounted for in the previous cgroup. From the administration guide of the Linux kernel:

A memory area is charged to the cgroup which instantiates it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn’t move the memory usages that it instantiated while in the previous cgroup to the new cgroup.

In one terminal, open a new shell and echo its PID

bash
echo $$

In another terminal, create the cgroup and move the process inside of it

sudo su
cd /sys/fs/cgroup   # for cgroup v1, create under /sys/fs/cgroup/memory
mkdir benchmark     # this is an arbitrary name
cd benchmark
echo $(pidof <process>) > cgroup.procs

Then start your process using the bash you opened in the first terminal and read the stat you want to acquire, for example:

cat memory.current   # for cgroup v1, equivalent can be memory.total_in_bytes

From the cgroup v2 documentation:

The memory.current value is the sum of memory.stat first three lines (anon + file + kernel)
anon: Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS)
file: Amount of memory used to cache filesystem data, including tmpfs and shared memory.
kernel (npn): Amount of total kernel memory, including (kernel_stack, pagetables, percpu, vmalloc, slab) in addition to other kernel memory use cases.

Kubernetes

Kubernetes containers, Pods, and cgroups

This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Cgroups and Kubernetes by Mihai Albert.

Kubernetes uses one cgroup per container but shares the hierarchy in a Pod, see more details in this Stack Overflow question.

Be aware of the pause container used in Pods, it’s a container used to reap zombie processes and hold shared namespaces, read the best article you can find on the topic: The Almighty Pause Container. Also, read this conversation I had on the Kubernetes Slack to find out why crictl does not list the pause container.

You can use crictl to inspect the memory controller hierarchy and statistics read more on the official Kubernetes documentation. Be aware that containerd uses the k8s.io “namespace” for containers. You can also directly check cAdvisor’s output for stats.

Kubernetes supports cgroups v2. You can check what you are using through different methods (see related note Find out which version of cgroup is running). Again this article is written with cgroups v1 in mind and things were fixed and changed with cgroups v2, like group killing for containers, see this presentation by Giuseppe Scrivano from Red Hat at 14:45 and 17:55.

Kubernetes, applications, and the OOM killer

This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Cgroups and the OOM killer by Mihai Albert.

OOM killer will step in when the whole OS is low on memory, but it has been updated to work with cgroups as well, see Teaching the OOM killer about control groups. As explained in the kernel documentation, when a cgroup goes over limits, it first tries to reclaim memory (from the per-group LRU list) and then invokes the OOM killer. From experimenting, it’s a bit hard to predict which process will be targeted by the OOM killer inside a container and the result can look incoherent from the desired behavior. Reading kernel logs (using dmesg for example) will give you more context on the decision (see this resource on various system logs).

[…] the OOM killer deciding to terminate processes inside cgroups is something that “happens” to Kubernetes containers, as the OOM killer is a Linux kernel component. It’s not that Kubernetes decides to invoke the OOM killer, as it has no such control over the kernel.

As such, applications cannot be given a signal to gracefully shutdown and this is a problem for running production code on Kubernetes, see this GitHub issue about making the OOM killer not send a SIGKILL. Another aspect is that the OOM killer has no notion of containers (like the rest of the kernel) nor Kubernetes Pods, so it can kill a process in a container without stopping everything else: this can lead to weird states for containers or Pods, unfortunately, this is “working as intended”. However, when Kubernetes sees an OOM kill event, it tries to restart the Pod to make him healthy given the restartPolicy. Finally, memory cgroup has soft limits but Kubernetes does not support them as of now, see this gist about Kubernetes resource management: kcgroups.

Now the application can use a runtime that will allocate memory in complex ways, and it can be difficult to understand why memory is retained and not released to the OS. The original article addresses .NET, I’m much more interested in Go and the best resource you can find is the Guide to the Go Garbage Collector. Amongst other things, since 1.19, we can use GOMEMLIMIT which is very useful to hint the runtime toward the limitation and make him smarter.

Finally, Kubernetes has resource requests and limits, see the Resource Management for Pods and Containers for full official documentation, but essentially:

When you specify the resource request for containers in a Pod, the kube-scheduler uses this information to decide which node to place the Pod on. When you specify a resource limit for a container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set.

Here are other resources on resource requests and limits:

Kubernetes uses Quality of Service (QoS) classes for Pods to dictate the order in which pods are evicted and define the root memory cgroup hierarchy on the node. Official documentation here, but a recap could be:

Guaranteed is assigned to pods that have all of their containers specify resource values that are equal to the limits ones for both CPU and memory respectively. Burstable is when at least one container in the pod has a CPU or memory request, but it doesn’t meet the “high” criteria for Guaranteed. The last of the classes – BestEffort – is the case of a pod that doesn’t have a single container specify at least one CPU or memory limit or request value.

Kubernetes memory metrics

This section is rewritten but inspired by Out-of-memory (OOM) in Kubernetes – Part 3: Memory metrics sources and tools to collect them by Mihai Albert.

In the Kubernetes world, people looking at memory metrics for workload usually look at the container_memory_working_set_bytes, let’s understand the link between this stat and the stats, we retrieve from the OS and explain it.

Metrics components

The above diagram is from the author of the original blog post and gives a good understanding of the flow of metrics on a node. Since we care about the memory use of the workloads, the most interesting part is what is happening inside kubelet, which will be the server exposing various metrics. Let’s simplify it by ignoring a few things:

The Kubelet’s own metrics endpoint.
The kube-state-metrics, a service that listens to the Kubernetes API server and generates metrics. Note that it only relies only on the Kubernetes API events.
The Kubernetes API Server’s own metrics endpoint.
The Prometheus node exporter can also skipped for now as it concerns the whole node and not the workloads, reporting various from the host itself (which contains memory stats though).

With this simplified version, we realize that:

We end up with two principal consumers, first the Prometheus Server will query the cAdvisor endpoint directly and the Prometheus node exporter, and then the Metrics server will query all nodes to return the data exposed by kubectl top pod|node.
The stats of the workloads exposed by the kubelet, are mostly coming from cAdvisor.

Indeed, on the last topic, you can see that in the Metrics server case, it queries the Resource Metrics endpoint, which interrogates the Summary API endpoint, that retrieves its metrics both from cAdvisor and the Container runtime (CRI). Note that the current evolution is to rely less on cAdvisor and more on the CRI for stats. For that, the way the kubelet “decides” which source to use (cAdvisor or CRI) is detailed in “How does the Summary API endpoint get its metrics?”. Nowadays, the reality is that, in any case, most metrics come from cAdvisor due to a bug in kubelet.

In practice, if you are using the runc low-level runtime¹, which is the reference implementation, used in most cases by the popular high-level container runtimes containerd and CRI-O, whether kubelet would retrieve its cgroups stats from the CRI or cAdvisor would not make a difference as they both use opencontainers/libcontainer:

Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.

However, this is cAdvisor or the CRI directly that will compute heuristic returned like memory_working_set, so the appropriate code must be explored.

cAdvisor and libcontainer deep dive

So in the end, cAdvisor will call its GetStats method, that will, for every container, eventually call the setMemoryStats function. The most interesting part for us is the function’s first line, that retrieves memory usage, and the last lines, which computes the working set stats we are trying to understand.

func setMemoryStats(s *cgroups.Stats, ret *info.ContainerStats) {
	ret.Memory.Usage = s.MemoryStats.Usage.Usage
	ret.Memory.MaxUsage = s.MemoryStats.Usage.MaxUsage
	ret.Memory.Failcnt = s.MemoryStats.Usage.Failcnt
	ret.Memory.KernelUsage = s.MemoryStats.KernelUsage.Usage

	if cgroups.IsCgroup2UnifiedMode() {
		ret.Memory.Cache = s.MemoryStats.Stats["file"]
		ret.Memory.RSS = s.MemoryStats.Stats["anon"]
		ret.Memory.Swap = s.MemoryStats.SwapUsage.Usage - s.MemoryStats.Usage.Usage
		ret.Memory.MappedFile = s.MemoryStats.Stats["file_mapped"]
	} else if s.MemoryStats.UseHierarchy {
		ret.Memory.Cache = s.MemoryStats.Stats["total_cache"]
		ret.Memory.RSS = s.MemoryStats.Stats["total_rss"]
		ret.Memory.Swap = s.MemoryStats.Stats["total_swap"]
		ret.Memory.MappedFile = s.MemoryStats.Stats["total_mapped_file"]
	} else {
		ret.Memory.Cache = s.MemoryStats.Stats["cache"]
		ret.Memory.RSS = s.MemoryStats.Stats["rss"]
		ret.Memory.Swap = s.MemoryStats.Stats["swap"]
		ret.Memory.MappedFile = s.MemoryStats.Stats["mapped_file"]
	}

    // [...]

    inactiveFileKeyName := "total_inactive_file"
	if cgroups.IsCgroup2UnifiedMode() {
		inactiveFileKeyName = "inactive_file"
	}

	workingSet := ret.Memory.Usage
	if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
		if workingSet < v {
			workingSet = 0
		} else {
			workingSet -= v
		}
	}
	ret.Memory.WorkingSet = workingSet
}

Memory usage link to cgroups fs

To retrieve memory usage, libcontainer reads memory.usage_in_bytes for cgroups v1, and memory.current for cgroups v2. If interested (or looking for proofs) see the code deep dive below 🌊 🤿 🦑.

The setMemoryStats function was called from newContainerStats in the GetStats method specified above, which called h.cgroupManager.GetStats() to retrieve the cgroup stats, using the opencontainers/libcontainer codebase. Note that naming is very confusing here since cAdvisor has a libcontainer package that contains the GetStats method that will call the libcontainer’s cgroup package GetStats method.

Now, as expected, libcontainer has two implementations for its GetStats method, one in the fs and the other in the fs2 package, corresponding to cgroups v1 and v2 respectively.

For cgroups v1, another GetStats is called on all subsystems. For memory, we end up in the appropriate GetStats method that writes to stats.MemoryStats.Usage after calling getMemoryData(path, ""). So we can finally see that libcontainer reads memory.usage_in_bytes in getMemoryData for memory usage.

func getMemoryData(path, name string) (cgroups.MemoryData, error) {
    memoryData := cgroups.MemoryData{}

    moduleName := "memory"
    if name != "" {
        moduleName = "memory." + name
    }
    var (
        usage    = moduleName + ".usage_in_bytes"
        maxUsage = moduleName + ".max_usage_in_bytes"
        failcnt  = moduleName + ".failcnt"
        limit    = moduleName + ".limit_in_bytes"
    )

    value, err := fscommon.GetCgroupParamUint(path, usage)
    if err != nil {
        if name != "" && os.IsNotExist(err) {
            // Ignore ENOENT as swap and kmem controllers
            // are optional in the kernel.
            return cgroups.MemoryData{}, nil
        }
        return cgroups.MemoryData{}, err
    }
    memoryData.Usage = value
    // [...]
}

For cgroups v2, the statsMemory function is called, which similarly as with v1, writes to stats.MemoryStats.Usage after calling getMemoryDataV2(dirpath, ""). We can finally see that libcontainer reads memory.current in getMemoryDataV2 for memory usage.

func getMemoryDataV2(path, name string) (cgroups.MemoryData, error) {
    memoryData := cgroups.MemoryData{}

    moduleName := "memory"
    if name != "" {
        moduleName = "memory." + name
    }
    usage := moduleName + ".current"
    limit := moduleName + ".max"
    maxUsage := moduleName + ".peak"

    value, err := fscommon.GetCgroupParamUint(path, usage)
    if err != nil {
        if name != "" && os.IsNotExist(err) {
            // Ignore EEXIST as there's no swap accounting
            // if kernel CONFIG_MEMCG_SWAP is not set or
            // swapaccount=0 kernel boot parameter is given.
            return cgroups.MemoryData{}, nil
        }
        return cgroups.MemoryData{}, err
    }
    memoryData.Usage = value
    // [...]
 }

Stats file

Doing a similar analysis you can find that out that s.MemoryStats.Stats is a map containing the key-values of the memory.stat file for cgroups v1 and for cgroups v2

The key names differ between cgroups v1 and v2, explaining why the setMemoryStats function needs to discriminate using cgroups.IsCgroup2UnifiedMode() and s.MemoryStats.UseHierarchy.

Memory working set

The conclusion is that Kubernetes’ container_memory_working_set_bytes is a heuristic trying to estimate what memory is used by the container’s processes by reading the usage and subtracting the memory used by the “inactive files”.

From the original article:

inactive_file is defined in the docs section 5.2 as “number of bytes of file-backed memory on inactive LRU list“. The LRU lists are described in this document with the inactive list described as containing “reclaim candidates” as opposed to the active list that “contains all the working sets in the system“.
So in effect the memory size for mapping files from disk that aren’t really required at the time is deducted from the memory usage – itself roughly equivalent as we’ve seen previously with rss+cache+swap (which we’re already reading from memory.stat).

To simplify, while the metrics always remain >= 0, we can link the container_memory_working_set_bytes with the cgroups fs metrics like this:

Cgroups	Container Memory Working Set Bytes
version 1	`memory.usage_in_bytes` - `memory.stat[total_inactive_file]`
version 2	`memory.current` - `memory.stat[inactive_file]`

Conclusion

Let’s recap and try to summarize the essence of the research that was compiled in the above sections.

Let’s say you have a Go application, optionally loading eBPF programs, running on Kubernetes having memory issues. By memory issue, I mean that kubectl top, or the metrics monitored on Grafana, the container_memory_working_set_bytes, are “too high” to your taste. How to understand and trace the link between your program memory usage and the final result reported by Kubernetes?

The container_memory_working_set_bytes is a heuristic memory metric computed in the container world trying to estimate what the OOM killer will look for when checking the process memory consumption. It does not exist as-is from the cgroups statistics.

This value can technically come out of cAdvisor, embedded in the Kubelet or the container runtime. However, at the moment, it mostly comes out of cAdvisor anyway because of a Kubelet bug. While the computation of this high-level metric is done by cAdvisor or the container runtime, if you are using runc as your low level container runtime¹, the cgroup statistics will be retrieved using runc libcontainer reference implementation.

Depending on whether you are using cgroups v1 or v2, the following table shows which cgroup stats will be used to compute container_memory_working_set_bytes. Also keep in mind that this metrics is always >= 0, so it’s technically max(0, <value below>).

Cgroups	Container Memory Working Set Bytes
version 1	`memory.usage_in_bytes` - `memory.stat[total_inactive_file]`
version 2	`memory.current` - `memory.stat[inactive_file]`

So if the amount of inactive_file memory is low, you can approximate this value to the memory cgroup main statistic, memory.usage_in_bytes or memory.current for cgroup v1 and v2 respectively. For details about how we ended up on this approximation, see the diagram and code deep dive above.

Now to continue your investigation you might want to:

Read the Linux process memory use and memory segments. Depending on the way your process handles memory, it can give you a first view on how memory is used.
Reproduce the measurements locally using cgroups. It can be easier than spinning a whole Kubernetes cluster and deploying your fixed program.
Check if your Go program uses too much heap. Using a language-specific profiler will give you information about how the heap is distributed. Don’t forget to learn about GOMEMLIMIT for containerized environment and other stuff about the Go garbage collector to optimize its behavior for your use case.
Check if your eBPF maps use too much kernel memory. When using cgroups v2, kernel memory used by BPF maps is accounted for properly, you can retrieve some stats using bpftool and jq, and plot some pie charts to visualize them.

If you find errors in this article, please reach out.

You might not use runc if you are using crun a C language container runtime implementation, or an isolation low-level runtimes like AWS firecracker-containerd or Google gVisor. See this article on how to understand the differences between Docker, containerd, CRI-O and runc if the notion of container runtime is unclear. ↩︎ ↩︎

Disclaimer#

Introduction#

Glossary#

Memory overcommit and the OOM killer#

The Paging Mechanism#

Process Memory#

Reading a Linux process memory use#

Fast but inaccurate: status#

Slow but accurate: smaps#

Golang memory use#

eBPF programs’ memory impact#

Some helpful bpftool commands#

Visualize the stats with pie charts#

Control groups#

Find out if I’m using cgroups v1 or v2#

The memory control group#

Using cgroups to measure memory usage#

Automatically with cmemstat#

Manually using the cgroup sys fs#

Kubernetes#

Kubernetes containers, Pods, and cgroups#

Kubernetes, applications, and the OOM killer#

Kubernetes memory metrics#

Metrics components#

cAdvisor and libcontainer deep dive#

Memory usage link to cgroups fs#

Stats file#

Memory working set#

Conclusion#