tl;dr: please jump directly to the conclusion if you think you already have some knowledge about memory and just want the recap. The conclusion links back to the other sections for more details.
Disclaimer
This (long) article is a Frankenstein 🧟 compilation of personal digest notes on various topics. Some sections are direct digests or extracts of very good resources I found on the topic and could not write something better:
- Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.
- Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications by Mihai Albert.
- Out-of-memory (OOM) in Kubernetes - Part 3: Memory metrics sources and tools to collect them by Mihai Albert.
- Golang Documentation: A Guide to the Go Garbage Collector by the Go maintainers.
If what you can find here is not enough on the specific sections marked as extract, digest, or inspiration, read these articles directly instead. I also, in this case, recommend reading most of this blogpost hyperlinks, those are the best resources I could find so far.
Introduction
To start this post about memory, let’s define a few terms and quickly explain how paging works on Linux. The following are extracts from the excellent Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.
Glossary
This section is an extract from Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.
- Main memory: Also referred to as physical memory, this describes the fast data storage area of a computer, commonly provided as DRAM.
- Virtual memory: An abstraction of main memory that is (almost) infinite and non-contended. Virtual memory is not real memory.
- Resident memory: Memory that currently resides in main memory.
- Anonymous memory: Memory with no file system location or path name. It includes the working data of a process address space, called the heap.
- Address space: A memory context. There are virtual address spaces for each process, and for the kernel.
- Segment: An area of virtual memory flagged for a particular purpose, such as for storing executable or writeable pages.
- Instruction text: Refers to CPU instructions in memory, usually in a segment.
- OOM: Out of memory, when the kernel detects low available memory.
- Page: A unit of memory, as used by the OS and CPUs. Historically it is either 4 or 8 Kbytes. Modern processors have multiple page size support for larger sizes.
- Page fault: An invalid memory access. These are normal occurrences when using on-demand virtual memory.
- Paging: The transfer of pages between main memory and the storage devices.
- Swapping: Linux uses the term swapping to refer to anonymous paging to the swap device (the transfer of swap pages). In Unix and other operating systems, swapping is the transfer of entire processes between main memory and the swap devices. This book uses the Linux version of the term.
- Swap: An on-disk area for paged anonymous data. It may be an area on a storage device, also called a physical swap device, or a file system file, called a swap file. Some tools use the term swap to refer to virtual memory (which is confusing and incorrect).
Memory overcommit and the OOM killer
This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Overcommit by Mihai Albert.
Linux is overcommitting memory (while, for your culture, Windows is not, the
original author wrote another post on that topic),
however, you can tune Linux by adjusting sysctl vm.overcommit_memory=<value>
with different modes. For example, 1
is “always overcommit”, or 2
is
“prevent overcommit” (behaving similarly to Windows). Overcommitting memory
allows accommodating applications that allocate a lot of memory but don’t use
everything and is also consistent with the nature of forking, where the kernel
copy-on-writes the parent memory to the child. Note that there are heated
debates around the fact that overcommit is a good or a bad thing.
The OOM killer watches when memory hits a critical level when the kernel has to
eventually back virtual memory to physical memory and choose a process to kill
in order to free memory. Its principle documentation is clear and concise.
Note that even with tuning the vm.overcommit_memory
sysctl, it’s hard to
actually disable the OOM killer completely.
The Paging Mechanism
This section is an extract from Systems Performance: Enterprise and the Cloud, 2nd Edition (2020) by Brendan Gregg.
The result of the virtual memory model and demand allocation is that any page of virtual memory may be in one of the following states:
- Unallocated
- Allocated, but unmapped (unpopulated and not yet faulted)
- Allocated, and mapped to main memory (RAM)
- Allocated, and mapped to the physical swap device (disk)
State (4) is reached if the page is paged out due to system memory pressure. A transition from (2) to (3) is a page fault. If it requires disk I/O, it is a major page fault; otherwise, a minor page fault.
From these states, two memory usage terms can also be defined:
- Resident set size (RSS): The size of allocated main memory pages (3)
- Virtual memory size: The size of all allocated areas (2 + 3 + 4)
Process Memory
Understanding a process use of memory can be tricky and it’s maybe
counter-intuitive but it’s pretty hard to define a universal definition for
“memory use” and get the perfect statistic. In that regard, for more details,
Brendan Gregg has been trying to define and estimate a metric: the
Working Set Size
.
Here are a few basic actions to read memory usage, and some more elaborated
methods for Golang and eBPF programs.
Reading a Linux process memory use
Fast but inaccurate: status
First, you can read the RSS stat of the process, as detailed in the proc(5)
manpage, the read is fast but inaccurate. You can find the same information in:
- a parsable form at
/proc/pid/stat
; - a human form at
/proc/pid/status
; - measured in pages at
/proc/pid/statm
.
Let’s take a look at the human form:
grep -i rss /proc/$(pidof <process>)/status
The output should be similar to:
VmRSS: 70244 kB
RssAnon: 30180 kB
RssFile: 40064 kB
RssShmem: 0 kB
As of the kernel /proc
documentation:
- VmRSS: size of memory portions. It contains the three following parts
(
VmRSS = RssAnon + RssFile + RssShmem
). - RssAnon: size of resident anonymous memory.
- RssFile: size of resident file mappings.
- RssShmem: size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings).
Slow but accurate: smaps
For slower but more accurate results, one can use /proc/pid/smaps_rollup
as
per the proc(5)
manpage to retrieve a better total of RSS memory use.
sudo grep -i rss /proc/$(pidof <process>)/smaps_rollup
The output should be similar to:
Rss: 70648 kB
To get more details of RSS consumption by memory segment, you can use:
sudo grep -e '^[^A-Z]' -e Rss /proc/$(pidof <process>)/smaps | less
The output should be similar to
00010000-01c29000 r-xp 00000000 fd:01 31577 /home/[...]/tetragon
Rss: 18340 kB
01c30000-03cfe000 r--p 01c20000 fd:01 31577 /home/[...]/tetragon
Rss: 20664 kB
03d00000-03e22000 rw-p 03cf0000 fd:01 31577 /home/[...]/tetragon
Rss: 1160 kB
03e22000-03e76000 rw-p 00000000 00:00 0
Rss: 164 kB
4000000000-400c800000 rw-p 00000000 00:00 0
Rss: 22892 kB
400c800000-4010000000 ---p 00000000 00:00 0
Rss: 0 kB
ff56b9c47000-ff56b9c58000 rw-s 00000000 00:0f 1064 anon_inode:[perf_event]
Rss: 68 kB
ff56b9c58000-ff56b9c69000 rw-s 00000000 00:0f 1064 anon_inode:[perf_event]
Rss: 68 kB
ff56b9c69000-ff56b9c7a000 rw-s 00000000 00:0f 1064 anon_inode:[perf_event]
Rss: 68 kB
ff56b9c7a000-ff56b9c8b000 rw-s 00000000 00:0f 1064 anon_inode:[perf_event]
Rss: 68 kB
ff56b9c8b000-ff56b9c9c000 rw-s 00000000 00:0f 1064 anon_inode:[perf_event]
Rss: 68 kB
ff56b9c9c000-ff56b9cad000 rw-s 00000000 00:0f 1064 anon_inode:[perf_event]
Rss: 68 kB
ff56b9cad000-ff56ba400000 rw-p 00000000 00:00 0
Rss: 6280 kB
ff56ba400000-ff56bc400000 rw-p 00000000 00:00 0
Rss: 4 kB
[...]
ff5700d96000-ff5700da8000 rw-p 00000000 00:00 0
Rss: 72 kB
ff5700da8000-ff5700ea7000 ---p 00000000 00:00 0
Rss: 0 kB
ff5700ea7000-ff5700f07000 rw-p 00000000 00:00 0
Rss: 56 kB
ff5700f07000-ff5700f09000 r--p 00000000 00:00 0 [vvar]
Rss: 0 kB
ff5700f09000-ff5700f0a000 r-xp 00000000 00:00 0 [vdso]
Rss: 4 kB
ffffe3ebf000-ffffe3ee0000 rw-p 00000000 00:00 0 [stack]
Rss: 16 kB
Or better, you can use pmap(1)
to get nicely formatted version of
/proc/pid/smaps
sudo pmap $(pidof <process>) -xp
The output should be similar to
597662: ./tetragon --bpf-lib bpf/objs/ --tracing-policy-dir /home/mtardy.linux/tetragon/examples/tracingpolicy/set
Address Kbytes RSS Dirty Mode Mapping
0000000000010000 28772 18340 0 r-x-- /home/mtardy.linux/tetragon/tetragon
0000000001c30000 33592 20664 0 r---- /home/mtardy.linux/tetragon/tetragon
0000000003d00000 1160 1160 228 rw--- /home/mtardy.linux/tetragon/tetragon
0000000003e22000 336 164 164 rw--- [ anon ]
0000004000000000 204800 22892 22892 rw--- [ anon ]
000000400c800000 57344 0 0 ----- [ anon ]
0000ff56b9c47000 68 68 4 rw-s- [ anon ]
0000ff56b9c58000 68 68 4 rw-s- [ anon ]
0000ff56b9c69000 68 68 4 rw-s- [ anon ]
0000ff56b9c7a000 68 68 4 rw-s- [ anon ]
0000ff56b9c8b000 68 68 4 rw-s- [ anon ]
0000ff56b9c9c000 68 68 4 rw-s- [ anon ]
0000ff56b9cad000 7500 6280 6280 rw--- [ anon ]
0000ff56ba400000 32768 4 4 rw--- [ anon ]
0000ff56bc400000 512 0 0 ----- [ anon ]
0000ff56bc480000 4 4 4 rw--- [ anon ]
0000ff56bc481000 524284 0 0 ----- [ anon ]
0000ff56dc480000 4 4 4 rw--- [ anon ]
0000ff56dc481000 523836 0 0 ----- [ anon ]
0000ff56fc410000 4 4 4 rw--- [ anon ]
0000ff56fc411000 65476 0 0 ----- [ anon ]
0000ff5700402000 4 4 4 rw--- [ anon ]
0000ff5700403000 8180 0 0 ----- [ anon ]
0000ff5700c06000 576 564 564 rw--- [ anon ]
0000ff5700c96000 1024 8 8 rw--- [ anon ]
0000ff5700d96000 72 72 72 rw--- [ anon ]
0000ff5700da8000 1020 0 0 ----- [ anon ]
0000ff5700ea7000 384 56 56 rw--- [ anon ]
0000ff5700f07000 8 0 0 r---- [ anon ]
0000ff5700f09000 4 4 0 r-x-- [ anon ]
0000ffffe3ebf000 132 16 16 rw--- [ stack ]
---------------- ------- ------- -------
total kB 1492204 70648 30324
Note that sometimes you can see the PSS column instead of the RSS column, the PSS is the “proportional set size”, taking into account shared memory. For example, if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. See more about that in ELC: How much memory are applications really using?.
If looking at the segment, especially the anonymous ones isn’t helpful, you can try to trace the memory operation of the process to see what’s happening and when they are allocated.
strace -e trace=memory -o out.trace <cmd>
But looking at a runtime allocating memory can be rather confusing and you better check a runtime tools to analyze anonymous memory consumption directly.
Golang memory use
This section was growing too much so I decided to make it a separate article that you can find here: A Deep Dive into Golang Memory.
eBPF programs’ memory impact
eBPF programs’ memory impact is mostly due to BPF maps. They serve as a way to store state, data and communicate between BPF programs and userspace programs. Due to their static nature, most of them have to be allocated and defined at compilation and thus empty or unused maps just use as much space as used ones.
It seems that memory cgroup v1 does not account for memory used by maps, not even
in the kernel
version of those stats. However, it changed for cgroup v2
certainly related to this series of patches.
This can lead to a drastic change in overall memory consumption if you
switch from cgroup v1 to v2 while having a lot of BPF maps.
If you spot some major memory consumptions from unused maps and you cannot make
the existence of a map conditional, a good option is to make that map
max_entries
equal to one (zero is invalid) and resize it at loading time in
the agent when needed.
Note that maps can be “anonymous” if the program loading them doesn’t pin them properly and are actually used in the BPF code (and thus are allocated). They are then not tied to the userspace process properly but account for memory usage.
Some helpful bpftool commands
Here are a few commands, using the great bpftool
and jq
, to gauge memory consumption of loaded maps.
Retrieve total memory usage of maps (in kB):
sudo bpftool map -j | jq '[ .[] | .bytes_memlock ] | add / 1000'
Same but filtered for the process with comm
equal to tetragon
:
sudo bpftool map -j | jq '[ .[] | select(.pids[0].comm == "tetragon") | .bytes_memlock ] | add / 1000'
Group bytes_memlock
with name and sort by bytes_memlock
:
sudo bpftool map -j | jq '[ .[] | {bytes_memlock:.bytes_memlock, name:.name} ] | sort_by(.bytes_memlock)'
Sum memory by map name:
sudo bpftool map -j | jq ' group_by(.name) | map({name: .[0].name, total_bytes_memlock: map(.bytes_memlock | tonumber) | add, maps: length}) | sort_by(.total_bytes_memlock)'
Visualize the stats with pie charts
If you want to visualise 👀 that last command in a pie chart 🥧, you can
try bpfmemapie
, a little Go utility
to render an interactive pie chart out of bpftool’s output.
Control groups
Find out if I’m using cgroups v1 or v2
To start, it’s important to know if you are using cgroups version 1 or version 2. Here are a few techniques to quickly detect the cgroups version.
From this unix Stack Exchange answer:
mount | grep '^cgroup' | awk '{print $1}' | uniq
If the output contains cgroup2, then your kernel supports cgroups v2.
From the Kubernetes cgroups concepts documentation
stat -fc %T /sys/fs/cgroup/
- For cgroup v2, the output isÂ
cgroup2fs
. - For cgroup v1, the output isÂ
tmpfs
.
- For cgroup v2, the output isÂ
From runc documentation:
Your are using cgroups v2 if
/sys/fs/cgroup/cgroup.controllers
 is present.A more bullet-proof method used by systemd
- if
/sys/fs/cgroup
exists and is on acgroup2
file system, the system is running with a full unified hierarchy; - if
/sys/fs/cgroup
exists and is on atmpfs
file system,- if either
/sys/fs/cgroup/unified
or/sys/fs/cgroup/systemd
exist and are oncgroup2
file systems, the system is using a unified hierarchy for the systemd controller only; - if
/sys/fs/cgroup/systemd
exists and is on acgroup
file system (or, as a fallback, if it exists and isn’t on acgroup2
file system), the system is using a legacy hierarchy.
- if either
Note that this is a bit similar to what cAdvisor is doing when retrieving values from the stat file.
- if
The memory control group
This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Cgroups by Mihai Albert.
Note that this section is written with cgroups v1 in mind
Cgroups are a mechanism used on Linux for limiting and accounting resources
and containers are built on cgroups (and namespaces), see a
container terminology introduction post by Red Hat
on that. However, contrary to namespaces, cgroups don’t limit what a process
can “see”, see this article on why top and free inside containers don’t show
the container memory values.
Indeed, /proc/meminfo
is not namespaced, contrary to PID list for example.
You can use the root memory cgroup accounting to retrieve the stat of the node
memory usage: that’s what Kubernetes does on cgroups v1. The value is correct if
memory.use_hierarchy
is enabled for the root cgroup (and it cannot be
modified dynamically). You can check with:
find /sys/fs/cgroup/memory -name memory.use_hierarchy -exec cat '{}' \;
To verify that all processes appearing in /proc
are also in the root memory
cgroup /sys/fs/cgroup/memory
you can use:
ps aux | wc -l
find /sys/fs/cgroup/memory -name cgroup.procs -exec cat '{}' \; | wc -l
For more details read detailed documentation on memory cgroup v1 by Red Hat.
Read an interesting recap about using cgroups from LinkedIn engineering blog. It emphasizes that other cgroups (or just the root cgroup) can be noisy neighbors and affect the performance of an application running in its own cgroup, highlighting again that they are not an isolation but a resource limitation mechanism.
Using cgroups to measure memory usage
Looking at RSS to measure memory consumption can be insufficient. Indeed, the Linux mechanisms to measure and restrict memory consumption (cgroups) use different accounting.
Automatically with cmemstat
I created a small utility called cmemstat
to perform the manual steps explained in the next section automatically.
With a working Go install you can fetch, compile and install with:
go install github.com/mtardy/cmemstat@latest
And then use it with:
cmemstat [option]... program [programoption]...
For example:
cmemstat sleep 3
# or with options
cmemstat --debug --refresh 400ms sleep 3
See more information on the project’s repository at github.com/mtardy/cmemstat.
Manually using the cgroup sys fs
Note that the process must start in the memory cgroup otherwise the accounting is incorrect because all the memory usage will be accounted for in the previous cgroup. From the administration guide of the Linux kernel:
A memory area is charged to the cgroup which instantiates it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn’t move the memory usages that it instantiated while in the previous cgroup to the new cgroup.
In one terminal, open a new shell and echo its PID
bash
echo $$
In another terminal, create the cgroup and move the process inside of it
sudo su
cd /sys/fs/cgroup # for cgroup v1, create under /sys/fs/cgroup/memory
mkdir benchmark # this is an arbitrary name
cd benchmark
echo $(pidof <process>) > cgroup.procs
Then start your process using the bash you opened in the first terminal and read the stat you want to acquire, for example:
cat memory.current # for cgroup v1, equivalent can be memory.total_in_bytes
From the cgroup v2 documentation:
The memory.current value is the sum of memory.stat first three lines (anon + file + kernel)
- anon: Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS)
- file: Amount of memory used to cache filesystem data, including tmpfs and shared memory.
- kernel (npn): Amount of total kernel memory, including (kernel_stack, pagetables, percpu, vmalloc, slab) in addition to other kernel memory use cases.
Kubernetes
Kubernetes containers, Pods, and cgroups
This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Cgroups and Kubernetes by Mihai Albert.
Kubernetes uses one cgroup per container but shares the hierarchy in a Pod, see more details in this Stack Overflow question.
Be aware of the pause container used in Pods, it’s a container used to reap zombie processes and hold shared namespaces, read the best article you can find on the topic: The Almighty Pause Container. Also, read this conversation I had on the Kubernetes Slack to find out why crictl does not list the pause container.
You can use crictl to inspect the memory controller hierarchy and statistics read more on the official Kubernetes documentation. Be aware that containerd uses the k8s.io “namespace” for containers. You can also directly check cAdvisor’s output for stats.
Kubernetes supports cgroups v2. You can check what you are using through different methods (see related note Find out which version of cgroup is running). Again this article is written with cgroups v1 in mind and things were fixed and changed with cgroups v2, like group killing for containers, see this presentation by Giuseppe Scrivano from Red Hat at 14:45 and 17:55.
Kubernetes, applications, and the OOM killer
This section is a digest of Out-of-memory (OOM) in Kubernetes – Part 2: The OOM killer and application runtime implications - Cgroups and the OOM killer by Mihai Albert.
OOM killer will step in when the whole OS is low on memory, but it has been updated to work with cgroups as well, see Teaching the OOM killer about control groups. As explained in the kernel documentation, when a cgroup goes over limits, it first tries to reclaim memory (from the per-group LRU list) and then invokes the OOM killer. From experimenting, it’s a bit hard to predict which process will be targeted by the OOM killer inside a container and the result can look incoherent from the desired behavior. Reading kernel logs (using dmesg for example) will give you more context on the decision (see this resource on various system logs).
[…] the OOM killer deciding to terminate processes inside cgroups is something that “happens” to Kubernetes containers, as the OOM killer is a Linux kernel component. It’s not that Kubernetes decides to invoke the OOM killer, as it has no such control over the kernel.
As such, applications cannot be given a signal to gracefully shutdown and this
is a problem for running production code on Kubernetes, see this GitHub issue
about making the OOM killer not send a SIGKILL.
Another aspect is that the OOM killer has no notion of containers (like the
rest of the kernel) nor Kubernetes Pods, so it can kill a process in a
container without stopping everything else: this can lead to weird states for
containers or Pods, unfortunately, this is “working as intended”.
However, when Kubernetes sees an OOM kill event, it tries to restart the Pod to
make him healthy given the restartPolicy
. Finally, memory cgroup has soft
limits but Kubernetes does not support them as of now, see this gist about
Kubernetes resource management: kcgroups.
Now the application can use a runtime that will allocate memory in complex
ways, and it can be difficult to understand why memory is retained and not
released to the OS. The original article addresses .NET, I’m much more
interested in Go and the best resource you can find is the Guide to the Go
Garbage Collector. Amongst other things,
since 1.19, we can use GOMEMLIMIT
which is very useful to hint the runtime
toward the limitation and make him smarter.
Finally, Kubernetes has resource requests and limits, see the Resource Management for Pods and Containers for full official documentation, but essentially:
When you specify the resource request for containers in a Pod, the kube-scheduler uses this information to decide which node to place the Pod on. When you specify a resource limit for a container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set.
Here are other resources on resource requests and limits:
- Google’s Kubernetes best practices: Resource requests and limits
- A Deep Dive into Kubernetes Metrics - Part 3 Container Resource Metrics
Kubernetes uses Quality of Service (QoS) classes for Pods to dictate the order in which pods are evicted and define the root memory cgroup hierarchy on the node. Official documentation here, but a recap could be:
Guaranteed is assigned to pods that have all of their containers specify resource values that are equal to the limits ones for both CPU and memory respectively. Burstable is when at least one container in the pod has a CPU or memory request, but it doesn’t meet the “high” criteria for Guaranteed. The last of the classes – BestEffort – is the case of a pod that doesn’t have a single container specify at least one CPU or memory limit or request value.
Kubernetes memory metrics
This section is rewritten but inspired by Out-of-memory (OOM) in Kubernetes – Part 3: Memory metrics sources and tools to collect them by Mihai Albert.
In the Kubernetes world, people looking at memory metrics for workload usually
look at the container_memory_working_set_bytes
, let’s understand the link
between this stat and the stats, we retrieve from the OS and explain it.
Metrics components
The above diagram is from the author of the original blog post and gives a good understanding of the flow of metrics on a node. Since we care about the memory use of the workloads, the most interesting part is what is happening inside kubelet, which will be the server exposing various metrics. Let’s simplify it by ignoring a few things:
- The Kubelet’s own metrics endpoint.
- The kube-state-metrics, a service that listens to the Kubernetes API server and generates metrics. Note that it only relies only on the Kubernetes API events.
- The Kubernetes API Server’s own metrics endpoint.
- The Prometheus node exporter can also skipped for now as it concerns the whole node and not the workloads, reporting various from the host itself (which contains memory stats though).
With this simplified version, we realize that:
- We end up with two principal consumers, first the Prometheus Server will
query the cAdvisor endpoint directly and the Prometheus node exporter, and
then the Metrics server will query all nodes to return the data exposed
by
kubectl top pod|node
. - The stats of the workloads exposed by the kubelet, are mostly coming from cAdvisor.
Indeed, on the last topic, you can see that in the Metrics server case, it queries the Resource Metrics endpoint, which interrogates the Summary API endpoint, that retrieves its metrics both from cAdvisor and the Container runtime (CRI). Note that the current evolution is to rely less on cAdvisor and more on the CRI for stats. For that, the way the kubelet “decides” which source to use (cAdvisor or CRI) is detailed in “How does the Summary API endpoint get its metrics?”. Nowadays, the reality is that, in any case, most metrics come from cAdvisor due to a bug in kubelet.
In practice, if you are using the runc low-level runtime1, which is the reference implementation, used in most cases by the popular high-level container runtimes containerd and CRI-O, whether kubelet would retrieve its cgroups stats from the CRI or cAdvisor would not make a difference as they both use opencontainers/libcontainer:
Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.
However, this is cAdvisor or the CRI directly that will compute heuristic
returned like memory_working_set
, so the appropriate code must be explored.
cAdvisor and libcontainer deep dive
So in the end, cAdvisor will call its
GetStats
method, that will, for every container, eventually call the
setMemoryStats
function. The most interesting part for us is the function’s first line,
that retrieves memory usage, and the last lines, which computes the working set
stats we are trying to understand.
func setMemoryStats(s *cgroups.Stats, ret *info.ContainerStats) {
ret.Memory.Usage = s.MemoryStats.Usage.Usage
ret.Memory.MaxUsage = s.MemoryStats.Usage.MaxUsage
ret.Memory.Failcnt = s.MemoryStats.Usage.Failcnt
ret.Memory.KernelUsage = s.MemoryStats.KernelUsage.Usage
if cgroups.IsCgroup2UnifiedMode() {
ret.Memory.Cache = s.MemoryStats.Stats["file"]
ret.Memory.RSS = s.MemoryStats.Stats["anon"]
ret.Memory.Swap = s.MemoryStats.SwapUsage.Usage - s.MemoryStats.Usage.Usage
ret.Memory.MappedFile = s.MemoryStats.Stats["file_mapped"]
} else if s.MemoryStats.UseHierarchy {
ret.Memory.Cache = s.MemoryStats.Stats["total_cache"]
ret.Memory.RSS = s.MemoryStats.Stats["total_rss"]
ret.Memory.Swap = s.MemoryStats.Stats["total_swap"]
ret.Memory.MappedFile = s.MemoryStats.Stats["total_mapped_file"]
} else {
ret.Memory.Cache = s.MemoryStats.Stats["cache"]
ret.Memory.RSS = s.MemoryStats.Stats["rss"]
ret.Memory.Swap = s.MemoryStats.Stats["swap"]
ret.Memory.MappedFile = s.MemoryStats.Stats["mapped_file"]
}
// [...]
inactiveFileKeyName := "total_inactive_file"
if cgroups.IsCgroup2UnifiedMode() {
inactiveFileKeyName = "inactive_file"
}
workingSet := ret.Memory.Usage
if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
if workingSet < v {
workingSet = 0
} else {
workingSet -= v
}
}
ret.Memory.WorkingSet = workingSet
}
Memory usage link to cgroups fs
To retrieve memory usage, libcontainer reads memory.usage_in_bytes
for
cgroups v1,
and memory.current
for cgroups v2.
If interested (or looking for proofs) see the code deep dive below
🌊 🤿 🦑.
The setMemoryStats
function was called from
newContainerStats
in the
GetStats
method specified above, which called
h.cgroupManager.GetStats()
to retrieve the cgroup stats, using the opencontainers/libcontainer
codebase.
Note that naming is very confusing here since cAdvisor has a libcontainer
package that contains the
GetStats
method that will call the libcontainer’s cgroup package
GetStats
method.
Now, as expected, libcontainer
has two implementations for its GetStats
method, one in the fs
and the other in the fs2
package, corresponding to cgroups v1 and v2
respectively.
For cgroups v1, another
GetStats
is called on all subsystems. For memory, we end up in the appropriateGetStats
method that writes tostats.MemoryStats.Usage
after callinggetMemoryData(path, "")
. So we can finally see that libcontainer readsmemory.usage_in_bytes
ingetMemoryData
for memory usage.func getMemoryData(path, name string) (cgroups.MemoryData, error) { memoryData := cgroups.MemoryData{} moduleName := "memory" if name != "" { moduleName = "memory." + name } var ( usage = moduleName + ".usage_in_bytes" maxUsage = moduleName + ".max_usage_in_bytes" failcnt = moduleName + ".failcnt" limit = moduleName + ".limit_in_bytes" ) value, err := fscommon.GetCgroupParamUint(path, usage) if err != nil { if name != "" && os.IsNotExist(err) { // Ignore ENOENT as swap and kmem controllers // are optional in the kernel. return cgroups.MemoryData{}, nil } return cgroups.MemoryData{}, err } memoryData.Usage = value // [...] }
For cgroups v2, the
statsMemory
function is called, which similarly as with v1, writes tostats.MemoryStats.Usage
after callinggetMemoryDataV2(dirpath, "")
. We can finally see that libcontainer readsmemory.current
ingetMemoryDataV2
for memory usage.func getMemoryDataV2(path, name string) (cgroups.MemoryData, error) { memoryData := cgroups.MemoryData{} moduleName := "memory" if name != "" { moduleName = "memory." + name } usage := moduleName + ".current" limit := moduleName + ".max" maxUsage := moduleName + ".peak" value, err := fscommon.GetCgroupParamUint(path, usage) if err != nil { if name != "" && os.IsNotExist(err) { // Ignore EEXIST as there's no swap accounting // if kernel CONFIG_MEMCG_SWAP is not set or // swapaccount=0 kernel boot parameter is given. return cgroups.MemoryData{}, nil } return cgroups.MemoryData{}, err } memoryData.Usage = value // [...] }
Stats file
Doing a similar analysis you can find that out that s.MemoryStats.Stats
is
a map containing the key-values of the memory.stat
file for
cgroups v1
and for cgroups v2
The key names differ between cgroups v1 and v2, explaining why the
setMemoryStats
function needs to discriminate using
cgroups.IsCgroup2UnifiedMode()
and s.MemoryStats.UseHierarchy
.
Memory working set
The conclusion is that Kubernetes’ container_memory_working_set_bytes
is a
heuristic trying to estimate what memory is used by the container’s processes
by reading the usage and subtracting the memory used by the “inactive files”.
From the original article:
inactive_file
is defined in the docs section 5.2 as “number of bytes of file-backed memory on inactive LRU list“. The LRU lists are described in this document with the inactive list described as containing “reclaim candidates” as opposed to the active list that “contains all the working sets in the system“.So in effect the memory size for mapping files from disk that aren’t really required at the time is deducted from the memory usage – itself roughly equivalent as we’ve seen previously with rss+cache+swap (which we’re already reading from memory.stat).
To simplify, while the metrics always remain >= 0, we can link the
container_memory_working_set_bytes
with the cgroups fs metrics like this:
Cgroups | Container Memory Working Set Bytes |
---|---|
version 1 | memory.usage_in_bytes - memory.stat[total_inactive_file] |
version 2 | memory.current - memory.stat[inactive_file] |
Conclusion
Let’s recap and try to summarize the essence of the research that was compiled in the above sections.
Let’s say you have a Go application, optionally loading eBPF programs, running on
Kubernetes having memory issues. By memory issue, I mean that kubectl top
, or
the metrics monitored on Grafana, the container_memory_working_set_bytes
, are
“too high” to your taste. How to understand and trace the link between your
program memory usage and the final result reported by Kubernetes?
The container_memory_working_set_bytes
is a heuristic memory metric computed
in the container world trying to estimate what the OOM killer will look for
when checking the process memory consumption. It does not exist as-is from the
cgroups statistics.
This value can technically come out of cAdvisor, embedded in the Kubelet or the container runtime. However, at the moment, it mostly comes out of cAdvisor anyway because of a Kubelet bug. While the computation of this high-level metric is done by cAdvisor or the container runtime, if you are using runc as your low level container runtime1, the cgroup statistics will be retrieved using runc libcontainer reference implementation.
Depending on whether you are using cgroups v1 or v2, the following table shows which
cgroup stats will be used to compute container_memory_working_set_bytes
. Also
keep in mind that this metrics is always >= 0, so it’s technically max(0, <value below>)
.
Cgroups | Container Memory Working Set Bytes |
---|---|
version 1 | memory.usage_in_bytes - memory.stat[total_inactive_file] |
version 2 | memory.current - memory.stat[inactive_file] |
So if the amount of inactive_file
memory is low, you can approximate this
value to the memory cgroup main statistic, memory.usage_in_bytes
or
memory.current
for cgroup v1 and v2 respectively. For details about how we
ended up on this approximation, see the diagram and code deep dive above.
Now to continue your investigation you might want to:
- Read the Linux process memory use and memory segments. Depending on the way your process handles memory, it can give you a first view on how memory is used.
- Reproduce the measurements locally using cgroups. It can be easier than spinning a whole Kubernetes cluster and deploying your fixed program.
- Check if your Go program uses too much heap.
Using a language-specific profiler will give you information about how the
heap is distributed. Don’t forget to learn about
GOMEMLIMIT
for containerized environment and other stuff about the Go garbage collector to optimize its behavior for your use case. - Check if your eBPF maps use too much kernel memory.
When using cgroups v2, kernel memory used by BPF maps is accounted for properly,
you can retrieve some stats using
bpftool
andjq
, and plot some pie charts to visualize them.
If you find errors in this article, please reach out.
You might not use runc if you are using crun a C language container runtime implementation, or an isolation low-level runtimes like AWS firecracker-containerd or Google gVisor. See this article on how to understand the differences between Docker, containerd, CRI-O and runc if the notion of container runtime is unclear. ↩︎ ↩︎