Tips and Tricks for Writing Linux BPF Applications With libbpf
This post introduces some tips and tricks for writing BPF applications with libbpf.
Join the DZone community and get the full member experience.Join For Free
At the beginning of 2020, when I used the BCC tools to analyze our database performance bottlenecks, and pulled the code from the GitHub, I accidentally discovered that there was an additional libbpf-tools directory in the BCC project. I had read an article on BPF portability and another on BCC to libbpf conversion, and I used what I learned to convert my previously submitted bcc-tools to libbpf-tools. I ended up converting nearly 20 tools. (See Why We Switched from bcc-tools to libbpf-tools for BPF Performance Analysis.)
During this process, I was fortunate to get a lot of help from Andrii Nakryiko (the libbpf + BPF CO-RE project's leader). It was fun and I learned a lot. In this post, I'll share my experience about writing Berkeley Packet Filter (BPF) applications with libbpf. I hope this article is helpful to people who are interested in libbpf and inspires them to further develop and improve BPF applications with libbpf.
Before you read further, however, consider reading these posts for important background information:
- BPF Portability and CO-RE
- HOWTO: BCC to libbpf conversion
- Building BPF applications with libbpf-boostrap
This article assumes that you've already read these posts, so there won't be any systematic descriptions. Instead, I'll offer you some tips for certain parts of the program.
Combining the Open and Load Phases
If your BPF code doesn't need any runtime adjustments (for example, adjusting the map size or setting an extra configuration), you can call
<name>__open_and_load() to combine the two phases into one. This makes our code look more compact. For example:
You can see the complete code in readahead.c.
<name>__attach() attaches all auto-attachable BPF programs. However, sometimes you might want to selectively attach the corresponding BPF program according to the command line parameters. In this case, you can call
bpf_program__attach() instead. For example:
You can see the complete code in biolatency.c.
Custom Load and Attach
Skeleton is suitable for almost all scenarios, but there is a special case: perf events. In this case, instead of using links from
struct <name>__bpf, you need to define an array:
struct bpf_link *links. The reason is that
perf_event needs to be opened separately on each CPU.
After this, open and attach
perf_event by yourself:
Finally, during the tear down phase, remember to destroy each link in the
links and then destroy
You can see the complete code in runqlen.c.
Multiple Handlers for the Same Event
Starting in v0.2, libbpf supports multiple entry-point BPF programs within the same executable and linkable format (ELF) section. Therefore, you can attach multiple BPF programs to the same event (such as tracepoints or kprobes) without worrying about ELF section name clashes. For details, see Add libbpf full support for BPF-to-BPF calls. Now, you can naturally define multiple handlers for an event like this:
If your libbpf version is earlier than v2.0, to define multiple handlers for an event, you have to use multiple program types, for example:
You can see the complete code in hardirqs.bpf.c.
Reduce Pre-Allocation Overhead
Beginning in Linux 4.6, BPF hash maps perform memory pre-allocation by default and introduce the
BPF_F_NO_PREALLOC flag. The motivation for doing so is to avoid kprobe + bpf deadlocks. The community had tried other solutions, but in the end, pre-allocating all the map elements was the simplest solution and didn't affect the user space visible behavior.
When full map pre-allocation is too memory expensive, define the map with the
BPF_F_NO_PREALLOC flag to keep old behavior. For details, see bpf: map pre-alloc. When the map size is not large (such as
MAX_ENTRIES = 256), this flag is not necessary.
BPF_F_NO_PREALLOC is slower.
Here is an example:
You can see many cases in libbpf-tools.
Determine the Map Size at Runtime
One advantage of libbpf-tools is that it is portable, so the maximum space required for the map may be different for different machines. In this case, you can define the map without specifying the size and resize it before load. For example:
<name>.bpf.c, define the map as:
After the open phase, call
bpf_map__resize(). For example:
You can see the complete code in cpudist.c.
When you select the map type, if multiple events are associated and occur on the same CPU, using a per-CPU array to track the timestamp is much simpler and more efficient than using a hashmap. However, you must be sure that the kernel doesn't migrate the process from one CPU to another between two BPF program invocations. So you can't always use this trick. The following example analyzes soft interrupts, and it meets both these conditions:
You can see the complete code in softirqs.bpf.c.
Not only can you use global variables to customize BPF program logic, but you can also use them instead of maps to make your program simpler and more efficient. Global variables can be of any size. You just need to set global variables to be a fixed size (or at least with a bounded maximum size if you don't mind wasting some memory).
For example, because the number of SOFTIRQ types is fixed, you can define global arrays to save counts and histograms in
Then, you can traverse the array directly in user space:
You can see the complete code in softirqs.c
Watch Out for Directly Accessing Fields Through Pointers
As you know from the BPF Portability and CO-RE blog post, the libbpf +
BPF_PROG_TYPE_TRACING approach gives you a smartness of BPF verifier. It understands and tracks BTF natively and allows you to follow pointers and read kernel memory directly (and safely). For example:
This is very cool. However, when you use such expressions in conditional statements, there is a bug that this branch is optimized away in some kernel versions. In this case, until bpf: fix an incorrect branch elimination by verifier is widely backported, please use
BPF_CORE_READ for kernel compatibility. You can find an example in biolatency.bpf.c:
You can see even though it's a
tp_btf program and
q->elevator will be faster, I have to use
BPF_CORE_READ(q, elevator) instead.
This article introduced some tips for writing BPF programs with libbpf. You can find many practical examples from libbpf-tools and bpf. If you have any questions, you can join the TiDB community on Slack and send us your feedback.
Published at DZone with permission of Wenbo Zhang. See the original article here.
Opinions expressed by DZone contributors are their own.