Providing NHI to Services: Piece of Cake or Not?
We can all agree that in today’s modern infrastructure, the significance of non-human identities has grown dramatically. Automatically provisioning identities for services has become a critical responsibility for security and infrastructure teams. At Riptides, we aim to be the catalyst for this transformation by offering a simple, automated path to streamline your non-human identity journey. The Riptides platform provides a comprehensive view of your entire infrastructure, enabling you to monitor whether services are communicating securely, whether they have identities, and—most importantly—who is talking to whom. Armed with this information, administrators can automatically provision identities for every service. Of course, delivering this level of visibility requires tracing connections—but the question remains: how?
Getting Events Out of the Kernel—or Lost in the Woods?
At Riptides, we’ve committed to three guiding principles for the tracing solutions we adopt:
- Never compromise the performance of live workloads
- Design for maximum portability (it should run anywhere)
- Leverage existing kernel tracing tools for easy debugging and integration
With these principles in mind, our goal is to capture kernel-level events and route them efficiently to user space for analysis. While there are many existing tools available, we don’t aim to reinvent the wheel. Instead, we focus on sticking to our principles while maximizing the benefits of proven solutions.
Dynamic Probes: kprobe & kretprobe
If you're researching kernel tracing or event collection, sooner or later you'll come across Kernel Probes (kprobes). Kprobes allow you to dynamically hook into almost any point in the kernel code to capture valuable events. There are two main types of probes: kprobes and kretprobes. A kprobe triggers when a specific function is called, while a kretprobe fires when that function returns. Traditionally, you had to write a kernel module that subscribed to a specific address, triggering when execution reached that point. This module also registered a handler containing the business logic for what to do when the probe fired. With the rise of eBPF, this process has become much less error-prone and far more accessible and widely adopted. Let’s dig deeper and explore how kprobes actually work.
How Do They Really Work?
When a kprobe is registered, the probed address in the kernel is replaced with a breakpoint—much like how a debugger works. When the CPU hits that instruction, a trap occurs: CPU registers are saved, and control is transferred to the kprobe handler. This handler receives both the kprobe struct and the saved register state, executes the defined logic, and then resumes normal execution. A kretprobe operates similarly, but instead of breakpoints, it uses trampolines. These trampolines are small snippets of code injected by modifying the return address on the stack. They execute the handler logic and then return execution to the original function return address. To handle recursive or concurrent calls of the same function, the kretprobe
struct includes a maxactive
field, which defines how many simultaneous instances can be probed. Additionally, you can specify an entry_handler
to pre-filter calls, deciding whether the main kretprobe handler should be executed or skipped.
Breakpoint: A special CPU instruction injected into executable code to trigger a trap.
Trampoline: A small, custom code snippet inserted into the control flow by overwriting the return address. It executes handler logic and returns to the original flow.
In certain scenarios, kprobes can be optimized
to use trampolines instead of breakpoints for improved performance.
Function Tracing: fprobe
When narrowing the scope to function entry and exit points, we encounter another built-in tracer: fprobe. Fprobe leverages the ftraceinfrastructure—a powerful tracing framework integrated directly into the Linux kernel. Like kprobe
and kretprobe
, fprobe
allows you to attach handlers to the entry and return points of functions. However, instead of relying on breakpoints or trampolines, fprobe uses ftrace hooks. When the kernel (or a kernel module) is compiled with specific configuration options, lightweight entry hooks are embedded into many functions. These hooks are initially implemented as no-ops or lightweight jumps, which are later patched at runtime to redirect execution to the ftrace handler—this includes your custom handler logic. In short, while fprobe is limited to specific, pre-instrumented points, it is significantly faster and better suited to high-frequency function calls, making it ideal for performance-sensitive tracing.
Static Instrumentation: Tracepoints
The Linux kernel provides a way to add statically defined instrumentation hooks called tracepoints. These can be embedded directly into the code at meaningful locations using the TRACE_EVENT()
macro. Although tracepoints use the ftrace infrastructure under the hood, this is completely transparent to developers. By following a few guidelines, developers can gain full access to the ftrace parser. Tracepoints are inserted at compile time, resulting in call sites in the kernel that check whether the tracepoint is enabled. If it’s disabled, the overhead is minimal—just a tiny time and space penalty. If enabled, it calls the registered function, executes the handler logic, and returns to the original caller. Tracepoints are safe and easy to use:
- They’re part of the official kernel API.
- The data structures they expose are stable, so you don’t need to worry about kernel version changes breaking your code.
- And they’re lightning-fast, thanks to their use of ftrace hooks, as described earlier.
Comparing the Three
As a Rule of Thumb:
- Use tracepoints when:
- You need stable, structured data
- You’re tracing commonly monitored behavior
- You control or maintain the code being instrumented
- Use fprobe for:
- High performance generic function tracing
- Use kprobe for:
- Everything else, especially when other options aren’t applicable
At Riptides, we chose to rely on tracepoints—primarily because we're the developers of the module that needs tracing. By using tracepoints, we ensure that our tracing remains stable across kernel versions, since we control the data structures. Additionally, the meaningful events we want to observe happen within the functions themselves. Without tracepoints, we’d be forced to use kprobe—a powerful but more fragile and error-prone method, especially for internal logic.
Deep Dive: Kernel Tracepoint API
Over time, several iterations of the kernel tracing system have led to the development of the TRACE_EVENT macro, which is now the standard and stable way to create tracepoints. If you're interested in the historical evolution, LWN.net has several excellent articles on the topic. However, this blog will focus solely on the modern TRACE_EVENT interface.
The TRACE_EVENT macro allows you to insert a tracepoint at a specific location in your code and automatically integrates it with ftrace. This macro is defined in linux/tracepoint.h.
To define a fully functional tracepoint, you have to meet couple of requirements:
- A tracepoint definition which can be placed within kernel or kernel module code
- A callback function to handle the event
- The callback must record incoming data into the tracer’s ring buffer
- A "stringer" function that formats the data into a human-readable string for output
Let’s take a look at how the TRACE_EVENT
macro manages all of this under the hood.:
TRACE_EVENT(name, proto, args, struct, assign, print)
Each TRACE_EVENT definition includes six components:
- name: The unique name of the tracepoint. Tracepoint names must be globally unique across the kernel, so choose carefully.
- proto: The prototype of the tracepoint callback function.
- args: The arguments passed to the tracepoint, matching the prototype.
- struct: The data structure used to store the tracepoint's data.
- assign: Code that assigns values to the fields of the structure.
- print: A "stringer" function that formats the stored data for display in human-readable form (ASCII).
With the exception of name, all other fields are written using specific helper macros for readability and correctness like TP_PROTO, TP_ARGS, TP_STRUCT, TP_fast_assign, TP_printk.
Let’s walk through an example TRACE_EVENT that logs a UUID and a timestamp.
TRACE_EVENT(stamp_with_ts_and_uuid)
This is the name of the tracepoint, and it will be used when invoking the tracepoint from your kernel module. Internally, the macro adds a trace_ prefix, so the actual function becomes: trace_stamp_with_ts_and_uuid() This name must be globally unique across the entire kernel.
TP_PROTO(const uuid_t *trace_id)
This defines the prototype of the tracepoint callback. It sets the function signature that both the tracepoint and its handler will use. Keep in mind: the callback executes in the same context as where the tracepoint is called, so it must be fast and safe.
TP_ARGS(trace_id)
This macro lists the actual arguments to be passed when calling the tracepoint. It's needed because the macro expands into a function call, and the compiler must know what values to pass.
Think of like this:
// Function prototype
void my_tracepoint(int a, int b); // <--TP_PROTO
// Function invocation
my_tracepoint(a,b); // <--TP_ARGS
TP_STRUCT__entry()
TP_STRUCT__entry(
__array(u8, uuid, UUID_SIZE)
__field(s64, trace_timestamp))
This macro defines the structure used to store trace event data in the tracer’s ring buffer. If you weren’t already fond of macros, get ready—they’re everywhere here. Within TP_STRUCT__entry
, you'll find other specialized macros for defining fields:
__field(type, name)
: Declares a standard scalar field (e.g., an int, s64, etc.)__array(type, name, len)
: Declares a fixed-length array__string(name, src)
: Declares a null-terminated string__dynamic_array(type, name, len)
: Declares a variable-length array with the size provided by the third parameter
Special variable __entry
, is introduced here. It represents a pointer to this structure and points directly into the ring buffer during assignment.
TP_fast_assign()
TP_fast_assign(
memcpy(__entry->uuid, trace_id->b, UUID_SIZE);
__entry->trace_timestamp = ktime_get_real_ns();)
This macro defines how data is assigned to the fields declared in TP_STRUCT__entry. No nested macros here—just plain C code using the __entry pointer to store data directly into the ring buffer. You can use any arguments listed in TP_ARGS() here. For example:
- Simple scalar fields (like
__entry->trace_timestamp
) can be set with direct assignment. - For arrays (like
__entry->uuid
), use memcpy() to populate the buffer.
There are two more helper macros available in this context:
__assign_str(name, src)
– Used to assign values to fields declared with__string
__get_dynamic_array(name)
– Returns a pointer to a dynamic array declared via__dynamic_array
, which you can then populate with a memcpy
TP_printk()
TP_printk("uuid:%d, timestamp:%lld",
__entry->uuid,
__entry->trace_timestamp));
The TP_printk()
macro defines a format string for rendering the tracepoint data, similar to printf(). It controls how the contents of the __entry struct are displayed in user space (e.g., via trace-cmd
or perf
).
The format string follows standard printf syntax, and __entry is again used to reference the structure fields defined earlier in TP_STRUCT__entry.
Several helper macros are also available:
__get_str(name)
: Retrieves a pointer to a__string
field__get_dynamic_array(name)
: Retrieves a pointer to a__dynamic_array
field (also used internally by __get_str)
Now that we’ve covered the key components of the tracepoint API, there’s one piece that keeps coming up but hasn’t been discussed in detail yet: the ring buffer where trace events are stored. In the next section, we’ll briefly explore how this buffer works and how events move through it.
Tracer's Ring Buffer
The ring buffer, also known as a circular buffer, is a core component of the Linux kernel tracing subsystem. It uses a fixed-size buffer where the end wraps around to the beginning—allowing for efficient, continuous data storage without dynamic memory allocations.
Key advantages:
- Constant-time reads and writes
- No memory allocation during writes
- Lockless operation for performance
The kernel tracing ring buffer can operate in two modes:
Overwrite mode
(default): If the reader cannot keep up, the oldest data is overwritten to make room for new events.Producer/consumer mode
: If the buffer is full, new data is dropped, preserving older entries.
To avoid contention and ensure scalability, the kernel allocates a separate ring buffer per CPU. This per-CPU design reduces locking and improves performance on multicore systems. When a trace event occurs, the TP_fast_assign() macro populates the relevant data fields directly into the ring buffer. These stored events can later be retrieved using tracing tools such as trace-pipe, perf, or strace.
Conclusion
In this first blog post, we explored the fundamentals of Linux kernel tracing:
- Dynamic probes (kprobe/kretprobe) for general-purpose, on-the-fly instrumentation
- Narrower, high-performance hooks via fprobe and the ftrace infrastructure
- Static, rock-solid tracepoints powered by the TRACE_EVENT API
- The role of the per-CPU ring buffer in efficiently capturing and storing event data
By choosing tracepoints for our Riptides module, we gain stability across kernel versions, minimal overhead when disabled, and structured, human-readable data when enabled.
Stay tuned for Part 2, where we’ll dive into how Riptides retrieves and processes trace events in user space—and demonstrate how to leverage familiar tools like strace and perf trace to debug and validate your tracepoints.
Ready to replace secrets
with trusted identities?
Build with trust at the core.