grsecurity - Teardown of a Failed Linux LTS Spectre Fix

Today's blog will serve as a deep dive into a recent Spectre fix, one of dozens being manually applied to the upstream Linux kernel. We'll cover the full path this fix took, from its warning-inducing initial state to its correction upstream and then later brokenness when backported to all of the upstream Long Term Support (LTS) kernels. We'll look at both the nature of the flaw in the backport as well as flaws in upstream processes that resulted in this backporting failure. As contrast, we'll note how our independent review process spotted this vulnerability at commit time and how our Respectre plugin had already automatically fixed the underlying vulnerability.

The earliest version of the fix that I've been able to find is this patch from Dianzhang Chen on May 24th 2019. It addresses speculative access of an array with a user-controlled index via the ptrace syscall. The original fix was fine from a fixing-the-vuln perspective, but Thomas Gleixner replied here with some style comments that set in motion a series of cascading failures. Exactly one month and one day later, on June 25th 2019, the second version of the patch was submitted here. As the details of the fix are now important, I've reproduced the diff below:

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a166c96..cbac646 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -25,6 +25,7 @@
 #include <linux/rcupdate.h>
 #include <linux/export.h>
 #include <linux/context_tracking.h>
+#include <linux/nospec.h>
 
 #include <linux/uaccess.h>
 #include <asm/pgtable.h>
@@ -643,9 +644,11 @@ static unsigned long ptrace_get_debugreg(struct task_struct *tsk, int n)
 {
 	struct thread_struct *thread = &tsk->thread;
 	unsigned long val = 0;
+	int index = n;
 
 	if (n < HBP_NUM) {
-		struct perf_event *bp = thread->ptrace_bps[n];
+		index = array_index_nospec(index, HBP_NUM);
+		struct perf_event *bp = thread->ptrace_bps[index];
 
 		if (bp)
 			val = bp->hw.info.address;

As you'll note, the beginning of the code block initializes the index variable before declaring the bp pointer. This results in the following compiler warning:

arch/x86/kernel/ptrace.c: In function 'ptrace_get_debugreg':
arch/x86/kernel/ptrace.c:705:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
   struct perf_event *bp = thread->ptrace_bps[index];
   ^~~~~~

Despite this warning, this code was merged into Thomas Gleixner's x86/tip tree verbatim, as can be seen here. Prior to merging the fix for 5.3-rc1, Linus Torvalds noticed the warning as seen on the LKML mailing list here and fixed it correctly. However, when the actual merge of the tree was performed, no mention was made of the correction to the fix, and with no specific commit mentioning the correction and fixing it alone, everyone else's processes that depended on cherry-picking specific commits ended up grabbing the bad warning-inducing change. As a further failure, instead of looking at Linus' correct fix (observable by checking out the master tree at the time), the approach employed in the LTS kernels seems to have been to naively silence the warning by simply swapping the order of the two lines, producing this:

+               struct perf_event *bp = thread->ptrace_bps[index];
+               index = array_index_nospec(index, HBP_NUM);

To clarify why this fix is bad, let's first discuss the upstream array_index_nospec() API. This function takes an index value along with the first larger out-of-bounds value for the index. Without the use of conditional control flow, it translates that index into either the original index value (if within bounds) or zero (if out of bounds). By ensuring that all later uses of the original index instead use the value returned from this API, the damage from speculated paths with an out-of-bounds index value is prevented.

Here, however, "index" was used as an index into the ptrace_bps array before the index was passed through the array_index_nospec() macro. Furthermore, as "index" has no later usage in the function, the entire array_index_nospec() operation would normally be considered a dead store by the compiler and eliminated through an optimization pass called Dead Store Elimination (DSE). This latter effect ends up not happening fully here due to the "volatile" keyword used on some arch-specific inline assembly used to perform the masking for array_index_nospec() [1]. We can demonstrate both of these things by looking at the resulting disassembly of the function, reproduced and commented below:

.text:0000000000000648 ptrace_get_debugreg proc near           ; CODE XREF: getreg32+EF
.text:0000000000000648                                         ; arch_ptrace+6E
.text:0000000000000648                 push    rbp
.text:0000000000000649                 cmp     esi, 3 ; esi = index (or n), this is the if (n < HBP_NUM) check
.text:000000000000064C                 mov     rbp, rsp
.text:000000000000064F                 jg      short loc_673
.text:0000000000000651                 movsxd  rsi, esi ; rsi = sign-extended index
.text:0000000000000654 ; rdx is 'bp' here, using rsi derived from user-provided one, with no speculation barrier
.text:0000000000000654                 mov     rdx, [rdi+rsi*8+800h]
.text:000000000000065C ; here begins the unused array_index_nospec()
.text:000000000000065C                 cmp     rsi, 4 ; from arch/x86/include/asm/barrier.h asm volatile
.text:0000000000000660                 sbb     rsi, rsi ; from arch/x86/include/asm/barrier.h asm volatile
.text:0000000000000663                 xor     eax, eax ; fallout from arch/x86/include/asm/barrier.h asm volatile
.text:0000000000000665                 test    rdx, rdx
.text:0000000000000668                 jz      short loc_68F
.text:000000000000066A                 mov     rax, [rdx+138h]
.text:0000000000000671                 jmp     short loc_68F
.text:0000000000000673 ; ---------------------------------------------------------------------------
.text:0000000000000673
.text:0000000000000673 loc_673:                                ; CODE XREF: ptrace_get_debugreg+7
.text:0000000000000673                 cmp     esi, 6
.text:0000000000000676                 jnz     short loc_681
.text:0000000000000678                 mov     rax, [rdi+820h]
.text:000000000000067F                 jmp     short loc_68F
.text:0000000000000681 ; ---------------------------------------------------------------------------
.text:0000000000000681
.text:0000000000000681 loc_681:                                ; CODE XREF: ptrace_get_debugreg+2E
.text:0000000000000681                 xor     eax, eax
.text:0000000000000683                 cmp     esi, 7
.text:0000000000000686                 jnz     short loc_68F
.text:0000000000000688                 mov     rax, [rdi+828h]
.text:000000000000068F
.text:000000000000068F loc_68F:                                ; CODE XREF: ptrace_get_debugreg+20
.text:000000000000068F                                         ; ptrace_get_debugreg+29 ...
.text:000000000000068F                 pop     rbp
.text:0000000000000690                 retn
.text:0000000000000690 ptrace_get_debugreg endp

Of note is that nowhere in the bad, modified cherry-picked commit does it mention that any changes were made to it [2]. We see at least two opportunities prior to the bad LTS fix that a proper fix could have been created (here and here), but for whatever reason Greg ignored the existing correct fix and went it alone with his bad fix that doesn't seem to have been reviewed by anyone else. Once that bad commit was in place in one upstream LTS kernel, it was subsequently cherry-picked to all other stable kernels, making them no more secure than kernels without the fix, but providing a false sense of security. We have confirmed this bad backport exists in the upstream 4.4, 4.9, 4.14, 4.19, and 5.2 kernels since July of this year. This is all of the stable/longterm kernels listed on kernel.org, with the exception of the 3.16 tree that doesn't contain any fix at all (it appears to only receive backports for severe issues).

We independently backported the fix on July 9th 2019 and on noticing the warning, fixed it correctly. When the upstream kernels later backported their bad fix, it created a conflict in our git repo that led to us immediately spotting their flaw (and keeping our existing fixes). But what if we hadn't backported the fix, or what if the issue was one of the many other Spectre instances lurking in the kernel? Here, our Respectre plugin would come to the rescue. Respectre is a compiler plugin we announced here on October 4th 2018, which remains the world's most advanced, effective, and high-performance defense against Spectre speculation attacks. The plugin uses advanced static analysis to automatically find potential Spectre instances and eliminate them through high-performance instrumentation. Noteworthy here is that it can also be used in a mode that reports all the instances it finds and fixes. By taking our current code and reverting any fixes for the vulnerability, we get the following output:

arch/x86/kernel/ptrace.c: In function 'ptrace_get_debugreg':
arch/x86/kernel/ptrace.c:717:22: note: Spectre v1 array index bound '3'
   struct perf_event *bp = thread->ptrace_bps[n];
                      ^
arch/x86/kernel/ptrace.c:717:22: note: Spectre v1 array index mask adjust: inc constbound: yes

The takeaways here are that there is a real benefit to having an external/independent review and backporting process like the one we perform for our customers, as well as generic defenses like Respectre that can fill in gaps left by ad-hoc manual patching of individual vulnerabilities. Several flaws in upstream backporting processes (or at minimum, failure to adhere to processes that would have avoided the introduction of this vulnerability) were demonstrated. Finally, it should be noted that the "many eyes" of the upstream community failed to notice this flaw; without this blog post it would have likely persisted for many years. So many people were involved and so many opportunities existed to prevent this bad fix from being introduced. Yet the flaw was not only introduced but propagated to all supported stable kernels up to the posting of this blog, suggesting that upstream stable maintenance processes are simply not as robust as some would have the public believe. In all likelihood, this specific failure will be swept under the rug with an uncredited quick fix and a boiler-plate "all users must upgrade" release message, but the issues this blog post raises aren't as easily hand-waved away.

Update 9/4/2019: This issue has been assigned CVE-2019-15902. Fixes with credit have been submitted by Greg for inclusion in the next stable kernels, making this issue affect upstream LTS Linux kernel versions (inclusive) 4.4.186 to 4.4.190, 4.9.186 to 4.9.190, 4.14.134 to 4.14.141, 4.19.59 to 4.19.69, and 5.2.1 to 5.2.11.

[1] Removal of the "volatile" keyword from the inline asm in arch/x86/include/asm/barrier.h is confirmed to eliminate all traces of the unused/ineffective array_index_nospec().
[2] Some other kernel developers, like Ben Hutchings, do notate in the commit when they've adjusted the code for backporting. This erroneous change, however, appears to have been introduced by Greg KH himself.