grsecurity - CVE-2021-4440: A Linux CNA Case Study

The Introduction

This blog serves as a case study into how the newly-formed Linux CNA (CVE Numbering Authority) has affected Linux kernel vulnerability management, through the mishandling of a vulnerability we reported a little over a month ago in the upstream 5.10 LTS kernel.

The Vulnerability

The report below provides the full details, but the summary is that in a proposed backport of a set of patches to improve the Linux kernel's mitigation against some newer MDS (Microarchitectural Data Sampling) attacks, an oversight in how the changes applied to kernels before 5.13 occurred. This oversight meant that in affected kernels with the bad backport, when CONFIG_XEN_PV is enabled (even if any Xen functionality is completely unused), not only was the MDS mitigation against the newer attacks turned into a no-op, but since the set of patches was meant to replace the old implementation of the original MDS mitigation, also turned the mitigation into a no-op for attacks as old as 2019. The impact of this was various forms of information leakage, including KASLR defeats.

While the bad backport was proposed for the upstream 5.4 and 5.10 LTS, 5.4 received no backport at all, while the bad backport was merged in April in the 5.10.215 kernel (due to my report, it was fixed in 5.10.218 toward the end of May).

Downstream, Debian Bullseye shipped the affected kernel in an affected configuration since it follows the upstream 5.10 LTS releases. SUSE's 5.3.18 kernel in SLE15 SP2/SP3 - LTSS was also affected, for SP2 from 5.3.18-150200.24.183.1 on March 18, 2024 until 5.3.18-150200.24.194.1 on June 24, 2024, and for SP3 from 5.3.18-150300.59.153.2 on March 13, 2024 until being fixed in 5.3.18-150300.59.164.1 on June 24, 2024.

The Report

The report reproduced verbatim below and sent to Debian on May 21st, 2024 was confirmed within a few hours with a request to bring it directly to the LTS maintainers. I confirmed and mentioned that it was fine to send the full writeup, and that there was no embargo or any other restrictions on the information imposed from my end. That was acknowledged and the issue was then reported upstream, who also acknowledged the report that same day. Up to this point, an incredibly quick turnaround: only 20 hours to go from initial report to an affected downstream distribution, to upstream acknowledgement. Debian here for the record did everything expected of a downstream and were great to work with.

Subject: 5.10 Bullseye kernel eliminates MDS mitigations in 64-bit syscall path with bad upstream LTS backport

Hi,

I am providing this vulnerability report directly to you as a courtesy
for the work Debian does for everyone.  If you would like to involve
upstream LTS maintainers, please do so on your own without my
involvement.

Backstory:
Back in February, we performed our own backports of the revamped MDS
mitigation patch series which moved the flushing VERW instruction to
just before return to userland, from its previous location in a C
function occurring much earlier.

After having performed these backports, we noticed upstream LTS
backports were prepared (I guess during some embargo period) to some
of the same kernels we backported to (I did 5.4, 5.15, 6.1, and 6.6).
In doing the backport to 5.4 in particular, I found a subtle
difference in the code that I accommodated for, and so I wanted to
check if the upstream backports spotted the same difference.  It did
not, which I mentioned here:
https://x.com/spendergrsec/status/1762181924561232170
but at the time, they weren't actually in any upstream LTS.  Our 5.4
LTS was technically EOL at the end of last year, but we continued
providing backports as a courtesy to the end of March.

By the end of March, I didn't see any upstream attempts to apply the
bad backport to 5.4 (as I would have noticed the conflict), and since
we dropped all interest in it then, never returned back to the topic
until I remembered it this week.  Hence this mail.

Affected Debian kernel versions:
5.10.215 and newer (of the 5.10 kernel only), upstream this is April
13th and newer.

Vulnerability Details:
Before each return to userland, for MDS mitigation, there needs to be
some execution of the CLEAR_CPU_BUFFERS macro.  You can see this in
numerous paths, even in the 32-bit compat syscall path in
arch/x86/entry/entry_64_compat.S.  In this case, the vulnerable path
is in entry_SYSCALL_64 present in arch/x86/entry/entry_64.S,
specifically via the USERGS_SYSRET64 present at the end when returning
via sysret instead of iret.  This is the subtle part I mentioned
earlier.  To see why, we need to look at the definition of
USERGS_SYSRET64 in 5.10:

#define USERGS_SYSRET64                         \
        swapgs;                                 \
        CLEAR_CPU_BUFFERS;                      \
        sysretq;

The above appears in arch/x86/include/asm/irqflags.h.  At first
glance, this looks fine, the CLEAR_CPU_BUFFERS is present (and would
have looked fine to anyone performing review of the patches based on
diff context only).  But when I looked at the full context when
performing my backports, you see that that definition is only used
under the following circumstance:

#ifdef CONFIG_PARAVIRT_XXL
#include <asm/paravirt.h>
#else
.. lots of confusing ifdef nesting here, followed by the
USERGS_SYSRET64 definition

So under CONFIG_PARAVIRT_XXL being enabled (which is selected when
CONFIG_XEN_PV is enabled, as in Debian's kernel configuration for this
kernel), this definition isn't used at all!  What's used instead then?
 The answer is in arch/x86/include/asm/paravirt.h as shown:

#ifdef CONFIG_X86_64
#ifdef CONFIG_PARAVIRT_XXL
#define USERGS_SYSRET64                                                 \
        PARA_SITE(PARA_PATCH(PV_CPU_usergs_sysret64),                   \
                  ANNOTATE_RETPOLINE_SAFE;                              \
                  jmp PARA_INDIRECT(pv_ops+PV_CPU_usergs_sysret64);)

It becomes a paravirt patching call site, which means two things:
prior to paravirt patching being applied, it operates as the indirect
jmp seen above, with the target being the following in
arch/x86/entry/entry_64.S:

#ifdef CONFIG_PARAVIRT_XXL
SYM_CODE_START(native_usergs_sysret64)
        UNWIND_HINT_EMPTY
        swapgs
        sysretq
SYM_CODE_END(native_usergs_sysret64)
#endif /* CONFIG_PARAVIRT_XXL */

Note, no CLEAR_CPU_BUFFERS -- this alone wouldn't be a vulnerability
though as the paravirt patching happens early in boot.  But what code
gets executed after paravirt patching is applied?
That can be found in arch/x86/kernel/paravirt_patch.c:

        const unsigned char     cpu_usergs_sysret64[6];
...
        .cpu_usergs_sysret64    = { 0x0f, 0x01, 0xf8,
                                    0x48, 0x0f, 0x07 }, // swapgs; sysretq
...
        PATCH_CASE(cpu, usergs_sysret64, xxl, insn_buff, len);

If you're not familiar, in a similar way as alternatives (it even uses
the alternatives infrastructure in more recent kernels), paravirt
patching works by replacing the indirect call/jmp to an out-of-line
function with much faster inlined code, here the same swapgs/sysretq
sequence seen in the out of line function.  Note, no execution of any
VERW instruction at all, which you could confirm at runtime by dumping
the running image.  Without the VERW-based clearing, you have no
mitigation of MDS on the 64-bit syscall path, probably the most
important place to have it, rendering the whole MDS mitigation
exercise rather pointless, meaning: leaks of kernel pointers/data,
KASLR break, etc.

Fix:
Easiest would likely be to do what we did in grsecurity, which is to
backport the following commits (in order), bringing it up to the state
of newer kernels without this quirk and thus the resulting
vulnerability:

x86/pv: Switch SWAPGS to ALTERNATIVE
x86/xen: Drop USERGS_SYSRET64 paravirt call

and then re-do the application of:
x86/entry_64: Add VERW just before userspace transition
ensuring that the sysret path then looks like:

        swapgs
        CLEAR_CPU_BUFFERS
        sysretq
 SYM_CODE_END(entry_SYSCALL_64)

Credits:
Brad Spengler of Open Source Security, Inc.

Thanks,
-Brad

The Fix

The fix merged into the 5.10 LTS in 5.10.218, a quick 3 days after the report, is shown below. It was fixed in the way I recommended in my report, however the choice of doing it in a single commit, reusing a cherry-pick of the USERGS_SYSRET64 macro removal from 2021, had an extensive impact on the information automatically generated for the CVE by the Linux CNA, as we'll discuss in much more detail later.

author		Juergen Gross <jgross@suse.com>	2021-01-20 14:55:45 +0100
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:19:05 +0200

x86/xen: Drop USERGS_SYSRET64 paravirt call
commit afd30525a659ac0ae0904f0cb4a2ca75522c3123 upstream.

USERGS_SYSRET64 is used to return from a syscall via SYSRET, but
a Xen PV guest will nevertheless use the IRET hypercall, as there
is no sysret PV hypercall defined.

So instead of testing all the prerequisites for doing a sysret and
then mangling the stack for Xen PV again for doing an iret just use
the iret exit from the beginning.

This can easily be done via an ALTERNATIVE like it is done for the
sysenter compat case already.

It should be noted that this drops the optimization in Xen for not
restoring a few registers when returning to user mode, but it seems
as if the saved instructions in the kernel more than compensate for
this drop (a kernel build in a Xen PV guest was slightly faster with
this patch applied).

While at it remove the stale sysret32 remnants.

  [ pawan: Brad Spengler and Salvatore Bonaccorso <carnil@debian.org>
	   reported a problem with the 5.10 backport commit edc702b4a820
	   ("x86/entry_64: Add VERW just before userspace transition").

	   When CONFIG_PARAVIRT_XXL=y, CLEAR_CPU_BUFFERS is not executed in
	   syscall_return_via_sysret path as USERGS_SYSRET64 is runtime
	   patched to:

	.cpu_usergs_sysret64    = { 0x0f, 0x01, 0xf8,
				    0x48, 0x0f, 0x07 }, // swapgs; sysretq

	   which is missing CLEAR_CPU_BUFFERS. It turns out dropping
	   USERGS_SYSRET64 simplifies the code, allowing CLEAR_CPU_BUFFERS
	   to be explicitly added to syscall_return_via_sysret path. Below
	   is with CONFIG_PARAVIRT_XXL=y and this patch applied:

	   syscall_return_via_sysret:
	   ...
	   <+342>:   swapgs
	   <+345>:   xchg   %ax,%ax
	   <+347>:   verw   -0x1a2(%rip)  <------
	   <+354>:   sysretq
  ]

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lkml.kernel.org/r/20210120135555.32594-6-jgross@suse.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Note that in the commit message above for the fix, the only new information relevant to the fixing of the vulnerability appears within the brackets at the end.

The CVE

The Linux CNA announced the CVE for the vulnerability they received a full, detailed, unrestricted report for 34 days earlier, yesterday on June 25, 2024. It is available to view here. A number of immediate issues are observed. The vulnerability was introduced in 2024 (via this commit), discovered in 2024 (here as mentioned in my report), but was incorrectly assigned a 2021 CVE year. This made it incorrectly appear like one of the many backfilled CVEs the Linux CNA has been generating from its earlier GSD dataset. The CVE description is merely a copy and paste of the associated commit message for the 5.10 LTS fix for the vulnerability. Rather than use the correct commit ID of the change introducing the vulnerability, the CVE announcement and metadata reference a seemingly-random, unrelated commit from the first affected 5.10 LTS version. Due to the way in which the fix was created, as mentioned in the previous section, the short description of the CVE became the headline of an innocent cleanup commit: "x86/xen: Drop USERGS_SYSRET64 paravirt call".

If one were providing summary information of assigned CVEs by the new Linux CNA, or performing a mere cursory analysis, it would be easy to either overlook these numerous errors or simply ascribe them to one-off mistakes. But with some knowledge of how the Linux CNA operates, all of the issues described above become inevitable consequences of a CNA that produces CVEs with unreliable information near-indistinguishable from full automation, even when outside researchers and downstream distributions make considerable effort to provide accurate, useful information.

The CNA

The subject of the Linux CNA is one that deserves its own blog post, as the controversy surrounding it is still developing. Despite existing for a little over four months and in that time assigning over 2000 CVEs at a faster rate than any other CNA in existence, the harm it's single-handedly caused to the CVE ecosystem hasn't been fully appreciated yet by the public and is mostly relegated to security teams of downstream distributions, vulnerability management companies, and end-users who noticed recently their previously-informative distribution security advisories got replaced with auto-generated lists of hundreds of CVEs with minimal user-understandable/actionable information. For the purposes of this blog, however, we'll focus on the aspects relevant to the errors present in the CVE under discussion.

The first error, the assignment of a 2021 year for a CVE describing a 2024 vulnerability, came about from the pervasive use of automation by the Linux CNA. As a reminder, according to CVE rules, the year assigned to a CVE should be the year in which it was publicly discovered or when the ID was reserved by the CNA (either due to an internal or private report). In this case, despite the report and fix commit clearly identifying the vulnerability-introducing change, the authorship year of a years-older cherry-picked cleanup commit involved in the fix was used instead. This ties into another key aspect of how the new Linux CNA operates: its CVEs do not identify vulnerabilities — they identify fixes. Without this change to the CVE definition, the mass automation essential to the existence of the Linux CNA would not be possible.

The second error, claiming the vulnerability was introduced in a seemingly-random backport to the 5.10.215 LTS kernel, must have had some manual involvement to not simply use the correct commit ID mentioned in the fix. The "seemingly-random" commit was actually the final merged commit of the first-affected 5.10.215 LTS kernel, just before the commit that updated the version in the main Makefile. To the previous point, not only does the automation require a CVE identify a fix rather than a vulnerability, it requires that it be a single fix only. If a single vulnerability needs multiple commits to address, depending on whether the commit message for those changes had easily greppable phrases like "buffer overflow" mentioned in them for the CNA's automation to pick up, then one CVE would be assigned for each fix. Any dependency between the fix commit and other commits, again due to the automation of the CNA, would not be provided.

Here, it's believed that due to the way the fix was created, and it not containing the "Fixes:" tag the CNA's automation relies on to provide information on affected versions via their 731 line Bash script named "bippy", someone at the Linux CNA manually stepped in. Were this not an externally-reported issue, that manual step likely would not have happened at all, leaving only a reference to a fix commit without any other information about affected versions (this is not uncommon, and can be seen easily in random examples like this). At this line in the script, it can be seen that a vulnerable kernel version cannot be provided without an associated commit ID. Unfortunately, as happened here, a completely unrelated commit was blamed in a way that can't be told apart later from a valid case where the last commit in a release happened to actually introduce a vulnerability.

The CVE's short headline of the unrelated cleanup commit "x86/xen: Drop USERGS_SYSRET64 paravirt call" falsely suggests that this issue required Xen to exploit, but in fact it only required the CONFIG_PARAVIRT_XXL option, commonly enabled by distributions like Debian that build in paravirtualized Xen support (even if unused by the end-user). This, along with the copy and pasted CVE description from the upstream LTS fix commit, are both artifacts of the automation employed by the Linux CNA. In the previous CVE system, a succinct summary of the vulnerability involved would have been provided, rather than information about an unrelated commit that simply made fixing the actual vulnerability easier. Keep in mind, the short headline descriptions of vulnerabilities are in most cases the only information end-users will scan through to determine relevance to their system.

It's especially egregious that the CNA's desire for all CVEs to fall in line with their automated processes overrides any consideration for providing additional information to users, unrestricted and known to them. Here, impact information understandable by users and not just kernel developers with deep knowledge of x86 internals and APIs used in specific mitigations, could have easily been provided. Yet rather than accepting that responsibility, the same responsibility that existed among the group that issued Linux kernel CVEs prior to the Linux CNA move, the CNA outsources the task of determining that information to all downstream consumers.

Though the fix was swift, the issuance of the CVE for the vulnerability reported to the Linux CNA was not: one month to the day of the fix being published. In the previous CVE system, a useful metric was seeing the time between when a CVE was issued and when the vulnerability was actually resolved. In this new system, where CVEs are for fixes instead of actual vulnerabilities, that metric disappears and any centrally-trackable evidence about who knew what and when disappears with it. As noted in this post, CVEs being issued long after stable releases containing the fixes is how the Linux CNA's process works.

The Recommendation

Much of the duplicative effort could have been avoided had I initially mailed the report to the linux-distros mailing list, or simply posted the advisory publicly to the oss-sec mailing list. That way, potentially affected distributions would have been notified to check if they were shipping a vulnerable version or an instance of the vulnerable patch. In the linux-distros case, this importantly would have also automatically resulted in a later mail to the public oss-sec mailing list, which would have shared the original report verbatim. This is in contrast to reports sent to security@kernel.org, whose goal is only to create and apply an upstream fix. Any additional vulnerability information shared, even with explicit notice that there is no embargo or any other restrictions on the information, does not make it into CVE announcements for the benefit of users. We strongly encourage other security researchers to keep this in mind when reporting vulnerabilities against the Linux kernel in the future.

We share the belief of many others serious about security that the Linux CNA's too-automated approach and dubious policies are a net negative on the CVE ecosystem, as illustrated in our case study above. Unfortunately, there doesn't appear to be any path to the problem being resolved any time soon as long as users and CVE consumers remain apathetic.