grsecurity - The Complicated History of a Simple Linux Kernel API

Luo Likang of NSFOCUS posted to the oss-sec mailing list about what he was told was a non-security bug in the Linux kernel's Common Field replaceable unit Access Macro (CFAM) device driver. While the bug itself is uninteresting (it would at most be a privileged DoS on some PowerPC systems under a non-standard config), this blog will discuss the nuance and history of bad sizes passed to the kernel's copy_*_user() API based on our two decades of kernel development experience. A seemingly simple topic at first glance, but we guarantee readers of any kernel development skill level will learn something new here.

The Linux kernel is a constantly evolving codebase with well over 30 million lines of code as of 2021. As users generally don't run the latest upstream Linux kernel, however, it's also a very fragmented codebase from a user perspective. This has important ramifications for security: it's not necessarily the case that someone can say (as apparently happened in this case) that an API like copy_from_user() has certain checks, and that those checks can prevent triggering a vulnerability. I specifically addressed this issue in my "10 Years of Linux Security - A Report Card" presentation from last year. Consistent security properties (across kernel versions, processor architectures, compiler versions) is an important goal of ours in grsecurity, but with upstream often not backporting added preventative security checks or functionality to older supported kernels, and with the state of those measures not being communicated to end-users, one would virtually have to be an experienced security-focused kernel developer to determine the status for their own systems.

The kinds of bugs affected by checks in copy_*_user have appeared several times in the past, see for instance this fix from Mathias Krause in 2013. When discussing these checks, it would make most sense to talk about the current upstream state, after which we can discuss the history of the code and how it evolved (and even devolved) over time.

Current upstream copy_*_user() checks

In the current upstream kernel, copy_from_user (used for copying data from userland to the kernel) and copy_to_user (used for copying data from the kernel to userland) are implemented as follows:

static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
{
        if (likely(check_copy_size(to, n, false)))
                n = _copy_from_user(to, from, n);
        return n;
}

static __always_inline unsigned long __must_check
copy_to_user(void __user *to, const void *from, unsigned long n)
{
        if (likely(check_copy_size(from, n, true)))
                n = _copy_to_user(to, from, n);
        return n;
}

So provided that check_copy_size() passes, then a lower-level copy_*_user routine will be called. These lower-level routines can either be inlined or not depending on processor architecture, but their implementation remains the same:

unsigned long _copy_from_user(void *to, const void __user *from, unsigned long n)
{
        unsigned long res = n;
        might_fault();
        if (!should_fail_usercopy() && likely(access_ok(from, n))) {
                instrument_copy_from_user(to, from, n);
                res = raw_copy_from_user(to, from, n);
        }
        if (unlikely(res))
                memset(to + (n - res), 0, res);
        return res;
}

unsigned long _copy_to_user(void __user *to, const void *from, unsigned long n)
{
        might_fault();
        if (should_fail_usercopy())
                return n;
        if (likely(access_ok(to, n))) {
                instrument_copy_to_user(to, from, n);
                n = raw_copy_to_user(to, from, n);
        }
        return n;
}

access_ok() is an ancient routine whose purpose is to validate whether the provided range in userland (as the source for copy_from_user, the destination for copy_to_user) actually resides in userland. This API changed recently with the removal of addr_limit from the thread_info structure and the splitting out of kernel to kernel copies into its own API. Previously, code running after a set_fs(KERNEL_DS) prior to addr_limit restoration or a maliciously-modified addr_limit (a common exploit technique in years past) could (ab)use copy*user for arbitrary kernel read/write.

should_fail_usercopy() is related to fault-injection (a fuzzing enhancement) and can be ignored. instrument_* can likewise be ignored, as it's related to KASAN (a debugging/fuzzing enhancement). For production security purposes, the only code we're interested in is access_ok() and check_copy_size(). The latter is reproduced below:

static __always_inline __must_check bool
check_copy_size(const void *addr, size_t bytes, bool is_source)
{
        int sz = __compiletime_object_size(addr);
        if (unlikely(sz >= 0 && sz < bytes)) {
                if (!__builtin_constant_p(bytes))
                        copy_overflow(sz, bytes);
                else if (is_source)
                        __bad_copy_from();
                else
                        __bad_copy_to();
                return false;
        }
        if (WARN_ON_ONCE(bytes > INT_MAX))
                return false;
        check_object_size(addr, bytes, is_source);
        return true;
}

To explain a bit what is being displayed here: __compiletime_object_size() is a macro that makes use of the compiler's __builtin_object_size(). If the size of the object in the kernel used for the source or destination of the copy (as appropriate) is able to be determined at compile time, this will return that size of the object. Otherwise, including if the compiler version doesn't support __builtin_object_size(), it returns -1.

__bad_copy_from() and __bad_copy_to() are both compile-time errors issued when both the size of the kernel object is statically known and the length to copy is also constant, a case which is unlikely to be a useful security issue in practice unless the code was never tested/used at all. copy_overflow() is a runtime warning for when the object size is statically known, but the copy length isn't a compile-time constant.

The "WARN_ON_ONCE(bytes > INT_MAX)" check will produce a runtime warning if the copy length is greater than INT_MAX. Since this is computed on an unsigned size_t type, this has the added effect of rejecting negative lengths when interpreted on signed int or long types (i.e. the case of the bug reported on oss-sec). More on this check later.

check_object_size() comes from upstream's limited version of our USERCOPY feature. No check is made on its return value because rather than simply fail the copy as the other checks do, this one performs a BUG() in usercopy_abort() which in a simple case will simply terminate the process involved, but in more complex scenarios like having a mutex held around a userland copy, could result in lockups of some code paths, or in a panic_on_oops scenario, would crash the system. More on this later as well.

With regard to the practical impact of the bug discussed on the oss-sec report, as the bad CFAM change was introduced in "fsi: Add cfam char devices" in June 2018 for Linux 4.19, the cfam_write() case would enter copy_from_user() with a negative length, reach check_copy_size() where __compiletime_object_size() would return 4 for the __be32 data variable being copied into, and then as bytes was not a compile-time constant, would call copy_overflow(), triggering the WARN() contained therein and aborting the copy operation with an error.

For purposes of brevity, we'll limit our analysis to these aspects without getting into the weeds of the implementation of raw_copy_*_user() itself, which has seen architecture-specific evolution of its own with the introduction of SMAP/PAN/etc and the discovery of Spectre.

One final item to note before we proceed is the memset() present only in _copy_from_user(). This is an ancient and documented aspect of the API, described in a kernel comment as follows:

 * NOTE: only copy_from_user() zero-pads the destination in case of short copy.
 * Neither __copy_from_user() nor __copy_from_user_inatomic() zero anything
 * at all; their callers absolutely must check the return value.

Note that __copy_*_user (two underscores) is different from _copy_*_user (one underscore). The reason for this memset() was presumably to address cases early in Linux's history prior to copy_*_user being marked with a __must_check attribute, requiring that callers check the return value for error. Consider the common case where a structure is copied from userland, some fields are changed by the kernel, and then the structure is copied back out to userland. If areas not written by either the copy from userland or the kernel setting fields are copied back out to userland, being uninitialized, they could leak previous memory contents to userland (an information leak). Userland can trivially also force a partial failure in the lower-level copy routine by making part of the copy range include unmapped addresses or invalid permissions for the operation.

History of check_object_size()

This check first appeared via commit "mm: Hardened usercopy" in June of 2016 for the 4.8 version of Linux. Its commit message noted:

    This is the start of porting PAX_USERCOPY into the mainline kernel. This
    is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
    work is based on code by PaX Team and Brad Spengler, and an earlier port
    from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.

check_object_size() works by referencing heap (and other) metadata in order to validate at runtime that the copy operation occurs within the bounds of a single object, whenever possible.

Though the commit message above mentioned being the start of porting the full functionality, the feature saw no further major changes in the 5 years since outside the disabling of the mentioned added page tests (not present in grsecurity) which broke several areas of the kernel. PAX_USERCOPY was initially published to the world in 2009, some 7 years prior to the limited upstream version. The hardened usercopy code was not backported to earlier releases, including active LTS releases at the time. Therefore, the upstream 4.4 XLTS, on its 6th and final year of claimed support this year, does not contain this check.

History of __compiletime_object_size()/__builtin_object_size()

__builtin_object_size() was first used in the kernel by an experimental FORTIFY_SOURCE patch written by Arjan van de Ven in 2005. It initially focused on the typical str* and mem* APIs that are covered by FORTIFY_SOURCE in userland as well. In 2009, when investigating the true coverage of FORTIFY_SOURCE in practice in the kernel, I extended Arjan's work to perform checks to more functions and increased its compile-time knowledge of some [k|v]malloc'd object sizes, finding that it only instrumented ~30% of instances of covered APIs, generally in non-complex cases unlikely to be interesting from a security perspective.

Coverage of these str* and mem* APIs came in upstream via commit "include/linux/string.h: add the option of fortified string.h functions" in July 2017 for the 4.13 kernel, with no mention of that earlier work, or any use of the improved knowledge of the size of some dynamically-allocated objects, reducing its effective coverage below the 30% of my initial investigations.

Use of __builtin_object_size() for copy_*_user was added upstream first for x86 via commit "x86: Use __builtin_object_size() to validate the buffer size for copy_from_user()" by Arjan van de Ven in 2009. This initial version for Linux 2.6.34 covered only copy_from_user, with copy_to_user being covered only starting from October 2013 with Jan Beulich's "x86: Unify copy_to_user() and add size checking to it" commit for Linux 3.13. It saw a number of refactorings, eventually resulting in an arch-independent variant via commit "generic ...copy_..._user primitives" by Al Viro in March 2017 for the 4.12 version of Linux.

As we mentioned before, __builtin_object_size() is provided by the compiler. In April of 2013, in response to compile-time errors produced by the kernel's existing use of the builtin on certain GCC versions, "gcc4: disable __compiletime_object_size for GCC 4.6+" was merged by Guenter Roeck, rendering the whole exercise useless for affected compiler versions (at the time, the newest-released GCC version was 4.8.0). Jan's commit from later that year referenced above pointed to problems with this change, saying:

    I'd like to point out though that with
    __compiletime_object_size() being restricted to gcc before 4.6,
    the whole construct is going to become more and more pointless
    going forward. I would question however that commit
    2fb0815c9ee6b9ac50e15dd8360ec76d9fa46a2 ("gcc4: disable
    __compiletime_object_size for GCC 4.6+") was really necessary,
    and instead this should have been dealt with as is done here
    from the beginning.

yet it wasn't until August of 2016 with Josh Poimboeuf's Linux 4.8 commit: "mm/usercopy: get rid of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS" that the restriction on __builtin_object_size() use to GCC >= 4.1 and < 4.6 was changed to GCC >= 4.1. At the time of this commit, GCC 6.2 was the most recent compiler release. Josh's commit also eliminated the need for a debug option to be enabled in order to get the functionality.

The result of all the above was that there was a period of roughly three years where checks were present that, unknowingly except perhaps to a handful of kernel developers, were complete no-ops for a large number of users. It should also be noted that the 4.8 commit relaxing the version restriction on __builtin_object_size() was never backported to earlier kernels, meaning that even for today's latest 4.4 XLTS release, any checks involving this are complete no-ops for virtually any modern user.

You may be wondering how later use of Clang for building Linux plays into all this, even though Google started using Clang for kernel builds in late 2018 with the 4.4 kernel. Clang historically (and even in the latest versions) fakes a GCC version of 4.2.1, and so was unaffected by these changes.

History of WARN_ON_ONCE(bytes > INT_MAX)

The earliest appearance of this kind of check was for i386 by Linus Torvalds himself via (ironically) BUG_ON() in February 2005 for the 2.6.11 kernel. These checks were removed a little over two years later for version 2.6.22 of the kernel via Andi Kleen's commit "[PATCH] i386: Remove copy_*_user BUG_ONs for (size < 0)" with the erroneous explanation of "access_ok checks this case anyways, no need to check twice." The explanation is erroneous because while access_ok() does effectively check the same case, the BUG_ON() avoided execution of the subsequent memset(), which thereafter became possible to execute.

In grsecurity, this check was implemented safely (without a BUG_ON()) across virtually all architectures supported by Linux at the time in August of 2009 for the 2.6.29 version of Linux. Importantly, this check was performed prior to access_ok() and in the case of copy_from_user() avoided the vulnerable memset() that will be discussed later.

Upstream, this check was introduced by Kees Cook via "uaccess: disallow > INT_MAX copy sizes" in December of 2019 for Linux 5.5, a decade later, with no mention of that earlier published work. Since this was a trivial 2-line change, it was backported to Linux 5.4 a month later. Yet as the copy_*_user API had undergone considerable churn in the preceding years, the simple change wasn't backported any further and thus isn't present in the upstream 4.4, 4.9, 4.14, or 4.19 XLTS kernels still under claimed support today.

History of the copy_from_user memset

On x86, going back at least before 2002, zeroing on failed copy_from_user was only performed when access_ok() was successful. See for instance the implementation of __generic_copy_from_user (called when the size for copy_from_user wasn't a compile-time constant) here. Later, this changed, starting with this commit from Linux 2.4.3.4 and this change from the subsequent 2.4.35 release which added the zeroing when access_ok() failed.

A patch for x86 from Andrew Morton in June 2003 mentioned putting back the memset() under the access_ok() failure case that had gone missing in the previous year's churn.

The most interesting history to note here is that ARM as early as 2002 with Linux 2.4.4.4 had the following change:

diff --git a/include/asm-arm/uaccess.h b/include/asm-arm/uaccess.h
index f7aea9b38e771..8b5e076a94212 100644
--- a/include/asm-arm/uaccess.h
+++ b/include/asm-arm/uaccess.h
@@ -75,6 +75,8 @@ static __inline__ unsigned long copy_from_user(void *to, const void *from, unsig
 {
 	if (access_ok(VERIFY_READ, from, n))
 		__do_copy_from_user(to, from, n);
+	else /* security hole - plug it */
+		memzero(to, n);
 	return n;
 }

It's not clear whether the comment was referring to it plugging the security hole described earlier or introducing one (zeroing a buffer with a possibly attacker-controlled length), however it is the case that this change introduced a weakness. Consider the case where n = -1 due to some overflow in calculation (like in the oss-sec posting this blog references): access_ok() would fail here on 32-bit ARM as the n expressed as an unsigned value added to any userland address would cover a range including kernel space. The else case would be triggered, causing a memzero() with a length of 0xffffffff, certainly causing an unrecoverable DoS to the system.

Our public patches starting from 2009 poked a little fun at this comment, as seen in the below diff snippet:

diff -urNp linux-2.6.29.6/arch/arm/include/asm/uaccess.h linux-2.6.29.6/arch/arm/include/asm/uaccess.h
--- linux-2.6.29.6/arch/arm/include/asm/uaccess.h	2009-07-02 19:41:20.000000000 -0400
+++ linux-2.6.29.6/arch/arm/include/asm/uaccess.h	2009-07-30 17:59:25.590775992 -0400
@@ -400,7 +400,7 @@ static inline unsigned long __must_check
 {
 	if (access_ok(VERIFY_READ, from, n))
 		n = __copy_from_user(to, from, n);
-	else /* security hole - plug it */
+	else if ((long)n > 0) /* security hole - plug it -- good idea! */
 		memset(to, 0, n);
 	return n;
 }

The problem remained upstream even after the comment was removed and the code refactored via Al Viro's September 2016 commit "arm: don't zero in __copy_from_user_inatomic()/__copy_from_user()".

Since the upstream check added in 2019 to disallow INT_MAX copy sizes was implemented in check_copy_size(), it failing avoids a call to _copy_from_user() which would perform the bad memset(), the same as the decade-earlier grsecurity change did. For unknown reasons though, fixing this vulnerability wasn't mentioned at all in the commit message or in the mailing list discussion it referenced.

As mentioned above, since the upstream 2019 change was not backported to 4.4, 4.9, 4.14, or 4.19, bugs present in those versions of the kernel where negative lengths can be passed to copy_from_user() will likely result in massive memory corruption and a guaranteed system DoS (modulo some vmalloc'd stack or other vmalloc-based destination buffer). If someone were to take mitigation advice proposed by some and enable panic_on_warn, the upstream 2019 change would also cause a DoS via panic.

TL;DR

The Linux kernel is an incredibly fast-moving project with significant churn and fragmentation. It is often the case that statements about how certain APIs operate (even something as well-known and commonplace as copy_*_user()) need to take into consideration the kernel versions involved. As many proactive security changes are never backported to earlier supported kernel versions (examples of which we have provided here), and certain features can change without user-visible notice (e.g. the three year __builtin_object_size change), it's virtually impossible for end-users to know their real risk based on statements about the latest kernel version.

We hope this post provided some useful historical insights and ideas for further improvements.