The Complicated History of a Simple Linux Kernel API
June 29, 2021
Luo Likang of NSFOCUS posted to the oss-sec mailing list
about what he was told was a non-security bug in the Linux kernel's Common
Field replaceable unit Access Macro (CFAM) device driver. While the bug itself is
uninteresting (it would at most be a privileged DoS on some PowerPC systems under a
non-standard config), this blog will discuss the nuance and history of bad sizes passed
to the kernel's copy_*_user()
API based on our two decades of kernel development experience.
A seemingly simple topic at first glance, but we guarantee readers of any kernel development skill
level will learn something new here.
The Linux kernel is a constantly evolving codebase with well over 30 million lines of
code as of 2021. As users generally don't run the latest upstream Linux kernel,
however, it's also a very fragmented codebase from a user perspective. This has
important ramifications for security: it's not necessarily the case that someone can
say (as apparently happened in this case) that an API like copy_from_user()
has certain
checks, and that those checks can prevent triggering a vulnerability. I specifically
addressed this issue in my "10 Years of Linux Security - A Report Card" presentation
from last year. Consistent security properties (across kernel versions, processor architectures, compiler
versions) is an important goal of ours in grsecurity, but with upstream often not
backporting added preventative security checks or functionality to older supported
kernels, and with the state of those measures not being communicated to end-users, one
would virtually have to be an experienced security-focused kernel developer to determine
the status for their own systems.
The kinds of bugs affected by checks in copy_*_user
have appeared several times in the past, see for instance this fix from Mathias Krause in 2013.
When discussing these checks, it would make most sense to talk about the current
upstream state, after which we can discuss the history of the code and how it evolved
(and even devolved) over time.
Current upstream copy_*_user() checks
In the current upstream kernel, copy_from_user
(used for copying data from userland to
the kernel) and copy_to_user
(used for copying data from the kernel to userland) are
implemented as follows:
static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
{
if (likely(check_copy_size(to, n, false)))
n = _copy_from_user(to, from, n);
return n;
}
static __always_inline unsigned long __must_check
copy_to_user(void __user *to, const void *from, unsigned long n)
{
if (likely(check_copy_size(from, n, true)))
n = _copy_to_user(to, from, n);
return n;
}
So provided that check_copy_size()
passes, then a lower-level copy_*_user
routine
will be called. These lower-level routines can either be inlined or not depending on
processor architecture, but their implementation remains the same:
unsigned long _copy_from_user(void *to, const void __user *from, unsigned long n)
{
unsigned long res = n;
might_fault();
if (!should_fail_usercopy() && likely(access_ok(from, n))) {
instrument_copy_from_user(to, from, n);
res = raw_copy_from_user(to, from, n);
}
if (unlikely(res))
memset(to + (n - res), 0, res);
return res;
}
unsigned long _copy_to_user(void __user *to, const void *from, unsigned long n)
{
might_fault();
if (should_fail_usercopy())
return n;
if (likely(access_ok(to, n))) {
instrument_copy_to_user(to, from, n);
n = raw_copy_to_user(to, from, n);
}
return n;
}
access_ok()
is an ancient routine whose purpose is to validate whether the provided
range in userland (as the source for copy_from_user
, the destination for copy_to_user
)
actually resides in userland. This API changed recently with the removal of addr_limit
from the thread_info
structure and the splitting out of kernel to kernel copies into
its own API. Previously, code running after a set_fs(KERNEL_DS)
prior to addr_limit
restoration or a maliciously-modified addr_limit
(a common exploit technique in years
past) could (ab)use copy*user
for arbitrary kernel read/write.
should_fail_usercopy()
is related to fault-injection (a fuzzing enhancement) and can be
ignored. instrument_*
can likewise be ignored, as it's related to KASAN (a
debugging/fuzzing enhancement). For production security purposes, the only code we're
interested in is access_ok()
and check_copy_size()
. The latter is reproduced below:
static __always_inline __must_check bool
check_copy_size(const void *addr, size_t bytes, bool is_source)
{
int sz = __compiletime_object_size(addr);
if (unlikely(sz >= 0 && sz < bytes)) {
if (!__builtin_constant_p(bytes))
copy_overflow(sz, bytes);
else if (is_source)
__bad_copy_from();
else
__bad_copy_to();
return false;
}
if (WARN_ON_ONCE(bytes > INT_MAX))
return false;
check_object_size(addr, bytes, is_source);
return true;
}
To explain a bit what is being displayed here: __compiletime_object_size()
is a macro
that makes use of the compiler's __builtin_object_size()
. If the size of the object in
the kernel used for the source or destination of the copy (as appropriate) is able to
be determined at compile time, this will return that size of the object. Otherwise, including if
the compiler version doesn't support __builtin_object_size()
, it returns -1.
__bad_copy_from()
and __bad_copy_to()
are both compile-time errors issued when both the
size of the kernel object is statically known and the length to copy is also constant,
a case which is unlikely to be a useful security issue in practice unless the code was
never tested/used at all. copy_overflow()
is a runtime warning for when the object
size is statically known, but the copy length isn't a compile-time constant.
The "WARN_ON_ONCE(bytes > INT_MAX)
" check will produce a runtime warning if the copy
length is greater than INT_MAX
. Since this is computed on an unsigned size_t
type,
this has the added effect of rejecting negative lengths when interpreted on signed int
or long types (i.e. the case of the bug reported on oss-sec). More on this check later.
check_object_size()
comes from upstream's limited version of our USERCOPY
feature. No check is made on its return value because rather than simply fail the copy
as the other checks do, this one performs a BUG()
in usercopy_abort()
which in a simple
case will simply terminate the process involved, but in more complex scenarios like
having a mutex held around a userland copy, could result in lockups of some code paths,
or in a panic_on_oops
scenario, would crash the system. More on this later as well.
With regard to the practical impact of the bug discussed on the oss-sec report, as the bad CFAM change was introduced in "fsi: Add cfam char devices" in
June 2018 for Linux 4.19, the cfam_write()
case would enter copy_from_user()
with a
negative length, reach check_copy_size()
where __compiletime_object_size()
would return
4 for the __be32 data
variable being copied into, and then as bytes
was not a
compile-time constant, would call copy_overflow()
, triggering the WARN()
contained
therein and aborting the copy operation with an error.
For purposes of brevity, we'll limit our analysis to these aspects without getting into
the weeds of the implementation of raw_copy_*_user()
itself, which has seen
architecture-specific evolution of its own with the introduction of SMAP/PAN/etc and
the discovery of Spectre.
One final item to note before we proceed is the memset()
present only in
_copy_from_user()
. This is an ancient and documented aspect of the API, described in a
kernel comment as follows:
* NOTE: only copy_from_user() zero-pads the destination in case of short copy.
* Neither __copy_from_user() nor __copy_from_user_inatomic() zero anything
* at all; their callers absolutely must check the return value.
Note that __copy_*_user
(two underscores) is different from
_copy_*_user
(one underscore). The reason for this memset()
was presumably to address cases early in Linux's history prior to
copy_*_user
being marked with a __must_check
attribute,
requiring that callers check the return value for error. Consider the common case
where a structure is copied from userland, some fields are changed by the kernel, and
then the structure is copied back out to userland. If areas not written by either the
copy from userland or the kernel setting fields are copied back out to userland, being
uninitialized, they could leak previous memory contents to userland (an information
leak). Userland can trivially also force a partial failure in the lower-level copy routine
by making part of the copy range include unmapped addresses or invalid permissions for the
operation.
History of check_object_size()
This check first appeared via commit "mm: Hardened usercopy" in June of 2016 for the 4.8 version of Linux. Its commit message noted:
This is the start of porting PAX_USERCOPY into the mainline kernel. This
is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
work is based on code by PaX Team and Brad Spengler, and an earlier port
from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
check_object_size()
works by referencing heap (and other) metadata in
order to validate at runtime that the copy operation occurs within the bounds of a
single object, whenever possible.
Though the commit message above mentioned being the start of porting the full functionality, the feature saw no further major changes in the 5 years since outside the disabling of the mentioned added page tests (not present in grsecurity) which broke several areas of the kernel. PAX_USERCOPY was initially published to the world in 2009, some 7 years prior to the limited upstream version. The hardened usercopy code was not backported to earlier releases, including active LTS releases at the time. Therefore, the upstream 4.4 XLTS, on its 6th and final year of claimed support this year, does not contain this check.
History of __compiletime_object_size()/__builtin_object_size()
__builtin_object_size()
was first used in the kernel by an experimental FORTIFY_SOURCE patch
written by Arjan van de Ven in 2005. It initially focused on the
typical str*
and mem*
APIs that are covered by FORTIFY_SOURCE in userland as well. In
2009, when investigating the true coverage of FORTIFY_SOURCE in practice in the kernel,
I extended Arjan's work to perform checks to more functions and increased its
compile-time knowledge of some [k|v]malloc
'd object sizes, finding that it only
instrumented ~30% of instances of covered APIs, generally in non-complex cases unlikely
to be interesting from a security perspective.
Coverage of these str*
and mem*
APIs came in upstream via commit
"include/linux/string.h: add the option of fortified string.h functions" in July 2017
for the 4.13 kernel, with no mention of that earlier work, or any use of the improved
knowledge of the size of some dynamically-allocated objects, reducing its effective coverage
below the 30% of my initial investigations.
Use of __builtin_object_size()
for copy_*_user
was added upstream first for x86
via commit "x86: Use __builtin_object_size() to validate the buffer size for copy_from_user()"
by Arjan van de Ven in 2009. This initial version for Linux 2.6.34 covered only copy_from_user, with
copy_to_user
being covered only starting from October 2013 with Jan
Beulich's "x86: Unify copy_to_user() and add size checking to it" commit for Linux
3.13. It saw a number of refactorings, eventually resulting in an arch-independent
variant via commit "generic ...copy_..._user primitives" by Al Viro in March 2017 for
the 4.12 version of Linux.
As we mentioned before, __builtin_object_size()
is provided by the compiler. In April
of 2013, in response to compile-time errors produced by the kernel's existing use of
the builtin on certain GCC versions, "gcc4: disable __compiletime_object_size for GCC 4.6+"
was merged by Guenter Roeck, rendering the whole exercise useless for affected
compiler versions (at the time, the newest-released GCC version was 4.8.0). Jan's
commit from later that year referenced above pointed to problems with this change,
saying:
I'd like to point out though that with
__compiletime_object_size() being restricted to gcc before 4.6,
the whole construct is going to become more and more pointless
going forward. I would question however that commit
2fb0815c9ee6b9ac50e15dd8360ec76d9fa46a2 ("gcc4: disable
__compiletime_object_size for GCC 4.6+") was really necessary,
and instead this should have been dealt with as is done here
from the beginning.
yet it wasn't until August of 2016 with Josh Poimboeuf's Linux 4.8 commit:
"mm/usercopy: get rid of CONFIG_DEBUG_STRICT_USER_COPY_CHECKS" that the restriction on
__builtin_object_size()
use to GCC >= 4.1 and < 4.6 was changed to GCC >= 4.1. At the
time of this commit, GCC 6.2 was the most recent compiler release. Josh's commit also
eliminated the need for a debug option to be enabled in order to get the functionality.
The result of all the above was that there was a period of roughly three years where checks
were present that, unknowingly except perhaps to a handful of kernel developers, were
complete no-ops for a large number of users. It should also be noted that the 4.8
commit relaxing the version restriction on __builtin_object_size()
was never backported
to earlier kernels, meaning that even for today's latest 4.4 XLTS release, any checks
involving this are complete no-ops for virtually any modern user.
You may be wondering how later use of Clang for building Linux plays into all this, even though Google started using Clang for kernel builds in late 2018 with the 4.4 kernel. Clang historically (and even in the latest versions) fakes a GCC version of 4.2.1, and so was unaffected by these changes.
History of WARN_ON_ONCE(bytes > INT_MAX)
The earliest appearance of this kind of check was for i386
by Linus Torvalds himself via (ironically) BUG_ON()
in February 2005 for the 2.6.11 kernel.
These checks were removed a little over two years later for version 2.6.22 of the
kernel via Andi Kleen's commit "[PATCH] i386: Remove copy_*_user BUG_ONs for (size < 0)"
with the erroneous explanation of "access_ok checks this case anyways, no need to
check twice." The explanation is erroneous because while access_ok()
does effectively
check the same case, the BUG_ON()
avoided execution of the subsequent memset()
, which
thereafter became possible to execute.
In grsecurity, this check was implemented safely (without a BUG_ON()
) across virtually
all architectures supported by Linux at the time in August of 2009 for the 2.6.29
version of Linux. Importantly, this check was performed prior to access_ok()
and in the case of
copy_from_user()
avoided the vulnerable memset()
that will be discussed later.
Upstream, this check was introduced by Kees Cook via "uaccess: disallow > INT_MAX copy sizes"
in December of 2019 for Linux 5.5, a decade later, with no mention of that earlier
published work. Since this was a trivial 2-line change, it was backported to Linux 5.4
a month later. Yet as the copy_*_user
API had undergone considerable churn in the
preceding years, the simple change wasn't backported any further and thus isn't present
in the upstream 4.4, 4.9, 4.14, or 4.19 XLTS kernels still under claimed support today.
History of the copy_from_user memset
On x86, going back at least before 2002, zeroing on failed copy_from_user
was only
performed when access_ok()
was successful. See for instance the implementation of
__generic_copy_from_user
(called when the size for copy_from_user
wasn't a compile-time
constant) here.
Later, this changed, starting with this commit from Linux 2.4.3.4
and this change
from the subsequent 2.4.35 release which added the zeroing when access_ok()
failed.
A patch for x86 from Andrew Morton in June 2003 mentioned putting back the memset()
under the access_ok()
failure case that had gone missing in the previous year's churn.
The most interesting history to note here is that ARM as early as 2002 with Linux 2.4.4.4 had the following change:
diff --git a/include/asm-arm/uaccess.h b/include/asm-arm/uaccess.h
index f7aea9b38e771..8b5e076a94212 100644
--- a/include/asm-arm/uaccess.h
+++ b/include/asm-arm/uaccess.h
@@ -75,6 +75,8 @@ static __inline__ unsigned long copy_from_user(void *to, const void *from, unsig
{
if (access_ok(VERIFY_READ, from, n))
__do_copy_from_user(to, from, n);
+ else /* security hole - plug it */
+ memzero(to, n);
return n;
}
It's not clear whether the comment was referring to it plugging the security hole
described earlier or introducing one (zeroing a buffer with a possibly
attacker-controlled length), however it is the case that this change introduced a
weakness. Consider the case where n = -1
due to some overflow in
calculation (like in the oss-sec posting this blog references):
access_ok()
would fail here on 32-bit ARM as the n
expressed
as an unsigned value added to any userland address would cover a range including kernel
space. The else
case would be triggered, causing a memzero()
with a length of 0xffffffff
, certainly causing an unrecoverable DoS to the
system.
Our public patches starting from 2009 poked a little fun at this comment, as seen in the below diff snippet:
diff -urNp linux-2.6.29.6/arch/arm/include/asm/uaccess.h linux-2.6.29.6/arch/arm/include/asm/uaccess.h
--- linux-2.6.29.6/arch/arm/include/asm/uaccess.h 2009-07-02 19:41:20.000000000 -0400
+++ linux-2.6.29.6/arch/arm/include/asm/uaccess.h 2009-07-30 17:59:25.590775992 -0400
@@ -400,7 +400,7 @@ static inline unsigned long __must_check
{
if (access_ok(VERIFY_READ, from, n))
n = __copy_from_user(to, from, n);
- else /* security hole - plug it */
+ else if ((long)n > 0) /* security hole - plug it -- good idea! */
memset(to, 0, n);
return n;
}
The problem remained upstream even after the comment was removed and the code refactored via Al Viro's September 2016 commit "arm: don't zero in __copy_from_user_inatomic()/__copy_from_user()".
Since the upstream check added in 2019 to disallow INT_MAX
copy sizes was implemented
in check_copy_size()
, it failing avoids a call to _copy_from_user()
which would perform
the bad memset()
, the same as the decade-earlier grsecurity change did. For unknown
reasons though, fixing this vulnerability wasn't mentioned at all in the commit message
or in the mailing list discussion it referenced.
As mentioned above, since the upstream 2019 change was not backported to 4.4, 4.9,
4.14, or 4.19, bugs present in those versions of the kernel where negative lengths can
be passed to copy_from_user()
will likely result in massive memory corruption and a guaranteed
system DoS (modulo some vmalloc'd stack or other vmalloc-based destination buffer). If someone
were to take mitigation advice proposed by some and
enable panic_on_warn, the
upstream 2019 change would also cause a DoS via panic.
TL;DR
The Linux kernel is an incredibly fast-moving project with significant churn and
fragmentation. It is often the case that statements about how certain APIs operate
(even something as well-known and commonplace as copy_*_user()
) need to take into
consideration the kernel versions involved. As many proactive security changes are
never backported to earlier supported kernel versions (examples of which we have
provided here), and certain features can change without user-visible notice (e.g. the
three year __builtin_object_size
change), it's virtually impossible for end-users
to know their real risk based on statements about the latest kernel version.
We hope this post provided some useful historical insights and ideas for further improvements.