Close, but No Cigar: On the Effectiveness of Intel's CET Against Code Reuse Attacks
June 12, 2016
Intel's recent announcement ([A1], [A2]) of their hardware support for a form of Control Flow Integrity (CFI) has raised a lot of interest among the expert as well as the popular press. As an interested party we've decided to look at some of the details and analyze the strengths and weaknesses of Intel's Control-flow Enforcement Technology (CET). Note that all the discussion below is based off Intel's published technology preview documents. As no processor with the claimed technology will exist for several years, the details are not complete and may change in small ways prior to production.
Full disclosure: we have a competing production-ready solution to defend against code reuse attacks called RAP, see [R1], [R2]. RAP isn't tied to any particular CPU architecture or operating system, and it scales to real-life software from Xen to Linux to Chromium with excellent performance.
Following typical CFI schemes ([P1]), CET provides two separate mechanisms to protect indirect control flow transfers: one for forward edges (indirect calls and jumps) and another for backward edges (function returns). As we'll see, they have very different characteristics, so we'll look at them each individually.
Indirect Branch Tracking
The forward edge mechanism is called Indirect Branch Tracking (IBT) and is designed to allow only designated code locations as valid targets for indirect calls and jumps ([N1]). This is no different from other approaches in the field. What does differentiate these schemes is their precision, that is, the number of allowed targets at each indirect control transfer instruction. Intuitively, the less locations an attack can target, the less likely that those locations will be useful for something. Without any CFI an attacker can target any executable byte in the program's address space. CFI, ideally, restricts this set to a minimum at each indirect control transfer instruction.
How does CET fare in this regard? Very badly unfortunately as CET implements the weakest form of CFI in that there's only a single class of valid targets. That is, any indirect control transfer can be redirected to any of the designated target locations (similar to what Microsoft's CFG allows). Such simplistic schemes have been proven to be fatally weak by both academic and industry researchers ([P2] [P3] [P4] [P5]).
In contrast, RAP's type hash based classification can create over 30.000 function pointer classes and 47.000 function classes for Chromium (this means among others that thousands of otherwise valid functions cannot be called indirectly at all).
Beyond the design flaw identified above, there are also implementation problems with CET. One of them is related to the fact that the hardware has not one but two state machines to keep track of the IDLE/WAIT_FOR_ENDBRANCH states for user and kernel mode code, respectively. Only one state machine is active at a time depending on the privilege level of the currently executing code; the other state machine is suspended. There is however no mention in the documentation how this hidden state is saved and restored when the privilege boundary is crossed by a system call, interrupts, etc.
This in particular seems to make it impossible for a kernel to switch contexts between threads since it may very well happen that the outgoing thread was interrupted in a different state than what the incoming thread would like to resume in, triggering an instant Control Protection fault upon returning from the kernel to userland. The same problem arises with UNIX style signal delivery and other similar asynchronous userland code invocations. Hopefully this is merely an oversight in the documentation and not the design itself.
Another problem is the support mechanism for compatibility with code that hasn't been recompiled for CET. The Legacy Code Bitmap (LCB) seems to be direct hardware support for Microsoft's CFG scheme and suffers from the same problems as a result identified by earlier research ([P6], [P7], [P8]).
Interestingly, this same compatibility mechanism could be used to fix the fatal flaw of the coarse-grained design. Namely, to simulate fine-grained CFI one could create a separate bitmap for each indirect call type and activate it for the call. The implementation would suffer from increased memory usage (one LCB per function pointer type) and it'd also have a large performance impact due to the slow access to the MSR storing the address of the LCB (this would be even worse for userland as the MSR doesn't seem to be writable directly from user mode code). Needless to say, RAP achieves fine-grained forward-edge CFI without this performance impact.
A third problem with IBT is that to mark valid indirect branch targets an otherwise useless instruction must be emitted at the target location which wastes instruction decoding bandwidth at least (and probably more on non-CET capable processors). In contrast, RAP's type hash based marking scheme was specifically designed to avoid this problem thus its only impact is on memory use.
Let's now look at CET's offering for protecting function returns. This mechanism is based on the well-known concept of shadow stacks that have been (re)invented and implemented many times in the past ([P9]).
Shadow stacks aim to provide secure storage for return addresses that can only be written by call instructions. This ensures that memory corruption bugs cannot be used to divert control flow at function returns, which used to be a widespread exploitation technique since the beginnings of time.
While the shadow stack design is sound as it provides precise enforcement of call/return pairs, implementing it in real life systems faces several problems such as protecting the shadow stack region itself from memory corruption attacks, performance overhead of instructions needed to read from and write to the shadow stack, and compatibility with programming constructs that intentionally violate the strict call/return pairing assumed by the shadow stack design.
Traditional shadow stack implementations all suffer from the problem that they're writable and thus subject to memory corruption themselves. Fixing this by changing memory protection rights on each function call and return is prohibitively expensive thus most designs either assume a weaker threat model or try to hide behind ASLR (which is vulnerable to more powerful threats itself).
Intel's shadow stack design solves the problem of writable shadow stacks by giving hardware support to separate the shadow stack memory from other data and allow only designated instructions to write there. This is a sound design but the particular implementation requires implementors to be careful.
Namely, the way shadow stacks are marked seems to make RELRO and text relocated pages look like shadow stacks as well (they're all read-only but have been written to thus dirty in the last level page table entries). This can become a problem if the actual shadow stack area is ever mapped directly next to such a mapping as overflowing or underflowing the shadow stack may go unnoticed and give rise to an attack. Speaking of which, the current document doesn't say anything about how shadow stack overflows/underflows are handled.
Finally, as already discovered by past implementors ([P10] [P11] [P12] [P13]), shadow stacks cannot be used through compiler modifications only. Each OS has their own exceptional cases that need special handling. On Linux and similar OSes, these exceptional cases include the setjmp/longjmp/makecontext/setcontext set of functions which can violate the assumption that a function will return to its call site. It also includes the default glibc behavior of lazy binding (done for performance reasons) as well as C++ exceptions and asynchronous signal handling.
In summary, Intel's CET is mainly a hardware implementation of Microsoft's weak CFI implementation with the addition of a shadow stack. Its use will require the presence of Intel processors that aren't expected to be released for several years. Rather than truly innovating and advancing the state of the art in performance and security guarantees as RAP has, CET merely cements into hardware existing technology known and bypassed by academia and industry that is too weak to protect against the larger class of code reuse attacks. One can't help but notice a striking similarity with Intel's MPX, another software-dependent technology announced with great fanfare a few years ago that failed to live up to its many promises and never reached its intended adoption as the solution to end buffer overflow attacks and exists only as yet another bounds-checking based debugging technology.
In comparison, RAP is architecture-independent, best of breed in performance and security, doesn't require the latest CPU, and gives software developers the powerful ability to easily make the protections from RAP even more fine-grained.
- Note that in practice indirect calls are the interesting case as the typical use of indirect jumps is to implement high level switch/case constructs where the code addresses and the paths leading to them are already in read-only memory and thus not subject to memory corruption.