Computer Architecture

I see dead uops: thoughts on the latest Spectre paper targeting uop caches

Last night, a group of computer security researchers lead by Ashish Venkat (University of Virginia) published a paper titled “I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches” which they have submitted for ISCA ’21 (International Symposium on Computer Architecture, a prestigious academic conference). The paper concerns a structure called the micro-op (uop) cache that is commonly used within modern microprocessors. A thorough analysis of the organization and implementation of these structures allows the research team to propose novel timing side-channel attacks similar to those of “Spectre“.

Modern microprocessors are decoupled into a “front-end” that decodes the instructions as presented by the programmer (in the form of compiled machine code) and a “back-end” that actually executes instructions (quite likely in the form of a data flow model, Out-of-Order with respect to how the program was actually written, but nonetheless respecting the actual dependencies). Notice I didn’t say “these instructions” a second time. What the back-end actually executes may look similar to the machine code presented at the front, but it might also look very different. Perhaps it has simply been optimized (e.g. fusing two – or more – adjacent instructions together into a more efficient one) but more often than not the backend of the machine will be executing very different instructions known as micro-ops.

A micro-op (uop) is a very simple instruction. For example, it might be an ADD instruction. It may take a couple of inputs and produce one output. In some cases (such as a simple add), the “macro” ops of the machine code written by the tooling used by a programmer map 1:1 on to the uops used by the machine. But often, a single macro op is actually decomposed into many individual uops. In RISC-like machines (Arm, RISC-V, etc.) such cracking into multiple uops is usually limited to relatively complex instructions (such as certain loads that also perform complex address generation logic), but most modern CISC machines (e.g. x86) will actually decompose every x86 macro instruction into a stream of RISC uops.

In such a machine, the processor frontend will be coupled to the instruction cache through a “fetch” unit that will retrieve a certain number of bytes (e.g. 16 in the paper cited) per processor cycle (clock “tick”) from the “L1” (level 1) instruction cache. The L1 instruction cache contains the program machine instruction “code”, represented as a sequence of bytes, typically stored into 64 byte cache “lines”. The L1 is small but extremely fast. It is typically “coherent” with respect to other caches in the machine, meaning that if you were to write self-modifying code out via the data cache, the instruction cache automatically observes this. The L1 might do many other things, such as pre-decode instructions (e.g. tag those that are branches ahead of time, etc.), or be the target of other timing side-channel attacks (such as the original Spectre attacks), but that is not the focus of this latest analysis.

Once bytes of instructions are fetched, they are run through decoders. Particularly with x86 (but also elsewhere), decoding macro operations is expensive. Worse, if a piece of code is “hot” because it is part of a loop that is being executed hundreds or millions of times, decoding the same macro instructions over and over can add needless expense (both to performance overhead, but also energy use). For this reason, it is common in some designs to cache decoded uops and replay them out of the uop cache as needed. Think of the uop cache as a companion to the other caches we already have in our modern processors – the instruction and data caches being just two of many different examples in modern designs.

Keeping a cache of decoded uops can save a lot of overhead, and increase throughput, especially when doing a CISC->RISC translation. Individual x86 macro instructions are variable width, up to 15 bytes in length (perhaps theoretically 16?), and decoding them is extremely unpleasant. Worse, the throughput per cycle is dependent upon the complexity of the instructions. Unlike a clean RISC architecture where you might build a wide uniform decoder that can fetch and stream 4, 8, or even more instructions to the processor backend each cycle, doing this at a sustained rate on x86 would require a much wider fetch (as each instruction might be from 1 to 15 bytes in length), and a lot of bit and byte swizzling, instruction length boundary detection, and so on. It’s a huge potential bottleneck, so the uop cache allows a higher sustained rate of instruction throughput than otherwise.

Furthermore, adding the uop cache means that the backend of a modern x86 processor isn’t all that different from any other RISC machine. If you want to learn more about that, search for the original “RISC86” work done by Greg Favor at NexGen (later AMD).

The problem with having a uop cache is that it adds an observable timing side-channel into the mix. Depending upon which instructions are executing, they may be fetched from the uop cache, or they may have to be decoded directly from the x86 macro ops. Worse, the uop cache is populated during instruction fetch/decode prior to any part of the backend getting involved. This means that existing “Spectre” mitigations that rely upon controlling execution happen too late in the pipeline – the offending uops are already present by then.

The paper first determines the “organization” (cache size and structure – sets and ways, etc.) and then conclusively shows that the uop cache implementation used by modern Intel processors features a “hot replacement” policy that guarantees recently executed uops will be present. It then introduces a “tiger” / “zebra” displacement attack that can be used by an attacker in order to determine which uops were recently executed. The uop cache structure used by Intel in their popular “Skylake” derived designs is revealed to feature lines of 6 uops that are organized into an 8-way set-associative cache with 32-sets. Instructions are “streamed” out of the uop cache set-by-set sequentially, following various rules.

It is claimed that Intel processors do not competitively share uop cache space between sibling “Hyperthreads” of an SMT (Simultaneous Multi-Threading) core. This means that the uop cache is split physically (but evenly) between two SMT threads in a way that one cannot measure the other. However, the researchers claim that certain AMD processors do have competitive sharing of SMT resources, meaning that a sibling thread might be able to monitor the uop cache footprint of the other. At the same time, it should be noted that the entire industry is moving away from a model of mixing trusted and untrusted workloads across sibling hyperthreads. Today, it is more common to give siblings only to the same workload, precisely because of the number of concerns that have emerged about SMT.

The rules for which instructions can live in the uop cache include a limit on the number of uops per macro instruction (“a given 32-byte code region may consume a maximum of 3 lines in the set (i.e. up to 18 micro-ops)”), as well as requiring that an unconditional branch must be last instruction in a uop group. There are many other constraints, which when taken together with the determined replacement policy allow sequences of instructions to be manufactured that will have a known layout in the uop cache. One group of “tigers” interfere with other “tigers” because they intentionally map to the same uop cache sets, while “tigers” and “zebra” are constructed to be mutually exclusive with one another.

As a result of determining the behavior of the uop cache, it is possible to construct sequences of code that can be used for covert communication through the uop cache (fun), or (worse) leverage measurements of the uop cache to determine when a data dependent branch has caused instructions for that dependent branch to be populated into the cache.

The latter case is most dangerous. It allows for (potential) circumvention of some of the Spectre-v1 (bounds check bypass) mitigations commonly adopted in software. Let’s remind ourselves how Spectre worked. It all comes down to memory being dramatically slower than compute, meaning that processors need to frequently wait for reads (loads) from memory to complete. Out-of-Order execution allows the backend to get around this in two ways. First, if a load is outstanding but some immediately subsequent code does not depend upon the result, then the subsequent code can be executed and the ordering restored later. Second, the processor can enter a mode of “speculation” in which it guesses the result of the load. Ignoring value prediction for this article, consider this instead:

char array[1024];
int array_size = 1024;

char secret[1024];
extern uint8_t victim_function(size_t i) {
    // bounds check:
    if (i >= 0 && i < array_size) {
        // misspeculation of this branch
        // bypasses the bounds check
        return array[i];
    }
    return -1;
}

Listing: Victim Method for the Variant-1 Attack (as given in “Listing 4” in the paper)

When the processor reaches the “if” branch above, which performs the bounds check on the attacker controlled “i” variable, it does not know whether to proceed until it has “resolved” the branch by performing the necessary load of array_size and the comparison. While this load is happening, the processor may nonetheless speculate ahead that the bounds check will succeed, especially if it has done so in the past (and thus the branch predictor has been “trained” to behave a certain way). In Spectre, the attacker causes secret leakage by first exceeding a bounds check such that “secret” which immediately follows “array” in memory becomes the target, and then finding some “gadget” in the code following the bounds check that performs a second memory access based upon the value of the first. This data-dependent second access leaves breadcrumbs in the L1 data cache of the processor.

In Spectre, the second access after the first is important because it causes the L1 data cache at a predictable location (based upon its organization) to become populated based upon whether or not the first (secret) memory had a certain value. Timing analysis of the data cache access latency at that second location can subsequently be used to determine whether it was populated or not. Through careful analysis of target programs, it is possible to find sufficient Spectre “gadgets” to leak secrets one bit at a time (or better).

Mitigating Spectre-v1 typically requires restraining speculation beyond sensitive branches (either identified by manual review, or in an automated fashion using tooling or compilers). One way in which this is performed is to use a form of serializing memory barrier known as a load fence (LFENCE) on x86. Inserting an LFENCE as a Spectre mitigation works because it seeks to prevent the backend of the machine from executing any further loads beyond the potentially dangerous load. However, it apparently doesn’t prevent the frontend of the machine from potentially fetching code (e.g. from a branch) beyond the bounds check.

In the paper, the researchers replace a secret-dependent second load with a secret-dependent jump or function call. The latter causes the processor to begin preparing to execute the instructions following the branch, ensuring they are present in the uop cache. Using the “tiger” and “zebra” analysis methods previously developed in the paper, the researchers are subsequently able to determine the presence or absence of uops and to reconstruct the secret. Finally (of course) they were able to find a few example “gadgets” in the Linux kernel source that could be theoretically exploited to leak data in this manner.

The researchers primarily focus on same address space / same privilege level attacks where a different context (e.g. managed code runtime, such as Java or JavaScript) might be interpreting untrusted code within a sandbox. Practical attacks of this form are likely to be fairly minimal, however, due to the extensive work done by the industry in the wake of Spectre (and its companions) to separate untrusted code into separate address spaces. It is, for example, no longer the case that web browsers are running untrusted JavaScript code within an interpreter that exists within the same address space as trusted browser code.

More worrying, however, is the potential for cross-domain attacks (one privilege level to another), or cross-SMT thread attacks. The researchers claim that they were able to perform user-to-kernel type attacks (and presumably this could be a kernel to hypervisor in place of user to kernel). They were not able to target Intel’s SGX (Software Guard Extensions, also known as enclaves) because entry into an enclave causes the processor’s instruction TLB (Translation Lookaside Buffer, a structure that caches memory translations between the “virtual” memory view used by programs, and the physical memory underneath) to be invalidated and this has the side-effect of invalidating the uop cache as a benefit.

But the cross-domain attack is a potential concern. If it turns out to be significant, then a logical fix would be to invalidate the uop cache when crossing this privilege boundary. We know that this can be done by invalidating the iTLB (and we know this is also a side effect of the original “Meltdown” Page Table Isolation or “PTI” solution as it causes the kernel and userspace to use entirely different address spaces – what we don’t know is whether all of the optimizations for this such as PCID invalidations also cause uop cache invalidation).

Overall, the paper is interesting reading. It’s far from the world-ending sensationalism implied by the “Defenseless” language on the Virginia site, and in the press pick up thus far. Quite the contrary really. The industry had a huge problem on its hands with Spectre, and as a direct consequence a great deal of effort was invested in separating privilege, isolating workloads, and using different contexts. There may be some cleanup needed in light of this latest paper, but there are mitigations available, albeit always at some performance cost.