ACCESS_ONCE() and compiler bugs

The ACCESS_ONCE() macro is used throughout the kernel to ensure that code generated by the compiler will access the indicated variable once (and only once); see this article for details on how it works and when its use is necessary. When that article was written (2012), there were 200 invocations of ACCESS_ONCE() in the kernel; now there are over 700 of them. Like many low-level techniques for concurrency management, ACCESS_ONCE() relies on trickery that is best hidden from view. And, like such techniques, it may break if the compiler changes behavior or, as has been seen recently, contains a bug.

ACCESS_ONCE() 宏在整个内核中广泛使用，用于确保编译器生成的代码对指定变量的访问只发生一次（且仅一次）；关于它的工作原理以及何时需要使用，可以参考这篇文章。撰写该文时（2012 年），内核中约有 200 处使用 ACCESS_ONCE()，而现在已经超过 700 处。像许多底层并发管理技术一样，ACCESS_ONCE() 依赖一些最好被隐藏的技巧。而且，和这类技术一样，如果编译器行为发生变化，或者像最近发现的情况那样出现了 bug，它也可能出问题。

Back in November, Christian Borntraeger posted a message regarding the interactions between ACCESS_ONCE() and an obscure GCC bug. To understand the problem, it is worth looking at the macro, which is defined simply in current kernels (in <linux/compiler.h>):

去年十一月，Christian Borntraeger 发布了一则消息，讨论了 ACCESS_ONCE() 与一个隐蔽的 GCC bug 之间的交互问题。为了理解这个问题，有必要看看当前内核中对该宏的定义（位于 <linux/compiler.h>）：

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

In short, ACCESS_ONCE() forces the variable to be treated as being a volatile type, even though it (like almost all variables in the kernel) is not declared that way. The problem reported by Christian is that GCC 4.6 and 4.7 will drop the volatile modifier if the variable passed into it is not of a scalar type. It works fine if x is an int, for example, but not if x has a more complicated type. For example, ACCESS_ONCE() is often used with page table entries, which are defined as having the pte_t type:

简单来说，ACCESS_ONCE() 通过强制将变量视为 volatile 类型来实现保护，即使该变量（像内核中的几乎所有变量一样）本身并未声明为 volatile。Christian 报告的问题是，GCC 4.6 和 4.7 在传入的变量不是标量类型时，会错误地去掉 volatile 修饰。例如，如果 x 是一个 int 类型，这么做没问题；但如果 x 是更复杂的类型，就会出错。例如，ACCESS_ONCE() 常用于页表项，它们被定义为 pte_t 类型：

typedef struct {
    unsigned long pte;
} pte_t;

In this case, the volatile semantics will be lost in buggy compilers, leading to buggy kernels. Christian started by looking for ways to work around the problem, only to be informed that normal kernel practice is to avoid working around compiler bugs whenever possible; instead, the buggy versions should simply be blacklisted in the kernel build system. But 4.6 and 4.7 are installed on a lot of systems; blacklisting them would inconvenience many users. And, as Linus put it, there can be reasons for approaches other than blacklisting:

在这种情况下，volatile 语义会在有 bug 的编译器中丢失，从而导致内核出错。Christian 最初尝试寻找绕过问题的方法，但很快被告知，内核的常规做法是尽量避免绕过编译器 bug；相反，应直接在内核构建系统中将有问题的编译器版本列入黑名单。然而，4.6 和 4.7 版本的 GCC 在许多系统上仍被广泛使用，拉黑它们会给很多用户带来不便。而且，正如 Linus 所说，有时也确实存在不采取黑名单策略的理由：

So I do agree with Heiko that we generally don't want to work around compiler bugs if we can avoid it. But sometimes the compiler bugs do end up saying “you're doing something very fragile”. Maybe we should try to be less fragile here.

我同意 Heiko 的观点，一般来说我们不希望为了规避编译器 bug 而改动代码。但有时候，编译器 bug 也在提示：“你正在做一些非常脆弱的事情。”也许我们应该尝试让这里的代码更加健壮一些。

One way of being less fragile would be to change the affected ACCESS_ONCE() calls to point to the scalar parts of the relevant non-scalar types. So, if code does something like:

减少脆弱性的一种方法是修改受影响的 ACCESS_ONCE() 调用，让它们指向相关非标量类型中的标量部分。例如，如果代码中存在如下写法：

pte_t p = ACCESS_ONCE(pte);

It could be changed to something like:

可以将其改为类似下面的写法：

unsigned long p = ACCESS_ONCE(pte->pte);

This type of change requires auditing all ACCESS_ONCE() calls, though, to find the ones using non-scalar types; that would be a lengthy and error-prone process that would not prevent the addition of new bugs in the future.

不过，这种修改需要审核所有的 ACCESS_ONCE() 调用，找出那些使用了非标量类型的调用；这将是一个冗长且容易出错的过程，而且也无法阻止未来新增的类似 bug。

Another approach to the problem explored by Christian was to remove a number of problematic ACCESS_ONCE() calls and just put in a compiler barrier with barrier() instead. In many cases, a barrier is sufficient, but in others it is not. Once again, a detailed audit is required, and there is nothing preventing new code from adding buggy ACCESS_ONCE() calls.

Christian 还探索了另一种方法，即删除一些有问题的 ACCESS_ONCE() 调用，改为插入编译器屏障 barrier()。在很多情况下，barrier() 已经足够，但在某些场景下还不足够。再次强调，这需要进行细致的审核，而且未来的新增代码仍可能引入新的 buggy ACCESS_ONCE() 调用。

So Christian headed down the path of changing ACCESS_ONCE() to simply disallow the use of non-scalar types altogether. In the most recent version of the patch set, ACCESS_ONCE() looks like this:

因此，Christian 最终选择了另一条路：直接修改 ACCESS_ONCE()，彻底禁止传入非标量类型。在最新版本的补丁集中，ACCESS_ONCE() 被定义为：

#define __ACCESS_ONCE(x) ({ 
       __maybe_unused typeof(x) __var = 0; 
       (volatile typeof(x) *)&(x); })
#define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))

This version will cause compilation failures if a non-scalar type is passed into the macro. But what about the situations where a non-scalar type needs to be used? For these cases, Christian has introduced two new macros, READ_ONCE() and ASSIGN_ONCE(). The definition of the former looks like this:

这个版本在传入非标量类型时会导致编译失败。那么，如果确实需要处理非标量类型呢？为此，Christian 引入了两个新的宏：READ_ONCE() 和 ASSIGN_ONCE()。前者的定义如下：

static __always_inline void __read_once_size(volatile void *p, void *res, int size)
{
    switch (size) {
    case 1: *(u8 *)res = *(volatile u8 *)p; break;
    case 2: *(u16 *)res = *(volatile u16 *)p; break;
    case 4: *(u32 *)res = *(volatile u32 *)p; break;
#ifdef CONFIG_64BIT
    case 8: *(u64 *)res = *(volatile u64 *)p; break;
#endif
    }
}

#define READ_ONCE(p) 
      ({ typeof(p) __val; __read_once_size(&p, &__val, sizeof(__val)); __val; })

Essentially, it works by forcing the use of scalar types, even if the variable passed in does not have such a type. Providing a single access macro that worked on both the left-hand and right-hand sides of an assignment turned out to not be trivial, so the separate ASSIGN_ONCE() was provided for the left-hand side case.

本质上，这种做法强制使用标量类型，即使传入的变量本身不是标量类型。要提供一个既能用于赋值左边又能用于右边的统一访问宏并不容易，因此 Christian 还单独提供了 ASSIGN_ONCE() 用于处理赋值左边的情况。

Christian's patch set replaces ACCESS_ONCE() calls with READ_ONCE() or ASSIGN_ONCE() in cases where the latter are needed. Comments in the code suggest that those macros should be preferred to ACCESS_ONCE() in the future, but most existing ACCESS_ONCE() calls have not been changed. Developers using ACCESS_ONCE() to access non-scalar types in the future will get an unpleasant surprise from the compiler, though.

Christian 的补丁集在必要的地方将 ACCESS_ONCE() 调用替换成了 READ_ONCE() 或 ASSIGN_ONCE()。代码注释中建议，今后应优先使用这些新宏而不是 ACCESS_ONCE()，但目前大多数已有的 ACCESS_ONCE() 调用尚未改动。不过，未来如果有开发者试图用 ACCESS_ONCE() 访问非标量类型，编译器就会给他们一个“惊喜”了。

This version of the patch has received few comments and seems likely to make it into the mainline in the near future; backports to the stable series are also probably on the agenda. There are times when it is best to simply avoid versions of the compiler with known bugs altogether. But, as can be seen here, compiler bugs can also be seen as a signal that things could be done better in the kernel, leading to more robust code overall.

目前这版补丁收到的评论很少，看起来很有可能在不久的将来并入主线；同时也很可能会有面向稳定版内核的回溯移植。有时候，彻底避免使用带已知 bug 的编译器版本是最好的选择。但正如这里所展示的，编译器 bug 也可以被视为内核中某些代码需要改进的信号，从而最终促使代码变得更加健壮。

文章版权归作者所有，未经允许请勿转载。如内容涉嫌侵权，请在本页底部进入<联系我们>进行举报投诉!

THE END