This series of articles provides an overview of the procedure one can follow when porting the Linux kernel to a new processor architecture. Part 1 and part 2 focused on the non-code-related groundwork and the early code, from the assembly boot code to the creation of the first kernel thread. Following on from those, the series concludes by looking at the last portion of the procedure. As will be seen, most of the remaining work for launching the init process deals with thread and process management.
本系列文章概述了在将 Linux 内核移植到一个新处理器架构时可以遵循的流程。第一部分和第二部分聚焦于非代码相关的准备工作以及早期代码的实现,从汇编启动代码到创建第一个内核线程。本文则承接前两部分,介绍剩余的最后阶段。可以看到,为了启动 init 进程,大部分剩余工作都与线程和进程管理有关。
Spawning kernel threads
生成内核线程
When start_kernel()
performs its last function call (to rest_init()
), the memory-management subsystem is fully operational, the boot processor is running and able to process both exceptions and interrupts, and the system has a notion of time.
当 start_kernel()
执行其最后一个函数调用(即调用 rest_init()
)时,内存管理子系统已经完全就绪,引导处理器正在运行并能够处理异常和中断,系统也已经具备了时间感知。
While the execution flow has so far been sequential and mono-threaded, the main job handled by rest_init()
before turning into the boot idle thread is to create two kernel threads: kernel_init
, which will be discussed in the next section, and kthreadd
. As one can imagine, creating these kernel threads (and any other kinds of threads for that matter, from user threads within the same process to actual processes) implies the existence of a complex process-management infrastructure. Most of the infrastructure to create a new thread is not architecture-specific: operations such as copying the task_struct
structure or the credentials, setting up the scheduler, and so on do not usually need any architecture-specific code. However, the process-management code must define a few architecture-specific parts, mainly for setting up the stack for each new thread and for switching between threads.
虽然当前的执行流程仍是顺序的、单线程的,但 rest_init()
在转换为启动空闲线程之前的主要任务是创建两个内核线程:kernel_init
(将在下一节讨论)和 kthreadd
。可以想见,创建这些内核线程(以及其他类型的线程,从同一进程中的用户线程到真正的进程)依赖于一个复杂的进程管理基础设施。创建新线程的大多数基础设施并不特定于体系结构:如复制 task_struct
结构、凭据,设置调度器等操作通常不需要特定于架构的代码。然而,进程管理代码仍需定义一些架构相关的部分,主要是为每个新线程设置堆栈以及在线程之间进行切换。
Linux always avoids creating new resources from scratch, especially new threads. With the exception of the initial thread (the one that has so far been booting the system and that we have implicitly been discussing), the kernel always duplicates an existing thread and modifies the copy to make it into the desired new thread. The same principle applies after thread creation, when the new thread's execution begins for the first time, as it is easier to resume the execution of a thread than to start it from scratch. This mainly means that the newly allocated stack must be initialized such that when switching to the new thread for the first time, the thread looks like it is resuming its execution—as if it had simply been stopped earlier.
Linux 总是避免从零开始创建资源,特别是线程。除了最初的那个线程(即一直在引导系统的线程,我们之前隐含地讨论过),内核始终是复制一个已有线程并修改副本以生成所需的新线程。该原则在新线程首次开始执行时同样适用,因为恢复一个线程的执行比从头启动一个线程要容易得多。这主要意味着,新分配的堆栈必须被初始化为某种状态,使得在首次切换到该线程时,它看起来像是在恢复执行,好像它之前只是被暂停了一样。
To further understand this mechanism, delving a bit into the thread-switching mechanism and more specifically into the switch of execution flow implemented by the architecture-specific context-switching routine switch_to()
is required. This routine, which is always written in assembly language, is always called by the current (soon to be previous) thread while returning as the next (future current) thread. Part of this trick is achieved by saving the current context in the stack of the current thread, switching stack pointers to use the stack of the next thread, and restoring the saved context from it. As with a typical function, switch_to()
finally returns to the “calling” function using the instruction address that had been saved on the stack of the newly current thread.
要进一步理解这一机制,需要深入研究线程切换机制,尤其是由体系结构特定的上下文切换例程 switch_to()
实现的执行流程切换。这个例程通常是用汇编语言编写的,它总是由当前(即将成为前一个)线程调用,并以下一个(即将成为当前)线程的身份返回。这一机制的部分实现方式包括:将当前上下文保存在当前线程的堆栈中,切换堆栈指针以使用下一个线程的堆栈,然后从其中恢复保存的上下文。与典型函数一样,switch_to()
最终使用保存在新当前线程堆栈上的指令地址返回到“调用”函数。
In the case that the next thread had previously been running and was temporarily removed from the processor, returning to the calling function would be a normal event that would eventually lead the thread to resume the execution of its own code. However, for a brand new thread, there would not have been any function to call switch_to()
in order to save the thread's context. This is why the stack of a new thread must be initialized to pretend that there has been a previous function call, enabling switch_to()
to return after restoring this new thread. Such a function is usually setup to be a few assembly lines acting as a trampoline to the thread's code.
如果下一个线程曾经运行过且只是暂时被处理器移除,那么返回到调用函数是正常事件,最终会使该线程继续执行其自身代码。但对于一个全新的线程,它并没有任何函数曾调用 switch_to()
来保存其上下文。这就是为什么新线程的堆栈必须被初始化为一种“伪造”的状态,使其看起来像是有个先前的函数调用,从而使 switch_to()
在恢复该新线程后能够顺利返回。通常会设置一个小段汇编代码作为跳板,以跳转到线程的实际代码。
Note that switching to a kernel thread does not generally involve switching to another page table since the kernel address space, in which all kernel threads run, is defined in every page table structure. For user processes, the switch to their own page table is performed by the architecture-specific routine switch_mm()
.
需要注意的是,切换到另一个内核线程通常不涉及页表的切换,因为所有内核线程运行在统一的内核地址空间中,而这个地址空间在每个页表结构中都已定义。对于用户进程,则需要通过体系结构特定的例程 switch_mm()
来切换到它们自己的页表。
The first kernel thread
As explained in the source code, the only reason the kernel thread kernel_init is created first is that it must obtain PID 1. This is the PID that the init process (i.e. the first user space process born from kernel_init) traditionally inherits.
第一个内核线程
正如源代码中所解释的,之所以最先创建内核线程 kernel_init,仅仅是因为它必须获得 PID 1。这个 PID 是 init 进程(即第一个从 kernel_init 派生出的用户空间进程)传统上继承的。
Interestingly, the first task of kernel_init is to wait for the second kernel thread, kthreadd, to be ready. kthreadd is the kernel thread daemon in charge of asynchronously spawning new kernel threads whenever requested. Once kthreadd is started, kernel_init proceeds with the second phase of booting, which includes a few architecture-specific initializations.
有趣的是,kernel_init 的首要任务是等待第二个内核线程 kthreadd 就绪。kthreadd 是负责在需要时异步创建新内核线程的守护线程。一旦 kthreadd 启动,kernel_init 就会继续进入启动过程的第二阶段,其中包括一些与体系结构相关的初始化。
In the case of a multiprocessor system, kernel_init begins by starting the other processors before initializing the various subsystems composing the driver model (e.g. devtmpfs, devices, buses, etc.) and, later, using the defined initialization calls to bring up the actual device drivers for the underlying hardware system. Before getting into the “fancy” device drivers (e.g. block device, framebuffer, etc.), it is probably a good idea to focus on having at least an operational terminal (by implementing the corresponding driver if necessary), especially since the early console set up by early_printk() is supposed to be replaced by a real, full-featured console shortly after.
在多处理器系统中,kernel_init 会首先启动其他处理器,然后初始化组成驱动模型的各个子系统(例如 devtmpfs、设备、总线等),随后通过预定义的初始化调用来加载底层硬件系统所需的实际设备驱动。在开始处理那些“复杂”的设备驱动(如块设备、帧缓冲等)之前,最好先确保至少有一个可用的终端(如有必要可实现相应的驱动),尤其是因为通过 early_printk() 设置的早期控制台很快将被一个真正的、功能齐全的控制台所取代。
It is also through these initialization calls that the initramfs is unpacked and the initial root filesystem (rootfs) is mounted. There are a few options for mounting an initial rootfs but I have found initramfs to be the simplest when porting Linux. Basically this means that the rootfs is statically built at compilation time and integrated into the kernel binary image. After being mounted, the rootfs can give access to the mandatory /init and /dev/console.
通过这些初始化调用,还会解压 initramfs 并挂载初始根文件系统(rootfs)。挂载初始 rootfs 有几种选择,但在移植 Linux 时,我发现使用 initramfs 是最简单的。基本上这意味着 rootfs 是在编译时静态构建并集成到内核二进制映像中的。挂载完成后,rootfs 可以提供对必要文件 /init 和 /dev/console 的访问。
Finally, the init memory is freed (i.e. the memory containing code and data that were used only during the initialization phase and that are no longer needed) and the init process that has been found on the rootfs is launched.
最后,init 内存会被释放(即只在初始化阶段使用、现已无用的代码和数据所占用的内存),并启动 rootfs 中找到的 init 进程。
Executing init
At this point, launching init will probably result in an immediate fault when trying to fetch the first instruction. This is because, as with creating threads, being able to execute the init process (and actually any user-space application) first involves a bit of groundwork.
执行 init
此时,尝试启动 init 很可能在获取第一条指令时就会发生故障。这是因为,就像创建线程一样,执行 init 进程(或任何用户空间应用程序)之前都需要做好一些基础准备。
The function that needs to be implemented in order to solve the instruction-fetching issue is the page fault handler. Linux is lazy, particularly when it comes to user applications and, by default, does not pre-load the text and data of applications into memory. Instead, it only sets up all of the kernel structures that are strictly required and lets applications fault at their first instruction because the pages containing their text segment have usually not been loaded yet.
为了解决指令获取失败的问题,需要实现的函数是页错误处理程序。Linux 的策略是“懒加载”,尤其对用户程序而言,默认情况下不会预先将应用程序的代码段和数据段加载到内存中。相反,它只设置严格必要的内核结构,然后在应用程序尝试执行第一条指令时触发缺页异常,因为其文本段所在的页通常尚未加载。
This is actually perfectly intentional behavior since it is expected that such a memory fault will be caught and fixed by the page fault handler. This handler can be seen as an intricate switch statement that is able to treat every fault related to memory: from vmalloc() faults that necessitate a synchronization with the reference page table to stack expansions in user applications. In this case, the handler will determine that the page fault corresponds to a valid virtual memory area (VMA) of the application and will consequently load the missing page in memory before retrying to run the application.
实际上,这是完全有意为之,因为系统预期这种内存错误会被页错误处理程序捕获并修复。这个处理程序可以被看作是一个复杂的多路分支结构,能够处理所有内存相关的错误:从需要与参考页表同步的 vmalloc() 错误,到用户应用程序中的栈扩展。在这种情况下,处理程序会判断该页错误是否落在应用程序的有效虚拟内存区域(VMA)中,如果是,就将缺失的页面加载到内存中,然后重试应用程序的执行。
Once the page fault handler is able to catch memory faults, it is likely that an extremely simple init process can be executed. However, it will not be able to do much as it cannot yet request any service from the kernel through system calls, such as printing to the terminal. To this end, the system-call infrastructure must be completed with a few architecture-specific parts. System calls are treated as software interrupts since they are accessed by a user instruction that makes the processor automatically switch to kernel mode, like hardware interrupts do. Besides defining the list of system calls supported by the port, handling system calls involves enhancing the interrupt and exception handler with the additional ability to receive them.
一旦页错误处理程序能够成功捕获内存错误,就有可能执行一个非常简单的 init 进程。然而,此时它能做的事情很有限,因为它尚无法通过系统调用向内核请求服务,例如向终端打印信息。为此,必须完成系统调用基础设施,并补充一些与体系结构相关的代码。系统调用被视为软件中断,因为它们通过一条用户指令触发处理器自动切换到内核模式,类似硬件中断。除了定义移植所支持的系统调用列表之外,还需要增强中断和异常处理程序,使其能够处理系统调用。
Once there is support for system calls, it should now be possible to execute a “hello world” init that is able to open the main console and write a message. But there are still missing pieces in order to have a full-featured init that is able to start other applications and communicate with them as well as exchange data with the kernel.
一旦系统调用得到支持,就应该能够运行一个“hello world”版本的 init,它可以打开主控制台并输出一条消息。但要实现一个功能齐全的 init,还缺少一些关键组件,例如启动其他应用程序、与它们通信,以及与内核交换数据。
The first step toward this goal concerns the management of signals and, more particularly, signal delivery (either from another process or from the kernel itself). If a process has defined a handler for a specific signal, then this handler must be called whenever the given signal is pending. Such an event occurs when the targeted process is about to get scheduled again. More specifically, this means that when resuming the process, right at the moment of the next transition back to user mode, the execution flow of the process must be altered in order to execute the handler instead. Some space must also be made on the application's stack for the execution of the handler. Once the handler has finished its execution and has returned to the kernel (via a system call that had been previously injected into the handler's context), the context of the process is restored so that it can resume its normal execution.
实现这一目标的第一步是实现信号的管理,尤其是信号的传递(可能来自其他进程或内核本身)。如果一个进程为某个特定信号定义了处理程序,那么每当该信号处于挂起状态时,就必须调用这个处理程序。这类事件通常发生在目标进程即将再次被调度运行时。更具体地说,这意味着在恢复该进程、即将返回用户模式的那一刻,必须改变其执行流程以转而执行信号处理程序。此外,还必须在应用程序的栈上为处理程序的执行预留空间。一旦处理程序执行完毕并通过先前注入到其上下文中的系统调用返回内核,进程的上下文就会被恢复,从而继续正常执行。
The second and last step for fully running user-space applications deals with user-space memory access: when the kernel wants to copy data from or to user-space pages. Such an operation can be quite dangerous if, for example, the application gives a bogus pointer, which would potentially result in kernel panics (or security vulnerabilities) if it is not checked properly. To circumvent this problem, it is necessary to write architecture-specific routines that use some assembly magic to register the addresses of all of the instructions performing the actual accesses to the user-space memory in an exception table. As explained in this LWN article from 2001, “if ever a fault happens in kernel mode, the fault handler scans through the exception table trying to match the address of the faulting instruction with a table entry. If a match is found, a special error exit is taken, the copy operation fails gracefully, and the system call returns a segmentation fault error.”
完全运行用户空间应用程序的第二个也是最后一个步骤是处理用户空间内存访问:即当内核需要从用户空间页中拷贝数据,或向其写入数据时。如果应用程序提供了一个伪造的指针而没有适当检查,这种操作可能会非常危险,可能导致内核崩溃(或产生安全漏洞)。为避免这种问题,必须编写一些与体系结构相关的汇编代码,用于将所有执行实际用户空间访问操作的指令地址注册到一个异常表中。正如 LWN 在 2001 年的一篇文章中所解释的,“如果内核模式下发生了故障,错误处理程序会扫描异常表,尝试将故障指令的地址与表中的条目进行匹配。如果找到匹配项,就会采用一种特殊的错误退出方式,使拷贝操作优雅失败,并使系统调用返回段错误。”
Conclusion
Once a full-featured init process is able to run and give access to a shell, it probably signals the end of the porting process. But it is most likely only the beginning of the adventure, as the port now needs to be maintained (as the internal APIs sometimes change quickly), and can also be enhanced in numerous ways: adding support for multiprocessor and NUMA systems, implementing more device drivers, etc.
总结
一旦一个功能完整的 init 进程能够运行并提供 shell 访问,这基本标志着移植工作的完成。但这很可能只是旅程的开始,因为此后需要持续维护移植工作(因为内核内部 API 有时会迅速变化),并且还可以通过许多方式对移植版本进行增强:例如添加对多处理器和 NUMA 系统的支持,或实现更多的设备驱动程序等。
By describing the long journey of porting Linux to a new processor architecture, I hope that this series of articles will contribute to remedying the lack of documentation in this area and will help the next brave programmer who one day embarks upon this challenging, but ultimately rewarding, experience.
通过讲述将 Linux 移植到新处理器架构的漫长旅程,我希望这系列文章能够部分弥补该领域文档的缺乏,并帮助下一位勇敢的程序员——某天他将踏上这段艰难但最终值得的旅程。
暂无评论内容