Learning Linux

Contents

February 15, 2018

I have been an active user of various Linux distributions since college including Ubuntu, Arch Linux, Debian and most recently Fedora.

Package management

Slotting: Gentoo support for having multiple packages installed simultanesouly
Fedora Modularity
Basic Overview of Pacman Code

Kernel

There are two different approaches for communicating between the CPU and peripheral hardware on x86, memory-mapped I/O and port-mapped I/O.

Generate documentation written in reStructuredText using make htmldocs or make pdfdocs, which will be output in Documentation/output.

A bunch of LWN articles on atomic kernel stuff:

There are tons of really cool features in the kernel including the following:

container_of() is a macro that provides a pointer to the struct that another stuct is a member of
System call filtering reduces the total kernel surface exposed to an application by limiting which system calls are available to a userland application
BUILD_BUG_ON() will force a compilation error if a condition is true
KUnit is a unit testing framework
A newer userspace interface that may replace ioctl() is Netlink
The last page of address space is reserved so a returned addresses within this range are actually error codes, which can be checked with IS_ERR() and then decoded with PTR_ERR()
BPF allows running arbitrary code in the kernel. Although the potential is very broad the most common use is for analysis and monitoring (see BPF Compiler Collection)
The runtime verification subsystem has been proposed to make Linux more suitable for safety critical applications. See also ARM’s SOAFEE project for “mixed-crticality automotive applications”

Booting

The typical boot process consists of a Boot ROM in the target chip that loads firmware from a location that may be configured using jumpers. In embedded systems that firmware often consists of the U-Boot Secondary Program Loader (SPL), which in turn loads U-Boot (see Booting ARM Linux).

The Linux boot process can be measured using tools like bootchart or systemd-analyze.

Memory management

Each process has a virtual address space composed of pages (e.g. 4 KB on 32-bit, 8 KB on 64-bit) that may be valid if backed by physical memory or secondary storage, or invalid. Secondary storage is simply used by the kernel to store pages that it believes won’t be used. If a program accesses such a page the MMU will generate a page fault, which will cause the kernel to replace it with physical memory.

Copy-on-write occurs when a page shared between one or more proceeses is modified by one of the processes. The MMU generates an interrupt and the kernel creates a new page for the modification.

Virtual memory is divided into blocks or regions like the text segment, stack, heap, bss segment (global variables), and mapped files like C and libraries (see /proc/self/maps).

Allocation

Allocation is done using numerous families of functions combined with a Get Free Page (GFP) flags (e.g. GFP_KERNEL, GFP_NOWAIT, GFP_ATOMIC).

A slab allocator reduces fragmentation by retaining memory containing a object of a certain type after deinitialization so that it may be reused when a object of the same type is initialized. It is a critical enhancement since allocating and deallocating memory has significant overhead. Slabs are pre-allocated chunks of memory of various sizes.

There are three slab allocators; SLOB (1991-1999), SLAB (1999-2008) and SLUB (2008-?). SLUB improved debugging, defragmentation and execution time.

This caching of many identical objects is available through kmem_cache_alloc() rather than kmalloc(). Alternatively, kzalloc() zeroes out memory and the kmalloc family should generally only be used for objects smaller than the page size. Considerations when allocating memory are alignment, whether the memory is contiguous and size.

Sychronization

Locks

Synchronization of shared resources may be accomplished with a spinlock or blocking primitives (i.e. mutex), which may be suspended or sleep. A local_lock disables preemption and interrupt primitives, but does not handle inter-CPU concurrency control.

Only spinlocks may be used in an interrupt context. That is because the code executing in a interrupt context may not be suspended (see Interrupts). If it is suspended then a critical error has occurred.

Spinlocks and mutexes in the Linux kernel are not recursive since they can result in a deadlock. Deadly embrace may occur with two or more locks and if processes on multiple CPUs attempt to get the same locks then the system may deadlock.

The best approach to preventing deadlocks is to not expose them in headers and minimize the complexity of code while the lock is held. Calling a callback with a lock held is asking for trouble.

Overzealous locking may result in holding a read lock and then holding a write lock, without realizing that another process may have modified that resource between the locks.

The kernel may be compiled without CONFIG_SMP and CONFIG_PREEMPT, which removes spinlocks. With only CONFIG_SMP enabled, spinlocks just disable preemption. Even with both disabled it is worth testing with both enabled to check for certain locking bugs.

See also Documentation/kernel-hacking/locking.rst.

Compare-and-swap

Compare-exchange (i.e. cmp_xchg()) or compare-and-swap does an atomic read-modify-write (RMW) by reading the value at the given address and comparing it to a value provided without taking a spinlock by leveraging processor specific instructions. If they are equal it will write the other provided value. It only writes a value if the current state is as the caller expects. Either way it returns the value it read. For example, the following continuously tries to increment a value while the value read at the address by cmp_xchg() doesn’t match the value read by the caller.

do {
    int old = *addr;
    int new = old + 1;
} while (cmp_xchg(addr, old, new) != old);

Note that cmp_xchg() doesn’t appear in the kernel.

Read-copy update

Read-copy update (RCU) allows concurrency between a single updater and multiple readers. It does this by publishing and reading multiple versions of an object and limiting the freeing of an object until after read-side critical sections are complete. This is normally not possible because the removal and reclamation phases are completed as one operation. Note that removal is required for updating an object. During removal any concurrent readers are noted and reclamation only occurs after all readers have completed. This dispenses with atomic operations, memory barriers and communications cache misses.

RCUs have zero overhead with non-preemptable kernels.

On modern CPUs writes to single aligned pointers are atomic.

There are various compiler optimizations that may prevent statements from executing in the order they are written. The rcu_ functions prevent that out-of-order execution. For example, a updater may call rcu_assign_pointer() passing a global and then a reader may surround rcu_reference() with rcu_read_lock() and rcu_read_unlock(). More commonly, rcu_ functions specific to list_head and hlist_* are used.

Interrupts

Interrupt handlers need to be short because they run in a interrupt context rather than process context. In that context things like sleeping, scheduling or user memory access are not allowed. The interrupt line associated with that handler is queued or dropped.

Deadlocks can occur when a interrupt handler tries to acquire a lock that is owned by the process that was interrupted.

Code which handles hardware and software interrupts is one obvious example of atomic context. Atomic context is where the code that is being executed can not be interrupted or preempted. That may occur with only a single instruction, disabling interrupts or taking a spinlock.

A programmable interrupt controller (PIC) multiplexes numerous peripheral interrupts to a single output to the processor. The processor may then read information about the interrupt over a bus (e.g. AXI) and then acknowledge the interrupt once it has been handled. The PIC will not raise another interrupt until the preceding one has been acknowledged.

kmalloc() takes several flags including GFP_KERNEL, which is only allowed in user context and may sleep, and GFP_ATOMIC, which may be called from interrupt context, but can also fail!

Software interrupts are also supported (e.g. timer softirq). However tasklets are often preferred since they will only run on one CPU at a time rather than simultaneously on more than one CPU. Software interrupts have higher priority than hardware interrupts. They are also synchronous events that increment the program counter (e.g. system calls, exceptions). Note that Intel added hardware support for user interrupts in 2021. Previously, all communication across privilege boundaries went through the kernel (e.g. signals, pipes, remote procedure calls, hardware interrupt notifications). User interrupts do not go through the kernel and therefore reduce latency and CPU utilization. User interrupts may only be sent by a user space task, but may additionally be sent by the kernel or a peripheral.

Questions

In the real world when is an interrupt queued or dropped?
Is only the interrupt associated with a handler disabled, some interrupts or all interrupts?

Scheduling

Schedule non-blocked or runnable processes by assigning them a timeslice, which is a period that the process may run. Linux manages timeslices using preemption, where one process is suspended in lieu of another. An alternative method is cooperative multitasking where the process decides to suspend itself or yield to the system.

Linux assigns a timeslice to each runnable process and runs them based on their priority. When all timeslices have been exhausted then new timeslices are assigned to all runnable processes.

The Linux scheduling algorithm is O(1), which guarantees as the runnable process list grows the scheduler performs the same.

Long timeslices may introduce noticeable delays, while short timeslices may incur a lot of overhead.

With more complex modern processors the distinction between processor-bound and I/O-bound processes may be less clear. I/O-bound processes may block more often since they interface with peripheral systems and therefore take priority since they run for shorter periods. A networked game may require calculations to run on the processor, but also communicate over a comparably slow network.

Threads within a process share the same address space, to which dynamic memory, files, and object code are mapped, as well as a list of open files and other kernel resources. The kernel treats them as individual processes that happen to share some resources.

The sched_yield() system call allows a process to yield as with cooperative multitasking, but is discouraged since the Kernel should be able preempt as necessary. For some time sched_yield() was used for user-space locks, but that is no longer necessary with futexes. As of 2.6 yielding forfeits the timeslice of the process.

Nice values dictate priority and timeslice size with a lower value indicating a higher priority and larger timeslice. Only root processes may make themselves less nice.

Note there is also an I/O scheduler which uses the nice value by default. But only one I/O scheduler, Complete Fair Queuing (CFQ), supports I/O priority as of 2007.

Process affinity dictates the likelihood of a process to be scheduled on the same processor. Switching between processors is costly because of independent caches per processor. It is possible to bond a cache sensitive process to a processor with a hard affinity.

Latency is the period from the occurrence of a stimulus to the response. Jitter is the variation in timing between successive events. It is hard to measure latency or even know when was the first occurrence of a stimulus. Real-time systems aim for zero jitter and a latency equal to the operational delay, which may be slow! The Kernel implements the POSIX standard interface using two real-time scheduling policies, which use the static priority of a process (which is independent of the nice value).

There are other mechanisms to increase determinism. Locking an applications pages into physical memory prevents swapping to disk. Since critical sections of the kernel may not be preempted one or more processors may be dedicated to a process.

Linux limits other resources like the number of open files, pages of memory, pending signals, size of created files, etc. Those limits are stored in struct rlimit.

Data structures

In text books linked lists are typically constructed from nodes implemented as a struct containing one or more pointers to objects of that same type. Rather than reimplementing those pointers and the functions that act on them for every list, the kernel provides include/linux/list.h. The key distinction is that the kernel implements list_head in include/linux/types.h as follows:

struct list_head {
    struct list_head *next, *prev;
};

list_head is then added as a member of a node. The utility functions then use container_of() to access a node from a list_head.

The two primary types of lists in the kernel are the circular list_head and linear hlist_head and hlist_node.

Power management

Linux power management is divided into runtime and system frameworks, where the runtime framework manages individual devices while the system manages the entire system.

Enable system power management using the CONFIG_PM Kconfig option.

If the actions taken by a driver are the same for system operations as for runtime operations add the following:

static const struct dev_pm_ops mydrv_dev_pm_ops = {
        SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
                                pm_runtime_force_resume)
        SET_RUNTIME_PM_OPS(driver runtime_suspend_callback,
                           driver_runtime_resume_callback,
                           NULL)
};

Power domains

Generic power management domains (genpd) allow devices to be managed independent of their subsystems or drivers. Their topology of devices can be described in the device tree using power-domains = <&pm_domain 0>;. Note that ck genpd uses the power_on() and power_off() callbacks. Wake-up latency constraints can be enforced using the genpd governor. Additionally, the set_performance_state callback can limit the power consumption of a device by varying the voltage to power rails.

Default sysfs

Several power management attributes are created for devices in sysfs. The power/control attribute allows userspace to take control of device power management by writing on or hand back control by writing auto.

Three more attributes are automatically created and describe the power management of a device; runtime_active_time, runtime_status, runtime_suspended_time.

Runtime

The power management framework maintains a usage counter that is incremented using pm_runtime_get_() and decremented using pm_runtime_put(). If the counter reaches zero the device may be put to sleep.

There is also a feature named autosuspend that allows a device to put itself to sleep after a specified period (see power/autosuspend_delay_ms).

CPU

Common mechanisms to reduce processor power usage include the following.

cpuidle
cpufreq
hotplug: Suspend or hibernate individual CPUs
sched_mc: Use the scheduler to fill one core or package at a time

Debug

When debugging a driver mount debugfs and use dynamic debug to enable debug print statements.

mount -t debugfs none /sys/kernel/debug

Udev

Devices exposed in /dev are managed by a userspace program named udev. Originally developed by Greg K.H., it has since been forked and maintained under systemd and Gentoo as eudevd (see #gentoo-eudev on Freenode). There are also a few folks in #udev on Freenode.

Events are generated by the kernel, processed by udev, which in turn generates it’s own events. It is possible to see the former with udevadm monitor -k and udevadm monitor -u will output the latter.

Udev rules include an environment that is populated from various places (see IMPORT). I found this to be the most confusing part of udev. Rules have access to sysfs attributes through the environment, which can be printed using udevadm info -a /sys/....

Rules are typically located in /usr/lib/udev/rules.d/*.rules and /etc/udev/rules.d/*.rules.

The syntax includes operators like = for assignment and == for matching. To skip an ACTION rules will often include a match of that action with a GOTO to LABEL="foo_end" at the end of the file.

All commands are run after processing the rule regardless of their order in the file.

A good example is 60-serial.rules, which creates symlinks for serial devices under /dev/serial.

There is also a database, known as hwdb, which I didn’t investigate.

I found the following pattern helpful for debugging adding a particular device:

cd /sys/...
udevadm test --action="add" "${PWD}"

After updating a rule it may be necessary to run udevadm control --reload.

Some of the keys match upwards against parent devices until they find a match (e.g. KERNELS, SUBSYSTEMS, DRIVERS).

DT Schema

git describe 2783a7f56f998
v5.17-rc1-88-g2783a7f56f99

dnf install -y yamllint
pip3 install dtschema
dt-validate Documentation/devicetree/bindings/processed-schema.json \
    arch/arm64/boot/dts/foo/bar.dtb

How to Get Your DT Schema Bindings Accepted in Less than 10 Iterations by Krzysztof Kozlowski from Linaro

Loading devicetree

arch/arm64/kernel/setup.c calls setup_machine_fdt(__fdt_pointer), where __fdt_pointer is set in arch/arm64/kernel/head.S.