I have been an active user of various Linux distributions since college including Ubuntu, Arch Linux, Debian and most recently Fedora.
There are two different approaches for communicating between the CPU and peripheral hardware on x86, memory-mapped I/O and port-mapped I/O.
Generate documentation written in reStructuredText using
make htmldocs
or make pdfdocs
, which will be
output in Documentation/output
.
A bunch of LWN articles on atomic kernel stuff:
There are tons of really cool features in the kernel including the following:
container_of()
is a macro that provides a pointer to the struct that another stuct is a
member ofBUILD_BUG_ON()
will force a compilation error if a condition is trueioctl()
is
NetlinkIS_ERR()
and then decoded with PTR_ERR()
The typical boot process consists of a Boot ROM in the target chip that loads firmware from a location that may be configured using jumpers. In embedded systems that firmware often consists of the U-Boot Secondary Program Loader (SPL), which in turn loads U-Boot (see Booting ARM Linux).
The Linux boot process can be measured using tools like bootchart or
systemd-analyze
.
Each process has a virtual address space composed of pages (e.g. 4 KB on 32-bit, 8 KB on 64-bit) that may be valid if backed by physical memory or secondary storage, or invalid. Secondary storage is simply used by the kernel to store pages that it believes won’t be used. If a program accesses such a page the MMU will generate a page fault, which will cause the kernel to replace it with physical memory.
Copy-on-write occurs when a page shared between one or more proceeses is modified by one of the processes. The MMU generates an interrupt and the kernel creates a new page for the modification.
Virtual memory is divided into blocks or regions like the text
segment, stack, heap, bss segment (global variables), and mapped files
like C and libraries (see /proc/self/maps
).
Allocation is done using numerous families of functions combined with
a Get Free Page (GFP) flags (e.g. GFP_KERNEL
,
GFP_NOWAIT
, GFP_ATOMIC
).
A slab allocator reduces fragmentation by retaining memory containing a object of a certain type after deinitialization so that it may be reused when a object of the same type is initialized. It is a critical enhancement since allocating and deallocating memory has significant overhead. Slabs are pre-allocated chunks of memory of various sizes.
There are three slab allocators; SLOB (1991-1999), SLAB (1999-2008) and SLUB (2008-?). SLUB improved debugging, defragmentation and execution time.
This caching of many identical objects is available through
kmem_cache_alloc()
rather than kmalloc()
.
Alternatively, kzalloc()
zeroes out memory and the
kmalloc
family should generally only be used for objects
smaller than the page size. Considerations when allocating memory are
alignment, whether the memory is contiguous and size.
Synchronization of shared resources may be accomplished with a
spinlock or blocking primitives (i.e. mutex), which may be suspended or
sleep. A local_lock
disables preemption and interrupt
primitives, but does not handle inter-CPU concurrency control.
Only spinlocks may be used in an interrupt context. That is because the code executing in a interrupt context may not be suspended (see Interrupts). If it is suspended then a critical error has occurred.
Spinlocks and mutexes in the Linux kernel are not recursive since they can result in a deadlock. Deadly embrace may occur with two or more locks and if processes on multiple CPUs attempt to get the same locks then the system may deadlock.
The best approach to preventing deadlocks is to not expose them in headers and minimize the complexity of code while the lock is held. Calling a callback with a lock held is asking for trouble.
Overzealous locking may result in holding a read lock and then holding a write lock, without realizing that another process may have modified that resource between the locks.
The kernel may be compiled without CONFIG_SMP
and
CONFIG_PREEMPT
, which removes spinlocks. With only
CONFIG_SMP
enabled, spinlocks just disable preemption. Even
with both disabled it is worth testing with both enabled to check for
certain locking bugs.
See also Documentation/kernel-hacking/locking.rst
.
Compare-exchange (i.e. cmp_xchg()
) or compare-and-swap
does an atomic read-modify-write (RMW) by reading the value at the given
address and comparing it to a value provided without taking a spinlock
by leveraging processor specific instructions. If they are equal it will
write the other provided value. It only writes a value if the current
state is as the caller expects. Either way it returns the value it read.
For example, the following continuously tries to increment a value while
the value read at the address by cmp_xchg()
doesn’t match
the value read by the caller.
do {
int old = *addr;
int new = old + 1;
} while (cmp_xchg(addr, old, new) != old);
Note that cmp_xchg()
doesn’t appear in the kernel.
Read-copy update (RCU) allows concurrency between a single updater and multiple readers. It does this by publishing and reading multiple versions of an object and limiting the freeing of an object until after read-side critical sections are complete. This is normally not possible because the removal and reclamation phases are completed as one operation. Note that removal is required for updating an object. During removal any concurrent readers are noted and reclamation only occurs after all readers have completed. This dispenses with atomic operations, memory barriers and communications cache misses.
RCUs have zero overhead with non-preemptable kernels.
On modern CPUs writes to single aligned pointers are atomic.
There are various compiler optimizations that may prevent statements
from executing in the order they are written. The rcu_
functions prevent that out-of-order execution. For example, a updater
may call rcu_assign_pointer()
passing a global and then a
reader may surround rcu_reference()
with
rcu_read_lock()
and rcu_read_unlock()
. More
commonly, rcu_
functions specific to list_head
and hlist_*
are used.
Interrupt handlers need to be short because they run in a interrupt context rather than process context. In that context things like sleeping, scheduling or user memory access are not allowed. The interrupt line associated with that handler is queued or dropped.
Deadlocks can occur when a interrupt handler tries to acquire a lock that is owned by the process that was interrupted.
Code which handles hardware and software interrupts is one obvious example of atomic context. Atomic context is where the code that is being executed can not be interrupted or preempted. That may occur with only a single instruction, disabling interrupts or taking a spinlock.
A programmable interrupt controller (PIC) multiplexes numerous peripheral interrupts to a single output to the processor. The processor may then read information about the interrupt over a bus (e.g. AXI) and then acknowledge the interrupt once it has been handled. The PIC will not raise another interrupt until the preceding one has been acknowledged.
kmalloc()
takes several flags including
GFP_KERNEL
, which is only allowed in user context and may
sleep, and GFP_ATOMIC
, which may be called from interrupt
context, but can also fail!
Software interrupts are also supported (e.g. timer softirq). However tasklets are often preferred since they will only run on one CPU at a time rather than simultaneously on more than one CPU. Software interrupts have higher priority than hardware interrupts. They are also synchronous events that increment the program counter (e.g. system calls, exceptions). Note that Intel added hardware support for user interrupts in 2021. Previously, all communication across privilege boundaries went through the kernel (e.g. signals, pipes, remote procedure calls, hardware interrupt notifications). User interrupts do not go through the kernel and therefore reduce latency and CPU utilization. User interrupts may only be sent by a user space task, but may additionally be sent by the kernel or a peripheral.
Schedule non-blocked or runnable processes by assigning them a timeslice, which is a period that the process may run. Linux manages timeslices using preemption, where one process is suspended in lieu of another. An alternative method is cooperative multitasking where the process decides to suspend itself or yield to the system.
Linux assigns a timeslice to each runnable process and runs them based on their priority. When all timeslices have been exhausted then new timeslices are assigned to all runnable processes.
The Linux scheduling algorithm is O(1), which guarantees as the runnable process list grows the scheduler performs the same.
Long timeslices may introduce noticeable delays, while short timeslices may incur a lot of overhead.
With more complex modern processors the distinction between processor-bound and I/O-bound processes may be less clear. I/O-bound processes may block more often since they interface with peripheral systems and therefore take priority since they run for shorter periods. A networked game may require calculations to run on the processor, but also communicate over a comparably slow network.
Threads within a process share the same address space, to which dynamic memory, files, and object code are mapped, as well as a list of open files and other kernel resources. The kernel treats them as individual processes that happen to share some resources.
The sched_yield()
system call allows a process to yield
as with cooperative multitasking, but is discouraged since the Kernel
should be able preempt as necessary. For some time
sched_yield()
was used for user-space locks, but that is no
longer necessary with futexes. As of 2.6 yielding forfeits the
timeslice of the process.
Nice values dictate priority and timeslice size with a lower value indicating a higher priority and larger timeslice. Only root processes may make themselves less nice.
Note there is also an I/O scheduler which uses the nice value by default. But only one I/O scheduler, Complete Fair Queuing (CFQ), supports I/O priority as of 2007.
Process affinity dictates the likelihood of a process to be scheduled on the same processor. Switching between processors is costly because of independent caches per processor. It is possible to bond a cache sensitive process to a processor with a hard affinity.
Latency is the period from the occurrence of a stimulus to the response. Jitter is the variation in timing between successive events. It is hard to measure latency or even know when was the first occurrence of a stimulus. Real-time systems aim for zero jitter and a latency equal to the operational delay, which may be slow! The Kernel implements the POSIX standard interface using two real-time scheduling policies, which use the static priority of a process (which is independent of the nice value).
There are other mechanisms to increase determinism. Locking an applications pages into physical memory prevents swapping to disk. Since critical sections of the kernel may not be preempted one or more processors may be dedicated to a process.
Linux limits other resources like the number of open files, pages of
memory, pending signals, size of created files, etc. Those limits are
stored in struct rlimit
.
In text books linked lists are typically constructed from nodes
implemented as a struct
containing one or more pointers to
objects of that same type. Rather than reimplementing those pointers and
the functions that act on them for every list, the kernel provides
include/linux/list.h
. The key distinction is that the
kernel implements list_head
in
include/linux/types.h
as follows:
struct list_head {
struct list_head *next, *prev;
};
list_head
is then added as a member of a node. The
utility functions then use container_of()
to access a node
from a list_head
.
The two primary types of lists in the kernel are the circular
list_head
and linear hlist_head
and
hlist_node
.
Linux power management is divided into runtime and system frameworks, where the runtime framework manages individual devices while the system manages the entire system.
Enable system power management using the CONFIG_PM
Kconfig option.
If the actions taken by a driver are the same for system operations as for runtime operations add the following:
static const struct dev_pm_ops mydrv_dev_pm_ops = {
(pm_runtime_force_suspend,
SET_SYSTEM_SLEEP_PM_OPS)
pm_runtime_force_resume(driver runtime_suspend_callback,
SET_RUNTIME_PM_OPS,
driver_runtime_resume_callback)
NULL};
Generic power management domains (genpd
) allow devices
to be managed independent of their subsystems or drivers. Their topology
of devices can be described in the device tree using
power-domains = <&pm_domain 0>;
. Note that ck
genpd
uses the power_on()
and
power_off()
callbacks. Wake-up latency constraints can be
enforced using the genpd
governor. Additionally, the
set_performance_state
callback can limit the power
consumption of a device by varying the voltage to power rails.
Several power management attributes are created for devices in sysfs.
The power/control
attribute allows userspace to take
control of device power management by writing on
or hand
back control by writing auto
.
Three more attributes are automatically created and describe the
power management of a device; runtime_active_time
,
runtime_status
, runtime_suspended_time
.
The power management framework maintains a usage counter that is
incremented using pm_runtime_get_()
and decremented using
pm_runtime_put()
. If the counter reaches zero the device
may be put to sleep.
There is also a feature named autosuspend
that allows a
device to put itself to sleep after a specified period (see
power/autosuspend_delay_ms
).
Common mechanisms to reduce processor power usage include the following.
cpuidle
cpufreq
hotplug
: Suspend or hibernate individual CPUssched_mc
: Use the scheduler to fill one core or package
at a timeWhen debugging a driver mount debugfs and use dynamic debug to enable debug print statements.
mount -t debugfs none /sys/kernel/debug
Devices exposed in /dev
are managed by a userspace
program named udev
. Originally developed by Greg K.H., it
has since been forked and maintained under systemd and Gentoo as
eudevd
(see #gentoo-eudev
on Freenode). There
are also a few folks in #udev
on Freenode.
Events are generated by the kernel, processed by udev, which in turn
generates it’s own events. It is possible to see the former with
udevadm monitor -k
and udevadm monitor -u
will
output the latter.
Udev rules include an environment that is populated from various
places (see IMPORT
). I found this to be the most confusing
part of udev. Rules have access to sysfs attributes through the
environment, which can be printed using
udevadm info -a /sys/...
.
Rules are typically located in
/usr/lib/udev/rules.d/*.rules
and
/etc/udev/rules.d/*.rules
.
The syntax includes operators like =
for assignment and
==
for matching. To skip an ACTION
rules will
often include a match of that action with a GOTO
to
LABEL="foo_end"
at the end of the file.
All commands are run after processing the rule regardless of their order in the file.
A good example is 60-serial.rules
, which creates
symlinks for serial devices under /dev/serial
.
There is also a database, known as hwdb
, which I didn’t
investigate.
I found the following pattern helpful for debugging adding a particular device:
cd /sys/...
udevadm test --action="add" "${PWD}"
After updating a rule it may be necessary to run
udevadm control --reload
.
Some of the keys match upwards against parent devices until they find
a match (e.g. KERNELS
, SUBSYSTEMS
,
DRIVERS
).
git describe 2783a7f56f998
v5.17-rc1-88-g2783a7f56f99
dnf install -y yamllint
pip3 install dtschema
dt-validate Documentation/devicetree/bindings/processed-schema.json \
arch/arm64/boot/dts/foo/bar.dtb
arch/arm64/kernel/setup.c
calls
setup_machine_fdt(__fdt_pointer)
, where
__fdt_pointer
is set in
arch/arm64/kernel/head.S
.