提交 eeee3149 编写于 作者: L Linus Torvalds

Merge tag 'docs-4.18' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "There's been a fair amount of work in the docs tree this time around,
  including:

   - Extensive RST conversions and organizational work in the
     memory-management docs thanks to Mike Rapoport.

   - An update of Documentation/features from Andrea Parri and a script
     to keep it updated.

   - Various LICENSES updates from Thomas, along with a script to check
     SPDX tags.

   - Work to fix dangling references to documentation files; this
     involved a fair number of one-liner comment changes outside of
     Documentation/

  ... and the usual list of documentation improvements, typo fixes, etc"

* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
  Documentation: document hung_task_panic kernel parameter
  docs/admin-guide/mm: add high level concepts overview
  docs/vm: move ksm and transhuge from "user" to "internals" section.
  docs: Use the kerneldoc comments for memalloc_no*()
  doc: document scope NOFS, NOIO APIs
  docs: update kernel versions and dates in tables
  docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
  docs/vm: transhuge: minor updates
  docs/vm: transhuge: change sections order
  Documentation: arm: clean up Marvell Berlin family info
  Documentation: gpio: driver: Fix a typo and some odd grammar
  docs: ranoops.rst: fix location of ramoops.txt
  scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
  docs: uio-howto.rst: use a code block to solve a warning
  mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
  w1: w1_io.c: fix a kernel-doc warning
  Documentation/process/posting: wrap text at 80 cols
  docs: admin-guide: add cgroup-v2 documentation
  Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
  Documentation: refcount-vs-atomic: Update reference to LKMM doc.
  ...
......@@ -64,8 +64,6 @@ auxdisplay/
- misc. LCD driver documentation (cfag12864b, ks0108).
backlight/
- directory with info on controlling backlights in flat panel displays
bcache.txt
- Block-layer cache on fast SSDs to improve slow (raid) I/O performance.
block/
- info on the Block I/O (BIO) layer.
blockdev/
......@@ -78,18 +76,10 @@ bus-devices/
- directory with info on TI GPMC (General Purpose Memory Controller)
bus-virt-phys-mapping.txt
- how to access I/O mapped memory from within device drivers.
cachetlb.txt
- describes the cache/TLB flushing interfaces Linux uses.
cdrom/
- directory with information on the CD-ROM drivers that Linux has.
cgroup-v1/
- cgroups v1 features, including cpusets and memory controller.
cgroup-v2.txt
- cgroups v2 features, including cpusets and memory controller.
circular-buffers.txt
- how to make use of the existing circular buffer infrastructure
clk.txt
- info on the common clock framework
cma/
- Continuous Memory Area (CMA) debugfs interface.
conf.py
......
......@@ -90,4 +90,4 @@ Date: December 2009
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
Description:
The node's huge page size control/query attributes.
See Documentation/vm/hugetlbpage.txt
\ No newline at end of file
See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file
......@@ -12,4 +12,4 @@ Description:
free_hugepages
surplus_hugepages
resv_hugepages
See Documentation/vm/hugetlbpage.txt for details.
See Documentation/admin-guide/mm/hugetlbpage.rst for details.
......@@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface
sleep_millisecs: how many milliseconds ksm should sleep between
scans.
See Documentation/vm/ksm.txt for more information.
See Documentation/vm/ksm.rst for more information.
What: /sys/kernel/mm/ksm/merge_across_nodes
Date: January 2013
......
......@@ -37,7 +37,7 @@ Description:
The alloc_calls file is read-only and lists the kernel code
locations from which allocations for this cache were performed.
The alloc_calls file only contains information if debugging is
enabled for that cache (see Documentation/vm/slub.txt).
enabled for that cache (see Documentation/vm/slub.rst).
What: /sys/kernel/slab/cache/alloc_fastpath
Date: February 2008
......@@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>,
Description:
The free_calls file is read-only and lists the locations of
object frees if slab debugging is enabled (see
Documentation/vm/slub.txt).
Documentation/vm/slub.rst).
What: /sys/kernel/slab/cache/free_fastpath
Date: February 2008
......
......@@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking.
:maxdepth: 1
initrd
cgroup-v2
serial-console
braille-console
parport
......@@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking.
mono
java
ras
bcache
pm/index
thunderbolt
LSM/index
mm/index
.. only:: subproject and html
......
......@@ -106,11 +106,11 @@
use by PCI
Format: <irq>,<irq>...
acpi_mask_gpe= [HW,ACPI]
acpi_mask_gpe= [HW,ACPI]
Due to the existence of _Lxx/_Exx, some GPEs triggered
by unsupported hardware/firmware features can result in
GPE floodings that cannot be automatically disabled by
the GPE dispatcher.
GPE floodings that cannot be automatically disabled by
the GPE dispatcher.
This facility can be used to prevent such uncontrolled
GPE floodings.
Format: <int>
......@@ -472,10 +472,10 @@
for platform specific values (SB1, Loongson3 and
others).
ccw_timeout_log [S390]
ccw_timeout_log [S390]
See Documentation/s390/CommonIO for details.
cgroup_disable= [KNL] Disable a particular controller
cgroup_disable= [KNL] Disable a particular controller
Format: {name of the controller(s) to disable}
The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in
......@@ -518,7 +518,7 @@
those clocks in any way. This parameter is useful for
debug and development, but should not be needed on a
platform with proper driver support. For more
information, see Documentation/clk.txt.
information, see Documentation/driver-api/clk.rst.
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
[Deprecated]
......@@ -641,8 +641,8 @@
hvc<n> Use the hypervisor console device <n>. This is for
both Xen and PowerPC hypervisors.
If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance
If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance
console=brl,ttyS0
For now, only VisioBraille is supported.
......@@ -662,7 +662,7 @@
consoleblank= [KNL] The console blank (screen saver) timeout in
seconds. A value of 0 disables the blank timer.
Defaults to 0.
Defaults to 0.
coredump_filter=
[KNL] Change the default value for
......@@ -730,7 +730,7 @@
or memory reserved is below 4G.
cryptomgr.notests
[KNL] Disable crypto self-tests
[KNL] Disable crypto self-tests
cs89x0_dma= [HW,NET]
Format: <dma>
......@@ -746,7 +746,7 @@
Format: <port#>,<type>
See also Documentation/input/devices/joystick-parport.rst
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
time. See
Documentation/admin-guide/dynamic-debug-howto.rst for
details. Deprecated, see dyndbg.
......@@ -833,7 +833,7 @@
causing system reset or hang due to sending
INIT from AP to BSP.
disable_ddw [PPC/PSERIES]
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
......@@ -1188,7 +1188,7 @@
parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
forcepae [X86-32]
forcepae [X86-32]
Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a
functionally usable PAE implementation.
......@@ -1247,7 +1247,7 @@
gamma= [HW,DRM]
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
Format: off | on
default: on
......@@ -1341,23 +1341,32 @@
x86-64 are 2M (when the CPU supports "pse") and 1G
(when the CPU supports the "pdpe1gb" cpuinfo flag).
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
terminal devices. Valid values: 0..8
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
hung_task_panic=
[KNL] Should the hung task detector generate panics.
Format: <integer>
A nonzero value instructs the kernel to panic when a
hung task is detected. The default value is controlled
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
option. The value selected by this boot parameter can
be changed later by the kernel.hung_task_panic sysctl.
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
terminal devices. Valid values: 0..8
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
keep_bootcon [KNL]
Do not unregister boot console at start. This is only
useful for debugging when something happens in the window
between unregistering the boot console and initializing
the real console.
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
registered from board initialization code.
Format:
<bus_id>,<clkrate>
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
registered from board initialization code.
Format:
<bus_id>,<clkrate>
i8042.debug [HW] Toggle i8042 debug mode
i8042.unmask_kbd_data
......@@ -1386,7 +1395,7 @@
Default: only on s2r transitions on x86; most other
architectures force reset to be always executed
i8042.unlock [HW] Unlock (ignore) the keylock
i8042.kbdreset [HW] Reset device connected to KBD port
i8042.kbdreset [HW] Reset device connected to KBD port
i810= [HW,DRM]
......@@ -1548,13 +1557,13 @@
programs exec'd, files mmap'd for exec, and all files
opened for read by uid=0.
ima_template= [IMA]
ima_template= [IMA]
Select one of defined IMA measurements template formats.
Formats: { "ima" | "ima-ng" | "ima-sig" }
Default: "ima-ng"
ima_template_fmt=
[IMA] Define a custom template format.
[IMA] Define a custom template format.
Format: { "field1|...|fieldN" }
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
......@@ -1597,7 +1606,7 @@
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
Format: <irq>
int_pln_enable [x86] Enable power limit notification interrupt
int_pln_enable [x86] Enable power limit notification interrupt
integrity_audit=[IMA]
Format: { "0" | "1" }
......@@ -1650,39 +1659,39 @@
0 disables intel_idle and fall back on acpi_idle.
1 to 9 specify maximum depth of C-state.
intel_pstate= [X86]
disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
passive
Use intel_pstate as a scaling driver, but configure it
to work with generic cpufreq governors (instead of
enabling its internal governor). This mode cannot be
used along with the hardware-managed P-states (HWP)
feature.
force
Enable intel_pstate on systems that prohibit it by default
in favor of acpi-cpufreq. Forcing the intel_pstate driver
instead of acpi-cpufreq may disable platform features, such
as thermal controls and power capping, that rely on ACPI
P-States information being indicated to OSPM and therefore
should be used with caution. This option does not work with
processors that aren't supported by the intel_pstate driver
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
no_hwp
Do not enable hardware P state control (HWP)
if available.
hwp_only
Only load intel_pstate on systems which support
hardware P state control (HWP) if available.
support_acpi_ppc
Enforce ACPI _PPC performance limits. If the Fixed ACPI
Description Table, specifies preferred power management
profile as "Enterprise Server" or "Performance Server",
then this feature is turned on by default.
per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
intel_pstate= [X86]
disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
passive
Use intel_pstate as a scaling driver, but configure it
to work with generic cpufreq governors (instead of
enabling its internal governor). This mode cannot be
used along with the hardware-managed P-states (HWP)
feature.
force
Enable intel_pstate on systems that prohibit it by default
in favor of acpi-cpufreq. Forcing the intel_pstate driver
instead of acpi-cpufreq may disable platform features, such
as thermal controls and power capping, that rely on ACPI
P-States information being indicated to OSPM and therefore
should be used with caution. This option does not work with
processors that aren't supported by the intel_pstate driver
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
no_hwp
Do not enable hardware P state control (HWP)
if available.
hwp_only
Only load intel_pstate on systems which support
hardware P state control (HWP) if available.
support_acpi_ppc
Enforce ACPI _PPC performance limits. If the Fixed ACPI
Description Table, specifies preferred power management
profile as "Enterprise Server" or "Performance Server",
then this feature is turned on by default.
per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
intremap= [X86-64, Intel-IOMMU]
on enable Interrupt Remapping (default)
......@@ -2026,7 +2035,7 @@
* [no]ncqtrim: Turn off queued DSM TRIM.
* nohrst, nosrst, norst: suppress hard, soft
and both resets.
and both resets.
* rstonce: only attempt one reset during
hot-unplug link recovery
......@@ -2214,7 +2223,7 @@
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
memhp_default_state=online/offline
memhp_default_state=online/offline
[KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is
set according to the
......@@ -2764,7 +2773,7 @@
[X86,PV_OPS] Disable paravirtualized VMware scheduler
clock and use the default one.
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
steal time is computed, but won't influence scheduler
behaviour
......@@ -2825,7 +2834,7 @@
notsc [BUGS=X86-32] Disable Time Stamp Counter
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).
soft-lockup and NMI watchdog (hard-lockup).
nowb [ARM]
......@@ -2845,7 +2854,7 @@
If the dependencies are under your control, you can
turn on cpu0_hotplug.
nps_mtm_hs_ctr= [KNL,ARC]
nps_mtm_hs_ctr= [KNL,ARC]
This parameter sets the maximum duration, in
cycles, each HW thread of the CTOP can run
without interruptions, before HW switches it.
......@@ -2986,7 +2995,7 @@
pci=option[,option...] [PCI] various PCI subsystem options:
earlydump [X86] dump PCI config space before the kernel
changes anything
changes anything
off [X86] don't probe for the PCI bus
bios [X86-32] force use of PCI BIOS, don't access
the hardware directly. Use this if your machine
......@@ -3074,7 +3083,7 @@
is enabled by default. If you need to use this,
please report a bug.
nocrs [X86] Ignore PCI host bridge windows from ACPI.
If you need to use this, please report a bug.
If you need to use this, please report a bug.
routeirq Do IRQ routing for all PCI devices.
This is normally done in pci_enable_device(),
so this option is a temporary workaround
......@@ -3917,7 +3926,7 @@
cache (risks via metadata attacks are mostly
unchanged). Debug options disable merging on their
own.
For more information see Documentation/vm/slub.txt.
For more information see Documentation/vm/slub.rst.
slab_max_order= [MM, SLAB]
Determines the maximum allowed order for slabs.
......@@ -3931,7 +3940,7 @@
slub_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the
last alloc / free. For more information see
Documentation/vm/slub.txt.
Documentation/vm/slub.rst.
slub_memcg_sysfs= [MM, SLUB]
Determines whether to enable sysfs directories for
......@@ -3945,7 +3954,7 @@
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. For more information see
Documentation/vm/slub.txt.
Documentation/vm/slub.rst.
slub_min_objects= [MM, SLUB]
The minimum number of objects per slab. SLUB will
......@@ -3954,12 +3963,12 @@
the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired.
For more information see Documentation/vm/slub.txt.
For more information see Documentation/vm/slub.rst.
slub_min_order= [MM, SLUB]
Determines the minimum page order for slabs. Must be
lower than slub_max_order.
For more information see Documentation/vm/slub.txt.
For more information see Documentation/vm/slub.rst.
slub_nomerge [MM, SLUB]
Same with slab_nomerge. This is supported for legacy.
......@@ -4357,7 +4366,8 @@
Format: [always|madvise|never]
Can be used to control the default behavior of the system
with respect to transparent hugepages.
See Documentation/vm/transhuge.txt for more details.
See Documentation/admin-guide/mm/transhuge.rst
for more details.
tsc= Disable clocksource stability checks for TSC.
Format: <string>
......@@ -4435,7 +4445,7 @@
usbcore.initial_descriptor_timeout=
[USB] Specifies timeout for the initial 64-byte
USB_REQ_GET_DESCRIPTOR request in milliseconds
USB_REQ_GET_DESCRIPTOR request in milliseconds
(default 5000 = 5.0 seconds).
usbcore.nousb [USB] Disable the USB subsystem
......
.. _mm_concepts:
=================
Concepts overview
=================
The memory management in Linux is complex system that evolved over the
years and included more and more functionality to support variety of
systems from MMU-less microcontrollers to supercomputers. The memory
management for systems without MMU is called ``nommu`` and it
definitely deserves a dedicated document, which hopefully will be
eventually written. Yet, although some of the concepts are the same,
here we assume that MMU is available and CPU can translate a virtual
address to a physical address.
.. contents:: :local:
Virtual Memory Primer
=====================
The physical memory in a computer system is a limited resource and
even for systems that support memory hotplug there is a hard limit on
the amount of memory that can be installed. The physical memory is not
necessary contiguous, it might be accessible as a set of distinct
address ranges. Besides, different CPU architectures, and even
different implementations of the same architecture have different view
how these address ranges defined.
All this makes dealing directly with physical memory quite complex and
to avoid this complexity a concept of virtual memory was developed.
The virtual memory abstracts the details of physical memory from the
application software, allows to keep only needed information in the
physical memory (demand paging) and provides a mechanism for the
protection and controlled sharing of data between processes.
With virtual memory, each and every memory access uses a virtual
address. When the CPU decodes the an instruction that reads (or
writes) from (or to) the system memory, it translates the `virtual`
address encoded in that instruction to a `physical` address that the
memory controller can understand.
The physical system memory is divided into page frames, or pages. The
size of each page is architecture specific. Some architectures allow
selection of the page size from several supported values; this
selection is performed at the kernel build time by setting an
appropriate kernel configuration option.
Each physical memory page can be mapped as one or more virtual
pages. These mappings are described by page tables that allow
translation from virtual address used by programs to real address in
the physical memory. The page tables organized hierarchically.
The tables at the lowest level of the hierarchy contain physical
addresses of actual pages used by the software. The tables at higher
levels contain physical addresses of the pages belonging to the lower
levels. The pointer to the top level page table resides in a
register. When the CPU performs the address translation, it uses this
register to access the top level page table. The high bits of the
virtual address are used to index an entry in the top level page
table. That entry is then used to access the next level in the
hierarchy with the next bits of the virtual address as the index to
that level page table. The lowest bits in the virtual address define
the offset inside the actual page.
Huge Pages
==========
The address translation requires several memory accesses and memory
accesses are slow relatively to CPU speed. To avoid spending precious
processor cycles on the address translation, CPUs maintain a cache of
such translations called Translation Lookaside Buffer (or
TLB). Usually TLB is pretty scarce resource and applications with
large memory working set will experience performance hit because of
TLB misses.
Many modern CPU architectures allow mapping of the memory pages
directly by the higher levels in the page table. For instance, on x86,
it is possible to map 2M and even 1G pages using entries in the second
and the third level page tables. In Linux such pages are called
`huge`. Usage of huge pages significantly reduces pressure on TLB,
improves TLB hit-rate and thus improves overall system performance.
There are two mechanisms in Linux that enable mapping of the physical
memory with the huge pages. The first one is `HugeTLB filesystem`, or
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
store. For the files created in this filesystem the data resides in
the memory and mapped using huge pages. The hugetlbfs is described at
:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
Another, more recent, mechanism that enables use of the huge pages is
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
requires users and/or system administrators to configure what parts of
the system memory should and can be mapped by the huge pages, THP
manages such mappings transparently to the user and hence the
name. See
:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
for more details about THP.
Zones
=====
Often hardware poses restrictions on how different physical memory
ranges can be accessed. In some cases, devices cannot perform DMA to
all the addressable memory. In other cases, the size of the physical
memory exceeds the maximal addressable size of virtual memory and
special actions are required to access portions of the memory. Linux
groups memory pages into `zones` according to their possible
usage. For example, ZONE_DMA will contain memory that can be used by
devices for DMA, ZONE_HIGHMEM will contain memory that is not
permanently mapped into kernel's address space and ZONE_NORMAL will
contain normally addressed pages.
The actual layout of the memory zones is hardware dependent as not all
architectures define all zones, and requirements for DMA are different
for different platforms.
Nodes
=====
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
systems. In such systems the memory is arranged into banks that have
different access latency depending on the "distance" from the
processor. Each bank is referred as `node` and for each node Linux
constructs an independent memory management subsystem. A node has it's
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
:ref:`Documentation/vm/numa.rst <numa>` and in
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
Page cache
==========
The physical memory is volatile and the common case for getting data
into the memory is to read it from files. Whenever a file is read, the
data is put into the `page cache` to avoid expensive disk access on
the subsequent reads. Similarly, when one writes to a file, the data
is placed in the page cache and eventually gets into the backing
storage device. The written pages are marked as `dirty` and when Linux
decides to reuse them for other purposes, it makes sure to synchronize
the file contents on the device with the updated data.
Anonymous Memory
================
The `anonymous memory` or `anonymous mappings` represent memory that
is not backed by a filesystem. Such mappings are implicitly created
for program's stack and heap or by explicit calls to mmap(2) system
call. Usually, the anonymous mappings only define virtual memory areas
that the program is allowed to access. The read accesses will result
in creation of a page table entry that references a special physical
page filled with zeroes. When the program performs a write, regular
physical page will be allocated to hold the written data. The page
will be marked dirty and if the kernel will decide to repurpose it,
the dirty page will be swapped out.
Reclaim
=======
Throughout the system lifetime, a physical page can be used for storing
different types of data. It can be kernel internal data structures,
DMA'able buffers for device drivers use, data read from a filesystem,
memory allocated by user space processes etc.
Depending on the page usage it is treated differently by the Linux
memory management. The pages that can be freed at any time, either
because they cache the data available elsewhere, for instance, on a
hard disk, or because they can be swapped out, again, to the hard
disk, are called `reclaimable`. The most notable categories of the
reclaimable pages are page cache and anonymous memory.
In most cases, the pages holding internal kernel data and used as DMA
buffers cannot be repurposed, and they remain pinned until freed by
their user. Such pages are called `unreclaimable`. However, in certain
circumstances, even pages occupied with kernel data structures can be
reclaimed. For instance, in-memory caches of filesystem metadata can
be re-read from the storage device and therefore it is possible to
discard them from the main memory when system is under memory
pressure.
The process of freeing the reclaimable physical memory pages and
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
pages either asynchronously or synchronously, depending on the state
of the system. When system is not loaded, most of the memory is free
and allocation request will be satisfied immediately from the free
pages supply. As the load increases, the amount of the free pages goes
down and when it reaches a certain threshold (high watermark), an
allocation request will awaken the ``kswapd`` daemon. It will
asynchronously scan memory pages and either just free them if the data
they contain is available elsewhere, or evict to the backing storage
device (remember those dirty pages?). As memory usage increases even
more and reaches another threshold - min watermark - an allocation
will trigger the `direct reclaim`. In this case allocation is stalled
until enough memory pages are reclaimed to satisfy the request.
Compaction
==========
As the system runs, tasks allocate and free the memory and it becomes
fragmented. Although with virtual memory it is possible to present
scattered physical pages as virtually contiguous range, sometimes it is
necessary to allocate large physically contiguous memory areas. Such
need may arise, for instance, when a device driver requires large
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
addresses the fragmentation issue. This mechanism moves occupied pages
from the lower part of a memory zone to free pages in the upper part
of the zone. When a compaction scan is finished free pages are grouped
together at the beginning of the zone and allocations of large
physically contiguous areas become possible.
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
daemon or synchronously as a result of memory allocation request.
OOM killer
==========
It may happen, that on a loaded machine memory will be exhausted. When
the kernel detects that the system runs out of memory (OOM) it invokes
`OOM killer`. Its mission is simple: all it has to do is to select a
task to sacrifice for the sake of the overall system health. The
selected task is killed in a hope that after it exits enough memory
will be freed to continue normal operation.
MOTIVATION
.. _idle_page_tracking:
==================
Idle Page Tracking
==================
Motivation
==========
The idle page tracking feature allows to track which memory pages are being
accessed by a workload and which are idle. This information can be useful for
......@@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster.
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
USER API
.. _user_api:
The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently,
it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap.
User API
========
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
Currently, it consists of the only read-write file,
``/sys/kernel/mm/page_idle/bitmap``.
The file implements a bitmap where each bit corresponds to a memory page. The
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
......@@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
set, the corresponding page is idle.
A page is considered idle if it has not been accessed since it was marked idle
(for more details on what "accessed" actually means see the IMPLEMENTATION
DETAILS section). To mark a page idle one has to set the bit corresponding to
(for more details on what "accessed" actually means see the :ref:`Implementation
Details <impl_details>` section).
To mark a page idle one has to set the bit corresponding to
the page by writing to the file. A value written to the file is OR-ed with the
current bitmap value.
......@@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
and hence such pages are never reported idle.
For huge pages the idle flag is set only on the head page, so one has to read
/proc/kpageflags in order to correctly count idle huge pages.
``/proc/kpageflags`` in order to correctly count idle huge pages.
Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return
Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
-EINVAL if you are not starting the read/write on an 8-byte boundary, or
if the size of the read/write is not a multiple of 8 bytes. Writing to
this file beyond max PFN will return -ENXIO.
......@@ -41,21 +53,26 @@ That said, in order to estimate the amount of pages that are not used by a
workload one should:
1. Mark all the workload's pages as idle by setting corresponding bits in
/sys/kernel/mm/page_idle/bitmap. The pages can be found by reading
/proc/pid/pagemap if the workload is represented by a process, or by
filtering out alien pages using /proc/kpagecgroup in case the workload is
placed in a memory cgroup.
``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
``/proc/pid/pagemap`` if the workload is represented by a process, or by
filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
is placed in a memory cgroup.
2. Wait until the workload accesses its working set.
3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If
one wants to ignore certain types of pages, e.g. mlocked pages since they
are not reclaimable, he or she can filter them out using /proc/kpageflags.
3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
If one wants to ignore certain types of pages, e.g. mlocked pages since they
are not reclaimable, he or she can filter them out using
``/proc/kpageflags``.
See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
``/proc/kpagecgroup``.
See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap,
/proc/kpageflags, and /proc/kpagecgroup.
.. _impl_details:
IMPLEMENTATION DETAILS
Implementation Details
======================
The kernel internally keeps track of accesses to user memory pages in order to
reclaim unreferenced pages first on memory shortage conditions. A page is
......@@ -77,7 +94,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
exceeding the dirty memory limit, it is not marked referenced.
The idle memory tracking feature adds a new page flag, the Idle flag. This flag
is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API
is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
:ref:`User API <user_api>`
section), and cleared automatically whenever a page is referenced as defined
above.
......
=================
Memory Management
=================
Linux memory management subsystem is responsible, as the name implies,
for managing the memory in the system. This includes implemnetation of
virtual memory and demand paging, memory allocation both for kernel
internal structures and user space programms, mapping of files into
processes address space and many other cool things.
Linux memory management is a complex system with many configurable
settings. Most of these settings are available via ``/proc``
filesystem and can be quired and adjusted using ``sysctl``. These APIs
are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
Linux memory management has its own jargon and if you are not yet
familiar with it, consider reading
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
Here we document in detail how to interact with various mechanisms in
the Linux memory management.
.. toctree::
:maxdepth: 1
concepts
hugetlbpage
idle_page_tracking
ksm
numa_memory_policy
pagemap
soft-dirty
transhuge
userfaultfd
.. _admin_guide_ksm:
=======================
Kernel Samepage Merging
=======================
Overview
========
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
KSM was originally developed for use with KVM (where it was known as
Kernel Shared Memory), to fit more virtual machines into physical memory,
by sharing the data common between them. But it can be useful to any
application which generates many instances of the same data.
The KSM daemon ksmd periodically scans those areas of user memory
which have been registered with it, looking for pages of identical
content which can be replaced by a single write-protected page (which
is automatically copied if a process later wants to update its
content). The amount of pages that KSM daemon scans in a single pass
and the time between the passes are configured using :ref:`sysfs
intraface <ksm_sysfs>`
KSM only merges anonymous (private) pages, never pagecache (file) pages.
KSM's merged pages were originally locked into kernel memory, but can now
be swapped out just like other user pages (but sharing is broken when they
are swapped back in: ksmd must rediscover their identity and merge again).
Controlling KSM with madvise
============================
KSM only operates on those areas of address space which an application
has advised to be likely candidates for merging, by using the madvise(2)
system call::
int madvise(addr, length, MADV_MERGEABLE)
The app may call
::
int madvise(addr, length, MADV_UNMERGEABLE)
to cancel that advice and restore unshared pages: whereupon KSM
unmerges whatever it merged in that range. Note: this unmerging call
may suddenly require more memory than is available - possibly failing
with EAGAIN, but more probably arousing the Out-Of-Memory killer.
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
built with CONFIG_KSM=y, those calls will normally succeed: even if the
the KSM daemon is not currently running, MADV_MERGEABLE still registers
the range for whenever the KSM daemon is started; even if the range
cannot contain any pages which KSM could actually merge; even if
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
If a region of memory must be split into at least one new MADV_MERGEABLE
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
Like other madvise calls, they are intended for use on mapped areas of
the user address space: they will report ENOMEM if the specified range
includes unmapped gaps (though working on the intervening mapped areas),
and might fail with EAGAIN if not enough memory for internal structures.
Applications should be considerate in their use of MADV_MERGEABLE,
restricting its use to areas likely to benefit. KSM's scans may use a lot
of processing power: some installations will disable KSM for that reason.
.. _ksm_sysfs:
KSM daemon sysfs interface
==========================
The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
readable by all but writable only by root:
pages_to_scan
how many pages to scan before ksmd goes to sleep
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
Default: 100 (chosen for demonstration purposes)
sleep_millisecs
how many milliseconds ksmd should sleep before next scan
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
Default: 20 (chosen for demonstration purposes)
merge_across_nodes
specifies if pages from different NUMA nodes can be merged.
When set to 0, ksm merges only pages which physically reside
in the memory area of same NUMA node. That brings lower
latency to access of shared pages. Systems with more nodes, at
significant NUMA distances, are likely to benefit from the
lower latency of setting 0. Smaller systems, which need to
minimize memory usage, are likely to benefit from the greater
sharing of setting 1 (default). You may wish to compare how
your system performs under each setting, before deciding on
which to use. ``merge_across_nodes`` setting can be changed only
when there are no ksm shared pages in the system: set run 2 to
unmerge pages first, then to 1 after changing
``merge_across_nodes``, to remerge according to the new setting.
Default: 1 (merging across nodes as in earlier releases)
run
* set to 0 to stop ksmd from running but keep merged pages,
* set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
* set to 2 to stop ksmd and unmerge all pages currently merged, but
leave mergeable areas registered for next run.
Default: 0 (must be changed to 1 to activate KSM, except if
CONFIG_SYSFS is disabled)
use_zero_pages
specifies whether empty pages (i.e. allocated pages that only
contain zeroes) should be treated specially. When set to 1,
empty pages are merged with the kernel zero page(s) instead of
with each other as it would happen normally. This can improve
the performance on architectures with coloured zero pages,
depending on the workload. Care should be taken when enabling
this setting, as it can potentially degrade the performance of
KSM for some workloads, for example if the checksums of pages
candidate for merging match the checksum of an empty
page. This setting can be changed at any time, it is only
effective for pages merged after the change.
Default: 0 (normal KSM behaviour as in earlier releases)
max_page_sharing
Maximum sharing allowed for each KSM page. This enforces a
deduplication limit to avoid high latency for virtual memory
operations that involve traversal of the virtual mappings that
share the KSM page. The minimum value is 2 as a newly created
KSM page will have at least two sharers. The higher this value
the faster KSM will merge the memory and the higher the
deduplication factor will be, but the slower the worst case
virtual mappings traversal could be for any given KSM
page. Slowing down this traversal means there will be higher
latency for certain virtual memory operations happening during
swapping, compaction, NUMA balancing and page migration, in
turn decreasing responsiveness for the caller of those virtual
memory operations. The scheduler latency of other tasks not
involved with the VM operations doing the virtual mappings
traversal is not affected by this parameter as these
traversals are always schedule friendly themselves.
stable_node_chains_prune_millisecs
specifies how frequently KSM checks the metadata of the pages
that hit the deduplication limit for stale information.
Smaller milllisecs values will free up the KSM metadata with
lower latency, but they will make ksmd use more CPU during the
scan. It's a noop if not a single KSM page hit the
``max_page_sharing`` yet.
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
pages_shared
how many shared pages are being used
pages_sharing
how many more sites are sharing them i.e. how much saved
pages_unshared
how many pages unique but repeatedly checked for merging
pages_volatile
how many pages changing too fast to be placed in a tree
full_scans
how many times all mergeable areas have been scanned
stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
indicates wasted effort. ``pages_volatile`` embraces several
different kinds of activity, but a high proportion there would also
indicate poor use of madvise MADV_MERGEABLE.
The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009
pagemap, from the userspace perspective
---------------------------------------
.. _pagemap:
=============================
Examining Process Page Tables
=============================
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.
reading files in ``/proc``.
There are four components to pagemap:
* /proc/pid/pagemap. This file lets a userspace process find out which
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
value for each virtual page, containing the following data (from
fs/proc/task_mmu.c, above pagemap_read):
``fs/proc/task_mmu.c``, above pagemap_read):
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
* Bit 55 pte is soft-dirty (see
:ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
* Bit 56 page exclusively mapped (since 4.2)
* Bits 57-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5)
......@@ -33,28 +37,28 @@ There are four components to pagemap:
precisely which pages are mapped (or in swap) and comparing mapped
pages between processes.
Efficient users of this interface will use /proc/pid/maps to
Efficient users of this interface will use ``/proc/pid/maps`` to
determine which areas of memory are actually mapped and llseek to
skip over unmapped regions.
* /proc/kpagecount. This file contains a 64-bit count of the number of
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
* /proc/kpageflags. This file contains a 64-bit set of flags for each
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
page, indexed by PFN.
The flags are (from fs/proc/page.c, above kpageflags_read):
0. LOCKED
1. ERROR
2. REFERENCED
3. UPTODATE
4. DIRTY
5. LRU
6. ACTIVE
7. SLAB
8. WRITEBACK
9. RECLAIM
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
0. LOCKED
1. ERROR
2. REFERENCED
3. UPTODATE
4. DIRTY
5. LRU
6. ACTIVE
7. SLAB
8. WRITEBACK
9. RECLAIM
10. BUDDY
11. MMAP
12. ANON
......@@ -72,98 +76,111 @@ There are four components to pagemap:
24. ZERO_PAGE
25. IDLE
* /proc/kpagecgroup. This file contains a 64-bit inode number of the
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.
Short descriptions to the page flags:
0. LOCKED
page is being locked for exclusive access, eg. by undergoing read/write IO
7. SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all.
Short descriptions to the page flags
====================================
10. BUDDY
0 - LOCKED
page is being locked for exclusive access, e.g. by undergoing read/write IO
7 - SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all.
10 - BUDDY
a free memory block managed by the buddy system allocator
The buddy system organizes free memory in blocks of various orders.
An order N block has 2^N physically contiguous pages, with the BUDDY flag
set for and _only_ for the first page.
15. COMPOUND_HEAD
16. COMPOUND_TAIL
15 - COMPOUND_HEAD
A compound page with order N consists of 2^N physically contiguous pages.
A compound page with order 2 takes the form of "HTTT", where H donates its
head page and T donates its tail page(s). The major consumers of compound
pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
memory allocators and various device drivers. However in this interface,
only huge/giga pages are made visible to end users.
17. HUGE
pages are hugeTLB pages
(:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
the SLUB etc. memory allocators and various device drivers.
However in this interface, only huge/giga pages are made visible
to end users.
16 - COMPOUND_TAIL
A compound page tail (see description above).
17 - HUGE
this is an integral part of a HugeTLB page
19. HWPOISON
19 - HWPOISON
hardware detected memory corruption on this page: don't touch the data!
20. NOPAGE
20 - NOPAGE
no page frame exists at the requested address
21. KSM
21 - KSM
identical memory pages dynamically shared between one or more processes
22. THP
22 - THP
contiguous pages which construct transparent hugepages
23. BALLOON
23 - BALLOON
balloon compaction page
24. ZERO_PAGE
24 - ZERO_PAGE
zero page for pfn_zero or huge_zero page
25. IDLE
25 - IDLE
page has not been accessed since it was marked idle (see
Documentation/vm/idle_page_tracking.txt). Note that this flag may be
stale in case the page was accessed via a PTE. To make sure the flag
is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first.
[IO related page flags]
1. ERROR IO error occurred
3. UPTODATE page has up-to-date data
ie. for file backed page: (in-memory data revision >= on-disk one)
4. DIRTY page has been written to, hence contains new data
ie. for file backed page: (in-memory data revision > on-disk one)
8. WRITEBACK page is being synced to disk
[LRU related page flags]
5. LRU page is in one of the LRU lists
6. ACTIVE page is in the active LRU list
18. UNEVICTABLE page is in the unevictable (non-)LRU list
It is somehow pinned and not a candidate for LRU page reclaims,
eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
2. REFERENCED page has been referenced since last LRU list enqueue/requeue
9. RECLAIM page will be reclaimed soon after its pageout IO completed
11. MMAP a memory mapped page
12. ANON a memory mapped page that is not part of a file
13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
14. SWAPBACKED page is backed by swap/RAM
:ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
Note that this flag may be stale in case the page was accessed via
a PTE. To make sure the flag is up-to-date one has to read
``/sys/kernel/mm/page_idle/bitmap`` first.
IO related page flags
---------------------
1 - ERROR
IO error occurred
3 - UPTODATE
page has up-to-date data
ie. for file backed page: (in-memory data revision >= on-disk one)
4 - DIRTY
page has been written to, hence contains new data
i.e. for file backed page: (in-memory data revision > on-disk one)
8 - WRITEBACK
page is being synced to disk
LRU related page flags
----------------------
5 - LRU
page is in one of the LRU lists
6 - ACTIVE
page is in the active LRU list
18 - UNEVICTABLE
page is in the unevictable (non-)LRU list It is somehow pinned and
not a candidate for LRU page reclaims, e.g. ramfs pages,
shmctl(SHM_LOCK) and mlock() memory segments
2 - REFERENCED
page has been referenced since last LRU list enqueue/requeue
9 - RECLAIM
page will be reclaimed soon after its pageout IO completed
11 - MMAP
a memory mapped page
12 - ANON
a memory mapped page that is not part of a file
13 - SWAPCACHE
page is mapped to swap space, i.e. has an associated swap entry
14 - SWAPBACKED
page is backed by swap/RAM
The page-types tool in the tools/vm directory can be used to query the
above flags.
Using pagemap to do something useful:
Using pagemap to do something useful
====================================
The general procedure for using pagemap to find out about a process' memory
usage goes like this:
1. Read /proc/pid/maps to determine which parts of the memory space are
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
mapped to what.
2. Select the maps you are interested in -- all of them, or a particular
library, or the stack or the heap, etc.
3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
4. Read a u64 for each page from pagemap.
5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
read, seek to that entry in the file, and read the data you want.
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
just read, seek to that entry in the file, and read the data you want.
For example, to find the "unique set size" (USS), which is the amount of
memory that a process is using that is not shared with any other process,
......@@ -171,7 +188,8 @@ you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced
once.
Other notes:
Other notes
===========
Reading from any of the files will return -EINVAL if you are not starting
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
......
SOFT-DIRTY PTEs
.. _soft_dirty:
The soft-dirty is a bit on a PTE which helps to track which pages a task
===============
Soft-Dirty PTEs
===============
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from the task's PTEs.
This is done by writing "4" into the /proc/PID/clear_refs file of the
This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
task in question.
2. Wait some time.
3. Read soft-dirty bits from the PTEs.
This is done by reading from the /proc/PID/pagemap. The bit 55 of the
This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
64-bit qword is the soft-dirty one. If set, the respective PTE was
written to since step 1.
Internally, to do this tracking, the writable bit is cleared from PTEs
Internally, to do this tracking, the writable bit is cleared from PTEs
when the soft-dirty bit is cleared. So, after this, when the task tries to
modify a page at some virtual address the #PF occurs and the kernel sets
the soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after the
Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus all
the kernel does is finds this fact out and puts both writable and soft-dirty
bits on the PTE.
While in most cases tracking memory changes by #PF-s is more than enough
While in most cases tracking memory changes by #PF-s is more than enough
there is still a scenario when we can lose soft dirty bits -- a task
unmaps a previously mapped memory region and then maps a new one at exactly
the same place. When unmap is called, the kernel internally clears PTE values
......@@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such
memory region renewal the kernel always marks new memory regions (and
expanded regions) as soft dirty.
This feature is actively used by the checkpoint-restore project. You
This feature is actively used by the checkpoint-restore project. You
can find more details about it on http://criu.org
......
= Userfaultfd =
.. _userfaultfd:
== Objective ==
===========
Userfaultfd
===========
Objective
=========
Userfaults allow the implementation of on-demand paging from userland
and more generally they allow userland to take control of various
......@@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do.
For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick.
== Design ==
Design
======
Userfaults are delivered and resolved through the userfaultfd syscall.
......@@ -41,7 +47,8 @@ different processes without them being aware about what is going on
themselves on the same region the manager is already tracking, which
is a corner case that would currently return -EBUSY).
== API ==
API
===
When first opened the userfaultfd must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
......@@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.
== QEMU/KVM ==
QEMU/KVM
========
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
migration. Postcopy live migration is one form of memory
......@@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
thread).
== Non-cooperative userfaultfd ==
Non-cooperative userfaultfd
===========================
When the userfaultfd is monitored by an external manager, the manager
must be able to track changes in the process virtual memory
......@@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The
manager has to explicitly enable these events by setting appropriate
bits in uffdio_api.features passed to UFFDIO_API ioctl:
UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When
this feature is enabled, the userfaultfd context of the parent process
is duplicated into the newly created process. The manager receives
UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in
the uffd_msg.fork.
UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap()
calls. When the non-cooperative process moves a virtual memory area to
a different location, the manager will receive UFFD_EVENT_REMAP. The
uffd_msg.remap will contain the old and new addresses of the area and
its original length.
UFFD_FEATURE_EVENT_REMOVE - enable notifications about
madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event
UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The
uffd_msg.remove will contain start and end addresses of the removed
area.
UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory
unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove
containing start and end addresses of the unmapped area.
UFFD_FEATURE_EVENT_FORK
enable userfaultfd hooks for fork(). When this feature is
enabled, the userfaultfd context of the parent process is
duplicated into the newly created process. The manager
receives UFFD_EVENT_FORK with file descriptor of the new
userfaultfd context in the uffd_msg.fork.
UFFD_FEATURE_EVENT_REMAP
enable notifications about mremap() calls. When the
non-cooperative process moves a virtual memory area to a
different location, the manager will receive
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
new addresses of the area and its original length.
UFFD_FEATURE_EVENT_REMOVE
enable notifications about madvise(MADV_REMOVE) and
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
be generated upon these calls to madvise. The uffd_msg.remove
will contain start and end addresses of the removed area.
UFFD_FEATURE_EVENT_UNMAP
enable notifications about memory unmapping. The manager will
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
end addresses of the unmapped area.
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
are pretty similar, they quite differ in the action expected from the
......
......@@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners:
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
B. Use Device Tree bindings, as described in
``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``.
``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
For example::
reserved-memory {
......
......@@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions)
88DE3010, Armada 1000 (no Linux support)
Core: Marvell PJ1 (ARMv5TE), Dual-core
Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf
88DE3005, Armada 1500-mini
88DE3005, Armada 1500 Mini
Design name: BG2CD
Core: ARM Cortex-A9, PL310 L2CC
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/
88DE3006, Armada 1500 Mini Plus
Design name: BG2CDP
Core: Dual Core ARM Cortex-A7
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/
88DE3006, Armada 1500 Mini Plus
Design name: BG2CDP
Core: Dual Core ARM Cortex-A7
88DE3100, Armada 1500
Design name: BG2
Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC
Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf
88DE3114, Armada 1500 Pro
Design name: BG2Q
Core: Quad Core ARM Cortex-A9, PL310 L2CC
......@@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions)
88DE3218, ARMADA 1500 Ultra
Core: ARM Cortex-A53
Homepage: http://www.marvell.com/multimedia-solutions/
Homepage: https://www.synaptics.com/products/multimedia-solutions
Directory: arch/arm/mach-berlin
Comments:
* This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs
with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...).
* The Berlin family was acquired by Synaptics from Marvell in 2017.
CPU Cores
---------
......
Embedded device command line partition parsing
=====================================================================
Support for reading the block device partition table from the command line.
The "blkdevparts" command line option adds support for reading the
block device partition table from the kernel command line.
It is typically used for fixed block (eMMC) embedded devices.
It has no MBR, so saves storage space. Bootloader can be easily accessed
by absolute address of data on the block device.
......@@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>]
<partdef> := <size>[@<offset>](part-name)
<blkdev-id>
block device disk name, embedded device used fixed block device,
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0.
block device disk name. Embedded device uses fixed block device.
Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0.
<size>
partition size, in bytes, such as: 512, 1m, 1G.
size may contain an optional suffix of (upper or lower case):
K, M, G, T, P, E.
"-" is used to denote all remaining space.
<offset>
partition start address, in bytes.
offset may contain an optional suffix of (upper or lower case):
K, M, G, T, P, E.
(part-name)
partition name, kernel send uevent with "PARTNAME". application can create
a link to block device partition with the name "PARTNAME".
user space application can access partition by partition name.
partition name. Kernel sends uevent with "PARTNAME". Application can
create a link to block device partition with the name "PARTNAME".
User space application can access partition by partition name.
Example:
eMMC disk name is "mmcblk0" and "mmcblk0boot0"
eMMC disk names are "mmcblk0" and "mmcblk0boot0".
bootargs:
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
......
=================================
GFP masks used from FS/IO context
=================================
:Date: May, 2018
:Author: Michal Hocko <mhocko@kernel.org>
Introduction
============
Code paths in the filesystem and IO stacks must be careful when
allocating memory to prevent recursion deadlocks caused by direct
memory reclaim calling back into the FS or IO paths and blocking on
already held resources (e.g. locks - most commonly those used for the
transaction context).
The traditional way to avoid this deadlock problem is to clear __GFP_FS
respectively __GFP_IO (note the latter implies clearing the first as well) in
the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
used as shortcut. It turned out though that above approach has led to
abuses when the restricted gfp mask is used "just in case" without a
deeper consideration which leads to problems because an excessive use
of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
reclaim issues.
New API
========
Since 4.12 we do have a generic scope API for both NOFS and NOIO context
``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
``memalloc_noio_restore`` which allow to mark a scope to be a critical
section from a filesystem or I/O point of view. Any allocation from that
scope will inherently drop __GFP_FS respectively __GFP_IO from the given
mask so no memory allocation can recurse back in the FS/IO.
.. kernel-doc:: include/linux/sched/mm.h
:functions: memalloc_nofs_save memalloc_nofs_restore
.. kernel-doc:: include/linux/sched/mm.h
:functions: memalloc_noio_save memalloc_noio_restore
FS/IO code then simply calls the appropriate save function before
any critical section with respect to the reclaim is started - e.g.
lock shared with the reclaim context or when a transaction context
nesting would be possible via reclaim. The restore function should be
called when the critical section ends. All that ideally along with an
explanation what is the reclaim context for easier maintenance.
Please note that the proper pairing of save/restore functions
allows nesting so it is safe to call ``memalloc_noio_save`` or
``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
scope.
What about __vmalloc(GFP_NOFS)
==============================
vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
almost always a bug. The good news is that the NOFS/NOIO semantic can be
achieved by the scope API.
In the ideal world, upper layers should already mark dangerous contexts
and so no special care is required and vmalloc should be called without
any problems. Sometimes if the context is not really clear or there are
layering violations then the recommended way around that is to wrap ``vmalloc``
by the scope API with a comment explaining the problem.
......@@ -14,6 +14,7 @@ Core utilities
kernel-api
assoc_array
atomic_ops
cachetlb
refcount-vs-atomic
cpu_hotplug
idr
......@@ -25,6 +26,8 @@ Core utilities
genalloc
errseq
printk-formats
circular-buffers
gfp_mask-from-fs-io
Interfaces for kernel debugging
===============================
......
......@@ -39,17 +39,17 @@ String Manipulation
.. kernel-doc:: lib/string.c
:export:
Basic Kernel Library Functions
==============================
The Linux kernel provides more basic utility functions.
Bit Operations
--------------
.. kernel-doc:: arch/x86/include/asm/bitops.h
:internal:
Basic Kernel Library Functions
==============================
The Linux kernel provides more basic utility functions.
Bitmap Operations
-----------------
......@@ -80,6 +80,31 @@ Command-line Parsing
.. kernel-doc:: lib/cmdline.c
:export:
Sorting
-------
.. kernel-doc:: lib/sort.c
:export:
.. kernel-doc:: lib/list_sort.c
:export:
Text Searching
--------------
.. kernel-doc:: lib/textsearch.c
:doc: ts_intro
.. kernel-doc:: lib/textsearch.c
:export:
.. kernel-doc:: include/linux/textsearch.h
:functions: textsearch_find textsearch_next \
textsearch_get_pattern textsearch_get_pattern_len
CRC and Math Functions in Linux
===============================
CRC Functions
-------------
......@@ -103,9 +128,6 @@ CRC Functions
.. kernel-doc:: lib/crc-itu-t.c
:export:
Math Functions in Linux
=======================
Base 2 log and power Functions
------------------------------
......@@ -127,28 +149,6 @@ Division Functions
.. kernel-doc:: lib/gcd.c
:export:
Sorting
-------
.. kernel-doc:: lib/sort.c
:export:
.. kernel-doc:: lib/list_sort.c
:export:
Text Searching
--------------
.. kernel-doc:: lib/textsearch.c
:doc: ts_intro
.. kernel-doc:: lib/textsearch.c
:export:
.. kernel-doc:: include/linux/textsearch.h
:functions: textsearch_find textsearch_next \
textsearch_get_pattern textsearch_get_pattern_len
UUID/GUID
---------
......
......@@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in
these memory ordering guarantees.
The terms used through this document try to follow the formal LKMM defined in
github.com/aparri/memory-model/blob/master/Documentation/explanation.txt
tools/memory-model/Documentation/explanation.txt.
memory-barriers.txt and atomic_t.txt provide more background to the
memory ordering in general and for atomic operations specifically.
......
......@@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples.
architecture
devel-algos
userspace-if
crypto_engine
api
api-samples
......@@ -120,7 +120,7 @@ A typical out of bounds access report looks like this::
The header of the report discribe what kind of bug happened and what kind of
access caused it. It's followed by the description of the accessed slub object
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and
(see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and
the description of the accessed memory page.
In the last section the report shows memory state around the accessed address.
......
......@@ -151,6 +151,11 @@ Contributing new tests (details)
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
test.
* First use the headers inside the kernel source and/or git repo, and then the
system headers. Headers for the kernel release as opposed to headers
installed by the distro on the system should be the primary focus to be able
to find regressions.
Test Harness
============
......
......@@ -40,4 +40,4 @@ API
---
.. kernel-doc:: drivers/base/devcon.c
: functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove
:functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove
......@@ -44,7 +44,7 @@ common to each controller of that type:
- methods to establish GPIO line direction
- methods used to access GPIO line values
- method to set electrical configuration to a a given GPIO line
- method to set electrical configuration for a given GPIO line
- method to return the IRQ number associated to a given GPIO line
- flag saying whether calls to its methods may sleep
- optional line names array to identify lines
......@@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on
the rail actively pulls it down.
The level on the line will go as high as the VDD on the pull-up resistor, which
may be higher than the level supported by the transistor, achieveing a
may be higher than the level supported by the transistor, achieving a
level-shift to the higher VDD.
Integrated electronics often have an output driver stage in the form of a CMOS
......@@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips
Any provider of irqchips needs to be carefully tailored to support Real Time
preemption. It is desirable that all irqchips in the GPIO subsystem keep this
in mind and does the proper testing to assure they are real time-enabled.
in mind and do the proper testing to assure they are real time-enabled.
So, pay attention on above " RT_FULL:" notes, please.
The following is a checklist to follow when preparing a driver for real
time-compliance:
......
......@@ -17,7 +17,9 @@ available subsections can be seen below.
basics
infrastructure
pm/index
clk
device-io
device_connection
dma-buf
device_link
message-based
......
......@@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources:
If a subchannel is created by a request to host, then the uio_hv_generic
device driver will create a sysfs binary file for the per-channel ring buffer.
For example:
For example::
/sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
Further information
......
#
# Feature name: strncasecmp
# Kconfig: __HAVE_ARCH_STRNCASECMP
# description: arch provides an optimized strncasecmp() function
# Feature name: cBPF-JIT
# Kconfig: HAVE_CBPF_JIT
# description: arch supports cBPF JIT optimizations
#
-----------------------
| arch |status|
......@@ -16,14 +16,16 @@
| ia64: | TODO |
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
| sparc: | ok |
| um: | TODO |
| unicore32: | TODO |
| x86: | TODO |
......
#
# Feature name: BPF-JIT
# Kconfig: HAVE_BPF_JIT
# description: arch supports BPF JIT optimizations
# Feature name: eBPF-JIT
# Kconfig: HAVE_EBPF_JIT
# description: arch supports eBPF JIT optimizations
#
-----------------------
| arch |status|
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| openrisc: | ok |
| parisc: | ok |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | ok |
| nios2: | ok |
| openrisc: | ok |
| parisc: | ok |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,15 +17,17 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
| um: | TODO |
| unicore32: | TODO |
| x86: | ok | 64-bit only
| x86: | ok |
| xtensa: | ok |
-----------------------
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | ok |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | ok |
| sparc: | TODO |
......
......@@ -11,16 +11,18 @@
| arm: | ok |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| h8300: | ok |
| hexagon: | ok |
| ia64: | TODO |
| m68k: | TODO |
| microblaze: | ok |
| mips: | ok |
| nds32: | TODO |
| nios2: | ok |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | ok |
| arm: | ok |
| arm64: | TODO |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | TODO |
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | ok |
| arm: | ok |
| arm64: | TODO |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | TODO |
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | ok |
| sparc: | TODO |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | ok |
| arm64: | TODO |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | TODO |
......@@ -17,13 +17,15 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
| sparc: | ok |
| um: | TODO |
| unicore32: | TODO |
| x86: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,11 +17,13 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| s390: | TODO |
| riscv: | ok |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
| um: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | ok |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | TODO |
| arm64: | TODO |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | TODO |
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | ok |
| mips: | ok |
| nds32: | ok |
| nios2: | TODO |
| openrisc: | TODO |
| openrisc: | ok |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -9,21 +9,23 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | TODO |
| arm64: | TODO |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | TODO |
| ia64: | TODO |
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| openrisc: | ok |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
| sparc: | ok |
| um: | TODO |
| unicore32: | TODO |
| x86: | ok |
......
......@@ -16,14 +16,16 @@
| ia64: | TODO |
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| openrisc: | ok |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
| sparc: | ok |
| um: | TODO |
| unicore32: | TODO |
| x86: | ok |
......
#
# Feature name: rwsem-optimized
# Kconfig: Optimized asm/rwsem.h
# Kconfig: !RWSEM_GENERIC_SPINLOCK
# description: arch provides optimized rwsem APIs
#
-----------------------
......@@ -8,8 +8,8 @@
-----------------------
| alpha: | ok |
| arc: | TODO |
| arm: | TODO |
| arm64: | TODO |
| arm: | ok |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | TODO |
......@@ -17,14 +17,16 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
| um: | TODO |
| um: | ok |
| unicore32: | TODO |
| x86: | ok |
| xtensa: | ok |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | ok |
| arm64: | TODO |
| arm64: | ok |
| c6x: | TODO |
| h8300: | TODO |
| hexagon: | ok |
......@@ -17,13 +17,15 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | ok |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | ok |
| sparc: | TODO |
| sparc: | ok |
| um: | TODO |
| unicore32: | TODO |
| x86: | ok |
......
......@@ -17,11 +17,13 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| s390: | TODO |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
| um: | TODO |
......
......@@ -17,11 +17,13 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| s390: | TODO |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
| um: | TODO |
......
......@@ -40,10 +40,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | .. |
| arm: | .. |
| arm64: | .. |
| arm64: | ok |
| c6x: | .. |
| h8300: | .. |
| hexagon: | .. |
......@@ -17,11 +17,13 @@
| m68k: | .. |
| microblaze: | .. |
| mips: | TODO |
| nds32: | TODO |
| nios2: | .. |
| openrisc: | .. |
| parisc: | .. |
| powerpc: | ok |
| s390: | .. |
| riscv: | TODO |
| s390: | ok |
| sh: | .. |
| sparc: | TODO |
| um: | .. |
......
#
# Small script that refreshes the kernel feature support status in place.
#
for F_FILE in Documentation/features/*/*/arch-support.txt; do
F=$(grep "^# Kconfig:" "$F_FILE" | cut -c26-)
#
# Each feature F is identified by a pair (O, K), where 'O' can
# be either the empty string (for 'nop') or "not" (the logical
# negation operator '!'); other operators are not supported.
#
O=""
K=$F
if [[ "$F" == !* ]]; then
O="not"
K=$(echo $F | sed -e 's/^!//g')
fi
#
# F := (O, K) is 'valid' iff there is a Kconfig file (for some
# arch) which contains K.
#
# Notice that this definition entails an 'asymmetry' between
# the case 'O = ""' and the case 'O = "not"'. E.g., F may be
# _invalid_ if:
#
# [case 'O = ""']
# 1) no arch provides support for F,
# 2) K does not exist (e.g., it was renamed/mis-typed);
#
# [case 'O = "not"']
# 3) all archs provide support for F,
# 4) as in (2).
#
# The rationale for adopting this definition (and, thus, for
# keeping the asymmetry) is:
#
# We want to be able to 'detect' (2) (or (4)).
#
# (1) and (3) may further warn the developers about the fact
# that K can be removed.
#
F_VALID="false"
for ARCH_DIR in arch/*/; do
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
K_GREP=$(grep "$K" $K_FILES)
if [ ! -z "$K_GREP" ]; then
F_VALID="true"
break
fi
done
if [ "$F_VALID" = "false" ]; then
printf "WARNING: '%s' is not a valid Kconfig\n" "$F"
fi
T_FILE="$F_FILE.tmp"
grep "^#" $F_FILE > $T_FILE
echo " -----------------------" >> $T_FILE
echo " | arch |status|" >> $T_FILE
echo " -----------------------" >> $T_FILE
for ARCH_DIR in arch/*/; do
ARCH=$(echo $ARCH_DIR | sed -e 's/arch//g' | sed -e 's/\///g')
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
K_GREP=$(grep "$K" $K_FILES)
#
# Arch support status values for (O, K) are updated according
# to the following rules.
#
# - ("", K) is 'supported by a given arch', if there is a
# Kconfig file for that arch which contains K;
#
# - ("not", K) is 'supported by a given arch', if there is
# no Kconfig file for that arch which contains K;
#
# - otherwise: preserve the previous status value (if any),
# default to 'not yet supported'.
#
# Notice that, according these rules, invalid features may be
# updated/modified.
#
if [ "$O" = "" ] && [ ! -z "$K_GREP" ]; then
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
elif [ "$O" = "not" ] && [ -z "$K_GREP" ]; then
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
else
S=$(grep -v "^#" "$F_FILE" | grep " $ARCH:")
if [ ! -z "$S" ]; then
echo "$S" >> $T_FILE
else
printf " |%12s: | TODO |\n" "$ARCH" \
>> $T_FILE
fi
fi
done
echo " -----------------------" >> $T_FILE
mv $T_FILE $F_FILE
done
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| parisc: | ok |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,12 +17,14 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sh: | ok |
| sparc: | TODO |
| um: | TODO |
| unicore32: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | ok |
| microblaze: | ok |
| mips: | ok |
| nds32: | ok |
| nios2: | ok |
| openrisc: | ok |
| parisc: | TODO |
| parisc: | ok |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | .. |
| powerpc: | .. |
| powerpc: | ok |
| riscv: | TODO |
| s390: | .. |
| sh: | TODO |
| sparc: | .. |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | ok |
| mips: | ok |
| nds32: | ok |
| nios2: | ok |
| openrisc: | ok |
| parisc: | ok |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | ok |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | ok |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| parisc: | ok |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | .. |
| microblaze: | .. |
| mips: | ok |
| nds32: | TODO |
| nios2: | .. |
| openrisc: | .. |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | .. |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | .. |
| microblaze: | .. |
| mips: | TODO |
| nds32: | TODO |
| nios2: | .. |
| openrisc: | .. |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO |
| sh: | TODO |
| sparc: | TODO |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | TODO |
| sh: | ok |
| sparc: | TODO |
......
......@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | .. |
| arm: | .. |
| arm64: | .. |
| arm64: | ok |
| c6x: | .. |
| h8300: | .. |
| hexagon: | .. |
......@@ -17,10 +17,12 @@
| m68k: | .. |
| microblaze: | ok |
| mips: | ok |
| nds32: | TODO |
| nios2: | .. |
| openrisc: | .. |
| parisc: | .. |
| powerpc: | ok |
| riscv: | ok |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -17,10 +17,12 @@
| m68k: | TODO |
| microblaze: | TODO |
| mips: | TODO |
| nds32: | TODO |
| nios2: | TODO |
| openrisc: | TODO |
| parisc: | TODO |
| powerpc: | ok |
| riscv: | TODO |
| s390: | ok |
| sh: | ok |
| sparc: | ok |
......
......@@ -515,7 +515,8 @@ guarantees:
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
bits on both physical and virtual pages associated with a process, and the
soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
for details).
To clear the bits for all the pages associated with the process
> echo 1 > /proc/PID/clear_refs
......@@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect.
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
using /proc/kpageflags and number of times a page is mapped using
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
/proc/kpagecount. For detailed explanation, see
Documentation/admin-guide/mm/pagemap.rst.
The /proc/pid/numa_maps is an extension based on maps, showing the memory
locality and binding policy, as well as the memory usage (in pages) of
......@@ -564,7 +566,7 @@ address policy mapping details
Where:
"address" is the starting address for the mapping;
"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt);
"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
"mapping details" summarizes mapping data such as mapping type, page usage counters,
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
......
......@@ -105,8 +105,9 @@ policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified
when tmpfs is mounted by appending them to the mode before the NodeList.
See Documentation/vm/numa_memory_policy.txt for a list of all available
memory allocation policy mode flags and their effect on memory policy.
See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
all available memory allocation policy mode flags and their effect on
memory policy.
=static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES
......
......@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
.. toctree::
:maxdepth: 2
userspace-api/index
userspace-api/index
Introduction to kernel development
......@@ -89,6 +89,7 @@ needed).
sound/index
crypto/index
filesystems/index
vm/index
Architecture-specific documentation
-----------------------------------
......
......@@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface.
future extensions is going right down the gutters since someone will submit
an ioctl struct with random stack garbage in the yet unused parts. Which
then bakes in the ABI that those fields can never be used for anything else
but garbage.
but garbage. This is also the reason why you must explicitly pad all
structures, even if you never use them in an array - the padding the compiler
might insert could contain garbage.
* Have simple testcases for all of the above.
......
......@@ -2903,7 +2903,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the
appropriate part of the kernel must invalidate the overlapping bits of the
cache on each CPU.
See Documentation/cachetlb.txt for more information on cache management.
See Documentation/core-api/cachetlb.rst for more information on cache management.
CACHE COHERENCY VS MMIO
......@@ -3083,7 +3083,7 @@ CIRCULAR BUFFERS
Memory barriers can be used to implement circular buffering without the need
of a lock to serialise the producer with the consumer. See:
Documentation/circular-buffers.txt
Documentation/core-api/circular-buffers.rst
for details.
......
......@@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent
release history looks like this:
====== =================
2.6.38 March 14, 2011
2.6.37 January 4, 2011
2.6.36 October 20, 2010
2.6.35 August 1, 2010
2.6.34 May 15, 2010
2.6.33 February 24, 2010
4.11 April 30, 2017
4.12 July 2, 2017
4.13 September 3, 2017
4.14 November 12, 2017
4.15 January 28, 2018
4.16 April 1, 2018
====== =================
Every 2.6.x release is a major kernel release with new features, internal
API changes, and more. A typical 2.6 release can contain nearly 10,000
changesets with changes to several hundred thousand lines of code. 2.6 is
Every 4.x release is a major kernel release with new features, internal
API changes, and more. A typical 4.x release contain about 13,000
changesets with changes to several hundred thousand lines of code. 4.x is
thus the leading edge of Linux kernel development; the kernel uses a
rolling development model which is continually integrating major changes.
......@@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is
considered to be sufficiently stable and the final 2.6.x release is made.
At that point the whole process starts over again.
As an example, here is how the 2.6.38 development cycle went (all dates in
2011):
As an example, here is how the 4.16 development cycle went (all dates in
2018):
============== ===============================
January 4 2.6.37 stable release
January 18 2.6.38-rc1, merge window closes
January 21 2.6.38-rc2
February 1 2.6.38-rc3
February 7 2.6.38-rc4
February 15 2.6.38-rc5
February 21 2.6.38-rc6
March 1 2.6.38-rc7
March 7 2.6.38-rc8
March 14 2.6.38 stable release
January 28 4.15 stable release
February 11 4.16-rc1, merge window closes
February 18 4.16-rc2
February 25 4.16-rc3
March 4 4.16-rc4
March 11 4.16-rc5
March 18 4.16-rc6
March 25 4.16-rc7
April 1 4.17 stable release
============== ===============================
How do the developers decide when to close the development cycle and create
......@@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to
achieve; there are just too many variables in a project of this size.
There comes a point where delaying the final release just makes the problem
worse; the pile of changes waiting for the next merge window will grow
larger, creating even more regressions the next time around. So most 2.6.x
larger, creating even more regressions the next time around. So most 4.x
kernels go out with a handful of known regressions though, hopefully, none
of them are serious.
Once a stable release is made, its ongoing maintenance is passed off to the
"stable team," currently consisting of Greg Kroah-Hartman. The stable team
will release occasional updates to the stable release using the 2.6.x.y
will release occasional updates to the stable release using the 4.x.y
numbering scheme. To be considered for an update release, a patch must (1)
fix a significant bug, and (2) already be merged into the mainline for the
next development kernel. Kernels will typically receive stable updates for
a little more than one development cycle past their initial release. So,
for example, the 2.6.36 kernel's history looked like:
for example, the 4.13 kernel's history looked like:
============== ===============================
October 10 2.6.36 stable release
November 22 2.6.36.1
December 9 2.6.36.2
January 7 2.6.36.3
February 17 2.6.36.4
September 3 4.13 stable release
September 13 4.13.1
September 20 4.13.2
September 27 4.13.3
October 5 4.13.4
October 12 4.13.5
... ...
November 24 4.13.16
============== ===============================
2.6.36.4 was the final stable update for the 2.6.36 release.
4.13.16 was the final stable update of the 4.13 release.
Some kernels are designated "long term" kernels; they will receive support
for a longer period. As of this writing, the current long term kernels
and their maintainers are:
====== ====================== ===========================
2.6.27 Willy Tarreau (Deep-frozen stable kernel)
2.6.32 Greg Kroah-Hartman
2.6.35 Andi Kleen (Embedded flag kernel)
====== ====================== ==============================
3.16 Ben Hutchings (very long-term stable kernel)
4.1 Sasha Levin
4.4 Greg Kroah-Hartman (very long-term stable kernel)
4.9 Greg Kroah-Hartman
4.14 Greg Kroah-Hartman
====== ====================== ===========================
The selection of a kernel for long-term support is purely a matter of a
......
......@@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches;
following them will make life much easier for everybody involved. This
document will attempt to cover these expectations in reasonable detail;
more information can also be found in the files process/submitting-patches.rst,
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel documentation
directory.
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel
documentation directory.
When to post
......@@ -198,8 +198,8 @@ pass it to diff with the "-X" option.
The tags mentioned above are used to describe how various developers have
been associated with the development of this patch. They are described in
detail in the process/submitting-patches.rst document; what follows here is a brief
summary. Each of these lines has the format:
detail in the process/submitting-patches.rst document; what follows here is a
brief summary. Each of these lines has the format:
::
......@@ -210,8 +210,8 @@ The tags in common use are:
- Signed-off-by: this is a developer's certification that he or she has
the right to submit the patch for inclusion into the kernel. It is an
agreement to the Developer's Certificate of Origin, the full text of
which can be found in Documentation/process/submitting-patches.rst. Code without a
proper signoff cannot be merged into the mainline.
which can be found in Documentation/process/submitting-patches.rst. Code
without a proper signoff cannot be merged into the mainline.
- Co-developed-by: states that the patch was also created by another developer
along with the original author. This is useful at times when multiple
......@@ -226,8 +226,8 @@ The tags in common use are:
it to work.
- Reviewed-by: the named developer has reviewed the patch for correctness;
see the reviewer's statement in Documentation/process/submitting-patches.rst for more
detail.
see the reviewer's statement in Documentation/process/submitting-patches.rst
for more detail.
- Reported-by: names a user who reported a problem which is fixed by this
patch; this tag is used to give credit to the (often underappreciated)
......
......@@ -52,6 +52,7 @@ lack of a better place.
adding-syscalls
magic-number
volatile-considered-harmful
clang-format
.. only:: subproject and html
......
......@@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so
if you only have a combined **[SC]** key, then you should create a separate
signing subkey::
$ gpg --quick-add-key [fpr] ed25519 sign
$ gpg --quick-addkey [fpr] ed25519 sign
Remember to tell the keyservers about this change, so others can pull down
your new subkey::
......@@ -450,11 +450,18 @@ functionality. There are several options available:
others. If you want to use ECC keys, your best bet among commercially
available devices is the Nitrokey Start.
.. note::
If you are listed in MAINTAINERS or have an account at kernel.org,
you `qualify for a free Nitrokey Start`_ courtesy of The Linux
Foundation.
.. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6
.. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3
.. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/
.. _Gnuk: http://www.fsij.org/doc-gnuk/
.. _`LWN has a good review`: https://lwn.net/Articles/736231/
.. _`qualify for a free Nitrokey Start`: https://www.kernel.org/nitrokey-digital-tokens-for-kernel-developers.html
Configure your smartcard device
-------------------------------
......@@ -482,7 +489,7 @@ there are no convenient command-line switches::
You should set the user PIN (1), Admin PIN (3), and the Reset Code (4).
Please make sure to record and store these in a safe place -- especially
the Admin PIN and the Reset Code (which allows you to completely wipe
the smartcard). You so rarely need to use the Admin PIN, that you will
the smartcard). You so rarely need to use the Admin PIN, that you will
inevitably forget what it is if you do not record it.
Getting back to the main card menu, you can also set other values (such
......@@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it.
Despite having the name "PIN", neither the user PIN nor the admin
PIN on the card need to be numbers.
.. warning::
Some devices may require that you move the subkeys onto the device
before you can change the passphrase. Please check the documentation
provided by the device manufacturer.
Move the subkeys to your smartcard
----------------------------------
......@@ -655,6 +668,20 @@ want to import these changes back into your regular working directory::
$ gpg --export | gpg --homedir ~/.gnupg --import
$ unset GNUPGHOME
Using gpg-agent over ssh
~~~~~~~~~~~~~~~~~~~~~~~~
You can forward your gpg-agent over ssh if you need to sign tags or
commits on a remote system. Please refer to the instructions provided
on the GnuPG wiki:
- `Agent Forwarding over SSH`_
It works more smoothly if you can modify the sshd server settings on the
remote end.
.. _`Agent Forwarding over SSH`: https://wiki.gnupg.org/AgentForwarding
Using PGP with Git
==================
......@@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key)::
tell git to always use it instead of the legacy ``gpg`` from version 1::
$ git config --global gpg.program gpg2
$ git config --global gpgv.program gpgv2
How to work with signed tags
----------------------------
......@@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to
import their PGP key. Please refer to the
":ref:`verify_identities`" section below.
.. note::
If you get "``gpg: Can't check signature: unknown pubkey
algorithm``" error, you need to tell git to use gpgv2 for
verification, so it properly processes signatures made by ECC keys.
See instructions at the start of this section.
Configure git to always sign annotated tags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
......@@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use
the pull request as the cover letter for a normal posting of the patch
series, giving the maintainer the option of using either.
A pull request should have [GIT] or [PULL] in the subject line. The
A pull request should have [GIT PULL] in the subject line. The
request itself should include the repository name and the branch of
interest on a single line; it should look something like::
......
......@@ -9,5 +9,7 @@ Security Documentation
IMA-templates
keys/index
LSM
LSM-sctp
SELinux-sctp
self-protection
tpm/index
......@@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel
ML (see the section `Links and Addresses`_).
``power_save`` and ``power_save_controller`` options are for power-saving
mode. See powersave.txt for details.
mode. See powersave.rst for details.
Note 2: If you get click noises on output, try the module option
``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB
......@@ -1133,7 +1133,7 @@ line_outs_monitor
enable_monitor
Enable Analog Out on Channel 63/64 by default.
See hdspm.txt for details.
See hdspm.rst for details.
Module snd-ice1712
------------------
......
......@@ -139,7 +139,7 @@ DAPM description
----------------
The Dynamic Audio Power Management description describes the codec power
components and their relationships and registers to the ASoC core.
Please read dapm.txt for details of building the description.
Please read dapm.rst for details of building the description.
Please also see the examples in other codec drivers.
......
......@@ -66,7 +66,7 @@ Each SoC DAI driver must provide the following features:-
4. SYSCLK configuration
5. Suspend and resume (optional)
Please see codec.txt for a description of items 1 - 4.
Please see codec.rst for a description of items 1 - 4.
SoC DSP Drivers
......
......@@ -515,7 +515,7 @@ nr_hugepages
Change the minimum size of the hugepage pool.
See Documentation/vm/hugetlbpage.txt
See Documentation/admin-guide/mm/hugetlbpage.rst
==============================================================
......@@ -524,7 +524,7 @@ nr_overcommit_hugepages
Change the maximum size of the hugepage pool. The maximum is
nr_hugepages + nr_overcommit_hugepages.
See Documentation/vm/hugetlbpage.txt
See Documentation/admin-guide/mm/hugetlbpage.rst
==============================================================
......@@ -667,7 +667,7 @@ and don't use much of it.
The default value is 0.
See Documentation/vm/overcommit-accounting and
See Documentation/vm/overcommit-accounting.rst and
mm/mmap.c::__vm_enough_memory() for more information.
==============================================================
......
......@@ -187,13 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The
specific to that component only. "Implementation defined" customisations are
expected to be accessed and controlled using those entries.
Last but not least, "struct module *owner" is expected to be set to reflect
the information carried in "THIS_MODULE".
How to use the tracer modules
-----------------------------
Before trace collection can start, a coresight sink needs to be identify.
There are two ways to use the Coresight framework: 1) using the perf cmd line
tools and 2) interacting directly with the Coresight devices using the sysFS
interface. Preference is given to the former as using the sysFS interface
requires a deep understanding of the Coresight HW. The following sections
provide details on using both methods.
1) Using the sysFS interface:
Before trace collection can start, a coresight sink needs to be identified.
There is no limit on the amount of sinks (nor sources) that can be enabled at
any given moment. As a generic operation, all device pertaining to the sink
class will have an "active" entry in sysfs:
......@@ -298,42 +304,48 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
Timestamp Timestamp: 17107041535
How to use the STM module
-------------------------
2) Using perf framework:
Using the System Trace Macrocell module is the same as the tracers - the only
difference is that clients are driving the trace capture rather
than the program flow through the code.
Coresight tracers are represented using the Perf framework's Performance
Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of
controlling when tracing gets enabled based on when the process of interest is
scheduled. When configured in a system, Coresight PMUs will be listed when
queried by the perf command line tool:
As with any other CoreSight component, specifics about the STM tracer can be
found in sysfs with more information on each entry being found in [1]:
linaro@linaro-nano:~$ ./perf list pmu
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
enable_source hwevent_select port_enable subsystem uevent
hwevent_enable mgmt port_select traceid
root@genericarmv8:~#
List of pre-defined events (to be used in -e):
Like any other source a sink needs to be identified and the STM enabled before
being used:
cs_etm// [Kernel PMU event]
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
linaro@linaro-nano:~$
From there user space applications can request and use channels using the devfs
interface provided for that purpose by the generic STM API:
Regardless of the number of tracers available in a system (usually equal to the
amount of processor cores), the "cs_etm" PMU will be listed only once.
root@genericarmv8:~# ls -l /dev/20100000.stm
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
root@genericarmv8:~#
A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is
listed along with configuration options within forward slashes '/'. Since a
Coresight system will typically have more than one sink, the name of the sink to
work with needs to be specified as an event option. Names for sink to choose
from are listed in sysFS under ($SYSFS)/bus/coresight/devices:
Details on how to use the generic STM API can be found here [2].
root@linaro-nano:~# ls /sys/bus/coresight/devices/
20010000.etf 20040000.funnel 20100000.stm 22040000.etm
22140000.etm 230c0000.funnel 23240000.etm 20030000.tpiu
20070000.etr 20120000.replicator 220c0000.funnel
23040000.etm 23140000.etm 23340000.etm
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
[2]. Documentation/trace/stm.txt
root@linaro-nano:~# perf record -e cs_etm/@20070000.etr/u --per-thread program
The syntax within the forward slashes '/' is important. The '@' character
tells the parser that a sink is about to be specified and that this is the sink
to use for the trace session.
Using perf tools
----------------
More information on the above and other example on how to use Coresight with
the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
repository [3].
2.1) AutoFDO analysis using the perf tools:
perf can be used to record and analyze trace of programs.
......@@ -381,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
$ taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
5806 ms
How to use the STM module
-------------------------
Using the System Trace Macrocell module is the same as the tracers - the only
difference is that clients are driving the trace capture rather
than the program flow through the code.
As with any other CoreSight component, specifics about the STM tracer can be
found in sysfs with more information on each entry being found in [1]:
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
enable_source hwevent_select port_enable subsystem uevent
hwevent_enable mgmt port_select traceid
root@genericarmv8:~#
Like any other source a sink needs to be identified and the STM enabled before
being used:
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
From there user space applications can request and use channels using the devfs
interface provided for that purpose by the generic STM API:
root@genericarmv8:~# ls -l /dev/20100000.stm
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
root@genericarmv8:~#
Details on how to use the generic STM API can be found here [2].
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
[2]. Documentation/trace/stm.txt
[3]. https://github.com/Linaro/perf-opencsd
......@@ -12,7 +12,7 @@ Written for: 4.14
Introduction
============
The ftrace infrastructure was originially created to attach callbacks to the
The ftrace infrastructure was originally created to attach callbacks to the
beginning of functions in order to record and trace the flow of the kernel.
But callbacks to the start of a function can have other use cases. Either
for live kernel patching, or for security monitoring. This document describes
......@@ -30,7 +30,7 @@ The ftrace context
This requires extra care to what can be done inside a callback. A callback
can be called outside the protective scope of RCU.
The ftrace infrastructure has some protections agains recursions and RCU
The ftrace infrastructure has some protections against recursions and RCU
but one must still be very careful how they use the callbacks.
......
......@@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files:
has a side effect of enabling or disabling specific functions
to be traced. Echoing names of functions into this file
will limit the trace to only those functions.
This influences the tracers "function" and "function_graph"
and thus also function profiling (see "function_profile_enabled").
The functions listed in "available_filter_functions" are what
can be written into this file.
......@@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files:
Functions listed in this file will cause the function graph
tracer to only trace these functions and the functions that
they call. (See the section "dynamic ftrace" for more details).
Note, set_ftrace_filter and set_ftrace_notrace still affects
what functions are being traced.
set_graph_notrace:
......@@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files:
This lists the functions that ftrace has processed and can trace.
These are the function names that you can pass to
"set_ftrace_filter" or "set_ftrace_notrace".
"set_ftrace_filter", "set_ftrace_notrace",
"set_graph_function", or "set_graph_notrace".
(See the section "dynamic ftrace" below for more details.)
dyn_ftrace_total_info:
......
......@@ -2846,7 +2846,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮
문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는
비트들을 무효화 시켜야 합니다.
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/cachetlb.txt 를
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/core-api/cachetlb.rst 를
참고하세요.
......@@ -3023,7 +3023,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다.
동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을
위해선 다음을 참고하세요:
Documentation/circular-buffers.txt
Documentation/core-api/circular-buffers.rst
=========
......
......@@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver,
the driver should call vfio_add_group_dev() and vfio_del_group_dev()
respectively::
extern int vfio_add_group_dev(struct iommu_group *iommu_group,
struct device *dev,
extern int vfio_add_group_dev(struct device *dev,
const struct vfio_device_ops *ops,
void *device_data);
extern void *vfio_del_group_dev(struct device *dev);
vfio_add_group_dev() indicates to the core to begin tracking the
specified iommu_group and register the specified dev as owned by
iommu_group of the specified dev and register the dev as owned by
a VFIO bus driver. The driver provides an ops structure for callbacks
similar to a file operations structure::
......
00-INDEX
- this file.
active_mm.txt
active_mm.rst
- An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
balance.rst
- various information on memory balancing.
cleancache.txt
cleancache.rst
- Intro to cleancache and page-granularity victim cache.
frontswap.txt
frontswap.rst
- Outline frontswap, part of the transcendent memory frontend.
highmem.txt
highmem.rst
- Outline of highmem and common issues.
hmm.txt
hmm.rst
- Documentation of heterogeneous memory management
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
hugetlbfs_reserv.txt
hugetlbfs_reserv.rst
- A brief overview of hugetlbfs reservation design/implementation.
hwpoison.txt
hwpoison.rst
- explains what hwpoison is
idle_page_tracking.txt
- description of the idle page tracking feature.
ksm.txt
ksm.rst
- how to use the Kernel Samepage Merging feature.
mmu_notifier.txt
mmu_notifier.rst
- a note about clearing pte/pmd and mmu notifications
numa
numa.rst
- information about NUMA specific code in the Linux vm.
numa_memory_policy.txt
- documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting
overcommit-accounting.rst
- description of the Linux kernels overcommit handling modes.
page_frags
page_frags.rst
- description of page fragments allocator
page_migration
page_migration.rst
- description of page migration in NUMA systems.
pagemap.txt
- pagemap, from the userspace perspective
page_owner.txt
page_owner.rst
- tracking about who allocated each page
remap_file_pages.txt
remap_file_pages.rst
- a note about remap_file_pages() system call
slub.txt
slub.rst
- a short users guide for SLUB.
soft-dirty.txt
- short explanation for soft-dirty PTEs
split_page_table_lock
split_page_table_lock.rst
- Separate per-table lock to improve scalability of the old page_table_lock.
swap_numa.txt
swap_numa.rst
- automatic binding of swap device to numa node
transhuge.txt
transhuge.rst
- Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.txt
unevictable-lru.rst
- Unevictable LRU infrastructure
userfaultfd.txt
- description of userfaultfd system call
z3fold.txt
- outline of z3fold allocator for storing compressed pages
zsmalloc.txt
zsmalloc.rst
- outline of zsmalloc allocator for storing compressed pages
zswap.txt
zswap.rst
- Intro to compressed cache for swap pages
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册