提交 eeee3149 编写于 作者: L Linus Torvalds

Merge tag 'docs-4.18' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "There's been a fair amount of work in the docs tree this time around,
  including:

   - Extensive RST conversions and organizational work in the
     memory-management docs thanks to Mike Rapoport.

   - An update of Documentation/features from Andrea Parri and a script
     to keep it updated.

   - Various LICENSES updates from Thomas, along with a script to check
     SPDX tags.

   - Work to fix dangling references to documentation files; this
     involved a fair number of one-liner comment changes outside of
     Documentation/

  ... and the usual list of documentation improvements, typo fixes, etc"

* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
  Documentation: document hung_task_panic kernel parameter
  docs/admin-guide/mm: add high level concepts overview
  docs/vm: move ksm and transhuge from "user" to "internals" section.
  docs: Use the kerneldoc comments for memalloc_no*()
  doc: document scope NOFS, NOIO APIs
  docs: update kernel versions and dates in tables
  docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
  docs/vm: transhuge: minor updates
  docs/vm: transhuge: change sections order
  Documentation: arm: clean up Marvell Berlin family info
  Documentation: gpio: driver: Fix a typo and some odd grammar
  docs: ranoops.rst: fix location of ramoops.txt
  scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
  docs: uio-howto.rst: use a code block to solve a warning
  mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
  w1: w1_io.c: fix a kernel-doc warning
  Documentation/process/posting: wrap text at 80 cols
  docs: admin-guide: add cgroup-v2 documentation
  Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
  Documentation: refcount-vs-atomic: Update reference to LKMM doc.
  ...
...@@ -64,8 +64,6 @@ auxdisplay/ ...@@ -64,8 +64,6 @@ auxdisplay/
- misc. LCD driver documentation (cfag12864b, ks0108). - misc. LCD driver documentation (cfag12864b, ks0108).
backlight/ backlight/
- directory with info on controlling backlights in flat panel displays - directory with info on controlling backlights in flat panel displays
bcache.txt
- Block-layer cache on fast SSDs to improve slow (raid) I/O performance.
block/ block/
- info on the Block I/O (BIO) layer. - info on the Block I/O (BIO) layer.
blockdev/ blockdev/
...@@ -78,18 +76,10 @@ bus-devices/ ...@@ -78,18 +76,10 @@ bus-devices/
- directory with info on TI GPMC (General Purpose Memory Controller) - directory with info on TI GPMC (General Purpose Memory Controller)
bus-virt-phys-mapping.txt bus-virt-phys-mapping.txt
- how to access I/O mapped memory from within device drivers. - how to access I/O mapped memory from within device drivers.
cachetlb.txt
- describes the cache/TLB flushing interfaces Linux uses.
cdrom/ cdrom/
- directory with information on the CD-ROM drivers that Linux has. - directory with information on the CD-ROM drivers that Linux has.
cgroup-v1/ cgroup-v1/
- cgroups v1 features, including cpusets and memory controller. - cgroups v1 features, including cpusets and memory controller.
cgroup-v2.txt
- cgroups v2 features, including cpusets and memory controller.
circular-buffers.txt
- how to make use of the existing circular buffer infrastructure
clk.txt
- info on the common clock framework
cma/ cma/
- Continuous Memory Area (CMA) debugfs interface. - Continuous Memory Area (CMA) debugfs interface.
conf.py conf.py
......
...@@ -90,4 +90,4 @@ Date: December 2009 ...@@ -90,4 +90,4 @@ Date: December 2009
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com> Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
Description: Description:
The node's huge page size control/query attributes. The node's huge page size control/query attributes.
See Documentation/vm/hugetlbpage.txt See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file \ No newline at end of file
...@@ -12,4 +12,4 @@ Description: ...@@ -12,4 +12,4 @@ Description:
free_hugepages free_hugepages
surplus_hugepages surplus_hugepages
resv_hugepages resv_hugepages
See Documentation/vm/hugetlbpage.txt for details. See Documentation/admin-guide/mm/hugetlbpage.rst for details.
...@@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface ...@@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface
sleep_millisecs: how many milliseconds ksm should sleep between sleep_millisecs: how many milliseconds ksm should sleep between
scans. scans.
See Documentation/vm/ksm.txt for more information. See Documentation/vm/ksm.rst for more information.
What: /sys/kernel/mm/ksm/merge_across_nodes What: /sys/kernel/mm/ksm/merge_across_nodes
Date: January 2013 Date: January 2013
......
...@@ -37,7 +37,7 @@ Description: ...@@ -37,7 +37,7 @@ Description:
The alloc_calls file is read-only and lists the kernel code The alloc_calls file is read-only and lists the kernel code
locations from which allocations for this cache were performed. locations from which allocations for this cache were performed.
The alloc_calls file only contains information if debugging is The alloc_calls file only contains information if debugging is
enabled for that cache (see Documentation/vm/slub.txt). enabled for that cache (see Documentation/vm/slub.rst).
What: /sys/kernel/slab/cache/alloc_fastpath What: /sys/kernel/slab/cache/alloc_fastpath
Date: February 2008 Date: February 2008
...@@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>, ...@@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>,
Description: Description:
The free_calls file is read-only and lists the locations of The free_calls file is read-only and lists the locations of
object frees if slab debugging is enabled (see object frees if slab debugging is enabled (see
Documentation/vm/slub.txt). Documentation/vm/slub.rst).
What: /sys/kernel/slab/cache/free_fastpath What: /sys/kernel/slab/cache/free_fastpath
Date: February 2008 Date: February 2008
......
...@@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking. ...@@ -48,6 +48,7 @@ configure specific aspects of kernel behavior to your liking.
:maxdepth: 1 :maxdepth: 1
initrd initrd
cgroup-v2
serial-console serial-console
braille-console braille-console
parport parport
...@@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking. ...@@ -60,9 +61,11 @@ configure specific aspects of kernel behavior to your liking.
mono mono
java java
ras ras
bcache
pm/index pm/index
thunderbolt thunderbolt
LSM/index LSM/index
mm/index
.. only:: subproject and html .. only:: subproject and html
......
...@@ -106,11 +106,11 @@ ...@@ -106,11 +106,11 @@
use by PCI use by PCI
Format: <irq>,<irq>... Format: <irq>,<irq>...
acpi_mask_gpe= [HW,ACPI] acpi_mask_gpe= [HW,ACPI]
Due to the existence of _Lxx/_Exx, some GPEs triggered Due to the existence of _Lxx/_Exx, some GPEs triggered
by unsupported hardware/firmware features can result in by unsupported hardware/firmware features can result in
GPE floodings that cannot be automatically disabled by GPE floodings that cannot be automatically disabled by
the GPE dispatcher. the GPE dispatcher.
This facility can be used to prevent such uncontrolled This facility can be used to prevent such uncontrolled
GPE floodings. GPE floodings.
Format: <int> Format: <int>
...@@ -472,10 +472,10 @@ ...@@ -472,10 +472,10 @@
for platform specific values (SB1, Loongson3 and for platform specific values (SB1, Loongson3 and
others). others).
ccw_timeout_log [S390] ccw_timeout_log [S390]
See Documentation/s390/CommonIO for details. See Documentation/s390/CommonIO for details.
cgroup_disable= [KNL] Disable a particular controller cgroup_disable= [KNL] Disable a particular controller
Format: {name of the controller(s) to disable} Format: {name of the controller(s) to disable}
The effects of cgroup_disable=foo are: The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in - foo isn't auto-mounted if you mount all cgroups in
...@@ -518,7 +518,7 @@ ...@@ -518,7 +518,7 @@
those clocks in any way. This parameter is useful for those clocks in any way. This parameter is useful for
debug and development, but should not be needed on a debug and development, but should not be needed on a
platform with proper driver support. For more platform with proper driver support. For more
information, see Documentation/clk.txt. information, see Documentation/driver-api/clk.rst.
clock= [BUGS=X86-32, HW] gettimeofday clocksource override. clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
[Deprecated] [Deprecated]
...@@ -641,8 +641,8 @@ ...@@ -641,8 +641,8 @@
hvc<n> Use the hypervisor console device <n>. This is for hvc<n> Use the hypervisor console device <n>. This is for
both Xen and PowerPC hypervisors. both Xen and PowerPC hypervisors.
If the device connected to the port is not a TTY but a braille If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance device, prepend "brl," before the device type, for instance
console=brl,ttyS0 console=brl,ttyS0
For now, only VisioBraille is supported. For now, only VisioBraille is supported.
...@@ -662,7 +662,7 @@ ...@@ -662,7 +662,7 @@
consoleblank= [KNL] The console blank (screen saver) timeout in consoleblank= [KNL] The console blank (screen saver) timeout in
seconds. A value of 0 disables the blank timer. seconds. A value of 0 disables the blank timer.
Defaults to 0. Defaults to 0.
coredump_filter= coredump_filter=
[KNL] Change the default value for [KNL] Change the default value for
...@@ -730,7 +730,7 @@ ...@@ -730,7 +730,7 @@
or memory reserved is below 4G. or memory reserved is below 4G.
cryptomgr.notests cryptomgr.notests
[KNL] Disable crypto self-tests [KNL] Disable crypto self-tests
cs89x0_dma= [HW,NET] cs89x0_dma= [HW,NET]
Format: <dma> Format: <dma>
...@@ -746,7 +746,7 @@ ...@@ -746,7 +746,7 @@
Format: <port#>,<type> Format: <port#>,<type>
See also Documentation/input/devices/joystick-parport.rst See also Documentation/input/devices/joystick-parport.rst
ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot
time. See time. See
Documentation/admin-guide/dynamic-debug-howto.rst for Documentation/admin-guide/dynamic-debug-howto.rst for
details. Deprecated, see dyndbg. details. Deprecated, see dyndbg.
...@@ -833,7 +833,7 @@ ...@@ -833,7 +833,7 @@
causing system reset or hang due to sending causing system reset or hang due to sending
INIT from AP to BSP. INIT from AP to BSP.
disable_ddw [PPC/PSERIES] disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware. to workaround buggy firmware.
...@@ -1188,7 +1188,7 @@ ...@@ -1188,7 +1188,7 @@
parameter will force ia64_sal_cache_flush to call parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH. ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
forcepae [X86-32] forcepae [X86-32]
Forcefully enable Physical Address Extension (PAE). Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a Many Pentium M systems disable PAE but may have a
functionally usable PAE implementation. functionally usable PAE implementation.
...@@ -1247,7 +1247,7 @@ ...@@ -1247,7 +1247,7 @@
gamma= [HW,DRM] gamma= [HW,DRM]
gart_fix_e820= [X86_64] disable the fix e820 for K8 GART gart_fix_e820= [X86_64] disable the fix e820 for K8 GART
Format: off | on Format: off | on
default: on default: on
...@@ -1341,23 +1341,32 @@ ...@@ -1341,23 +1341,32 @@
x86-64 are 2M (when the CPU supports "pse") and 1G x86-64 are 2M (when the CPU supports "pse") and 1G
(when the CPU supports the "pdpe1gb" cpuinfo flag). (when the CPU supports the "pdpe1gb" cpuinfo flag).
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) hung_task_panic=
terminal devices. Valid values: 0..8 [KNL] Should the hung task detector generate panics.
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs. Format: <integer>
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
A nonzero value instructs the kernel to panic when a
hung task is detected. The default value is controlled
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
option. The value selected by this boot parameter can
be changed later by the kernel.hung_task_panic sysctl.
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
terminal devices. Valid values: 0..8
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
keep_bootcon [KNL] keep_bootcon [KNL]
Do not unregister boot console at start. This is only Do not unregister boot console at start. This is only
useful for debugging when something happens in the window useful for debugging when something happens in the window
between unregistering the boot console and initializing between unregistering the boot console and initializing
the real console. the real console.
i2c_bus= [HW] Override the default board specific I2C bus speed i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not or register an additional I2C bus that is not
registered from board initialization code. registered from board initialization code.
Format: Format:
<bus_id>,<clkrate> <bus_id>,<clkrate>
i8042.debug [HW] Toggle i8042 debug mode i8042.debug [HW] Toggle i8042 debug mode
i8042.unmask_kbd_data i8042.unmask_kbd_data
...@@ -1386,7 +1395,7 @@ ...@@ -1386,7 +1395,7 @@
Default: only on s2r transitions on x86; most other Default: only on s2r transitions on x86; most other
architectures force reset to be always executed architectures force reset to be always executed
i8042.unlock [HW] Unlock (ignore) the keylock i8042.unlock [HW] Unlock (ignore) the keylock
i8042.kbdreset [HW] Reset device connected to KBD port i8042.kbdreset [HW] Reset device connected to KBD port
i810= [HW,DRM] i810= [HW,DRM]
...@@ -1548,13 +1557,13 @@ ...@@ -1548,13 +1557,13 @@
programs exec'd, files mmap'd for exec, and all files programs exec'd, files mmap'd for exec, and all files
opened for read by uid=0. opened for read by uid=0.
ima_template= [IMA] ima_template= [IMA]
Select one of defined IMA measurements template formats. Select one of defined IMA measurements template formats.
Formats: { "ima" | "ima-ng" | "ima-sig" } Formats: { "ima" | "ima-ng" | "ima-sig" }
Default: "ima-ng" Default: "ima-ng"
ima_template_fmt= ima_template_fmt=
[IMA] Define a custom template format. [IMA] Define a custom template format.
Format: { "field1|...|fieldN" } Format: { "field1|...|fieldN" }
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
...@@ -1597,7 +1606,7 @@ ...@@ -1597,7 +1606,7 @@
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
Format: <irq> Format: <irq>
int_pln_enable [x86] Enable power limit notification interrupt int_pln_enable [x86] Enable power limit notification interrupt
integrity_audit=[IMA] integrity_audit=[IMA]
Format: { "0" | "1" } Format: { "0" | "1" }
...@@ -1650,39 +1659,39 @@ ...@@ -1650,39 +1659,39 @@
0 disables intel_idle and fall back on acpi_idle. 0 disables intel_idle and fall back on acpi_idle.
1 to 9 specify maximum depth of C-state. 1 to 9 specify maximum depth of C-state.
intel_pstate= [X86] intel_pstate= [X86]
disable disable
Do not enable intel_pstate as the default Do not enable intel_pstate as the default
scaling driver for the supported processors scaling driver for the supported processors
passive passive
Use intel_pstate as a scaling driver, but configure it Use intel_pstate as a scaling driver, but configure it
to work with generic cpufreq governors (instead of to work with generic cpufreq governors (instead of
enabling its internal governor). This mode cannot be enabling its internal governor). This mode cannot be
used along with the hardware-managed P-states (HWP) used along with the hardware-managed P-states (HWP)
feature. feature.
force force
Enable intel_pstate on systems that prohibit it by default Enable intel_pstate on systems that prohibit it by default
in favor of acpi-cpufreq. Forcing the intel_pstate driver in favor of acpi-cpufreq. Forcing the intel_pstate driver
instead of acpi-cpufreq may disable platform features, such instead of acpi-cpufreq may disable platform features, such
as thermal controls and power capping, that rely on ACPI as thermal controls and power capping, that rely on ACPI
P-States information being indicated to OSPM and therefore P-States information being indicated to OSPM and therefore
should be used with caution. This option does not work with should be used with caution. This option does not work with
processors that aren't supported by the intel_pstate driver processors that aren't supported by the intel_pstate driver
or on platforms that use pcc-cpufreq instead of acpi-cpufreq. or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
no_hwp no_hwp
Do not enable hardware P state control (HWP) Do not enable hardware P state control (HWP)
if available. if available.
hwp_only hwp_only
Only load intel_pstate on systems which support Only load intel_pstate on systems which support
hardware P state control (HWP) if available. hardware P state control (HWP) if available.
support_acpi_ppc support_acpi_ppc
Enforce ACPI _PPC performance limits. If the Fixed ACPI Enforce ACPI _PPC performance limits. If the Fixed ACPI
Description Table, specifies preferred power management Description Table, specifies preferred power management
profile as "Enterprise Server" or "Performance Server", profile as "Enterprise Server" or "Performance Server",
then this feature is turned on by default. then this feature is turned on by default.
per_cpu_perf_limits per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface cpufreq sysfs interface
intremap= [X86-64, Intel-IOMMU] intremap= [X86-64, Intel-IOMMU]
on enable Interrupt Remapping (default) on enable Interrupt Remapping (default)
...@@ -2026,7 +2035,7 @@ ...@@ -2026,7 +2035,7 @@
* [no]ncqtrim: Turn off queued DSM TRIM. * [no]ncqtrim: Turn off queued DSM TRIM.
* nohrst, nosrst, norst: suppress hard, soft * nohrst, nosrst, norst: suppress hard, soft
and both resets. and both resets.
* rstonce: only attempt one reset during * rstonce: only attempt one reset during
hot-unplug link recovery hot-unplug link recovery
...@@ -2214,7 +2223,7 @@ ...@@ -2214,7 +2223,7 @@
[KNL,SH] Allow user to override the default size for [KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers. per-device physically contiguous DMA buffers.
memhp_default_state=online/offline memhp_default_state=online/offline
[KNL] Set the initial state for the memory hotplug [KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is onlining policy. If not specified, the default value is
set according to the set according to the
...@@ -2764,7 +2773,7 @@ ...@@ -2764,7 +2773,7 @@
[X86,PV_OPS] Disable paravirtualized VMware scheduler [X86,PV_OPS] Disable paravirtualized VMware scheduler
clock and use the default one. clock and use the default one.
no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
steal time is computed, but won't influence scheduler steal time is computed, but won't influence scheduler
behaviour behaviour
...@@ -2825,7 +2834,7 @@ ...@@ -2825,7 +2834,7 @@
notsc [BUGS=X86-32] Disable Time Stamp Counter notsc [BUGS=X86-32] Disable Time Stamp Counter
nowatchdog [KNL] Disable both lockup detectors, i.e. nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup). soft-lockup and NMI watchdog (hard-lockup).
nowb [ARM] nowb [ARM]
...@@ -2845,7 +2854,7 @@ ...@@ -2845,7 +2854,7 @@
If the dependencies are under your control, you can If the dependencies are under your control, you can
turn on cpu0_hotplug. turn on cpu0_hotplug.
nps_mtm_hs_ctr= [KNL,ARC] nps_mtm_hs_ctr= [KNL,ARC]
This parameter sets the maximum duration, in This parameter sets the maximum duration, in
cycles, each HW thread of the CTOP can run cycles, each HW thread of the CTOP can run
without interruptions, before HW switches it. without interruptions, before HW switches it.
...@@ -2986,7 +2995,7 @@ ...@@ -2986,7 +2995,7 @@
pci=option[,option...] [PCI] various PCI subsystem options: pci=option[,option...] [PCI] various PCI subsystem options:
earlydump [X86] dump PCI config space before the kernel earlydump [X86] dump PCI config space before the kernel
changes anything changes anything
off [X86] don't probe for the PCI bus off [X86] don't probe for the PCI bus
bios [X86-32] force use of PCI BIOS, don't access bios [X86-32] force use of PCI BIOS, don't access
the hardware directly. Use this if your machine the hardware directly. Use this if your machine
...@@ -3074,7 +3083,7 @@ ...@@ -3074,7 +3083,7 @@
is enabled by default. If you need to use this, is enabled by default. If you need to use this,
please report a bug. please report a bug.
nocrs [X86] Ignore PCI host bridge windows from ACPI. nocrs [X86] Ignore PCI host bridge windows from ACPI.
If you need to use this, please report a bug. If you need to use this, please report a bug.
routeirq Do IRQ routing for all PCI devices. routeirq Do IRQ routing for all PCI devices.
This is normally done in pci_enable_device(), This is normally done in pci_enable_device(),
so this option is a temporary workaround so this option is a temporary workaround
...@@ -3917,7 +3926,7 @@ ...@@ -3917,7 +3926,7 @@
cache (risks via metadata attacks are mostly cache (risks via metadata attacks are mostly
unchanged). Debug options disable merging on their unchanged). Debug options disable merging on their
own. own.
For more information see Documentation/vm/slub.txt. For more information see Documentation/vm/slub.rst.
slab_max_order= [MM, SLAB] slab_max_order= [MM, SLAB]
Determines the maximum allowed order for slabs. Determines the maximum allowed order for slabs.
...@@ -3931,7 +3940,7 @@ ...@@ -3931,7 +3940,7 @@
slub_debug can create guard zones around objects and slub_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the may poison objects when not in use. Also tracks the
last alloc / free. For more information see last alloc / free. For more information see
Documentation/vm/slub.txt. Documentation/vm/slub.rst.
slub_memcg_sysfs= [MM, SLUB] slub_memcg_sysfs= [MM, SLUB]
Determines whether to enable sysfs directories for Determines whether to enable sysfs directories for
...@@ -3945,7 +3954,7 @@ ...@@ -3945,7 +3954,7 @@
Determines the maximum allowed order for slabs. Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory A high setting may cause OOMs due to memory
fragmentation. For more information see fragmentation. For more information see
Documentation/vm/slub.txt. Documentation/vm/slub.rst.
slub_min_objects= [MM, SLUB] slub_min_objects= [MM, SLUB]
The minimum number of objects per slab. SLUB will The minimum number of objects per slab. SLUB will
...@@ -3954,12 +3963,12 @@ ...@@ -3954,12 +3963,12 @@
the number of objects indicated. The higher the number the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired. and the less frequently locks need to be acquired.
For more information see Documentation/vm/slub.txt. For more information see Documentation/vm/slub.rst.
slub_min_order= [MM, SLUB] slub_min_order= [MM, SLUB]
Determines the minimum page order for slabs. Must be Determines the minimum page order for slabs. Must be
lower than slub_max_order. lower than slub_max_order.
For more information see Documentation/vm/slub.txt. For more information see Documentation/vm/slub.rst.
slub_nomerge [MM, SLUB] slub_nomerge [MM, SLUB]
Same with slab_nomerge. This is supported for legacy. Same with slab_nomerge. This is supported for legacy.
...@@ -4357,7 +4366,8 @@ ...@@ -4357,7 +4366,8 @@
Format: [always|madvise|never] Format: [always|madvise|never]
Can be used to control the default behavior of the system Can be used to control the default behavior of the system
with respect to transparent hugepages. with respect to transparent hugepages.
See Documentation/vm/transhuge.txt for more details. See Documentation/admin-guide/mm/transhuge.rst
for more details.
tsc= Disable clocksource stability checks for TSC. tsc= Disable clocksource stability checks for TSC.
Format: <string> Format: <string>
...@@ -4435,7 +4445,7 @@ ...@@ -4435,7 +4445,7 @@
usbcore.initial_descriptor_timeout= usbcore.initial_descriptor_timeout=
[USB] Specifies timeout for the initial 64-byte [USB] Specifies timeout for the initial 64-byte
USB_REQ_GET_DESCRIPTOR request in milliseconds USB_REQ_GET_DESCRIPTOR request in milliseconds
(default 5000 = 5.0 seconds). (default 5000 = 5.0 seconds).
usbcore.nousb [USB] Disable the USB subsystem usbcore.nousb [USB] Disable the USB subsystem
......
.. _mm_concepts:
=================
Concepts overview
=================
The memory management in Linux is complex system that evolved over the
years and included more and more functionality to support variety of
systems from MMU-less microcontrollers to supercomputers. The memory
management for systems without MMU is called ``nommu`` and it
definitely deserves a dedicated document, which hopefully will be
eventually written. Yet, although some of the concepts are the same,
here we assume that MMU is available and CPU can translate a virtual
address to a physical address.
.. contents:: :local:
Virtual Memory Primer
=====================
The physical memory in a computer system is a limited resource and
even for systems that support memory hotplug there is a hard limit on
the amount of memory that can be installed. The physical memory is not
necessary contiguous, it might be accessible as a set of distinct
address ranges. Besides, different CPU architectures, and even
different implementations of the same architecture have different view
how these address ranges defined.
All this makes dealing directly with physical memory quite complex and
to avoid this complexity a concept of virtual memory was developed.
The virtual memory abstracts the details of physical memory from the
application software, allows to keep only needed information in the
physical memory (demand paging) and provides a mechanism for the
protection and controlled sharing of data between processes.
With virtual memory, each and every memory access uses a virtual
address. When the CPU decodes the an instruction that reads (or
writes) from (or to) the system memory, it translates the `virtual`
address encoded in that instruction to a `physical` address that the
memory controller can understand.
The physical system memory is divided into page frames, or pages. The
size of each page is architecture specific. Some architectures allow
selection of the page size from several supported values; this
selection is performed at the kernel build time by setting an
appropriate kernel configuration option.
Each physical memory page can be mapped as one or more virtual
pages. These mappings are described by page tables that allow
translation from virtual address used by programs to real address in
the physical memory. The page tables organized hierarchically.
The tables at the lowest level of the hierarchy contain physical
addresses of actual pages used by the software. The tables at higher
levels contain physical addresses of the pages belonging to the lower
levels. The pointer to the top level page table resides in a
register. When the CPU performs the address translation, it uses this
register to access the top level page table. The high bits of the
virtual address are used to index an entry in the top level page
table. That entry is then used to access the next level in the
hierarchy with the next bits of the virtual address as the index to
that level page table. The lowest bits in the virtual address define
the offset inside the actual page.
Huge Pages
==========
The address translation requires several memory accesses and memory
accesses are slow relatively to CPU speed. To avoid spending precious
processor cycles on the address translation, CPUs maintain a cache of
such translations called Translation Lookaside Buffer (or
TLB). Usually TLB is pretty scarce resource and applications with
large memory working set will experience performance hit because of
TLB misses.
Many modern CPU architectures allow mapping of the memory pages
directly by the higher levels in the page table. For instance, on x86,
it is possible to map 2M and even 1G pages using entries in the second
and the third level page tables. In Linux such pages are called
`huge`. Usage of huge pages significantly reduces pressure on TLB,
improves TLB hit-rate and thus improves overall system performance.
There are two mechanisms in Linux that enable mapping of the physical
memory with the huge pages. The first one is `HugeTLB filesystem`, or
hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
store. For the files created in this filesystem the data resides in
the memory and mapped using huge pages. The hugetlbfs is described at
:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
Another, more recent, mechanism that enables use of the huge pages is
called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
requires users and/or system administrators to configure what parts of
the system memory should and can be mapped by the huge pages, THP
manages such mappings transparently to the user and hence the
name. See
:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
for more details about THP.
Zones
=====
Often hardware poses restrictions on how different physical memory
ranges can be accessed. In some cases, devices cannot perform DMA to
all the addressable memory. In other cases, the size of the physical
memory exceeds the maximal addressable size of virtual memory and
special actions are required to access portions of the memory. Linux
groups memory pages into `zones` according to their possible
usage. For example, ZONE_DMA will contain memory that can be used by
devices for DMA, ZONE_HIGHMEM will contain memory that is not
permanently mapped into kernel's address space and ZONE_NORMAL will
contain normally addressed pages.
The actual layout of the memory zones is hardware dependent as not all
architectures define all zones, and requirements for DMA are different
for different platforms.
Nodes
=====
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
systems. In such systems the memory is arranged into banks that have
different access latency depending on the "distance" from the
processor. Each bank is referred as `node` and for each node Linux
constructs an independent memory management subsystem. A node has it's
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
:ref:`Documentation/vm/numa.rst <numa>` and in
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
Page cache
==========
The physical memory is volatile and the common case for getting data
into the memory is to read it from files. Whenever a file is read, the
data is put into the `page cache` to avoid expensive disk access on
the subsequent reads. Similarly, when one writes to a file, the data
is placed in the page cache and eventually gets into the backing
storage device. The written pages are marked as `dirty` and when Linux
decides to reuse them for other purposes, it makes sure to synchronize
the file contents on the device with the updated data.
Anonymous Memory
================
The `anonymous memory` or `anonymous mappings` represent memory that
is not backed by a filesystem. Such mappings are implicitly created
for program's stack and heap or by explicit calls to mmap(2) system
call. Usually, the anonymous mappings only define virtual memory areas
that the program is allowed to access. The read accesses will result
in creation of a page table entry that references a special physical
page filled with zeroes. When the program performs a write, regular
physical page will be allocated to hold the written data. The page
will be marked dirty and if the kernel will decide to repurpose it,
the dirty page will be swapped out.
Reclaim
=======
Throughout the system lifetime, a physical page can be used for storing
different types of data. It can be kernel internal data structures,
DMA'able buffers for device drivers use, data read from a filesystem,
memory allocated by user space processes etc.
Depending on the page usage it is treated differently by the Linux
memory management. The pages that can be freed at any time, either
because they cache the data available elsewhere, for instance, on a
hard disk, or because they can be swapped out, again, to the hard
disk, are called `reclaimable`. The most notable categories of the
reclaimable pages are page cache and anonymous memory.
In most cases, the pages holding internal kernel data and used as DMA
buffers cannot be repurposed, and they remain pinned until freed by
their user. Such pages are called `unreclaimable`. However, in certain
circumstances, even pages occupied with kernel data structures can be
reclaimed. For instance, in-memory caches of filesystem metadata can
be re-read from the storage device and therefore it is possible to
discard them from the main memory when system is under memory
pressure.
The process of freeing the reclaimable physical memory pages and
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
pages either asynchronously or synchronously, depending on the state
of the system. When system is not loaded, most of the memory is free
and allocation request will be satisfied immediately from the free
pages supply. As the load increases, the amount of the free pages goes
down and when it reaches a certain threshold (high watermark), an
allocation request will awaken the ``kswapd`` daemon. It will
asynchronously scan memory pages and either just free them if the data
they contain is available elsewhere, or evict to the backing storage
device (remember those dirty pages?). As memory usage increases even
more and reaches another threshold - min watermark - an allocation
will trigger the `direct reclaim`. In this case allocation is stalled
until enough memory pages are reclaimed to satisfy the request.
Compaction
==========
As the system runs, tasks allocate and free the memory and it becomes
fragmented. Although with virtual memory it is possible to present
scattered physical pages as virtually contiguous range, sometimes it is
necessary to allocate large physically contiguous memory areas. Such
need may arise, for instance, when a device driver requires large
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
addresses the fragmentation issue. This mechanism moves occupied pages
from the lower part of a memory zone to free pages in the upper part
of the zone. When a compaction scan is finished free pages are grouped
together at the beginning of the zone and allocations of large
physically contiguous areas become possible.
Like reclaim, the compaction may happen asynchronously in ``kcompactd``
daemon or synchronously as a result of memory allocation request.
OOM killer
==========
It may happen, that on a loaded machine memory will be exhausted. When
the kernel detects that the system runs out of memory (OOM) it invokes
`OOM killer`. Its mission is simple: all it has to do is to select a
task to sacrifice for the sake of the overall system health. The
selected task is killed in a hope that after it exits enough memory
will be freed to continue normal operation.
.. _hugetlbpage:
=============
HugeTLB Pages
=============
Overview
========
The intent of this file is to give a brief summary of hugetlbpage support in The intent of this file is to give a brief summary of hugetlbpage support in
the Linux kernel. This support is built on top of multiple page size support the Linux kernel. This support is built on top of multiple page size support
...@@ -18,53 +26,59 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS ...@@ -18,53 +26,59 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
automatically when CONFIG_HUGETLBFS is selected) configuration automatically when CONFIG_HUGETLBFS is selected) configuration
options. options.
The /proc/meminfo file provides information about the total number of The ``/proc/meminfo`` file provides information about the total number of
persistent hugetlb pages in the kernel's huge page pool. It also displays persistent hugetlb pages in the kernel's huge page pool. It also displays
default huge page size and information about the number of free, reserved default huge page size and information about the number of free, reserved
and surplus huge pages in the pool of huge pages of default size. and surplus huge pages in the pool of huge pages of default size.
The huge page size is needed for generating the proper alignment and The huge page size is needed for generating the proper alignment and
size of the arguments to system calls that map huge page regions. size of the arguments to system calls that map huge page regions.
The output of "cat /proc/meminfo" will include lines like: The output of ``cat /proc/meminfo`` will include lines like::
..... HugePages_Total: uuu
HugePages_Total: uuu HugePages_Free: vvv
HugePages_Free: vvv HugePages_Rsvd: www
HugePages_Rsvd: www HugePages_Surp: xxx
HugePages_Surp: xxx Hugepagesize: yyy kB
Hugepagesize: yyy kB Hugetlb: zzz kB
Hugetlb: zzz kB
where: where:
HugePages_Total is the size of the pool of huge pages.
HugePages_Free is the number of huge pages in the pool that are not yet HugePages_Total
allocated. is the size of the pool of huge pages.
HugePages_Rsvd is short for "reserved," and is the number of huge pages for HugePages_Free
which a commitment to allocate from the pool has been made, is the number of huge pages in the pool that are not yet
but no allocation has yet been made. Reserved huge pages allocated.
guarantee that an application will be able to allocate a HugePages_Rsvd
huge page from the pool of huge pages at fault time. is short for "reserved," and is the number of huge pages for
HugePages_Surp is short for "surplus," and is the number of huge pages in which a commitment to allocate from the pool has been made,
the pool above the value in /proc/sys/vm/nr_hugepages. The but no allocation has yet been made. Reserved huge pages
maximum number of surplus huge pages is controlled by guarantee that an application will be able to allocate a
/proc/sys/vm/nr_overcommit_hugepages. huge page from the pool of huge pages at fault time.
Hugepagesize is the default hugepage size (in Kb). HugePages_Surp
Hugetlb is the total amount of memory (in kB), consumed by huge is short for "surplus," and is the number of huge pages in
pages of all sizes. the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
If huge pages of different sizes are in use, this number maximum number of surplus huge pages is controlled by
will exceed HugePages_Total * Hugepagesize. To get more ``/proc/sys/vm/nr_overcommit_hugepages``.
detailed information, please, refer to Hugepagesize
/sys/kernel/mm/hugepages (described below). is the default hugepage size (in Kb).
Hugetlb
is the total amount of memory (in kB), consumed by huge
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured pages of all sizes.
in the kernel. If huge pages of different sizes are in use, this number
will exceed HugePages_Total \* Hugepagesize. To get more
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge detailed information, please, refer to
``/sys/kernel/mm/hugepages`` (described below).
``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
configured in the kernel.
``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
pages in the kernel's huge page pool. "Persistent" huge pages will be pages in the kernel's huge page pool. "Persistent" huge pages will be
returned to the huge page pool when freed by a task. A user with root returned to the huge page pool when freed by a task. A user with root
privileges can dynamically allocate more or free some persistent huge pages privileges can dynamically allocate more or free some persistent huge pages
by increasing or decreasing the value of 'nr_hugepages'. by increasing or decreasing the value of ``nr_hugepages``.
Pages that are used as huge pages are reserved inside the kernel and cannot Pages that are used as huge pages are reserved inside the kernel and cannot
be used for other purposes. Huge pages cannot be swapped out under be used for other purposes. Huge pages cannot be swapped out under
...@@ -73,7 +87,7 @@ memory pressure. ...@@ -73,7 +87,7 @@ memory pressure.
Once a number of huge pages have been pre-allocated to the kernel huge page Once a number of huge pages have been pre-allocated to the kernel huge page
pool, a user with appropriate privilege can use either the mmap system call pool, a user with appropriate privilege can use either the mmap system call
or shared memory system calls to use the huge pages. See the discussion of or shared memory system calls to use the huge pages. See the discussion of
Using Huge Pages, below. :ref:`Using Huge Pages <using_huge_pages>`, below.
The administrator can allocate persistent huge pages on the kernel boot The administrator can allocate persistent huge pages on the kernel boot
command line by specifying the "hugepages=N" parameter, where 'N' = the command line by specifying the "hugepages=N" parameter, where 'N' = the
...@@ -86,10 +100,10 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must ...@@ -86,10 +100,10 @@ with a huge page size selection parameter "hugepagesz=<size>". <size> must
be specified in bytes with optional scale suffix [kKmMgG]. The default huge be specified in bytes with optional scale suffix [kKmMgG]. The default huge
page size may be selected with the "default_hugepagesz=<size>" boot parameter. page size may be selected with the "default_hugepagesz=<size>" boot parameter.
When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
indicates the current number of pre-allocated huge pages of the default size. indicates the current number of pre-allocated huge pages of the default size.
Thus, one can use the following command to dynamically allocate/deallocate Thus, one can use the following command to dynamically allocate/deallocate
default sized persistent huge pages: default sized persistent huge pages::
echo 20 > /proc/sys/vm/nr_hugepages echo 20 > /proc/sys/vm/nr_hugepages
...@@ -98,11 +112,12 @@ huge page pool to 20, allocating or freeing huge pages, as required. ...@@ -98,11 +112,12 @@ huge page pool to 20, allocating or freeing huge pages, as required.
On a NUMA platform, the kernel will attempt to distribute the huge page pool On a NUMA platform, the kernel will attempt to distribute the huge page pool
over all the set of allowed nodes specified by the NUMA memory policy of the over all the set of allowed nodes specified by the NUMA memory policy of the
task that modifies nr_hugepages. The default for the allowed nodes--when the task that modifies ``nr_hugepages``. The default for the allowed nodes--when the
task has default memory policy--is all on-line nodes with memory. Allowed task has default memory policy--is all on-line nodes with memory. Allowed
nodes with insufficient available, contiguous memory for a huge page will be nodes with insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the discussion silently skipped when allocating persistent huge pages. See the
below of the interaction of task memory policy, cpusets and per node attributes :ref:`discussion below <mem_policy_and_hp_alloc>`
of the interaction of task memory policy, cpusets and per node attributes
with the allocation and freeing of persistent huge pages. with the allocation and freeing of persistent huge pages.
The success or failure of huge page allocation depends on the amount of The success or failure of huge page allocation depends on the amount of
...@@ -117,51 +132,52 @@ init files. This will enable the kernel to allocate huge pages early in ...@@ -117,51 +132,52 @@ init files. This will enable the kernel to allocate huge pages early in
the boot process when the possibility of getting physical contiguous pages the boot process when the possibility of getting physical contiguous pages
is still very high. Administrators can verify the number of huge pages is still very high. Administrators can verify the number of huge pages
actually allocated by checking the sysctl or meminfo. To check the per node actually allocated by checking the sysctl or meminfo. To check the per node
distribution of huge pages in a NUMA system, use: distribution of huge pages in a NUMA system, use::
cat /sys/devices/system/node/node*/meminfo | fgrep Huge cat /sys/devices/system/node/node*/meminfo | fgrep Huge
/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of ``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
requested by applications. Writing any non-zero value into this file requested by applications. Writing any non-zero value into this file
indicates that the hugetlb subsystem is allowed to try to obtain that indicates that the hugetlb subsystem is allowed to try to obtain that
number of "surplus" huge pages from the kernel's normal page pool, when the number of "surplus" huge pages from the kernel's normal page pool, when the
persistent huge page pool is exhausted. As these surplus huge pages become persistent huge page pool is exhausted. As these surplus huge pages become
unused, they are freed back to the kernel's normal page pool. unused, they are freed back to the kernel's normal page pool.
When increasing the huge page pool size via nr_hugepages, any existing surplus When increasing the huge page pool size via ``nr_hugepages``, any existing
pages will first be promoted to persistent huge pages. Then, additional surplus pages will first be promoted to persistent huge pages. Then, additional
huge pages will be allocated, if necessary and if possible, to fulfill huge pages will be allocated, if necessary and if possible, to fulfill
the new persistent huge page pool size. the new persistent huge page pool size.
The administrator may shrink the pool of persistent huge pages for The administrator may shrink the pool of persistent huge pages for
the default huge page size by setting the nr_hugepages sysctl to a the default huge page size by setting the ``nr_hugepages`` sysctl to a
smaller value. The kernel will attempt to balance the freeing of huge pages smaller value. The kernel will attempt to balance the freeing of huge pages
across all nodes in the memory policy of the task modifying nr_hugepages. across all nodes in the memory policy of the task modifying ``nr_hugepages``.
Any free huge pages on the selected nodes will be freed back to the kernel's Any free huge pages on the selected nodes will be freed back to the kernel's
normal page pool. normal page pool.
Caveat: Shrinking the persistent huge page pool via nr_hugepages such that Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
it becomes less than the number of huge pages in use will convert the balance it becomes less than the number of huge pages in use will convert the balance
of the in-use huge pages to surplus huge pages. This will occur even if of the in-use huge pages to surplus huge pages. This will occur even if
the number of surplus pages it would exceed the overcommit value. As long as the number of surplus pages would exceed the overcommit value. As long as
this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
increased sufficiently, or the surplus huge pages go out of use and are freed-- increased sufficiently, or the surplus huge pages go out of use and are freed--
no more surplus huge pages will be allowed to be allocated. no more surplus huge pages will be allowed to be allocated.
With support for multiple huge page pools at run-time available, much of With support for multiple huge page pools at run-time available, much of
the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
The /proc interfaces discussed above have been retained for backwards sysfs.
compatibility. The root huge page control directory in sysfs is: The ``/proc`` interfaces discussed above have been retained for backwards
compatibility. The root huge page control directory in sysfs is::
/sys/kernel/mm/hugepages /sys/kernel/mm/hugepages
For each huge page size supported by the running kernel, a subdirectory For each huge page size supported by the running kernel, a subdirectory
will exist, of the form: will exist, of the form::
hugepages-${size}kB hugepages-${size}kB
Inside each of these directories, the same set of files will exist: Inside each of these directories, the same set of files will exist::
nr_hugepages nr_hugepages
nr_hugepages_mempolicy nr_hugepages_mempolicy
...@@ -172,37 +188,39 @@ Inside each of these directories, the same set of files will exist: ...@@ -172,37 +188,39 @@ Inside each of these directories, the same set of files will exist:
which function as described above for the default huge page-sized case. which function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:
Interaction of Task Memory Policy with Huge Page Allocation/Freeing Interaction of Task Memory Policy with Huge Page Allocation/Freeing
=================================================================== ===================================================================
Whether huge pages are allocated and freed via the /proc interface or Whether huge pages are allocated and freed via the ``/proc`` interface or
the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
nodes from which huge pages are allocated or freed are controlled by the NUMA nodes from which huge pages are allocated or freed are controlled by the
NUMA memory policy of the task that modifies the nr_hugepages_mempolicy NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
sysctl or attribute. When the nr_hugepages attribute is used, mempolicy sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy
is ignored. is ignored.
The recommended method to allocate or free huge pages to/from the kernel The recommended method to allocate or free huge pages to/from the kernel
huge page pool, using the nr_hugepages example above, is: huge page pool, using the ``nr_hugepages`` example above, is::
numactl --interleave <node-list> echo 20 \ numactl --interleave <node-list> echo 20 \
>/proc/sys/vm/nr_hugepages_mempolicy >/proc/sys/vm/nr_hugepages_mempolicy
or, more succinctly: or, more succinctly::
numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
This will allocate or free abs(20 - nr_hugepages) to or from the nodes This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
specified in <node-list>, depending on whether number of persistent huge pages specified in <node-list>, depending on whether number of persistent huge pages
is initially less than or greater than 20, respectively. No huge pages will be is initially less than or greater than 20, respectively. No huge pages will be
allocated nor freed on any node not included in the specified <node-list>. allocated nor freed on any node not included in the specified <node-list>.
When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
memory policy mode--bind, preferred, local or interleave--may be used. The memory policy mode--bind, preferred, local or interleave--may be used. The
resulting effect on persistent huge page allocation is as follows: resulting effect on persistent huge page allocation is as follows:
1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], #. Regardless of mempolicy mode [see
:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`],
persistent huge pages will be distributed across the node or nodes persistent huge pages will be distributed across the node or nodes
specified in the mempolicy as if "interleave" had been specified. specified in the mempolicy as if "interleave" had been specified.
However, if a node in the policy does not contain sufficient contiguous However, if a node in the policy does not contain sufficient contiguous
...@@ -212,7 +230,7 @@ resulting effect on persistent huge page allocation is as follows: ...@@ -212,7 +230,7 @@ resulting effect on persistent huge page allocation is as follows:
possibly, allocation of persistent huge pages on nodes not allowed by possibly, allocation of persistent huge pages on nodes not allowed by
the task's memory policy. the task's memory policy.
2) One or more nodes may be specified with the bind or interleave policy. #. One or more nodes may be specified with the bind or interleave policy.
If more than one node is specified with the preferred policy, only the If more than one node is specified with the preferred policy, only the
lowest numeric id will be used. Local policy will select the node where lowest numeric id will be used. Local policy will select the node where
the task is running at the time the nodes_allowed mask is constructed. the task is running at the time the nodes_allowed mask is constructed.
...@@ -222,20 +240,20 @@ resulting effect on persistent huge page allocation is as follows: ...@@ -222,20 +240,20 @@ resulting effect on persistent huge page allocation is as follows:
indeterminate. Thus, local policy is not very useful for this purpose. indeterminate. Thus, local policy is not very useful for this purpose.
Any of the other mempolicy modes may be used to specify a single node. Any of the other mempolicy modes may be used to specify a single node.
3) The nodes allowed mask will be derived from any non-default task mempolicy, #. The nodes allowed mask will be derived from any non-default task mempolicy,
whether this policy was set explicitly by the task itself or one of its whether this policy was set explicitly by the task itself or one of its
ancestors, such as numactl. This means that if the task is invoked from a ancestors, such as numactl. This means that if the task is invoked from a
shell with non-default policy, that policy will be used. One can specify a shell with non-default policy, that policy will be used. One can specify a
node list of "all" with numactl --interleave or --membind [-m] to achieve node list of "all" with numactl --interleave or --membind [-m] to achieve
interleaving over all nodes in the system or cpuset. interleaving over all nodes in the system or cpuset.
4) Any task mempolicy specified--e.g., using numactl--will be constrained by #. Any task mempolicy specified--e.g., using numactl--will be constrained by
the resource limits of any cpuset in which the task runs. Thus, there will the resource limits of any cpuset in which the task runs. Thus, there will
be no way for a task with non-default policy running in a cpuset with a be no way for a task with non-default policy running in a cpuset with a
subset of the system nodes to allocate huge pages outside the cpuset subset of the system nodes to allocate huge pages outside the cpuset
without first moving to a cpuset that contains all of the desired nodes. without first moving to a cpuset that contains all of the desired nodes.
5) Boot-time huge page allocation attempts to distribute the requested number #. Boot-time huge page allocation attempts to distribute the requested number
of huge pages over all on-lines nodes with memory. of huge pages over all on-lines nodes with memory.
Per Node Hugepages Attributes Per Node Hugepages Attributes
...@@ -243,22 +261,22 @@ Per Node Hugepages Attributes ...@@ -243,22 +261,22 @@ Per Node Hugepages Attributes
A subset of the contents of the root huge page control directory in sysfs, A subset of the contents of the root huge page control directory in sysfs,
described above, will be replicated under each the system device of each described above, will be replicated under each the system device of each
NUMA node with memory in: NUMA node with memory in::
/sys/devices/system/node/node[0-9]*/hugepages/ /sys/devices/system/node/node[0-9]*/hugepages/
Under this directory, the subdirectory for each supported huge page size Under this directory, the subdirectory for each supported huge page size
contains the following attribute files: contains the following attribute files::
nr_hugepages nr_hugepages
free_hugepages free_hugepages
surplus_hugepages surplus_hugepages
The free_' and surplus_' attribute files are read-only. They return the number The free\_' and surplus\_' attribute files are read-only. They return the number
of free and surplus [overcommitted] huge pages, respectively, on the parent of free and surplus [overcommitted] huge pages, respectively, on the parent
node. node.
The nr_hugepages attribute returns the total number of huge pages on the The ``nr_hugepages`` attribute returns the total number of huge pages on the
specified node. When this attribute is written, the number of persistent huge specified node. When this attribute is written, the number of persistent huge
pages on the parent node will be adjusted to the specified value, if sufficient pages on the parent node will be adjusted to the specified value, if sufficient
resources exist, regardless of the task's mempolicy or cpuset constraints. resources exist, regardless of the task's mempolicy or cpuset constraints.
...@@ -267,43 +285,58 @@ Note that the number of overcommit and reserve pages remain global quantities, ...@@ -267,43 +285,58 @@ Note that the number of overcommit and reserve pages remain global quantities,
as we don't know until fault time, when the faulting task's mempolicy is as we don't know until fault time, when the faulting task's mempolicy is
applied, from which node the huge page allocation will be attempted. applied, from which node the huge page allocation will be attempted.
.. _using_huge_pages:
Using Huge Pages Using Huge Pages
================ ================
If the user applications are going to request huge pages using mmap system If the user applications are going to request huge pages using mmap system
call, then it is required that system administrator mount a file system of call, then it is required that system administrator mount a file system of
type hugetlbfs: type hugetlbfs::
mount -t hugetlbfs \ mount -t hugetlbfs \
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
min_size=<value>,nr_inodes=<value> none /mnt/huge min_size=<value>,nr_inodes=<value> none /mnt/huge
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid ``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages.
options sets the owner and group of the root of the file system. By default
the uid and gid of the current process are taken. The mode option sets the The ``uid`` and ``gid`` options sets the owner and group of the root of the
mode of root of file system to value & 01777. This value is given in octal. file system. By default the ``uid`` and ``gid`` of the current process
By default the value 0755 is picked. If the platform supports multiple huge are taken.
page sizes, the pagesize option can be used to specify the huge page size and
associated pool. pagesize is specified in bytes. If pagesize is not specified The ``mode`` option sets the mode of root of file system to value & 01777.
the platform's default huge page size and associated pool will be used. The This value is given in octal. By default the value 0755 is picked.
size option sets the maximum value of memory (huge pages) allowed for that
filesystem (/mnt/huge). The size option can be specified in bytes, or as a If the platform supports multiple huge page sizes, the ``pagesize`` option can
percentage of the specified huge page pool (nr_hugepages). The size is be used to specify the huge page size and associated pool. ``pagesize``
rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum is specified in bytes. If ``pagesize`` is not specified the platform's
value of memory (huge pages) allowed for the filesystem. min_size can be default huge page size and associated pool will be used.
specified in the same way as size, either bytes or a percentage of the
huge page pool. At mount time, the number of huge pages specified by The ``size`` option sets the maximum value of memory (huge pages) allowed
min_size are reserved for use by the filesystem. If there are not enough for that filesystem (``/mnt/huge``). The ``size`` option can be specified
free huge pages available, the mount will fail. As huge pages are allocated in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
to the filesystem and freed, the reserve count is adjusted so that the sum The size is rounded down to HPAGE_SIZE boundary.
of allocated and reserved huge pages is always at least min_size. The option
nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the The ``min_size`` option sets the minimum value of memory (huge pages) allowed
size, min_size or nr_inodes option is not provided on command line then for the filesystem. ``min_size`` can be specified in the same way as ``size``,
no limits are set. For pagesize, size, min_size and nr_inodes options, you either bytes or a percentage of the huge page pool.
can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K At mount time, the number of huge pages specified by ``min_size`` are reserved
has the same meaning as size=2048. for use by the filesystem.
If there are not enough free huge pages available, the mount will fail.
As huge pages are allocated to the filesystem and freed, the reserve count
is adjusted so that the sum of allocated and reserved huge pages is always
at least ``min_size``.
The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
can use.
If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
command line then no limits are set.
For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
For example, size=2K has the same meaning as size=2048.
While read system calls are supported on files that reside on hugetlb While read system calls are supported on files that reside on hugetlb
file systems, write system calls are not. file systems, write system calls are not.
...@@ -313,12 +346,12 @@ used to change the file attributes on hugetlbfs. ...@@ -313,12 +346,12 @@ used to change the file attributes on hugetlbfs.
Also, it is important to note that no such mount command is required if Also, it is important to note that no such mount command is required if
applications are going to use only shmat/shmget system calls or mmap with applications are going to use only shmat/shmget system calls or mmap with
MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
below. :ref:`map_hugetlb <map_hugetlb>` below.
Users who wish to use hugetlb memory via shared memory segment should be a Users who wish to use hugetlb memory via shared memory segment should be
member of a supplementary group and system admin needs to configure that gid members of a supplementary group and system admin needs to configure that gid
into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different
applications to use any combination of mmaps and shm* calls, though the mount of applications to use any combination of mmaps and shm* calls, though the mount of
filesystem will be required for using mmap calls without MAP_HUGETLB. filesystem will be required for using mmap calls without MAP_HUGETLB.
...@@ -332,20 +365,18 @@ a hugetlb page and the length is smaller than the hugepage size. ...@@ -332,20 +365,18 @@ a hugetlb page and the length is smaller than the hugepage size.
Examples Examples
======== ========
1) map_hugetlb: see tools/testing/selftests/vm/map_hugetlb.c .. _map_hugetlb:
2) hugepage-shm: see tools/testing/selftests/vm/hugepage-shm.c ``map_hugetlb``
see tools/testing/selftests/vm/map_hugetlb.c
3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c ``hugepage-shm``
see tools/testing/selftests/vm/hugepage-shm.c
4) The libhugetlbfs (https://github.com/libhugetlbfs/libhugetlbfs) library ``hugepage-mmap``
provides a wide range of userspace tools to help with huge page usability, see tools/testing/selftests/vm/hugepage-mmap.c
environment setup, and control.
Kernel development regression testing The `libhugetlbfs`_ library provides a wide range of userspace tools
===================================== to help with huge page usability, environment setup, and control.
The most complete set of hugetlb tests are in the libhugetlbfs repository. .. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
If you modify any hugetlb related code, use the libhugetlbfs test suite
to check for regressions. In addition, if you add any new hugetlb
functionality, please add appropriate tests to libhugetlbfs.
MOTIVATION .. _idle_page_tracking:
==================
Idle Page Tracking
==================
Motivation
==========
The idle page tracking feature allows to track which memory pages are being The idle page tracking feature allows to track which memory pages are being
accessed by a workload and which are idle. This information can be useful for accessed by a workload and which are idle. This information can be useful for
...@@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster. ...@@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster.
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
USER API .. _user_api:
The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, User API
it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. ========
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
Currently, it consists of the only read-write file,
``/sys/kernel/mm/page_idle/bitmap``.
The file implements a bitmap where each bit corresponds to a memory page. The The file implements a bitmap where each bit corresponds to a memory page. The
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
...@@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is ...@@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
set, the corresponding page is idle. set, the corresponding page is idle.
A page is considered idle if it has not been accessed since it was marked idle A page is considered idle if it has not been accessed since it was marked idle
(for more details on what "accessed" actually means see the IMPLEMENTATION (for more details on what "accessed" actually means see the :ref:`Implementation
DETAILS section). To mark a page idle one has to set the bit corresponding to Details <impl_details>` section).
To mark a page idle one has to set the bit corresponding to
the page by writing to the file. A value written to the file is OR-ed with the the page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. current bitmap value.
...@@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, ...@@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
and hence such pages are never reported idle. and hence such pages are never reported idle.
For huge pages the idle flag is set only on the head page, so one has to read For huge pages the idle flag is set only on the head page, so one has to read
/proc/kpageflags in order to correctly count idle huge pages. ``/proc/kpageflags`` in order to correctly count idle huge pages.
Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
-EINVAL if you are not starting the read/write on an 8-byte boundary, or -EINVAL if you are not starting the read/write on an 8-byte boundary, or
if the size of the read/write is not a multiple of 8 bytes. Writing to if the size of the read/write is not a multiple of 8 bytes. Writing to
this file beyond max PFN will return -ENXIO. this file beyond max PFN will return -ENXIO.
...@@ -41,21 +53,26 @@ That said, in order to estimate the amount of pages that are not used by a ...@@ -41,21 +53,26 @@ That said, in order to estimate the amount of pages that are not used by a
workload one should: workload one should:
1. Mark all the workload's pages as idle by setting corresponding bits in 1. Mark all the workload's pages as idle by setting corresponding bits in
/sys/kernel/mm/page_idle/bitmap. The pages can be found by reading ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
/proc/pid/pagemap if the workload is represented by a process, or by ``/proc/pid/pagemap`` if the workload is represented by a process, or by
filtering out alien pages using /proc/kpagecgroup in case the workload is filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
placed in a memory cgroup. is placed in a memory cgroup.
2. Wait until the workload accesses its working set. 2. Wait until the workload accesses its working set.
3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
one wants to ignore certain types of pages, e.g. mlocked pages since they If one wants to ignore certain types of pages, e.g. mlocked pages since they
are not reclaimable, he or she can filter them out using /proc/kpageflags. are not reclaimable, he or she can filter them out using
``/proc/kpageflags``.
See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more
information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and
``/proc/kpagecgroup``.
See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, .. _impl_details:
/proc/kpageflags, and /proc/kpagecgroup.
IMPLEMENTATION DETAILS Implementation Details
======================
The kernel internally keeps track of accesses to user memory pages in order to The kernel internally keeps track of accesses to user memory pages in order to
reclaim unreferenced pages first on memory shortage conditions. A page is reclaim unreferenced pages first on memory shortage conditions. A page is
...@@ -77,7 +94,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or ...@@ -77,7 +94,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
exceeding the dirty memory limit, it is not marked referenced. exceeding the dirty memory limit, it is not marked referenced.
The idle memory tracking feature adds a new page flag, the Idle flag. This flag The idle memory tracking feature adds a new page flag, the Idle flag. This flag
is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
:ref:`User API <user_api>`
section), and cleared automatically whenever a page is referenced as defined section), and cleared automatically whenever a page is referenced as defined
above. above.
......
=================
Memory Management
=================
Linux memory management subsystem is responsible, as the name implies,
for managing the memory in the system. This includes implemnetation of
virtual memory and demand paging, memory allocation both for kernel
internal structures and user space programms, mapping of files into
processes address space and many other cool things.
Linux memory management is a complex system with many configurable
settings. Most of these settings are available via ``/proc``
filesystem and can be quired and adjusted using ``sysctl``. These APIs
are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
Linux memory management has its own jargon and if you are not yet
familiar with it, consider reading
:ref:`Documentation/admin-guide/mm/concepts.rst <mm_concepts>`.
Here we document in detail how to interact with various mechanisms in
the Linux memory management.
.. toctree::
:maxdepth: 1
concepts
hugetlbpage
idle_page_tracking
ksm
numa_memory_policy
pagemap
soft-dirty
transhuge
userfaultfd
.. _admin_guide_ksm:
=======================
Kernel Samepage Merging
=======================
Overview
========
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
KSM was originally developed for use with KVM (where it was known as
Kernel Shared Memory), to fit more virtual machines into physical memory,
by sharing the data common between them. But it can be useful to any
application which generates many instances of the same data.
The KSM daemon ksmd periodically scans those areas of user memory
which have been registered with it, looking for pages of identical
content which can be replaced by a single write-protected page (which
is automatically copied if a process later wants to update its
content). The amount of pages that KSM daemon scans in a single pass
and the time between the passes are configured using :ref:`sysfs
intraface <ksm_sysfs>`
KSM only merges anonymous (private) pages, never pagecache (file) pages.
KSM's merged pages were originally locked into kernel memory, but can now
be swapped out just like other user pages (but sharing is broken when they
are swapped back in: ksmd must rediscover their identity and merge again).
Controlling KSM with madvise
============================
KSM only operates on those areas of address space which an application
has advised to be likely candidates for merging, by using the madvise(2)
system call::
int madvise(addr, length, MADV_MERGEABLE)
The app may call
::
int madvise(addr, length, MADV_UNMERGEABLE)
to cancel that advice and restore unshared pages: whereupon KSM
unmerges whatever it merged in that range. Note: this unmerging call
may suddenly require more memory than is available - possibly failing
with EAGAIN, but more probably arousing the Out-Of-Memory killer.
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
built with CONFIG_KSM=y, those calls will normally succeed: even if the
the KSM daemon is not currently running, MADV_MERGEABLE still registers
the range for whenever the KSM daemon is started; even if the range
cannot contain any pages which KSM could actually merge; even if
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
If a region of memory must be split into at least one new MADV_MERGEABLE
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
Like other madvise calls, they are intended for use on mapped areas of
the user address space: they will report ENOMEM if the specified range
includes unmapped gaps (though working on the intervening mapped areas),
and might fail with EAGAIN if not enough memory for internal structures.
Applications should be considerate in their use of MADV_MERGEABLE,
restricting its use to areas likely to benefit. KSM's scans may use a lot
of processing power: some installations will disable KSM for that reason.
.. _ksm_sysfs:
KSM daemon sysfs interface
==========================
The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
readable by all but writable only by root:
pages_to_scan
how many pages to scan before ksmd goes to sleep
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
Default: 100 (chosen for demonstration purposes)
sleep_millisecs
how many milliseconds ksmd should sleep before next scan
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs``
Default: 20 (chosen for demonstration purposes)
merge_across_nodes
specifies if pages from different NUMA nodes can be merged.
When set to 0, ksm merges only pages which physically reside
in the memory area of same NUMA node. That brings lower
latency to access of shared pages. Systems with more nodes, at
significant NUMA distances, are likely to benefit from the
lower latency of setting 0. Smaller systems, which need to
minimize memory usage, are likely to benefit from the greater
sharing of setting 1 (default). You may wish to compare how
your system performs under each setting, before deciding on
which to use. ``merge_across_nodes`` setting can be changed only
when there are no ksm shared pages in the system: set run 2 to
unmerge pages first, then to 1 after changing
``merge_across_nodes``, to remerge according to the new setting.
Default: 1 (merging across nodes as in earlier releases)
run
* set to 0 to stop ksmd from running but keep merged pages,
* set to 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
* set to 2 to stop ksmd and unmerge all pages currently merged, but
leave mergeable areas registered for next run.
Default: 0 (must be changed to 1 to activate KSM, except if
CONFIG_SYSFS is disabled)
use_zero_pages
specifies whether empty pages (i.e. allocated pages that only
contain zeroes) should be treated specially. When set to 1,
empty pages are merged with the kernel zero page(s) instead of
with each other as it would happen normally. This can improve
the performance on architectures with coloured zero pages,
depending on the workload. Care should be taken when enabling
this setting, as it can potentially degrade the performance of
KSM for some workloads, for example if the checksums of pages
candidate for merging match the checksum of an empty
page. This setting can be changed at any time, it is only
effective for pages merged after the change.
Default: 0 (normal KSM behaviour as in earlier releases)
max_page_sharing
Maximum sharing allowed for each KSM page. This enforces a
deduplication limit to avoid high latency for virtual memory
operations that involve traversal of the virtual mappings that
share the KSM page. The minimum value is 2 as a newly created
KSM page will have at least two sharers. The higher this value
the faster KSM will merge the memory and the higher the
deduplication factor will be, but the slower the worst case
virtual mappings traversal could be for any given KSM
page. Slowing down this traversal means there will be higher
latency for certain virtual memory operations happening during
swapping, compaction, NUMA balancing and page migration, in
turn decreasing responsiveness for the caller of those virtual
memory operations. The scheduler latency of other tasks not
involved with the VM operations doing the virtual mappings
traversal is not affected by this parameter as these
traversals are always schedule friendly themselves.
stable_node_chains_prune_millisecs
specifies how frequently KSM checks the metadata of the pages
that hit the deduplication limit for stale information.
Smaller milllisecs values will free up the KSM metadata with
lower latency, but they will make ksmd use more CPU during the
scan. It's a noop if not a single KSM page hit the
``max_page_sharing`` yet.
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
pages_shared
how many shared pages are being used
pages_sharing
how many more sites are sharing them i.e. how much saved
pages_unshared
how many pages unique but repeatedly checked for merging
pages_volatile
how many pages changing too fast to be placed in a tree
full_scans
how many times all mergeable areas have been scanned
stable_node_chains
the number of KSM pages that hit the ``max_page_sharing`` limit
stable_node_dups
number of duplicated KSM pages
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
indicates wasted effort. ``pages_volatile`` embraces several
different kinds of activity, but a high proportion there would also
indicate poor use of madvise MADV_MERGEABLE.
The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009
.. _numa_memory_policy:
What is Linux Memory Policy? ==================
NUMA Memory Policy
==================
What is NUMA Memory Policy?
============================
In the Linux kernel, "memory policy" determines from which node the kernel will In the Linux kernel, "memory policy" determines from which node the kernel will
allocate memory in a NUMA system or in an emulated NUMA system. Linux has allocate memory in a NUMA system or in an emulated NUMA system. Linux has
...@@ -9,42 +15,49 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy ...@@ -9,42 +15,49 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
support. support.
Memory policies should not be confused with cpusets Memory policies should not be confused with cpusets
(Documentation/cgroup-v1/cpusets.txt) (``Documentation/cgroup-v1/cpusets.txt``)
which is an administrative mechanism for restricting the nodes from which which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When programming interface that a NUMA-aware application can take advantage of. When
both cpusets and policies are applied to a task, the restrictions of the cpuset both cpusets and policies are applied to a task, the restrictions of the cpuset
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
below for more details.
MEMORY POLICY CONCEPTS Memory Policy Concepts
======================
Scope of Memory Policies Scope of Memory Policies
------------------------
The Linux kernel supports _scopes_ of memory policy, described here from The Linux kernel supports _scopes_ of memory policy, described here from
most general to most specific: most general to most specific:
System Default Policy: this policy is "hard coded" into the kernel. It System Default Policy
is the policy that governs all page allocations that aren't controlled this policy is "hard coded" into the kernel. It is the policy
by one of the more specific policy scopes discussed below. When the that governs all page allocations that aren't controlled by
system is "up and running", the system default policy will use "local one of the more specific policy scopes discussed below. When
allocation" described below. However, during boot up, the system the system is "up and running", the system default policy will
default policy will be set to interleave allocations across all nodes use "local allocation" described below. However, during boot
with "sufficient" memory, so as not to overload the initial boot node up, the system default policy will be set to interleave
with boot-time allocations. allocations across all nodes with "sufficient" memory, so as
not to overload the initial boot node with boot-time
Task/Process Policy: this is an optional, per-task policy. When defined allocations.
for a specific task, this policy controls all page allocations made by or
on behalf of the task that aren't controlled by a more specific scope. Task/Process Policy
If a task does not define a task policy, then all page allocations that this is an optional, per-task policy. When defined for a
would have been controlled by the task policy "fall back" to the System specific task, this policy controls all page allocations made
Default Policy. by or on behalf of the task that aren't controlled by a more
specific scope. If a task does not define a task policy, then
all page allocations that would have been controlled by the
task policy "fall back" to the System Default Policy.
The task policy applies to the entire address space of a task. Thus, The task policy applies to the entire address space of a task. Thus,
it is inheritable, and indeed is inherited, across both fork() it is inheritable, and indeed is inherited, across both fork()
[clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
to establish the task policy for a child task exec()'d from an to establish the task policy for a child task exec()'d from an
executable image that has no awareness of memory policy. See the executable image that has no awareness of memory policy. See the
MEMORY POLICY APIS section, below, for an overview of the system call :ref:`Memory Policy APIs <memory_policy_apis>` section,
below, for an overview of the system call
that a task may use to set/change its task/process policy. that a task may use to set/change its task/process policy.
In a multi-threaded task, task policies apply only to the thread In a multi-threaded task, task policies apply only to the thread
...@@ -58,56 +71,67 @@ most general to most specific: ...@@ -58,56 +71,67 @@ most general to most specific:
changes its task policy remain where they were allocated based on changes its task policy remain where they were allocated based on
the policy at the time they were allocated. the policy at the time they were allocated.
VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's .. _vma_policy:
virtual address space. A task may define a specific policy for a range
of its virtual address space. See the MEMORY POLICIES APIS section, VMA Policy
below, for an overview of the mbind() system call used to set a VMA A "VMA" or "Virtual Memory Area" refers to a range of a task's
policy. virtual address space. A task may define a specific policy for a range
of its virtual address space. See the
A VMA policy will govern the allocation of pages that back this region of :ref:`Memory Policy APIs <memory_policy_apis>` section,
the address space. Any regions of the task's address space that don't below, for an overview of the mbind() system call used to set a VMA
have an explicit VMA policy will fall back to the task policy, which may policy.
itself fall back to the System Default Policy.
A VMA policy will govern the allocation of pages that back
VMA policies have a few complicating details: this region of the address space. Any regions of the task's
address space that don't have an explicit VMA policy will fall
VMA policy applies ONLY to anonymous pages. These include pages back to the task policy, which may itself fall back to the
allocated for anonymous segments, such as the task stack and heap, and System Default Policy.
any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
If a VMA policy is applied to a file mapping, it will be ignored if VMA policies have a few complicating details:
the mapping used the MAP_SHARED flag. If the file mapping used the
MAP_PRIVATE flag, the VMA policy will only be applied when an * VMA policy applies ONLY to anonymous pages. These include
anonymous page is allocated on an attempt to write to the mapping-- pages allocated for anonymous segments, such as the task
i.e., at Copy-On-Write. stack and heap, and any regions of the address space
mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
VMA policies are shared between all tasks that share a virtual address applied to a file mapping, it will be ignored if the mapping
space--a.k.a. threads--independent of when the policy is installed; and used the MAP_SHARED flag. If the file mapping used the
they are inherited across fork(). However, because VMA policies refer MAP_PRIVATE flag, the VMA policy will only be applied when
to a specific region of a task's address space, and because the address an anonymous page is allocated on an attempt to write to the
space is discarded and recreated on exec*(), VMA policies are NOT mapping-- i.e., at Copy-On-Write.
inheritable across exec(). Thus, only NUMA-aware applications may
use VMA policies. * VMA policies are shared between all tasks that share a
virtual address space--a.k.a. threads--independent of when
A task may install a new VMA policy on a sub-range of a previously the policy is installed; and they are inherited across
mmap()ed region. When this happens, Linux splits the existing virtual fork(). However, because VMA policies refer to a specific
memory area into 2 or 3 VMAs, each with it's own policy. region of a task's address space, and because the address
space is discarded and recreated on exec*(), VMA policies
By default, VMA policy applies only to pages allocated after the policy are NOT inheritable across exec(). Thus, only NUMA-aware
is installed. Any pages already faulted into the VMA range remain applications may use VMA policies.
where they were allocated based on the policy at the time they were
allocated. However, since 2.6.16, Linux supports page migration via * A task may install a new VMA policy on a sub-range of a
the mbind() system call, so that page contents can be moved to match previously mmap()ed region. When this happens, Linux splits
a newly installed policy. the existing virtual memory area into 2 or 3 VMAs, each with
it's own policy.
Shared Policy: Conceptually, shared policies apply to "memory objects"
mapped shared into one or more tasks' distinct address spaces. An * By default, VMA policy applies only to pages allocated after
application installs a shared policies the same way as VMA policies--using the policy is installed. Any pages already faulted into the
the mbind() system call specifying a range of virtual addresses that map VMA range remain where they were allocated based on the
the shared object. However, unlike VMA policies, which can be considered policy at the time they were allocated. However, since
to be an attribute of a range of a task's address space, shared policies 2.6.16, Linux supports page migration via the mbind() system
apply directly to the shared object. Thus, all tasks that attach to the call, so that page contents can be moved to match a newly
object share the policy, and all pages allocated for the shared object, installed policy.
by any task, will obey the shared policy.
Shared Policy
Conceptually, shared policies apply to "memory objects" mapped
shared into one or more tasks' distinct address spaces. An
application installs shared policies the same way as VMA
policies--using the mbind() system call specifying a range of
virtual addresses that map the shared object. However, unlike
VMA policies, which can be considered to be an attribute of a
range of a task's address space, shared policies apply
directly to the shared object. Thus, all tasks that attach to
the object share the policy, and all pages allocated for the
shared object, by any task, will obey the shared policy.
As of 2.6.22, only shared memory segments, created by shmget() or As of 2.6.22, only shared memory segments, created by shmget() or
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
...@@ -118,11 +142,12 @@ most general to most specific: ...@@ -118,11 +142,12 @@ most general to most specific:
Although hugetlbfs segments now support lazy allocation, their support Although hugetlbfs segments now support lazy allocation, their support
for shared policy has not been completed. for shared policy has not been completed.
As mentioned above [re: VMA policies], allocations of page cache As mentioned above in :ref:`VMA policies <vma_policy>` section,
pages for regular files mmap()ed with MAP_SHARED ignore any VMA allocations of page cache pages for regular files mmap()ed
policy installed on the virtual address range backed by the shared with MAP_SHARED ignore any VMA policy installed on the virtual
file mapping. Rather, shared page cache pages, including pages backing address range backed by the shared file mapping. Rather,
private mappings that have not yet been written by the task, follow shared page cache pages, including pages backing private
mappings that have not yet been written by the task, follow
task policy, if any, else System Default Policy. task policy, if any, else System Default Policy.
The shared policy infrastructure supports different policies on subset The shared policy infrastructure supports different policies on subset
...@@ -135,164 +160,175 @@ most general to most specific: ...@@ -135,164 +160,175 @@ most general to most specific:
one or more ranges of the region. one or more ranges of the region.
Components of Memory Policies Components of Memory Policies
-----------------------------
A Linux memory policy consists of a "mode", optional mode flags, and an
optional set of nodes. The mode determines the behavior of the policy, A NUMA memory policy consists of a "mode", optional mode flags, and
the optional mode flags determine the behavior of the mode, and the an optional set of nodes. The mode determines the behavior of the
optional set of nodes can be viewed as the arguments to the policy policy, the optional mode flags determine the behavior of the mode,
behavior. and the optional set of nodes can be viewed as the arguments to the
policy behavior.
Internally, memory policies are implemented by a reference counted
structure, struct mempolicy. Details of this structure will be discussed Internally, memory policies are implemented by a reference counted
in context, below, as required to explain the behavior. structure, struct mempolicy. Details of this structure will be
discussed in context, below, as required to explain the behavior.
Linux memory policy supports the following 4 behavioral modes:
NUMA memory policy supports the following 4 behavioral modes:
Default Mode--MPOL_DEFAULT: This mode is only used in the memory
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL Default Mode--MPOL_DEFAULT
memory policy in all policy scopes. Any existing non-default policy This mode is only used in the memory policy APIs. Internally,
will simply be removed when MPOL_DEFAULT is specified. As a result, MPOL_DEFAULT is converted to the NULL memory policy in all
MPOL_DEFAULT means "fall back to the next most specific policy scope." policy scopes. Any existing non-default policy will simply be
removed when MPOL_DEFAULT is specified. As a result,
For example, a NULL or default task policy will fall back to the MPOL_DEFAULT means "fall back to the next most specific policy
system default policy. A NULL or default vma policy will fall scope."
back to the task policy.
For example, a NULL or default task policy will fall back to the
When specified in one of the memory policy APIs, the Default mode system default policy. A NULL or default vma policy will fall
does not use the optional set of nodes. back to the task policy.
It is an error for the set of nodes specified for this policy to When specified in one of the memory policy APIs, the Default mode
be non-empty. does not use the optional set of nodes.
MPOL_BIND: This mode specifies that memory must come from the It is an error for the set of nodes specified for this policy to
set of nodes specified by the policy. Memory will be allocated from be non-empty.
the node in the set with sufficient free memory that is closest to
the node where the allocation takes place. MPOL_BIND
This mode specifies that memory must come from the set of
MPOL_PREFERRED: This mode specifies that the allocation should be nodes specified by the policy. Memory will be allocated from
attempted from the single node specified in the policy. If that the node in the set with sufficient free memory that is
allocation fails, the kernel will search other nodes, in order of closest to the node where the allocation takes place.
increasing distance from the preferred node based on information
provided by the platform firmware. MPOL_PREFERRED
This mode specifies that the allocation should be attempted
Internally, the Preferred policy uses a single node--the from the single node specified in the policy. If that
preferred_node member of struct mempolicy. When the internal allocation fails, the kernel will search other nodes, in order
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and of increasing distance from the preferred node based on
the policy is interpreted as local allocation. "Local" allocation information provided by the platform firmware.
policy can be viewed as a Preferred policy that starts at the node
containing the cpu where the allocation takes place. Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. When the internal
It is possible for the user to specify that local allocation is mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
always preferred by passing an empty nodemask with this mode. and the policy is interpreted as local allocation. "Local"
If an empty nodemask is passed, the policy cannot use the allocation policy can be viewed as a Preferred policy that
MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described starts at the node containing the cpu where the allocation
below. takes place.
MPOL_INTERLEAVED: This mode specifies that page allocations be It is possible for the user to specify that local allocation
interleaved, on a page granularity, across the nodes specified in is always preferred by passing an empty nodemask with this
the policy. This mode also behaves slightly differently, based on mode. If an empty nodemask is passed, the policy cannot use
the context where it is used: the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
described below.
For allocation of anonymous pages and shared memory pages,
Interleave mode indexes the set of nodes specified by the policy MPOL_INTERLEAVED
using the page offset of the faulting address into the segment This mode specifies that page allocations be interleaved, on a
[VMA] containing the address modulo the number of nodes specified page granularity, across the nodes specified in the policy.
by the policy. It then attempts to allocate a page, starting at This mode also behaves slightly differently, based on the
the selected node, as if the node had been specified by a Preferred context where it is used:
policy or had been selected by a local allocation. That is,
allocation will follow the per node zonelist. For allocation of anonymous pages and shared memory pages,
Interleave mode indexes the set of nodes specified by the
For allocation of page cache pages, Interleave mode indexes the set policy using the page offset of the faulting address into the
of nodes specified by the policy using a node counter maintained segment [VMA] containing the address modulo the number of
per task. This counter wraps around to the lowest specified node nodes specified by the policy. It then attempts to allocate a
after it reaches the highest specified node. This will tend to page, starting at the selected node, as if the node had been
spread the pages out over the nodes specified by the policy based specified by a Preferred policy or had been selected by a
on the order in which they are allocated, rather than based on any local allocation. That is, allocation will follow the per
page offset into an address range or file. During system boot up, node zonelist.
the temporary interleaved system default policy works in this
mode. For allocation of page cache pages, Interleave mode indexes
the set of nodes specified by the policy using a node counter
Linux memory policy supports the following optional mode flags: maintained per task. This counter wraps around to the lowest
specified node after it reaches the highest specified node.
MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by This will tend to spread the pages out over the nodes
specified by the policy based on the order in which they are
allocated, rather than based on any page offset into an
address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
This flag specifies that the nodemask passed by
the user should not be remapped if the task or VMA's set of allowed the user should not be remapped if the task or VMA's set of allowed
nodes changes after the memory policy has been defined. nodes changes after the memory policy has been defined.
Without this flag, anytime a mempolicy is rebound because of a Without this flag, any time a mempolicy is rebound because of a
change in the set of allowed nodes, the node (Preferred) or change in the set of allowed nodes, the node (Preferred) or
nodemask (Bind, Interleave) is remapped to the new set of nodemask (Bind, Interleave) is remapped to the new set of
allowed nodes. This may result in nodes being used that were allowed nodes. This may result in nodes being used that were
previously undesired. previously undesired.
With this flag, if the user-specified nodes overlap with the With this flag, if the user-specified nodes overlap with the
nodes allowed by the task's cpuset, then the memory policy is nodes allowed by the task's cpuset, then the memory policy is
applied to their intersection. If the two sets of nodes do not applied to their intersection. If the two sets of nodes do not
overlap, the Default policy is used. overlap, the Default policy is used.
For example, consider a task that is attached to a cpuset with For example, consider a task that is attached to a cpuset with
mems 1-3 that sets an Interleave policy over the same set. If mems 1-3 that sets an Interleave policy over the same set. If
the cpuset's mems change to 3-5, the Interleave will now occur the cpuset's mems change to 3-5, the Interleave will now occur
over nodes 3, 4, and 5. With this flag, however, since only node over nodes 3, 4, and 5. With this flag, however, since only node
3 is allowed from the user's nodemask, the "interleave" only 3 is allowed from the user's nodemask, the "interleave" only
occurs over that node. If no nodes from the user's nodemask are occurs over that node. If no nodes from the user's nodemask are
now allowed, the Default behavior is used. now allowed, the Default behavior is used.
MPOL_F_STATIC_NODES cannot be combined with the MPOL_F_STATIC_NODES cannot be combined with the
MPOL_F_RELATIVE_NODES flag. It also cannot be used for MPOL_F_RELATIVE_NODES flag. It also cannot be used for
MPOL_PREFERRED policies that were created with an empty nodemask MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation). (local allocation).
MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed MPOL_F_RELATIVE_NODES
This flag specifies that the nodemask passed
by the user will be mapped relative to the set of the task or VMA's by the user will be mapped relative to the set of the task or VMA's
set of allowed nodes. The kernel stores the user-passed nodemask, set of allowed nodes. The kernel stores the user-passed nodemask,
and if the allowed nodes changes, then that original nodemask will and if the allowed nodes changes, then that original nodemask will
be remapped relative to the new set of allowed nodes. be remapped relative to the new set of allowed nodes.
Without this flag (and without MPOL_F_STATIC_NODES), anytime a Without this flag (and without MPOL_F_STATIC_NODES), anytime a
mempolicy is rebound because of a change in the set of allowed mempolicy is rebound because of a change in the set of allowed
nodes, the node (Preferred) or nodemask (Bind, Interleave) is nodes, the node (Preferred) or nodemask (Bind, Interleave) is
remapped to the new set of allowed nodes. That remap may not remapped to the new set of allowed nodes. That remap may not
preserve the relative nature of the user's passed nodemask to its preserve the relative nature of the user's passed nodemask to its
set of allowed nodes upon successive rebinds: a nodemask of set of allowed nodes upon successive rebinds: a nodemask of
1,3,5 may be remapped to 7-9 and then to 1-3 if the set of 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
allowed nodes is restored to its original state. allowed nodes is restored to its original state.
With this flag, the remap is done so that the node numbers from With this flag, the remap is done so that the node numbers from
the user's passed nodemask are relative to the set of allowed the user's passed nodemask are relative to the set of allowed
nodes. In other words, if nodes 0, 2, and 4 are set in the user's nodes. In other words, if nodes 0, 2, and 4 are set in the user's
nodemask, the policy will be effected over the first (and in the nodemask, the policy will be effected over the first (and in the
Bind or Interleave case, the third and fifth) nodes in the set of Bind or Interleave case, the third and fifth) nodes in the set of
allowed nodes. The nodemask passed by the user represents nodes allowed nodes. The nodemask passed by the user represents nodes
relative to task or VMA's set of allowed nodes. relative to task or VMA's set of allowed nodes.
If the user's nodemask includes nodes that are outside the range If the user's nodemask includes nodes that are outside the range
of the new set of allowed nodes (for example, node 5 is set in of the new set of allowed nodes (for example, node 5 is set in
the user's nodemask when the set of allowed nodes is only 0-3), the user's nodemask when the set of allowed nodes is only 0-3),
then the remap wraps around to the beginning of the nodemask and, then the remap wraps around to the beginning of the nodemask and,
if not already set, sets the node in the mempolicy nodemask. if not already set, sets the node in the mempolicy nodemask.
For example, consider a task that is attached to a cpuset with For example, consider a task that is attached to a cpuset with
mems 2-5 that sets an Interleave policy over the same set with mems 2-5 that sets an Interleave policy over the same set with
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
interleave now occurs over nodes 3,5-7. If the cpuset's mems interleave now occurs over nodes 3,5-7. If the cpuset's mems
then change to 0,2-3,5, then the interleave occurs over nodes then change to 0,2-3,5, then the interleave occurs over nodes
0,2-3,5. 0,2-3,5.
Thanks to the consistent remapping, applications preparing Thanks to the consistent remapping, applications preparing
nodemasks to specify memory policies using this flag should nodemasks to specify memory policies using this flag should
disregard their current, actual cpuset imposed memory placement disregard their current, actual cpuset imposed memory placement
and prepare the nodemask as if they were always located on and prepare the nodemask as if they were always located on
memory nodes 0 to N-1, where N is the number of memory nodes the memory nodes 0 to N-1, where N is the number of memory nodes the
policy is intended to manage. Let the kernel then remap to the policy is intended to manage. Let the kernel then remap to the
set of memory nodes allowed by the task's cpuset, as that may set of memory nodes allowed by the task's cpuset, as that may
change over time. change over time.
MPOL_F_RELATIVE_NODES cannot be combined with the MPOL_F_RELATIVE_NODES cannot be combined with the
MPOL_F_STATIC_NODES flag. It also cannot be used for MPOL_F_STATIC_NODES flag. It also cannot be used for
MPOL_PREFERRED policies that were created with an empty nodemask MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation). (local allocation).
MEMORY POLICY REFERENCE COUNTING Memory Policy Reference Counting
================================
To resolve use/free races, struct mempolicy contains an atomic reference To resolve use/free races, struct mempolicy contains an atomic reference
count field. Internal interfaces, mpol_get()/mpol_put() increment and count field. Internal interfaces, mpol_get()/mpol_put() increment and
...@@ -360,60 +396,65 @@ follows: ...@@ -360,60 +396,65 @@ follows:
or by prefaulting the entire shared memory region into memory and locking or by prefaulting the entire shared memory region into memory and locking
it down. However, this might not be appropriate for all applications. it down. However, this might not be appropriate for all applications.
MEMORY POLICY APIs .. _memory_policy_apis:
Memory Policy APIs
==================
Linux supports 3 system calls for controlling memory policy. These APIS Linux supports 3 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space. some shared object mapped into the calling task's address space.
Note: the headers that define these APIs and the parameter data types .. note::
for user space applications reside in a package that is not part of the headers that define these APIs and the parameter data types for
the Linux kernel. The kernel system call interfaces, with the 'sys_' user space applications reside in a package that is not part of the
prefix, are defined in <linux/syscalls.h>; the mode and flag Linux kernel. The kernel system call interfaces, with the 'sys\_'
definitions are defined in <linux/mempolicy.h>. prefix, are defined in <linux/syscalls.h>; the mode and flag
definitions are defined in <linux/mempolicy.h>.
Set [Task] Memory Policy: Set [Task] Memory Policy::
long set_mempolicy(int mode, const unsigned long *nmask, long set_mempolicy(int mode, const unsigned long *nmask,
unsigned long maxnode); unsigned long maxnode);
Set's the calling task's "task/process memory policy" to mode Set's the calling task's "task/process memory policy" to mode
specified by the 'mode' argument and the set of nodes defined specified by the 'mode' argument and the set of nodes defined by
by 'nmask'. 'nmask' points to a bit mask of node ids containing 'nmask'. 'nmask' points to a bit mask of node ids containing at least
at least 'maxnode' ids. Optional mode flags may be passed by 'maxnode' ids. Optional mode flags may be passed by combining the
combining the 'mode' argument with the flag (for example: 'mode' argument with the flag (for example: MPOL_INTERLEAVE |
MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). MPOL_F_STATIC_NODES).
See the set_mempolicy(2) man page for more details See the set_mempolicy(2) man page for more details
Get [Task] Memory Policy or Related Information Get [Task] Memory Policy or Related Information::
long get_mempolicy(int *mode, long get_mempolicy(int *mode,
const unsigned long *nmask, unsigned long maxnode, const unsigned long *nmask, unsigned long maxnode,
void *addr, int flags); void *addr, int flags);
Queries the "task/process memory policy" of the calling task, or Queries the "task/process memory policy" of the calling task, or the
the policy or location of a specified virtual address, depending policy or location of a specified virtual address, depending on the
on the 'flags' argument. 'flags' argument.
See the get_mempolicy(2) man page for more details See the get_mempolicy(2) man page for more details
Install VMA/Shared Policy for a Range of Task's Address Space Install VMA/Shared Policy for a Range of Task's Address Space::
long mbind(void *start, unsigned long len, int mode, long mbind(void *start, unsigned long len, int mode,
const unsigned long *nmask, unsigned long maxnode, const unsigned long *nmask, unsigned long maxnode,
unsigned flags); unsigned flags);
mbind() installs the policy specified by (mode, nmask, maxnodes) as mbind() installs the policy specified by (mode, nmask, maxnodes) as a
a VMA policy for the range of the calling task's address space VMA policy for the range of the calling task's address space specified
specified by the 'start' and 'len' arguments. Additional actions by the 'start' and 'len' arguments. Additional actions may be
may be requested via the 'flags' argument. requested via the 'flags' argument.
See the mbind(2) man page for more details. See the mbind(2) man page for more details.
MEMORY POLICY COMMAND LINE INTERFACE Memory Policy Command Line Interface
====================================
Although not strictly part of the Linux implementation of memory policy, Although not strictly part of the Linux implementation of memory policy,
a command line tool, numactl(8), exists that allows one to: a command line tool, numactl(8), exists that allows one to:
...@@ -428,8 +469,10 @@ containing the memory policy system call wrappers. Some distributions ...@@ -428,8 +469,10 @@ containing the memory policy system call wrappers. Some distributions
package the headers and compile-time libraries in a separate development package the headers and compile-time libraries in a separate development
package. package.
.. _mem_pol_and_cpusets:
MEMORY POLICIES AND CPUSETS Memory Policies and cpusets
===========================
Memory policies work within cpusets as described above. For memory policies Memory policies work within cpusets as described above. For memory policies
that require a node or set of nodes, the nodes are restricted to the set of that require a node or set of nodes, the nodes are restricted to the set of
......
pagemap, from the userspace perspective .. _pagemap:
---------------------------------------
=============================
Examining Process Page Tables
=============================
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by userspace programs to examine the page tables and related information by
reading files in /proc. reading files in ``/proc``.
There are four components to pagemap: There are four components to pagemap:
* /proc/pid/pagemap. This file lets a userspace process find out which * ``/proc/pid/pagemap``. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit physical frame each virtual page is mapped to. It contains one 64-bit
value for each virtual page, containing the following data (from value for each virtual page, containing the following data (from
fs/proc/task_mmu.c, above pagemap_read): ``fs/proc/task_mmu.c``, above pagemap_read):
* Bits 0-54 page frame number (PFN) if present * Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped * Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped * Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) * Bit 55 pte is soft-dirty (see
:ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
* Bit 56 page exclusively mapped (since 4.2) * Bit 56 page exclusively mapped (since 4.2)
* Bits 57-60 zero * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5) * Bit 61 page is file-page or shared-anon (since 3.5)
...@@ -33,28 +37,28 @@ There are four components to pagemap: ...@@ -33,28 +37,28 @@ There are four components to pagemap:
precisely which pages are mapped (or in swap) and comparing mapped precisely which pages are mapped (or in swap) and comparing mapped
pages between processes. pages between processes.
Efficient users of this interface will use /proc/pid/maps to Efficient users of this interface will use ``/proc/pid/maps`` to
determine which areas of memory are actually mapped and llseek to determine which areas of memory are actually mapped and llseek to
skip over unmapped regions. skip over unmapped regions.
* /proc/kpagecount. This file contains a 64-bit count of the number of * ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN. times each page is mapped, indexed by PFN.
* /proc/kpageflags. This file contains a 64-bit set of flags for each * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
page, indexed by PFN. page, indexed by PFN.
The flags are (from fs/proc/page.c, above kpageflags_read): The flags are (from ``fs/proc/page.c``, above kpageflags_read):
0. LOCKED 0. LOCKED
1. ERROR 1. ERROR
2. REFERENCED 2. REFERENCED
3. UPTODATE 3. UPTODATE
4. DIRTY 4. DIRTY
5. LRU 5. LRU
6. ACTIVE 6. ACTIVE
7. SLAB 7. SLAB
8. WRITEBACK 8. WRITEBACK
9. RECLAIM 9. RECLAIM
10. BUDDY 10. BUDDY
11. MMAP 11. MMAP
12. ANON 12. ANON
...@@ -72,98 +76,111 @@ There are four components to pagemap: ...@@ -72,98 +76,111 @@ There are four components to pagemap:
24. ZERO_PAGE 24. ZERO_PAGE
25. IDLE 25. IDLE
* /proc/kpagecgroup. This file contains a 64-bit inode number of the * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set. CONFIG_MEMCG is set.
Short descriptions to the page flags: Short descriptions to the page flags
====================================
0. LOCKED
page is being locked for exclusive access, eg. by undergoing read/write IO
7. SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all.
10. BUDDY 0 - LOCKED
page is being locked for exclusive access, e.g. by undergoing read/write IO
7 - SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all.
10 - BUDDY
a free memory block managed by the buddy system allocator a free memory block managed by the buddy system allocator
The buddy system organizes free memory in blocks of various orders. The buddy system organizes free memory in blocks of various orders.
An order N block has 2^N physically contiguous pages, with the BUDDY flag An order N block has 2^N physically contiguous pages, with the BUDDY flag
set for and _only_ for the first page. set for and _only_ for the first page.
15 - COMPOUND_HEAD
15. COMPOUND_HEAD
16. COMPOUND_TAIL
A compound page with order N consists of 2^N physically contiguous pages. A compound page with order N consists of 2^N physically contiguous pages.
A compound page with order 2 takes the form of "HTTT", where H donates its A compound page with order 2 takes the form of "HTTT", where H donates its
head page and T donates its tail page(s). The major consumers of compound head page and T donates its tail page(s). The major consumers of compound
pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. pages are hugeTLB pages
memory allocators and various device drivers. However in this interface, (:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`),
only huge/giga pages are made visible to end users. the SLUB etc. memory allocators and various device drivers.
17. HUGE However in this interface, only huge/giga pages are made visible
to end users.
16 - COMPOUND_TAIL
A compound page tail (see description above).
17 - HUGE
this is an integral part of a HugeTLB page this is an integral part of a HugeTLB page
19 - HWPOISON
19. HWPOISON
hardware detected memory corruption on this page: don't touch the data! hardware detected memory corruption on this page: don't touch the data!
20 - NOPAGE
20. NOPAGE
no page frame exists at the requested address no page frame exists at the requested address
21 - KSM
21. KSM
identical memory pages dynamically shared between one or more processes identical memory pages dynamically shared between one or more processes
22 - THP
22. THP
contiguous pages which construct transparent hugepages contiguous pages which construct transparent hugepages
23 - BALLOON
23. BALLOON
balloon compaction page balloon compaction page
24 - ZERO_PAGE
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page zero page for pfn_zero or huge_zero page
25 - IDLE
25. IDLE
page has not been accessed since it was marked idle (see page has not been accessed since it was marked idle (see
Documentation/vm/idle_page_tracking.txt). Note that this flag may be :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
stale in case the page was accessed via a PTE. To make sure the flag Note that this flag may be stale in case the page was accessed via
is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. a PTE. To make sure the flag is up-to-date one has to read
``/sys/kernel/mm/page_idle/bitmap`` first.
[IO related page flags]
1. ERROR IO error occurred IO related page flags
3. UPTODATE page has up-to-date data ---------------------
ie. for file backed page: (in-memory data revision >= on-disk one)
4. DIRTY page has been written to, hence contains new data 1 - ERROR
ie. for file backed page: (in-memory data revision > on-disk one) IO error occurred
8. WRITEBACK page is being synced to disk 3 - UPTODATE
page has up-to-date data
[LRU related page flags] ie. for file backed page: (in-memory data revision >= on-disk one)
5. LRU page is in one of the LRU lists 4 - DIRTY
6. ACTIVE page is in the active LRU list page has been written to, hence contains new data
18. UNEVICTABLE page is in the unevictable (non-)LRU list i.e. for file backed page: (in-memory data revision > on-disk one)
It is somehow pinned and not a candidate for LRU page reclaims, 8 - WRITEBACK
eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments page is being synced to disk
2. REFERENCED page has been referenced since last LRU list enqueue/requeue
9. RECLAIM page will be reclaimed soon after its pageout IO completed LRU related page flags
11. MMAP a memory mapped page ----------------------
12. ANON a memory mapped page that is not part of a file
13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry 5 - LRU
14. SWAPBACKED page is backed by swap/RAM page is in one of the LRU lists
6 - ACTIVE
page is in the active LRU list
18 - UNEVICTABLE
page is in the unevictable (non-)LRU list It is somehow pinned and
not a candidate for LRU page reclaims, e.g. ramfs pages,
shmctl(SHM_LOCK) and mlock() memory segments
2 - REFERENCED
page has been referenced since last LRU list enqueue/requeue
9 - RECLAIM
page will be reclaimed soon after its pageout IO completed
11 - MMAP
a memory mapped page
12 - ANON
a memory mapped page that is not part of a file
13 - SWAPCACHE
page is mapped to swap space, i.e. has an associated swap entry
14 - SWAPBACKED
page is backed by swap/RAM
The page-types tool in the tools/vm directory can be used to query the The page-types tool in the tools/vm directory can be used to query the
above flags. above flags.
Using pagemap to do something useful: Using pagemap to do something useful
====================================
The general procedure for using pagemap to find out about a process' memory The general procedure for using pagemap to find out about a process' memory
usage goes like this: usage goes like this:
1. Read /proc/pid/maps to determine which parts of the memory space are 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
mapped to what. mapped to what.
2. Select the maps you are interested in -- all of them, or a particular 2. Select the maps you are interested in -- all of them, or a particular
library, or the stack or the heap, etc. library, or the stack or the heap, etc.
3. Open /proc/pid/pagemap and seek to the pages you would like to examine. 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
4. Read a u64 for each page from pagemap. 4. Read a u64 for each page from pagemap.
5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
read, seek to that entry in the file, and read the data you want. just read, seek to that entry in the file, and read the data you want.
For example, to find the "unique set size" (USS), which is the amount of For example, to find the "unique set size" (USS), which is the amount of
memory that a process is using that is not shared with any other process, memory that a process is using that is not shared with any other process,
...@@ -171,7 +188,8 @@ you can go through every map in the process, find the PFNs, look those up ...@@ -171,7 +188,8 @@ you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced in kpagecount, and tally up the number of pages that are only referenced
once. once.
Other notes: Other notes
===========
Reading from any of the files will return -EINVAL if you are not starting Reading from any of the files will return -EINVAL if you are not starting
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
......
SOFT-DIRTY PTEs .. _soft_dirty:
The soft-dirty is a bit on a PTE which helps to track which pages a task ===============
Soft-Dirty PTEs
===============
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should writes to. In order to do this tracking one should
1. Clear soft-dirty bits from the task's PTEs. 1. Clear soft-dirty bits from the task's PTEs.
This is done by writing "4" into the /proc/PID/clear_refs file of the This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
task in question. task in question.
2. Wait some time. 2. Wait some time.
3. Read soft-dirty bits from the PTEs. 3. Read soft-dirty bits from the PTEs.
This is done by reading from the /proc/PID/pagemap. The bit 55 of the This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
64-bit qword is the soft-dirty one. If set, the respective PTE was 64-bit qword is the soft-dirty one. If set, the respective PTE was
written to since step 1. written to since step 1.
Internally, to do this tracking, the writable bit is cleared from PTEs Internally, to do this tracking, the writable bit is cleared from PTEs
when the soft-dirty bit is cleared. So, after this, when the task tries to when the soft-dirty bit is cleared. So, after this, when the task tries to
modify a page at some virtual address the #PF occurs and the kernel sets modify a page at some virtual address the #PF occurs and the kernel sets
the soft-dirty bit on the respective PTE. the soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after the Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast. soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus all This is so, since the pages are still mapped to physical memory, and thus all
the kernel does is finds this fact out and puts both writable and soft-dirty the kernel does is finds this fact out and puts both writable and soft-dirty
bits on the PTE. bits on the PTE.
While in most cases tracking memory changes by #PF-s is more than enough While in most cases tracking memory changes by #PF-s is more than enough
there is still a scenario when we can lose soft dirty bits -- a task there is still a scenario when we can lose soft dirty bits -- a task
unmaps a previously mapped memory region and then maps a new one at exactly unmaps a previously mapped memory region and then maps a new one at exactly
the same place. When unmap is called, the kernel internally clears PTE values the same place. When unmap is called, the kernel internally clears PTE values
...@@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such ...@@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such
memory region renewal the kernel always marks new memory regions (and memory region renewal the kernel always marks new memory regions (and
expanded regions) as soft dirty. expanded regions) as soft dirty.
This feature is actively used by the checkpoint-restore project. You This feature is actively used by the checkpoint-restore project. You
can find more details about it on http://criu.org can find more details about it on http://criu.org
......
= Transparent Hugepage Support = .. _admin_guide_transhuge:
== Objective == ============================
Transparent Hugepage Support
============================
Objective
=========
Performance critical computing applications dealing with large memory Performance critical computing applications dealing with large memory
working sets are already running on top of libhugetlbfs and in turn working sets are already running on top of libhugetlbfs and in turn
hugetlbfs. Transparent Hugepage Support is an alternative means of hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
using huge pages for the backing of virtual memory with huge pages using huge pages for the backing of virtual memory with huge pages
that supports the automatic promotion and demotion of page sizes and that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs. without the shortcomings of hugetlbfs.
Currently it only works for anonymous memory mappings and tmpfs/shmem. Currently THP only works for anonymous memory mappings and tmpfs/shmem.
But in the future it can expand to other filesystems. But in the future it can expand to other filesystems.
.. note::
in the examples below we presume that the basic page size is 4K and
the huge page size is 2M, although the actual numbers may vary
depending on the CPU architecture.
The reason applications are running faster is because of two The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not factors. The first factor is almost completely irrelevant and it's not
of significant interest because it'll also have the downside of of significant interest because it'll also have the downside of
...@@ -23,39 +33,27 @@ only matters the first time the memory is accessed for the lifetime of ...@@ -23,39 +33,27 @@ only matters the first time the memory is accessed for the lifetime of
a memory mapping. The second long lasting and much more important a memory mapping. The second long lasting and much more important
factor will affect all subsequent accesses to the memory for the whole factor will affect all subsequent accesses to the memory for the whole
runtime of the application. The second factor consist of two runtime of the application. The second factor consist of two
components: 1) the TLB miss will run faster (especially with components:
virtualization using nested pagetables but almost always also on bare
metal without virtualization) and 2) a single TLB entry will be 1) the TLB miss will run faster (especially with virtualization using
mapping a much larger amount of virtual memory in turn reducing the nested pagetables but almost always also on bare metal without
number of TLB misses. With virtualization and nested pagetables the virtualization)
TLB can be mapped of larger size only if both KVM and the Linux guest
are using hugepages but a significant speedup already happens if only 2) a single TLB entry will be mapping a much larger amount of virtual
one of the two is using hugepages just because of the fact the TLB memory in turn reducing the number of TLB misses. With
miss is going to run faster. virtualization and nested pagetables the TLB can be mapped of
larger size only if both KVM and the Linux guest are using
== Design == hugepages but a significant speedup already happens if only one of
the two is using hugepages just because of the fact the TLB miss is
- "graceful fallback": mm components which don't have transparent hugepage going to run faster.
knowledge fall back to breaking huge pmd mapping into table of ptes and,
if necessary, split a transparent hugepage. Therefore these components THP can be enabled system wide or restricted to certain tasks or even
can continue working on the regular pages or regular pte mappings. memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
- if a hugepage allocation fails because of memory fragmentation, collapses sequences of basic pages into huge pages.
regular pages should be gracefully allocated instead and mixed in
the same vma without any failure or significant delay and without The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
userland noticing interface and using madivse(2) and prctl(2) system calls.
- if some task quits and more hugepages become available (either
immediately in the buddy or through the VM), guest physical memory
backed by regular pages should be relocated on hugepages
automatically (with khugepaged)
- it doesn't require memory reservation and in turn it uses hugepages
whenever possible (the only possible reservation here is kernelcore=
to avoid unmovable pages to fragment all the memory but such a tweak
is not specific to transparent hugepage support and it's a generic
feature that applies to all dynamic high order allocations in the
kernel)
Transparent Hugepage Support maximizes the usefulness of free memory Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all if compared to the reservation approach of hugetlbfs by allowing all
...@@ -88,16 +86,22 @@ Applications that gets a lot of benefit from hugepages and that don't ...@@ -88,16 +86,22 @@ Applications that gets a lot of benefit from hugepages and that don't
risk to lose memory by using hugepages, should use risk to lose memory by using hugepages, should use
madvise(MADV_HUGEPAGE) on their critical mmapped regions. madvise(MADV_HUGEPAGE) on their critical mmapped regions.
== sysfs == .. _thp_sysfs:
sysfs
=====
Global THP controls
-------------------
Transparent Hugepage Support for anonymous memory can be entirely disabled Transparent Hugepage Support for anonymous memory can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
regions (to avoid the risk of consuming more memory resources) or enabled regions (to avoid the risk of consuming more memory resources) or enabled
system wide. This can be achieved with one of: system wide. This can be achieved with one of::
echo always >/sys/kernel/mm/transparent_hugepage/enabled echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled
It's also possible to limit defrag efforts in the VM to generate It's also possible to limit defrag efforts in the VM to generate
anonymous hugepages in case they're not immediately free to madvise anonymous hugepages in case they're not immediately free to madvise
...@@ -108,131 +112,145 @@ use hugepages later instead of regular pages. This isn't always ...@@ -108,131 +112,145 @@ use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region. MADV_HUGEPAGE region.
echo always >/sys/kernel/mm/transparent_hugepage/defrag ::
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag echo defer >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
"always" means that an application requesting THP will stall on allocation echo never >/sys/kernel/mm/transparent_hugepage/defrag
failure and directly reclaim pages and compact memory in an effort to
allocate a THP immediately. This may be desirable for virtual machines always
that benefit heavily from THP use and are willing to delay the VM start means that an application requesting THP will stall on
to utilise them. allocation failure and directly reclaim pages and compact
memory in an effort to allocate a THP immediately. This may be
"defer" means that an application will wake kswapd in the background desirable for virtual machines that benefit heavily from THP
to reclaim pages and wake kcompactd to compact memory so that THP is use and are willing to delay the VM start to utilise them.
available in the near future. It's the responsibility of khugepaged
to then install the THP pages later. defer
means that an application will wake kswapd in the background
"defer+madvise" will enter direct reclaim and compaction like "always", but to reclaim pages and wake kcompactd to compact memory so that
only for regions that have used madvise(MADV_HUGEPAGE); all other regions THP is available in the near future. It's the responsibility
will wake kswapd in the background to reclaim pages and wake kcompactd to of khugepaged to then install the THP pages later.
compact memory so that THP is available in the near future.
defer+madvise
"madvise" will enter direct reclaim like "always" but only for regions will enter direct reclaim and compaction like ``always``, but
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. only for regions that have used madvise(MADV_HUGEPAGE); all
other regions will wake kswapd in the background to reclaim
"never" should be self-explanatory. pages and wake kcompactd to compact memory so that THP is
available in the near future.
madvise
will enter direct reclaim like ``always`` but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default
behaviour.
never
should be self-explanatory.
By default kernel tries to use huge zero page on read page fault to By default kernel tries to use huge zero page on read page fault to
anonymous mapping. It's possible to disable huge zero page by writing 0 anonymous mapping. It's possible to disable huge zero page by writing 0
or enable it back by writing 1: or enable it back by writing 1::
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
Some userspace (such as a test program, or an optimized memory allocation Some userspace (such as a test program, or an optimized memory allocation
library) may want to know the size (in bytes) of a transparent hugepage: library) may want to know the size (in bytes) of a transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
khugepaged will be automatically started when khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never". be automatically shutdown if it's set to "never".
Khugepaged controls
-------------------
khugepaged runs usually at low frequency so while one may not want to khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's should be worth invoking defrag at least in khugepaged. However it's
also possible to disable defrag in khugepaged by writing 0 or enable also possible to disable defrag in khugepaged by writing 0 or enable
defrag in khugepaged by writing 1: defrag in khugepaged by writing 1::
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
You can also control how many pages khugepaged should scan at each You can also control how many pages khugepaged should scan at each
pass: pass::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
and how many milliseconds to wait in khugepaged between each pass (you and how many milliseconds to wait in khugepaged between each pass (you
can set this to 0 to run khugepaged at 100% utilization of one core): can set this to 0 to run khugepaged at 100% utilization of one core)::
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
and how many milliseconds to wait in khugepaged if there's an hugepage and how many milliseconds to wait in khugepaged if there's an hugepage
allocation failure to throttle the next allocation attempt. allocation failure to throttle the next allocation attempt::
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
The khugepaged progress can be seen in the number of pages collapsed: The khugepaged progress can be seen in the number of pages collapsed::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
for each pass: for each pass::
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
max_ptes_none specifies how many extra small pages (that are ``max_ptes_none`` specifies how many extra small pages (that are
not already mapped) can be allocated when collapsing a group not already mapped) can be allocated when collapsing a group
of small pages into one large page. of small pages into one large page::
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
A higher value leads to use additional memory for programs. A higher value leads to use additional memory for programs.
A lower value leads to gain less thp performance. Value of A lower value leads to gain less thp performance. Value of
max_ptes_none can waste cpu time very little, you can max_ptes_none can waste cpu time very little, you can
ignore it. ignore it.
max_ptes_swap specifies how many pages can be brought in from ``max_ptes_swap`` specifies how many pages can be brought in from
swap when collapsing a group of pages into a transparent huge page. swap when collapsing a group of pages into a transparent huge page::
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
A higher value can cause excessive swap IO and waste A higher value can cause excessive swap IO and waste
memory. A lower value can prevent THPs from being memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance. THPs, and lower memory access performance.
== Boot parameter == Boot parameter
==============
You can change the sysfs boot time defaults of Transparent Hugepage You can change the sysfs boot time defaults of Transparent Hugepage
Support by passing the parameter "transparent_hugepage=always" or Support by passing the parameter ``transparent_hugepage=always`` or
"transparent_hugepage=madvise" or "transparent_hugepage=never" ``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
(without "") to the kernel command line. to the kernel command line.
== Hugepages in tmpfs/shmem == Hugepages in tmpfs/shmem
========================
You can control hugepage allocation policy in tmpfs with mount option You can control hugepage allocation policy in tmpfs with mount option
"huge=". It can have following values: ``huge=``. It can have following values:
- "always": always
Attempt to allocate huge pages every time we need a new page; Attempt to allocate huge pages every time we need a new page;
- "never": never
Do not allocate huge pages; Do not allocate huge pages;
- "within_size": within_size
Only allocate huge page if it will be fully within i_size. Only allocate huge page if it will be fully within i_size.
Also respect fadvise()/madvise() hints; Also respect fadvise()/madvise() hints;
- "advise: advise
Only allocate huge pages if requested with fadvise()/madvise(); Only allocate huge pages if requested with fadvise()/madvise();
The default policy is "never". The default policy is ``never``.
"mount -o remount,huge= /mountpoint" works fine after mount: remounting ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop more ``huge=never`` will not attempt to break up huge pages at all, just stop more
from being allocated. from being allocated.
There's also sysfs knob to control hugepage allocation policy for internal There's also sysfs knob to control hugepage allocation policy for internal
...@@ -243,110 +261,139 @@ MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. ...@@ -243,110 +261,139 @@ MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
In addition to policies listed above, shmem_enabled allows two further In addition to policies listed above, shmem_enabled allows two further
values: values:
- "deny": deny
For use in emergencies, to force the huge option off from For use in emergencies, to force the huge option off from
all mounts; all mounts;
- "force": force
Force the huge option on for all - very useful for testing; Force the huge option on for all - very useful for testing;
== Need of application restart == Need of application restart
===========================
The transparent_hugepage/enabled values and tmpfs mount option only affect The transparent_hugepage/enabled values and tmpfs mount option only affect
future behavior. So to make them effective you need to restart any future behavior. So to make them effective you need to restart any
application that could have been using hugepages. This also applies to the application that could have been using hugepages. This also applies to the
regions registered in khugepaged. regions registered in khugepaged.
== Monitoring usage == Monitoring usage
================
The number of anonymous transparent huge pages currently used by the The number of anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in /proc/meminfo. system is available by reading the AnonHugePages field in ``/proc/meminfo``.
To identify what applications are using anonymous transparent huge pages, To identify what applications are using anonymous transparent huge pages,
it is necessary to read /proc/PID/smaps and count the AnonHugePages fields it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
for each mapping. for each mapping.
The number of file transparent huge pages mapped to userspace is available The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in /proc/meminfo. by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
To identify what applications are mapping file transparent huge pages, it To identify what applications are mapping file transparent huge pages, it
is necessary to read /proc/PID/smaps and count the FileHugeMapped fields is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
for each mapping. for each mapping.
Note that reading the smaps file is expensive and reading it Note that reading the smaps file is expensive and reading it
frequently will incur overhead. frequently will incur overhead.
There are a number of counters in /proc/vmstat that may be used to There are a number of counters in ``/proc/vmstat`` that may be used to
monitor how successfully the system is providing huge pages for use. monitor how successfully the system is providing huge pages for use.
thp_fault_alloc is incremented every time a huge page is successfully thp_fault_alloc
is incremented every time a huge page is successfully
allocated to handle a page fault. This applies to both the allocated to handle a page fault. This applies to both the
first time a page is faulted and for COW faults. first time a page is faulted and for COW faults.
thp_collapse_alloc is incremented by khugepaged when it has found thp_collapse_alloc
is incremented by khugepaged when it has found
a range of pages to collapse into one huge page and has a range of pages to collapse into one huge page and has
successfully allocated a new huge page to store the data. successfully allocated a new huge page to store the data.
thp_fault_fallback is incremented if a page fault fails to allocate thp_fault_fallback
is incremented if a page fault fails to allocate
a huge page and instead falls back to using small pages. a huge page and instead falls back to using small pages.
thp_collapse_alloc_failed is incremented if khugepaged found a range thp_collapse_alloc_failed
is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed of pages that should be collapsed into one huge page but failed
the allocation. the allocation.
thp_file_alloc is incremented every time a file huge page is successfully thp_file_alloc
is incremented every time a file huge page is successfully
allocated. allocated.
thp_file_mapped is incremented every time a file huge page is mapped into thp_file_mapped
is incremented every time a file huge page is mapped into
user address space. user address space.
thp_split_page is incremented every time a huge page is split into base thp_split_page
is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed. reason is that a huge page is old and is being reclaimed.
This action implies splitting all PMD the page mapped with. This action implies splitting all PMD the page mapped with.
thp_split_page_failed is incremented if kernel fails to split huge thp_split_page_failed
is incremented if kernel fails to split huge
page. This can happen if the page was pinned by somebody. page. This can happen if the page was pinned by somebody.
thp_deferred_split_page is incremented when a huge page is put onto split thp_deferred_split_page
is incremented when a huge page is put onto split
queue. This happens when a huge page is partially unmapped and queue. This happens when a huge page is partially unmapped and
splitting it would free up some memory. Pages on split queue are splitting it would free up some memory. Pages on split queue are
going to be split under memory pressure. going to be split under memory pressure.
thp_split_pmd is incremented every time a PMD split into table of PTEs. thp_split_pmd
is incremented every time a PMD split into table of PTEs.
This can happen, for instance, when application calls mprotect() or This can happen, for instance, when application calls mprotect() or
munmap() on part of huge page. It doesn't split huge page, only munmap() on part of huge page. It doesn't split huge page, only
page table entry. page table entry.
thp_zero_page_alloc is incremented every time a huge zero page is thp_zero_page_alloc
is incremented every time a huge zero page is
successfully allocated. It includes allocations which where successfully allocated. It includes allocations which where
dropped due race with other allocation. Note, it doesn't count dropped due race with other allocation. Note, it doesn't count
every map of the huge zero page, only its allocation. every map of the huge zero page, only its allocation.
thp_zero_page_alloc_failed is incremented if kernel fails to allocate thp_zero_page_alloc_failed
is incremented if kernel fails to allocate
huge zero page and falls back to using small pages. huge zero page and falls back to using small pages.
thp_swpout
is incremented every time a huge page is swapout in one
piece without splitting.
thp_swpout_fallback
is incremented if a huge page has to be split before swapout.
Usually because failed to allocate some continuous swap space
for the huge page.
As the system ages, allocating huge pages may be expensive as the As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in /proc/vmstat to help huge page for use. There are some counters in ``/proc/vmstat`` to help
monitor this overhead. monitor this overhead.
compact_stall is incremented every time a process stalls to run compact_stall
is incremented every time a process stalls to run
memory compaction so that a huge page is free for use. memory compaction so that a huge page is free for use.
compact_success is incremented if the system compacted memory and compact_success
is incremented if the system compacted memory and
freed a huge page for use. freed a huge page for use.
compact_fail is incremented if the system tries to compact memory compact_fail
is incremented if the system tries to compact memory
but failed. but failed.
compact_pages_moved is incremented each time a page is moved. If compact_pages_moved
is incremented each time a page is moved. If
this value is increasing rapidly, it implies that the system this value is increasing rapidly, it implies that the system
is copying a lot of data to satisfy the huge page allocation. is copying a lot of data to satisfy the huge page allocation.
It is possible that the cost of copying exceeds any savings It is possible that the cost of copying exceeds any savings
from reduced TLB misses. from reduced TLB misses.
compact_pagemigrate_failed is incremented when the underlying mechanism compact_pagemigrate_failed
is incremented when the underlying mechanism
for moving a page failed. for moving a page failed.
compact_blocks_moved is incremented each time memory compaction examines compact_blocks_moved
is incremented each time memory compaction examines
a huge page aligned range of pages. a huge page aligned range of pages.
It is possible to establish how long the stalls were using the function It is possible to establish how long the stalls were using the function
...@@ -354,174 +401,18 @@ tracer to record how long was spent in __alloc_pages_nodemask and ...@@ -354,174 +401,18 @@ tracer to record how long was spent in __alloc_pages_nodemask and
using the mm_page_alloc tracepoint to identify which allocations were using the mm_page_alloc tracepoint to identify which allocations were
for huge pages. for huge pages.
== get_user_pages and follow_page == Optimizing the applications
===========================
get_user_pages and follow_page if run on a hugepage, will return the
head or tail pages as usual (exactly as they would do on
hugetlbfs). Most gup users will only care about the actual physical
address of the page and its temporary pinning to release after the I/O
is complete, so they won't ever notice the fact the page is huge. But
if any driver is going to mangle over the page structure of the tail
page (like for checking page->mapping or other bits that are relevant
for the head page and not the tail page), it should be updated to jump
to check head page instead. Taking reference on any head/tail page would
prevent page from being split by anyone.
NOTE: these aren't new constraints to the GUP API, and they match the
same constrains that applies to hugetlbfs too, so any driver capable
of handling GUP on hugetlbfs will also work fine on transparent
hugepage backed mappings.
In case you can't handle compound pages if they're returned by
follow_page, the FOLL_SPLIT bit can be specified as parameter to
follow_page, so that it will split the hugepages before returning
them. Migration for example passes FOLL_SPLIT as parameter to
follow_page because it's not hugepage aware and in fact it can't work
at all on hugetlbfs (but it instead works fine on transparent
hugepages thanks to FOLL_SPLIT). migration simply can't deal with
hugepages being returned (as it's not only checking the pfn of the
page and pinning it during the copy but it pretends to migrate the
memory in regular page sizes and with regular pte/pmd mappings).
== Optimizing the applications ==
To be guaranteed that the kernel will map a 2M page immediately in any To be guaranteed that the kernel will map a 2M page immediately in any
memory region, the mmap region has to be hugepage naturally memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee. aligned. posix_memalign() can provide that guarantee.
== Hugetlbfs == Hugetlbfs
=========
You can use hugetlbfs on a kernel that has transparent hugepage You can use hugetlbfs on a kernel that has transparent hugepage
support enabled just fine as always. No difference can be noted in support enabled just fine as always. No difference can be noted in
hugetlbfs other than there will be less overall fragmentation. All hugetlbfs other than there will be less overall fragmentation. All
usual features belonging to hugetlbfs are preserved and usual features belonging to hugetlbfs are preserved and
unaffected. libhugetlbfs will also work fine as usual. unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==
Code walking pagetables but unaware about huge pmds can simply call
split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
hugepage aware.
If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
it tries to swapout the hugepage for example. split_huge_page() can fail
if the page is pinned and you must handle this correctly.
Example to make mremap.c transparent hugepage aware with a one liner
change:
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
return NULL;
pmd = pmd_offset(pud, addr);
+ split_huge_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;
== Locking in hugepage aware code ==
We want as much code as possible hugepage aware, as calling
split_huge_page() or split_huge_pmd() has a cost.
To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
mmap_sem in read (or write) mode to be sure an huge pmd cannot be
created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
page table lock will prevent the huge pmd to be converted into a
regular pmd from under you (split_huge_pmd can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
should just drop the page table lock and fallback to the old code as
before. Otherwise you can proceed to process the huge pmd and the
hugepage natively. Once finished you can drop the page table lock.
== Refcounts and transparent huge pages ==
Refcounting on THP is mostly consistent with refcounting on other compound
pages:
- get_page()/put_page() and GUP operate in head page's ->_refcount.
- ->_refcount in tail pages is always zero: get_page_unless_zero() never
succeed on tail pages.
- map/unmap of the pages with PTE entry increment/decrement ->_mapcount
on relevant sub-page of the compound page.
- map/unmap of the whole compound page accounted in compound_mapcount
(stored in first tail page). For file huge pages, we also increment
->_mapcount of all sub-pages in order to have race-free detection of
last unmap of subpages.
PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
subpages is offset up by one. This additional reference is required to
get race-free detection of unmap of subpages when we have them mapped with
both PMDs and PTEs.
This is optimization required to lower overhead of per-subpage mapcount
tracking. The alternative is alter ->_mapcount in all subpages on each
map/unmap of the whole compound page.
For anonymous pages, we set PG_double_map when a PMD of the page got split
for the first time, but still have PMD mapping. The additional references
go away with last compound_mapcount.
File pages get PG_double_map set on first map of the page with PTE and
goes away when the page gets evicted from page cache.
split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
structures. It can be done easily for refcounts taken by page table
entries. But we don't have enough information on how to distribute any
additional pins (i.e. from get_user_pages). split_huge_page() fails any
requests to split pinned huge page: it expects page count to be equal to
sum of mapcount of all sub-pages plus one (split_huge_page caller must
have reference for head page).
split_huge_page uses migration entries to stabilize page->_refcount and
page->_mapcount of anonymous pages. File pages just got unmapped.
We safe against physical memory scanners too: the only legitimate way
scanner can get reference to a page is get_page_unless_zero().
All tail pages have zero ->_refcount until atomic_add(). This prevents the
scanner from getting a reference to the tail page up to that point. After the
atomic_add() we don't care about the ->_refcount value. We already known how
many references should be uncharged from the head page.
For head page get_page_unless_zero() will succeed and we don't mind. It's
clear where reference should go after split: it will stay on head page.
Note that split_huge_pmd() doesn't have any limitation on refcounting:
pmd can be split at any point and never fails.
== Partial unmap and deferred_split_huge_page() ==
Unmapping part of THP (with munmap() or other way) is not going to free
memory immediately. Instead, we detect that a subpage of THP is not in use
in page_remove_rmap() and queue the THP for splitting if memory pressure
comes. Splitting will free up unused subpages.
Splitting the page right away is not an option due to locking context in
the place where we can detect partial unmap. It's also might be
counterproductive since in many cases partial unmap happens during exit(2) if
a THP crosses a VMA boundary.
Function deferred_split_huge_page() is used to queue page for splitting.
The splitting itself will happen when we get memory pressure via shrinker
interface.
= Userfaultfd = .. _userfaultfd:
== Objective == ===========
Userfaultfd
===========
Objective
=========
Userfaults allow the implementation of on-demand paging from userland Userfaults allow the implementation of on-demand paging from userland
and more generally they allow userland to take control of various and more generally they allow userland to take control of various
...@@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do. ...@@ -9,7 +14,8 @@ memory page faults, something otherwise only the kernel code could do.
For example userfaults allows a proper and more optimal implementation For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick. of the PROT_NONE+SIGSEGV trick.
== Design == Design
======
Userfaults are delivered and resolved through the userfaultfd syscall. Userfaults are delivered and resolved through the userfaultfd syscall.
...@@ -41,7 +47,8 @@ different processes without them being aware about what is going on ...@@ -41,7 +47,8 @@ different processes without them being aware about what is going on
themselves on the same region the manager is already tracking, which themselves on the same region the manager is already tracking, which
is a corner case that would currently return -EBUSY). is a corner case that would currently return -EBUSY).
== API == API
===
When first opened the userfaultfd must be enabled invoking the When first opened the userfaultfd must be enabled invoking the
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
...@@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an ...@@ -101,7 +108,8 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has half copied page since it'll keep userfaulting until the copy has
finished. finished.
== QEMU/KVM == QEMU/KVM
========
QEMU/KVM is using the userfaultfd syscall to implement postcopy live QEMU/KVM is using the userfaultfd syscall to implement postcopy live
migration. Postcopy live migration is one form of memory migration. Postcopy live migration is one form of memory
...@@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the ...@@ -163,7 +171,8 @@ sending the same page twice (in case the userfault is read by the
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
thread). thread).
== Non-cooperative userfaultfd == Non-cooperative userfaultfd
===========================
When the userfaultfd is monitored by an external manager, the manager When the userfaultfd is monitored by an external manager, the manager
must be able to track changes in the process virtual memory must be able to track changes in the process virtual memory
...@@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The ...@@ -172,27 +181,30 @@ the same read(2) protocol as for the page fault notifications. The
manager has to explicitly enable these events by setting appropriate manager has to explicitly enable these events by setting appropriate
bits in uffdio_api.features passed to UFFDIO_API ioctl: bits in uffdio_api.features passed to UFFDIO_API ioctl:
UFFD_FEATURE_EVENT_FORK - enable userfaultfd hooks for fork(). When UFFD_FEATURE_EVENT_FORK
this feature is enabled, the userfaultfd context of the parent process enable userfaultfd hooks for fork(). When this feature is
is duplicated into the newly created process. The manager receives enabled, the userfaultfd context of the parent process is
UFFD_EVENT_FORK with file descriptor of the new userfaultfd context in duplicated into the newly created process. The manager
the uffd_msg.fork. receives UFFD_EVENT_FORK with file descriptor of the new
userfaultfd context in the uffd_msg.fork.
UFFD_FEATURE_EVENT_REMAP - enable notifications about mremap()
calls. When the non-cooperative process moves a virtual memory area to UFFD_FEATURE_EVENT_REMAP
a different location, the manager will receive UFFD_EVENT_REMAP. The enable notifications about mremap() calls. When the
uffd_msg.remap will contain the old and new addresses of the area and non-cooperative process moves a virtual memory area to a
its original length. different location, the manager will receive
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
UFFD_FEATURE_EVENT_REMOVE - enable notifications about new addresses of the area and its original length.
madvise(MADV_REMOVE) and madvise(MADV_DONTNEED) calls. The event
UFFD_EVENT_REMOVE will be generated upon these calls to madvise. The UFFD_FEATURE_EVENT_REMOVE
uffd_msg.remove will contain start and end addresses of the removed enable notifications about madvise(MADV_REMOVE) and
area. madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
be generated upon these calls to madvise. The uffd_msg.remove
UFFD_FEATURE_EVENT_UNMAP - enable notifications about memory will contain start and end addresses of the removed area.
unmapping. The manager will get UFFD_EVENT_UNMAP with uffd_msg.remove
containing start and end addresses of the unmapped area. UFFD_FEATURE_EVENT_UNMAP
enable notifications about memory unmapping. The manager will
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
end addresses of the unmapped area.
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
are pretty similar, they quite differ in the action expected from the are pretty similar, they quite differ in the action expected from the
......
...@@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners: ...@@ -61,7 +61,7 @@ Setting the ramoops parameters can be done in several different manners:
mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1 mem=128M ramoops.mem_address=0x8000000 ramoops.ecc=1
B. Use Device Tree bindings, as described in B. Use Device Tree bindings, as described in
``Documentation/device-tree/bindings/reserved-memory/admin-guide/ramoops.rst``. ``Documentation/devicetree/bindings/reserved-memory/ramoops.txt``.
For example:: For example::
reserved-memory { reserved-memory {
......
...@@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions) ...@@ -302,19 +302,15 @@ Berlin family (Multimedia Solutions)
88DE3010, Armada 1000 (no Linux support) 88DE3010, Armada 1000 (no Linux support)
Core: Marvell PJ1 (ARMv5TE), Dual-core Core: Marvell PJ1 (ARMv5TE), Dual-core
Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf Product Brief: http://www.marvell.com.cn/digital-entertainment/assets/armada_1000_pb.pdf
88DE3005, Armada 1500-mini
88DE3005, Armada 1500 Mini 88DE3005, Armada 1500 Mini
Design name: BG2CD Design name: BG2CD
Core: ARM Cortex-A9, PL310 L2CC Core: ARM Cortex-A9, PL310 L2CC
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini/ 88DE3006, Armada 1500 Mini Plus
88DE3006, Armada 1500 Mini Plus Design name: BG2CDP
Design name: BG2CDP Core: Dual Core ARM Cortex-A7
Core: Dual Core ARM Cortex-A7
Homepage: http://www.marvell.com/multimedia-solutions/armada-1500-mini-plus/
88DE3100, Armada 1500 88DE3100, Armada 1500
Design name: BG2 Design name: BG2
Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC Core: Marvell PJ4B-MP (ARMv7), Tauros3 L2CC
Product Brief: http://www.marvell.com/digital-entertainment/armada-1500/assets/Marvell-ARMADA-1500-Product-Brief.pdf
88DE3114, Armada 1500 Pro 88DE3114, Armada 1500 Pro
Design name: BG2Q Design name: BG2Q
Core: Quad Core ARM Cortex-A9, PL310 L2CC Core: Quad Core ARM Cortex-A9, PL310 L2CC
...@@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions) ...@@ -324,13 +320,16 @@ Berlin family (Multimedia Solutions)
88DE3218, ARMADA 1500 Ultra 88DE3218, ARMADA 1500 Ultra
Core: ARM Cortex-A53 Core: ARM Cortex-A53
Homepage: http://www.marvell.com/multimedia-solutions/ Homepage: https://www.synaptics.com/products/multimedia-solutions
Directory: arch/arm/mach-berlin Directory: arch/arm/mach-berlin
Comments: Comments:
* This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs * This line of SoCs is based on Marvell Sheeva or ARM Cortex CPUs
with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...). with Synopsys DesignWare (IRQ, GPIO, Timers, ...) and PXA IP (SDHCI, USB, ETH, ...).
* The Berlin family was acquired by Synaptics from Marvell in 2017.
CPU Cores CPU Cores
--------- ---------
......
Embedded device command line partition parsing Embedded device command line partition parsing
===================================================================== =====================================================================
Support for reading the block device partition table from the command line. The "blkdevparts" command line option adds support for reading the
block device partition table from the kernel command line.
It is typically used for fixed block (eMMC) embedded devices. It is typically used for fixed block (eMMC) embedded devices.
It has no MBR, so saves storage space. Bootloader can be easily accessed It has no MBR, so saves storage space. Bootloader can be easily accessed
by absolute address of data on the block device. by absolute address of data on the block device.
...@@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>] ...@@ -14,22 +16,27 @@ blkdevparts=<blkdev-def>[;<blkdev-def>]
<partdef> := <size>[@<offset>](part-name) <partdef> := <size>[@<offset>](part-name)
<blkdev-id> <blkdev-id>
block device disk name, embedded device used fixed block device, block device disk name. Embedded device uses fixed block device.
it's disk name also fixed. such as: mmcblk0, mmcblk1, mmcblk0boot0. Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0.
<size> <size>
partition size, in bytes, such as: 512, 1m, 1G. partition size, in bytes, such as: 512, 1m, 1G.
size may contain an optional suffix of (upper or lower case):
K, M, G, T, P, E.
"-" is used to denote all remaining space.
<offset> <offset>
partition start address, in bytes. partition start address, in bytes.
offset may contain an optional suffix of (upper or lower case):
K, M, G, T, P, E.
(part-name) (part-name)
partition name, kernel send uevent with "PARTNAME". application can create partition name. Kernel sends uevent with "PARTNAME". Application can
a link to block device partition with the name "PARTNAME". create a link to block device partition with the name "PARTNAME".
user space application can access partition by partition name. User space application can access partition by partition name.
Example: Example:
eMMC disk name is "mmcblk0" and "mmcblk0boot0" eMMC disk names are "mmcblk0" and "mmcblk0boot0".
bootargs: bootargs:
'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)' 'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)'
......
=================================
GFP masks used from FS/IO context
=================================
:Date: May, 2018
:Author: Michal Hocko <mhocko@kernel.org>
Introduction
============
Code paths in the filesystem and IO stacks must be careful when
allocating memory to prevent recursion deadlocks caused by direct
memory reclaim calling back into the FS or IO paths and blocking on
already held resources (e.g. locks - most commonly those used for the
transaction context).
The traditional way to avoid this deadlock problem is to clear __GFP_FS
respectively __GFP_IO (note the latter implies clearing the first as well) in
the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
used as shortcut. It turned out though that above approach has led to
abuses when the restricted gfp mask is used "just in case" without a
deeper consideration which leads to problems because an excessive use
of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
reclaim issues.
New API
========
Since 4.12 we do have a generic scope API for both NOFS and NOIO context
``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
``memalloc_noio_restore`` which allow to mark a scope to be a critical
section from a filesystem or I/O point of view. Any allocation from that
scope will inherently drop __GFP_FS respectively __GFP_IO from the given
mask so no memory allocation can recurse back in the FS/IO.
.. kernel-doc:: include/linux/sched/mm.h
:functions: memalloc_nofs_save memalloc_nofs_restore
.. kernel-doc:: include/linux/sched/mm.h
:functions: memalloc_noio_save memalloc_noio_restore
FS/IO code then simply calls the appropriate save function before
any critical section with respect to the reclaim is started - e.g.
lock shared with the reclaim context or when a transaction context
nesting would be possible via reclaim. The restore function should be
called when the critical section ends. All that ideally along with an
explanation what is the reclaim context for easier maintenance.
Please note that the proper pairing of save/restore functions
allows nesting so it is safe to call ``memalloc_noio_save`` or
``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
scope.
What about __vmalloc(GFP_NOFS)
==============================
vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
almost always a bug. The good news is that the NOFS/NOIO semantic can be
achieved by the scope API.
In the ideal world, upper layers should already mark dangerous contexts
and so no special care is required and vmalloc should be called without
any problems. Sometimes if the context is not really clear or there are
layering violations then the recommended way around that is to wrap ``vmalloc``
by the scope API with a comment explaining the problem.
...@@ -14,6 +14,7 @@ Core utilities ...@@ -14,6 +14,7 @@ Core utilities
kernel-api kernel-api
assoc_array assoc_array
atomic_ops atomic_ops
cachetlb
refcount-vs-atomic refcount-vs-atomic
cpu_hotplug cpu_hotplug
idr idr
...@@ -25,6 +26,8 @@ Core utilities ...@@ -25,6 +26,8 @@ Core utilities
genalloc genalloc
errseq errseq
printk-formats printk-formats
circular-buffers
gfp_mask-from-fs-io
Interfaces for kernel debugging Interfaces for kernel debugging
=============================== ===============================
......
...@@ -39,17 +39,17 @@ String Manipulation ...@@ -39,17 +39,17 @@ String Manipulation
.. kernel-doc:: lib/string.c .. kernel-doc:: lib/string.c
:export: :export:
Basic Kernel Library Functions
==============================
The Linux kernel provides more basic utility functions.
Bit Operations Bit Operations
-------------- --------------
.. kernel-doc:: arch/x86/include/asm/bitops.h .. kernel-doc:: arch/x86/include/asm/bitops.h
:internal: :internal:
Basic Kernel Library Functions
==============================
The Linux kernel provides more basic utility functions.
Bitmap Operations Bitmap Operations
----------------- -----------------
...@@ -80,6 +80,31 @@ Command-line Parsing ...@@ -80,6 +80,31 @@ Command-line Parsing
.. kernel-doc:: lib/cmdline.c .. kernel-doc:: lib/cmdline.c
:export: :export:
Sorting
-------
.. kernel-doc:: lib/sort.c
:export:
.. kernel-doc:: lib/list_sort.c
:export:
Text Searching
--------------
.. kernel-doc:: lib/textsearch.c
:doc: ts_intro
.. kernel-doc:: lib/textsearch.c
:export:
.. kernel-doc:: include/linux/textsearch.h
:functions: textsearch_find textsearch_next \
textsearch_get_pattern textsearch_get_pattern_len
CRC and Math Functions in Linux
===============================
CRC Functions CRC Functions
------------- -------------
...@@ -103,9 +128,6 @@ CRC Functions ...@@ -103,9 +128,6 @@ CRC Functions
.. kernel-doc:: lib/crc-itu-t.c .. kernel-doc:: lib/crc-itu-t.c
:export: :export:
Math Functions in Linux
=======================
Base 2 log and power Functions Base 2 log and power Functions
------------------------------ ------------------------------
...@@ -127,28 +149,6 @@ Division Functions ...@@ -127,28 +149,6 @@ Division Functions
.. kernel-doc:: lib/gcd.c .. kernel-doc:: lib/gcd.c
:export: :export:
Sorting
-------
.. kernel-doc:: lib/sort.c
:export:
.. kernel-doc:: lib/list_sort.c
:export:
Text Searching
--------------
.. kernel-doc:: lib/textsearch.c
:doc: ts_intro
.. kernel-doc:: lib/textsearch.c
:export:
.. kernel-doc:: include/linux/textsearch.h
:functions: textsearch_find textsearch_next \
textsearch_get_pattern textsearch_get_pattern_len
UUID/GUID UUID/GUID
--------- ---------
......
...@@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in ...@@ -17,7 +17,7 @@ in order to help maintainers validate their code against the change in
these memory ordering guarantees. these memory ordering guarantees.
The terms used through this document try to follow the formal LKMM defined in The terms used through this document try to follow the formal LKMM defined in
github.com/aparri/memory-model/blob/master/Documentation/explanation.txt tools/memory-model/Documentation/explanation.txt.
memory-barriers.txt and atomic_t.txt provide more background to the memory-barriers.txt and atomic_t.txt provide more background to the
memory ordering in general and for atomic operations specifically. memory ordering in general and for atomic operations specifically.
......
...@@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples. ...@@ -20,5 +20,6 @@ for cryptographic use cases, as well as programming examples.
architecture architecture
devel-algos devel-algos
userspace-if userspace-if
crypto_engine
api api
api-samples api-samples
...@@ -120,7 +120,7 @@ A typical out of bounds access report looks like this:: ...@@ -120,7 +120,7 @@ A typical out of bounds access report looks like this::
The header of the report discribe what kind of bug happened and what kind of The header of the report discribe what kind of bug happened and what kind of
access caused it. It's followed by the description of the accessed slub object access caused it. It's followed by the description of the accessed slub object
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and (see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and
the description of the accessed memory page. the description of the accessed memory page.
In the last section the report shows memory state around the accessed address. In the last section the report shows memory state around the accessed address.
......
...@@ -151,6 +151,11 @@ Contributing new tests (details) ...@@ -151,6 +151,11 @@ Contributing new tests (details)
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
test. test.
* First use the headers inside the kernel source and/or git repo, and then the
system headers. Headers for the kernel release as opposed to headers
installed by the distro on the system should be the primary focus to be able
to find regressions.
Test Harness Test Harness
============ ============
......
...@@ -40,4 +40,4 @@ API ...@@ -40,4 +40,4 @@ API
--- ---
.. kernel-doc:: drivers/base/devcon.c .. kernel-doc:: drivers/base/devcon.c
: functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove :functions: device_connection_find_match device_connection_find device_connection_add device_connection_remove
...@@ -44,7 +44,7 @@ common to each controller of that type: ...@@ -44,7 +44,7 @@ common to each controller of that type:
- methods to establish GPIO line direction - methods to establish GPIO line direction
- methods used to access GPIO line values - methods used to access GPIO line values
- method to set electrical configuration to a a given GPIO line - method to set electrical configuration for a given GPIO line
- method to return the IRQ number associated to a given GPIO line - method to return the IRQ number associated to a given GPIO line
- flag saying whether calls to its methods may sleep - flag saying whether calls to its methods may sleep
- optional line names array to identify lines - optional line names array to identify lines
...@@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on ...@@ -143,7 +143,7 @@ resistor will make the line tend to high level unless one of the transistors on
the rail actively pulls it down. the rail actively pulls it down.
The level on the line will go as high as the VDD on the pull-up resistor, which The level on the line will go as high as the VDD on the pull-up resistor, which
may be higher than the level supported by the transistor, achieveing a may be higher than the level supported by the transistor, achieving a
level-shift to the higher VDD. level-shift to the higher VDD.
Integrated electronics often have an output driver stage in the form of a CMOS Integrated electronics often have an output driver stage in the form of a CMOS
...@@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips ...@@ -382,7 +382,7 @@ Real-Time compliance for GPIO IRQ chips
Any provider of irqchips needs to be carefully tailored to support Real Time Any provider of irqchips needs to be carefully tailored to support Real Time
preemption. It is desirable that all irqchips in the GPIO subsystem keep this preemption. It is desirable that all irqchips in the GPIO subsystem keep this
in mind and does the proper testing to assure they are real time-enabled. in mind and do the proper testing to assure they are real time-enabled.
So, pay attention on above " RT_FULL:" notes, please. So, pay attention on above " RT_FULL:" notes, please.
The following is a checklist to follow when preparing a driver for real The following is a checklist to follow when preparing a driver for real
time-compliance: time-compliance:
......
...@@ -17,7 +17,9 @@ available subsections can be seen below. ...@@ -17,7 +17,9 @@ available subsections can be seen below.
basics basics
infrastructure infrastructure
pm/index pm/index
clk
device-io device-io
device_connection
dma-buf dma-buf
device_link device_link
message-based message-based
......
...@@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources: ...@@ -711,7 +711,8 @@ The vmbus device regions are mapped into uio device resources:
If a subchannel is created by a request to host, then the uio_hv_generic If a subchannel is created by a request to host, then the uio_hv_generic
device driver will create a sysfs binary file for the per-channel ring buffer. device driver will create a sysfs binary file for the per-channel ring buffer.
For example: For example::
/sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring /sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
Further information Further information
......
# #
# Feature name: strncasecmp # Feature name: cBPF-JIT
# Kconfig: __HAVE_ARCH_STRNCASECMP # Kconfig: HAVE_CBPF_JIT
# description: arch provides an optimized strncasecmp() function # description: arch supports cBPF JIT optimizations
# #
----------------------- -----------------------
| arch |status| | arch |status|
...@@ -16,14 +16,16 @@ ...@@ -16,14 +16,16 @@
| ia64: | TODO | | ia64: | TODO |
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | ok |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | TODO | | x86: | TODO |
......
# #
# Feature name: BPF-JIT # Feature name: eBPF-JIT
# Kconfig: HAVE_BPF_JIT # Kconfig: HAVE_EBPF_JIT
# description: arch supports BPF JIT optimizations # description: arch supports eBPF JIT optimizations
# #
----------------------- -----------------------
| arch |status| | arch |status|
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | ok |
| parisc: | ok | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | ok |
| nios2: | ok | | nios2: | ok |
| openrisc: | ok | | openrisc: | ok |
| parisc: | ok | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,15 +17,17 @@ ...@@ -17,15 +17,17 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | ok | 64-bit only | x86: | ok |
| xtensa: | ok | | xtensa: | ok |
----------------------- -----------------------
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | ok | | microblaze: | ok |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -11,16 +11,18 @@ ...@@ -11,16 +11,18 @@
| arm: | ok | | arm: | ok |
| arm64: | ok | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | ok |
| hexagon: | ok | | hexagon: | ok |
| ia64: | TODO | | ia64: | TODO |
| m68k: | TODO | | m68k: | TODO |
| microblaze: | ok | | microblaze: | ok |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | ok | | nios2: | ok |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | ok | | arc: | ok |
| arm: | ok | | arm: | ok |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | ok | | arc: | ok |
| arm: | ok | | arm: | ok |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | ok | | sh: | ok |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | TODO | | arc: | TODO |
| arm: | ok | | arm: | ok |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
...@@ -17,13 +17,15 @@ ...@@ -17,13 +17,15 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | ok |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | ok | | x86: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,11 +17,13 @@ ...@@ -17,11 +17,13 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| s390: | TODO | | riscv: | ok |
| s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
| um: | TODO | | um: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | ok | | sparc: | ok |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | TODO | | arc: | TODO |
| arm: | TODO | | arm: | TODO |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | ok | | microblaze: | ok |
| mips: | ok | | mips: | ok |
| nds32: | ok |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | ok |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -9,21 +9,23 @@ ...@@ -9,21 +9,23 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | TODO | | arc: | TODO |
| arm: | TODO | | arm: | TODO |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
| ia64: | TODO | | ia64: | TODO |
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | ok |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | ok |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | ok | | x86: | ok |
......
...@@ -16,14 +16,16 @@ ...@@ -16,14 +16,16 @@
| ia64: | TODO | | ia64: | TODO |
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | ok |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | ok |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | ok | | x86: | ok |
......
# #
# Feature name: rwsem-optimized # Feature name: rwsem-optimized
# Kconfig: Optimized asm/rwsem.h # Kconfig: !RWSEM_GENERIC_SPINLOCK
# description: arch provides optimized rwsem APIs # description: arch provides optimized rwsem APIs
# #
----------------------- -----------------------
...@@ -8,8 +8,8 @@ ...@@ -8,8 +8,8 @@
----------------------- -----------------------
| alpha: | ok | | alpha: | ok |
| arc: | TODO | | arc: | TODO |
| arm: | TODO | | arm: | ok |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | TODO | | hexagon: | TODO |
...@@ -17,14 +17,16 @@ ...@@ -17,14 +17,16 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
| um: | TODO | | um: | ok |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | ok | | x86: | ok |
| xtensa: | ok | | xtensa: | ok |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | TODO | | arc: | TODO |
| arm: | ok | | arm: | ok |
| arm64: | TODO | | arm64: | ok |
| c6x: | TODO | | c6x: | TODO |
| h8300: | TODO | | h8300: | TODO |
| hexagon: | ok | | hexagon: | ok |
...@@ -17,13 +17,15 @@ ...@@ -17,13 +17,15 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | ok |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | TODO | | sparc: | ok |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
| x86: | ok | | x86: | ok |
......
...@@ -17,11 +17,13 @@ ...@@ -17,11 +17,13 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| s390: | TODO | | riscv: | TODO |
| s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
| um: | TODO | | um: | TODO |
......
...@@ -17,11 +17,13 @@ ...@@ -17,11 +17,13 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| s390: | TODO | | riscv: | TODO |
| s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
| um: | TODO | | um: | TODO |
......
...@@ -40,10 +40,12 @@ ...@@ -40,10 +40,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | .. | | arc: | .. |
| arm: | .. | | arm: | .. |
| arm64: | .. | | arm64: | ok |
| c6x: | .. | | c6x: | .. |
| h8300: | .. | | h8300: | .. |
| hexagon: | .. | | hexagon: | .. |
...@@ -17,11 +17,13 @@ ...@@ -17,11 +17,13 @@
| m68k: | .. | | m68k: | .. |
| microblaze: | .. | | microblaze: | .. |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | .. | | nios2: | .. |
| openrisc: | .. | | openrisc: | .. |
| parisc: | .. | | parisc: | .. |
| powerpc: | ok | | powerpc: | ok |
| s390: | .. | | riscv: | TODO |
| s390: | ok |
| sh: | .. | | sh: | .. |
| sparc: | TODO | | sparc: | TODO |
| um: | .. | | um: | .. |
......
#
# Small script that refreshes the kernel feature support status in place.
#
for F_FILE in Documentation/features/*/*/arch-support.txt; do
F=$(grep "^# Kconfig:" "$F_FILE" | cut -c26-)
#
# Each feature F is identified by a pair (O, K), where 'O' can
# be either the empty string (for 'nop') or "not" (the logical
# negation operator '!'); other operators are not supported.
#
O=""
K=$F
if [[ "$F" == !* ]]; then
O="not"
K=$(echo $F | sed -e 's/^!//g')
fi
#
# F := (O, K) is 'valid' iff there is a Kconfig file (for some
# arch) which contains K.
#
# Notice that this definition entails an 'asymmetry' between
# the case 'O = ""' and the case 'O = "not"'. E.g., F may be
# _invalid_ if:
#
# [case 'O = ""']
# 1) no arch provides support for F,
# 2) K does not exist (e.g., it was renamed/mis-typed);
#
# [case 'O = "not"']
# 3) all archs provide support for F,
# 4) as in (2).
#
# The rationale for adopting this definition (and, thus, for
# keeping the asymmetry) is:
#
# We want to be able to 'detect' (2) (or (4)).
#
# (1) and (3) may further warn the developers about the fact
# that K can be removed.
#
F_VALID="false"
for ARCH_DIR in arch/*/; do
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
K_GREP=$(grep "$K" $K_FILES)
if [ ! -z "$K_GREP" ]; then
F_VALID="true"
break
fi
done
if [ "$F_VALID" = "false" ]; then
printf "WARNING: '%s' is not a valid Kconfig\n" "$F"
fi
T_FILE="$F_FILE.tmp"
grep "^#" $F_FILE > $T_FILE
echo " -----------------------" >> $T_FILE
echo " | arch |status|" >> $T_FILE
echo " -----------------------" >> $T_FILE
for ARCH_DIR in arch/*/; do
ARCH=$(echo $ARCH_DIR | sed -e 's/arch//g' | sed -e 's/\///g')
K_FILES=$(find $ARCH_DIR -name "Kconfig*")
K_GREP=$(grep "$K" $K_FILES)
#
# Arch support status values for (O, K) are updated according
# to the following rules.
#
# - ("", K) is 'supported by a given arch', if there is a
# Kconfig file for that arch which contains K;
#
# - ("not", K) is 'supported by a given arch', if there is
# no Kconfig file for that arch which contains K;
#
# - otherwise: preserve the previous status value (if any),
# default to 'not yet supported'.
#
# Notice that, according these rules, invalid features may be
# updated/modified.
#
if [ "$O" = "" ] && [ ! -z "$K_GREP" ]; then
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
elif [ "$O" = "not" ] && [ -z "$K_GREP" ]; then
printf " |%12s: | ok |\n" "$ARCH" >> $T_FILE
else
S=$(grep -v "^#" "$F_FILE" | grep " $ARCH:")
if [ ! -z "$S" ]; then
echo "$S" >> $T_FILE
else
printf " |%12s: | TODO |\n" "$ARCH" \
>> $T_FILE
fi
fi
done
echo " -----------------------" >> $T_FILE
mv $T_FILE $F_FILE
done
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | ok |
| powerpc: | TODO | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,12 +17,14 @@ ...@@ -17,12 +17,14 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | ok |
| sparc: | TODO | | sparc: | TODO |
| um: | TODO | | um: | TODO |
| unicore32: | TODO | | unicore32: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | ok | | m68k: | ok |
| microblaze: | ok | | microblaze: | ok |
| mips: | ok | | mips: | ok |
| nds32: | ok |
| nios2: | ok | | nios2: | ok |
| openrisc: | ok | | openrisc: | ok |
| parisc: | TODO | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | .. | | parisc: | .. |
| powerpc: | .. | | powerpc: | ok |
| riscv: | TODO |
| s390: | .. | | s390: | .. |
| sh: | TODO | | sh: | TODO |
| sparc: | .. | | sparc: | .. |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | ok | | microblaze: | ok |
| mips: | ok | | mips: | ok |
| nds32: | ok |
| nios2: | ok | | nios2: | ok |
| openrisc: | ok | | openrisc: | ok |
| parisc: | ok | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | ok | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | ok |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | .. | | m68k: | .. |
| microblaze: | .. | | microblaze: | .. |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | .. | | nios2: | .. |
| openrisc: | .. | | openrisc: | .. |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | .. | | sh: | .. |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | .. | | m68k: | .. |
| microblaze: | .. | | microblaze: | .. |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | .. | | nios2: | .. |
| openrisc: | .. | | openrisc: | .. |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | TODO | | powerpc: | TODO |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | TODO | | sh: | TODO |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | TODO | | s390: | TODO |
| sh: | ok | | sh: | ok |
| sparc: | TODO | | sparc: | TODO |
......
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
| alpha: | TODO | | alpha: | TODO |
| arc: | .. | | arc: | .. |
| arm: | .. | | arm: | .. |
| arm64: | .. | | arm64: | ok |
| c6x: | .. | | c6x: | .. |
| h8300: | .. | | h8300: | .. |
| hexagon: | .. | | hexagon: | .. |
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | .. | | m68k: | .. |
| microblaze: | ok | | microblaze: | ok |
| mips: | ok | | mips: | ok |
| nds32: | TODO |
| nios2: | .. | | nios2: | .. |
| openrisc: | .. | | openrisc: | .. |
| parisc: | .. | | parisc: | .. |
| powerpc: | ok | | powerpc: | ok |
| riscv: | ok |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -17,10 +17,12 @@ ...@@ -17,10 +17,12 @@
| m68k: | TODO | | m68k: | TODO |
| microblaze: | TODO | | microblaze: | TODO |
| mips: | TODO | | mips: | TODO |
| nds32: | TODO |
| nios2: | TODO | | nios2: | TODO |
| openrisc: | TODO | | openrisc: | TODO |
| parisc: | TODO | | parisc: | TODO |
| powerpc: | ok | | powerpc: | ok |
| riscv: | TODO |
| s390: | ok | | s390: | ok |
| sh: | ok | | sh: | ok |
| sparc: | ok | | sparc: | ok |
......
...@@ -515,7 +515,8 @@ guarantees: ...@@ -515,7 +515,8 @@ guarantees:
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
bits on both physical and virtual pages associated with a process, and the bits on both physical and virtual pages associated with a process, and the
soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
for details).
To clear the bits for all the pages associated with the process To clear the bits for all the pages associated with the process
> echo 1 > /proc/PID/clear_refs > echo 1 > /proc/PID/clear_refs
...@@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect. ...@@ -536,7 +537,8 @@ Any other value written to /proc/PID/clear_refs will have no effect.
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
using /proc/kpageflags and number of times a page is mapped using using /proc/kpageflags and number of times a page is mapped using
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. /proc/kpagecount. For detailed explanation, see
Documentation/admin-guide/mm/pagemap.rst.
The /proc/pid/numa_maps is an extension based on maps, showing the memory The /proc/pid/numa_maps is an extension based on maps, showing the memory
locality and binding policy, as well as the memory usage (in pages) of locality and binding policy, as well as the memory usage (in pages) of
...@@ -564,7 +566,7 @@ address policy mapping details ...@@ -564,7 +566,7 @@ address policy mapping details
Where: Where:
"address" is the starting address for the mapping; "address" is the starting address for the mapping;
"policy" reports the NUMA memory policy set for the mapping (see vm/numa_memory_policy.txt); "policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
"mapping details" summarizes mapping data such as mapping type, page usage counters, "mapping details" summarizes mapping data such as mapping type, page usage counters,
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up. size, in KB, that is backing the mapping up.
......
...@@ -105,8 +105,9 @@ policy for the file will revert to "default" policy. ...@@ -105,8 +105,9 @@ policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified conjunction with their modes. These optional flags can be specified
when tmpfs is mounted by appending them to the mode before the NodeList. when tmpfs is mounted by appending them to the mode before the NodeList.
See Documentation/vm/numa_memory_policy.txt for a list of all available See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
memory allocation policy mode flags and their effect on memory policy. all available memory allocation policy mode flags and their effect on
memory policy.
=static is equivalent to MPOL_F_STATIC_NODES =static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES =relative is equivalent to MPOL_F_RELATIVE_NODES
......
...@@ -45,7 +45,7 @@ the kernel interface as seen by application developers. ...@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
userspace-api/index userspace-api/index
Introduction to kernel development Introduction to kernel development
...@@ -89,6 +89,7 @@ needed). ...@@ -89,6 +89,7 @@ needed).
sound/index sound/index
crypto/index crypto/index
filesystems/index filesystems/index
vm/index
Architecture-specific documentation Architecture-specific documentation
----------------------------------- -----------------------------------
......
...@@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface. ...@@ -73,7 +73,9 @@ will have a second iteration or at least an extension for any given interface.
future extensions is going right down the gutters since someone will submit future extensions is going right down the gutters since someone will submit
an ioctl struct with random stack garbage in the yet unused parts. Which an ioctl struct with random stack garbage in the yet unused parts. Which
then bakes in the ABI that those fields can never be used for anything else then bakes in the ABI that those fields can never be used for anything else
but garbage. but garbage. This is also the reason why you must explicitly pad all
structures, even if you never use them in an array - the padding the compiler
might insert could contain garbage.
* Have simple testcases for all of the above. * Have simple testcases for all of the above.
......
...@@ -2903,7 +2903,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the ...@@ -2903,7 +2903,7 @@ is discarded from the CPU's cache and reloaded. To deal with this, the
appropriate part of the kernel must invalidate the overlapping bits of the appropriate part of the kernel must invalidate the overlapping bits of the
cache on each CPU. cache on each CPU.
See Documentation/cachetlb.txt for more information on cache management. See Documentation/core-api/cachetlb.rst for more information on cache management.
CACHE COHERENCY VS MMIO CACHE COHERENCY VS MMIO
...@@ -3083,7 +3083,7 @@ CIRCULAR BUFFERS ...@@ -3083,7 +3083,7 @@ CIRCULAR BUFFERS
Memory barriers can be used to implement circular buffering without the need Memory barriers can be used to implement circular buffering without the need
of a lock to serialise the producer with the consumer. See: of a lock to serialise the producer with the consumer. See:
Documentation/circular-buffers.txt Documentation/core-api/circular-buffers.rst
for details. for details.
......
...@@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent ...@@ -18,17 +18,17 @@ major kernel release happening every two or three months. The recent
release history looks like this: release history looks like this:
====== ================= ====== =================
2.6.38 March 14, 2011 4.11 April 30, 2017
2.6.37 January 4, 2011 4.12 July 2, 2017
2.6.36 October 20, 2010 4.13 September 3, 2017
2.6.35 August 1, 2010 4.14 November 12, 2017
2.6.34 May 15, 2010 4.15 January 28, 2018
2.6.33 February 24, 2010 4.16 April 1, 2018
====== ================= ====== =================
Every 2.6.x release is a major kernel release with new features, internal Every 4.x release is a major kernel release with new features, internal
API changes, and more. A typical 2.6 release can contain nearly 10,000 API changes, and more. A typical 4.x release contain about 13,000
changesets with changes to several hundred thousand lines of code. 2.6 is changesets with changes to several hundred thousand lines of code. 4.x is
thus the leading edge of Linux kernel development; the kernel uses a thus the leading edge of Linux kernel development; the kernel uses a
rolling development model which is continually integrating major changes. rolling development model which is continually integrating major changes.
...@@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is ...@@ -70,20 +70,19 @@ will get up to somewhere between -rc6 and -rc9 before the kernel is
considered to be sufficiently stable and the final 2.6.x release is made. considered to be sufficiently stable and the final 2.6.x release is made.
At that point the whole process starts over again. At that point the whole process starts over again.
As an example, here is how the 2.6.38 development cycle went (all dates in As an example, here is how the 4.16 development cycle went (all dates in
2011): 2018):
============== =============================== ============== ===============================
January 4 2.6.37 stable release January 28 4.15 stable release
January 18 2.6.38-rc1, merge window closes February 11 4.16-rc1, merge window closes
January 21 2.6.38-rc2 February 18 4.16-rc2
February 1 2.6.38-rc3 February 25 4.16-rc3
February 7 2.6.38-rc4 March 4 4.16-rc4
February 15 2.6.38-rc5 March 11 4.16-rc5
February 21 2.6.38-rc6 March 18 4.16-rc6
March 1 2.6.38-rc7 March 25 4.16-rc7
March 7 2.6.38-rc8 April 1 4.17 stable release
March 14 2.6.38 stable release
============== =============================== ============== ===============================
How do the developers decide when to close the development cycle and create How do the developers decide when to close the development cycle and create
...@@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to ...@@ -99,37 +98,42 @@ release is made. In the real world, this kind of perfection is hard to
achieve; there are just too many variables in a project of this size. achieve; there are just too many variables in a project of this size.
There comes a point where delaying the final release just makes the problem There comes a point where delaying the final release just makes the problem
worse; the pile of changes waiting for the next merge window will grow worse; the pile of changes waiting for the next merge window will grow
larger, creating even more regressions the next time around. So most 2.6.x larger, creating even more regressions the next time around. So most 4.x
kernels go out with a handful of known regressions though, hopefully, none kernels go out with a handful of known regressions though, hopefully, none
of them are serious. of them are serious.
Once a stable release is made, its ongoing maintenance is passed off to the Once a stable release is made, its ongoing maintenance is passed off to the
"stable team," currently consisting of Greg Kroah-Hartman. The stable team "stable team," currently consisting of Greg Kroah-Hartman. The stable team
will release occasional updates to the stable release using the 2.6.x.y will release occasional updates to the stable release using the 4.x.y
numbering scheme. To be considered for an update release, a patch must (1) numbering scheme. To be considered for an update release, a patch must (1)
fix a significant bug, and (2) already be merged into the mainline for the fix a significant bug, and (2) already be merged into the mainline for the
next development kernel. Kernels will typically receive stable updates for next development kernel. Kernels will typically receive stable updates for
a little more than one development cycle past their initial release. So, a little more than one development cycle past their initial release. So,
for example, the 2.6.36 kernel's history looked like: for example, the 4.13 kernel's history looked like:
============== =============================== ============== ===============================
October 10 2.6.36 stable release September 3 4.13 stable release
November 22 2.6.36.1 September 13 4.13.1
December 9 2.6.36.2 September 20 4.13.2
January 7 2.6.36.3 September 27 4.13.3
February 17 2.6.36.4 October 5 4.13.4
October 12 4.13.5
... ...
November 24 4.13.16
============== =============================== ============== ===============================
2.6.36.4 was the final stable update for the 2.6.36 release. 4.13.16 was the final stable update of the 4.13 release.
Some kernels are designated "long term" kernels; they will receive support Some kernels are designated "long term" kernels; they will receive support
for a longer period. As of this writing, the current long term kernels for a longer period. As of this writing, the current long term kernels
and their maintainers are: and their maintainers are:
====== ====================== =========================== ====== ====================== ==============================
2.6.27 Willy Tarreau (Deep-frozen stable kernel) 3.16 Ben Hutchings (very long-term stable kernel)
2.6.32 Greg Kroah-Hartman 4.1 Sasha Levin
2.6.35 Andi Kleen (Embedded flag kernel) 4.4 Greg Kroah-Hartman (very long-term stable kernel)
4.9 Greg Kroah-Hartman
4.14 Greg Kroah-Hartman
====== ====================== =========================== ====== ====================== ===========================
The selection of a kernel for long-term support is purely a matter of a The selection of a kernel for long-term support is purely a matter of a
......
...@@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches; ...@@ -10,8 +10,8 @@ of conventions and procedures which are used in the posting of patches;
following them will make life much easier for everybody involved. This following them will make life much easier for everybody involved. This
document will attempt to cover these expectations in reasonable detail; document will attempt to cover these expectations in reasonable detail;
more information can also be found in the files process/submitting-patches.rst, more information can also be found in the files process/submitting-patches.rst,
process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel documentation process/submitting-drivers.rst, and process/submit-checklist.rst in the kernel
directory. documentation directory.
When to post When to post
...@@ -198,8 +198,8 @@ pass it to diff with the "-X" option. ...@@ -198,8 +198,8 @@ pass it to diff with the "-X" option.
The tags mentioned above are used to describe how various developers have The tags mentioned above are used to describe how various developers have
been associated with the development of this patch. They are described in been associated with the development of this patch. They are described in
detail in the process/submitting-patches.rst document; what follows here is a brief detail in the process/submitting-patches.rst document; what follows here is a
summary. Each of these lines has the format: brief summary. Each of these lines has the format:
:: ::
...@@ -210,8 +210,8 @@ The tags in common use are: ...@@ -210,8 +210,8 @@ The tags in common use are:
- Signed-off-by: this is a developer's certification that he or she has - Signed-off-by: this is a developer's certification that he or she has
the right to submit the patch for inclusion into the kernel. It is an the right to submit the patch for inclusion into the kernel. It is an
agreement to the Developer's Certificate of Origin, the full text of agreement to the Developer's Certificate of Origin, the full text of
which can be found in Documentation/process/submitting-patches.rst. Code without a which can be found in Documentation/process/submitting-patches.rst. Code
proper signoff cannot be merged into the mainline. without a proper signoff cannot be merged into the mainline.
- Co-developed-by: states that the patch was also created by another developer - Co-developed-by: states that the patch was also created by another developer
along with the original author. This is useful at times when multiple along with the original author. This is useful at times when multiple
...@@ -226,8 +226,8 @@ The tags in common use are: ...@@ -226,8 +226,8 @@ The tags in common use are:
it to work. it to work.
- Reviewed-by: the named developer has reviewed the patch for correctness; - Reviewed-by: the named developer has reviewed the patch for correctness;
see the reviewer's statement in Documentation/process/submitting-patches.rst for more see the reviewer's statement in Documentation/process/submitting-patches.rst
detail. for more detail.
- Reported-by: names a user who reported a problem which is fixed by this - Reported-by: names a user who reported a problem which is fixed by this
patch; this tag is used to give credit to the (often underappreciated) patch; this tag is used to give credit to the (often underappreciated)
......
...@@ -52,6 +52,7 @@ lack of a better place. ...@@ -52,6 +52,7 @@ lack of a better place.
adding-syscalls adding-syscalls
magic-number magic-number
volatile-considered-harmful volatile-considered-harmful
clang-format
.. only:: subproject and html .. only:: subproject and html
......
...@@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so ...@@ -219,7 +219,7 @@ Our goal is to protect your master key by moving it to offline media, so
if you only have a combined **[SC]** key, then you should create a separate if you only have a combined **[SC]** key, then you should create a separate
signing subkey:: signing subkey::
$ gpg --quick-add-key [fpr] ed25519 sign $ gpg --quick-addkey [fpr] ed25519 sign
Remember to tell the keyservers about this change, so others can pull down Remember to tell the keyservers about this change, so others can pull down
your new subkey:: your new subkey::
...@@ -450,11 +450,18 @@ functionality. There are several options available: ...@@ -450,11 +450,18 @@ functionality. There are several options available:
others. If you want to use ECC keys, your best bet among commercially others. If you want to use ECC keys, your best bet among commercially
available devices is the Nitrokey Start. available devices is the Nitrokey Start.
.. note::
If you are listed in MAINTAINERS or have an account at kernel.org,
you `qualify for a free Nitrokey Start`_ courtesy of The Linux
Foundation.
.. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6 .. _`Nitrokey Start`: https://shop.nitrokey.com/shop/product/nitrokey-start-6
.. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3 .. _`Nitrokey Pro`: https://shop.nitrokey.com/shop/product/nitrokey-pro-3
.. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/ .. _`Yubikey 4`: https://www.yubico.com/product/yubikey-4-series/
.. _Gnuk: http://www.fsij.org/doc-gnuk/ .. _Gnuk: http://www.fsij.org/doc-gnuk/
.. _`LWN has a good review`: https://lwn.net/Articles/736231/ .. _`LWN has a good review`: https://lwn.net/Articles/736231/
.. _`qualify for a free Nitrokey Start`: https://www.kernel.org/nitrokey-digital-tokens-for-kernel-developers.html
Configure your smartcard device Configure your smartcard device
------------------------------- -------------------------------
...@@ -482,7 +489,7 @@ there are no convenient command-line switches:: ...@@ -482,7 +489,7 @@ there are no convenient command-line switches::
You should set the user PIN (1), Admin PIN (3), and the Reset Code (4). You should set the user PIN (1), Admin PIN (3), and the Reset Code (4).
Please make sure to record and store these in a safe place -- especially Please make sure to record and store these in a safe place -- especially
the Admin PIN and the Reset Code (which allows you to completely wipe the Admin PIN and the Reset Code (which allows you to completely wipe
the smartcard). You so rarely need to use the Admin PIN, that you will the smartcard). You so rarely need to use the Admin PIN, that you will
inevitably forget what it is if you do not record it. inevitably forget what it is if you do not record it.
Getting back to the main card menu, you can also set other values (such Getting back to the main card menu, you can also set other values (such
...@@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it. ...@@ -494,6 +501,12 @@ additionally leak information about your smartcard should you lose it.
Despite having the name "PIN", neither the user PIN nor the admin Despite having the name "PIN", neither the user PIN nor the admin
PIN on the card need to be numbers. PIN on the card need to be numbers.
.. warning::
Some devices may require that you move the subkeys onto the device
before you can change the passphrase. Please check the documentation
provided by the device manufacturer.
Move the subkeys to your smartcard Move the subkeys to your smartcard
---------------------------------- ----------------------------------
...@@ -655,6 +668,20 @@ want to import these changes back into your regular working directory:: ...@@ -655,6 +668,20 @@ want to import these changes back into your regular working directory::
$ gpg --export | gpg --homedir ~/.gnupg --import $ gpg --export | gpg --homedir ~/.gnupg --import
$ unset GNUPGHOME $ unset GNUPGHOME
Using gpg-agent over ssh
~~~~~~~~~~~~~~~~~~~~~~~~
You can forward your gpg-agent over ssh if you need to sign tags or
commits on a remote system. Please refer to the instructions provided
on the GnuPG wiki:
- `Agent Forwarding over SSH`_
It works more smoothly if you can modify the sshd server settings on the
remote end.
.. _`Agent Forwarding over SSH`: https://wiki.gnupg.org/AgentForwarding
Using PGP with Git Using PGP with Git
================== ==================
...@@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key):: ...@@ -692,6 +719,7 @@ should be used (``[fpr]`` is the fingerprint of your key)::
tell git to always use it instead of the legacy ``gpg`` from version 1:: tell git to always use it instead of the legacy ``gpg`` from version 1::
$ git config --global gpg.program gpg2 $ git config --global gpg.program gpg2
$ git config --global gpgv.program gpgv2
How to work with signed tags How to work with signed tags
---------------------------- ----------------------------
...@@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to ...@@ -731,6 +759,13 @@ If you are verifying someone else's git tag, then you will need to
import their PGP key. Please refer to the import their PGP key. Please refer to the
":ref:`verify_identities`" section below. ":ref:`verify_identities`" section below.
.. note::
If you get "``gpg: Can't check signature: unknown pubkey
algorithm``" error, you need to tell git to use gpgv2 for
verification, so it properly processes signatures made by ECC keys.
See instructions at the start of this section.
Configure git to always sign annotated tags Configure git to always sign annotated tags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
...@@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use ...@@ -761,7 +761,7 @@ requests, especially from new, unknown developers. If in doubt you can use
the pull request as the cover letter for a normal posting of the patch the pull request as the cover letter for a normal posting of the patch
series, giving the maintainer the option of using either. series, giving the maintainer the option of using either.
A pull request should have [GIT] or [PULL] in the subject line. The A pull request should have [GIT PULL] in the subject line. The
request itself should include the repository name and the branch of request itself should include the repository name and the branch of
interest on a single line; it should look something like:: interest on a single line; it should look something like::
......
...@@ -9,5 +9,7 @@ Security Documentation ...@@ -9,5 +9,7 @@ Security Documentation
IMA-templates IMA-templates
keys/index keys/index
LSM LSM
LSM-sctp
SELinux-sctp
self-protection self-protection
tpm/index tpm/index
...@@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel ...@@ -1062,7 +1062,7 @@ output (with ``--no-upload`` option) to kernel bugzilla or alsa-devel
ML (see the section `Links and Addresses`_). ML (see the section `Links and Addresses`_).
``power_save`` and ``power_save_controller`` options are for power-saving ``power_save`` and ``power_save_controller`` options are for power-saving
mode. See powersave.txt for details. mode. See powersave.rst for details.
Note 2: If you get click noises on output, try the module option Note 2: If you get click noises on output, try the module option
``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB ``position_fix=1`` or ``2``. ``position_fix=1`` will use the SD_LPIB
...@@ -1133,7 +1133,7 @@ line_outs_monitor ...@@ -1133,7 +1133,7 @@ line_outs_monitor
enable_monitor enable_monitor
Enable Analog Out on Channel 63/64 by default. Enable Analog Out on Channel 63/64 by default.
See hdspm.txt for details. See hdspm.rst for details.
Module snd-ice1712 Module snd-ice1712
------------------ ------------------
......
...@@ -139,7 +139,7 @@ DAPM description ...@@ -139,7 +139,7 @@ DAPM description
---------------- ----------------
The Dynamic Audio Power Management description describes the codec power The Dynamic Audio Power Management description describes the codec power
components and their relationships and registers to the ASoC core. components and their relationships and registers to the ASoC core.
Please read dapm.txt for details of building the description. Please read dapm.rst for details of building the description.
Please also see the examples in other codec drivers. Please also see the examples in other codec drivers.
......
...@@ -66,7 +66,7 @@ Each SoC DAI driver must provide the following features:- ...@@ -66,7 +66,7 @@ Each SoC DAI driver must provide the following features:-
4. SYSCLK configuration 4. SYSCLK configuration
5. Suspend and resume (optional) 5. Suspend and resume (optional)
Please see codec.txt for a description of items 1 - 4. Please see codec.rst for a description of items 1 - 4.
SoC DSP Drivers SoC DSP Drivers
......
...@@ -515,7 +515,7 @@ nr_hugepages ...@@ -515,7 +515,7 @@ nr_hugepages
Change the minimum size of the hugepage pool. Change the minimum size of the hugepage pool.
See Documentation/vm/hugetlbpage.txt See Documentation/admin-guide/mm/hugetlbpage.rst
============================================================== ==============================================================
...@@ -524,7 +524,7 @@ nr_overcommit_hugepages ...@@ -524,7 +524,7 @@ nr_overcommit_hugepages
Change the maximum size of the hugepage pool. The maximum is Change the maximum size of the hugepage pool. The maximum is
nr_hugepages + nr_overcommit_hugepages. nr_hugepages + nr_overcommit_hugepages.
See Documentation/vm/hugetlbpage.txt See Documentation/admin-guide/mm/hugetlbpage.rst
============================================================== ==============================================================
...@@ -667,7 +667,7 @@ and don't use much of it. ...@@ -667,7 +667,7 @@ and don't use much of it.
The default value is 0. The default value is 0.
See Documentation/vm/overcommit-accounting and See Documentation/vm/overcommit-accounting.rst and
mm/mmap.c::__vm_enough_memory() for more information. mm/mmap.c::__vm_enough_memory() for more information.
============================================================== ==============================================================
......
...@@ -187,13 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The ...@@ -187,13 +187,19 @@ that can be performed on them (see "struct coresight_ops"). The
specific to that component only. "Implementation defined" customisations are specific to that component only. "Implementation defined" customisations are
expected to be accessed and controlled using those entries. expected to be accessed and controlled using those entries.
Last but not least, "struct module *owner" is expected to be set to reflect
the information carried in "THIS_MODULE".
How to use the tracer modules How to use the tracer modules
----------------------------- -----------------------------
Before trace collection can start, a coresight sink needs to be identify. There are two ways to use the Coresight framework: 1) using the perf cmd line
tools and 2) interacting directly with the Coresight devices using the sysFS
interface. Preference is given to the former as using the sysFS interface
requires a deep understanding of the Coresight HW. The following sections
provide details on using both methods.
1) Using the sysFS interface:
Before trace collection can start, a coresight sink needs to be identified.
There is no limit on the amount of sinks (nor sources) that can be enabled at There is no limit on the amount of sinks (nor sources) that can be enabled at
any given moment. As a generic operation, all device pertaining to the sink any given moment. As a generic operation, all device pertaining to the sink
class will have an "active" entry in sysfs: class will have an "active" entry in sysfs:
...@@ -298,42 +304,48 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD ...@@ -298,42 +304,48 @@ Instruction 13570831 0x8026B584 E28DD00C false ADD
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc} Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
Timestamp Timestamp: 17107041535 Timestamp Timestamp: 17107041535
How to use the STM module 2) Using perf framework:
-------------------------
Using the System Trace Macrocell module is the same as the tracers - the only Coresight tracers are represented using the Perf framework's Performance
difference is that clients are driving the trace capture rather Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of
than the program flow through the code. controlling when tracing gets enabled based on when the process of interest is
scheduled. When configured in a system, Coresight PMUs will be listed when
queried by the perf command line tool:
As with any other CoreSight component, specifics about the STM tracer can be linaro@linaro-nano:~$ ./perf list pmu
found in sysfs with more information on each entry being found in [1]:
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm List of pre-defined events (to be used in -e):
enable_source hwevent_select port_enable subsystem uevent
hwevent_enable mgmt port_select traceid
root@genericarmv8:~#
Like any other source a sink needs to be identified and the STM enabled before cs_etm// [Kernel PMU event]
being used:
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink linaro@linaro-nano:~$
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
From there user space applications can request and use channels using the devfs Regardless of the number of tracers available in a system (usually equal to the
interface provided for that purpose by the generic STM API: amount of processor cores), the "cs_etm" PMU will be listed only once.
root@genericarmv8:~# ls -l /dev/20100000.stm A Coresight PMU works the same way as any other PMU, i.e the name of the PMU is
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm listed along with configuration options within forward slashes '/'. Since a
root@genericarmv8:~# Coresight system will typically have more than one sink, the name of the sink to
work with needs to be specified as an event option. Names for sink to choose
from are listed in sysFS under ($SYSFS)/bus/coresight/devices:
Details on how to use the generic STM API can be found here [2]. root@linaro-nano:~# ls /sys/bus/coresight/devices/
20010000.etf 20040000.funnel 20100000.stm 22040000.etm
22140000.etm 230c0000.funnel 23240000.etm 20030000.tpiu
20070000.etr 20120000.replicator 220c0000.funnel
23040000.etm 23140000.etm 23340000.etm
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm root@linaro-nano:~# perf record -e cs_etm/@20070000.etr/u --per-thread program
[2]. Documentation/trace/stm.txt
The syntax within the forward slashes '/' is important. The '@' character
tells the parser that a sink is about to be specified and that this is the sink
to use for the trace session.
Using perf tools More information on the above and other example on how to use Coresight with
---------------- the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
repository [3].
2.1) AutoFDO analysis using the perf tools:
perf can be used to record and analyze trace of programs. perf can be used to record and analyze trace of programs.
...@@ -381,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto ...@@ -381,3 +393,38 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
$ taskset -c 2 ./sort_autofdo $ taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements Bubble sorting array of 30000 elements
5806 ms 5806 ms
How to use the STM module
-------------------------
Using the System Trace Macrocell module is the same as the tracers - the only
difference is that clients are driving the trace capture rather
than the program flow through the code.
As with any other CoreSight component, specifics about the STM tracer can be
found in sysfs with more information on each entry being found in [1]:
root@genericarmv8:~# ls /sys/bus/coresight/devices/20100000.stm
enable_source hwevent_select port_enable subsystem uevent
hwevent_enable mgmt port_select traceid
root@genericarmv8:~#
Like any other source a sink needs to be identified and the STM enabled before
being used:
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20010000.etf/enable_sink
root@genericarmv8:~# echo 1 > /sys/bus/coresight/devices/20100000.stm/enable_source
From there user space applications can request and use channels using the devfs
interface provided for that purpose by the generic STM API:
root@genericarmv8:~# ls -l /dev/20100000.stm
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/20100000.stm
root@genericarmv8:~#
Details on how to use the generic STM API can be found here [2].
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
[2]. Documentation/trace/stm.txt
[3]. https://github.com/Linaro/perf-opencsd
...@@ -12,7 +12,7 @@ Written for: 4.14 ...@@ -12,7 +12,7 @@ Written for: 4.14
Introduction Introduction
============ ============
The ftrace infrastructure was originially created to attach callbacks to the The ftrace infrastructure was originally created to attach callbacks to the
beginning of functions in order to record and trace the flow of the kernel. beginning of functions in order to record and trace the flow of the kernel.
But callbacks to the start of a function can have other use cases. Either But callbacks to the start of a function can have other use cases. Either
for live kernel patching, or for security monitoring. This document describes for live kernel patching, or for security monitoring. This document describes
...@@ -30,7 +30,7 @@ The ftrace context ...@@ -30,7 +30,7 @@ The ftrace context
This requires extra care to what can be done inside a callback. A callback This requires extra care to what can be done inside a callback. A callback
can be called outside the protective scope of RCU. can be called outside the protective scope of RCU.
The ftrace infrastructure has some protections agains recursions and RCU The ftrace infrastructure has some protections against recursions and RCU
but one must still be very careful how they use the callbacks. but one must still be very careful how they use the callbacks.
......
...@@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files: ...@@ -224,6 +224,8 @@ of ftrace. Here is a list of some of the key files:
has a side effect of enabling or disabling specific functions has a side effect of enabling or disabling specific functions
to be traced. Echoing names of functions into this file to be traced. Echoing names of functions into this file
will limit the trace to only those functions. will limit the trace to only those functions.
This influences the tracers "function" and "function_graph"
and thus also function profiling (see "function_profile_enabled").
The functions listed in "available_filter_functions" are what The functions listed in "available_filter_functions" are what
can be written into this file. can be written into this file.
...@@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files: ...@@ -265,6 +267,8 @@ of ftrace. Here is a list of some of the key files:
Functions listed in this file will cause the function graph Functions listed in this file will cause the function graph
tracer to only trace these functions and the functions that tracer to only trace these functions and the functions that
they call. (See the section "dynamic ftrace" for more details). they call. (See the section "dynamic ftrace" for more details).
Note, set_ftrace_filter and set_ftrace_notrace still affects
what functions are being traced.
set_graph_notrace: set_graph_notrace:
...@@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files: ...@@ -277,7 +281,8 @@ of ftrace. Here is a list of some of the key files:
This lists the functions that ftrace has processed and can trace. This lists the functions that ftrace has processed and can trace.
These are the function names that you can pass to These are the function names that you can pass to
"set_ftrace_filter" or "set_ftrace_notrace". "set_ftrace_filter", "set_ftrace_notrace",
"set_graph_function", or "set_graph_notrace".
(See the section "dynamic ftrace" below for more details.) (See the section "dynamic ftrace" below for more details.)
dyn_ftrace_total_info: dyn_ftrace_total_info:
......
...@@ -2846,7 +2846,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮 ...@@ -2846,7 +2846,7 @@ CPU 의 캐시에서 RAM 으로 쓰여지는 더티 캐시 라인에 의해 덮
문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는 문제를 해결하기 위해선, 커널의 적절한 부분에서 각 CPU 의 캐시 안의 문제가 되는
비트들을 무효화 시켜야 합니다. 비트들을 무효화 시켜야 합니다.
캐시 관리에 대한 더 많은 정보를 위해선 Documentation/cachetlb.txt 를 캐시 관리에 대한 더 많은 정보를 위해선 Documentation/core-api/cachetlb.rst 를
참고하세요. 참고하세요.
...@@ -3023,7 +3023,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다. ...@@ -3023,7 +3023,7 @@ smp_mb() 가 아니라 virt_mb() 를 사용해야 합니다.
동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을 동기화에 락을 사용하지 않고 구현하는데에 사용될 수 있습니다. 더 자세한 내용을
위해선 다음을 참고하세요: 위해선 다음을 참고하세요:
Documentation/circular-buffers.txt Documentation/core-api/circular-buffers.rst
========= =========
......
...@@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver, ...@@ -252,15 +252,14 @@ into VFIO core. When devices are bound and unbound to the driver,
the driver should call vfio_add_group_dev() and vfio_del_group_dev() the driver should call vfio_add_group_dev() and vfio_del_group_dev()
respectively:: respectively::
extern int vfio_add_group_dev(struct iommu_group *iommu_group, extern int vfio_add_group_dev(struct device *dev,
struct device *dev,
const struct vfio_device_ops *ops, const struct vfio_device_ops *ops,
void *device_data); void *device_data);
extern void *vfio_del_group_dev(struct device *dev); extern void *vfio_del_group_dev(struct device *dev);
vfio_add_group_dev() indicates to the core to begin tracking the vfio_add_group_dev() indicates to the core to begin tracking the
specified iommu_group and register the specified dev as owned by iommu_group of the specified dev and register the dev as owned by
a VFIO bus driver. The driver provides an ops structure for callbacks a VFIO bus driver. The driver provides an ops structure for callbacks
similar to a file operations structure:: similar to a file operations structure::
......
00-INDEX 00-INDEX
- this file. - this file.
active_mm.txt active_mm.rst
- An explanation from Linus about tsk->active_mm vs tsk->mm. - An explanation from Linus about tsk->active_mm vs tsk->mm.
balance balance.rst
- various information on memory balancing. - various information on memory balancing.
cleancache.txt cleancache.rst
- Intro to cleancache and page-granularity victim cache. - Intro to cleancache and page-granularity victim cache.
frontswap.txt frontswap.rst
- Outline frontswap, part of the transcendent memory frontend. - Outline frontswap, part of the transcendent memory frontend.
highmem.txt highmem.rst
- Outline of highmem and common issues. - Outline of highmem and common issues.
hmm.txt hmm.rst
- Documentation of heterogeneous memory management - Documentation of heterogeneous memory management
hugetlbpage.txt hugetlbfs_reserv.rst
- a brief summary of hugetlbpage support in the Linux kernel.
hugetlbfs_reserv.txt
- A brief overview of hugetlbfs reservation design/implementation. - A brief overview of hugetlbfs reservation design/implementation.
hwpoison.txt hwpoison.rst
- explains what hwpoison is - explains what hwpoison is
idle_page_tracking.txt ksm.rst
- description of the idle page tracking feature.
ksm.txt
- how to use the Kernel Samepage Merging feature. - how to use the Kernel Samepage Merging feature.
mmu_notifier.txt mmu_notifier.rst
- a note about clearing pte/pmd and mmu notifications - a note about clearing pte/pmd and mmu notifications
numa numa.rst
- information about NUMA specific code in the Linux vm. - information about NUMA specific code in the Linux vm.
numa_memory_policy.txt overcommit-accounting.rst
- documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting
- description of the Linux kernels overcommit handling modes. - description of the Linux kernels overcommit handling modes.
page_frags page_frags.rst
- description of page fragments allocator - description of page fragments allocator
page_migration page_migration.rst
- description of page migration in NUMA systems. - description of page migration in NUMA systems.
pagemap.txt page_owner.rst
- pagemap, from the userspace perspective
page_owner.txt
- tracking about who allocated each page - tracking about who allocated each page
remap_file_pages.txt remap_file_pages.rst
- a note about remap_file_pages() system call - a note about remap_file_pages() system call
slub.txt slub.rst
- a short users guide for SLUB. - a short users guide for SLUB.
soft-dirty.txt split_page_table_lock.rst
- short explanation for soft-dirty PTEs
split_page_table_lock
- Separate per-table lock to improve scalability of the old page_table_lock. - Separate per-table lock to improve scalability of the old page_table_lock.
swap_numa.txt swap_numa.rst
- automatic binding of swap device to numa node - automatic binding of swap device to numa node
transhuge.txt transhuge.rst
- Transparent Hugepage Support, alternative way of using hugepages. - Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.txt unevictable-lru.rst
- Unevictable LRU infrastructure - Unevictable LRU infrastructure
userfaultfd.txt
- description of userfaultfd system call
z3fold.txt z3fold.txt
- outline of z3fold allocator for storing compressed pages - outline of z3fold allocator for storing compressed pages
zsmalloc.txt zsmalloc.rst
- outline of zsmalloc allocator for storing compressed pages - outline of zsmalloc allocator for storing compressed pages
zswap.txt zswap.rst
- Intro to compressed cache for swap pages - Intro to compressed cache for swap pages
.. _active_mm:
=========
Active MM
=========
::
List: linux-kernel
Subject: Re: active_mm
From: Linus Torvalds <torvalds () transmeta ! com>
Date: 1999-07-30 21:36:24
Cc'd to linux-kernel, because I don't write explanations all that often,
and when I do I feel better about more people reading them.
On Fri, 30 Jul 1999, David Mosberger wrote:
>
> Is there a brief description someplace on how "mm" vs. "active_mm" in
> the task_struct are supposed to be used? (My apologies if this was
> discussed on the mailing lists---I just returned from vacation and
> wasn't able to follow linux-kernel for a while).
Basically, the new setup is:
- we have "real address spaces" and "anonymous address spaces". The
difference is that an anonymous address space doesn't care about the
user-level page tables at all, so when we do a context switch into an
anonymous address space we just leave the previous address space
active.
The obvious use for a "anonymous address space" is any thread that
doesn't need any user mappings - all kernel threads basically fall into
this category, but even "real" threads can temporarily say that for
some amount of time they are not going to be interested in user space,
and that the scheduler might as well try to avoid wasting time on
switching the VM state around. Currently only the old-style bdflush
sync does that.
- "tsk->mm" points to the "real address space". For an anonymous process,
tsk->mm will be NULL, for the logical reason that an anonymous process
really doesn't _have_ a real address space at all.
- however, we obviously need to keep track of which address space we
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
which shows what the currently active address space is.
The rule is that for a process with a real address space (ie tsk->mm is
non-NULL) the active_mm obviously always has to be the same as the real
one.
For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
"borrowed" mm while the anonymous process is running. When the
anonymous process gets scheduled away, the borrowed address space is
returned and cleared.
To support all that, the "struct mm_struct" now has two counters: a
"mm_users" counter that is how many "real address space users" there are,
and a "mm_count" counter that is the number of "lazy" users (ie anonymous
users) plus one if there are any real users.
Usually there is at least one real user, but it could be that the real
user exited on another CPU while a lazy user was still active, so you do
actually get cases where you have a address space that is _only_ used by
lazy users. That is often a short-lived state, because once that thread
gets scheduled away in favour of a real thread, the "zombie" mm gets
released because "mm_users" becomes zero.
Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
more. "init_mm" should be considered just a "lazy context when no other
context is available", and in fact it is mainly used just at bootup when
no real VM has yet been created. So code that used to check
if (current->mm == &init_mm)
should generally just do
if (!current->mm)
instead (which makes more sense anyway - the test is basically one of "do
we have a user context", and is generally done by the page fault handler
and things like that).
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
because it slightly changes the interfaces to accommodate the alpha (who
would have thought it, but the alpha actually ends up having one of the
ugliest context switch codes - unlike the other architectures where the MM
and register state is separate, the alpha PALcode joins the two, and you
need to switch both together).
(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
List: linux-kernel
Subject: Re: active_mm
From: Linus Torvalds <torvalds () transmeta ! com>
Date: 1999-07-30 21:36:24
Cc'd to linux-kernel, because I don't write explanations all that often,
and when I do I feel better about more people reading them.
On Fri, 30 Jul 1999, David Mosberger wrote:
>
> Is there a brief description someplace on how "mm" vs. "active_mm" in
> the task_struct are supposed to be used? (My apologies if this was
> discussed on the mailing lists---I just returned from vacation and
> wasn't able to follow linux-kernel for a while).
Basically, the new setup is:
- we have "real address spaces" and "anonymous address spaces". The
difference is that an anonymous address space doesn't care about the
user-level page tables at all, so when we do a context switch into an
anonymous address space we just leave the previous address space
active.
The obvious use for a "anonymous address space" is any thread that
doesn't need any user mappings - all kernel threads basically fall into
this category, but even "real" threads can temporarily say that for
some amount of time they are not going to be interested in user space,
and that the scheduler might as well try to avoid wasting time on
switching the VM state around. Currently only the old-style bdflush
sync does that.
- "tsk->mm" points to the "real address space". For an anonymous process,
tsk->mm will be NULL, for the logical reason that an anonymous process
really doesn't _have_ a real address space at all.
- however, we obviously need to keep track of which address space we
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
which shows what the currently active address space is.
The rule is that for a process with a real address space (ie tsk->mm is
non-NULL) the active_mm obviously always has to be the same as the real
one.
For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
"borrowed" mm while the anonymous process is running. When the
anonymous process gets scheduled away, the borrowed address space is
returned and cleared.
To support all that, the "struct mm_struct" now has two counters: a
"mm_users" counter that is how many "real address space users" there are,
and a "mm_count" counter that is the number of "lazy" users (ie anonymous
users) plus one if there are any real users.
Usually there is at least one real user, but it could be that the real
user exited on another CPU while a lazy user was still active, so you do
actually get cases where you have a address space that is _only_ used by
lazy users. That is often a short-lived state, because once that thread
gets scheduled away in favour of a real thread, the "zombie" mm gets
released because "mm_users" becomes zero.
Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
more. "init_mm" should be considered just a "lazy context when no other
context is available", and in fact it is mainly used just at bootup when
no real VM has yet been created. So code that used to check
if (current->mm == &init_mm)
should generally just do
if (!current->mm)
instead (which makes more sense anyway - the test is basically one of "do
we have a user context", and is generally done by the page fault handler
and things like that).
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
because it slightly changes the interfaces to accommodate the alpha (who
would have thought it, but the alpha actually ends up having one of the
ugliest context switch codes - unlike the other architectures where the MM
and register state is separate, the alpha PALcode joins the two, and you
need to switch both together).
(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
.. _balance:
================
Memory Balancing
================
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
...@@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, ...@@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
so as to give a fighting chance for replace_with_highmem() to get a so as to give a fighting chance for replace_with_highmem() to get a
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
fall back into regular zone. This also makes sure that HIGHMEM pages fall back into regular zone. This also makes sure that HIGHMEM pages
are not leaked (for example, in situations where a HIGHMEM page is in are not leaked (for example, in situations where a HIGHMEM page is in
the swapcache but is not being used by anyone) the swapcache but is not being used by anyone)
kswapd also needs to know about the zones it should balance. kswapd is kswapd also needs to know about the zones it should balance. kswapd is
primarily needed in a situation where balancing can not be done, primarily needed in a situation where balancing can not be done,
probably because all allocation requests are coming from intr context probably because all allocation requests are coming from intr context
and all process contexts are sleeping. For 2.3, kswapd does not really and all process contexts are sleeping. For 2.3, kswapd does not really
need to balance the highmem zone, since intr context does not request need to balance the highmem zone, since intr context does not request
...@@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. ...@@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
(Good) Ideas that I have heard: (Good) Ideas that I have heard:
1. Dynamic experience should influence balancing: number of failed requests 1. Dynamic experience should influence balancing: number of failed requests
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
dma pages. (lkd@tantalophile.demon.co.uk) dma pages. (lkd@tantalophile.demon.co.uk)
MOTIVATION .. _cleancache:
==========
Cleancache
==========
Motivation
==========
Cleancache is a new optional feature provided by the VFS layer that Cleancache is a new optional feature provided by the VFS layer that
potentially dramatically increases page cache effectiveness for potentially dramatically increases page cache effectiveness for
...@@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented ...@@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented
in Xen (using hypervisor memory) and zcache (using in-kernel compressed in Xen (using hypervisor memory) and zcache (using in-kernel compressed
memory) and other implementations are in development. memory) and other implementations are in development.
FAQs are included below. :ref:`FAQs <faq>` are included below.
IMPLEMENTATION OVERVIEW Implementation Overview
=======================
A cleancache "backend" that provides transcendent memory registers itself A cleancache "backend" that provides transcendent memory registers itself
to the kernel's cleancache "frontend" by calling cleancache_register_ops, to the kernel's cleancache "frontend" by calling cleancache_register_ops,
...@@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page ...@@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page
with the same handle, the results are indeterminate. Callers must with the same handle, the results are indeterminate. Callers must
lock the page to ensure serial behavior. lock the page to ensure serial behavior.
CLEANCACHE PERFORMANCE METRICS Cleancache Performance Metrics
==============================
If properly configured, monitoring of cleancache is done via debugfs in If properly configured, monitoring of cleancache is done via debugfs in
the /sys/kernel/debug/cleancache directory. The effectiveness of cleancache the `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache
can be measured (across all filesystems) with: can be measured (across all filesystems) with:
succ_gets - number of gets that were successful ``succ_gets``
failed_gets - number of gets that failed number of gets that were successful
puts - number of puts attempted (all "succeed")
invalidates - number of invalidates attempted ``failed_gets``
number of gets that failed
``puts``
number of puts attempted (all "succeed")
``invalidates``
number of invalidates attempted
A backend implementation may provide additional metrics. A backend implementation may provide additional metrics.
.. _faq:
FAQ FAQ
===
1) Where's the value? (Andrew Morton) * Where's the value? (Andrew Morton)
Cleancache provides a significant performance benefit to many workloads Cleancache provides a significant performance benefit to many workloads
in many environments with negligible overhead by improving the in many environments with negligible overhead by improving the
...@@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And ...@@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And
the proposed "RAMster" driver shares RAM across multiple physical the proposed "RAMster" driver shares RAM across multiple physical
systems. systems.
2) Why does cleancache have its sticky fingers so deep inside the * Why does cleancache have its sticky fingers so deep inside the
filesystems and VFS? (Andrew Morton and Christoph Hellwig) filesystems and VFS? (Andrew Morton and Christoph Hellwig)
The core hooks for cleancache in VFS are in most cases a single line The core hooks for cleancache in VFS are in most cases a single line
and the minimum set are placed precisely where needed to maintain and the minimum set are placed precisely where needed to maintain
...@@ -168,9 +187,9 @@ filesystems in the future. ...@@ -168,9 +187,9 @@ filesystems in the future.
The total impact of the hooks to existing fs and mm files is only The total impact of the hooks to existing fs and mm files is only
about 40 lines added (not counting comments and blank lines). about 40 lines added (not counting comments and blank lines).
3) Why not make cleancache asynchronous and batched so it can * Why not make cleancache asynchronous and batched so it can more
more easily interface with real devices with DMA instead easily interface with real devices with DMA instead of copying each
of copying each individual page? (Minchan Kim) individual page? (Minchan Kim)
The one-page-at-a-time copy semantics simplifies the implementation The one-page-at-a-time copy semantics simplifies the implementation
on both the frontend and backend and also allows the backend to on both the frontend and backend and also allows the backend to
...@@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device" ...@@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device"
or for real kernel-addressable RAM, it makes perfect sense for or for real kernel-addressable RAM, it makes perfect sense for
transcendent memory. transcendent memory.
4) Why is non-shared cleancache "exclusive"? And where is the * Why is non-shared cleancache "exclusive"? And where is the
page "invalidated" after a "get"? (Minchan Kim) page "invalidated" after a "get"? (Minchan Kim)
The main reason is to free up space in transcendent memory and The main reason is to free up space in transcendent memory and
to avoid unnecessary cleancache_invalidate calls. If you want inclusive, to avoid unnecessary cleancache_invalidate calls. If you want inclusive,
...@@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call. ...@@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call.
The invalidate is done by the cleancache backend implementation. The invalidate is done by the cleancache backend implementation.
5) What's the performance impact? * What's the performance impact?
Performance analysis has been presented at OLS'09 and LCA'10. Performance analysis has been presented at OLS'09 and LCA'10.
Briefly, performance gains can be significant on most workloads, Briefly, performance gains can be significant on most workloads,
...@@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache ...@@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache
has little value, but in newer multicore machines, especially has little value, but in newer multicore machines, especially
consolidated/virtualized machines, it has great value. consolidated/virtualized machines, it has great value.
6) How do I add cleancache support for filesystem X? (Boaz Harrash) * How do I add cleancache support for filesystem X? (Boaz Harrash)
Filesystems that are well-behaved and conform to certain Filesystems that are well-behaved and conform to certain
restrictions can utilize cleancache simply by making a call to restrictions can utilize cleancache simply by making a call to
...@@ -217,26 +236,26 @@ not enable the optional cleancache. ...@@ -217,26 +236,26 @@ not enable the optional cleancache.
Some points for a filesystem to consider: Some points for a filesystem to consider:
- The FS should be block-device-based (e.g. a ram-based FS such - The FS should be block-device-based (e.g. a ram-based FS such
as tmpfs should not enable cleancache) as tmpfs should not enable cleancache)
- To ensure coherency/correctness, the FS must ensure that all - To ensure coherency/correctness, the FS must ensure that all
file removal or truncation operations either go through VFS or file removal or truncation operations either go through VFS or
add hooks to do the equivalent cleancache "invalidate" operations add hooks to do the equivalent cleancache "invalidate" operations
- To ensure coherency/correctness, either inode numbers must - To ensure coherency/correctness, either inode numbers must
be unique across the lifetime of the on-disk file OR the be unique across the lifetime of the on-disk file OR the
FS must provide an "encode_fh" function. FS must provide an "encode_fh" function.
- The FS must call the VFS superblock alloc and deactivate routines - The FS must call the VFS superblock alloc and deactivate routines
or add hooks to do the equivalent cleancache calls done there. or add hooks to do the equivalent cleancache calls done there.
- To maximize performance, all pages fetched from the FS should - To maximize performance, all pages fetched from the FS should
go through the do_mpag_readpage routine or the FS should add go through the do_mpag_readpage routine or the FS should add
hooks to do the equivalent (cf. btrfs) hooks to do the equivalent (cf. btrfs)
- Currently, the FS blocksize must be the same as PAGESIZE. This - Currently, the FS blocksize must be the same as PAGESIZE. This
is not an architectural restriction, but no backends currently is not an architectural restriction, but no backends currently
support anything different. support anything different.
- A clustered FS should invoke the "shared_init_fs" cleancache - A clustered FS should invoke the "shared_init_fs" cleancache
hook to get best performance for some backends. hook to get best performance for some backends.
7) Why not use the KVA of the inode as the key? (Christoph Hellwig) * Why not use the KVA of the inode as the key? (Christoph Hellwig)
If cleancache would use the inode virtual address instead of If cleancache would use the inode virtual address instead of
inode/filehandle, the pool id could be eliminated. But, this inode/filehandle, the pool id could be eliminated. But, this
...@@ -251,7 +270,7 @@ of cleancache would be lost because the cache of pages in cleanache ...@@ -251,7 +270,7 @@ of cleancache would be lost because the cache of pages in cleanache
is potentially much larger than the kernel pagecache and is most is potentially much larger than the kernel pagecache and is most
useful if the pages survive inode cache removal. useful if the pages survive inode cache removal.
8) Why is a global variable required? * Why is a global variable required?
The cleancache_enabled flag is checked in all of the frequently-used The cleancache_enabled flag is checked in all of the frequently-used
cleancache hooks. The alternative is a function call to check a static cleancache hooks. The alternative is a function call to check a static
...@@ -262,14 +281,14 @@ global variable allows cleancache to be enabled by default at compile ...@@ -262,14 +281,14 @@ global variable allows cleancache to be enabled by default at compile
time, but have insignificant performance impact when cleancache remains time, but have insignificant performance impact when cleancache remains
disabled at runtime. disabled at runtime.
9) Does cleanache work with KVM? * Does cleanache work with KVM?
The memory model of KVM is sufficiently different that a cleancache The memory model of KVM is sufficiently different that a cleancache
backend may have less value for KVM. This remains to be tested, backend may have less value for KVM. This remains to be tested,
especially in an overcommitted system. especially in an overcommitted system.
10) Does cleancache work in userspace? It sounds useful for * Does cleancache work in userspace? It sounds useful for
memory hungry caches like web browsers. (Jamie Lokier) memory hungry caches like web browsers. (Jamie Lokier)
No plans yet, though we agree it sounds useful, at least for No plans yet, though we agree it sounds useful, at least for
apps that bypass the page cache (e.g. O_DIRECT). apps that bypass the page cache (e.g. O_DIRECT).
......
# -*- coding: utf-8; mode: python -*-
project = "Linux Memory Management Documentation"
tags.add("subproject")
latex_documents = [
('index', 'memory-management.tex', project,
'The kernel development community', 'manual'),
]
.. _frontswap:
=========
Frontswap
=========
Frontswap provides a "transcendent memory" interface for swap pages. Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained because In some environments, dramatic performance savings may be obtained because
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" (Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends"
and the only necessary changes to the core kernel for transcendent memory; and the only necessary changes to the core kernel for transcendent memory;
all other supporting code -- the "backends" -- is implemented as drivers. all other supporting code -- the "backends" -- is implemented as drivers.
See the LWN.net article "Transcendent memory in a nutshell" for a detailed See the LWN.net article `Transcendent memory in a nutshell`_
overview of frontswap and related kernel parts: for a detailed overview of frontswap and related kernel parts)
https://lwn.net/Articles/454795/ )
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
Frontswap is so named because it can be thought of as the opposite of Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device. The storage is assumed to be a "backing" store for a swap device. The storage is assumed to be
...@@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may ...@@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may
never be obtained from frontswap. never be obtained from frontswap.
If properly configured, monitoring of frontswap is done via debugfs in If properly configured, monitoring of frontswap is done via debugfs in
the /sys/kernel/debug/frontswap directory. The effectiveness of the `/sys/kernel/debug/frontswap` directory. The effectiveness of
frontswap can be measured (across all swap devices) with: frontswap can be measured (across all swap devices) with:
failed_stores - how many store attempts have failed ``failed_stores``
loads - how many loads were attempted (all should succeed) how many store attempts have failed
succ_stores - how many store attempts have succeeded
invalidates - how many invalidates were attempted ``loads``
how many loads were attempted (all should succeed)
``succ_stores``
how many store attempts have succeeded
``invalidates``
how many invalidates were attempted
A backend implementation may provide additional metrics. A backend implementation may provide additional metrics.
FAQ FAQ
===
1) Where's the value? * Where's the value?
When a workload starts swapping, performance falls through the floor. When a workload starts swapping, performance falls through the floor.
Frontswap significantly increases performance in many such workloads by Frontswap significantly increases performance in many such workloads by
...@@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And, ...@@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And,
using frontswap, investigation is also underway on the use of NVM as using frontswap, investigation is also underway on the use of NVM as
a memory extension technology. a memory extension technology.
2) Sure there may be performance advantages in some situations, but * Sure there may be performance advantages in some situations, but
what's the space/time overhead of frontswap? what's the space/time overhead of frontswap?
If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
nothingness and the only overhead is a few extra bytes per swapon'ed nothingness and the only overhead is a few extra bytes per swapon'ed
...@@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A ...@@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A
backend, such as zcache, must implement policies to carefully (but backend, such as zcache, must implement policies to carefully (but
dynamically) manage memory limits to ensure this doesn't happen. dynamically) manage memory limits to ensure this doesn't happen.
3) OK, how about a quick overview of what this frontswap patch does * OK, how about a quick overview of what this frontswap patch does
in terms that a kernel hacker can grok? in terms that a kernel hacker can grok?
Let's assume that a frontswap "backend" has registered during Let's assume that a frontswap "backend" has registered during
kernel initialization; this registration indicates that this kernel initialization; this registration indicates that this
...@@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend ...@@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend
store" and (possibly) a "frontswap backend loads", which are presumably much store" and (possibly) a "frontswap backend loads", which are presumably much
faster. faster.
4) Can't frontswap be configured as a "special" swap device that is * Can't frontswap be configured as a "special" swap device that is
just higher priority than any real swap device (e.g. like zswap, just higher priority than any real swap device (e.g. like zswap,
or maybe swap-over-nbd/NFS)? or maybe swap-over-nbd/NFS)?
No. First, the existing swap subsystem doesn't allow for any kind of No. First, the existing swap subsystem doesn't allow for any kind of
swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
...@@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices ...@@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices
can still use frontswap but a backend for such devices must configure can still use frontswap but a backend for such devices must configure
some kind of "ghost" swap device and ensure that it is never used. some kind of "ghost" swap device and ensure that it is never used.
5) Why this weird definition about "duplicate stores"? If a page * Why this weird definition about "duplicate stores"? If a page
has been previously successfully stored, can't it always be has been previously successfully stored, can't it always be
successfully overwritten? successfully overwritten?
Nearly always it can, but no, sometimes it cannot. Consider an example Nearly always it can, but no, sometimes it cannot. Consider an example
where data is compressed and the original 4K page has been compressed where data is compressed and the original 4K page has been compressed
...@@ -254,7 +269,7 @@ the old data and ensure that it is no longer accessible. Since the ...@@ -254,7 +269,7 @@ the old data and ensure that it is no longer accessible. Since the
swap subsystem then writes the new data to the read swap device, swap subsystem then writes the new data to the read swap device,
this is the correct course of action to ensure coherency. this is the correct course of action to ensure coherency.
6) What is frontswap_shrink for? * What is frontswap_shrink for?
When the (non-frontswap) swap subsystem swaps out a page to a real When the (non-frontswap) swap subsystem swaps out a page to a real
swap device, that page is only taking up low-value pre-allocated disk swap device, that page is only taking up low-value pre-allocated disk
...@@ -267,7 +282,7 @@ to "repatriate" pages sent to a remote machine back to the local machine; ...@@ -267,7 +282,7 @@ to "repatriate" pages sent to a remote machine back to the local machine;
this is driven using the frontswap_shrink mechanism when memory pressure this is driven using the frontswap_shrink mechanism when memory pressure
subsides. subsides.
7) Why does the frontswap patch create the new include file swapfile.h? * Why does the frontswap patch create the new include file swapfile.h?
The frontswap code depends on some swap-subsystem-internal data The frontswap code depends on some swap-subsystem-internal data
structures that have, over the years, moved back and forth between structures that have, over the years, moved back and forth between
......
.. _highmem:
==================== ====================
HIGH MEMORY HANDLING High Memory Handling
==================== ====================
By: Peter Zijlstra <a.p.zijlstra@chello.nl> By: Peter Zijlstra <a.p.zijlstra@chello.nl>
Contents: .. contents:: :local:
(*) What is high memory?
(*) Temporary virtual mappings.
(*) Using kmap_atomic.
(*) Cost of temporary mappings.
(*) i386 PAE.
What Is High Memory?
====================
WHAT IS HIGH MEMORY?
==================== ====================
High memory (highmem) is used when the size of physical memory approaches or High memory (highmem) is used when the size of physical memory approaches or
...@@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on ...@@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on
i386) has to be divided between user and kernel space. i386) has to be divided between user and kernel space.
The traditional split for architectures using this approach is 3:1, 3GiB for The traditional split for architectures using this approach is 3:1, 3GiB for
userspace and the top 1GiB for kernel space: userspace and the top 1GiB for kernel space::
+--------+ 0xffffffff +--------+ 0xffffffff
| Kernel | | Kernel |
...@@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual ...@@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual
space when they use mm context tags. space when they use mm context tags.
========================== Temporary Virtual Mappings
TEMPORARY VIRTUAL MAPPINGS
========================== ==========================
The kernel contains several ways of creating temporary mappings: The kernel contains several ways of creating temporary mappings:
(*) vmap(). This can be used to make a long duration mapping of multiple * vmap(). This can be used to make a long duration mapping of multiple
physical pages into a contiguous virtual space. It needs global physical pages into a contiguous virtual space. It needs global
synchronization to unmap. synchronization to unmap.
(*) kmap(). This permits a short duration mapping of a single page. It needs * kmap(). This permits a short duration mapping of a single page. It needs
global synchronization, but is amortized somewhat. It is also prone to global synchronization, but is amortized somewhat. It is also prone to
deadlocks when using in a nested fashion, and so it is not recommended for deadlocks when using in a nested fashion, and so it is not recommended for
new code. new code.
(*) kmap_atomic(). This permits a very short duration mapping of a single * kmap_atomic(). This permits a very short duration mapping of a single
page. Since the mapping is restricted to the CPU that issued it, it page. Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings. CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called. sleep and the caller may not sleep until after kunmap_atomic() is called.
It may be assumed that k[un]map_atomic() won't fail. It may be assumed that k[un]map_atomic() won't fail.
================= Using kmap_atomic
USING KMAP_ATOMIC
================= =================
When and where to use kmap_atomic() is straightforward. It is used when code When and where to use kmap_atomic() is straightforward. It is used when code
wants to access the contents of a page that might be allocated from high memory wants to access the contents of a page that might be allocated from high memory
(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
functions, and they can be used in a manner similar to the following: functions, and they can be used in a manner similar to the following::
/* Find the page of interest. */ /* Find the page of interest. */
struct page *page = find_get_page(mapping, offset); struct page *page = find_get_page(mapping, offset);
...@@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call ...@@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
not the argument. not the argument.
If you need to map two pages because you want to copy from one page to If you need to map two pages because you want to copy from one page to
another you need to keep the kmap_atomic calls strictly nested, like: another you need to keep the kmap_atomic calls strictly nested, like::
vaddr1 = kmap_atomic(page1); vaddr1 = kmap_atomic(page1);
vaddr2 = kmap_atomic(page2); vaddr2 = kmap_atomic(page2);
...@@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like: ...@@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like:
kunmap_atomic(vaddr1); kunmap_atomic(vaddr1);
========================== Cost of Temporary Mappings
COST OF TEMPORARY MAPPINGS
========================== ==========================
The cost of creating temporary mappings can be quite high. The arch has to The cost of creating temporary mappings can be quite high. The arch has to
...@@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary mappings and no ...@@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary mappings and no
highmem. In such a case, the arithmetic approach will also be used. highmem. In such a case, the arithmetic approach will also be used.
========
i386 PAE i386 PAE
======== ========
The i386 arch, under some circumstances, will permit you to stick up to 64GiB The i386 arch, under some circumstances, will permit you to stick up to 64GiB
of RAM into your 32-bit machine. This has a number of consequences: of RAM into your 32-bit machine. This has a number of consequences:
(*) Linux needs a page-frame structure for each page in the system and the * Linux needs a page-frame structure for each page in the system and the
pageframes need to live in the permanent mapping, which means: pageframes need to live in the permanent mapping, which means:
(*) you can have 896M/sizeof(struct page) page-frames at most; with struct * you can have 896M/sizeof(struct page) page-frames at most; with struct
page being 32-bytes that would end up being something in the order of 112G page being 32-bytes that would end up being something in the order of 112G
worth of pages; the kernel, however, needs to store more than just worth of pages; the kernel, however, needs to store more than just
page-frames in that memory... page-frames in that memory...
(*) PAE makes your page tables larger - which slows the system down as more * PAE makes your page tables larger - which slows the system down as more
data has to be accessed to traverse in TLB fills and the like. One data has to be accessed to traverse in TLB fills and the like. One
advantage is that PAE has more PTE bits and can provide advanced features advantage is that PAE has more PTE bits and can provide advanced features
like NX and PAT. like NX and PAT.
The general recommendation is that you don't use more than 8GiB on a 32-bit The general recommendation is that you don't use more than 8GiB on a 32-bit
machine - although more might work for you and your workload, you're pretty machine - although more might work for you and your workload, you're pretty
......
.. hmm:
=====================================
Heterogeneous Memory Management (HMM) Heterogeneous Memory Management (HMM)
=====================================
Provide infrastructure and helpers to integrate non-conventional memory (device Provide infrastructure and helpers to integrate non-conventional memory (device
memory like GPU on board memory) into regular kernel path, with the cornerstone memory like GPU on board memory) into regular kernel path, with the cornerstone
...@@ -6,10 +10,10 @@ of this being specialized struct page for such memory (see sections 5 to 7 of ...@@ -6,10 +10,10 @@ of this being specialized struct page for such memory (see sections 5 to 7 of
this document). this document).
HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
allowing a device to transparently access program address coherently with the allowing a device to transparently access program address coherently with
CPU meaning that any valid pointer on the CPU is also a valid pointer for the the CPU meaning that any valid pointer on the CPU is also a valid pointer
device. This is becoming mandatory to simplify the use of advanced hetero- for the device. This is becoming mandatory to simplify the use of advanced
geneous computing where GPU, DSP, or FPGA are used to perform various heterogeneous computing where GPU, DSP, or FPGA are used to perform various
computations on behalf of a process. computations on behalf of a process.
This document is divided as follows: in the first section I expose the problems This document is divided as follows: in the first section I expose the problems
...@@ -21,19 +25,10 @@ fifth section deals with how device memory is represented inside the kernel. ...@@ -21,19 +25,10 @@ fifth section deals with how device memory is represented inside the kernel.
Finally, the last section presents a new migration helper that allows lever- Finally, the last section presents a new migration helper that allows lever-
aging the device DMA engine. aging the device DMA engine.
.. contents:: :local:
1) Problems of using a device specific memory allocator: Problems of using a device specific memory allocator
2) I/O bus, device memory characteristics ====================================================
3) Shared address space and migration
4) Address space mirroring implementation and API
5) Represent and manage device memory from core kernel point of view
6) Migration to and from device memory
7) Memory cgroup (memcg) and rss accounting
-------------------------------------------------------------------------------
1) Problems of using a device specific memory allocator:
Devices with a large amount of on board memory (several gigabytes) like GPUs Devices with a large amount of on board memory (several gigabytes) like GPUs
have historically managed their memory through dedicated driver specific APIs. have historically managed their memory through dedicated driver specific APIs.
...@@ -77,9 +72,8 @@ are only do-able with a shared address space. It is also more reasonable to use ...@@ -77,9 +72,8 @@ are only do-able with a shared address space. It is also more reasonable to use
a shared address space for all other patterns. a shared address space for all other patterns.
------------------------------------------------------------------------------- I/O bus, device memory characteristics
======================================
2) I/O bus, device memory characteristics
I/O buses cripple shared address spaces due to a few limitations. Most I/O I/O buses cripple shared address spaces due to a few limitations. Most I/O
buses only allow basic memory access from device to main memory; even cache buses only allow basic memory access from device to main memory; even cache
...@@ -109,9 +103,8 @@ access any memory but we must also permit any memory to be migrated to device ...@@ -109,9 +103,8 @@ access any memory but we must also permit any memory to be migrated to device
memory while device is using it (blocking CPU access while it happens). memory while device is using it (blocking CPU access while it happens).
------------------------------------------------------------------------------- Shared address space and migration
==================================
3) Shared address space and migration
HMM intends to provide two main features. First one is to share the address HMM intends to provide two main features. First one is to share the address
space by duplicating the CPU page table in the device page table so the same space by duplicating the CPU page table in the device page table so the same
...@@ -148,23 +141,23 @@ ages device memory by migrating the part of the data set that is actively being ...@@ -148,23 +141,23 @@ ages device memory by migrating the part of the data set that is actively being
used by the device. used by the device.
------------------------------------------------------------------------------- Address space mirroring implementation and API
==============================================
4) Address space mirroring implementation and API
Address space mirroring's main objective is to allow duplication of a range of Address space mirroring's main objective is to allow duplication of a range of
CPU page table into a device page table; HMM helps keep both synchronized. A CPU page table into a device page table; HMM helps keep both synchronized. A
device driver that wants to mirror a process address space must start with the device driver that wants to mirror a process address space must start with the
registration of an hmm_mirror struct: registration of an hmm_mirror struct::
int hmm_mirror_register(struct hmm_mirror *mirror, int hmm_mirror_register(struct hmm_mirror *mirror,
struct mm_struct *mm); struct mm_struct *mm);
int hmm_mirror_register_locked(struct hmm_mirror *mirror, int hmm_mirror_register_locked(struct hmm_mirror *mirror,
struct mm_struct *mm); struct mm_struct *mm);
The locked variant is to be used when the driver is already holding mmap_sem The locked variant is to be used when the driver is already holding mmap_sem
of the mm in write mode. The mirror struct has a set of callbacks that are used of the mm in write mode. The mirror struct has a set of callbacks that are used
to propagate CPU page tables: to propagate CPU page tables::
struct hmm_mirror_ops { struct hmm_mirror_ops {
/* sync_cpu_device_pagetables() - synchronize page tables /* sync_cpu_device_pagetables() - synchronize page tables
...@@ -193,10 +186,10 @@ The device driver must perform the update action to the range (mark range ...@@ -193,10 +186,10 @@ The device driver must perform the update action to the range (mark range
read only, or fully unmap, ...). The device must be done with the update before read only, or fully unmap, ...). The device must be done with the update before
the driver callback returns. the driver callback returns.
When the device driver wants to populate a range of virtual addresses, it can When the device driver wants to populate a range of virtual addresses, it can
use either: use either::
int hmm_vma_get_pfns(struct vm_area_struct *vma,
int hmm_vma_get_pfns(struct vm_area_struct *vma,
struct hmm_range *range, struct hmm_range *range,
unsigned long start, unsigned long start,
unsigned long end, unsigned long end,
...@@ -221,7 +214,7 @@ provides a set of flags to help the driver identify special CPU page table ...@@ -221,7 +214,7 @@ provides a set of flags to help the driver identify special CPU page table
entries. entries.
Locking with the update() callback is the most important aspect the driver must Locking with the update() callback is the most important aspect the driver must
respect in order to keep things properly synchronized. The usage pattern is: respect in order to keep things properly synchronized. The usage pattern is::
int driver_populate_range(...) int driver_populate_range(...)
{ {
...@@ -262,9 +255,8 @@ report commands as executed is serialized (there is no point in doing this ...@@ -262,9 +255,8 @@ report commands as executed is serialized (there is no point in doing this
concurrently). concurrently).
------------------------------------------------------------------------------- Represent and manage device memory from core kernel point of view
=================================================================
5) Represent and manage device memory from core kernel point of view
Several different designs were tried to support device memory. First one used Several different designs were tried to support device memory. First one used
a device specific data structure to keep information about migrated memory and a device specific data structure to keep information about migrated memory and
...@@ -280,14 +272,14 @@ unaware of the difference. We only need to make sure that no one ever tries to ...@@ -280,14 +272,14 @@ unaware of the difference. We only need to make sure that no one ever tries to
map those pages from the CPU side. map those pages from the CPU side.
HMM provides a set of helpers to register and hotplug device memory as a new HMM provides a set of helpers to register and hotplug device memory as a new
region needing a struct page. This is offered through a very simple API: region needing a struct page. This is offered through a very simple API::
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
struct device *device, struct device *device,
unsigned long size); unsigned long size);
void hmm_devmem_remove(struct hmm_devmem *devmem); void hmm_devmem_remove(struct hmm_devmem *devmem);
The hmm_devmem_ops is where most of the important things are: The hmm_devmem_ops is where most of the important things are::
struct hmm_devmem_ops { struct hmm_devmem_ops {
void (*free)(struct hmm_devmem *devmem, struct page *page); void (*free)(struct hmm_devmem *devmem, struct page *page);
...@@ -306,13 +298,12 @@ which it cannot do. This second callback must trigger a migration back to ...@@ -306,13 +298,12 @@ which it cannot do. This second callback must trigger a migration back to
system memory. system memory.
------------------------------------------------------------------------------- Migration to and from device memory
===================================
6) Migration to and from device memory
Because the CPU cannot access device memory, migration must use the device DMA Because the CPU cannot access device memory, migration must use the device DMA
engine to perform copy from and to device memory. For this we need a new engine to perform copy from and to device memory. For this we need a new
migration helper: migration helper::
int migrate_vma(const struct migrate_vma_ops *ops, int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma, struct vm_area_struct *vma,
...@@ -331,7 +322,7 @@ migration might be for a range of addresses the device is actively accessing. ...@@ -331,7 +322,7 @@ migration might be for a range of addresses the device is actively accessing.
The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
controls destination memory allocation and copy operation. Second one is there controls destination memory allocation and copy operation. Second one is there
to allow the device driver to perform cleanup operations after migration. to allow the device driver to perform cleanup operations after migration::
struct migrate_vma_ops { struct migrate_vma_ops {
void (*alloc_and_copy)(struct vm_area_struct *vma, void (*alloc_and_copy)(struct vm_area_struct *vma,
...@@ -365,9 +356,8 @@ bandwidth but this is considered as a rare event and a price that we are ...@@ -365,9 +356,8 @@ bandwidth but this is considered as a rare event and a price that we are
willing to pay to keep all the code simpler. willing to pay to keep all the code simpler.
------------------------------------------------------------------------------- Memory cgroup (memcg) and rss accounting
========================================
7) Memory cgroup (memcg) and rss accounting
For now device memory is accounted as any regular page in rss counters (either For now device memory is accounted as any regular page in rss counters (either
anonymous if device page is used for anonymous, file if device page is used for anonymous if device page is used for anonymous, file if device page is used for
......
Hugetlbfs Reservation Overview .. _hugetlbfs_reserve:
------------------------------
Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically =====================
Hugetlbfs Reservation
=====================
Overview
========
Huge pages as described at :ref:`hugetlbpage` are typically
preallocated for application use. These huge pages are instantiated in a preallocated for application use. These huge pages are instantiated in a
task's address space at page fault time if the VMA indicates huge pages are task's address space at page fault time if the VMA indicates huge pages are
to be used. If no huge page exists at page fault time, the task is sent to be used. If no huge page exists at page fault time, the task is sent
...@@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel. ...@@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel.
Audience Audience
-------- ========
This description is primarily targeted at kernel developers who are modifying This description is primarily targeted at kernel developers who are modifying
hugetlbfs code. hugetlbfs code.
The Data Structures The Data Structures
------------------- ===================
resv_huge_pages resv_huge_pages
This is a global (per-hstate) count of reserved huge pages. Reserved This is a global (per-hstate) count of reserved huge pages. Reserved
huge pages are only available to the task which reserved them. huge pages are only available to the task which reserved them.
Therefore, the number of huge pages generally available is computed Therefore, the number of huge pages generally available is computed
as (free_huge_pages - resv_huge_pages). as (``free_huge_pages - resv_huge_pages``).
Reserve Map Reserve Map
A reserve map is described by the structure: A reserve map is described by the structure::
struct resv_map {
struct kref refs; struct resv_map {
spinlock_t lock; struct kref refs;
struct list_head regions; spinlock_t lock;
long adds_in_progress; struct list_head regions;
struct list_head region_cache; long adds_in_progress;
long region_cache_count; struct list_head region_cache;
}; long region_cache_count;
};
There is one reserve map for each huge page mapping in the system. There is one reserve map for each huge page mapping in the system.
The regions list within the resv_map describes the regions within The regions list within the resv_map describes the regions within
the mapping. A region is described as: the mapping. A region is described as::
struct file_region {
struct list_head link; struct file_region {
long from; struct list_head link;
long to; long from;
}; long to;
};
The 'from' and 'to' fields of the file region structure are huge page The 'from' and 'to' fields of the file region structure are huge page
indices into the mapping. Depending on the type of mapping, a indices into the mapping. Depending on the type of mapping, a
region in the reserv_map may indicate reservations exist for the region in the reserv_map may indicate reservations exist for the
range, or reservations do not exist. range, or reservations do not exist.
Flags for MAP_PRIVATE Reservations Flags for MAP_PRIVATE Reservations
These are stored in the bottom bits of the reservation map pointer. These are stored in the bottom bits of the reservation map pointer.
#define HPAGE_RESV_OWNER (1UL << 0) Indicates this task is the
owner of the reservations associated with the mapping. ``#define HPAGE_RESV_OWNER (1UL << 0)``
#define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally Indicates this task is the owner of the reservations
mapping this range (and creating reserves) has unmapped a associated with the mapping.
page from this task (the child) due to a failed COW. ``#define HPAGE_RESV_UNMAPPED (1UL << 1)``
Indicates task originally mapping this range (and creating
reserves) has unmapped a page from this task (the child)
due to a failed COW.
Page Flags Page Flags
The PagePrivate page flag is used to indicate that a huge page The PagePrivate page flag is used to indicate that a huge page
reservation must be restored when the huge page is freed. More reservation must be restored when the huge page is freed. More
...@@ -65,12 +80,14 @@ Page Flags ...@@ -65,12 +80,14 @@ Page Flags
Reservation Map Location (Private or Shared) Reservation Map Location (Private or Shared)
-------------------------------------------- ============================================
A huge page mapping or segment is either private or shared. If private, A huge page mapping or segment is either private or shared. If private,
it is typically only available to a single address space (task). If shared, it is typically only available to a single address space (task). If shared,
it can be mapped into multiple address spaces (tasks). The location and it can be mapped into multiple address spaces (tasks). The location and
semantics of the reservation map is significantly different for two types semantics of the reservation map is significantly different for two types
of mappings. Location differences are: of mappings. Location differences are:
- For private mappings, the reservation map hangs off the the VMA structure. - For private mappings, the reservation map hangs off the the VMA structure.
Specifically, vma->vm_private_data. This reserve map is created at the Specifically, vma->vm_private_data. This reserve map is created at the
time the mapping (mmap(MAP_PRIVATE)) is created. time the mapping (mmap(MAP_PRIVATE)) is created.
...@@ -82,15 +99,15 @@ of mappings. Location differences are: ...@@ -82,15 +99,15 @@ of mappings. Location differences are:
Creating Reservations Creating Reservations
--------------------- =====================
Reservations are created when a huge page backed shared memory segment is Reservations are created when a huge page backed shared memory segment is
created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
These operations result in a call to the routine hugetlb_reserve_pages() These operations result in a call to the routine hugetlb_reserve_pages()::
int hugetlb_reserve_pages(struct inode *inode, int hugetlb_reserve_pages(struct inode *inode,
long from, long to, long from, long to,
struct vm_area_struct *vma, struct vm_area_struct *vma,
vm_flags_t vm_flags) vm_flags_t vm_flags)
The first thing hugetlb_reserve_pages() does is check for the NORESERVE The first thing hugetlb_reserve_pages() does is check for the NORESERVE
flag was specified in either the shmget() or mmap() call. If NORESERVE flag was specified in either the shmget() or mmap() call. If NORESERVE
...@@ -105,6 +122,7 @@ the 'from' and 'to' arguments have been adjusted by this offset. ...@@ -105,6 +122,7 @@ the 'from' and 'to' arguments have been adjusted by this offset.
One of the big differences between PRIVATE and SHARED mappings is the way One of the big differences between PRIVATE and SHARED mappings is the way
in which reservations are represented in the reservation map. in which reservations are represented in the reservation map.
- For shared mappings, an entry in the reservation map indicates a reservation - For shared mappings, an entry in the reservation map indicates a reservation
exists or did exist for the corresponding page. As reservations are exists or did exist for the corresponding page. As reservations are
consumed, the reservation map is not modified. consumed, the reservation map is not modified.
...@@ -121,12 +139,13 @@ to indicate this VMA owns the reservations. ...@@ -121,12 +139,13 @@ to indicate this VMA owns the reservations.
The reservation map is consulted to determine how many huge page reservations The reservation map is consulted to determine how many huge page reservations
are needed for the current mapping/segment. For private mappings, this is are needed for the current mapping/segment. For private mappings, this is
always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the
section "Reservation Map Modifications" for details on how this is accomplished. section :ref:`Reservation Map Modifications <resv_map_modifications>`
for details on how this is accomplished.
The mapping may be associated with a subpool. If so, the subpool is consulted The mapping may be associated with a subpool. If so, the subpool is consulted
to ensure there is sufficient space for the mapping. It is possible that the to ensure there is sufficient space for the mapping. It is possible that the
subpool has set aside reservations that can be used for the mapping. See the subpool has set aside reservations that can be used for the mapping. See the
section "Subpool Reservations" for more details. section :ref:`Subpool Reservations <sub_pool_resv>` for more details.
After consulting the reservation map and subpool, the number of needed new After consulting the reservation map and subpool, the number of needed new
reservations is known. The routine hugetlb_acct_memory() is called to check reservations is known. The routine hugetlb_acct_memory() is called to check
...@@ -135,9 +154,11 @@ calls into routines that potentially allocate and adjust surplus page counts. ...@@ -135,9 +154,11 @@ calls into routines that potentially allocate and adjust surplus page counts.
However, within those routines the code is simply checking to ensure there However, within those routines the code is simply checking to ensure there
are enough free huge pages to accommodate the reservation. If there are, are enough free huge pages to accommodate the reservation. If there are,
the global reservation count resv_huge_pages is adjusted something like the the global reservation count resv_huge_pages is adjusted something like the
following. following::
if (resv_needed <= (resv_huge_pages - free_huge_pages)) if (resv_needed <= (resv_huge_pages - free_huge_pages))
resv_huge_pages += resv_needed; resv_huge_pages += resv_needed;
Note that the global lock hugetlb_lock is held when checking and adjusting Note that the global lock hugetlb_lock is held when checking and adjusting
these counters. these counters.
...@@ -152,14 +173,18 @@ If hugetlb_reserve_pages() was successful, the global reservation count and ...@@ -152,14 +173,18 @@ If hugetlb_reserve_pages() was successful, the global reservation count and
reservation map associated with the mapping will be modified as required to reservation map associated with the mapping will be modified as required to
ensure reservations exist for the range 'from' - 'to'. ensure reservations exist for the range 'from' - 'to'.
.. _consume_resv:
Consuming Reservations/Allocating a Huge Page Consuming Reservations/Allocating a Huge Page
--------------------------------------------- =============================================
Reservations are consumed when huge pages associated with the reservations Reservations are consumed when huge pages associated with the reservations
are allocated and instantiated in the corresponding mapping. The allocation are allocated and instantiated in the corresponding mapping. The allocation
is performed within the routine alloc_huge_page(). is performed within the routine alloc_huge_page()::
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve) struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
alloc_huge_page is passed a VMA pointer and a virtual address, so it can alloc_huge_page is passed a VMA pointer and a virtual address, so it can
consult the reservation map to determine if a reservation exists. In addition, consult the reservation map to determine if a reservation exists. In addition,
alloc_huge_page takes the argument avoid_reserve which indicates reserves alloc_huge_page takes the argument avoid_reserve which indicates reserves
...@@ -170,8 +195,9 @@ page are being allocated. ...@@ -170,8 +195,9 @@ page are being allocated.
The helper routine vma_needs_reservation() is called to determine if a The helper routine vma_needs_reservation() is called to determine if a
reservation exists for the address within the mapping(vma). See the section reservation exists for the address within the mapping(vma). See the section
"Reservation Map Helper Routines" for detailed information on what this :ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed
routine does. The value returned from vma_needs_reservation() is generally information on what this routine does.
The value returned from vma_needs_reservation() is generally
0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists.
If a reservation does not exist, and there is a subpool associated with the If a reservation does not exist, and there is a subpool associated with the
mapping the subpool is consulted to determine if it contains reservations. mapping the subpool is consulted to determine if it contains reservations.
...@@ -180,21 +206,25 @@ However, in every case the avoid_reserve argument overrides the use of ...@@ -180,21 +206,25 @@ However, in every case the avoid_reserve argument overrides the use of
a reservation for the allocation. After determining whether a reservation a reservation for the allocation. After determining whether a reservation
exists and can be used for the allocation, the routine dequeue_huge_page_vma() exists and can be used for the allocation, the routine dequeue_huge_page_vma()
is called. This routine takes two arguments related to reservations: is called. This routine takes two arguments related to reservations:
- avoid_reserve, this is the same value/argument passed to alloc_huge_page() - avoid_reserve, this is the same value/argument passed to alloc_huge_page()
- chg, even though this argument is of type long only the values 0 or 1 are - chg, even though this argument is of type long only the values 0 or 1 are
passed to dequeue_huge_page_vma. If the value is 0, it indicates a passed to dequeue_huge_page_vma. If the value is 0, it indicates a
reservation exists (see the section "Memory Policy and Reservations" for reservation exists (see the section "Memory Policy and Reservations" for
possible issues). If the value is 1, it indicates a reservation does not possible issues). If the value is 1, it indicates a reservation does not
exist and the page must be taken from the global free pool if possible. exist and the page must be taken from the global free pool if possible.
The free lists associated with the memory policy of the VMA are searched for The free lists associated with the memory policy of the VMA are searched for
a free page. If a page is found, the value free_huge_pages is decremented a free page. If a page is found, the value free_huge_pages is decremented
when the page is removed from the free list. If there was a reservation when the page is removed from the free list. If there was a reservation
associated with the page, the following adjustments are made: associated with the page, the following adjustments are made::
SetPagePrivate(page); /* Indicates allocating this page consumed SetPagePrivate(page); /* Indicates allocating this page consumed
* a reservation, and if an error is * a reservation, and if an error is
* encountered such that the page must be * encountered such that the page must be
* freed, the reservation will be restored. */ * freed, the reservation will be restored. */
resv_huge_pages--; /* Decrement the global reservation count */ resv_huge_pages--; /* Decrement the global reservation count */
Note, if no huge page can be found that satisfies the VMA's memory policy Note, if no huge page can be found that satisfies the VMA's memory policy
an attempt will be made to allocate one using the buddy allocator. This an attempt will be made to allocate one using the buddy allocator. This
brings up the issue of surplus huge pages and overcommit which is beyond brings up the issue of surplus huge pages and overcommit which is beyond
...@@ -222,12 +252,14 @@ mapping. In such cases, the reservation count and subpool free page count ...@@ -222,12 +252,14 @@ mapping. In such cases, the reservation count and subpool free page count
will be off by one. This rare condition can be identified by comparing the will be off by one. This rare condition can be identified by comparing the
return value from vma_needs_reservation and vma_commit_reservation. If such return value from vma_needs_reservation and vma_commit_reservation. If such
a race is detected, the subpool and global reserve counts are adjusted to a race is detected, the subpool and global reserve counts are adjusted to
compensate. See the section "Reservation Map Helper Routines" for more compensate. See the section
:ref:`Reservation Map Helper Routines <resv_map_helpers>` for more
information on these routines. information on these routines.
Instantiate Huge Pages Instantiate Huge Pages
---------------------- ======================
After huge page allocation, the page is typically added to the page tables After huge page allocation, the page is typically added to the page tables
of the allocating task. Before this, pages in a shared mapping are added of the allocating task. Before this, pages in a shared mapping are added
to the page cache and pages in private mappings are added to an anonymous to the page cache and pages in private mappings are added to an anonymous
...@@ -237,7 +269,8 @@ to the global reservation count (resv_huge_pages). ...@@ -237,7 +269,8 @@ to the global reservation count (resv_huge_pages).
Freeing Huge Pages Freeing Huge Pages
------------------ ==================
Huge page freeing is performed by the routine free_huge_page(). This routine Huge page freeing is performed by the routine free_huge_page(). This routine
is the destructor for hugetlbfs compound pages. As a result, it is only is the destructor for hugetlbfs compound pages. As a result, it is only
passed a pointer to the page struct. When a huge page is freed, reservation passed a pointer to the page struct. When a huge page is freed, reservation
...@@ -247,7 +280,8 @@ on an error path where a global reserve count must be restored. ...@@ -247,7 +280,8 @@ on an error path where a global reserve count must be restored.
The page->private field points to any subpool associated with the page. The page->private field points to any subpool associated with the page.
If the PagePrivate flag is set, it indicates the global reserve count should If the PagePrivate flag is set, it indicates the global reserve count should
be adjusted (see the section "Consuming Reservations/Allocating a Huge Page" be adjusted (see the section
:ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>`
for information on how these are set). for information on how these are set).
The routine first calls hugepage_subpool_put_pages() for the page. If this The routine first calls hugepage_subpool_put_pages() for the page. If this
...@@ -259,9 +293,11 @@ Therefore, the global resv_huge_pages counter is incremented in this case. ...@@ -259,9 +293,11 @@ Therefore, the global resv_huge_pages counter is incremented in this case.
If the PagePrivate flag was set in the page, the global resv_huge_pages counter If the PagePrivate flag was set in the page, the global resv_huge_pages counter
will always be incremented. will always be incremented.
.. _sub_pool_resv:
Subpool Reservations Subpool Reservations
-------------------- ====================
There is a struct hstate associated with each huge page size. The hstate There is a struct hstate associated with each huge page size. The hstate
tracks all huge pages of the specified size. A subpool represents a subset tracks all huge pages of the specified size. A subpool represents a subset
of pages within a hstate that is associated with a mounted hugetlbfs of pages within a hstate that is associated with a mounted hugetlbfs
...@@ -295,7 +331,8 @@ the global pools. ...@@ -295,7 +331,8 @@ the global pools.
COW and Reservations COW and Reservations
-------------------- ====================
Since shared mappings all point to and use the same underlying pages, the Since shared mappings all point to and use the same underlying pages, the
biggest reservation concern for COW is private mappings. In this case, biggest reservation concern for COW is private mappings. In this case,
two tasks can be pointing at the same previously allocated page. One task two tasks can be pointing at the same previously allocated page. One task
...@@ -326,30 +363,36 @@ faults on a non-present page. But, the original owner of the ...@@ -326,30 +363,36 @@ faults on a non-present page. But, the original owner of the
mapping/reservation will behave as expected. mapping/reservation will behave as expected.
.. _resv_map_modifications:
Reservation Map Modifications Reservation Map Modifications
----------------------------- =============================
The following low level routines are used to make modifications to a The following low level routines are used to make modifications to a
reservation map. Typically, these routines are not called directly. Rather, reservation map. Typically, these routines are not called directly. Rather,
a reservation map helper routine is called which calls one of these low level a reservation map helper routine is called which calls one of these low level
routines. These low level routines are fairly well documented in the source routines. These low level routines are fairly well documented in the source
code (mm/hugetlb.c). These routines are: code (mm/hugetlb.c). These routines are::
long region_chg(struct resv_map *resv, long f, long t);
long region_add(struct resv_map *resv, long f, long t); long region_chg(struct resv_map *resv, long f, long t);
void region_abort(struct resv_map *resv, long f, long t); long region_add(struct resv_map *resv, long f, long t);
long region_count(struct resv_map *resv, long f, long t); void region_abort(struct resv_map *resv, long f, long t);
long region_count(struct resv_map *resv, long f, long t);
Operations on the reservation map typically involve two operations: Operations on the reservation map typically involve two operations:
1) region_chg() is called to examine the reserve map and determine how 1) region_chg() is called to examine the reserve map and determine how
many pages in the specified range [f, t) are NOT currently represented. many pages in the specified range [f, t) are NOT currently represented.
The calling code performs global checks and allocations to determine if The calling code performs global checks and allocations to determine if
there are enough huge pages for the operation to succeed. there are enough huge pages for the operation to succeed.
2a) If the operation can succeed, region_add() is called to actually modify 2)
the reservation map for the same range [f, t) previously passed to a) If the operation can succeed, region_add() is called to actually modify
region_chg(). the reservation map for the same range [f, t) previously passed to
2b) If the operation can not succeed, region_abort is called for the same range region_chg().
[f, t) to abort the operation. b) If the operation can not succeed, region_abort is called for the same
range [f, t) to abort the operation.
Note that this is a two step process where region_add() and region_abort() Note that this is a two step process where region_add() and region_abort()
are guaranteed to succeed after a prior call to region_chg() for the same are guaranteed to succeed after a prior call to region_chg() for the same
...@@ -371,6 +414,7 @@ and make the appropriate adjustments. ...@@ -371,6 +414,7 @@ and make the appropriate adjustments.
The routine region_del() is called to remove regions from a reservation map. The routine region_del() is called to remove regions from a reservation map.
It is typically called in the following situations: It is typically called in the following situations:
- When a file in the hugetlbfs filesystem is being removed, the inode will - When a file in the hugetlbfs filesystem is being removed, the inode will
be released and the reservation map freed. Before freeing the reservation be released and the reservation map freed. Before freeing the reservation
map, all the individual file_region structures must be freed. In this case map, all the individual file_region structures must be freed. In this case
...@@ -384,6 +428,7 @@ It is typically called in the following situations: ...@@ -384,6 +428,7 @@ It is typically called in the following situations:
removed, region_del() is called to remove the corresponding entry from the removed, region_del() is called to remove the corresponding entry from the
reservation map. In this case, region_del is passed the range reservation map. In this case, region_del is passed the range
[page_idx, page_idx + 1). [page_idx, page_idx + 1).
In every case, region_del() will return the number of pages removed from the In every case, region_del() will return the number of pages removed from the
reservation map. In VERY rare cases, region_del() can fail. This can only reservation map. In VERY rare cases, region_del() can fail. This can only
happen in the hole punch case where it has to split an existing file_region happen in the hole punch case where it has to split an existing file_region
...@@ -403,9 +448,11 @@ outstanding (outstanding = (end - start) - region_count(resv, start, end)). ...@@ -403,9 +448,11 @@ outstanding (outstanding = (end - start) - region_count(resv, start, end)).
Since the mapping is going away, the subpool and global reservation counts Since the mapping is going away, the subpool and global reservation counts
are decremented by the number of outstanding reservations. are decremented by the number of outstanding reservations.
.. _resv_map_helpers:
Reservation Map Helper Routines Reservation Map Helper Routines
------------------------------- ===============================
Several helper routines exist to query and modify the reservation maps. Several helper routines exist to query and modify the reservation maps.
These routines are only interested with reservations for a specific huge These routines are only interested with reservations for a specific huge
page, so they just pass in an address instead of a range. In addition, page, so they just pass in an address instead of a range. In addition,
...@@ -414,32 +461,40 @@ or shared) and the location of the reservation map (inode or VMA) can be ...@@ -414,32 +461,40 @@ or shared) and the location of the reservation map (inode or VMA) can be
determined. These routines simply call the underlying routines described determined. These routines simply call the underlying routines described
in the section "Reservation Map Modifications". However, they do take into in the section "Reservation Map Modifications". However, they do take into
account the 'opposite' meaning of reservation map entries for private and account the 'opposite' meaning of reservation map entries for private and
shared mappings and hide this detail from the caller. shared mappings and hide this detail from the caller::
long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This routine calls region_chg() for the specified page. If no reservation This routine calls region_chg() for the specified page. If no reservation
exists, 1 is returned. If a reservation exists, 0 is returned. exists, 1 is returned. If a reservation exists, 0 is returned::
long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This calls region_add() for the specified page. As in the case of region_chg This calls region_add() for the specified page. As in the case of region_chg
and region_add, this routine is to be called after a previous call to and region_add, this routine is to be called after a previous call to
vma_needs_reservation. It will add a reservation entry for the page. It vma_needs_reservation. It will add a reservation entry for the page. It
returns 1 if the reservation was added and 0 if not. The return value should returns 1 if the reservation was added and 0 if not. The return value should
be compared with the return value of the previous call to be compared with the return value of the previous call to
vma_needs_reservation. An unexpected difference indicates the reservation vma_needs_reservation. An unexpected difference indicates the reservation
map was modified between calls. map was modified between calls::
void vma_end_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
void vma_end_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This calls region_abort() for the specified page. As in the case of region_chg This calls region_abort() for the specified page. As in the case of region_chg
and region_abort, this routine is to be called after a previous call to and region_abort, this routine is to be called after a previous call to
vma_needs_reservation. It will abort/end the in progress reservation add vma_needs_reservation. It will abort/end the in progress reservation add
operation. operation::
long vma_add_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
long vma_add_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This is a special wrapper routine to help facilitate reservation cleanup This is a special wrapper routine to help facilitate reservation cleanup
on error paths. It is only called from the routine restore_reserve_on_error(). on error paths. It is only called from the routine restore_reserve_on_error().
This routine is used in conjunction with vma_needs_reservation in an attempt This routine is used in conjunction with vma_needs_reservation in an attempt
...@@ -453,8 +508,10 @@ be done on error paths. ...@@ -453,8 +508,10 @@ be done on error paths.
Reservation Cleanup in Error Paths Reservation Cleanup in Error Paths
---------------------------------- ==================================
As mentioned in the section "Reservation Map Helper Routines", reservation
As mentioned in the section
:ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation
map modifications are performed in two steps. First vma_needs_reservation map modifications are performed in two steps. First vma_needs_reservation
is called before a page is allocated. If the allocation is successful, is called before a page is allocated. If the allocation is successful,
then vma_commit_reservation is called. If not, vma_end_reservation is called. then vma_commit_reservation is called. If not, vma_end_reservation is called.
...@@ -494,13 +551,14 @@ so that a reservation will not be leaked when the huge page is freed. ...@@ -494,13 +551,14 @@ so that a reservation will not be leaked when the huge page is freed.
Reservations and Memory Policy Reservations and Memory Policy
------------------------------ ==============================
Per-node huge page lists existed in struct hstate when git was first used Per-node huge page lists existed in struct hstate when git was first used
to manage Linux code. The concept of reservations was added some time later. to manage Linux code. The concept of reservations was added some time later.
When reservations were added, no attempt was made to take memory policy When reservations were added, no attempt was made to take memory policy
into account. While cpusets are not exactly the same as memory policy, this into account. While cpusets are not exactly the same as memory policy, this
comment in hugetlb_acct_memory sums up the interaction between reservations comment in hugetlb_acct_memory sums up the interaction between reservations
and cpusets/memory policy. and cpusets/memory policy::
/* /*
* When cpuset is configured, it breaks the strict hugetlb page * When cpuset is configured, it breaks the strict hugetlb page
* reservation as the accounting is done on a global variable. Such * reservation as the accounting is done on a global variable. Such
...@@ -525,5 +583,13 @@ of cpusets or memory policy there is no guarantee that huge pages will be ...@@ -525,5 +583,13 @@ of cpusets or memory policy there is no guarantee that huge pages will be
available on the required nodes. This is true even if there are a sufficient available on the required nodes. This is true even if there are a sufficient
number of global reservations. number of global reservations.
Hugetlbfs regression testing
============================
The most complete set of hugetlb tests are in the libhugetlbfs repository.
If you modify any hugetlb related code, use the libhugetlbfs test suite
to check for regressions. In addition, if you add any new hugetlb
functionality, please add appropriate tests to libhugetlbfs.
--
Mike Kravetz, 7 April 2017 Mike Kravetz, 7 April 2017
.. hwpoison:
========
hwpoison
========
What is hwpoison? What is hwpoison?
=================
Upcoming Intel CPUs have support for recovering from some memory errors Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", (``MCA recovery``). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future. kill the processes associated with it and avoid using it in the future.
This patchkit implements the necessary infrastructure in the VM. This patchkit implements the necessary infrastructure in the VM.
...@@ -46,9 +53,10 @@ address. This in theory allows other applications to handle ...@@ -46,9 +53,10 @@ address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. won't do that, but some very specialized ones might.
--- Failure recovery modes
======================
There are two (actually three) modi memory failure recovery can be in: There are two (actually three) modes memory failure recovery can be in:
vm.memory_failure_recovery sysctl set to zero: vm.memory_failure_recovery sysctl set to zero:
All memory failures cause a panic. Do not attempt recovery. All memory failures cause a panic. Do not attempt recovery.
...@@ -67,9 +75,8 @@ late kill ...@@ -67,9 +75,8 @@ late kill
This is best for memory error unaware applications and default This is best for memory error unaware applications and default
Note some pages are always handled as late kill. Note some pages are always handled as late kill.
--- User control
============
User control:
vm.memory_failure_recovery vm.memory_failure_recovery
See sysctl.txt See sysctl.txt
...@@ -79,11 +86,19 @@ vm.memory_failure_early_kill ...@@ -79,11 +86,19 @@ vm.memory_failure_early_kill
PR_MCE_KILL PR_MCE_KILL
Set early/late kill mode/revert to system default Set early/late kill mode/revert to system default
arg1: PR_MCE_KILL_CLEAR: Revert to system default
arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode arg1: PR_MCE_KILL_CLEAR:
PR_MCE_KILL_EARLY: Early kill Revert to system default
PR_MCE_KILL_LATE: Late kill arg1: PR_MCE_KILL_SET:
PR_MCE_KILL_DEFAULT: Use system global default arg2 defines thread specific mode
PR_MCE_KILL_EARLY:
Early kill
PR_MCE_KILL_LATE:
Late kill
PR_MCE_KILL_DEFAULT
Use system global default
Note that if you want to have a dedicated thread which handles Note that if you want to have a dedicated thread which handles
the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
...@@ -92,77 +107,64 @@ PR_MCE_KILL ...@@ -92,77 +107,64 @@ PR_MCE_KILL
PR_MCE_KILL_GET PR_MCE_KILL_GET
return current mode return current mode
Testing
=======
--- * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
process for testing
Testing:
madvise(MADV_HWPOISON, ....)
(as root)
Poison a page in the process for testing
hwpoison-inject module through debugfs * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
/sys/kernel/debug/hwpoison/ corrupt-pfn
Inject hwpoison fault at PFN echoed into this file. This does
some early filtering to avoid corrupted unintended pages in test suites.
corrupt-pfn unpoison-pfn
Software-unpoison page at PFN echoed into this file. This way
a page can be reused again. This only works for Linux
injected failures, not for real memory failures.
Inject hwpoison fault at PFN echoed into this file. This does Note these injection interfaces are not stable and might change between
some early filtering to avoid corrupted unintended pages in test suites. kernel versions
unpoison-pfn corrupt-filter-dev-major, corrupt-filter-dev-minor
Only handle memory failures to pages associated with the file
system defined by block device major/minor. -1U is the
wildcard value. This should be only used for testing with
artificial injection.
Software-unpoison page at PFN echoed into this file. This corrupt-filter-memcg
way a page can be reused again. Limit injection to pages owned by memgroup. Specified by inode
This only works for Linux injected failures, not for real number of the memcg.
memory failures.
Note these injection interfaces are not stable and might change between Example::
kernel versions
corrupt-filter-dev-major mkdir /sys/fs/cgroup/mem/hwpoison
corrupt-filter-dev-minor
Only handle memory failures to pages associated with the file system defined usemem -m 100 -s 1000 &
by block device major/minor. -1U is the wildcard value. echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
This should be only used for testing with artificial injection.
corrupt-filter-memcg memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
Limit injection to pages owned by memgroup. Specified by inode number page-types -p `pidof init` --hwpoison # shall do nothing
of the memcg. page-types -p `pidof usemem` --hwpoison # poison its pages
Example: corrupt-filter-flags-mask, corrupt-filter-flags-value
mkdir /sys/fs/cgroup/mem/hwpoison When specified, only poison pages if ((page_flags & mask) ==
value). This allows stress testing of many kinds of
pages. The page_flags are the same as in /proc/kpageflags. The
flag bits are defined in include/linux/kernel-page-flags.h and
documented in Documentation/admin-guide/mm/pagemap.rst
usemem -m 100 -s 1000 & * Architecture specific MCE injector
echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') x86 has mce-inject, mce-test
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
page-types -p `pidof init` --hwpoison # shall do nothing Some portable hwpoison test programs in mce-test, see below.
page-types -p `pidof usemem` --hwpoison # poison its pages
corrupt-filter-flags-mask References
corrupt-filter-flags-value ==========
When specified, only poison pages if ((page_flags & mask) == value).
This allows stress testing of many kinds of pages. The page_flags
are the same as in /proc/kpageflags. The flag bits are defined in
include/linux/kernel-page-flags.h and documented in
Documentation/vm/pagemap.txt
Architecture specific MCE injector
x86 has mce-inject, mce-test
Some portable hwpoison test programs in mce-test, see blow.
---
References:
http://halobates.de/mce-lc09-2.pdf http://halobates.de/mce-lc09-2.pdf
Overview presentation from LinuxCon 09 Overview presentation from LinuxCon 09
...@@ -174,14 +176,11 @@ git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git ...@@ -174,14 +176,11 @@ git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
x86 specific injector x86 specific injector
--- Limitations
===========
Limitations:
- Not all page types are supported and never will. Most kernel internal - Not all page types are supported and never will. Most kernel internal
objects cannot be recovered, only LRU pages for now. objects cannot be recovered, only LRU pages for now.
- Right now hugepage support is missing. - Right now hugepage support is missing.
--- ---
Andi Kleen, Oct 2009 Andi Kleen, Oct 2009
=====================================
Linux Memory Management Documentation
=====================================
This is a collection of documents about Linux memory management (mm) subsystem.
User guides for MM features
===========================
The following documents provide guides for controlling and tuning
various features of the Linux memory management
.. toctree::
:maxdepth: 1
swap_numa
zswap
Kernel developers MM documentation
==================================
The below documents describe MM internals with different level of
details ranging from notes and mailing list responses to elaborate
descriptions of data structures and algorithms.
.. toctree::
:maxdepth: 1
active_mm
balance
cleancache
frontswap
highmem
hmm
hwpoison
hugetlbfs_reserv
ksm
mmu_notifier
numa
overcommit-accounting
page_migration
page_frags
page_owner
remap_file_pages
slub
split_page_table_lock
transhuge
unevictable-lru
z3fold
zsmalloc
.. _ksm:
=======================
Kernel Samepage Merging
=======================
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
The userspace interface of KSM is described in :ref:`Documentation/admin-guide/mm/ksm.rst <admin_guide_ksm>`
Design
======
Overview
--------
.. kernel-doc:: mm/ksm.c
:DOC: Overview
Reverse mapping
---------------
KSM maintains reverse mapping information for KSM pages in the stable
tree.
If a KSM page is shared between less than ``max_page_sharing`` VMAs,
the node of the stable tree that represents such KSM page points to a
list of :c:type:`struct rmap_item` and the ``page->mapping`` of the
KSM page points to the stable tree node.
When the sharing passes this threshold, KSM adds a second dimension to
the stable tree. The tree node becomes a "chain" that links one or
more "dups". Each "dup" keeps reverse mapping information for a KSM
page with ``page->mapping`` pointing to that "dup".
Every "chain" and all "dups" linked into a "chain" enforce the
invariant that they represent the same write protected memory content,
even if each "dup" will be pointed by a different KSM page copy of
that content.
This way the stable tree lookup computational complexity is unaffected
if compared to an unlimited list of reverse mappings. It is still
enforced that there cannot be KSM page content duplicates in the
stable tree itself.
The deduplication limit enforced by ``max_page_sharing`` is required
to avoid the virtual memory rmap lists to grow too large. The rmap
walk has O(N) complexity where N is the number of rmap_items
(i.e. virtual mappings) that are sharing the page, which is in turn
capped by ``max_page_sharing``. So this effectively spreads the linear
O(N) computational complexity from rmap walk context over different
KSM pages. The ksmd walk over the stable_node "chains" is also O(N),
but N is the number of stable_node "dups", not the number of
rmap_items, so it has not a significant impact on ksmd performance. In
practice the best stable_node "dup" candidate will be kept and found
at the head of the "dups" list.
High values of ``max_page_sharing`` result in faster memory merging
(because there will be fewer stable_node dups queued into the
stable_node chain->hlist to check for pruning) and higher
deduplication factor at the expense of slower worst case for rmap
walks for any KSM page which can happen during swapping, compaction,
NUMA balancing and page migration.
The ``stable_node_dups/stable_node_chains`` ratio is also affected by the
``max_page_sharing`` tunable, and an high ratio may indicate fragmentation
in the stable_node dups, which could be solved by introducing
fragmentation algorithms in ksmd which would refile rmap_items from
one stable_node dup to another stable_node dup, in order to free up
stable_node "dups" with few rmap_items in them, but that may increase
the ksmd CPU usage and possibly slowdown the readonly computations on
the KSM pages of the applications.
The whole list of stable_node "dups" linked in the stable_node
"chains" is scanned periodically in order to prune stale stable_nodes.
The frequency of such scans is defined by
``stable_node_chains_prune_millisecs`` sysfs tunable.
Reference
---------
.. kernel-doc:: mm/ksm.c
:functions: mm_slot ksm_scan stable_node rmap_item
--
Izik Eidus,
Hugh Dickins, 17 Nov 2009
How to use the Kernel Samepage Merging feature
----------------------------------------------
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation,
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
The KSM daemon ksmd periodically scans those areas of user memory which
have been registered with it, looking for pages of identical content which
can be replaced by a single write-protected page (which is automatically
copied if a process later wants to update its content).
KSM was originally developed for use with KVM (where it was known as
Kernel Shared Memory), to fit more virtual machines into physical memory,
by sharing the data common between them. But it can be useful to any
application which generates many instances of the same data.
KSM only merges anonymous (private) pages, never pagecache (file) pages.
KSM's merged pages were originally locked into kernel memory, but can now
be swapped out just like other user pages (but sharing is broken when they
are swapped back in: ksmd must rediscover their identity and merge again).
KSM only operates on those areas of address space which an application
has advised to be likely candidates for merging, by using the madvise(2)
system call: int madvise(addr, length, MADV_MERGEABLE).
The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel
that advice and restore unshared pages: whereupon KSM unmerges whatever
it merged in that range. Note: this unmerging call may suddenly require
more memory than is available - possibly failing with EAGAIN, but more
probably arousing the Out-Of-Memory killer.
If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
built with CONFIG_KSM=y, those calls will normally succeed: even if the
the KSM daemon is not currently running, MADV_MERGEABLE still registers
the range for whenever the KSM daemon is started; even if the range
cannot contain any pages which KSM could actually merge; even if
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
If a region of memory must be split into at least one new MADV_MERGEABLE
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
will exceed vm.max_map_count (see Documentation/sysctl/vm.txt).
Like other madvise calls, they are intended for use on mapped areas of
the user address space: they will report ENOMEM if the specified range
includes unmapped gaps (though working on the intervening mapped areas),
and might fail with EAGAIN if not enough memory for internal structures.
Applications should be considerate in their use of MADV_MERGEABLE,
restricting its use to areas likely to benefit. KSM's scans may use a lot
of processing power: some installations will disable KSM for that reason.
The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/,
readable by all but writable only by root:
pages_to_scan - how many present pages to scan before ksmd goes to sleep
e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan"
Default: 100 (chosen for demonstration purposes)
sleep_millisecs - how many milliseconds ksmd should sleep before next scan
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
Default: 20 (chosen for demonstration purposes)
merge_across_nodes - specifies if pages from different numa nodes can be merged.
When set to 0, ksm merges only pages which physically
reside in the memory area of same NUMA node. That brings
lower latency to access of shared pages. Systems with more
nodes, at significant NUMA distances, are likely to benefit
from the lower latency of setting 0. Smaller systems, which
need to minimize memory usage, are likely to benefit from
the greater sharing of setting 1 (default). You may wish to
compare how your system performs under each setting, before
deciding on which to use. merge_across_nodes setting can be
changed only when there are no ksm shared pages in system:
set run 2 to unmerge pages first, then to 1 after changing
merge_across_nodes, to remerge according to the new setting.
Default: 1 (merging across nodes as in earlier releases)
run - set 0 to stop ksmd from running but keep merged pages,
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
set 2 to stop ksmd and unmerge all pages currently merged,
but leave mergeable areas registered for next run
Default: 0 (must be changed to 1 to activate KSM,
except if CONFIG_SYSFS is disabled)
use_zero_pages - specifies whether empty pages (i.e. allocated pages
that only contain zeroes) should be treated specially.
When set to 1, empty pages are merged with the kernel
zero page(s) instead of with each other as it would
happen normally. This can improve the performance on
architectures with coloured zero pages, depending on
the workload. Care should be taken when enabling this
setting, as it can potentially degrade the performance
of KSM for some workloads, for example if the checksums
of pages candidate for merging match the checksum of
an empty page. This setting can be changed at any time,
it is only effective for pages merged after the change.
Default: 0 (normal KSM behaviour as in earlier releases)
max_page_sharing - Maximum sharing allowed for each KSM page. This
enforces a deduplication limit to avoid the virtual
memory rmap lists to grow too large. The minimum
value is 2 as a newly created KSM page will have at
least two sharers. The rmap walk has O(N)
complexity where N is the number of rmap_items
(i.e. virtual mappings) that are sharing the page,
which is in turn capped by max_page_sharing. So
this effectively spread the the linear O(N)
computational complexity from rmap walk context
over different KSM pages. The ksmd walk over the
stable_node "chains" is also O(N), but N is the
number of stable_node "dups", not the number of
rmap_items, so it has not a significant impact on
ksmd performance. In practice the best stable_node
"dup" candidate will be kept and found at the head
of the "dups" list. The higher this value the
faster KSM will merge the memory (because there
will be fewer stable_node dups queued into the
stable_node chain->hlist to check for pruning) and
the higher the deduplication factor will be, but
the slowest the worst case rmap walk could be for
any given KSM page. Slowing down the rmap_walk
means there will be higher latency for certain
virtual memory operations happening during
swapping, compaction, NUMA balancing and page
migration, in turn decreasing responsiveness for
the caller of those virtual memory operations. The
scheduler latency of other tasks not involved with
the VM operations doing the rmap walk is not
affected by this parameter as the rmap walks are
always schedule friendly themselves.
stable_node_chains_prune_millisecs - How frequently to walk the whole
list of stable_node "dups" linked in the
stable_node "chains" in order to prune stale
stable_nodes. Smaller milllisecs values will free
up the KSM metadata with lower latency, but they
will make ksmd use more CPU during the scan. This
only applies to the stable_node chains so it's a
noop if not a single KSM page hit the
max_page_sharing yet (there would be no stable_node
chains in such case).
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
pages_shared - how many shared pages are being used
pages_sharing - how many more sites are sharing them i.e. how much saved
pages_unshared - how many pages unique but repeatedly checked for merging
pages_volatile - how many pages changing too fast to be placed in a tree
full_scans - how many times all mergeable areas have been scanned
stable_node_chains - number of stable node chains allocated, this is
effectively the number of KSM pages that hit the
max_page_sharing limit
stable_node_dups - number of stable node dups queued into the
stable_node chains
A high ratio of pages_sharing to pages_shared indicates good sharing, but
a high ratio of pages_unshared to pages_sharing indicates wasted effort.
pages_volatile embraces several different kinds of activity, but a high
proportion there would also indicate poor use of madvise MADV_MERGEABLE.
The maximum possible page_sharing/page_shared ratio is limited by the
max_page_sharing tunable. To increase the ratio max_page_sharing must
be increased accordingly.
The stable_node_dups/stable_node_chains ratio is also affected by the
max_page_sharing tunable, and an high ratio may indicate fragmentation
in the stable_node dups, which could be solved by introducing
fragmentation algorithms in ksmd which would refile rmap_items from
one stable_node dup to another stable_node dup, in order to freeup
stable_node "dups" with few rmap_items in them, but that may increase
the ksmd CPU usage and possibly slowdown the readonly computations on
the KSM pages of the applications.
Izik Eidus,
Hugh Dickins, 17 Nov 2009
.. _mmu_notifier:
When do you need to notify inside page table lock ? When do you need to notify inside page table lock ?
===================================================
When clearing a pte/pmd we are given a choice to notify the event through When clearing a pte/pmd we are given a choice to notify the event through
(notify version of *_clear_flush call mmu_notifier_invalidate_range) under (notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
the page table lock. But that notification is not necessary in all cases. the page table lock. But that notification is not necessary in all cases.
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
...@@ -18,6 +21,7 @@ a page that might now be used by some completely different task. ...@@ -18,6 +21,7 @@ a page that might now be used by some completely different task.
Case B is more subtle. For correctness it requires the following sequence to Case B is more subtle. For correctness it requires the following sequence to
happen: happen:
- take page table lock - take page table lock
- clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
- set page table entry to point to new page - set page table entry to point to new page
...@@ -28,58 +32,60 @@ the device. ...@@ -28,58 +32,60 @@ the device.
Consider the following scenario (device use a feature similar to ATS/PASID): Consider the following scenario (device use a feature similar to ATS/PASID):
Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
they are write protected for COW (other case of B apply too). they are write protected for COW (other case of B apply too).
[Time N] -------------------------------------------------------------------- ::
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB} [Time N] --------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {try to write to addrA}
CPU-thread-3 {} CPU-thread-1 {try to write to addrB}
DEV-thread-0 {read addrA and populate device TLB} CPU-thread-2 {}
DEV-thread-2 {read addrB and populate device TLB} CPU-thread-3 {}
[Time N+1] ------------------------------------------------------------------ DEV-thread-0 {read addrA and populate device TLB}
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} DEV-thread-2 {read addrB and populate device TLB}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} [Time N+1] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-3 {} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+2] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} DEV-thread-2 {}
CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} [Time N+2] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}}
CPU-thread-3 {} CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+3] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {preempted} [Time N+3] ------------------------------------------------------------------
CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-0 {preempted}
CPU-thread-3 {} CPU-thread-1 {preempted}
DEV-thread-0 {} CPU-thread-2 {write to addrA which is a write to new page}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+3] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {preempted} [Time N+3] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {preempted}
CPU-thread-3 {write to addrB which is a write to new page} CPU-thread-1 {preempted}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page}
[Time N+4] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} [Time N+4] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {preempted}
CPU-thread-3 {} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+5] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {} [Time N+5] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {preempted}
CPU-thread-3 {} CPU-thread-1 {}
DEV-thread-0 {read addrA from old page} CPU-thread-2 {}
DEV-thread-2 {read addrB from new page} CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}
So here because at time N+2 the clear page table entry was not pair with a So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value for notification to invalidate the secondary TLB, the device see the new value for
......
.. _numa:
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
=============
What is NUMA? What is NUMA?
=============
This question can be answered from a couple of perspectives: the This question can be answered from a couple of perspectives: the
hardware view and the Linux software view. hardware view and the Linux software view.
...@@ -106,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces, ...@@ -106,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces,
such as taskset(1) and numactl(1), and program interfaces such as such as taskset(1) and numactl(1), and program interfaces such as
sched_setaffinity(2). Further, one can modify the kernel's default local sched_setaffinity(2). Further, one can modify the kernel's default local
allocation behavior using Linux NUMA memory policy. allocation behavior using Linux NUMA memory policy.
[see Documentation/vm/numa_memory_policy.txt.] [see Documentation/admin-guide/mm/numa_memory_policy.rst.]
System administrators can restrict the CPUs and nodes' memories that a non- System administrators can restrict the CPUs and nodes' memories that a non-
privileged user can specify in the scheduling or NUMA commands and functions privileged user can specify in the scheduling or NUMA commands and functions
......
The Linux kernel supports the following overcommit handling modes
0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slightly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays
and just relying on the virtual memory consisting almost
entirely of zero pages.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap + a
configurable amount (default is 50%) of physical RAM.
Depending on the amount you use, in most situations
this means a process will not be killed while accessing
pages but will receive errors on memory allocation as
appropriate.
Useful for applications that want to guarantee their
memory allocations will be available in the future
without having to initialize every page.
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
The overcommit amount can be set via `vm.overcommit_ratio' (percentage)
or `vm.overcommit_kbytes' (absolute value).
The current overcommit limit and amount committed are viewable in
/proc/meminfo as CommitLimit and Committed_AS respectively.
Gotchas
-------
The C language stack growth does an implicit mremap. If you want absolute
guarantees and run close to the edge you MUST mmap your stack for the
largest size you think you will need. For typical stack usage this does
not matter much but it's a corner case if you really really care
In mode 2 the MAP_NORESERVE flag is ignored.
How It Works
------------
The overcommit is based on the following rules
For a file backed map
SHARED or READ-only - 0 cost (the file is the map not swap)
PRIVATE WRITABLE - size of mapping per instance
For an anonymous or /dev/zero map
SHARED - size of mapping
PRIVATE READ-only - 0 cost (but of little use)
PRIVATE WRITABLE - size of mapping per instance
Additional accounting
Pages made writable copies by mmap
shmfs memory drawn from the same pool
Status
------
o We account mmap memory mappings
o We account mprotect changes in commit
o We account mremap changes in size
o We account brk
o We account munmap
o We report the commit status in /proc
o Account and check on fork
o Review stack handling/building on exec
o SHMfs accounting
o Implement actual limit enforcement
To Do
-----
o Account ptrace pages (this is hard)
.. _overcommit_accounting:
=====================
Overcommit Accounting
=====================
The Linux kernel supports the following overcommit handling modes
0
Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a
seriously wild allocation fails while allowing overcommit to
reduce swap usage. root is allowed to allocate slightly more
memory in this mode. This is the default.
1
Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays and
just relying on the virtual memory consisting almost entirely
of zero pages.
2
Don't overcommit. The total address space commit for the
system is not permitted to exceed swap + a configurable amount
(default is 50%) of physical RAM. Depending on the amount you
use, in most situations this means a process will not be
killed while accessing pages but will receive errors on memory
allocation as appropriate.
Useful for applications that want to guarantee their memory
allocations will be available in the future without having to
initialize every page.
The overcommit policy is set via the sysctl ``vm.overcommit_memory``.
The overcommit amount can be set via ``vm.overcommit_ratio`` (percentage)
or ``vm.overcommit_kbytes`` (absolute value).
The current overcommit limit and amount committed are viewable in
``/proc/meminfo`` as CommitLimit and Committed_AS respectively.
Gotchas
=======
The C language stack growth does an implicit mremap. If you want absolute
guarantees and run close to the edge you MUST mmap your stack for the
largest size you think you will need. For typical stack usage this does
not matter much but it's a corner case if you really really care
In mode 2 the MAP_NORESERVE flag is ignored.
How It Works
============
The overcommit is based on the following rules
For a file backed map
| SHARED or READ-only - 0 cost (the file is the map not swap)
| PRIVATE WRITABLE - size of mapping per instance
For an anonymous or ``/dev/zero`` map
| SHARED - size of mapping
| PRIVATE READ-only - 0 cost (but of little use)
| PRIVATE WRITABLE - size of mapping per instance
Additional accounting
| Pages made writable copies by mmap
| shmfs memory drawn from the same pool
Status
======
* We account mmap memory mappings
* We account mprotect changes in commit
* We account mremap changes in size
* We account brk
* We account munmap
* We report the commit status in /proc
* Account and check on fork
* Review stack handling/building on exec
* SHMfs accounting
* Implement actual limit enforcement
To Do
=====
* Account ptrace pages (this is hard)
.. _page_frags:
==============
Page fragments Page fragments
-------------- ==============
A page fragment is an arbitrary-length arbitrary-offset area of memory A page fragment is an arbitrary-length arbitrary-offset area of memory
which resides within a 0 or higher order compound page. Multiple which resides within a 0 or higher order compound page. Multiple
......
.. _page_migration:
==============
Page migration Page migration
-------------- ==============
Page migration allows the moving of the physical location of pages between Page migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the nodes in a numa system while the process is running. This means that the
...@@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen ...@@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen
(a version later than 0.9.3 is required. Get it from (a version later than 0.9.3 is required. Get it from
ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
which provides an interface similar to other numa functionality for page which provides an interface similar to other numa functionality for page
migration. cat /proc/<pid>/numa_maps allows an easy review of where the migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the
pages of a process are located. See also the numa_maps documentation in the pages of a process are located. See also the numa_maps documentation in the
proc(5) man page. proc(5) man page.
...@@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel ...@@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel
(for userspace usage see the Andi Kleen's numactl package mentioned above) (for userspace usage see the Andi Kleen's numactl package mentioned above)
and then a low level description of how the low level details work. and then a low level description of how the low level details work.
A. In kernel use of migrate_pages() In kernel use of migrate_pages()
----------------------------------- ================================
1. Remove pages from the LRU. 1. Remove pages from the LRU.
...@@ -78,8 +81,8 @@ A. In kernel use of migrate_pages() ...@@ -78,8 +81,8 @@ A. In kernel use of migrate_pages()
the new page for each page that is considered for the new page for each page that is considered for
moving. moving.
B. How migrate_pages() works How migrate_pages() works
---------------------------- =========================
migrate_pages() does several passes over its list of pages. A page is moved migrate_pages() does several passes over its list of pages. A page is moved
if all references to a page are removable at the time. The page has if all references to a page are removable at the time. The page has
...@@ -142,8 +145,8 @@ Steps: ...@@ -142,8 +145,8 @@ Steps:
20. The new page is moved to the LRU and can be scanned by the swapper 20. The new page is moved to the LRU and can be scanned by the swapper
etc again. etc again.
C. Non-LRU page migration Non-LRU page migration
------------------------- ======================
Although original migration aimed for reducing the latency of memory access Although original migration aimed for reducing the latency of memory access
for NUMA, compaction who want to create high-order page is also main customer. for NUMA, compaction who want to create high-order page is also main customer.
...@@ -164,89 +167,91 @@ migration path. ...@@ -164,89 +167,91 @@ migration path.
If a driver want to make own pages movable, it should define three functions If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations. which are function pointers of struct address_space_operations.
1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);``
What VM expects on isolate_page function of driver is to return *true* What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*. for isolation. If a driver cannot isolate the page, it should return *false*.
Once page is successfully isolated, VM uses page.lru fields so driver Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields. shouldn't expect to preserve values in that fields.
2. int (*migratepage) (struct address_space *mapping, 2. ``int (*migratepage) (struct address_space *mapping,``
struct page *newpage, struct page *oldpage, enum migrate_mode); | ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
After isolation, VM calls migratepage of driver with isolated page. After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via __ClearPageMovable() indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
under page_lock if you migrated the oldpage successfully and returns under page_lock if you migrated the oldpage successfully and returns
MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver
can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time
because VM interprets -EAGAIN as "temporal migration failure". On returning because VM interprets -EAGAIN as "temporal migration failure". On returning
any error except -EAGAIN, VM will give up the page migration without retrying any error except -EAGAIN, VM will give up the page migration without retrying
in this time. in this time.
Driver shouldn't touch page.lru field VM using in the functions. Driver shouldn't touch page.lru field VM using in the functions.
3. void (*putback_page)(struct page *); 3. ``void (*putback_page)(struct page *);``
If migration fails on isolated page, VM should return the isolated page If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page. to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data In this function, driver should put the isolated page back to the own data
structure. structure.
4. non-lru movable page flags 4. non-lru movable page flags
There are two page flags for supporting non-lru movable page. There are two page flags for supporting non-lru movable page.
* PG_movable * PG_movable
Driver should use the below function to make page movable under page_lock. Driver should use the below function to make page movable under page_lock::
void __SetPageMovable(struct page *page, struct address_space *mapping) void __SetPageMovable(struct page *page, struct address_space *mapping)
It needs argument of address_space for registering migration family functions It needs argument of address_space for registering migration
which will be called by VM. Exactly speaking, PG_movable is not a real flag of family functions which will be called by VM. Exactly speaking,
struct page. Rather than, VM reuses page->mapping's lower bits to represent it. PG_movable is not a real flag of struct page. Rather than, VM
reuses page->mapping's lower bits to represent it.
::
#define PAGE_MAPPING_MOVABLE 0x2 #define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
so driver shouldn't access page->mapping directly. Instead, driver should so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping under use page_mapping which mask off the low two bits of page->mapping under
page lock so it can get right struct address_space. page lock so it can get right struct address_space.
For testing of non-lru movable page, VM supports __PageMovable function. For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim. checking with lock_page in pfn scanning to select victim.
For guaranteeing non-lru movable page, VM provides PageMovable function. For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping. destroying of page->mapping.
Driver using __SetPageMovable should clear the flag via __ClearMovablePage Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page. under page_lock before the releasing the page.
* PG_isolated * PG_isolated
To prevent concurrent isolation among several CPUs, VM marks isolated page To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field. shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose. for own purpose.
Christoph Lameter, May 8, 2006. Christoph Lameter, May 8, 2006.
Minchan Kim, Mar 28, 2016. Minchan Kim, Mar 28, 2016.
.. _page_owner:
==================================================
page owner: Tracking about who allocated each page page owner: Tracking about who allocated each page
----------------------------------------------------------- ==================================================
* Introduction Introduction
============
page owner is for the tracking about who allocated each page. page owner is for the tracking about who allocated each page.
It can be used to debug memory leak or to find a memory hogger. It can be used to debug memory leak or to find a memory hogger.
...@@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump ...@@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump
label patching functionality is available. Following is the kernel's code label patching functionality is available. Following is the kernel's code
size change due to this facility. size change due to this facility.
- Without page owner - Without page owner::
text data bss dec hex filename text data bss dec hex filename
40662 1493 644 42799 a72f mm/page_alloc.o 40662 1493 644 42799 a72f mm/page_alloc.o
- With page owner::
- With page owner
text data bss dec hex filename text data bss dec hex filename
40892 1493 644 43029 a815 mm/page_alloc.o 40892 1493 644 43029 a815 mm/page_alloc.o
1427 24 8 1459 5b3 mm/page_ext.o 1427 24 8 1459 5b3 mm/page_ext.o
2722 50 0 2772 ad4 mm/page_owner.o 2722 50 0 2772 ad4 mm/page_owner.o
...@@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct ...@@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct
page extension feature. Anyway, after that, no page is left in page extension feature. Anyway, after that, no page is left in
un-tracking state. un-tracking state.
* Usage Usage
=====
1) Build user-space helper::
1) Build user-space helper
cd tools/vm cd tools/vm
make page_owner_sort make page_owner_sort
2) Enable page owner 2) Enable page owner: add "page_owner=on" to boot cmdline.
Add "page_owner=on" to boot cmdline.
3) Do the job what you want to debug 3) Do the job what you want to debug
4) Analyze information from page owner 4) Analyze information from page owner::
cat /sys/kernel/debug/page_owner > page_owner_full.txt cat /sys/kernel/debug/page_owner > page_owner_full.txt
grep -v ^PFN page_owner_full.txt > page_owner.txt grep -v ^PFN page_owner_full.txt > page_owner.txt
./page_owner_sort page_owner.txt sorted_page_owner.txt ./page_owner_sort page_owner.txt sorted_page_owner.txt
See the result about who allocated each page See the result about who allocated each page
in the sorted_page_owner.txt. in the ``sorted_page_owner.txt``.
.. _remap_file_pages:
==============================
remap_file_pages() system call
==============================
The remap_file_pages() system call is used to create a nonlinear mapping, The remap_file_pages() system call is used to create a nonlinear mapping,
that is, a mapping in which the pages of the file are mapped into a that is, a mapping in which the pages of the file are mapped into a
nonsequential order in memory. The advantage of using remap_file_pages() nonsequential order in memory. The advantage of using remap_file_pages()
......
.. _slub:
==========================
Short users guide for SLUB Short users guide for SLUB
-------------------------- ==========================
The basic philosophy of SLUB is very different from SLAB. SLAB The basic philosophy of SLUB is very different from SLAB. SLAB
requires rebuilding the kernel to activate debug options for all requires rebuilding the kernel to activate debug options for all
...@@ -8,18 +11,19 @@ SLUB can enable debugging only for selected slabs in order to avoid ...@@ -8,18 +11,19 @@ SLUB can enable debugging only for selected slabs in order to avoid
an impact on overall system performance which may make a bug more an impact on overall system performance which may make a bug more
difficult to find. difficult to find.
In order to switch debugging on one can add an option "slub_debug" In order to switch debugging on one can add an option ``slub_debug``
to the kernel command line. That will enable full debugging for to the kernel command line. That will enable full debugging for
all slabs. all slabs.
Typically one would then use the "slabinfo" command to get statistical Typically one would then use the ``slabinfo`` command to get statistical
data and perform operation on the slabs. By default slabinfo only lists data and perform operation on the slabs. By default ``slabinfo`` only lists
slabs that have data in them. See "slabinfo -h" for more options when slabs that have data in them. See "slabinfo -h" for more options when
running the command. slabinfo can be compiled with running the command. ``slabinfo`` can be compiled with
::
gcc -o slabinfo tools/vm/slabinfo.c gcc -o slabinfo tools/vm/slabinfo.c
Some of the modes of operation of slabinfo require that slub debugging Some of the modes of operation of ``slabinfo`` require that slub debugging
be enabled on the command line. F.e. no tracking information will be be enabled on the command line. F.e. no tracking information will be
available without debugging on and validation can only partially available without debugging on and validation can only partially
be performed if debugging was not switched on. be performed if debugging was not switched on.
...@@ -27,14 +31,17 @@ be performed if debugging was not switched on. ...@@ -27,14 +31,17 @@ be performed if debugging was not switched on.
Some more sophisticated uses of slub_debug: Some more sophisticated uses of slub_debug:
------------------------------------------- -------------------------------------------
Parameters may be given to slub_debug. If none is specified then full Parameters may be given to ``slub_debug``. If none is specified then full
debugging is enabled. Format: debugging is enabled. Format:
slub_debug=<Debug-Options> Enable options for all slabs slub_debug=<Debug-Options>
Enable options for all slabs
slub_debug=<Debug-Options>,<slab name> slub_debug=<Debug-Options>,<slab name>
Enable options only for select slabs Enable options only for select slabs
Possible debug options are::
Possible debug options are
F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS
Sorry SLAB legacy issues) Sorry SLAB legacy issues)
Z Red zoning Z Red zoning
...@@ -47,18 +54,18 @@ Possible debug options are ...@@ -47,18 +54,18 @@ Possible debug options are
- Switch all debugging off (useful if the kernel is - Switch all debugging off (useful if the kernel is
configured with CONFIG_SLUB_DEBUG_ON) configured with CONFIG_SLUB_DEBUG_ON)
F.e. in order to boot just with sanity checks and red zoning one would specify: F.e. in order to boot just with sanity checks and red zoning one would specify::
slub_debug=FZ slub_debug=FZ
Trying to find an issue in the dentry cache? Try Trying to find an issue in the dentry cache? Try::
slub_debug=,dentry slub_debug=,dentry
to only enable debugging on the dentry cache. to only enable debugging on the dentry cache.
Red zoning and tracking may realign the slab. We can just apply sanity checks Red zoning and tracking may realign the slab. We can just apply sanity checks
to the dentry cache with to the dentry cache with::
slub_debug=F,dentry slub_debug=F,dentry
...@@ -66,15 +73,15 @@ Debugging options may require the minimum possible slab order to increase as ...@@ -66,15 +73,15 @@ Debugging options may require the minimum possible slab order to increase as
a result of storing the metadata (for example, caches with PAGE_SIZE object a result of storing the metadata (for example, caches with PAGE_SIZE object
sizes). This has a higher liklihood of resulting in slab allocation errors sizes). This has a higher liklihood of resulting in slab allocation errors
in low memory situations or if there's high fragmentation of memory. To in low memory situations or if there's high fragmentation of memory. To
switch off debugging for such caches by default, use switch off debugging for such caches by default, use::
slub_debug=O slub_debug=O
In case you forgot to enable debugging on the kernel command line: It is In case you forgot to enable debugging on the kernel command line: It is
possible to enable debugging manually when the kernel is up. Look at the possible to enable debugging manually when the kernel is up. Look at the
contents of: contents of::
/sys/kernel/slab/<slab name>/ /sys/kernel/slab/<slab name>/
Look at the writable files. Writing 1 to them will enable the Look at the writable files. Writing 1 to them will enable the
corresponding debug option. All options can be set on a slab that does corresponding debug option. All options can be set on a slab that does
...@@ -86,98 +93,103 @@ Careful with tracing: It may spew out lots of information and never stop if ...@@ -86,98 +93,103 @@ Careful with tracing: It may spew out lots of information and never stop if
used on the wrong slab. used on the wrong slab.
Slab merging Slab merging
------------ ============
If no debug options are specified then SLUB may merge similar slabs together If no debug options are specified then SLUB may merge similar slabs together
in order to reduce overhead and increase cache hotness of objects. in order to reduce overhead and increase cache hotness of objects.
slabinfo -a displays which slabs were merged together. ``slabinfo -a`` displays which slabs were merged together.
Slab validation Slab validation
--------------- ===============
SLUB can validate all object if the kernel was booted with slub_debug. In SLUB can validate all object if the kernel was booted with slub_debug. In
order to do so you must have the slabinfo tool. Then you can do order to do so you must have the ``slabinfo`` tool. Then you can do
::
slabinfo -v slabinfo -v
which will test all objects. Output will be generated to the syslog. which will test all objects. Output will be generated to the syslog.
This also works in a more limited way if boot was without slab debug. This also works in a more limited way if boot was without slab debug.
In that case slabinfo -v simply tests all reachable objects. Usually In that case ``slabinfo -v`` simply tests all reachable objects. Usually
these are in the cpu slabs and the partial slabs. Full slabs are not these are in the cpu slabs and the partial slabs. Full slabs are not
tracked by SLUB in a non debug situation. tracked by SLUB in a non debug situation.
Getting more performance Getting more performance
------------------------ ========================
To some degree SLUB's performance is limited by the need to take the To some degree SLUB's performance is limited by the need to take the
list_lock once in a while to deal with partial slabs. That overhead is list_lock once in a while to deal with partial slabs. That overhead is
governed by the order of the allocation for each slab. The allocations governed by the order of the allocation for each slab. The allocations
can be influenced by kernel parameters: can be influenced by kernel parameters:
slub_min_objects=x (default 4) .. slub_min_objects=x (default 4)
slub_min_order=x (default 0) .. slub_min_order=x (default 0)
slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) .. slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER))
slub_min_objects allows to specify how many objects must at least fit ``slub_min_objects``
into one slab in order for the allocation order to be acceptable. allows to specify how many objects must at least fit into one
In general slub will be able to perform this number of allocations slab in order for the allocation order to be acceptable. In
on a slab without consulting centralized resources (list_lock) where general slub will be able to perform this number of
contention may occur. allocations on a slab without consulting centralized resources
(list_lock) where contention may occur.
slub_min_order specifies a minim order of slabs. A similar effect like
slub_min_objects. ``slub_min_order``
specifies a minim order of slabs. A similar effect like
slub_max_order specified the order at which slub_min_objects should no ``slub_min_objects``.
longer be checked. This is useful to avoid SLUB trying to generate
super large order pages to fit slub_min_objects of a slab cache with ``slub_max_order``
large object sizes into one high order page. Setting command line specified the order at which ``slub_min_objects`` should no
parameter debug_guardpage_minorder=N (N > 0), forces setting longer be checked. This is useful to avoid SLUB trying to
slub_max_order to 0, what cause minimum possible order of slabs generate super large order pages to fit ``slub_min_objects``
allocation. of a slab cache with large object sizes into one high order
page. Setting command line parameter
``debug_guardpage_minorder=N`` (N > 0), forces setting
``slub_max_order`` to 0, what cause minimum possible order of
slabs allocation.
SLUB Debug output SLUB Debug output
----------------- =================
Here is a sample of slub debug output: Here is a sample of slub debug output::
==================================================================== ====================================================================
BUG kmalloc-8: Redzone overwritten BUG kmalloc-8: Redzone overwritten
-------------------------------------------------------------------- --------------------------------------------------------------------
INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc
INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58
INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58
INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554
Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005 Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005
Redzone 0xc90f6d28: 00 cc cc cc . Redzone 0xc90f6d28: 00 cc cc cc .
Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[<c010523d>] dump_trace+0x63/0x1eb [<c010523d>] dump_trace+0x63/0x1eb
[<c01053df>] show_trace_log_lvl+0x1a/0x2f [<c01053df>] show_trace_log_lvl+0x1a/0x2f
[<c010601d>] show_trace+0x12/0x14 [<c010601d>] show_trace+0x12/0x14
[<c0106035>] dump_stack+0x16/0x18 [<c0106035>] dump_stack+0x16/0x18
[<c017e0fa>] object_err+0x143/0x14b [<c017e0fa>] object_err+0x143/0x14b
[<c017e2cc>] check_object+0x66/0x234 [<c017e2cc>] check_object+0x66/0x234
[<c017eb43>] __slab_free+0x239/0x384 [<c017eb43>] __slab_free+0x239/0x384
[<c017f446>] kfree+0xa6/0xc6 [<c017f446>] kfree+0xa6/0xc6
[<c02e2335>] get_modalias+0xb9/0xf5 [<c02e2335>] get_modalias+0xb9/0xf5
[<c02e23b7>] dmi_dev_uevent+0x27/0x3c [<c02e23b7>] dmi_dev_uevent+0x27/0x3c
[<c027866a>] dev_uevent+0x1ad/0x1da [<c027866a>] dev_uevent+0x1ad/0x1da
[<c0205024>] kobject_uevent_env+0x20a/0x45b [<c0205024>] kobject_uevent_env+0x20a/0x45b
[<c020527f>] kobject_uevent+0xa/0xf [<c020527f>] kobject_uevent+0xa/0xf
[<c02779f1>] store_uevent+0x4f/0x58 [<c02779f1>] store_uevent+0x4f/0x58
[<c027758e>] dev_attr_store+0x29/0x2f [<c027758e>] dev_attr_store+0x29/0x2f
[<c01bec4f>] sysfs_write_file+0x16e/0x19c [<c01bec4f>] sysfs_write_file+0x16e/0x19c
[<c0183ba7>] vfs_write+0xd1/0x15a [<c0183ba7>] vfs_write+0xd1/0x15a
[<c01841d7>] sys_write+0x3d/0x72 [<c01841d7>] sys_write+0x3d/0x72
[<c0104112>] sysenter_past_esp+0x5f/0x99 [<c0104112>] sysenter_past_esp+0x5f/0x99
[<b7f7b410>] 0xb7f7b410 [<b7f7b410>] 0xb7f7b410
======================= =======================
FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc
If SLUB encounters a corrupted object (full detection requires the kernel If SLUB encounters a corrupted object (full detection requires the kernel
to be booted with slub_debug) then the following output will be dumped to be booted with slub_debug) then the following output will be dumped
...@@ -185,38 +197,38 @@ into the syslog: ...@@ -185,38 +197,38 @@ into the syslog:
1. Description of the problem encountered 1. Description of the problem encountered
This will be a message in the system log starting with This will be a message in the system log starting with::
=============================================== ===============================================
BUG <slab cache affected>: <What went wrong> BUG <slab cache affected>: <What went wrong>
----------------------------------------------- -----------------------------------------------
INFO: <corruption start>-<corruption_end> <more info> INFO: <corruption start>-<corruption_end> <more info>
INFO: Slab <address> <slab information> INFO: Slab <address> <slab information>
INFO: Object <address> <object information> INFO: Object <address> <object information>
INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by
cpu> pid=<pid of the process> cpu> pid=<pid of the process>
INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu>
pid=<pid of the process> pid=<pid of the process>
(Object allocation / free information is only available if SLAB_STORE_USER is (Object allocation / free information is only available if SLAB_STORE_USER is
set for the slab. slub_debug sets that option) set for the slab. slub_debug sets that option)
2. The object contents if an object was involved. 2. The object contents if an object was involved.
Various types of lines can follow the BUG SLUB line: Various types of lines can follow the BUG SLUB line:
Bytes b4 <address> : <bytes> Bytes b4 <address> : <bytes>
Shows a few bytes before the object where the problem was detected. Shows a few bytes before the object where the problem was detected.
Can be useful if the corruption does not stop with the start of the Can be useful if the corruption does not stop with the start of the
object. object.
Object <address> : <bytes> Object <address> : <bytes>
The bytes of the object. If the object is inactive then the bytes The bytes of the object. If the object is inactive then the bytes
typically contain poison values. Any non-poison value shows a typically contain poison values. Any non-poison value shows a
corruption by a write after free. corruption by a write after free.
Redzone <address> : <bytes> Redzone <address> : <bytes>
The Redzone following the object. The Redzone is used to detect The Redzone following the object. The Redzone is used to detect
writes after the object. All bytes should always have the same writes after the object. All bytes should always have the same
value. If there is any deviation then it is due to a write after value. If there is any deviation then it is due to a write after
...@@ -225,7 +237,7 @@ Redzone <address> : <bytes> ...@@ -225,7 +237,7 @@ Redzone <address> : <bytes>
(Redzone information is only available if SLAB_RED_ZONE is set. (Redzone information is only available if SLAB_RED_ZONE is set.
slub_debug sets that option) slub_debug sets that option)
Padding <address> : <bytes> Padding <address> : <bytes>
Unused data to fill up the space in order to get the next object Unused data to fill up the space in order to get the next object
properly aligned. In the debug case we make sure that there are properly aligned. In the debug case we make sure that there are
at least 4 bytes of padding. This allows the detection of writes at least 4 bytes of padding. This allows the detection of writes
...@@ -233,29 +245,29 @@ Padding <address> : <bytes> ...@@ -233,29 +245,29 @@ Padding <address> : <bytes>
3. A stackdump 3. A stackdump
The stackdump describes the location where the error was detected. The cause The stackdump describes the location where the error was detected. The cause
of the corruption is may be more likely found by looking at the function that of the corruption is may be more likely found by looking at the function that
allocated or freed the object. allocated or freed the object.
4. Report on how the problem was dealt with in order to ensure the continued 4. Report on how the problem was dealt with in order to ensure the continued
operation of the system. operation of the system.
These are messages in the system log beginning with These are messages in the system log beginning with::
FIX <slab cache affected>: <corrective action taken> FIX <slab cache affected>: <corrective action taken>
In the above sample SLUB found that the Redzone of an active object has In the above sample SLUB found that the Redzone of an active object has
been overwritten. Here a string of 8 characters was written into a slab that been overwritten. Here a string of 8 characters was written into a slab that
has the length of 8 characters. However, a 8 character string needs a has the length of 8 characters. However, a 8 character string needs a
terminating 0. That zero has overwritten the first byte of the Redzone field. terminating 0. That zero has overwritten the first byte of the Redzone field.
After reporting the details of the issue encountered the FIX SLUB message After reporting the details of the issue encountered the FIX SLUB message
tells us that SLUB has restored the Redzone to its proper value and then tells us that SLUB has restored the Redzone to its proper value and then
system operations continue. system operations continue.
Emergency operations: Emergency operations
--------------------- ====================
Minimal debugging (sanity checks alone) can be enabled by booting with Minimal debugging (sanity checks alone) can be enabled by booting with::
slub_debug=F slub_debug=F
...@@ -270,73 +282,80 @@ No guarantees. The kernel component still needs to be fixed. Performance ...@@ -270,73 +282,80 @@ No guarantees. The kernel component still needs to be fixed. Performance
may be optimized further by locating the slab that experiences corruption may be optimized further by locating the slab that experiences corruption
and enabling debugging only for that cache and enabling debugging only for that cache
I.e. I.e.::
slub_debug=F,dentry slub_debug=F,dentry
If the corruption occurs by writing after the end of the object then it If the corruption occurs by writing after the end of the object then it
may be advisable to enable a Redzone to avoid corrupting the beginning may be advisable to enable a Redzone to avoid corrupting the beginning
of other objects. of other objects::
slub_debug=FZ,dentry slub_debug=FZ,dentry
Extended slabinfo mode and plotting Extended slabinfo mode and plotting
----------------------------------- ===================================
The slabinfo tool has a special 'extended' ('-X') mode that includes: The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes:
- Slabcache Totals - Slabcache Totals
- Slabs sorted by size (up to -N <num> slabs, default 1) - Slabs sorted by size (up to -N <num> slabs, default 1)
- Slabs sorted by loss (up to -N <num> slabs, default 1) - Slabs sorted by loss (up to -N <num> slabs, default 1)
Additionally, in this mode slabinfo does not dynamically scale sizes (G/M/K) Additionally, in this mode ``slabinfo`` does not dynamically scale
and reports everything in bytes (this functionality is also available to sizes (G/M/K) and reports everything in bytes (this functionality is
other slabinfo modes via '-B' option) which makes reporting more precise and also available to other slabinfo modes via '-B' option) which makes
accurate. Moreover, in some sense the `-X' mode also simplifies the analysis reporting more precise and accurate. Moreover, in some sense the `-X'
of slabs' behaviour, because its output can be plotted using the mode also simplifies the analysis of slabs' behaviour, because its
slabinfo-gnuplot.sh script. So it pushes the analysis from looking through output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it
the numbers (tons of numbers) to something easier -- visual analysis. pushes the analysis from looking through the numbers (tons of numbers)
to something easier -- visual analysis.
To generate plots: To generate plots:
a) collect slabinfo extended records, for example:
a) collect slabinfo extended records, for example::
while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done
while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done
b) pass stats file(-s) to slabinfo-gnuplot.sh script:
slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script::
The slabinfo-gnuplot.sh script will pre-processes the collected records slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN]
and generates 3 png files (and 3 pre-processing cache files) per STATS
file: The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records
- Slabcache Totals: FOO_STATS-totals.png and generates 3 png files (and 3 pre-processing cache files) per STATS
- Slabs sorted by size: FOO_STATS-slabs-by-size.png file:
- Slabs sorted by loss: FOO_STATS-slabs-by-loss.png - Slabcache Totals: FOO_STATS-totals.png
- Slabs sorted by size: FOO_STATS-slabs-by-size.png
Another use case, when slabinfo-gnuplot can be useful, is when you need - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png
to compare slabs' behaviour "prior to" and "after" some code modification.
To help you out there, slabinfo-gnuplot.sh script can 'merge' the Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you
`Slabcache Totals` sections from different measurements. To visually need to compare slabs' behaviour "prior to" and "after" some code
compare N plots: modification. To help you out there, ``slabinfo-gnuplot.sh`` script
can 'merge' the `Slabcache Totals` sections from different
a) Collect as many STATS1, STATS2, .. STATSN files as you need measurements. To visually compare N plots:
while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done
a) Collect as many STATS1, STATS2, .. STATSN files as you need::
b) Pre-process those STATS files
slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done
c) Execute slabinfo-gnuplot.sh in '-t' mode, passing all of the b) Pre-process those STATS files::
generated pre-processed *-totals
slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN
This will produce a single plot (png file). c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the
generated pre-processed \*-totals::
Plots, expectedly, can be large so some fluctuations or small spikes
can go unnoticed. To deal with that, `slabinfo-gnuplot.sh' has two slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals
options to 'zoom-in'/'zoom-out':
a) -s %d,%d overwrites the default image width and heigh This will produce a single plot (png file).
b) -r %d,%d specifies a range of samples to use (for example,
in `slabinfo -X >> FOO_STATS; sleep 1;' case, using Plots, expectedly, can be large so some fluctuations or small spikes
a "-r 40,60" range will plot only samples collected can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two
between 40th and 60th seconds). options to 'zoom-in'/'zoom-out':
a) ``-s %d,%d`` -- overwrites the default image width and heigh
b) ``-r %d,%d`` -- specifies a range of samples to use (for example,
in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r
40,60`` range will plot only samples collected between 40th and
60th seconds).
Christoph Lameter, May 30, 2007 Christoph Lameter, May 30, 2007
Sergey Senozhatsky, October 23, 2015 Sergey Senozhatsky, October 23, 2015
.. _split_page_table_lock:
=====================
Split page table lock Split page table lock
===================== =====================
...@@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD ...@@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD
tables. Access to higher level tables protected by mm->page_table_lock. tables. Access to higher level tables protected by mm->page_table_lock.
There are helpers to lock/unlock a table and other accessor functions: There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_lock() - pte_offset_map_lock()
maps pte and takes PTE table lock, returns pointer to the taken maps pte and takes PTE table lock, returns pointer to the taken
lock; lock;
...@@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE ...@@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE
tables and the architecture supports it (see below). tables and the architecture supports it (see below).
Hugetlb and split page table lock Hugetlb and split page table lock
--------------------------------- =================================
Hugetlb can support several page sizes. We use split lock only for PMD Hugetlb can support several page sizes. We use split lock only for PMD
level, but not for PUD. level, but not for PUD.
Hugetlb-specific helpers: Hugetlb-specific helpers:
- huge_pte_lock() - huge_pte_lock()
takes pmd split lock for PMD_SIZE page, mm->page_table_lock takes pmd split lock for PMD_SIZE page, mm->page_table_lock
otherwise; otherwise;
...@@ -47,7 +52,7 @@ Hugetlb-specific helpers: ...@@ -47,7 +52,7 @@ Hugetlb-specific helpers:
returns pointer to table lock; returns pointer to table lock;
Support of split page table lock by an architecture Support of split page table lock by an architecture
--------------------------------------------------- ===================================================
There's no need in special enabling of PTE split page table lock: There's no need in special enabling of PTE split page table lock:
everything required is done by pgtable_page_ctor() and pgtable_page_dtor(), everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
...@@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must ...@@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
be handled properly. be handled properly.
page->ptl page->ptl
--------- =========
page->ptl is used to access split page table lock, where 'page' is struct page->ptl is used to access split page table lock, where 'page' is struct
page of page containing the table. It shares storage with page->private page of page containing the table. It shares storage with page->private
...@@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private ...@@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private
To avoid increasing size of struct page and have best performance, we use a To avoid increasing size of struct page and have best performance, we use a
trick: trick:
- if spinlock_t fits into long, we use page->ptr as spinlock, so we - if spinlock_t fits into long, we use page->ptr as spinlock, so we
can avoid indirect access and save a cache line. can avoid indirect access and save a cache line.
- if size of spinlock_t is bigger then size of long, we use page->ptl as - if size of spinlock_t is bigger then size of long, we use page->ptl as
......
.. _swap_numa:
===========================================
Automatically bind swap device to numa node Automatically bind swap device to numa node
------------------------------------------- ===========================================
If the system has more than one swap device and swap device has the node If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap information, we can make use of this information to decide which swap
...@@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance. ...@@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance.
How to use this feature How to use this feature
----------------------- =======================
Swap device has priority and that decides the order of it to be used. To make Swap device has priority and that decides the order of it to be used. To make
use of automatically binding, there is no need to manipulate priority settings use of automatically binding, there is no need to manipulate priority settings
for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
swapB, with swapA attached to node 0 and swapB attached to node 1, are going swapB, with swapA attached to node 0 and swapB attached to node 1, are going
to be swapped on. Simply swapping them on by doing: to be swapped on. Simply swapping them on by doing::
# swapon /dev/swapA
# swapon /dev/swapB # swapon /dev/swapA
# swapon /dev/swapB
Then node 0 will use the two swap devices in the order of swapA then swapB and Then node 0 will use the two swap devices in the order of swapA then swapB and
node 1 will use the two swap devices in the order of swapB then swapA. Note node 1 will use the two swap devices in the order of swapB then swapA. Note
...@@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter. ...@@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter.
A more complex example on a 4 node machine. Assume 6 swap devices are going to A more complex example on a 4 node machine. Assume 6 swap devices are going to
be swapped on: swapA and swapB are attached to node 0, swapC is attached to be swapped on: swapA and swapB are attached to node 0, swapC is attached to
node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
The way to swap them on is the same as above: The way to swap them on is the same as above::
# swapon /dev/swapA
# swapon /dev/swapB # swapon /dev/swapA
# swapon /dev/swapC # swapon /dev/swapB
# swapon /dev/swapD # swapon /dev/swapC
# swapon /dev/swapE # swapon /dev/swapD
# swapon /dev/swapF # swapon /dev/swapE
# swapon /dev/swapF
Then node 0 will use them in the order of:
swapA/swapB -> swapC -> swapD -> swapE -> swapF Then node 0 will use them in the order of::
swapA/swapB -> swapC -> swapD -> swapE -> swapF
swapA and swapB will be used in a round robin mode before any other swap device. swapA and swapB will be used in a round robin mode before any other swap device.
node 1 will use them in the order of: node 1 will use them in the order of::
swapC -> swapA -> swapB -> swapD -> swapE -> swapF
swapC -> swapA -> swapB -> swapD -> swapE -> swapF
node 2 will use them in the order of::
swapD/swapE -> swapA -> swapB -> swapC -> swapF
node 2 will use them in the order of:
swapD/swapE -> swapA -> swapB -> swapC -> swapF
Similaly, swapD and swapE will be used in a round robin mode before any Similaly, swapD and swapE will be used in a round robin mode before any
other swap devices. other swap devices.
node 3 will use them in the order of: node 3 will use them in the order of::
swapF -> swapA -> swapB -> swapC -> swapD -> swapE
swapF -> swapA -> swapB -> swapC -> swapD -> swapE
Implementation details Implementation details
---------------------- ======================
The current code uses a priority based list, swap_avail_list, to decide The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same which swap device to use and if multiple swap devices share the same
......
.. _transhuge:
============================
Transparent Hugepage Support
============================
This document describes design principles Transparent Hugepage (THP)
Support and its interaction with other parts of the memory management.
Design principles
=================
- "graceful fallback": mm components which don't have transparent hugepage
knowledge fall back to breaking huge pmd mapping into table of ptes and,
if necessary, split a transparent hugepage. Therefore these components
can continue working on the regular pages or regular pte mappings.
- if a hugepage allocation fails because of memory fragmentation,
regular pages should be gracefully allocated instead and mixed in
the same vma without any failure or significant delay and without
userland noticing
- if some task quits and more hugepages become available (either
immediately in the buddy or through the VM), guest physical memory
backed by regular pages should be relocated on hugepages
automatically (with khugepaged)
- it doesn't require memory reservation and in turn it uses hugepages
whenever possible (the only possible reservation here is kernelcore=
to avoid unmovable pages to fragment all the memory but such a tweak
is not specific to transparent hugepage support and it's a generic
feature that applies to all dynamic high order allocations in the
kernel)
get_user_pages and follow_page
==============================
get_user_pages and follow_page if run on a hugepage, will return the
head or tail pages as usual (exactly as they would do on
hugetlbfs). Most gup users will only care about the actual physical
address of the page and its temporary pinning to release after the I/O
is complete, so they won't ever notice the fact the page is huge. But
if any driver is going to mangle over the page structure of the tail
page (like for checking page->mapping or other bits that are relevant
for the head page and not the tail page), it should be updated to jump
to check head page instead. Taking reference on any head/tail page would
prevent page from being split by anyone.
.. note::
these aren't new constraints to the GUP API, and they match the
same constrains that applies to hugetlbfs too, so any driver capable
of handling GUP on hugetlbfs will also work fine on transparent
hugepage backed mappings.
In case you can't handle compound pages if they're returned by
follow_page, the FOLL_SPLIT bit can be specified as parameter to
follow_page, so that it will split the hugepages before returning
them. Migration for example passes FOLL_SPLIT as parameter to
follow_page because it's not hugepage aware and in fact it can't work
at all on hugetlbfs (but it instead works fine on transparent
hugepages thanks to FOLL_SPLIT). migration simply can't deal with
hugepages being returned (as it's not only checking the pfn of the
page and pinning it during the copy but it pretends to migrate the
memory in regular page sizes and with regular pte/pmd mappings).
Graceful fallback
=================
Code walking pagetables but unaware about huge pmds can simply call
split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
hugepage aware.
If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
it tries to swapout the hugepage for example. split_huge_page() can fail
if the page is pinned and you must handle this correctly.
Example to make mremap.c transparent hugepage aware with a one liner
change::
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
return NULL;
pmd = pmd_offset(pud, addr);
+ split_huge_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;
Locking in hugepage aware code
==============================
We want as much code as possible hugepage aware, as calling
split_huge_page() or split_huge_pmd() has a cost.
To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
mmap_sem in read (or write) mode to be sure an huge pmd cannot be
created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
page table lock will prevent the huge pmd to be converted into a
regular pmd from under you (split_huge_pmd can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
should just drop the page table lock and fallback to the old code as
before. Otherwise you can proceed to process the huge pmd and the
hugepage natively. Once finished you can drop the page table lock.
Refcounts and transparent huge pages
====================================
Refcounting on THP is mostly consistent with refcounting on other compound
pages:
- get_page()/put_page() and GUP operate in head page's ->_refcount.
- ->_refcount in tail pages is always zero: get_page_unless_zero() never
succeed on tail pages.
- map/unmap of the pages with PTE entry increment/decrement ->_mapcount
on relevant sub-page of the compound page.
- map/unmap of the whole compound page accounted in compound_mapcount
(stored in first tail page). For file huge pages, we also increment
->_mapcount of all sub-pages in order to have race-free detection of
last unmap of subpages.
PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
subpages is offset up by one. This additional reference is required to
get race-free detection of unmap of subpages when we have them mapped with
both PMDs and PTEs.
This is optimization required to lower overhead of per-subpage mapcount
tracking. The alternative is alter ->_mapcount in all subpages on each
map/unmap of the whole compound page.
For anonymous pages, we set PG_double_map when a PMD of the page got split
for the first time, but still have PMD mapping. The additional references
go away with last compound_mapcount.
File pages get PG_double_map set on first map of the page with PTE and
goes away when the page gets evicted from page cache.
split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page
structures. It can be done easily for refcounts taken by page table
entries. But we don't have enough information on how to distribute any
additional pins (i.e. from get_user_pages). split_huge_page() fails any
requests to split pinned huge page: it expects page count to be equal to
sum of mapcount of all sub-pages plus one (split_huge_page caller must
have reference for head page).
split_huge_page uses migration entries to stabilize page->_refcount and
page->_mapcount of anonymous pages. File pages just got unmapped.
We safe against physical memory scanners too: the only legitimate way
scanner can get reference to a page is get_page_unless_zero().
All tail pages have zero ->_refcount until atomic_add(). This prevents the
scanner from getting a reference to the tail page up to that point. After the
atomic_add() we don't care about the ->_refcount value. We already known how
many references should be uncharged from the head page.
For head page get_page_unless_zero() will succeed and we don't mind. It's
clear where reference should go after split: it will stay on head page.
Note that split_huge_pmd() doesn't have any limitation on refcounting:
pmd can be split at any point and never fails.
Partial unmap and deferred_split_huge_page()
============================================
Unmapping part of THP (with munmap() or other way) is not going to free
memory immediately. Instead, we detect that a subpage of THP is not in use
in page_remove_rmap() and queue the THP for splitting if memory pressure
comes. Splitting will free up unused subpages.
Splitting the page right away is not an option due to locking context in
the place where we can detect partial unmap. It's also might be
counterproductive since in many cases partial unmap happens during exit(2) if
a THP crosses a VMA boundary.
Function deferred_split_huge_page() is used to queue page for splitting.
The splitting itself will happen when we get memory pressure via shrinker
interface.
============================== .. _unevictable_lru:
UNEVICTABLE LRU INFRASTRUCTURE
==============================
========
CONTENTS
========
(*) The Unevictable LRU
- The unevictable page list.
- Memory control group interaction.
- Marking address spaces unevictable.
- Detecting Unevictable Pages.
- vmscan's handling of unevictable pages.
(*) mlock()'d pages.
- History.
- Basic management.
- mlock()/mlockall() system call handling.
- Filtering special vmas.
- munlock()/munlockall() system call handling.
- Migrating mlocked pages.
- Compacting mlocked pages.
- mmap(MAP_LOCKED) system call handling.
- munmap()/exit()/exec() system call handling.
- try_to_unmap().
- try_to_munlock() reverse map scan.
- Page reclaim in shrink_*_list().
==============================
Unevictable LRU Infrastructure
==============================
============ .. contents:: :local:
INTRODUCTION
Introduction
============ ============
This document describes the Linux memory manager's "Unevictable LRU" This document describes the Linux memory manager's "Unevictable LRU"
...@@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the ...@@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the
descriptions below add value by provide the answer to "why does it do that?". descriptions below add value by provide the answer to "why does it do that?".
===================
THE UNEVICTABLE LRU The Unevictable LRU
=================== ===================
The Unevictable LRU facility adds an additional LRU list to track unevictable The Unevictable LRU facility adds an additional LRU list to track unevictable
...@@ -66,17 +42,17 @@ completely unresponsive. ...@@ -66,17 +42,17 @@ completely unresponsive.
The unevictable list addresses the following classes of unevictable pages: The unevictable list addresses the following classes of unevictable pages:
(*) Those owned by ramfs. * Those owned by ramfs.
(*) Those mapped into SHM_LOCK'd shared memory regions. * Those mapped into SHM_LOCK'd shared memory regions.
(*) Those mapped into VM_LOCKED [mlock()ed] VMAs. * Those mapped into VM_LOCKED [mlock()ed] VMAs.
The infrastructure may also be able to handle other conditions that make pages The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future. unevictable, either by definition or by circumstance, in the future.
THE UNEVICTABLE PAGE LIST The Unevictable Page List
------------------------- -------------------------
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
...@@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other ...@@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other
tasks are changing the "evictability" state of the page. tasks are changing the "evictability" state of the page.
MEMORY CONTROL GROUP INTERACTION Memory Control Group Interaction
-------------------------------- --------------------------------
The unevictable LRU facility interacts with the memory control group [aka The unevictable LRU facility interacts with the memory control group [aka
...@@ -144,7 +120,9 @@ effects: ...@@ -144,7 +120,9 @@ effects:
the control group to thrash or to OOM-kill tasks. the control group to thrash or to OOM-kill tasks.
MARKING ADDRESS SPACES UNEVICTABLE .. _mark_addr_space_unevict:
Marking Address Spaces Unevictable
---------------------------------- ----------------------------------
For facilities such as ramfs none of the pages attached to the address space For facilities such as ramfs none of the pages attached to the address space
...@@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE ...@@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
address space flag is provided, and this can be manipulated by a filesystem address space flag is provided, and this can be manipulated by a filesystem
using a number of wrapper functions: using a number of wrapper functions:
(*) void mapping_set_unevictable(struct address_space *mapping); * ``void mapping_set_unevictable(struct address_space *mapping);``
Mark the address space as being completely unevictable. Mark the address space as being completely unevictable.
(*) void mapping_clear_unevictable(struct address_space *mapping); * ``void mapping_clear_unevictable(struct address_space *mapping);``
Mark the address space as being evictable. Mark the address space as being evictable.
(*) int mapping_unevictable(struct address_space *mapping); * ``int mapping_unevictable(struct address_space *mapping);``
Query the address space, and return true if it is completely Query the address space, and return true if it is completely
unevictable. unevictable.
...@@ -177,12 +155,13 @@ These are currently used in two places in the kernel: ...@@ -177,12 +155,13 @@ These are currently used in two places in the kernel:
ensure they're in memory. ensure they're in memory.
DETECTING UNEVICTABLE PAGES Detecting Unevictable Pages
--------------------------- ---------------------------
The function page_evictable() in vmscan.c determines whether a page is The function page_evictable() in vmscan.c determines whether a page is
evictable or not using the query function outlined above [see section "Marking evictable or not using the query function outlined above [see section
address spaces unevictable"] to check the AS_UNEVICTABLE flag. :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
to check the AS_UNEVICTABLE flag.
For address spaces that are so marked after being populated (as SHM regions For address spaces that are so marked after being populated (as SHM regions
might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
...@@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is ...@@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
VMSCAN'S HANDLING OF UNEVICTABLE PAGES Vmscan's Handling of Unevictable Pages
-------------------------------------- --------------------------------------
If unevictable pages are culled in the fault path, or moved to the unevictable If unevictable pages are culled in the fault path, or moved to the unevictable
...@@ -233,8 +212,7 @@ extra evictabilty checks should not occur in the majority of calls to ...@@ -233,8 +212,7 @@ extra evictabilty checks should not occur in the majority of calls to
putback_lru_page(). putback_lru_page().
============= MLOCKED Pages
MLOCKED PAGES
============= =============
The unevictable page list is also useful for mlock(), in addition to ramfs and The unevictable page list is also useful for mlock(), in addition to ramfs and
...@@ -242,7 +220,7 @@ SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in ...@@ -242,7 +220,7 @@ SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
NOMMU situations, all mappings are effectively mlocked. NOMMU situations, all mappings are effectively mlocked.
HISTORY History
------- -------
The "Unevictable mlocked Pages" infrastructure is based on work originally The "Unevictable mlocked Pages" infrastructure is based on work originally
...@@ -263,7 +241,7 @@ replaced by walking the reverse map to determine whether any VM_LOCKED VMAs ...@@ -263,7 +241,7 @@ replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
mapped the page. More on this below. mapped the page. More on this below.
BASIC MANAGEMENT Basic Management
---------------- ----------------
mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
...@@ -304,10 +282,10 @@ mlocked pages become unlocked and rescued from the unevictable list when: ...@@ -304,10 +282,10 @@ mlocked pages become unlocked and rescued from the unevictable list when:
(4) before a page is COW'd in a VM_LOCKED VMA. (4) before a page is COW'd in a VM_LOCKED VMA.
mlock()/mlockall() SYSTEM CALL HANDLING mlock()/mlockall() System Call Handling
--------------------------------------- ---------------------------------------
Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup()
for each VMA in the range specified by the call. In the case of mlockall(), for each VMA in the range specified by the call. In the case of mlockall(),
this is the entire active address space of the task. Note that mlock_fixup() this is the entire active address space of the task. Note that mlock_fixup()
is used for both mlocking and munlocking a range of memory. A call to mlock() is used for both mlocking and munlocking a range of memory. A call to mlock()
...@@ -351,7 +329,7 @@ mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle ...@@ -351,7 +329,7 @@ mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page. it later if and when it attempts to reclaim the page.
FILTERING SPECIAL VMAS Filtering Special VMAs
---------------------- ----------------------
mlock_fixup() filters several classes of "special" VMAs: mlock_fixup() filters several classes of "special" VMAs:
...@@ -379,8 +357,9 @@ VM_LOCKED flag. Therefore, we won't have to deal with them later during ...@@ -379,8 +357,9 @@ VM_LOCKED flag. Therefore, we won't have to deal with them later during
munlock(), munmap() or task exit. Neither does mlock_fixup() account these munlock(), munmap() or task exit. Neither does mlock_fixup() account these
VMAs against the task's "locked_vm". VMAs against the task's "locked_vm".
.. _munlock_munlockall_handling:
munlock()/munlockall() SYSTEM CALL HANDLING munlock()/munlockall() System Call Handling
------------------------------------------- -------------------------------------------
The munlock() and munlockall() system calls are handled by the same functions - The munlock() and munlockall() system calls are handled by the same functions -
...@@ -426,7 +405,7 @@ This is fine, because we'll catch it later if and if vmscan tries to reclaim ...@@ -426,7 +405,7 @@ This is fine, because we'll catch it later if and if vmscan tries to reclaim
the page. This should be relatively rare. the page. This should be relatively rare.
MIGRATING MLOCKED PAGES Migrating MLOCKED Pages
----------------------- -----------------------
A page that is being migrated has been isolated from the LRU lists and is held A page that is being migrated has been isolated from the LRU lists and is held
...@@ -451,7 +430,7 @@ list because of a race between munlock and migration, page migration uses the ...@@ -451,7 +430,7 @@ list because of a race between munlock and migration, page migration uses the
putback_lru_page() function to add migrated pages back to the LRU. putback_lru_page() function to add migrated pages back to the LRU.
COMPACTING MLOCKED PAGES Compacting MLOCKED Pages
------------------------ ------------------------
The unevictable LRU can be scanned for compactable regions and the default The unevictable LRU can be scanned for compactable regions and the default
...@@ -461,7 +440,7 @@ unevictable LRU is enabled, the work of compaction is mostly handled by ...@@ -461,7 +440,7 @@ unevictable LRU is enabled, the work of compaction is mostly handled by
the page migration code and the same work flow as described in MIGRATING the page migration code and the same work flow as described in MIGRATING
MLOCKED PAGES will apply. MLOCKED PAGES will apply.
MLOCKING TRANSPARENT HUGE PAGES MLOCKING Transparent Huge Pages
------------------------------- -------------------------------
A transparent huge page is represented by a single entry on an LRU list. A transparent huge page is represented by a single entry on an LRU list.
...@@ -483,7 +462,7 @@ to unevictable LRU and the rest can be reclaimed. ...@@ -483,7 +462,7 @@ to unevictable LRU and the rest can be reclaimed.
See also comment in follow_trans_huge_pmd(). See also comment in follow_trans_huge_pmd().
mmap(MAP_LOCKED) SYSTEM CALL HANDLING mmap(MAP_LOCKED) System Call Handling
------------------------------------- -------------------------------------
In addition the mlock()/mlockall() system calls, an application can request In addition the mlock()/mlockall() system calls, an application can request
...@@ -514,7 +493,7 @@ memory range accounted as locked_vm, as the protections could be changed later ...@@ -514,7 +493,7 @@ memory range accounted as locked_vm, as the protections could be changed later
and pages allocated into that region. and pages allocated into that region.
munmap()/exit()/exec() SYSTEM CALL HANDLING munmap()/exit()/exec() System Call Handling
------------------------------------------- -------------------------------------------
When unmapping an mlocked region of memory, whether by an explicit call to When unmapping an mlocked region of memory, whether by an explicit call to
...@@ -568,16 +547,18 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, ...@@ -568,16 +547,18 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
holepunching, and truncation of file pages and their anonymous COWed pages. holepunching, and truncation of file pages and their anonymous COWed pages.
try_to_munlock() REVERSE MAP SCAN try_to_munlock() Reverse Map Scan
--------------------------------- ---------------------------------
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the .. warning::
page_referenced() reverse map walker. [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.
When munlock_vma_page() [see section "munlock()/munlockall() System Call When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
Handling" above] tries to munlock a page, it needs to determine whether or not Handling <munlock_munlockall_handling>` above] tries to munlock a
the page is mapped by any VM_LOCKED VMA without actually attempting to unmap page, it needs to determine whether or not the page is mapped by any
all PTEs from the page. For this purpose, the unevictable/mlock infrastructure VM_LOCKED VMA without actually attempting to unmap all PTEs from the
page. For this purpose, the unevictable/mlock infrastructure
introduced a variant of try_to_unmap() called try_to_munlock(). introduced a variant of try_to_unmap() called try_to_munlock().
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
...@@ -595,7 +576,7 @@ large region or tearing down a large address space that has been mlocked via ...@@ -595,7 +576,7 @@ large region or tearing down a large address space that has been mlocked via
mlockall(), overall this is a fairly rare event. mlockall(), overall this is a fairly rare event.
PAGE RECLAIM IN shrink_*_list() Page Reclaim in shrink_*_list()
------------------------------- -------------------------------
shrink_active_list() culls any obviously unevictable pages - i.e. shrink_active_list() culls any obviously unevictable pages - i.e.
......
.. _z3fold:
======
z3fold z3fold
------ ======
z3fold is a special purpose allocator for storing compressed pages. z3fold is a special purpose allocator for storing compressed pages.
It is designed to store up to three compressed pages per physical page. It is designed to store up to three compressed pages per physical page.
...@@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression ...@@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression
ratio keeping the simplicity and determinism of its predecessor. ratio keeping the simplicity and determinism of its predecessor.
The main differences between z3fold and zbud are: The main differences between z3fold and zbud are:
* unlike zbud, z3fold allows for up to PAGE_SIZE allocations * unlike zbud, z3fold allows for up to PAGE_SIZE allocations
* z3fold can hold up to 3 compressed pages in its page * z3fold can hold up to 3 compressed pages in its page
* z3fold doesn't export any API itself and is thus intended to be used * z3fold doesn't export any API itself and is thus intended to be used
......
.. _zsmalloc:
========
zsmalloc zsmalloc
-------- ========
This allocator is designed for use with zram. Thus, the allocator is This allocator is designed for use with zram. Thus, the allocator is
supposed to work well under low memory conditions. In particular, it supposed to work well under low memory conditions. In particular, it
...@@ -31,40 +34,49 @@ be mapped using zs_map_object() to get a usable pointer and subsequently ...@@ -31,40 +34,49 @@ be mapped using zs_map_object() to get a usable pointer and subsequently
unmapped using zs_unmap_object(). unmapped using zs_unmap_object().
stat stat
---- ====
With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via
/sys/kernel/debug/zsmalloc/<user name>. Here is a sample of stat output: ``/sys/kernel/debug/zsmalloc/<user name>``. Here is a sample of stat output::
# cat /sys/kernel/debug/zsmalloc/zram0/classes # cat /sys/kernel/debug/zsmalloc/zram0/classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage
.. ...
.. ...
9 176 0 1 186 129 8 4 9 176 0 1 186 129 8 4
10 192 1 0 2880 2872 135 3 10 192 1 0 2880 2872 135 3
11 208 0 1 819 795 42 2 11 208 0 1 819 795 42 2
12 224 0 1 219 159 12 4 12 224 0 1 219 159 12 4
.. ...
.. ...
class
index
size
object size zspage stores
almost_empty
the number of ZS_ALMOST_EMPTY zspages(see below)
almost_full
the number of ZS_ALMOST_FULL zspages(see below)
obj_allocated
the number of objects allocated
obj_used
the number of objects allocated to the user
pages_used
the number of pages allocated for the class
pages_per_zspage
the number of 0-order pages to make a zspage
class: index We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where
size: object size zspage stores
almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below)
almost_full: the number of ZS_ALMOST_FULL zspages(see below)
obj_allocated: the number of objects allocated
obj_used: the number of objects allocated to the user
pages_used: the number of pages allocated for the class
pages_per_zspage: the number of 0-order pages to make a zspage
We assign a zspage to ZS_ALMOST_EMPTY fullness group when: * n = number of allocated objects
n <= N / f, where * N = total number of objects zspage can store
n = number of allocated objects * f = fullness_threshold_frac(ie, 4 at the moment)
N = total number of objects zspage can store
f = fullness_threshold_frac(ie, 4 at the moment)
Similarly, we assign zspage to: Similarly, we assign zspage to:
ZS_ALMOST_FULL when n > N / f
ZS_EMPTY when n == 0 * ZS_ALMOST_FULL when n > N / f
ZS_FULL when n == N * ZS_EMPTY when n == 0
* ZS_FULL when n == N
Overview: .. _zswap:
=====
zswap
=====
Overview
========
Zswap is a lightweight compressed cache for swap pages. It takes pages that are Zswap is a lightweight compressed cache for swap pages. It takes pages that are
in the process of being swapped out and attempts to compress them into a in the process of being swapped out and attempts to compress them into a
...@@ -7,32 +14,34 @@ for potentially reduced swap I/O.  This trade-off can also result in a ...@@ -7,32 +14,34 @@ for potentially reduced swap I/O.  This trade-off can also result in a
significant performance improvement if reads from the compressed cache are significant performance improvement if reads from the compressed cache are
faster than reads from a swap device. faster than reads from a swap device.
NOTE: Zswap is a new feature as of v3.11 and interacts heavily with memory .. note::
reclaim. This interaction has not been fully explored on the large set of Zswap is a new feature as of v3.11 and interacts heavily with memory
potential configurations and workloads that exist. For this reason, zswap reclaim. This interaction has not been fully explored on the large set of
is a work in progress and should be considered experimental. potential configurations and workloads that exist. For this reason, zswap
is a work in progress and should be considered experimental.
Some potential benefits:
Some potential benefits:
* Desktop/laptop users with limited RAM capacities can mitigate the * Desktop/laptop users with limited RAM capacities can mitigate the
    performance impact of swapping. performance impact of swapping.
* Overcommitted guests that share a common I/O resource can * Overcommitted guests that share a common I/O resource can
    dramatically reduce their swap I/O pressure, avoiding heavy handed I/O dramatically reduce their swap I/O pressure, avoiding heavy handed I/O
throttling by the hypervisor. This allows more work to get done with less throttling by the hypervisor. This allows more work to get done with less
impact to the guest workload and guests sharing the I/O subsystem impact to the guest workload and guests sharing the I/O subsystem
* Users with SSDs as swap devices can extend the life of the device by * Users with SSDs as swap devices can extend the life of the device by
    drastically reducing life-shortening writes. drastically reducing life-shortening writes.
Zswap evicts pages from compressed cache on an LRU basis to the backing swap Zswap evicts pages from compressed cache on an LRU basis to the backing swap
device when the compressed pool reaches its size limit. This requirement had device when the compressed pool reaches its size limit. This requirement had
been identified in prior community discussions. been identified in prior community discussions.
Zswap is disabled by default but can be enabled at boot time by setting Zswap is disabled by default but can be enabled at boot time by setting
the "enabled" attribute to 1 at boot time. ie: zswap.enabled=1. Zswap the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``. Zswap
can also be enabled and disabled at runtime using the sysfs interface. can also be enabled and disabled at runtime using the sysfs interface.
An example command to enable zswap at runtime, assuming sysfs is mounted An example command to enable zswap at runtime, assuming sysfs is mounted
at /sys, is: at ``/sys``, is::
echo 1 > /sys/module/zswap/parameters/enabled echo 1 > /sys/module/zswap/parameters/enabled
When zswap is disabled at runtime it will stop storing pages that are When zswap is disabled at runtime it will stop storing pages that are
being swapped out. However, it will _not_ immediately write out or fault being swapped out. However, it will _not_ immediately write out or fault
...@@ -43,7 +52,8 @@ pages out of the compressed pool, a swapoff on the swap device(s) will ...@@ -43,7 +52,8 @@ pages out of the compressed pool, a swapoff on the swap device(s) will
fault back into memory all swapped out pages, including those in the fault back into memory all swapped out pages, including those in the
compressed pool. compressed pool.
Design: Design
======
Zswap receives pages for compression through the Frontswap API and is able to Zswap receives pages for compression through the Frontswap API and is able to
evict pages from its own compressed pool on an LRU basis and write them back to evict pages from its own compressed pool on an LRU basis and write them back to
...@@ -53,12 +63,12 @@ Zswap makes use of zpool for the managing the compressed memory pool. Each ...@@ -53,12 +63,12 @@ Zswap makes use of zpool for the managing the compressed memory pool. Each
allocation in zpool is not directly accessible by address. Rather, a handle is allocation in zpool is not directly accessible by address. Rather, a handle is
returned by the allocation routine and that handle must be mapped before being returned by the allocation routine and that handle must be mapped before being
accessed. The compressed memory pool grows on demand and shrinks as compressed accessed. The compressed memory pool grows on demand and shrinks as compressed
pages are freed. The pool is not preallocated. By default, a zpool of type pages are freed. The pool is not preallocated. By default, a zpool
zbud is created, but it can be selected at boot time by setting the "zpool" of type zbud is created, but it can be selected at boot time by
attribute, e.g. zswap.zpool=zbud. It can also be changed at runtime using the setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can
sysfs "zpool" attribute, e.g. also be changed at runtime using the sysfs ``zpool`` attribute, e.g.::
echo zbud > /sys/module/zswap/parameters/zpool echo zbud > /sys/module/zswap/parameters/zpool
The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which
means the compression ratio will always be 2:1 or worse (because of half-full means the compression ratio will always be 2:1 or worse (because of half-full
...@@ -83,14 +93,16 @@ via frontswap, to free the compressed entry. ...@@ -83,14 +93,16 @@ via frontswap, to free the compressed entry.
Zswap seeks to be simple in its policies. Sysfs attributes allow for one user Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
controlled policy: controlled policy:
* max_pool_percent - The maximum percentage of memory that the compressed * max_pool_percent - The maximum percentage of memory that the compressed
pool can occupy. pool can occupy.
The default compressor is lzo, but it can be selected at boot time by setting The default compressor is lzo, but it can be selected at boot time by
the “compressor” attribute, e.g. zswap.compressor=lzo. It can also be changed setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
at runtime using the sysfs "compressor" attribute, e.g. It can also be changed at runtime using the sysfs "compressor"
attribute, e.g.::
echo lzo > /sys/module/zswap/parameters/compressor echo lzo > /sys/module/zswap/parameters/compressor
When the zpool and/or compressor parameter is changed at runtime, any existing When the zpool and/or compressor parameter is changed at runtime, any existing
compressed pages are not modified; they are left in their own zpool. When a compressed pages are not modified; they are left in their own zpool. When a
...@@ -106,11 +118,12 @@ compressed length of the page is set to zero and the pattern or same-filled ...@@ -106,11 +118,12 @@ compressed length of the page is set to zero and the pattern or same-filled
value is stored. value is stored.
Same-value filled pages identification feature is enabled by default and can be Same-value filled pages identification feature is enabled by default and can be
disabled at boot time by setting the "same_filled_pages_enabled" attribute to 0, disabled at boot time by setting the ``same_filled_pages_enabled`` attribute
e.g. zswap.same_filled_pages_enabled=0. It can also be enabled and disabled at to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and
runtime using the sysfs "same_filled_pages_enabled" attribute, e.g. disabled at runtime using the sysfs ``same_filled_pages_enabled``
attribute, e.g.::
echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled
When zswap same-filled page identification is disabled at runtime, it will stop When zswap same-filled page identification is disabled at runtime, it will stop
checking for the same-value filled pages during store operation. However, the checking for the same-value filled pages during store operation. However, the
......
SPDX-Exception-Identifier: Linux-syscall-note SPDX-Exception-Identifier: Linux-syscall-note
SPDX-URL: https://spdx.org/licenses/Linux-syscall-note.html SPDX-URL: https://spdx.org/licenses/Linux-syscall-note.html
SPDX-Licenses: GPL-2.0, GPL-2.0+, GPL-1.0+, LGPL-2.0, LGPL-2.0+, LGPL-2.1, LGPL-2.1+ SPDX-Licenses: GPL-2.0, GPL-2.0+, GPL-1.0+, LGPL-2.0, LGPL-2.0+, LGPL-2.1, LGPL-2.1+, GPL-2.0-only, GPL-2.0-or-later
Usage-Guide: Usage-Guide:
This exception is used together with one of the above SPDX-Licenses This exception is used together with one of the above SPDX-Licenses
to mark user space API (uapi) header files so they can be included to mark user space API (uapi) header files so they can be included
......
Valid-License-Identifier: Apache-2.0
SPDX-URL: https://spdx.org/licenses/Apache-2.0.html
Usage-Guide:
To use the Apache License version 2.0 put the following SPDX tag/value
pair into a comment according to the placement guidelines in the
licensing rules documentation:
SPDX-License-Identifier: Apache-2.0
License-Text:
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and
distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the
copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other
entities that control, are controlled by, or are under common control with
that entity. For the purposes of this definition, "control" means (i) the
power, direct or indirect, to cause the direction or management of such
entity, whether by contract or otherwise, or (ii) ownership of fifty
percent (50%) or more of the outstanding shares, or (iii) beneficial
ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising
permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation source,
and configuration files.
"Object" form shall mean any form resulting from mechanical transformation
or translation of a Source form, including but not limited to compiled
object code, generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form,
made available under the License, as indicated by a copyright notice that
is included in or attached to the work (an example is provided in the
Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form,
that is based on (or derived from) the Work and for which the editorial
revisions, annotations, elaborations, or other modifications represent, as
a whole, an original work of authorship. For the purposes of this License,
Derivative Works shall not include works that remain separable from, or
merely link (or bind by name) to the interfaces of, the Work and Derivative
Works thereof.
"Contribution" shall mean any work of authorship, including the original
version of the Work and any modifications or additions to that Work or
Derivative Works thereof, that is intentionally submitted to Licensor for
inclusion in the Work by the copyright owner or by an individual or Legal
Entity authorized to submit on behalf of the copyright owner. For the
purposes of this definition, "submitted" means any form of electronic,
verbal, or written communication sent to the Licensor or its
representatives, including but not limited to communication on electronic
mailing lists, source code control systems, and issue tracking systems that
are managed by, or on behalf of, the Licensor for the purpose of discussing
and improving the Work, but excluding communication that is conspicuously
marked or otherwise designated in writing by the copyright owner as "Not a
Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on
behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this
License, each Contributor hereby grants to You a perpetual, worldwide,
non-exclusive, no-charge, royalty-free, irrevocable copyright license to
reproduce, prepare Derivative Works of, publicly display, publicly
perform, sublicense, and distribute the Work and such Derivative Works
in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of this
License, each Contributor hereby grants to You a perpetual, worldwide,
non-exclusive, no-charge, royalty-free, irrevocable (except as stated in
this section) patent license to make, have made, use, offer to sell,
sell, import, and otherwise transfer the Work, where such license
applies only to those patent claims licensable by such Contributor that
are necessarily infringed by their Contribution(s) alone or by
combination of their Contribution(s) with the Work to which such
Contribution(s) was submitted. If You institute patent litigation
against any entity (including a cross-claim or counterclaim in a
lawsuit) alleging that the Work or a Contribution incorporated within
the Work constitutes direct or contributory patent infringement, then
any patent licenses granted to You under this License for that Work
shall terminate as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or
Derivative Works thereof in any medium, with or without modifications,
and in Source or Object form, provided that You meet the following
conditions:
a. You must give any other recipients of the Work or Derivative Works a
copy of this License; and
b. You must cause any modified files to carry prominent notices stating
that You changed the files; and
c. You must retain, in the Source form of any Derivative Works that You
distribute, all copyright, patent, trademark, and attribution notices
from the Source form of the Work, excluding those notices that do not
pertain to any part of the Derivative Works; and
d. If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained within
such NOTICE file, excluding those notices that do not pertain to any
part of the Derivative Works, in at least one of the following
places: within a NOTICE text file distributed as part of the
Derivative Works; within the Source form or documentation, if
provided along with the Derivative Works; or, within a display
generated by the Derivative Works, if and wherever such third-party
notices normally appear. The contents of the NOTICE file are for
informational purposes only and do not modify the License. You may
add Your own attribution notices within Derivative Works that You
distribute, alongside or as an addendum to the NOTICE text from the
Work, provided that such additional attribution notices cannot be
construed as modifying the License.
You may add Your own copyright statement to Your modifications and may
provide additional or different license terms and conditions for use,
reproduction, or distribution of Your modifications, or for any such
Derivative Works as a whole, provided Your use, reproduction, and
distribution of the Work otherwise complies with the conditions stated
in this License.
5. Submission of Contributions. Unless You explicitly state otherwise, any
Contribution intentionally submitted for inclusion in the Work by You to
the Licensor shall be under the terms and conditions of this License,
without any additional terms or conditions. Notwithstanding the above,
nothing herein shall supersede or modify the terms of any separate
license agreement you may have executed with Licensor regarding such
Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to
in writing, Licensor provides the Work (and each Contributor provides
its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied, including, without limitation,
any warranties or conditions of TITLE, NON-INFRINGEMENT,
MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely
responsible for determining the appropriateness of using or
redistributing the Work and assume any risks associated with Your
exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether
in tort (including negligence), contract, or otherwise, unless required
by applicable law (such as deliberate and grossly negligent acts) or
agreed to in writing, shall any Contributor be liable to You for
damages, including any direct, indirect, special, incidental, or
consequential damages of any character arising as a result of this
License or out of the use or inability to use the Work (including but
not limited to damages for loss of goodwill, work stoppage, computer
failure or malfunction, or any and all other commercial damages or
losses), even if such Contributor has been advised of the possibility of
such damages.
9. Accepting Warranty or Additional Liability. While redistributing the
Work or Derivative Works thereof, You may choose to offer, and charge a
fee for, acceptance of support, warranty, indemnity, or other liability
obligations and/or rights consistent with this License. However, in
accepting such obligations, You may act only on Your own behalf and on
Your sole responsibility, not on behalf of any other Contributor, and
only if You agree to indemnify, defend, and hold each Contributor
harmless for any liability incurred by, or claims asserted against, such
Contributor by reason of your accepting any such warranty or additional
liability.
END OF TERMS AND CONDITIONS
Valid-License-Identifier: CC-BY-SA-4.0
SPDX-URL: https://spdx.org/licenses/CC-BY-SA-4.0
Usage-Guide:
To use the Creative Commons Attribution Share Alike 4.0 International
license put the following SPDX tag/value pair into a comment according to
the placement guidelines in the licensing rules documentation:
SPDX-License-Identifier: CC-BY-SA-4.0
License-Text:
Creative Commons Attribution-ShareAlike 4.0 International
Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of Creative
Commons public licenses does not create a lawyer-client or other
relationship. Creative Commons makes its licenses and related information
available on an "as-is" basis. Creative Commons gives no warranties
regarding its licenses, any material licensed under their terms and
conditions, or any related information. Creative Commons disclaims all
liability for damages resulting from their use to the fullest extent
possible.
Using Creative Commons Public Licenses
Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share original
works of authorship and other material subject to copyright and certain
other rights specified in the public license below. The following
considerations are for informational purposes only, are not exhaustive, and
do not form part of our licenses.
Considerations for licensors: Our public licenses are intended for use by
those authorized to give the public permission to use material in ways
otherwise restricted by copyright and certain other rights. Our licenses
are irrevocable. Licensors should read and understand the terms and
conditions of the license they choose before applying it. Licensors should
also secure all rights necessary before applying our licenses so that the
public can reuse the material as expected. Licensors should clearly mark
any material not subject to the license. This includes other CC-licensed
material, or material used under an exception or limitation to
copyright. More considerations for licensors :
wiki.creativecommons.org/Considerations_for_licensors
Considerations for the public: By using one of our public licenses, a
licensor grants the public permission to use the licensed material under
specified terms and conditions. If the licensor's permission is not
necessary for any reason - for example, because of any applicable exception
or limitation to copyright - then that use is not regulated by the
license. Our licenses grant only permissions under copyright and certain
other rights that a licensor has authority to grant. Use of the licensed
material may still be restricted for other reasons, including because
others have copyright or other rights in the material. A licensor may make
special requests, such as asking that all changes be marked or described.
Although not required by our licenses, you are encouraged to respect those
requests where reasonable. More considerations for the public :
wiki.creativecommons.org/Considerations_for_licensees
Creative Commons Attribution-ShareAlike 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree to
be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You such
rights in consideration of benefits the Licensor receives from making the
Licensed Material available under these terms and conditions.
Section 1 - Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material and
in which the Licensed Material is translated, altered, arranged,
transformed, or otherwise modified in a manner requiring permission
under the Copyright and Similar Rights held by the Licensor. For
purposes of this Public License, where the Licensed Material is a
musical work, performance, or sound recording, Adapted Material is
always produced where the Licensed Material is synched in timed
relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright and
Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. BY-SA Compatible License means a license listed at
creativecommons.org/compatiblelicenses, approved by Creative Commons
as essentially the equivalent of this Public License.
d. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
e. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright Treaty
adopted on December 20, 1996, and/or similar international
agreements.
f. Exceptions and Limitations means fair use, fair dealing, and/or any
other exception or limitation to Copyright and Similar Rights that
applies to Your use of the Licensed Material.
g. License Elements means the license attributes listed in the name of
a Creative Commons Public License. The License Elements of this
Public License are Attribution and ShareAlike.
h. Licensed Material means the artistic or literary work, database, or
other material to which the Licensor applied this Public License.
i. Licensed Rights means the rights granted to You subject to the terms
and conditions of this Public License, which are limited to all
Copyright and Similar Rights that apply to Your use of the Licensed
Material and that the Licensor has authority to license.
j. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
k. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such as
reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the public
may access the material from a place and at a time individually
chosen by them.
l. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially equivalent
rights anywhere in the world. m. You means the individual or entity
exercising the Licensed Rights under this Public License. Your has a
corresponding meaning.
Section 2 - Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License, the
Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
A. reproduce and Share the Licensed Material, in whole or in part; and
B. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with its
terms and conditions.
3. Term. The term of this Public License is specified in Section 6(a).
4. Media and formats; technical modifications allowed. The Licensor
authorizes You to exercise the Licensed Rights in all media and
formats whether now known or hereafter created, and to make
technical modifications necessary to do so. The Licensor waives
and/or agrees not to assert any right or authority to forbid You
from making technical modifications necessary to exercise the
Licensed Rights, including technical modifications necessary to
circumvent Effective Technological Measures. For purposes of
this Public License, simply making modifications authorized by
this Section 2(a)(4) never produces Adapted Material.
5. Downstream recipients.
A. Offer from the Licensor - Licensed Material. Every recipient
of the Licensed Material automatically receives an offer
from the Licensor to exercise the Licensed Rights under the
terms and conditions of this Public License.
B. Additional offer from the Licensor - Adapted Material. Every
recipient of Adapted Material from You automatically
receives an offer from the Licensor to exercise the Licensed
Rights in the Adapted Material under the conditions of the
Adapter's License You apply.
C. No downstream restrictions. You may not offer or impose any
additional or different terms or conditions on, or apply any
Effective Technological Measures to, the Licensed Material
if doing so restricts exercise of the Licensed Rights by any
recipient of the Licensed Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You are,
or that Your use of the Licensed Material is, connected with, or
sponsored, endorsed, or granted official status by, the Licensor
or others designated to receive attribution as provided in
Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not licensed
under this Public License, nor are publicity, privacy, and/or
other similar personality rights; however, to the extent
possible, the Licensor waives and/or agrees not to assert any
such rights held by the Licensor to the limited extent necessary
to allow You to exercise the Licensed Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this Public
License.
3. To the extent possible, the Licensor waives any right to collect
royalties from You for the exercise of the Licensed Rights,
whether directly or through a collecting society under any
voluntary or waivable statutory or compulsory licensing
scheme. In all other cases the Licensor expressly reserves any
right to collect such royalties.
Section 3 - License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified form),
You must:
A. retain the following if it is supplied by the Licensor with
the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by the
Licensor (including by pseudonym if designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of warranties;
v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
B. indicate if You modified the Licensed Material and retain an
indication of any previous modifications; and
C. indicate the Licensed Material is licensed under this Public
License, and include the text of, or the URI or hyperlink to,
this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable. b. ShareAlike.In addition to the
conditions in Section 3(a), if You Share Adapted Material You
produce, the following conditions also apply.
1. The Adapter's License You apply must be a Creative Commons
license with the same License Elements, this version or
later, or a BY-SA Compatible License.
2. You must include the text of, or the URI or hyperlink to, the
Adapter's License You apply. You may satisfy this condition
in any reasonable manner based on the medium, means, and
context in which You Share Adapted Material.
3. You may not offer or impose any additional or different terms
or conditions on, or apply any Effective Technological
Measures to, Adapted Material that restrict exercise of the
rights granted under the Adapter's License You apply.
Section 4 - Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that apply to
Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right to
extract, reuse, reproduce, and Share all or a substantial portion of
the contents of the database;
b. if You include all or a substantial portion of the database contents
in a database in which You have Sui Generis Database Rights, then
the database in which You have Sui Generis Database Rights (but not
its individual contents) is Adapted Material, including for purposes
of Section 3(b); and
c. You must comply with the conditions in Section 3(a) if You Share all
or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 - Disclaimer of Warranties and Limitation of Liability.
a. Unless otherwise separately undertaken by the Licensor, to the
extent possible, the Licensor offers the Licensed Material as-is and
as-available, and makes no representations or warranties of any kind
concerning the Licensed Material, whether express, implied,
statutory, or other. This includes, without limitation, warranties
of title, merchantability, fitness for a particular purpose,
non-infringement, absence of latent or other defects, accuracy, or
the presence or absence of errors, whether or not known or
discoverable. Where disclaimers of warranties are not allowed in
full or in part, this disclaimer may not apply to You.
b. To the extent possible, in no event will the Licensor be liable to
You on any legal theory (including, without limitation, negligence)
or otherwise for any direct, special, indirect, incidental,
consequential, punitive, exemplary, or other losses, costs,
expenses, or damages arising out of this Public License or use of
the Licensed Material, even if the Licensor has been advised of the
possibility of such losses, costs, expenses, or damages. Where a
limitation of liability is not allowed in full or in part, this
limitation may not apply to You.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent possible,
most closely approximates an absolute disclaimer and waiver of all
liability.
Section 6 - Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided it
is cured within 30 days of Your discovery of the violation; or
2. upon express reinstatement by the Licensor.
c. For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations of
this Public License.
d. For the avoidance of doubt, the Licensor may also offer the Licensed
Material under separate terms or conditions or stop distributing the
Licensed Material at any time; however, doing so will not terminate
this Public License.
e. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
Section 7 - Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different terms
or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 - Interpretation.
a. For the avoidance of doubt, this Public License does not, and shall
not be interpreted to, reduce, limit, restrict, or impose conditions
on any use of the Licensed Material that could lawfully be made
without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted as
a limitation upon, or waiver of, any privileges and immunities that
apply to the Licensor or You, including from the legal processes of
any jurisdiction or authority.
Creative Commons is not a party to its public licenses. Notwithstanding,
Creative Commons may elect to apply one of its public licenses to material
it publishes and in those instances will be considered the "Licensor." The
text of the Creative Commons public licenses is dedicated to the public
domain under the CC0 Public Domain Dedication. Except for the limited
purpose of indicating that material is shared under a Creative Commons
public license or as otherwise permitted by the Creative Commons policies
published at creativecommons.org/policies, Creative Commons does not
authorize the use of the trademark "Creative Commons" or any other
trademark or logo of Creative Commons without its prior written consent
including, without limitation, in connection with any unauthorized
modifications to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For the
avoidance of doubt, this paragraph does not form part of the public
licenses.
Creative Commons may be contacted at creativecommons.org.
Valid-License-Identifier: CDDL-1.0
SPDX-URL: https://spdx.org/licenses/CDDL-1.0.html
Usage-Guide:
To use the Common Development and Distribution License 1.0 put the
following SPDX tag/value pair into a comment according to the placement
guidelines in the licensing rules documentation:
SPDX-License-Identifier: CDDL-1.0
License-Text:
COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL)
Version 1.0
1. Definitions.
1.1. "Contributor" means each individual or entity that creates or
contributes to the creation of Modifications.
1.2. "Contributor Version" means the combination of the Original
Software, prior Modifications used by a Contributor (if any),
and the Modifications made by that particular Contributor.
1.3. "Covered Software" means (a) the Original Software, or (b)
Modifications, or (c) the combination of files containing
Original Software with files containing Modifications, in each
case including portions thereof.
1.4. "Executable" means the Covered Software in any form other than
Source Code.
1.5. "Initial Developer" means the individual or entity that first
makes Original Software available under this License.
1.6. "Larger Work" means a work which combines Covered Software or
portions thereof with code not governed by the terms of this
License.
1.7. "License" means this document.
1.8. "Licensable" means having the right to grant, to the maximum
extent possible, whether at the time of the initial grant or
subsequently acquired, any and all of the rights conveyed herein.
1.9. "Modifications" means the Source Code and Executable form of
any of the following:
A. Any file that results from an addition to, deletion from or
modification of the contents of a file containing Original
Software or previous Modifications;
B. Any new file that contains any part of the Original Software
or previous Modification; or
C. Any new file that is contributed or otherwise made available
under the terms of this License.
1.10. "Original Software" means the Source Code and Executable form
of computer software code that is originally released under
this License.
1.11. "Patent Claims" means any patent claim(s), now owned or
hereafter acquired, including without limitation, method,
process, and apparatus claims, in any patent Licensable by
grantor.
1.12. "Source Code" means (a) the common form of computer software
code in which modifications are made and (b) associated
documentation included in or with such code.
1.13. "You" (or "Your") means an individual or a legal entity
exercising rights under, and complying with all of the terms
of, this License. For legal entities, "You" includes any
entity which controls, is controlled by, or is under common
control with You. For purposes of this definition, "control"
means (a) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract
or otherwise, or (b) ownership of more than fifty percent
(50%) of the outstanding shares or beneficial ownership of
such entity.
2. License Grants.
2.1. The Initial Developer Grant.
Conditioned upon Your compliance with Section 3.1 below and subject
to third party intellectual property claims, the Initial Developer
hereby grants You a world-wide, royalty-free, non-exclusive
license:
(a) under intellectual property rights (other than patent or
trademark) Licensable by Initial Developer, to use,
reproduce, modify, display, perform, sublicense and
distribute the Original Software (or portions thereof),
with or without Modifications, and/or as part of a Larger
Work; and
(b) under Patent Claims infringed by the making, using or
selling of Original Software, to make, have made, use,
practice, sell, and offer for sale, and/or otherwise
dispose of the Original Software (or portions thereof).
(c) The licenses granted in Sections 2.1(a) and (b) are
effective on the date Initial Developer first distributes
or otherwise makes the Original Software available to a
third party under the terms of this License.
(d) Notwithstanding Section 2.1(b) above, no patent license is
granted: (1) for code that You delete from the Original
Software, or (2) for infringements caused by: (i) the
modification of the Original Software, or (ii) the
combination of the Original Software with other software or
devices.
2.2. Contributor Grant.
Conditioned upon Your compliance with Section 3.1 below and subject
to third party intellectual property claims, each Contributor
hereby grants You a world-wide, royalty-free, non-exclusive
license:
(a) under intellectual property rights (other than patent or
trademark) Licensable by Contributor to use, reproduce,
modify, display, perform, sublicense and distribute the
Modifications created by such Contributor (or portions
thereof), either on an unmodified basis, with other
Modifications, as Covered Software and/or as part of a
Larger Work; and
(b) under Patent Claims infringed by the making, using, or
selling of Modifications made by that Contributor either
alone and/or in combination with its Contributor Version
(or portions of such combination), to make, use, sell,
offer for sale, have made, and/or otherwise dispose of: (1)
Modifications made by that Contributor (or portions
thereof); and (2) the combination of Modifications made by
that Contributor with its Contributor Version (or portions
of such combination).
(c) The licenses granted in Sections 2.2(a) and 2.2(b) are
effective on the date Contributor first distributes or
otherwise makes the Modifications available to a third
party.
(d) Notwithstanding Section 2.2(b) above, no patent license is
granted: (1) for any code that Contributor has deleted from
the Contributor Version; (2) for infringements caused by:
(i) third party modifications of Contributor Version, or
(ii) the combination of Modifications made by that
Contributor with other software (except as part of the
Contributor Version) or other devices; or (3) under Patent
Claims infringed by Covered Software in the absence of
Modifications made by that Contributor.
3. Distribution Obligations.
3.1. Availability of Source Code.
Any Covered Software that You distribute or otherwise make
available in Executable form must also be made available in Source
Code form and that Source Code form must be distributed only under
the terms of this License. You must include a copy of this License
with every copy of the Source Code form of the Covered Software You
distribute or otherwise make available. You must inform recipients
of any such Covered Software in Executable form as to how they can
obtain such Covered Software in Source Code form in a reasonable
manner on or through a medium customarily used for software
exchange.
3.2. Modifications.
The Modifications that You create or to which You contribute are
governed by the terms of this License. You represent that You
believe Your Modifications are Your original creation(s) and/or You
have sufficient rights to grant the rights conveyed by this
License.
3.3. Required Notices.
You must include a notice in each of Your Modifications that
identifies You as the Contributor of the Modification. You may not
remove or alter any copyright, patent or trademark notices
contained within the Covered Software, or any notices of licensing
or any descriptive text giving attribution to any Contributor or
the Initial Developer.
3.4. Application of Additional Terms.
You may not offer or impose any terms on any Covered Software in
Source Code form that alters or restricts the applicable version of
this License or the recipients' rights hereunder. You may choose to
offer, and to charge a fee for, warranty, support, indemnity or
liability obligations to one or more recipients of Covered
Software. However, you may do so only on Your own behalf, and not
on behalf of the Initial Developer or any Contributor. You must
make it absolutely clear that any such warranty, support, indemnity
or liability obligation is offered by You alone, and You hereby
agree to indemnify the Initial Developer and every Contributor for
any liability incurred by the Initial Developer or such Contributor
as a result of warranty, support, indemnity or liability terms You
offer.
3.5. Distribution of Executable Versions.
You may distribute the Executable form of the Covered Software
under the terms of this License or under the terms of a license of
Your choice, which may contain terms different from this License,
provided that You are in compliance with the terms of this License
and that the license for the Executable form does not attempt to
limit or alter the recipient's rights in the Source Code form from
the rights set forth in this License. If You distribute the Covered
Software in Executable form under a different license, You must
make it absolutely clear that any terms which differ from this
License are offered by You alone, not by the Initial Developer or
Contributor. You hereby agree to indemnify the Initial Developer
and every Contributor for any liability incurred by the Initial
Developer or such Contributor as a result of any such terms You
offer.
3.6. Larger Works.
You may create a Larger Work by combining Covered Software with
other code not governed by the terms of this License and distribute
the Larger Work as a single product. In such a case, You must make
sure the requirements of this License are fulfilled for the Covered
Software.
4. Versions of the License.
4.1. New Versions.
Sun Microsystems, Inc. is the initial license steward and may
publish revised and/or new versions of this License from time to
time. Each version will be given a distinguishing version
number. Except as provided in Section 4.3, no one other than the
license steward has the right to modify this License.
4.2. Effect of New Versions.
You may always continue to use, distribute or otherwise make the
Covered Software available under the terms of the version of the
License under which You originally received the Covered
Software. If the Initial Developer includes a notice in the
Original Software prohibiting it from being distributed or
otherwise made available under any subsequent version of the
License, You must distribute and make the Covered Software
available under the terms of the version of the License under which
You originally received the Covered Software. Otherwise, You may
also choose to use, distribute or otherwise make the Covered
Software available under the terms of any subsequent version of the
License published by the license steward.
4.3. Modified Versions.
When You are an Initial Developer and You want to create a new
license for Your Original Software, You may create and use a
modified version of this License if You: (a) rename the license and
remove any references to the name of the license steward (except to
note that the license differs from this License); and (b) otherwise
make it clear that the license contains terms which differ from
this License.
5. DISCLAIMER OF WARRANTY.
COVERED SOFTWARE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS,
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
WITHOUT LIMITATION, WARRANTIES THAT THE COVERED SOFTWARE IS FREE OF
DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR
NON-INFRINGING. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF
THE COVERED SOFTWARE IS WITH YOU. SHOULD ANY COVERED SOFTWARE PROVE
DEFECTIVE IN ANY RESPECT, YOU (NOT THE INITIAL DEVELOPER OR ANY OTHER
CONTRIBUTOR) ASSUME THE COST OF ANY NECESSARY SERVICING, REPAIR OR
CORRECTION. THIS DISCLAIMER OF WARRANTY CONSTITUTES AN ESSENTIAL PART
OF THIS LICENSE. NO USE OF ANY COVERED SOFTWARE IS AUTHORIZED HEREUNDER
EXCEPT UNDER THIS DISCLAIMER.
6. TERMINATION.
6.1. This License and the rights granted hereunder will terminate
automatically if You fail to comply with terms herein and fail to
cure such breach within 30 days of becoming aware of the
breach. Provisions which, by their nature, must remain in effect
beyond the termination of this License shall survive.
6.2. If You assert a patent infringement claim (excluding
declaratory judgment actions) against Initial Developer or a
Contributor (the Initial Developer or Contributor against whom You
assert such claim is referred to as "Participant") alleging that
the Participant Software (meaning the Contributor Version where the
Participant is a Contributor or the Original Software where the
Participant is the Initial Developer) directly or indirectly
infringes any patent, then any and all rights granted directly or
indirectly to You by such Participant, the Initial Developer (if
the Initial Developer is not the Participant) and all Contributors
under Sections 2.1 and/or 2.2 of this License shall, upon 60 days
notice from Participant terminate prospectively and automatically
at the expiration of such 60 day notice period, unless if within
such 60 day period You withdraw Your claim with respect to the
Participant Software against such Participant either unilaterally
or pursuant to a written agreement with Participant.
6.3. In the event of termination under Sections 6.1 or 6.2 above,
all end user licenses that have been validly granted by You or any
distributor hereunder prior to termination (excluding licenses
granted to You by any distributor) shall survive termination.
7. LIMITATION OF LIABILITY.
UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, WHETHER TORT
(INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL YOU, THE INITIAL
DEVELOPER, ANY OTHER CONTRIBUTOR, OR ANY DISTRIBUTOR OF COVERED
SOFTWARE, OR ANY SUPPLIER OF ANY OF SUCH PARTIES, BE LIABLE TO ANY
PERSON FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
OF ANY CHARACTER INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOST
PROFITS, LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE OR
MALFUNCTION, OR ANY AND ALL OTHER COMMERCIAL DAMAGES OR LOSSES, EVEN IF
SUCH PARTY SHALL HAVE BEEN INFORMED OF THE POSSIBILITY OF SUCH
DAMAGES. THIS LIMITATION OF LIABILITY SHALL NOT APPLY TO LIABILITY FOR
DEATH OR PERSONAL INJURY RESULTING FROM SUCH PARTY'S NEGLIGENCE TO THE
EXTENT APPLICABLE LAW PROHIBITS SUCH LIMITATION. SOME JURISDICTIONS DO
NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL OR CONSEQUENTIAL
DAMAGES, SO THIS EXCLUSION AND LIMITATION MAY NOT APPLY TO YOU.
8. U.S. GOVERNMENT END USERS.
The Covered Software is a "commercial item," as that term is defined in
48 C.F.R. 2.101 (Oct. 1995), consisting of "commercial computer
software" (as that term is defined at 48 C.F.R. $ 252.227-7014(a)(1))
and "commercial computer software documentation" as such terms are used
in 48 C.F.R. 12.212 (Sept. 1995). Consistent with 48 C.F.R. 12.212 and
48 C.F.R. 227.7202-1 through 227.7202-4 (June 1995), all
U.S. Government End Users acquire Covered Software with only those
rights set forth herein. This U.S. Government Rights clause is in lieu
of, and supersedes, any other FAR, DFAR, or other clause or provision
that addresses Government rights in computer software under this
License.
9. MISCELLANEOUS.
This License represents the complete agreement concerning subject
matter hereof. If any provision of this License is held to be
unenforceable, such provision shall be reformed only to the extent
necessary to make it enforceable. This License shall be governed by the
law of the jurisdiction specified in a notice contained within the
Original Software (except to the extent applicable law, if any,
provides otherwise), excluding such jurisdiction's conflict-of-law
provisions. Any litigation relating to this License shall be subject to
the jurisdiction of the courts located in the jurisdiction and venue
specified in a notice contained within the Original Software, with the
losing party responsible for costs, including, without limitation,
court costs and reasonable attorneys' fees and expenses. The
application of the United Nations Convention on Contracts for the
International Sale of Goods is expressly excluded. Any law or
regulation which provides that the language of a contract shall be
construed against the drafter shall not apply to this License. You
agree that You alone are responsible for compliance with the United
States export administration regulations (and the export control laws
and regulation of any other countries) when You use, distribute or
otherwise make available any Covered Software.
10. RESPONSIBILITY FOR CLAIMS.
As between Initial Developer and the Contributors, each party is
responsible for claims and damages arising, directly or indirectly, out
of its utilization of rights under this License and You agree to work
with Initial Developer and Contributors to distribute such
responsibility on an equitable basis. Nothing herein is intended or
shall be deemed to constitute any admission of liability.
Valid-License-Identifier: Linux-OpenIB
SPDX-URL: https://spdx.org/licenses/Linux-OpenIB.html
Usage-Guide:
To use the Linux Kernel Variant of OpenIB.org license put the following
SPDX tag/value pair into a comment according to the placement guidelines
in the licensing rules documentation:
SPDX-License-Identifier: Linux-OpenIB
License-Text:
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
Valid-License-Identifier: X11
SPDX-URL: https://spdx.org/licenses/X11.html
Usage-Guide:
To use the X11 put the following SPDX tag/value pair into a comment
according to the placement guidelines in the licensing rules
documentation:
SPDX-License-Identifier: X11
License-Text:
X11 License
Copyright (C) 1996 X Consortium
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
X CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Except as contained in this notice, the name of the X Consortium shall not
be used in advertising or otherwise to promote the sale, use or other
dealings in this Software without prior written authorization from the X
Consortium.
X Window System is a trademark of X Consortium, Inc.
Valid-License-Identifier: GPL-2.0 Valid-License-Identifier: GPL-2.0
Valid-License-Identifier: GPL-2.0-only
Valid-License-Identifier: GPL-2.0+ Valid-License-Identifier: GPL-2.0+
Valid-License-Identifier: GPL-2.0-or-later
SPDX-URL: https://spdx.org/licenses/GPL-2.0.html SPDX-URL: https://spdx.org/licenses/GPL-2.0.html
Usage-Guide: Usage-Guide:
To use this license in source code, put one of the following SPDX To use this license in source code, put one of the following SPDX
...@@ -7,8 +9,12 @@ Usage-Guide: ...@@ -7,8 +9,12 @@ Usage-Guide:
guidelines in the licensing rules documentation. guidelines in the licensing rules documentation.
For 'GNU General Public License (GPL) version 2 only' use: For 'GNU General Public License (GPL) version 2 only' use:
SPDX-License-Identifier: GPL-2.0 SPDX-License-Identifier: GPL-2.0
or
SPDX-License-Identifier: GPL-2.0-only
For 'GNU General Public License (GPL) version 2 or any later version' use: For 'GNU General Public License (GPL) version 2 or any later version' use:
SPDX-License-Identifier: GPL-2.0+ SPDX-License-Identifier: GPL-2.0+
or
SPDX-License-Identifier: GPL-2.0-or-later
License-Text: License-Text:
GNU GENERAL PUBLIC LICENSE GNU GENERAL PUBLIC LICENSE
......
...@@ -15642,7 +15642,7 @@ L: linux-mm@kvack.org ...@@ -15642,7 +15642,7 @@ L: linux-mm@kvack.org
S: Maintained S: Maintained
F: mm/zsmalloc.c F: mm/zsmalloc.c
F: include/linux/zsmalloc.h F: include/linux/zsmalloc.h
F: Documentation/vm/zsmalloc.txt F: Documentation/vm/zsmalloc.rst
ZSWAP COMPRESSED SWAP CACHING ZSWAP COMPRESSED SWAP CACHING
M: Seth Jennings <sjenning@redhat.com> M: Seth Jennings <sjenning@redhat.com>
......
...@@ -576,7 +576,7 @@ config ARCH_DISCONTIGMEM_ENABLE ...@@ -576,7 +576,7 @@ config ARCH_DISCONTIGMEM_ENABLE
Say Y to support efficient handling of discontiguous physical memory, Say Y to support efficient handling of discontiguous physical memory,
for architectures which are either NUMA (Non-Uniform Memory Access) for architectures which are either NUMA (Non-Uniform Memory Access)
or have huge holes in the physical address space for other reasons. or have huge holes in the physical address space for other reasons.
See <file:Documentation/vm/numa> for more. See <file:Documentation/vm/numa.rst> for more.
source "mm/Kconfig" source "mm/Kconfig"
......
...@@ -381,7 +381,7 @@ config ARCH_DISCONTIGMEM_ENABLE ...@@ -381,7 +381,7 @@ config ARCH_DISCONTIGMEM_ENABLE
Say Y to support efficient handling of discontiguous physical memory, Say Y to support efficient handling of discontiguous physical memory,
for architectures which are either NUMA (Non-Uniform Memory Access) for architectures which are either NUMA (Non-Uniform Memory Access)
or have huge holes in the physical address space for other reasons. or have huge holes in the physical address space for other reasons.
See <file:Documentation/vm/numa> for more. See <file:Documentation/vm/numa.rst> for more.
config ARCH_FLATMEM_ENABLE config ARCH_FLATMEM_ENABLE
def_bool y def_bool y
......
...@@ -2548,7 +2548,7 @@ config ARCH_DISCONTIGMEM_ENABLE ...@@ -2548,7 +2548,7 @@ config ARCH_DISCONTIGMEM_ENABLE
Say Y to support efficient handling of discontiguous physical memory, Say Y to support efficient handling of discontiguous physical memory,
for architectures which are either NUMA (Non-Uniform Memory Access) for architectures which are either NUMA (Non-Uniform Memory Access)
or have huge holes in the physical address space for other reasons. or have huge holes in the physical address space for other reasons.
See <file:Documentation/vm/numa> for more. See <file:Documentation/vm/numa.rst> for more.
config ARCH_SPARSEMEM_ENABLE config ARCH_SPARSEMEM_ENABLE
bool bool
......
...@@ -865,7 +865,7 @@ config PPC_MEM_KEYS ...@@ -865,7 +865,7 @@ config PPC_MEM_KEYS
page-based protections, but without requiring modification of the page-based protections, but without requiring modification of the
page tables when an application changes protection domains. page tables when an application changes protection domains.
For details, see Documentation/vm/protection-keys.txt For details, see Documentation/vm/protection-keys.rst
If unsure, say y. If unsure, say y.
......
...@@ -194,6 +194,7 @@ static u8 w1_read_bit(struct w1_master *dev) ...@@ -194,6 +194,7 @@ static u8 w1_read_bit(struct w1_master *dev)
* bit 0 = id_bit * bit 0 = id_bit
* bit 1 = comp_bit * bit 1 = comp_bit
* bit 2 = dir_taken * bit 2 = dir_taken
*
* If both bits 0 & 1 are set, the search should be restarted. * If both bits 0 & 1 are set, the search should be restarted.
* *
* Return: bit fields - see above * Return: bit fields - see above
......
...@@ -196,7 +196,7 @@ config HUGETLBFS ...@@ -196,7 +196,7 @@ config HUGETLBFS
help help
hugetlbfs is a filesystem backing for HugeTLB pages, based on hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read ramfs. For architectures that support it, say Y here and read
<file:Documentation/vm/hugetlbpage.txt> for details. <file:Documentation/admin-guide/mm/hugetlbpage.rst> for details.
If unsure, say N. If unsure, say N.
......
...@@ -677,7 +677,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, ...@@ -677,7 +677,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
* downgrading page table protection not changing it to point * downgrading page table protection not changing it to point
* to a new page. * to a new page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
if (pmdp) { if (pmdp) {
#ifdef CONFIG_FS_DAX_PMD #ifdef CONFIG_FS_DAX_PMD
......
...@@ -937,7 +937,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, ...@@ -937,7 +937,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
/* /*
* The soft-dirty tracker uses #PF-s to catch writes * The soft-dirty tracker uses #PF-s to catch writes
* to pages, so write-protect the pte as well. See the * to pages, so write-protect the pte as well. See the
* Documentation/vm/soft-dirty.txt for full description * Documentation/admin-guide/mm/soft-dirty.rst for full description
* of how soft-dirty works. * of how soft-dirty works.
*/ */
pte_t ptent = *pte; pte_t ptent = *pte;
...@@ -1421,7 +1421,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, ...@@ -1421,7 +1421,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
* Bits 0-54 page frame number (PFN) if present * Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped * Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped * Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst)
* Bit 56 page exclusively mapped * Bit 56 page exclusively mapped
* Bits 57-60 zero * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon * Bit 61 page is file-page or shared-anon
......
...@@ -16,7 +16,7 @@ ...@@ -16,7 +16,7 @@
/* /*
* Heterogeneous Memory Management (HMM) * Heterogeneous Memory Management (HMM)
* *
* See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it * See Documentation/vm/hmm.rst for reasons and overview of what HMM is and it
* is for. Here we focus on the HMM API description, with some explanation of * is for. Here we focus on the HMM API description, with some explanation of
* the underlying implementation. * the underlying implementation.
* *
......
...@@ -45,7 +45,7 @@ struct vmem_altmap { ...@@ -45,7 +45,7 @@ struct vmem_altmap {
* must be treated as an opaque object, rather than a "normal" struct page. * must be treated as an opaque object, rather than a "normal" struct page.
* *
* A more complete discussion of unaddressable memory may be found in * A more complete discussion of unaddressable memory may be found in
* include/linux/hmm.h and Documentation/vm/hmm.txt. * include/linux/hmm.h and Documentation/vm/hmm.rst.
* *
* MEMORY_DEVICE_PUBLIC: * MEMORY_DEVICE_PUBLIC:
* Device memory that is cache coherent from device and CPU point of view. This * Device memory that is cache coherent from device and CPU point of view. This
...@@ -67,7 +67,7 @@ enum memory_type { ...@@ -67,7 +67,7 @@ enum memory_type {
* page_free() * page_free()
* *
* Additional notes about MEMORY_DEVICE_PRIVATE may be found in * Additional notes about MEMORY_DEVICE_PRIVATE may be found in
* include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief * include/linux/hmm.h and Documentation/vm/hmm.rst. There is also a brief
* explanation in include/linux/memory_hotplug.h. * explanation in include/linux/memory_hotplug.h.
* *
* The page_fault() callback must migrate page back, from device memory to * The page_fault() callback must migrate page back, from device memory to
......
...@@ -174,7 +174,7 @@ struct mmu_notifier_ops { ...@@ -174,7 +174,7 @@ struct mmu_notifier_ops {
* invalidate_range_start()/end() notifiers, as * invalidate_range_start()/end() notifiers, as
* invalidate_range() alread catches the points in time when an * invalidate_range() alread catches the points in time when an
* external TLB range needs to be flushed. For more in depth * external TLB range needs to be flushed. For more in depth
* discussion on this see Documentation/vm/mmu_notifier.txt * discussion on this see Documentation/vm/mmu_notifier.rst
* *
* Note that this function might be called with just a sub-range * Note that this function might be called with just a sub-range
* of what was passed to invalidate_range_start()/end(), if * of what was passed to invalidate_range_start()/end(), if
......
...@@ -28,7 +28,7 @@ extern struct mm_struct *mm_alloc(void); ...@@ -28,7 +28,7 @@ extern struct mm_struct *mm_alloc(void);
* *
* Use mmdrop() to release the reference acquired by mmgrab(). * Use mmdrop() to release the reference acquired by mmgrab().
* *
* See also <Documentation/vm/active_mm.txt> for an in-depth explanation * See also <Documentation/vm/active_mm.rst> for an in-depth explanation
* of &mm_struct.mm_count vs &mm_struct.mm_users. * of &mm_struct.mm_count vs &mm_struct.mm_users.
*/ */
static inline void mmgrab(struct mm_struct *mm) static inline void mmgrab(struct mm_struct *mm)
...@@ -62,7 +62,7 @@ static inline void mmdrop(struct mm_struct *mm) ...@@ -62,7 +62,7 @@ static inline void mmdrop(struct mm_struct *mm)
* *
* Use mmput() to release the reference acquired by mmget(). * Use mmput() to release the reference acquired by mmget().
* *
* See also <Documentation/vm/active_mm.txt> for an in-depth explanation * See also <Documentation/vm/active_mm.rst> for an in-depth explanation
* of &mm_struct.mm_count vs &mm_struct.mm_users. * of &mm_struct.mm_count vs &mm_struct.mm_users.
*/ */
static inline void mmget(struct mm_struct *mm) static inline void mmget(struct mm_struct *mm)
...@@ -170,6 +170,17 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { } ...@@ -170,6 +170,17 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { }
static inline void fs_reclaim_release(gfp_t gfp_mask) { } static inline void fs_reclaim_release(gfp_t gfp_mask) { }
#endif #endif
/**
* memalloc_noio_save - Marks implicit GFP_NOIO allocation scope.
*
* This functions marks the beginning of the GFP_NOIO allocation scope.
* All further allocations will implicitly drop __GFP_IO flag and so
* they are safe for the IO critical section from the allocation recursion
* point of view. Use memalloc_noio_restore to end the scope with flags
* returned by this function.
*
* This function is safe to be used from any context.
*/
static inline unsigned int memalloc_noio_save(void) static inline unsigned int memalloc_noio_save(void)
{ {
unsigned int flags = current->flags & PF_MEMALLOC_NOIO; unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
...@@ -177,11 +188,30 @@ static inline unsigned int memalloc_noio_save(void) ...@@ -177,11 +188,30 @@ static inline unsigned int memalloc_noio_save(void)
return flags; return flags;
} }
/**
* memalloc_noio_restore - Ends the implicit GFP_NOIO scope.
* @flags: Flags to restore.
*
* Ends the implicit GFP_NOIO scope started by memalloc_noio_save function.
* Always make sure that that the given flags is the return value from the
* pairing memalloc_noio_save call.
*/
static inline void memalloc_noio_restore(unsigned int flags) static inline void memalloc_noio_restore(unsigned int flags)
{ {
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
} }
/**
* memalloc_nofs_save - Marks implicit GFP_NOFS allocation scope.
*
* This functions marks the beginning of the GFP_NOFS allocation scope.
* All further allocations will implicitly drop __GFP_FS flag and so
* they are safe for the FS critical section from the allocation recursion
* point of view. Use memalloc_nofs_restore to end the scope with flags
* returned by this function.
*
* This function is safe to be used from any context.
*/
static inline unsigned int memalloc_nofs_save(void) static inline unsigned int memalloc_nofs_save(void)
{ {
unsigned int flags = current->flags & PF_MEMALLOC_NOFS; unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
...@@ -189,6 +219,14 @@ static inline unsigned int memalloc_nofs_save(void) ...@@ -189,6 +219,14 @@ static inline unsigned int memalloc_nofs_save(void)
return flags; return flags;
} }
/**
* memalloc_nofs_restore - Ends the implicit GFP_NOFS scope.
* @flags: Flags to restore.
*
* Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function.
* Always make sure that that the given flags is the return value from the
* pairing memalloc_nofs_save call.
*/
static inline void memalloc_nofs_restore(unsigned int flags) static inline void memalloc_nofs_restore(unsigned int flags)
{ {
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags; current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
......
...@@ -53,7 +53,7 @@ static inline int current_is_kswapd(void) ...@@ -53,7 +53,7 @@ static inline int current_is_kswapd(void)
/* /*
* Unaddressable device memory support. See include/linux/hmm.h and * Unaddressable device memory support. See include/linux/hmm.h and
* Documentation/vm/hmm.txt. Short description is we need struct pages for * Documentation/vm/hmm.rst. Short description is we need struct pages for
* device memory that is unaddressable (inaccessible) by CPU, so that we can * device memory that is unaddressable (inaccessible) by CPU, so that we can
* migrate part of a process memory to device memory. * migrate part of a process memory to device memory.
* *
......
...@@ -305,7 +305,7 @@ config KSM ...@@ -305,7 +305,7 @@ config KSM
the many instances by a single page with that content, so the many instances by a single page with that content, so
saving memory until one or another app needs to modify the content. saving memory until one or another app needs to modify the content.
Recommended for use with KVM, or with other duplicative applications. Recommended for use with KVM, or with other duplicative applications.
See Documentation/vm/ksm.txt for more information: KSM is inactive See Documentation/vm/ksm.rst for more information: KSM is inactive
until a program has madvised that an area is MADV_MERGEABLE, and until a program has madvised that an area is MADV_MERGEABLE, and
root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set). root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
...@@ -530,7 +530,7 @@ config MEM_SOFT_DIRTY ...@@ -530,7 +530,7 @@ config MEM_SOFT_DIRTY
into a page just as regular dirty bit, but unlike the latter into a page just as regular dirty bit, but unlike the latter
it can be cleared by hands. it can be cleared by hands.
See Documentation/vm/soft-dirty.txt for more details. See Documentation/admin-guide/mm/soft-dirty.rst for more details.
config ZSWAP config ZSWAP
bool "Compressed cache for swap pages (EXPERIMENTAL)" bool "Compressed cache for swap pages (EXPERIMENTAL)"
...@@ -657,7 +657,8 @@ config IDLE_PAGE_TRACKING ...@@ -657,7 +657,8 @@ config IDLE_PAGE_TRACKING
be useful to tune memory cgroup limits and/or for job placement be useful to tune memory cgroup limits and/or for job placement
within a compute cluster. within a compute cluster.
See Documentation/vm/idle_page_tracking.txt for more details. See Documentation/admin-guide/mm/idle_page_tracking.rst for
more details.
# arch_add_memory() comprehends device memory # arch_add_memory() comprehends device memory
config ARCH_HAS_ZONE_DEVICE config ARCH_HAS_ZONE_DEVICE
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
* *
* This code provides the generic "frontend" layer to call a matching * This code provides the generic "frontend" layer to call a matching
* "backend" driver implementation of cleancache. See * "backend" driver implementation of cleancache. See
* Documentation/vm/cleancache.txt for more information. * Documentation/vm/cleancache.rst for more information.
* *
* Copyright (C) 2009-2010 Oracle Corp. All rights reserved. * Copyright (C) 2009-2010 Oracle Corp. All rights reserved.
* Author: Dan Magenheimer * Author: Dan Magenheimer
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
* *
* This code provides the generic "frontend" layer to call a matching * This code provides the generic "frontend" layer to call a matching
* "backend" driver implementation of frontswap. See * "backend" driver implementation of frontswap. See
* Documentation/vm/frontswap.txt for more information. * Documentation/vm/frontswap.rst for more information.
* *
* Copyright (C) 2009-2012 Oracle Corp. All rights reserved. * Copyright (C) 2009-2012 Oracle Corp. All rights reserved.
* Author: Dan Magenheimer * Author: Dan Magenheimer
......
...@@ -37,7 +37,7 @@ ...@@ -37,7 +37,7 @@
#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC) #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
/* /*
* Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h * Device private memory see HMM (Documentation/vm/hmm.rst) or hmm.h
*/ */
DEFINE_STATIC_KEY_FALSE(device_private_key); DEFINE_STATIC_KEY_FALSE(device_private_key);
EXPORT_SYMBOL(device_private_key); EXPORT_SYMBOL(device_private_key);
......
...@@ -1185,7 +1185,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, ...@@ -1185,7 +1185,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
* mmu_notifier_invalidate_range_end() happens which can lead to a * mmu_notifier_invalidate_range_end() happens which can lead to a
* device seeing memory write in different order than CPU. * device seeing memory write in different order than CPU.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
...@@ -2037,7 +2037,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, ...@@ -2037,7 +2037,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
* replacing a zero pmd write protected page with a zero pte write * replacing a zero pmd write protected page with a zero pte write
* protected page. * protected page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
pmdp_huge_clear_flush(vma, haddr, pmd); pmdp_huge_clear_flush(vma, haddr, pmd);
......
...@@ -3291,7 +3291,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, ...@@ -3291,7 +3291,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* table protection not changing it to point * table protection not changing it to point
* to a new page. * to a new page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
huge_ptep_set_wrprotect(src, addr, src_pte); huge_ptep_set_wrprotect(src, addr, src_pte);
} }
...@@ -4357,7 +4357,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, ...@@ -4357,7 +4357,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
* No need to call mmu_notifier_invalidate_range() we are downgrading * No need to call mmu_notifier_invalidate_range() we are downgrading
* page table protection not changing it to point to a new page. * page table protection not changing it to point to a new page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
i_mmap_unlock_write(vma->vm_file->f_mapping); i_mmap_unlock_write(vma->vm_file->f_mapping);
mmu_notifier_invalidate_range_end(mm, start, end); mmu_notifier_invalidate_range_end(mm, start, end);
......
...@@ -51,7 +51,9 @@ ...@@ -51,7 +51,9 @@
#define DO_NUMA(x) do { } while (0) #define DO_NUMA(x) do { } while (0)
#endif #endif
/* /**
* DOC: Overview
*
* A few notes about the KSM scanning process, * A few notes about the KSM scanning process,
* to make it easier to understand the data structures below: * to make it easier to understand the data structures below:
* *
...@@ -67,6 +69,21 @@ ...@@ -67,6 +69,21 @@
* this tree is fully assured to be working (except when pages are unmapped), * this tree is fully assured to be working (except when pages are unmapped),
* and therefore this tree is called the stable tree. * and therefore this tree is called the stable tree.
* *
* The stable tree node includes information required for reverse
* mapping from a KSM page to virtual addresses that map this page.
*
* In order to avoid large latencies of the rmap walks on KSM pages,
* KSM maintains two types of nodes in the stable tree:
*
* * the regular nodes that keep the reverse mapping structures in a
* linked list
* * the "chains" that link nodes ("dups") that represent the same
* write protected memory content, but each "dup" corresponds to a
* different KSM page copy of that content
*
* Internally, the regular nodes, "dups" and "chains" are represented
* using the same :c:type:`struct stable_node` structure.
*
* In addition to the stable tree, KSM uses a second data structure called the * In addition to the stable tree, KSM uses a second data structure called the
* unstable tree: this tree holds pointers to pages which have been found to * unstable tree: this tree holds pointers to pages which have been found to
* be "unchanged for a period of time". The unstable tree sorts these pages * be "unchanged for a period of time". The unstable tree sorts these pages
...@@ -1049,7 +1066,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, ...@@ -1049,7 +1066,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
* No need to notify as we are downgrading page table to read * No need to notify as we are downgrading page table to read
* only not changing it to point to a new page. * only not changing it to point to a new page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
/* /*
...@@ -1145,7 +1162,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, ...@@ -1145,7 +1162,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
* No need to notify as we are replacing a read only page with another * No need to notify as we are replacing a read only page with another
* read only page with the same content. * read only page with the same content.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
ptep_clear_flush(vma, addr, ptep); ptep_clear_flush(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, newpte); set_pte_at_notify(mm, addr, ptep, newpte);
......
...@@ -2828,7 +2828,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, ...@@ -2828,7 +2828,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
unsigned long ret = -EINVAL; unsigned long ret = -EINVAL;
struct file *file; struct file *file;
pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.txt.\n", pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.rst.\n",
current->comm, current->pid); current->comm, current->pid);
if (prot) if (prot)
......
...@@ -942,7 +942,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, ...@@ -942,7 +942,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
* downgrading page table protection not changing it to point * downgrading page table protection not changing it to point
* to a new page. * to a new page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
if (ret) if (ret)
(*cleaned)++; (*cleaned)++;
...@@ -1599,7 +1599,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, ...@@ -1599,7 +1599,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* point at new page while a device still is using this * point at new page while a device still is using this
* page. * page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
dec_mm_counter(mm, mm_counter_file(page)); dec_mm_counter(mm, mm_counter_file(page));
} }
...@@ -1609,7 +1609,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, ...@@ -1609,7 +1609,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* done above for all cases requiring it to happen under page * done above for all cases requiring it to happen under page
* table lock before mmu_notifier_invalidate_range_end() * table lock before mmu_notifier_invalidate_range_end()
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
page_remove_rmap(subpage, PageHuge(page)); page_remove_rmap(subpage, PageHuge(page));
put_page(page); put_page(page);
......
...@@ -621,7 +621,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); ...@@ -621,7 +621,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
* succeed and -ENOMEM implies there is not. * succeed and -ENOMEM implies there is not.
* *
* We currently support three overcommit policies, which are set via the * We currently support three overcommit policies, which are set via the
* vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting.rst
* *
* Strict overcommit modes added 2002 Feb 26 by Alan Cox. * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
* Additional code 2002 Jul 20 by Robert Love. * Additional code 2002 Jul 20 by Robert Love.
......
#!/bin/sh #!/usr/bin/env perl
# SPDX-License-Identifier: GPL-2.0
#
# Treewide grep for references to files under Documentation, and report # Treewide grep for references to files under Documentation, and report
# non-existing files in stderr. # non-existing files in stderr.
for f in $(git ls-files); do use warnings;
for ref in $(grep -ho "Documentation/[A-Za-z0-9_.,~/*+-]*" "$f"); do use strict;
# presume trailing . and , are not part of the name use Getopt::Long qw(:config no_auto_abbrev);
ref=${ref%%[.,]}
my $scriptname = $0;
# use ls to handle wildcards $scriptname =~ s,.*/([^/]+/),$1,;
if ! ls $ref >/dev/null 2>&1; then
echo "$f: $ref" >&2 # Parse arguments
fi my $help = 0;
done my $fix = 0;
done
GetOptions(
'fix' => \$fix,
'h|help|usage' => \$help,
);
if ($help != 0) {
print "$scriptname [--help] [--fix-rst]\n";
exit -1;
}
# Step 1: find broken references
print "Finding broken references. This may take a while... " if ($fix);
my %broken_ref;
open IN, "git grep 'Documentation/'|"
or die "Failed to run git grep";
while (<IN>) {
next if (!m/^([^:]+):(.*)/);
my $f = $1;
my $ln = $2;
# Makefiles contain nasty expressions to parse docs
next if ($f =~ m/Makefile/);
# Skip this script
next if ($f eq $scriptname);
if ($ln =~ m,\b(\S*)(Documentation/[A-Za-z0-9\_\.\,\~/\*+-]*),) {
my $prefix = $1;
my $ref = $2;
my $base = $2;
$ref =~ s/[\,\.]+$//;
my $fulref = "$prefix$ref";
$fulref =~ s/^(\<file|ref)://;
$fulref =~ s/^[\'\`]+//;
$fulref =~ s,^\$\(.*\)/,,;
$base =~ s,.*/,,;
# Remove URL false-positives
next if ($fulref =~ m/^http/);
# Check if exists, evaluating wildcards
next if (grep -e, glob("$ref $fulref"));
if ($fix) {
if (!($ref =~ m/(devicetree|scripts|Kconfig|Kbuild)/)) {
$broken_ref{$ref}++;
}
} else {
print STDERR "$f: $fulref\n";
}
}
}
exit 0 if (!$fix);
# Step 2: Seek for file name alternatives
print "Auto-fixing broken references. Please double-check the results\n";
foreach my $ref (keys %broken_ref) {
my $new =$ref;
# get just the basename
$new =~ s,.*/,,;
# Seek for the same name on another place, as it may have been moved
my $f="";
$f = qx(find . -iname $new) if ($new);
# usual reason for breakage: file renamed to .rst
if (!$f) {
$new =~ s/\.txt$/.rst/;
$f=qx(find . -iname $new) if ($new);
}
my @find = split /\s+/, $f;
if (!$f) {
print STDERR "ERROR: Didn't find a replacement for $ref\n";
} elsif (scalar(@find) > 1) {
print STDERR "WARNING: Won't auto-replace, as found multiple files close to $ref:\n";
foreach my $j (@find) {
$j =~ s,^./,,;
print STDERR " $j\n";
}
} else {
$f = $find[0];
$f =~ s,^./,,;
print "INFO: Replacing $ref to $f\n";
foreach my $j (qx(git grep -l $ref)) {
qx(sed "s\@$ref\@$f\@g" -i $j);
}
}
}
#!/usr/bin/env python
# SPDX-License-Identifier: GPL-2.0
# Copyright Thomas Gleixner <tglx@linutronix.de>
from argparse import ArgumentParser
from ply import lex, yacc
import traceback
import sys
import git
import re
import os
class ParserException(Exception):
def __init__(self, tok, txt):
self.tok = tok
self.txt = txt
class SPDXException(Exception):
def __init__(self, el, txt):
self.el = el
self.txt = txt
class SPDXdata(object):
def __init__(self):
self.license_files = 0
self.exception_files = 0
self.licenses = [ ]
self.exceptions = { }
# Read the spdx data from the LICENSES directory
def read_spdxdata(repo):
# The subdirectories of LICENSES in the kernel source
license_dirs = [ "preferred", "other", "exceptions" ]
lictree = repo.heads.master.commit.tree['LICENSES']
spdx = SPDXdata()
for d in license_dirs:
for el in lictree[d].traverse():
if not os.path.isfile(el.path):
continue
exception = None
for l in open(el.path).readlines():
if l.startswith('Valid-License-Identifier:'):
lid = l.split(':')[1].strip().upper()
if lid in spdx.licenses:
raise SPDXException(el, 'Duplicate License Identifier: %s' %lid)
else:
spdx.licenses.append(lid)
elif l.startswith('SPDX-Exception-Identifier:'):
exception = l.split(':')[1].strip().upper()
spdx.exceptions[exception] = []
elif l.startswith('SPDX-Licenses:'):
for lic in l.split(':')[1].upper().strip().replace(' ', '').replace('\t', '').split(','):
if not lic in spdx.licenses:
raise SPDXException(None, 'Exception %s missing license %s' %(ex, lic))
spdx.exceptions[exception].append(lic)
elif l.startswith("License-Text:"):
if exception:
if not len(spdx.exceptions[exception]):
raise SPDXException(el, 'Exception %s is missing SPDX-Licenses' %excid)
spdx.exception_files += 1
else:
spdx.license_files += 1
break
return spdx
class id_parser(object):
reserved = [ 'AND', 'OR', 'WITH' ]
tokens = [ 'LPAR', 'RPAR', 'ID', 'EXC' ] + reserved
precedence = ( ('nonassoc', 'AND', 'OR'), )
t_ignore = ' \t'
def __init__(self, spdx):
self.spdx = spdx
self.lasttok = None
self.lastid = None
self.lexer = lex.lex(module = self, reflags = re.UNICODE)
# Initialize the parser. No debug file and no parser rules stored on disk
# The rules are small enough to be generated on the fly
self.parser = yacc.yacc(module = self, write_tables = False, debug = False)
self.lines_checked = 0
self.checked = 0
self.spdx_valid = 0
self.spdx_errors = 0
self.curline = 0
self.deepest = 0
# Validate License and Exception IDs
def validate(self, tok):
id = tok.value.upper()
if tok.type == 'ID':
if not id in self.spdx.licenses:
raise ParserException(tok, 'Invalid License ID')
self.lastid = id
elif tok.type == 'EXC':
if not self.spdx.exceptions.has_key(id):
raise ParserException(tok, 'Invalid Exception ID')
if self.lastid not in self.spdx.exceptions[id]:
raise ParserException(tok, 'Exception not valid for license %s' %self.lastid)
self.lastid = None
elif tok.type != 'WITH':
self.lastid = None
# Lexer functions
def t_RPAR(self, tok):
r'\)'
self.lasttok = tok.type
return tok
def t_LPAR(self, tok):
r'\('
self.lasttok = tok.type
return tok
def t_ID(self, tok):
r'[A-Za-z.0-9\-+]+'
if self.lasttok == 'EXC':
print(tok)
raise ParserException(tok, 'Missing parentheses')
tok.value = tok.value.strip()
val = tok.value.upper()
if val in self.reserved:
tok.type = val
elif self.lasttok == 'WITH':
tok.type = 'EXC'
self.lasttok = tok.type
self.validate(tok)
return tok
def t_error(self, tok):
raise ParserException(tok, 'Invalid token')
def p_expr(self, p):
'''expr : ID
| ID WITH EXC
| expr AND expr
| expr OR expr
| LPAR expr RPAR'''
pass
def p_error(self, p):
if not p:
raise ParserException(None, 'Unfinished license expression')
else:
raise ParserException(p, 'Syntax error')
def parse(self, expr):
self.lasttok = None
self.lastid = None
self.parser.parse(expr, lexer = self.lexer)
def parse_lines(self, fd, maxlines, fname):
self.checked += 1
self.curline = 0
try:
for line in fd:
self.curline += 1
if self.curline > maxlines:
break
self.lines_checked += 1
if line.find("SPDX-License-Identifier:") < 0:
continue
expr = line.split(':')[1].replace('*/', '').strip()
self.parse(expr)
self.spdx_valid += 1
#
# Should we check for more SPDX ids in the same file and
# complain if there are any?
#
break
except ParserException as pe:
if pe.tok:
col = line.find(expr) + pe.tok.lexpos
tok = pe.tok.value
sys.stdout.write('%s: %d:%d %s: %s\n' %(fname, self.curline, col, pe.txt, tok))
else:
sys.stdout.write('%s: %d:0 %s\n' %(fname, self.curline, col, pe.txt))
self.spdx_errors += 1
def scan_git_tree(tree):
for el in tree.traverse():
# Exclude stuff which would make pointless noise
# FIXME: Put this somewhere more sensible
if el.path.startswith("LICENSES"):
continue
if el.path.find("license-rules.rst") >= 0:
continue
if el.path == 'scripts/checkpatch.pl':
continue
if not os.path.isfile(el.path):
continue
parser.parse_lines(open(el.path), args.maxlines, el.path)
def scan_git_subtree(tree, path):
for p in path.strip('/').split('/'):
tree = tree[p]
scan_git_tree(tree)
if __name__ == '__main__':
ap = ArgumentParser(description='SPDX expression checker')
ap.add_argument('path', nargs='*', help='Check path or file. If not given full git tree scan. For stdin use "-"')
ap.add_argument('-m', '--maxlines', type=int, default=15,
help='Maximum number of lines to scan in a file. Default 15')
ap.add_argument('-v', '--verbose', action='store_true', help='Verbose statistics output')
args = ap.parse_args()
# Sanity check path arguments
if '-' in args.path and len(args.path) > 1:
sys.stderr.write('stdin input "-" must be the only path argument\n')
sys.exit(1)
try:
# Use git to get the valid license expressions
repo = git.Repo(os.getcwd())
assert not repo.bare
# Initialize SPDX data
spdx = read_spdxdata(repo)
# Initilize the parser
parser = id_parser(spdx)
except SPDXException as se:
if se.el:
sys.stderr.write('%s: %s\n' %(se.el.path, se.txt))
else:
sys.stderr.write('%s\n' %se.txt)
sys.exit(1)
except Exception as ex:
sys.stderr.write('FAIL: %s\n' %ex)
sys.stderr.write('%s\n' %traceback.format_exc())
sys.exit(1)
try:
if len(args.path) and args.path[0] == '-':
parser.parse_lines(sys.stdin, args.maxlines, '-')
else:
if args.path:
for p in args.path:
if os.path.isfile(p):
parser.parse_lines(open(p), args.maxlines, p)
elif os.path.isdir(p):
scan_git_subtree(repo.head.reference.commit.tree, p)
else:
sys.stderr.write('path %s does not exist\n' %p)
sys.exit(1)
else:
# Full git tree scan
scan_git_tree(repo.head.commit.tree)
if args.verbose:
sys.stderr.write('\n')
sys.stderr.write('License files: %12d\n' %spdx.license_files)
sys.stderr.write('Exception files: %12d\n' %spdx.exception_files)
sys.stderr.write('License IDs %12d\n' %len(spdx.licenses))
sys.stderr.write('Exception IDs %12d\n' %len(spdx.exceptions))
sys.stderr.write('\n')
sys.stderr.write('Files checked: %12d\n' %parser.checked)
sys.stderr.write('Lines checked: %12d\n' %parser.lines_checked)
sys.stderr.write('Files with SPDX: %12d\n' %parser.spdx_valid)
sys.stderr.write('Files with errors: %12d\n' %parser.spdx_errors)
sys.exit(0)
except Exception as ex:
sys.stderr.write('FAIL: %s\n' %ex)
sys.stderr.write('%s\n' %traceback.format_exc())
sys.exit(1)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册