未验证 提交 2c55d703 编写于 作者: M Maxime Ripard

Merge drm/drm-fixes into drm-misc-fixes

Let's start the fixes cycle.
Signed-off-by: NMaxime Ripard <maxime@cerno.tech>

要显示的变更太多。

To preserve performance only 1000 of 1000+ files are displayed.
...@@ -222,6 +222,7 @@ ForEachMacros: ...@@ -222,6 +222,7 @@ ForEachMacros:
- 'for_each_component_dais' - 'for_each_component_dais'
- 'for_each_component_dais_safe' - 'for_each_component_dais_safe'
- 'for_each_console' - 'for_each_console'
- 'for_each_console_srcu'
- 'for_each_cpu' - 'for_each_cpu'
- 'for_each_cpu_and' - 'for_each_cpu_and'
- 'for_each_cpu_not' - 'for_each_cpu_not'
...@@ -440,8 +441,11 @@ ForEachMacros: ...@@ -440,8 +441,11 @@ ForEachMacros:
- 'inet_lhash2_for_each_icsk' - 'inet_lhash2_for_each_icsk'
- 'inet_lhash2_for_each_icsk_continue' - 'inet_lhash2_for_each_icsk_continue'
- 'inet_lhash2_for_each_icsk_rcu' - 'inet_lhash2_for_each_icsk_rcu'
- 'interval_tree_for_each_double_span'
- 'interval_tree_for_each_span'
- 'intlist__for_each_entry' - 'intlist__for_each_entry'
- 'intlist__for_each_entry_safe' - 'intlist__for_each_entry_safe'
- 'iopt_for_each_contig_area'
- 'kcore_copy__for_each_phdr' - 'kcore_copy__for_each_phdr'
- 'key_for_each' - 'key_for_each'
- 'key_for_each_safe' - 'key_for_each_safe'
...@@ -535,6 +539,7 @@ ForEachMacros: ...@@ -535,6 +539,7 @@ ForEachMacros:
- 'perf_hpp_list__for_each_sort_list_safe' - 'perf_hpp_list__for_each_sort_list_safe'
- 'perf_pmu__for_each_hybrid_pmu' - 'perf_pmu__for_each_hybrid_pmu'
- 'ping_portaddr_for_each_entry' - 'ping_portaddr_for_each_entry'
- 'ping_portaddr_for_each_entry_rcu'
- 'plist_for_each' - 'plist_for_each'
- 'plist_for_each_continue' - 'plist_for_each_continue'
- 'plist_for_each_entry' - 'plist_for_each_entry'
......
...@@ -20,6 +20,7 @@ ...@@ -20,6 +20,7 @@
*.dtb *.dtb
*.dtbo *.dtbo
*.dtb.S *.dtb.S
*.dtbo.S
*.dwo *.dwo
*.elf *.elf
*.gcno *.gcno
...@@ -38,6 +39,7 @@ ...@@ -38,6 +39,7 @@
*.o.* *.o.*
*.patch *.patch
*.rmeta *.rmeta
*.rpm
*.rsi *.rsi
*.s *.s
*.so *.so
......
...@@ -228,6 +228,7 @@ Juha Yrjola <at solidboot.com> ...@@ -228,6 +228,7 @@ Juha Yrjola <at solidboot.com>
Juha Yrjola <juha.yrjola@nokia.com> Juha Yrjola <juha.yrjola@nokia.com>
Juha Yrjola <juha.yrjola@solidboot.com> Juha Yrjola <juha.yrjola@solidboot.com>
Julien Thierry <julien.thierry.kdev@gmail.com> <julien.thierry@arm.com> Julien Thierry <julien.thierry.kdev@gmail.com> <julien.thierry@arm.com>
Iskren Chernev <me@iskren.info> <iskren.chernev@gmail.com>
Kalle Valo <kvalo@kernel.org> <kvalo@codeaurora.org> Kalle Valo <kvalo@kernel.org> <kvalo@codeaurora.org>
Kalyan Thota <quic_kalyant@quicinc.com> <kalyan_t@codeaurora.org> Kalyan Thota <quic_kalyant@quicinc.com> <kalyan_t@codeaurora.org>
Kay Sievers <kay.sievers@vrfy.org> Kay Sievers <kay.sievers@vrfy.org>
...@@ -287,6 +288,7 @@ Matthew Wilcox <willy@infradead.org> <willy@linux.intel.com> ...@@ -287,6 +288,7 @@ Matthew Wilcox <willy@infradead.org> <willy@linux.intel.com>
Matthew Wilcox <willy@infradead.org> <willy@parisc-linux.org> Matthew Wilcox <willy@infradead.org> <willy@parisc-linux.org>
Matthias Fuchs <socketcan@esd.eu> <matthias.fuchs@esd.eu> Matthias Fuchs <socketcan@esd.eu> <matthias.fuchs@esd.eu>
Matthieu CASTET <castet.matthieu@free.fr> Matthieu CASTET <castet.matthieu@free.fr>
Matti Vaittinen <mazziesaccount@gmail.com> <matti.vaittinen@fi.rohmeurope.com>
Matt Ranostay <matt.ranostay@konsulko.com> <matt@ranostay.consulting> Matt Ranostay <matt.ranostay@konsulko.com> <matt@ranostay.consulting>
Matt Ranostay <mranostay@gmail.com> Matthew Ranostay <mranostay@embeddedalley.com> Matt Ranostay <mranostay@gmail.com> Matthew Ranostay <mranostay@embeddedalley.com>
Matt Ranostay <mranostay@gmail.com> <matt.ranostay@intel.com> Matt Ranostay <mranostay@gmail.com> <matt.ranostay@intel.com>
...@@ -372,6 +374,8 @@ Ricardo Ribalda <ribalda@kernel.org> <ricardo.ribalda@gmail.com> ...@@ -372,6 +374,8 @@ Ricardo Ribalda <ribalda@kernel.org> <ricardo.ribalda@gmail.com>
Roman Gushchin <roman.gushchin@linux.dev> <guro@fb.com> Roman Gushchin <roman.gushchin@linux.dev> <guro@fb.com>
Roman Gushchin <roman.gushchin@linux.dev> <guroan@gmail.com> Roman Gushchin <roman.gushchin@linux.dev> <guroan@gmail.com>
Roman Gushchin <roman.gushchin@linux.dev> <klamm@yandex-team.ru> Roman Gushchin <roman.gushchin@linux.dev> <klamm@yandex-team.ru>
Muchun Song <muchun.song@linux.dev> <songmuchun@bytedance.com>
Muchun Song <muchun.song@linux.dev> <smuchun@gmail.com>
Ross Zwisler <zwisler@kernel.org> <ross.zwisler@linux.intel.com> Ross Zwisler <zwisler@kernel.org> <ross.zwisler@linux.intel.com>
Rudolf Marek <R.Marek@sh.cvut.cz> Rudolf Marek <R.Marek@sh.cvut.cz>
Rui Saraiva <rmps@joel.ist.utl.pt> Rui Saraiva <rmps@joel.ist.utl.pt>
......
...@@ -1439,6 +1439,10 @@ N: Justin Guyett ...@@ -1439,6 +1439,10 @@ N: Justin Guyett
E: jguyett@andrew.cmu.edu E: jguyett@andrew.cmu.edu
D: via-rhine net driver hacking D: via-rhine net driver hacking
N: Nitin Gupta
E: ngupta@vflare.org
D: zsmalloc memory allocator and zram block device driver
N: Danny ter Haar N: Danny ter Haar
E: dth@cistron.nl E: dth@cistron.nl
D: /proc/cpuinfo, reboot on panic , kernel pre-patch tester ;) D: /proc/cpuinfo, reboot on panic , kernel pre-patch tester ;)
......
...@@ -22,6 +22,7 @@ Date: Oct 25, 2019 ...@@ -22,6 +22,7 @@ Date: Oct 25, 2019
KernelVersion: 5.6.0 KernelVersion: 5.6.0
Contact: dmaengine@vger.kernel.org Contact: dmaengine@vger.kernel.org
Description: The largest number of work descriptors in a batch. Description: The largest number of work descriptors in a batch.
It's not visible when the device does not support batch.
What: /sys/bus/dsa/devices/dsa<m>/max_work_queues_size What: /sys/bus/dsa/devices/dsa<m>/max_work_queues_size
Date: Oct 25, 2019 Date: Oct 25, 2019
...@@ -49,6 +50,8 @@ Description: The total number of read buffers supported by this device. ...@@ -49,6 +50,8 @@ Description: The total number of read buffers supported by this device.
The read buffers represent resources within the DSA The read buffers represent resources within the DSA
implementation, and these resources are allocated by engines to implementation, and these resources are allocated by engines to
support operations. See DSA spec v1.2 9.2.4 Total Read Buffers. support operations. See DSA spec v1.2 9.2.4 Total Read Buffers.
It's not visible when the device does not support Read Buffer
allocation control.
What: /sys/bus/dsa/devices/dsa<m>/max_transfer_size What: /sys/bus/dsa/devices/dsa<m>/max_transfer_size
Date: Oct 25, 2019 Date: Oct 25, 2019
...@@ -122,6 +125,8 @@ Contact: dmaengine@vger.kernel.org ...@@ -122,6 +125,8 @@ Contact: dmaengine@vger.kernel.org
Description: The maximum number of read buffers that may be in use at Description: The maximum number of read buffers that may be in use at
one time by operations that access low bandwidth memory in the one time by operations that access low bandwidth memory in the
device. See DSA spec v1.2 9.2.8 GENCFG on Global Read Buffer Limit. device. See DSA spec v1.2 9.2.8 GENCFG on Global Read Buffer Limit.
It's not visible when the device does not support Read Buffer
allocation control.
What: /sys/bus/dsa/devices/dsa<m>/cmd_status What: /sys/bus/dsa/devices/dsa<m>/cmd_status
Date: Aug 28, 2020 Date: Aug 28, 2020
...@@ -205,6 +210,7 @@ KernelVersion: 5.10.0 ...@@ -205,6 +210,7 @@ KernelVersion: 5.10.0
Contact: dmaengine@vger.kernel.org Contact: dmaengine@vger.kernel.org
Description: The max batch size for this workqueue. Cannot exceed device Description: The max batch size for this workqueue. Cannot exceed device
max batch size. Configurable parameter. max batch size. Configurable parameter.
It's not visible when the device does not support batch.
What: /sys/bus/dsa/devices/wq<m>.<n>/ats_disable What: /sys/bus/dsa/devices/wq<m>.<n>/ats_disable
Date: Nov 13, 2020 Date: Nov 13, 2020
...@@ -250,6 +256,8 @@ KernelVersion: 5.17.0 ...@@ -250,6 +256,8 @@ KernelVersion: 5.17.0
Contact: dmaengine@vger.kernel.org Contact: dmaengine@vger.kernel.org
Description: Enable the use of global read buffer limit for the group. See DSA Description: Enable the use of global read buffer limit for the group. See DSA
spec v1.2 9.2.18 GRPCFG Use Global Read Buffer Limit. spec v1.2 9.2.18 GRPCFG Use Global Read Buffer Limit.
It's not visible when the device does not support Read Buffer
allocation control.
What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_allowed What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_allowed
Date: Dec 10, 2021 Date: Dec 10, 2021
...@@ -258,6 +266,8 @@ Contact: dmaengine@vger.kernel.org ...@@ -258,6 +266,8 @@ Contact: dmaengine@vger.kernel.org
Description: Indicates max number of read buffers that may be in use at one time Description: Indicates max number of read buffers that may be in use at one time
by all engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read by all engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read
Buffers Allowed. Buffers Allowed.
It's not visible when the device does not support Read Buffer
allocation control.
What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_reserved What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_reserved
Date: Dec 10, 2021 Date: Dec 10, 2021
...@@ -266,6 +276,8 @@ Contact: dmaengine@vger.kernel.org ...@@ -266,6 +276,8 @@ Contact: dmaengine@vger.kernel.org
Description: Indicates the number of Read Buffers reserved for the use of Description: Indicates the number of Read Buffers reserved for the use of
engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read Buffers engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read Buffers
Reserved. Reserved.
It's not visible when the device does not support Read Buffer
allocation control.
What: /sys/bus/dsa/devices/group<m>.<n>/desc_progress_limit What: /sys/bus/dsa/devices/group<m>.<n>/desc_progress_limit
Date: Sept 14, 2022 Date: Sept 14, 2022
......
...@@ -35,6 +35,15 @@ Description: This controls cursor delay when using arrow keys. When a ...@@ -35,6 +35,15 @@ Description: This controls cursor delay when using arrow keys. When a
characters. Set this to a higher value to adjust for the delay characters. Set this to a higher value to adjust for the delay
and better synchronisation between cursor position and speech. and better synchronisation between cursor position and speech.
What: /sys/accessibility/speakup/cur_phonetic
KernelVersion: 6.2
Contact: speakup@linux-speakup.org
Description: This allows speakup to speak letters phoneticaly when arrowing through
a word letter by letter. This doesn't affect the spelling when typing
the characters. When cur_phonetic=1, speakup will speak characters
phoneticaly when arrowing over a letter. When cur_phonetic=0, speakup
will speak letters as normally.
What: /sys/accessibility/speakup/delimiters What: /sys/accessibility/speakup/delimiters
KernelVersion: 2.6 KernelVersion: 2.6
Contact: speakup@linux-speakup.org Contact: speakup@linux-speakup.org
......
...@@ -197,7 +197,7 @@ Description: Specific MJPEG format descriptors ...@@ -197,7 +197,7 @@ Description: Specific MJPEG format descriptors
read-only read-only
bmaControls this format's data for bmaControls in bmaControls this format's data for bmaControls in
the streaming header the streaming header
bmInterfaceFlags specifies interlace information, bmInterlaceFlags specifies interlace information,
read-only read-only
bAspectRatioY the X dimension of the picture aspect bAspectRatioY the X dimension of the picture aspect
ratio, read-only ratio, read-only
...@@ -253,7 +253,7 @@ Description: Specific uncompressed format descriptors ...@@ -253,7 +253,7 @@ Description: Specific uncompressed format descriptors
read-only read-only
bmaControls this format's data for bmaControls in bmaControls this format's data for bmaControls in
the streaming header the streaming header
bmInterfaceFlags specifies interlace information, bmInterlaceFlags specifies interlace information,
read-only read-only
bAspectRatioY the X dimension of the picture aspect bAspectRatioY the X dimension of the picture aspect
ratio, read-only ratio, read-only
......
What: /sys/kernel/debug/dell-wmi-ddv-<wmi_device_name>/fan_sensor_information
Date: September 2022
KernelVersion: 6.1
Contact: Armin Wolf <W_Armin@gmx.de>
Description:
This file contains the contents of the fan sensor information buffer,
which contains fan sensor entries and a terminating character (0xFF).
Each fan sensor entry consists of three bytes with an unknown meaning,
interested people may use this file for reverse-engineering.
What: /sys/kernel/debug/dell-wmi-ddv-<wmi_device_name>/thermal_sensor_information
Date: September 2022
KernelVersion: 6.1
Contact: Armin Wolf <W_Armin@gmx.de>
Description:
This file contains the contents of the thermal sensor information buffer,
which contains thermal sensor entries and a terminating character (0xFF).
Each thermal sensor entry consists of five bytes with an unknown meaning,
interested people may use this file for reverse-engineering.
...@@ -91,6 +91,13 @@ Description: Enables the root user to set the device to specific state. ...@@ -91,6 +91,13 @@ Description: Enables the root user to set the device to specific state.
Valid values are "disable", "enable", "suspend", "resume". Valid values are "disable", "enable", "suspend", "resume".
User can read this property to see the valid values User can read this property to see the valid values
What: /sys/kernel/debug/habanalabs/hl<n>/device_release_watchdog_timeout
Date: Oct 2022
KernelVersion: 6.2
Contact: ttayar@habana.ai
Description: The watchdog timeout value in seconds for a device relese upon
certain error cases, after which the device is reset.
What: /sys/kernel/debug/habanalabs/hl<n>/dma_size What: /sys/kernel/debug/habanalabs/hl<n>/dma_size
Date: Apr 2021 Date: Apr 2021
KernelVersion: 5.13 KernelVersion: 5.13
......
What: /sys/kernel/debug/pktcdvd/pktcdvd[0-7]
Date: Oct. 2006
KernelVersion: 2.6.20
Contact: Thomas Maier <balagi@justmail.de>
Description:
The pktcdvd module (packet writing driver) creates
these files in debugfs:
/sys/kernel/debug/pktcdvd/pktcdvd[0-7]/
==== ====== ====================================
info 0444 Lots of driver statistics and infos.
==== ====== ====================================
Example::
cat /sys/kernel/debug/pktcdvd/pktcdvd0/info
...@@ -137,3 +137,17 @@ Description: ...@@ -137,3 +137,17 @@ Description:
The writeback_limit file is read-write and specifies the maximum The writeback_limit file is read-write and specifies the maximum
amount of writeback ZRAM can do. The limit could be changed amount of writeback ZRAM can do. The limit could be changed
in run time. in run time.
What: /sys/block/zram<id>/recomp_algorithm
Date: November 2022
Contact: Sergey Senozhatsky <senozhatsky@chromium.org>
Description:
The recomp_algorithm file is read-write and allows to set
or show secondary compression algorithms.
What: /sys/block/zram<id>/recompress
Date: November 2022
Contact: Sergey Senozhatsky <senozhatsky@chromium.org>
Description:
The recompress file is write-only and triggers re-compression
with secondary compression algorithms.
What: /sys/bus/coreboot
Date: August 2022
Contact: Jack Rosenthal <jrosenth@chromium.org>
Description:
The coreboot bus provides a variety of virtual devices used to
access data structures created by the Coreboot BIOS.
What: /sys/bus/coreboot/devices/cbmem-<id>
Date: August 2022
Contact: Jack Rosenthal <jrosenth@chromium.org>
Description:
CBMEM is a downwards-growing memory region created by Coreboot,
and contains tagged data structures to be shared with payloads
in the boot process and the OS. Each CBMEM entry is given a
directory in /sys/bus/coreboot/devices based on its id.
A list of ids known to Coreboot can be found in the coreboot
source tree at
``src/commonlib/bsd/include/commonlib/bsd/cbmem_id.h``.
What: /sys/bus/coreboot/devices/cbmem-<id>/address
Date: August 2022
Contact: Jack Rosenthal <jrosenth@chromium.org>
Description:
This is the pyhsical memory address that the CBMEM entry's data
begins at, in hexadecimal (e.g., ``0x76ffe000``).
What: /sys/bus/coreboot/devices/cbmem-<id>/size
Date: August 2022
Contact: Jack Rosenthal <jrosenth@chromium.org>
Description:
This is the size of the CBMEM entry's data, in hexadecimal
(e.g., ``0x1234``).
What: /sys/bus/coreboot/devices/cbmem-<id>/mem
Date: August 2022
Contact: Jack Rosenthal <jrosenth@chromium.org>
Description:
A file exposing read/write access to the entry's data. Note
that this file does not support mmap(), as coreboot
does not guarantee that the data will be page-aligned.
The mode of this file is 0600. While there shouldn't be
anything security-sensitive contained in CBMEM, read access
requires root privileges given this is exposing a small subset
of physical memory.
What: /sys/bus/iio/devices/iio:deviceX/in_voltage-voltage_filter_mode_available
KernelVersion: 6.2
Contact: linux-iio@vger.kernel.org
Description:
Reading returns a list with the possible filter modes.
* "sinc4" - Sinc 4. Excellent noise performance. Long
1st conversion time. No natural 50/60Hz rejection.
* "sinc4+sinc1" - Sinc4 + averaging by 8. Low 1st conversion
time.
* "sinc3" - Sinc3. Moderate 1st conversion time.
Good noise performance.
* "sinc3+rej60" - Sinc3 + 60Hz rejection. At a sampling
frequency of 50Hz, achieves simultaneous 50Hz and 60Hz
rejection.
* "sinc3+sinc1" - Sinc3 + averaging by 8. Low 1st conversion
time. Best used with a sampling frequency of at least
216.19Hz.
* "sinc3+pf1" - Sinc3 + Post Filter 1. 53dB rejection @
50Hz, 58dB rejection @ 60Hz.
* "sinc3+pf2" - Sinc3 + Post Filter 2. 70dB rejection @
50Hz, 70dB rejection @ 60Hz.
* "sinc3+pf3" - Sinc3 + Post Filter 3. 99dB rejection @
50Hz, 103dB rejection @ 60Hz.
* "sinc3+pf4" - Sinc3 + Post Filter 4. 103dB rejection @
50Hz, 109dB rejection @ 60Hz.
What: /sys/bus/iio/devices/iio:deviceX/in_voltageY-voltageZ_filter_mode
KernelVersion: 6.2
Contact: linux-iio@vger.kernel.org
Description:
Set the filter mode of the differential channel. When the filter
mode changes, the in_voltageY-voltageZ_sampling_frequency and
in_voltageY-voltageZ_sampling_frequency_available attributes
might also change to accommodate the new filter mode.
If the current sampling frequency is out of range for the new
filter mode, the sampling frequency will be changed to the
closest valid one.
What: /sys/bus/iio/devices/iio:deviceX/in_voltage_filterY_notch_en
Date: September 2022
KernelVersion: 6.0
Contact: linux-iio@vger.kernel.org
Description:
Enable or disable a notch filter.
What: /sys/bus/iio/devices/iio:deviceX/in_voltage_filterY_notch_center
Date: September 2022
KernelVersion: 6.0
Contact: linux-iio@vger.kernel.org
Description:
Center frequency of the notch filter in Hz.
...@@ -41,3 +41,17 @@ KernelVersion: 5.18 ...@@ -41,3 +41,17 @@ KernelVersion: 5.18
Contact: Kajol Jain <kjain@linux.ibm.com> Contact: Kajol Jain <kjain@linux.ibm.com>
Description: (RO) This sysfs file exposes the cpumask which is designated to Description: (RO) This sysfs file exposes the cpumask which is designated to
to retrieve nvdimm pmu event counter data. to retrieve nvdimm pmu event counter data.
What: /sys/bus/nd/devices/nmemX/cxl/id
Date: November 2022
KernelVersion: 6.2
Contact: Dave Jiang <dave.jiang@intel.com>
Description: (RO) Show the id (serial) of the device. This is CXL specific.
What: /sys/bus/nd/devices/nmemX/cxl/provider
Date: November 2022
KernelVersion: 6.2
Contact: Dave Jiang <dave.jiang@intel.com>
Description: (RO) Shows the CXL bridge device that ties to a CXL memory device
to this NVDIMM device. I.e. the parent of the device returned is
a /sys/bus/cxl/devices/memX instance.
...@@ -407,6 +407,16 @@ Description: ...@@ -407,6 +407,16 @@ Description:
file contains a '1' if the memory has been published for file contains a '1' if the memory has been published for
use outside the driver that owns the device. use outside the driver that owns the device.
What: /sys/bus/pci/devices/.../p2pmem/allocate
Date: August 2022
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
This file allows mapping p2pmem into userspace. For each
mmap() call on this file, the kernel will allocate a chunk
of Peer-to-Peer memory for use in Peer-to-Peer transactions.
This memory can be used in O_DIRECT calls to NVMe backed
files for Peer-to-Peer copies.
What: /sys/bus/pci/devices/.../link/clkpm What: /sys/bus/pci/devices/.../link/clkpm
/sys/bus/pci/devices/.../link/l0s_aspm /sys/bus/pci/devices/.../link/l0s_aspm
/sys/bus/pci/devices/.../link/l1_aspm /sys/bus/pci/devices/.../link/l1_aspm
......
...@@ -5,6 +5,9 @@ Contact: linux-mtd@lists.infradead.org ...@@ -5,6 +5,9 @@ Contact: linux-mtd@lists.infradead.org
Description: (RO) The JEDEC ID of the SPI NOR flash as reported by the Description: (RO) The JEDEC ID of the SPI NOR flash as reported by the
flash device. flash device.
The attribute is not present if the flash doesn't support
the "Read JEDEC ID" command (9Fh). This is the case for
non-JEDEC compliant flashes.
What: /sys/bus/spi/devices/.../spi-nor/manufacturer What: /sys/bus/spi/devices/.../spi-nor/manufacturer
Date: April 2021 Date: April 2021
...@@ -12,6 +15,9 @@ KernelVersion: 5.14 ...@@ -12,6 +15,9 @@ KernelVersion: 5.14
Contact: linux-mtd@lists.infradead.org Contact: linux-mtd@lists.infradead.org
Description: (RO) Manufacturer of the SPI NOR flash. Description: (RO) Manufacturer of the SPI NOR flash.
The attribute is not present if the flash device isn't
known to the kernel and is only probed by its SFDP
tables.
What: /sys/bus/spi/devices/.../spi-nor/partname What: /sys/bus/spi/devices/.../spi-nor/partname
Date: April 2021 Date: April 2021
......
...@@ -264,6 +264,17 @@ Description: ...@@ -264,6 +264,17 @@ Description:
attached to the port will not be detected, initialized, attached to the port will not be detected, initialized,
or enumerated. or enumerated.
What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/early_stop
Date: Sep 2022
Contact: Ray Chi <raychi@google.com>
Description:
Some USB hosts have some watchdog mechanisms so that the device
may enter ramdump if it takes a long time during port initialization.
This attribute allows each port just has two attempts so that the
port initialization will be failed quickly. In addition, if a port
which is marked with early_stop has failed to initialize, it will ignore
all future connections until this attribute is clear.
What: /sys/bus/usb/devices/.../power/usb2_lpm_l1_timeout What: /sys/bus/usb/devices/.../power/usb2_lpm_l1_timeout
Date: May 2013 Date: May 2013
Contact: Mathias Nyman <mathias.nyman@linux.intel.com> Contact: Mathias Nyman <mathias.nyman@linux.intel.com>
......
...@@ -44,6 +44,21 @@ Description: ...@@ -44,6 +44,21 @@ Description:
(read-write) (read-write)
What: /sys/class/bdi/<bdi>/min_ratio_fine
Date: November 2022
Contact: Stefan Roesch <shr@devkernel.io>
Description:
Under normal circumstances each device is given a part of the
total write-back cache that relates to its current average
writeout speed in relation to the other devices.
The 'min_ratio_fine' parameter allows assigning a minimum reserve
of the write-back cache to a particular device. The value is
expressed as part of 1 million. For example, this is useful for
providing a minimum QoS.
(read-write)
What: /sys/class/bdi/<bdi>/max_ratio What: /sys/class/bdi/<bdi>/max_ratio
Date: January 2008 Date: January 2008
Contact: Peter Zijlstra <a.p.zijlstra@chello.nl> Contact: Peter Zijlstra <a.p.zijlstra@chello.nl>
...@@ -55,6 +70,59 @@ Description: ...@@ -55,6 +70,59 @@ Description:
mount that is prone to get stuck, or a FUSE mount which cannot mount that is prone to get stuck, or a FUSE mount which cannot
be trusted to play fair. be trusted to play fair.
(read-write)
What: /sys/class/bdi/<bdi>/max_ratio_fine
Date: November 2022
Contact: Stefan Roesch <shr@devkernel.io>
Description:
Allows limiting a particular device to use not more than the
given value of the write-back cache. The value is given as part
of 1 million. This is useful in situations where we want to avoid
one device taking all or most of the write-back cache. For example
in case of an NFS mount that is prone to get stuck, or a FUSE mount
which cannot be trusted to play fair.
(read-write)
What: /sys/class/bdi/<bdi>/min_bytes
Date: October 2022
Contact: Stefan Roesch <shr@devkernel.io>
Description:
Under normal circumstances each device is given a part of the
total write-back cache that relates to its current average
writeout speed in relation to the other devices.
The 'min_bytes' parameter allows assigning a minimum
percentage of the write-back cache to a particular device
expressed in bytes.
For example, this is useful for providing a minimum QoS.
(read-write)
What: /sys/class/bdi/<bdi>/max_bytes
Date: October 2022
Contact: Stefan Roesch <shr@devkernel.io>
Description:
Allows limiting a particular device to use not more than the
given 'max_bytes' of the write-back cache. This is useful in
situations where we want to avoid one device taking all or
most of the write-back cache. For example in case of an NFS
mount that is prone to get stuck, a FUSE mount which cannot be
trusted to play fair, or a nbd device.
(read-write)
What: /sys/class/bdi/<bdi>/strict_limit
Date: October 2022
Contact: Stefan Roesch <shr@devkernel.io>
Description:
Forces per-BDI checks for the share of given device in the write-back
cache even before the global background dirty limit is reached. This
is useful in situations where the global limit is much higher than
affordable for given relatively slow (or untrusted) device. Turning
strictlimit on has no visible effect if max_ratio is equal to 100%.
(read-write) (read-write)
What: /sys/class/bdi/<bdi>/stable_pages_required What: /sys/class/bdi/<bdi>/stable_pages_required
Date: January 2008 Date: January 2008
......
sysfs interface
---------------
The pktcdvd module (packet writing driver) creates the following files in the
sysfs: (<devid> is in the format major:minor)
What: /sys/class/pktcdvd/add
What: /sys/class/pktcdvd/remove
What: /sys/class/pktcdvd/device_map
Date: Oct. 2006
KernelVersion: 2.6.20
Contact: Thomas Maier <balagi@justmail.de>
Description:
========== ==============================================
add (WO) Write a block device id (major:minor) to
create a new pktcdvd device and map it to the
block device.
remove (WO) Write the pktcdvd device id (major:minor)
to remove the pktcdvd device.
device_map (RO) Shows the device mapping in format:
pktcdvd[0-7] <pktdevid> <blkdevid>
========== ==============================================
What: /sys/class/pktcdvd/pktcdvd[0-7]/dev
What: /sys/class/pktcdvd/pktcdvd[0-7]/uevent
Date: Oct. 2006
KernelVersion: 2.6.20
Contact: Thomas Maier <balagi@justmail.de>
Description:
dev: (RO) Device id
uevent: (WO) To send a uevent
What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_started
What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_finished
What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_written
What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read
What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read_gather
What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/reset
Date: Oct. 2006
KernelVersion: 2.6.20
Contact: Thomas Maier <balagi@justmail.de>
Description:
packets_started: (RO) Number of started packets.
packets_finished: (RO) Number of finished packets.
kb_written: (RO) kBytes written.
kb_read: (RO) kBytes read.
kb_read_gather: (RO) kBytes read to fill write packets.
reset: (WO) Write any value to it to reset
pktcdvd device statistic values, like
bytes read/written.
What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/size
What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_off
What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_on
Date: Oct. 2006
KernelVersion: 2.6.20
Contact: Thomas Maier <balagi@justmail.de>
Description:
============== ================================================
size (RO) Contains the size of the bio write queue.
congestion_off (RW) If bio write queue size is below this mark,
accept new bio requests from the block layer.
congestion_on (RW) If bio write queue size is higher as this
mark, do no longer accept bio write requests
from the block layer and wait till the pktcdvd
device has processed enough bio's so that bio
write queue size is below congestion off mark.
A value of <= 0 disables congestion control.
============== ================================================
Example:
--------
To use the pktcdvd sysfs interface directly, you can do::
# create a new pktcdvd device mapped to /dev/hdc
echo "22:0" >/sys/class/pktcdvd/add
cat /sys/class/pktcdvd/device_map
# assuming device pktcdvd0 was created, look at stat's
cat /sys/class/pktcdvd/pktcdvd0/stat/kb_written
# print the device id of the mapped block device
fgrep pktcdvd0 /sys/class/pktcdvd/device_map
# remove device, using pktcdvd0 device id 253:0
echo "253:0" >/sys/class/pktcdvd/remove
What: /sys/devices/uncore_iio_x/dieX What: /sys/devices/uncore_iio_x/dieX
Date: February 2020 Date: February 2020
Contact: Roman Sudarikov <roman.sudarikov@linux.intel.com> Contact: Alexander Antonov <alexander.antonov@linux.intel.com>
Description: Description:
Each IIO stack (PCIe root port) has its own IIO PMON block, so Each IIO stack (PCIe root port) has its own IIO PMON block, so
each dieX file (where X is die number) holds "Segment:Root Bus" each dieX file (where X is die number) holds "Segment:Root Bus"
...@@ -32,3 +32,31 @@ Description: ...@@ -32,3 +32,31 @@ Description:
IIO PMU 0 on die 1 belongs to PCI RP on bus 0x40, domain 0x0000 IIO PMU 0 on die 1 belongs to PCI RP on bus 0x40, domain 0x0000
IIO PMU 0 on die 2 belongs to PCI RP on bus 0x80, domain 0x0000 IIO PMU 0 on die 2 belongs to PCI RP on bus 0x80, domain 0x0000
IIO PMU 0 on die 3 belongs to PCI RP on bus 0xc0, domain 0x0000 IIO PMU 0 on die 3 belongs to PCI RP on bus 0xc0, domain 0x0000
What: /sys/devices/uncore_upi_x/dieX
Date: March 2022
Contact: Alexander Antonov <alexander.antonov@linux.intel.com>
Description:
Each /sys/devices/uncore_upi_X/dieY file holds "upi_Z,die_W"
value that means UPI link number X on die Y is connected to UPI
link Z on die W and this link between sockets can be monitored
by UPI PMON block.
For example, 4-die Sapphire Rapids platform has the following
UPI 0 topology::
# tail /sys/devices/uncore_upi_0/die*
==> /sys/devices/uncore_upi_0/die0 <==
upi_1,die_1
==> /sys/devices/uncore_upi_0/die1 <==
upi_0,die_3
==> /sys/devices/uncore_upi_0/die2 <==
upi_1,die_3
==> /sys/devices/uncore_upi_0/die3 <==
upi_0,die_1
Which means::
UPI link 0 on die 0 is connected to UPI link 1 on die 1
UPI link 0 on die 1 is connected to UPI link 0 on die 3
UPI link 0 on die 2 is connected to UPI link 1 on die 3
UPI link 0 on die 3 is connected to UPI link 0 on die 1
\ No newline at end of file
What: /sys/devices/.../hwmon/hwmon<i>/in0_input
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RO. Current Voltage in millivolt.
Only supported for particular Intel i915 graphics platforms.
What: /sys/devices/.../hwmon/hwmon<i>/power1_max
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RW. Card reactive sustained (PL1/Tau) power limit in microwatts.
The power controller will throttle the operating frequency
if the power averaged over a window (typically seconds)
exceeds this limit.
Only supported for particular Intel i915 graphics platforms.
What: /sys/devices/.../hwmon/hwmon<i>/power1_rated_max
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RO. Card default power limit (default TDP setting).
Only supported for particular Intel i915 graphics platforms.
What: /sys/devices/.../hwmon/hwmon<i>/power1_max_interval
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RW. Sustained power limit interval (Tau in PL1/Tau) in
milliseconds over which sustained power is averaged.
Only supported for particular Intel i915 graphics platforms.
What: /sys/devices/.../hwmon/hwmon<i>/power1_crit
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RW. Card reactive critical (I1) power limit in microwatts.
Card reactive critical (I1) power limit in microwatts is exposed
for client products. The power controller will throttle the
operating frequency if the power averaged over a window exceeds
this limit.
Only supported for particular Intel i915 graphics platforms.
What: /sys/devices/.../hwmon/hwmon<i>/curr1_crit
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RW. Card reactive critical (I1) power limit in milliamperes.
Card reactive critical (I1) power limit in milliamperes is
exposed for server products. The power controller will throttle
the operating frequency if the power averaged over a window
exceeds this limit.
Only supported for particular Intel i915 graphics platforms.
What: /sys/devices/.../hwmon/hwmon<i>/energy1_input
Date: February 2023
KernelVersion: 6.2
Contact: intel-gfx@lists.freedesktop.org
Description: RO. Energy input of device or gt in microjoules.
For i915 device level hwmon devices (name "i915") this
reflects energy input for the entire device. For gt level
hwmon devices (name "i915_gtN") this reflects energy input
for the gt.
Only supported for particular Intel i915 graphics platforms.
...@@ -4,21 +4,21 @@ KernelVersion: 5.18 ...@@ -4,21 +4,21 @@ KernelVersion: 5.18
Contact: "David E. Box" <david.e.box@linux.intel.com> Contact: "David E. Box" <david.e.box@linux.intel.com>
Description: Description:
This directory contains interface files for accessing Intel This directory contains interface files for accessing Intel
Software Defined Silicon (SDSi) features on a CPU. X On Demand (formerly Software Defined Silicon or SDSi) features
represents the socket instance (though not the socket ID). on a CPU. X represents the socket instance (though not the
The socket ID is determined by reading the registers file socket ID). The socket ID is determined by reading the
and decoding it per the specification. registers file and decoding it per the specification.
Some files communicate with SDSi hardware through a mailbox. Some files communicate with On Demand hardware through a
Should the operation fail, one of the following error codes mailbox. Should the operation fail, one of the following error
may be returned: codes may be returned:
========== ===== ========== =====
Error Code Cause Error Code Cause
========== ===== ========== =====
EIO General mailbox failure. Log may indicate cause. EIO General mailbox failure. Log may indicate cause.
EBUSY Mailbox is owned by another agent. EBUSY Mailbox is owned by another agent.
EPERM SDSI capability is not enabled in hardware. EPERM On Demand capability is not enabled in hardware.
EPROTO Failure in mailbox protocol detected by driver. EPROTO Failure in mailbox protocol detected by driver.
See log for details. See log for details.
EOVERFLOW For provision commands, the size of the data EOVERFLOW For provision commands, the size of the data
...@@ -54,8 +54,8 @@ KernelVersion: 5.18 ...@@ -54,8 +54,8 @@ KernelVersion: 5.18
Contact: "David E. Box" <david.e.box@linux.intel.com> Contact: "David E. Box" <david.e.box@linux.intel.com>
Description: Description:
(WO) Used to write an Authentication Key Certificate (AKC) to (WO) Used to write an Authentication Key Certificate (AKC) to
the SDSi NVRAM for the CPU. The AKC is used to authenticate a the On Demand NVRAM for the CPU. The AKC is used to authenticate
Capability Activation Payload. Mailbox command. a Capability Activation Payload. Mailbox command.
What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/provision_cap What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/provision_cap
Date: Feb 2022 Date: Feb 2022
...@@ -63,17 +63,28 @@ KernelVersion: 5.18 ...@@ -63,17 +63,28 @@ KernelVersion: 5.18
Contact: "David E. Box" <david.e.box@linux.intel.com> Contact: "David E. Box" <david.e.box@linux.intel.com>
Description: Description:
(WO) Used to write a Capability Activation Payload (CAP) to the (WO) Used to write a Capability Activation Payload (CAP) to the
SDSi NVRAM for the CPU. CAPs are used to activate a given CPU On Demand NVRAM for the CPU. CAPs are used to activate a given
feature. A CAP is validated by SDSi hardware using a previously CPU feature. A CAP is validated by On Demand hardware using a
provisioned AKC file. Upon successful authentication, the CPU previously provisioned AKC file. Upon successful authentication,
configuration is updated. A cold reboot is required to fully the CPU configuration is updated. A cold reboot is required to
activate the feature. Mailbox command. fully activate the feature. Mailbox command.
What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/meter_certificate
Date: Nov 2022
KernelVersion: 6.2
Contact: "David E. Box" <david.e.box@linux.intel.com>
Description:
(RO) Used to read back the current meter certificate for the CPU
from Intel On Demand hardware. The meter certificate contains
utilization metrics of On Demand enabled features. Mailbox
command.
What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/state_certificate What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/state_certificate
Date: Feb 2022 Date: Feb 2022
KernelVersion: 5.18 KernelVersion: 5.18
Contact: "David E. Box" <david.e.box@linux.intel.com> Contact: "David E. Box" <david.e.box@linux.intel.com>
Description: Description:
(RO) Used to read back the current State Certificate for the CPU (RO) Used to read back the current state certificate for the CPU
from SDSi hardware. The State Certificate contains information from On Demand hardware. The state certificate contains
about the current licenses on the CPU. Mailbox command. information about the current licenses on the CPU. Mailbox
command.
...@@ -99,6 +99,12 @@ Description: Controls the issue rate of discard commands that consist of small ...@@ -99,6 +99,12 @@ Description: Controls the issue rate of discard commands that consist of small
checkpoint is triggered, and issued during the checkpoint. checkpoint is triggered, and issued during the checkpoint.
By default, it is disabled with 0. By default, it is disabled with 0.
What: /sys/fs/f2fs/<disk>/max_ordered_discard
Date: October 2022
Contact: "Yangtao Li" <frank.li@vivo.com>
Description: Controls the maximum ordered discard, the unit size is one block(4KB).
Set it to 16 by default.
What: /sys/fs/f2fs/<disk>/max_discard_request What: /sys/fs/f2fs/<disk>/max_discard_request
Date: December 2021 Date: December 2021
Contact: "Konstantin Vyshetsky" <vkon@google.com> Contact: "Konstantin Vyshetsky" <vkon@google.com>
...@@ -132,7 +138,8 @@ Contact: "Chao Yu" <yuchao0@huawei.com> ...@@ -132,7 +138,8 @@ Contact: "Chao Yu" <yuchao0@huawei.com>
Description: Controls discard granularity of inner discard thread. Inner thread Description: Controls discard granularity of inner discard thread. Inner thread
will not issue discards with size that is smaller than granularity. will not issue discards with size that is smaller than granularity.
The unit size is one block(4KB), now only support configuring The unit size is one block(4KB), now only support configuring
in range of [1, 512]. Default value is 4(=16KB). in range of [1, 512]. Default value is 16.
For small devices, default value is 1.
What: /sys/fs/f2fs/<disk>/umount_discard_timeout What: /sys/fs/f2fs/<disk>/umount_discard_timeout
Date: January 2019 Date: January 2019
...@@ -235,7 +242,7 @@ Description: Shows total written kbytes issued to disk. ...@@ -235,7 +242,7 @@ Description: Shows total written kbytes issued to disk.
What: /sys/fs/f2fs/<disk>/features What: /sys/fs/f2fs/<disk>/features
Date: July 2017 Date: July 2017
Contact: "Jaegeuk Kim" <jaegeuk@kernel.org> Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
Description: <deprecated: should use /sys/fs/f2fs/<disk>/feature_list/ Description: <deprecated: should use /sys/fs/f2fs/<disk>/feature_list/>
Shows all enabled features in current device. Shows all enabled features in current device.
Supported features: Supported features:
encryption, blkzoned, extra_attr, projquota, inode_checksum, encryption, blkzoned, extra_attr, projquota, inode_checksum,
...@@ -592,10 +599,10 @@ Description: With "mode=fragment:block" mount options, we can scatter block allo ...@@ -592,10 +599,10 @@ Description: With "mode=fragment:block" mount options, we can scatter block allo
in the length of 1..<max_fragment_hole> by turns. This value can be set in the length of 1..<max_fragment_hole> by turns. This value can be set
between 1..512 and the default value is 4. between 1..512 and the default value is 4.
What: /sys/fs/f2fs/<disk>/gc_urgent_high_remaining What: /sys/fs/f2fs/<disk>/gc_remaining_trials
Date: December 2021 Date: October 2022
Contact: "Daeho Jeong" <daehojeong@google.com> Contact: "Yangtao Li" <frank.li@vivo.com>
Description: You can set the trial count limit for GC urgent high mode with this value. Description: You can set the trial count limit for GC urgent and idle mode with this value.
If GC thread gets to the limit, the mode will turn back to GC normal mode. If GC thread gets to the limit, the mode will turn back to GC normal mode.
By default, the value is zero, which means there is no limit like before. By default, the value is zero, which means there is no limit like before.
...@@ -634,3 +641,31 @@ Date: July 2022 ...@@ -634,3 +641,31 @@ Date: July 2022
Contact: "Daeho Jeong" <daehojeong@google.com> Contact: "Daeho Jeong" <daehojeong@google.com>
Description: Show the accumulated total revoked atomic write block count after boot. Description: Show the accumulated total revoked atomic write block count after boot.
If you write "0" here, you can initialize to "0". If you write "0" here, you can initialize to "0".
What: /sys/fs/f2fs/<disk>/gc_mode
Date: October 2022
Contact: "Yangtao Li" <frank.li@vivo.com>
Description: Show the current gc_mode as a string.
This is a read-only entry.
What: /sys/fs/f2fs/<disk>/discard_urgent_util
Date: November 2022
Contact: "Yangtao Li" <frank.li@vivo.com>
Description: When space utilization exceeds this, do background DISCARD aggressively.
Does DISCARD forcibly in a period of given min_discard_issue_time when the number
of discards is not 0 and set discard granularity to 1.
Default: 80
What: /sys/fs/f2fs/<disk>/hot_data_age_threshold
Date: November 2022
Contact: "Ping Xiong" <xiongping1@xiaomi.com>
Description: When DATA SEPARATION is on, it controls the age threshold to indicate
the data blocks as hot. By default it was initialized as 262144 blocks
(equals to 1GB).
What: /sys/fs/f2fs/<disk>/warm_data_age_threshold
Date: November 2022
Contact: "Ping Xiong" <xiongping1@xiaomi.com>
Description: When DATA SEPARATION is on, it controls the age threshold to indicate
the data blocks as warm. By default it was initialized as 2621440 blocks
(equals to 10GB).
What: /sys/kernel/cpu_byteorder
Date: February 2023
KernelVersion: 6.2
Contact: Thomas Weißschuh <linux@weissschuh.net>
Description:
The endianness of the running kernel.
Access: Read
Valid values:
"little", "big"
Users: util-linux
...@@ -27,6 +27,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or ...@@ -27,6 +27,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
makes the kdamond reads the user inputs in the sysfs files makes the kdamond reads the user inputs in the sysfs files
except 'state' again. Writing 'update_schemes_stats' to the except 'state' again. Writing 'update_schemes_stats' to the
file updates contents of schemes stats files of the kdamond. file updates contents of schemes stats files of the kdamond.
Writing 'update_schemes_tried_regions' to the file updates
contents of 'tried_regions' directory of every scheme directory
of this kdamond. Writing 'clear_schemes_tried_regions' to the
file removes contents of the 'tried_regions' directory.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
Date: Mar 2022 Date: Mar 2022
...@@ -283,3 +287,31 @@ Date: Mar 2022 ...@@ -283,3 +287,31 @@ Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org> Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the number of the exceed events of Description: Reading this file returns the number of the exceed events of
the scheme's quotas. the scheme's quotas.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start
Date: Oct 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the start address of a memory region
that corresponding DAMON-based Operation Scheme's action has
tried to be applied.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/end
Date: Oct 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the end address of a memory region
that corresponding DAMON-based Operation Scheme's action has
tried to be applied.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/nr_accesses
Date: Oct 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the 'nr_accesses' of a memory region
that corresponding DAMON-based Operation Scheme's action has
tried to be applied.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/age
Date: Oct 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Reading this file returns the 'age' of a memory region that
corresponding DAMON-based Operation Scheme's action has tried
to be applied.
What: /sys/kernel/oops_count
Date: November 2022
KernelVersion: 6.2.0
Contact: Linux Kernel Hardening List <linux-hardening@vger.kernel.org>
Description:
Shows how many times the system has Oopsed since last boot.
What: /sys/kernel/warn_count
Date: November 2022
KernelVersion: 6.2.0
Contact: Linux Kernel Hardening List <linux-hardening@vger.kernel.org>
Description:
Shows how many times the system has Warned since last boot.
What: /sys/class/power_supply/<battery_name>/eppid
Date: September 2022
KernelVersion: 6.1
Contact: Armin Wolf <W_Armin@gmx.de>
Description:
Reports the Dell ePPID (electronic Dell Piece Part Identification)
of the ACPI battery.
What: /sys/devices/virtual/misc/intel_ifs_<N>/run_test What: /sys/devices/virtual/misc/intel_ifs_<N>/run_test
Date: April 21 2022 Date: Nov 16 2022
KernelVersion: 5.19 KernelVersion: 6.2
Contact: "Jithu Joseph" <jithu.joseph@intel.com> Contact: "Jithu Joseph" <jithu.joseph@intel.com>
Description: Write <cpu#> to trigger IFS test for one online core. Description: Write <cpu#> to trigger IFS test for one online core.
Note that the test is per core. The cpu# can be Note that the test is per core. The cpu# can be
for any thread on the core. Running on one thread for any thread on the core. Running on one thread
completes the test for the core containing that thread. completes the test for the core containing that thread.
Example: to test the core containing cpu5: echo 5 > Example: to test the core containing cpu5: echo 5 >
/sys/devices/platform/intel_ifs.<N>/run_test /sys/devices/virtual/misc/intel_ifs_<N>/run_test
What: /sys/devices/virtual/misc/intel_ifs_<N>/status What: /sys/devices/virtual/misc/intel_ifs_<N>/status
Date: April 21 2022 Date: Nov 16 2022
KernelVersion: 5.19 KernelVersion: 6.2
Contact: "Jithu Joseph" <jithu.joseph@intel.com> Contact: "Jithu Joseph" <jithu.joseph@intel.com>
Description: The status of the last test. It can be one of "pass", "fail" Description: The status of the last test. It can be one of "pass", "fail"
or "untested". or "untested".
What: /sys/devices/virtual/misc/intel_ifs_<N>/details What: /sys/devices/virtual/misc/intel_ifs_<N>/details
Date: April 21 2022 Date: Nov 16 2022
KernelVersion: 5.19 KernelVersion: 6.2
Contact: "Jithu Joseph" <jithu.joseph@intel.com> Contact: "Jithu Joseph" <jithu.joseph@intel.com>
Description: Additional information regarding the last test. The details file reports Description: Additional information regarding the last test. The details file reports
the hex value of the SCAN_STATUS MSR. Note that the error_code field the hex value of the SCAN_STATUS MSR. Note that the error_code field
may contain driver defined software code not defined in the Intel SDM. may contain driver defined software code not defined in the Intel SDM.
What: /sys/devices/virtual/misc/intel_ifs_<N>/image_version What: /sys/devices/virtual/misc/intel_ifs_<N>/image_version
Date: April 21 2022 Date: Nov 16 2022
KernelVersion: 5.19 KernelVersion: 6.2
Contact: "Jithu Joseph" <jithu.joseph@intel.com> Contact: "Jithu Joseph" <jithu.joseph@intel.com>
Description: Version (hexadecimal) of loaded IFS binary image. If no scan image Description: Version (hexadecimal) of loaded IFS binary image. If no scan image
is loaded reports "none". is loaded reports "none".
What: /sys/devices/virtual/misc/intel_ifs_<N>/reload What: /sys/devices/virtual/misc/intel_ifs_<N>/current_batch
Date: April 21 2022 Date: Nov 16 2022
KernelVersion: 5.19 KernelVersion: 6.2
Contact: "Jithu Joseph" <jithu.joseph@intel.com> Contact: "Jithu Joseph" <jithu.joseph@intel.com>
Description: Write "1" (or "y" or "Y") to reload the IFS image from Description: Write a number less than or equal to 0xff to load an IFS test image.
/lib/firmware/intel/ifs/ff-mm-ss.scan. The number written treated as the 2 digit suffix in the following file name:
/lib/firmware/intel/ifs_<N>/ff-mm-ss-02x.scan
Reading the file will provide the suffix of the currently loaded IFS test image.
...@@ -95,6 +95,15 @@ htmldocs: ...@@ -95,6 +95,15 @@ htmldocs:
@$(srctree)/scripts/sphinx-pre-install --version-check @$(srctree)/scripts/sphinx-pre-install --version-check
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var))) @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
texinfodocs:
@$(srctree)/scripts/sphinx-pre-install --version-check
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,texinfo,$(var),texinfo,$(var)))
# Note: the 'info' Make target is generated by sphinx itself when
# running the texinfodocs target define above.
infodocs: texinfodocs
$(MAKE) -C $(BUILDDIR)/texinfo info
linkcheckdocs: linkcheckdocs:
@$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var))) @$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var)))
...@@ -143,6 +152,8 @@ cleandocs: ...@@ -143,6 +152,8 @@ cleandocs:
dochelp: dochelp:
@echo ' Linux kernel internal documentation in different formats from ReST:' @echo ' Linux kernel internal documentation in different formats from ReST:'
@echo ' htmldocs - HTML' @echo ' htmldocs - HTML'
@echo ' texinfodocs - Texinfo'
@echo ' infodocs - Info'
@echo ' latexdocs - LaTeX' @echo ' latexdocs - LaTeX'
@echo ' pdfdocs - PDF' @echo ' pdfdocs - PDF'
@echo ' epubdocs - EPUB' @echo ' epubdocs - EPUB'
......
...@@ -285,3 +285,13 @@ to bridges between the PCI root and the device, MSIs are disabled. ...@@ -285,3 +285,13 @@ to bridges between the PCI root and the device, MSIs are disabled.
It is also worth checking the device driver to see whether it supports MSIs. It is also worth checking the device driver to see whether it supports MSIs.
For example, it may contain calls to pci_alloc_irq_vectors() with the For example, it may contain calls to pci_alloc_irq_vectors() with the
PCI_IRQ_MSI or PCI_IRQ_MSIX flags. PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
List of device drivers MSI(-X) APIs
===================================
The PCI/MSI subystem has a dedicated C file for its exported device driver
APIs — `drivers/pci/msi/api.c`. The following functions are exported:
.. kernel-doc:: drivers/pci/msi/api.c
:export:
...@@ -83,6 +83,7 @@ This structure has the form:: ...@@ -83,6 +83,7 @@ This structure has the form::
int (*mmio_enabled)(struct pci_dev *dev); int (*mmio_enabled)(struct pci_dev *dev);
int (*slot_reset)(struct pci_dev *dev); int (*slot_reset)(struct pci_dev *dev);
void (*resume)(struct pci_dev *dev); void (*resume)(struct pci_dev *dev);
void (*cor_error_detected)(struct pci_dev *dev);
}; };
The possible channel states are:: The possible channel states are::
...@@ -422,5 +423,11 @@ That is, the recovery API only requires that: ...@@ -422,5 +423,11 @@ That is, the recovery API only requires that:
- drivers/net/cxgb3 - drivers/net/cxgb3
- drivers/net/s2io.c - drivers/net/s2io.c
The cor_error_detected() callback is invoked in handle_error_source() when
the error severity is "correctable". The callback is optional and allows
additional logging to be done if desired. See example:
- drivers/cxl/pci.c
The End The End
------- -------
...@@ -1858,7 +1858,7 @@ unloaded. After a given module has been unloaded, any attempt to call ...@@ -1858,7 +1858,7 @@ unloaded. After a given module has been unloaded, any attempt to call
one of its functions results in a segmentation fault. The module-unload one of its functions results in a segmentation fault. The module-unload
functions must therefore cancel any delayed calls to loadable-module functions must therefore cancel any delayed calls to loadable-module
functions, for example, any outstanding mod_timer() must be dealt functions, for example, any outstanding mod_timer() must be dealt
with via del_timer_sync() or similar. with via timer_shutdown_sync() or similar.
Unfortunately, there is no way to cancel an RCU callback; once you Unfortunately, there is no way to cancel an RCU callback; once you
invoke call_rcu(), the callback function is eventually going to be invoke call_rcu(), the callback function is eventually going to be
......
.. _array_rcu_doc:
Using RCU to Protect Read-Mostly Arrays
=======================================
Although RCU is more commonly used to protect linked lists, it can
also be used to protect arrays. Three situations are as follows:
1. :ref:`Hash Tables <hash_tables>`
2. :ref:`Static Arrays <static_arrays>`
3. :ref:`Resizable Arrays <resizable_arrays>`
Each of these three situations involves an RCU-protected pointer to an
array that is separately indexed. It might be tempting to consider use
of RCU to instead protect the index into an array, however, this use
case is **not** supported. The problem with RCU-protected indexes into
arrays is that compilers can play way too many optimization games with
integers, which means that the rules governing handling of these indexes
are far more trouble than they are worth. If RCU-protected indexes into
arrays prove to be particularly valuable (which they have not thus far),
explicit cooperation from the compiler will be required to permit them
to be safely used.
That aside, each of the three RCU-protected pointer situations are
described in the following sections.
.. _hash_tables:
Situation 1: Hash Tables
------------------------
Hash tables are often implemented as an array, where each array entry
has a linked-list hash chain. Each hash chain can be protected by RCU
as described in listRCU.rst. This approach also applies to other
array-of-list situations, such as radix trees.
.. _static_arrays:
Situation 2: Static Arrays
--------------------------
Static arrays, where the data (rather than a pointer to the data) is
located in each array element, and where the array is never resized,
have not been used with RCU. Rik van Riel recommends using seqlock in
this situation, which would also have minimal read-side overhead as long
as updates are rare.
Quick Quiz:
Why is it so important that updates be rare when using seqlock?
:ref:`Answer to Quick Quiz <answer_quick_quiz_seqlock>`
.. _resizable_arrays:
Situation 3: Resizable Arrays
------------------------------
Use of RCU for resizable arrays is demonstrated by the grow_ary()
function formerly used by the System V IPC code. The array is used
to map from semaphore, message-queue, and shared-memory IDs to the data
structure that represents the corresponding IPC construct. The grow_ary()
function does not acquire any locks; instead its caller must hold the
ids->sem semaphore.
The grow_ary() function, shown below, does some limit checks, allocates a
new ipc_id_ary, copies the old to the new portion of the new, initializes
the remainder of the new, updates the ids->entries pointer to point to
the new array, and invokes ipc_rcu_putref() to free up the old array.
Note that rcu_assign_pointer() is used to update the ids->entries pointer,
which includes any memory barriers required on whatever architecture
you are running on::
static int grow_ary(struct ipc_ids* ids, int newsize)
{
struct ipc_id_ary* new;
struct ipc_id_ary* old;
int i;
int size = ids->entries->size;
if(newsize > IPCMNI)
newsize = IPCMNI;
if(newsize <= size)
return newsize;
new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize +
sizeof(struct ipc_id_ary));
if(new == NULL)
return size;
new->size = newsize;
memcpy(new->p, ids->entries->p,
sizeof(struct kern_ipc_perm *)*size +
sizeof(struct ipc_id_ary));
for(i=size;i<newsize;i++) {
new->p[i] = NULL;
}
old = ids->entries;
/*
* Use rcu_assign_pointer() to make sure the memcpyed
* contents of the new array are visible before the new
* array becomes visible.
*/
rcu_assign_pointer(ids->entries, new);
ipc_rcu_putref(old);
return newsize;
}
The ipc_rcu_putref() function decrements the array's reference count
and then, if the reference count has dropped to zero, uses call_rcu()
to free the array after a grace period has elapsed.
The array is traversed by the ipc_lock() function. This function
indexes into the array under the protection of rcu_read_lock(),
using rcu_dereference() to pick up the pointer to the array so
that it may later safely be dereferenced -- memory barriers are
required on the Alpha CPU. Since the size of the array is stored
with the array itself, there can be no array-size mismatches, so
a simple check suffices. The pointer to the structure corresponding
to the desired IPC object is placed in "out", with NULL indicating
a non-existent entry. After acquiring "out->lock", the "out->deleted"
flag indicates whether the IPC object is in the process of being
deleted, and, if not, the pointer is returned::
struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id)
{
struct kern_ipc_perm* out;
int lid = id % SEQ_MULTIPLIER;
struct ipc_id_ary* entries;
rcu_read_lock();
entries = rcu_dereference(ids->entries);
if(lid >= entries->size) {
rcu_read_unlock();
return NULL;
}
out = entries->p[lid];
if(out == NULL) {
rcu_read_unlock();
return NULL;
}
spin_lock(&out->lock);
/* ipc_rmid() may have already freed the ID while ipc_lock
* was spinning: here verify that the structure is still valid
*/
if (out->deleted) {
spin_unlock(&out->lock);
rcu_read_unlock();
return NULL;
}
return out;
}
.. _answer_quick_quiz_seqlock:
Answer to Quick Quiz:
Why is it so important that updates be rare when using seqlock?
The reason that it is important that updates be rare when
using seqlock is that frequent updates can livelock readers.
One way to avoid this problem is to assign a seqlock for
each array entry rather than to the entire array.
...@@ -32,8 +32,8 @@ over a rather long period of time, but improvements are always welcome! ...@@ -32,8 +32,8 @@ over a rather long period of time, but improvements are always welcome!
for lockless updates. This does result in the mildly for lockless updates. This does result in the mildly
counter-intuitive situation where rcu_read_lock() and counter-intuitive situation where rcu_read_lock() and
rcu_read_unlock() are used to protect updates, however, this rcu_read_unlock() are used to protect updates, however, this
approach provides the same potential simplifications that garbage approach can provide the same simplifications to certain types
collectors do. of lockless algorithms that garbage collectors do.
1. Does the update code have proper mutual exclusion? 1. Does the update code have proper mutual exclusion?
...@@ -49,12 +49,12 @@ over a rather long period of time, but improvements are always welcome! ...@@ -49,12 +49,12 @@ over a rather long period of time, but improvements are always welcome!
them -- even x86 allows later loads to be reordered to precede them -- even x86 allows later loads to be reordered to precede
earlier stores), and be prepared to explain why this added earlier stores), and be prepared to explain why this added
complexity is worthwhile. If you choose #c, be prepared to complexity is worthwhile. If you choose #c, be prepared to
explain how this single task does not become a major bottleneck on explain how this single task does not become a major bottleneck
big multiprocessor machines (for example, if the task is updating on large systems (for example, if the task is updating information
information relating to itself that other tasks can read, there relating to itself that other tasks can read, there by definition
by definition can be no bottleneck). Note that the definition can be no bottleneck). Note that the definition of "large" has
of "large" has changed significantly: Eight CPUs was "large" changed significantly: Eight CPUs was "large" in the year 2000,
in the year 2000, but a hundred CPUs was unremarkable in 2017. but a hundred CPUs was unremarkable in 2017.
2. Do the RCU read-side critical sections make proper use of 2. Do the RCU read-side critical sections make proper use of
rcu_read_lock() and friends? These primitives are needed rcu_read_lock() and friends? These primitives are needed
...@@ -97,33 +97,38 @@ over a rather long period of time, but improvements are always welcome! ...@@ -97,33 +97,38 @@ over a rather long period of time, but improvements are always welcome!
b. Proceed as in (a) above, but also maintain per-element b. Proceed as in (a) above, but also maintain per-element
locks (that are acquired by both readers and writers) locks (that are acquired by both readers and writers)
that guard per-element state. Of course, fields that that guard per-element state. Fields that the readers
the readers refrain from accessing can be guarded by refrain from accessing can be guarded by some other lock
some other lock acquired only by updaters, if desired. acquired only by updaters, if desired.
This works quite well, also. This also works quite well.
c. Make updates appear atomic to readers. For example, c. Make updates appear atomic to readers. For example,
pointer updates to properly aligned fields will pointer updates to properly aligned fields will
appear atomic, as will individual atomic primitives. appear atomic, as will individual atomic primitives.
Sequences of operations performed under a lock will *not* Sequences of operations performed under a lock will *not*
appear to be atomic to RCU readers, nor will sequences appear to be atomic to RCU readers, nor will sequences
of multiple atomic primitives. of multiple atomic primitives. One alternative is to
move multiple individual fields to a separate structure,
thus solving the multiple-field problem by imposing an
additional level of indirection.
This can work, but is starting to get a bit tricky. This can work, but is starting to get a bit tricky.
d. Carefully order the updates and the reads so that d. Carefully order the updates and the reads so that readers
readers see valid data at all phases of the update. see valid data at all phases of the update. This is often
This is often more difficult than it sounds, especially more difficult than it sounds, especially given modern
given modern CPUs' tendency to reorder memory references. CPUs' tendency to reorder memory references. One must
One must usually liberally sprinkle memory barriers usually liberally sprinkle memory-ordering operations
(smp_wmb(), smp_rmb(), smp_mb()) through the code, through the code, making it difficult to understand and
making it difficult to understand and to test. to test. Where it works, it is better to use things
like smp_store_release() and smp_load_acquire(), but in
It is usually better to group the changing data into some cases the smp_mb() full memory barrier is required.
a separate structure, so that the change may be made
to appear atomic by updating a pointer to reference As noted earlier, it is usually better to group the
a new structure containing updated values. changing data into a separate structure, so that the
change may be made to appear atomic by updating a pointer
to reference a new structure containing updated values.
4. Weakly ordered CPUs pose special challenges. Almost all CPUs 4. Weakly ordered CPUs pose special challenges. Almost all CPUs
are weakly ordered -- even x86 CPUs allow later loads to be are weakly ordered -- even x86 CPUs allow later loads to be
...@@ -188,26 +193,29 @@ over a rather long period of time, but improvements are always welcome! ...@@ -188,26 +193,29 @@ over a rather long period of time, but improvements are always welcome!
when publicizing a pointer to a structure that can when publicizing a pointer to a structure that can
be traversed by an RCU read-side critical section. be traversed by an RCU read-side critical section.
5. If call_rcu() or call_srcu() is used, the callback function will 5. If any of call_rcu(), call_srcu(), call_rcu_tasks(),
be called from softirq context. In particular, it cannot block. call_rcu_tasks_rude(), or call_rcu_tasks_trace() is used,
If you need the callback to block, run that code in a workqueue the callback function may be invoked from softirq context,
handler scheduled from the callback. The queue_rcu_work() and in any case with bottom halves disabled. In particular,
function does this for you in the case of call_rcu(). this callback function cannot block. If you need the callback
to block, run that code in a workqueue handler scheduled from
the callback. The queue_rcu_work() function does this for you
in the case of call_rcu().
6. Since synchronize_rcu() can block, it cannot be called 6. Since synchronize_rcu() can block, it cannot be called
from any sort of irq context. The same rule applies from any sort of irq context. The same rule applies
for synchronize_srcu(), synchronize_rcu_expedited(), and for synchronize_srcu(), synchronize_rcu_expedited(),
synchronize_srcu_expedited(). synchronize_srcu_expedited(), synchronize_rcu_tasks(),
synchronize_rcu_tasks_rude(), and synchronize_rcu_tasks_trace().
The expedited forms of these primitives have the same semantics The expedited forms of these primitives have the same semantics
as the non-expedited forms, but expediting is both expensive and as the non-expedited forms, but expediting is more CPU intensive.
(with the exception of synchronize_srcu_expedited()) unfriendly Use of the expedited primitives should be restricted to rare
to real-time workloads. Use of the expedited primitives should configuration-change operations that would not normally be
be restricted to rare configuration-change operations that would undertaken while a real-time workload is running. Note that
not normally be undertaken while a real-time workload is running. IPI-sensitive real-time workloads can use the rcupdate.rcu_normal
However, real-time workloads can use rcupdate.rcu_normal kernel kernel boot parameter to completely disable expedited grace
boot parameter to completely disable expedited grace periods, periods, though this might have performance implications.
though this might have performance implications.
In particular, if you find yourself invoking one of the expedited In particular, if you find yourself invoking one of the expedited
primitives repeatedly in a loop, please do everyone a favor: primitives repeatedly in a loop, please do everyone a favor:
...@@ -215,8 +223,9 @@ over a rather long period of time, but improvements are always welcome! ...@@ -215,8 +223,9 @@ over a rather long period of time, but improvements are always welcome!
a single non-expedited primitive to cover the entire batch. a single non-expedited primitive to cover the entire batch.
This will very likely be faster than the loop containing the This will very likely be faster than the loop containing the
expedited primitive, and will be much much easier on the rest expedited primitive, and will be much much easier on the rest
of the system, especially to real-time workloads running on of the system, especially to real-time workloads running on the
the rest of the system. rest of the system. Alternatively, instead use asynchronous
primitives such as call_rcu().
7. As of v4.20, a given kernel implements only one RCU flavor, which 7. As of v4.20, a given kernel implements only one RCU flavor, which
is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y. is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y.
...@@ -239,7 +248,8 @@ over a rather long period of time, but improvements are always welcome! ...@@ -239,7 +248,8 @@ over a rather long period of time, but improvements are always welcome!
the corresponding readers must use rcu_read_lock_trace() and the corresponding readers must use rcu_read_lock_trace() and
rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude() rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude()
or synchronize_rcu_tasks_rude(), then the corresponding readers or synchronize_rcu_tasks_rude(), then the corresponding readers
must use anything that disables interrupts. must use anything that disables preemption, for example,
preempt_disable() and preempt_enable().
Mixing things up will result in confusion and broken kernels, and Mixing things up will result in confusion and broken kernels, and
has even resulted in an exploitable security issue. Therefore, has even resulted in an exploitable security issue. Therefore,
...@@ -253,15 +263,16 @@ over a rather long period of time, but improvements are always welcome! ...@@ -253,15 +263,16 @@ over a rather long period of time, but improvements are always welcome!
that this usage is safe is that readers can use anything that that this usage is safe is that readers can use anything that
disables BH when updaters use call_rcu() or synchronize_rcu(). disables BH when updaters use call_rcu() or synchronize_rcu().
8. Although synchronize_rcu() is slower than is call_rcu(), it 8. Although synchronize_rcu() is slower than is call_rcu(),
usually results in simpler code. So, unless update performance is it usually results in simpler code. So, unless update
critically important, the updaters cannot block, or the latency of performance is critically important, the updaters cannot block,
synchronize_rcu() is visible from userspace, synchronize_rcu() or the latency of synchronize_rcu() is visible from userspace,
should be used in preference to call_rcu(). Furthermore, synchronize_rcu() should be used in preference to call_rcu().
kfree_rcu() usually results in even simpler code than does Furthermore, kfree_rcu() and kvfree_rcu() usually result
synchronize_rcu() without synchronize_rcu()'s multi-millisecond in even simpler code than does synchronize_rcu() without
latency. So please take advantage of kfree_rcu()'s "fire and synchronize_rcu()'s multi-millisecond latency. So please take
forget" memory-freeing capabilities where it applies. advantage of kfree_rcu()'s and kvfree_rcu()'s "fire and forget"
memory-freeing capabilities where it applies.
An especially important property of the synchronize_rcu() An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods primitive is that it automatically self-limits: if grace periods
...@@ -271,8 +282,8 @@ over a rather long period of time, but improvements are always welcome! ...@@ -271,8 +282,8 @@ over a rather long period of time, but improvements are always welcome!
cases where grace periods are delayed, as failing to do so can cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions. result in excessive realtime latencies or even OOM conditions.
Ways of gaining this self-limiting property when using call_rcu() Ways of gaining this self-limiting property when using call_rcu(),
include: kfree_rcu(), or kvfree_rcu() include:
a. Keeping a count of the number of data-structure elements a. Keeping a count of the number of data-structure elements
used by the RCU-protected data structure, including used by the RCU-protected data structure, including
...@@ -304,18 +315,21 @@ over a rather long period of time, but improvements are always welcome! ...@@ -304,18 +315,21 @@ over a rather long period of time, but improvements are always welcome!
here is that superuser already has lots of ways to crash here is that superuser already has lots of ways to crash
the machine. the machine.
d. Periodically invoke synchronize_rcu(), permitting a limited d. Periodically invoke rcu_barrier(), permitting a limited
number of updates per grace period. Better yet, periodically number of updates per grace period.
invoke rcu_barrier() to wait for all outstanding callbacks.
The same cautions apply to call_srcu() and kfree_rcu(). The same cautions apply to call_srcu(), call_rcu_tasks(),
call_rcu_tasks_rude(), and call_rcu_tasks_trace(). This is
why there is an srcu_barrier(), rcu_barrier_tasks(),
rcu_barrier_tasks_rude(), and rcu_barrier_tasks_rude(),
respectively.
Note that although these primitives do take action to avoid memory Note that although these primitives do take action to avoid
exhaustion when any given CPU has too many callbacks, a determined memory exhaustion when any given CPU has too many callbacks,
user could still exhaust memory. This is especially the case a determined user or administrator can still exhaust memory.
if a system with a large number of CPUs has been configured to This is especially the case if a system with a large number of
offload all of its RCU callbacks onto a single CPU, or if the CPUs has been configured to offload all of its RCU callbacks onto
system has relatively little free memory. a single CPU, or if the system has relatively little free memory.
9. All RCU list-traversal primitives, which include 9. All RCU list-traversal primitives, which include
rcu_dereference(), list_for_each_entry_rcu(), and rcu_dereference(), list_for_each_entry_rcu(), and
...@@ -344,14 +358,14 @@ over a rather long period of time, but improvements are always welcome! ...@@ -344,14 +358,14 @@ over a rather long period of time, but improvements are always welcome!
and you don't hold the appropriate update-side lock, you *must* and you don't hold the appropriate update-side lock, you *must*
use the "_rcu()" variants of the list macros. Failing to do so use the "_rcu()" variants of the list macros. Failing to do so
will break Alpha, cause aggressive compilers to generate bad code, will break Alpha, cause aggressive compilers to generate bad code,
and confuse people trying to read your code. and confuse people trying to understand your code.
11. Any lock acquired by an RCU callback must be acquired elsewhere 11. Any lock acquired by an RCU callback must be acquired elsewhere
with softirq disabled, e.g., via spin_lock_irqsave(), with softirq disabled, e.g., via spin_lock_bh(). Failing to
spin_lock_bh(), etc. Failing to disable softirq on a given disable softirq on a given acquisition of that lock will result
acquisition of that lock will result in deadlock as soon as in deadlock as soon as the RCU softirq handler happens to run
the RCU softirq handler happens to run your RCU callback while your RCU callback while interrupting that acquisition's critical
interrupting that acquisition's critical section. section.
12. RCU callbacks can be and are executed in parallel. In many cases, 12. RCU callbacks can be and are executed in parallel. In many cases,
the callback code simply wrappers around kfree(), so that this the callback code simply wrappers around kfree(), so that this
...@@ -372,7 +386,17 @@ over a rather long period of time, but improvements are always welcome! ...@@ -372,7 +386,17 @@ over a rather long period of time, but improvements are always welcome!
for some real-time workloads, this is the whole point of using for some real-time workloads, this is the whole point of using
the rcu_nocbs= kernel boot parameter. the rcu_nocbs= kernel boot parameter.
13. Unlike other forms of RCU, it *is* permissible to block in an In addition, do not assume that callbacks queued in a given order
will be invoked in that order, even if they all are queued on the
same CPU. Furthermore, do not assume that same-CPU callbacks will
be invoked serially. For example, in recent kernels, CPUs can be
switched between offloaded and de-offloaded callback invocation,
and while a given CPU is undergoing such a switch, its callbacks
might be concurrently invoked by that CPU's softirq handler and
that CPU's rcuo kthread. At such times, that CPU's callbacks
might be executed both concurrently and out of order.
13. Unlike most flavors of RCU, it *is* permissible to block in an
SRCU read-side critical section (demarked by srcu_read_lock() SRCU read-side critical section (demarked by srcu_read_lock()
and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
Please note that if you don't need to sleep in read-side critical Please note that if you don't need to sleep in read-side critical
...@@ -412,6 +436,12 @@ over a rather long period of time, but improvements are always welcome! ...@@ -412,6 +436,12 @@ over a rather long period of time, but improvements are always welcome!
never sends IPIs to other CPUs, so it is easier on never sends IPIs to other CPUs, so it is easier on
real-time workloads than is synchronize_rcu_expedited(). real-time workloads than is synchronize_rcu_expedited().
It is also permissible to sleep in RCU Tasks Trace read-side
critical, which are delimited by rcu_read_lock_trace() and
rcu_read_unlock_trace(). However, this is a specialized flavor
of RCU, and you should not use it without first checking with
its current users. In most cases, you should instead use SRCU.
Note that rcu_assign_pointer() relates to SRCU just as it does to Note that rcu_assign_pointer() relates to SRCU just as it does to
other forms of RCU, but instead of rcu_dereference() you should other forms of RCU, but instead of rcu_dereference() you should
use srcu_dereference() in order to avoid lockdep splats. use srcu_dereference() in order to avoid lockdep splats.
...@@ -442,50 +472,62 @@ over a rather long period of time, but improvements are always welcome! ...@@ -442,50 +472,62 @@ over a rather long period of time, but improvements are always welcome!
find problems as follows: find problems as follows:
CONFIG_PROVE_LOCKING: CONFIG_PROVE_LOCKING:
check that accesses to RCU-protected data check that accesses to RCU-protected data structures
structures are carried out under the proper RCU are carried out under the proper RCU read-side critical
read-side critical section, while holding the right section, while holding the right combination of locks,
combination of locks, or whatever other conditions or whatever other conditions are appropriate.
are appropriate.
CONFIG_DEBUG_OBJECTS_RCU_HEAD: CONFIG_DEBUG_OBJECTS_RCU_HEAD:
check that you don't pass the check that you don't pass the same object to call_rcu()
same object to call_rcu() (or friends) before an RCU (or friends) before an RCU grace period has elapsed
grace period has elapsed since the last time that you since the last time that you passed that same object to
passed that same object to call_rcu() (or friends). call_rcu() (or friends).
__rcu sparse checks: __rcu sparse checks:
tag the pointer to the RCU-protected data tag the pointer to the RCU-protected data structure
structure with __rcu, and sparse will warn you if you with __rcu, and sparse will warn you if you access that
access that pointer without the services of one of the pointer without the services of one of the variants
variants of rcu_dereference(). of rcu_dereference().
These debugging aids can help you find problems that are These debugging aids can help you find problems that are
otherwise extremely difficult to spot. otherwise extremely difficult to spot.
17. If you register a callback using call_rcu() or call_srcu(), and 17. If you pass a callback function defined within a module to one of
pass in a function defined within a loadable module, then it in call_rcu(), call_srcu(), call_rcu_tasks(), call_rcu_tasks_rude(),
necessary to wait for all pending callbacks to be invoked after or call_rcu_tasks_trace(), then it is necessary to wait for all
the last invocation and before unloading that module. Note that pending callbacks to be invoked before unloading that module.
it is absolutely *not* sufficient to wait for a grace period! Note that it is absolutely *not* sufficient to wait for a grace
The current (say) synchronize_rcu() implementation is *not* period! For example, synchronize_rcu() implementation is *not*
guaranteed to wait for callbacks registered on other CPUs. guaranteed to wait for callbacks registered on other CPUs via
Or even on the current CPU if that CPU recently went offline call_rcu(). Or even on the current CPU if that CPU recently
and came back online. went offline and came back online.
You instead need to use one of the barrier functions: You instead need to use one of the barrier functions:
- call_rcu() -> rcu_barrier() - call_rcu() -> rcu_barrier()
- call_srcu() -> srcu_barrier() - call_srcu() -> srcu_barrier()
- call_rcu_tasks() -> rcu_barrier_tasks()
- call_rcu_tasks_rude() -> rcu_barrier_tasks_rude()
- call_rcu_tasks_trace() -> rcu_barrier_tasks_trace()
However, these barrier functions are absolutely *not* guaranteed However, these barrier functions are absolutely *not* guaranteed
to wait for a grace period. In fact, if there are no call_rcu() to wait for a grace period. For example, if there are no
callbacks waiting anywhere in the system, rcu_barrier() is within call_rcu() callbacks queued anywhere in the system, rcu_barrier()
its rights to return immediately. can and will return immediately.
So if you need to wait for both an RCU grace period and for So if you need to wait for both a grace period and for all
all pre-existing call_rcu() callbacks, you will need to execute pre-existing callbacks, you will need to invoke both functions,
both rcu_barrier() and synchronize_rcu(), if necessary, using with the pair depending on the flavor of RCU:
something like workqueues to execute them concurrently.
- Either synchronize_rcu() or synchronize_rcu_expedited(),
together with rcu_barrier()
- Either synchronize_srcu() or synchronize_srcu_expedited(),
together with and srcu_barrier()
- synchronize_rcu_tasks() and rcu_barrier_tasks()
- synchronize_tasks_rude() and rcu_barrier_tasks_rude()
- synchronize_tasks_trace() and rcu_barrier_tasks_trace()
If necessary, you can use something like workqueues to execute
the requisite pair of functions concurrently.
See rcubarrier.rst for more information. See rcubarrier.rst for more information.
...@@ -9,7 +9,6 @@ RCU concepts ...@@ -9,7 +9,6 @@ RCU concepts
.. toctree:: .. toctree::
:maxdepth: 3 :maxdepth: 3
arrayRCU
checklist checklist
lockdep lockdep
lockdep-splat lockdep-splat
......
...@@ -3,11 +3,10 @@ ...@@ -3,11 +3,10 @@
Using RCU to Protect Read-Mostly Linked Lists Using RCU to Protect Read-Mostly Linked Lists
============================================= =============================================
One of the best applications of RCU is to protect read-mostly linked lists One of the most common uses of RCU is protecting read-mostly linked lists
(``struct list_head`` in list.h). One big advantage of this approach (``struct list_head`` in list.h). One big advantage of this approach is
is that all of the required memory barriers are included for you in that all of the required memory ordering is provided by the list macros.
the list macros. This document describes several applications of RCU, This document describes several list-based RCU use cases.
with the best fits first.
Example 1: Read-mostly list: Deferred Destruction Example 1: Read-mostly list: Deferred Destruction
...@@ -35,7 +34,8 @@ The code traversing the list of all processes typically looks like:: ...@@ -35,7 +34,8 @@ The code traversing the list of all processes typically looks like::
} }
rcu_read_unlock(); rcu_read_unlock();
The simplified code for removing a process from a task list is:: The simplified and heavily inlined code for removing a process from a
task list is::
void release_task(struct task_struct *p) void release_task(struct task_struct *p)
{ {
...@@ -45,39 +45,48 @@ The simplified code for removing a process from a task list is:: ...@@ -45,39 +45,48 @@ The simplified code for removing a process from a task list is::
call_rcu(&p->rcu, delayed_put_task_struct); call_rcu(&p->rcu, delayed_put_task_struct);
} }
When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)``
``tasklist_lock`` writer lock protection, to remove the task from the list of via __exit_signal() and __unhash_process() under ``tasklist_lock``
all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals writer lock protection. The list_del_rcu() invocation removes
from corrupting the list. Readers using ``for_each_process()`` are not protected the task from the list of all tasks. The ``tasklist_lock``
with the ``tasklist_lock``. To prevent readers from noticing changes in the list prevents concurrent list additions/removals from corrupting the
pointers, the ``task_struct`` object is freed only after one or more grace list. Readers using ``for_each_process()`` are not protected with the
periods elapse (with the help of call_rcu()). This deferring of destruction ``tasklist_lock``. To prevent readers from noticing changes in the list
ensures that any readers traversing the list will see valid ``p->tasks.next`` pointers, the ``task_struct`` object is freed only after one or more
pointers and deletion/freeing can happen in parallel with traversal of the list. grace periods elapse, with the help of call_rcu(), which is invoked via
This pattern is also called an **existence lock**, since RCU pins the object in put_task_struct_rcu_user(). This deferring of destruction ensures that
memory until all existing readers finish. any readers traversing the list will see valid ``p->tasks.next`` pointers
and deletion/freeing can happen in parallel with traversal of the list.
This pattern is also called an **existence lock**, since RCU refrains
from invoking the delayed_put_task_struct() callback function until
all existing readers finish, which guarantees that the ``task_struct``
object in question will remain in existence until after the completion
of all RCU readers that might possibly have a reference to that object.
Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
---------------------------------------------------------------------- ----------------------------------------------------------------------
The best applications are cases where, if reader-writer locking were Some reader-writer locking use cases compute a value while holding
used, the read-side lock would be dropped before taking any action the read-side lock, but continue to use that value after that lock is
based on the results of the search. The most celebrated example is released. These use cases are often good candidates for conversion
the routing table. Because the routing table is tracking the state of to RCU. One prominent example involves network packet routing.
equipment outside of the computer, it will at times contain stale data. Because the packet-routing data tracks the state of equipment outside
Therefore, once the route has been computed, there is no need to hold of the computer, it will at times contain stale data. Therefore, once
the routing table static during transmission of the packet. After all, the route has been computed, there is no need to hold the routing table
you can hold the routing table static all you want, but that won't keep static during transmission of the packet. After all, you can hold the
the external Internet from changing, and it is the state of the external routing table static all you want, but that won't keep the external
Internet that really matters. In addition, routing entries are typically Internet from changing, and it is the state of the external Internet
added or deleted, rather than being modified in place. that really matters. In addition, routing entries are typically added
or deleted, rather than being modified in place. This is a rare example
A straightforward example of this use of RCU may be found in the of the finite speed of light and the non-zero size of atoms actually
system-call auditing support. For example, a reader-writer locked helping make synchronization be lighter weight.
A straightforward example of this type of RCU use case may be found in
the system-call auditing support. For example, a reader-writer locked
implementation of ``audit_filter_task()`` might be as follows:: implementation of ``audit_filter_task()`` might be as follows::
static enum audit_state audit_filter_task(struct task_struct *tsk) static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
{ {
struct audit_entry *e; struct audit_entry *e;
enum audit_state state; enum audit_state state;
...@@ -86,6 +95,8 @@ implementation of ``audit_filter_task()`` might be as follows:: ...@@ -86,6 +95,8 @@ implementation of ``audit_filter_task()`` might be as follows::
/* Note: audit_filter_mutex held by caller. */ /* Note: audit_filter_mutex held by caller. */
list_for_each_entry(e, &audit_tsklist, list) { list_for_each_entry(e, &audit_tsklist, list) {
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
if (state == AUDIT_STATE_RECORD)
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
read_unlock(&auditsc_lock); read_unlock(&auditsc_lock);
return state; return state;
} }
...@@ -101,7 +112,7 @@ you are turning auditing off, it is OK to audit a few extra system calls. ...@@ -101,7 +112,7 @@ you are turning auditing off, it is OK to audit a few extra system calls.
This means that RCU can be easily applied to the read side, as follows:: This means that RCU can be easily applied to the read side, as follows::
static enum audit_state audit_filter_task(struct task_struct *tsk) static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
{ {
struct audit_entry *e; struct audit_entry *e;
enum audit_state state; enum audit_state state;
...@@ -110,6 +121,8 @@ This means that RCU can be easily applied to the read side, as follows:: ...@@ -110,6 +121,8 @@ This means that RCU can be easily applied to the read side, as follows::
/* Note: audit_filter_mutex held by caller. */ /* Note: audit_filter_mutex held by caller. */
list_for_each_entry_rcu(e, &audit_tsklist, list) { list_for_each_entry_rcu(e, &audit_tsklist, list) {
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
if (state == AUDIT_STATE_RECORD)
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
rcu_read_unlock(); rcu_read_unlock();
return state; return state;
} }
...@@ -118,13 +131,15 @@ This means that RCU can be easily applied to the read side, as follows:: ...@@ -118,13 +131,15 @@ This means that RCU can be easily applied to the read side, as follows::
return AUDIT_BUILD_CONTEXT; return AUDIT_BUILD_CONTEXT;
} }
The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock() The read_lock() and read_unlock() calls have become rcu_read_lock()
and rcu_read_unlock(), respectively, and the list_for_each_entry() has and rcu_read_unlock(), respectively, and the list_for_each_entry()
become list_for_each_entry_rcu(). The **_rcu()** list-traversal primitives has become list_for_each_entry_rcu(). The **_rcu()** list-traversal
insert the read-side memory barriers that are required on DEC Alpha CPUs. primitives add READ_ONCE() and diagnostic checks for incorrect use
outside of an RCU read-side critical section.
The changes to the update side are also straightforward. A reader-writer lock The changes to the update side are also straightforward. A reader-writer lock
might be used as follows for deletion and insertion:: might be used as follows for deletion and insertion in these simplified
versions of audit_del_rule() and audit_add_rule()::
static inline int audit_del_rule(struct audit_rule *rule, static inline int audit_del_rule(struct audit_rule *rule,
struct list_head *list) struct list_head *list)
...@@ -188,16 +203,16 @@ Following are the RCU equivalents for these two functions:: ...@@ -188,16 +203,16 @@ Following are the RCU equivalents for these two functions::
return 0; return 0;
} }
Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a Normally, the write_lock() and write_unlock() would be replaced by a
spin_lock() and a spin_unlock(). But in this case, all callers hold spin_lock() and a spin_unlock(). But in this case, all callers hold
``audit_filter_mutex``, so no additional locking is required. The ``audit_filter_mutex``, so no additional locking is required. The
``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the auditsc_lock can therefore be eliminated, since use of RCU eliminates the
need for writers to exclude readers. need for writers to exclude readers.
The list_del(), list_add(), and list_add_tail() primitives have been The list_del(), list_add(), and list_add_tail() primitives have been
replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
The **_rcu()** list-manipulation primitives add memory barriers that are needed on The **_rcu()** list-manipulation primitives add memory barriers that are
weakly ordered CPUs (most of them!). The list_del_rcu() primitive omits the needed on weakly ordered CPUs. The list_del_rcu() primitive omits the
pointer poisoning debug-assist code that would otherwise cause concurrent pointer poisoning debug-assist code that would otherwise cause concurrent
readers to fail spectacularly. readers to fail spectacularly.
...@@ -238,7 +253,9 @@ need to be filled in):: ...@@ -238,7 +253,9 @@ need to be filled in)::
The RCU version creates a copy, updates the copy, then replaces the old The RCU version creates a copy, updates the copy, then replaces the old
entry with the newly updated entry. This sequence of actions, allowing entry with the newly updated entry. This sequence of actions, allowing
concurrent reads while making a copy to perform an update, is what gives concurrent reads while making a copy to perform an update, is what gives
RCU (*read-copy update*) its name. The RCU code is as follows:: RCU (*read-copy update*) its name.
The RCU version of audit_upd_rule() is as follows::
static inline int audit_upd_rule(struct audit_rule *rule, static inline int audit_upd_rule(struct audit_rule *rule,
struct list_head *list, struct list_head *list,
...@@ -267,6 +284,9 @@ RCU (*read-copy update*) its name. The RCU code is as follows:: ...@@ -267,6 +284,9 @@ RCU (*read-copy update*) its name. The RCU code is as follows::
Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the
writer lock would become a spinlock in this sort of code. writer lock would become a spinlock in this sort of code.
The update_lsm_rule() does something very similar, for those who would
prefer to look at real Linux-kernel code.
Another use of this pattern can be found in the openswitch driver's *connection Another use of this pattern can be found in the openswitch driver's *connection
tracking table* code in ``ct_limit_set()``. The table holds connection tracking tracking table* code in ``ct_limit_set()``. The table holds connection tracking
entries and has a limit on the maximum entries. There is one such table entries and has a limit on the maximum entries. There is one such table
...@@ -281,9 +301,10 @@ Example 4: Eliminating Stale Data ...@@ -281,9 +301,10 @@ Example 4: Eliminating Stale Data
--------------------------------- ---------------------------------
The auditing example above tolerates stale data, as do most algorithms The auditing example above tolerates stale data, as do most algorithms
that are tracking external state. Because there is a delay from the that are tracking external state. After all, given there is a delay
time the external state changes before Linux becomes aware of the change, from the time the external state changes before Linux becomes aware
additional RCU-induced staleness is generally not a problem. of the change, and so as noted earlier, a small quantity of additional
RCU-induced staleness is generally not a problem.
However, there are many examples where stale data cannot be tolerated. However, there are many examples where stale data cannot be tolerated.
One example in the Linux kernel is the System V IPC (see the shm_lock() One example in the Linux kernel is the System V IPC (see the shm_lock()
...@@ -302,7 +323,7 @@ Quick Quiz: ...@@ -302,7 +323,7 @@ Quick Quiz:
If the system-call audit module were to ever need to reject stale data, one way If the system-call audit module were to ever need to reject stale data, one way
to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
audit_entry structure, and modify ``audit_filter_task()`` as follows:: ``audit_entry`` structure, and modify audit_filter_task() as follows::
static enum audit_state audit_filter_task(struct task_struct *tsk) static enum audit_state audit_filter_task(struct task_struct *tsk)
{ {
...@@ -319,6 +340,8 @@ audit_entry structure, and modify ``audit_filter_task()`` as follows:: ...@@ -319,6 +340,8 @@ audit_entry structure, and modify ``audit_filter_task()`` as follows::
return AUDIT_BUILD_CONTEXT; return AUDIT_BUILD_CONTEXT;
} }
rcu_read_unlock(); rcu_read_unlock();
if (state == AUDIT_STATE_RECORD)
*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
return state; return state;
} }
} }
...@@ -326,12 +349,6 @@ audit_entry structure, and modify ``audit_filter_task()`` as follows:: ...@@ -326,12 +349,6 @@ audit_entry structure, and modify ``audit_filter_task()`` as follows::
return AUDIT_BUILD_CONTEXT; return AUDIT_BUILD_CONTEXT;
} }
Note that this example assumes that entries are only added and deleted.
Additional mechanism is required to deal correctly with the update-in-place
performed by ``audit_upd_rule()``. For one thing, ``audit_upd_rule()`` would
need additional memory barriers to ensure that the list_add_rcu() was really
executed before the list_del_rcu().
The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
spinlock as follows:: spinlock as follows::
...@@ -357,24 +374,32 @@ spinlock as follows:: ...@@ -357,24 +374,32 @@ spinlock as follows::
This too assumes that the caller holds ``audit_filter_mutex``. This too assumes that the caller holds ``audit_filter_mutex``.
Note that this example assumes that entries are only added and deleted.
Additional mechanism is required to deal correctly with the update-in-place
performed by audit_upd_rule(). For one thing, audit_upd_rule() would
need to hold the locks of both the old ``audit_entry`` and its replacement
while executing the list_replace_rcu().
Example 5: Skipping Stale Objects Example 5: Skipping Stale Objects
--------------------------------- ---------------------------------
For some usecases, reader performance can be improved by skipping stale objects For some use cases, reader performance can be improved by skipping
during read-side list traversal if the object in concern is pending destruction stale objects during read-side list traversal, where stale objects
after one or more grace periods. One such example can be found in the timerfd are those that will be removed and destroyed after one or more grace
subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to periods. One such example can be found in the timerfd subsystem. When a
setting of the system time, then all programmed timerfds that depend on this ``CLOCK_REALTIME`` clock is reprogrammed (for example due to setting
clock get triggered and processes waiting on them to expire are woken up in of the system time) then all programmed ``timerfds`` that depend on
advance of their scheduled expiry. To facilitate this, all such timers are added this clock get triggered and processes waiting on them are awakened in
to an RCU-managed ``cancel_list`` when they are setup in advance of their scheduled expiry. To facilitate this, all such timers
are added to an RCU-managed ``cancel_list`` when they are setup in
``timerfd_setup_cancel()``:: ``timerfd_setup_cancel()``::
static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags) static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
{ {
spin_lock(&ctx->cancel_lock); spin_lock(&ctx->cancel_lock);
if ((ctx->clockid == CLOCK_REALTIME && if ((ctx->clockid == CLOCK_REALTIME ||
ctx->clockid == CLOCK_REALTIME_ALARM) &&
(flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) { (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
if (!ctx->might_cancel) { if (!ctx->might_cancel) {
ctx->might_cancel = true; ctx->might_cancel = true;
...@@ -382,13 +407,16 @@ to an RCU-managed ``cancel_list`` when they are setup in ...@@ -382,13 +407,16 @@ to an RCU-managed ``cancel_list`` when they are setup in
list_add_rcu(&ctx->clist, &cancel_list); list_add_rcu(&ctx->clist, &cancel_list);
spin_unlock(&cancel_lock); spin_unlock(&cancel_lock);
} }
} else {
__timerfd_remove_cancel(ctx);
} }
spin_unlock(&ctx->cancel_lock); spin_unlock(&ctx->cancel_lock);
} }
When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the When a timerfd is freed (fd is closed), then the ``might_cancel``
timerfd object is cleared, the object removed from the ``cancel_list`` and flag of the timerfd object is cleared, the object removed from the
destroyed:: ``cancel_list`` and destroyed, as shown in this simplified and inlined
version of timerfd_release()::
int timerfd_release(struct inode *inode, struct file *file) int timerfd_release(struct inode *inode, struct file *file)
{ {
...@@ -403,7 +431,10 @@ destroyed:: ...@@ -403,7 +431,10 @@ destroyed::
} }
spin_unlock(&ctx->cancel_lock); spin_unlock(&ctx->cancel_lock);
hrtimer_cancel(&ctx->t.tmr); if (isalarm(ctx))
alarm_cancel(&ctx->t.alarm);
else
hrtimer_cancel(&ctx->t.tmr);
kfree_rcu(ctx, rcu); kfree_rcu(ctx, rcu);
return 0; return 0;
} }
...@@ -416,6 +447,7 @@ objects:: ...@@ -416,6 +447,7 @@ objects::
void timerfd_clock_was_set(void) void timerfd_clock_was_set(void)
{ {
ktime_t moffs = ktime_mono_to_real(0);
struct timerfd_ctx *ctx; struct timerfd_ctx *ctx;
unsigned long flags; unsigned long flags;
...@@ -424,7 +456,7 @@ objects:: ...@@ -424,7 +456,7 @@ objects::
if (!ctx->might_cancel) if (!ctx->might_cancel)
continue; continue;
spin_lock_irqsave(&ctx->wqh.lock, flags); spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ctx->moffs != ktime_mono_to_real(0)) { if (ctx->moffs != moffs) {
ctx->moffs = KTIME_MAX; ctx->moffs = KTIME_MAX;
ctx->ticks++; ctx->ticks++;
wake_up_locked_poll(&ctx->wqh, EPOLLIN); wake_up_locked_poll(&ctx->wqh, EPOLLIN);
...@@ -434,10 +466,10 @@ objects:: ...@@ -434,10 +466,10 @@ objects::
rcu_read_unlock(); rcu_read_unlock();
} }
The key point here is, because RCU-traversal of the ``cancel_list`` happens The key point is that because RCU-protected traversal of the
while objects are being added and removed to the list, sometimes the traversal ``cancel_list`` happens concurrently with object addition and removal,
can step on an object that has been removed from the list. In this example, it sometimes the traversal can access an object that has been removed from
is seen that it is better to skip such objects using a flag. the list. In this example, a flag is used to skip such objects.
Summary Summary
......
...@@ -17,7 +17,9 @@ state:: ...@@ -17,7 +17,9 @@ state::
rcu_read_lock_held() for normal RCU. rcu_read_lock_held() for normal RCU.
rcu_read_lock_bh_held() for RCU-bh. rcu_read_lock_bh_held() for RCU-bh.
rcu_read_lock_sched_held() for RCU-sched. rcu_read_lock_sched_held() for RCU-sched.
rcu_read_lock_any_held() for any of normal RCU, RCU-bh, and RCU-sched.
srcu_read_lock_held() for SRCU. srcu_read_lock_held() for SRCU.
rcu_read_lock_trace_held() for RCU Tasks Trace.
These functions are conservative, and will therefore return 1 if they These functions are conservative, and will therefore return 1 if they
aren't certain (for example, if CONFIG_DEBUG_LOCK_ALLOC is not set). aren't certain (for example, if CONFIG_DEBUG_LOCK_ALLOC is not set).
...@@ -53,6 +55,8 @@ checking of rcu_dereference() primitives: ...@@ -53,6 +55,8 @@ checking of rcu_dereference() primitives:
is invoked by both SRCU readers and updaters. is invoked by both SRCU readers and updaters.
rcu_dereference_raw(p): rcu_dereference_raw(p):
Don't check. (Use sparingly, if at all.) Don't check. (Use sparingly, if at all.)
rcu_dereference_raw_check(p):
Don't do lockdep at all. (Use sparingly, if at all.)
rcu_dereference_protected(p, c): rcu_dereference_protected(p, c):
Use explicit check expression "c", and omit all barriers Use explicit check expression "c", and omit all barriers
and compiler constraints. This is useful when the data and compiler constraints. This is useful when the data
......
.. SPDX-License-Identifier: GPL-2.0
====================
Compute Accelerators
====================
.. toctree::
:maxdepth: 1
introduction
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
.. SPDX-License-Identifier: GPL-2.0
============
Introduction
============
The Linux compute accelerators subsystem is designed to expose compute
accelerators in a common way to user-space and provide a common set of
functionality.
These devices can be either stand-alone ASICs or IP blocks inside an SoC/GPU.
Although these devices are typically designed to accelerate
Machine-Learning (ML) and/or Deep-Learning (DL) computations, the accel layer
is not limited to handling these types of accelerators.
Typically, a compute accelerator will belong to one of the following
categories:
- Edge AI - doing inference at an edge device. It can be an embedded ASIC/FPGA,
or an IP inside a SoC (e.g. laptop web camera). These devices
are typically configured using registers and can work with or without DMA.
- Inference data-center - single/multi user devices in a large server. This
type of device can be stand-alone or an IP inside a SoC or a GPU. It will
have on-board DRAM (to hold the DL topology), DMA engines and
command submission queues (either kernel or user-space queues).
It might also have an MMU to manage multiple users and might also enable
virtualization (SR-IOV) to support multiple VMs on the same device. In
addition, these devices will usually have some tools, such as profiler and
debugger.
- Training data-center - Similar to Inference data-center cards, but typically
have more computational power and memory b/w (e.g. HBM) and will likely have
a method of scaling-up/out, i.e. connecting to other training cards inside
the server or in other servers, respectively.
All these devices typically have different runtime user-space software stacks,
that are tailored-made to their h/w. In addition, they will also probably
include a compiler to generate programs to their custom-made computational
engines. Typically, the common layer in user-space will be the DL frameworks,
such as PyTorch and TensorFlow.
Sharing code with DRM
=====================
Because this type of devices can be an IP inside GPUs or have similar
characteristics as those of GPUs, the accel subsystem will use the
DRM subsystem's code and functionality. i.e. the accel core code will
be part of the DRM subsystem and an accel device will be a new type of DRM
device.
This will allow us to leverage the extensive DRM code-base and
collaborate with DRM developers that have experience with this type of
devices. In addition, new features that will be added for the accelerator
drivers can be of use to GPU drivers as well.
Differentiation from GPUs
=========================
Because we want to prevent the extensive user-space graphic software stack
from trying to use an accelerator as a GPU, the compute accelerators will be
differentiated from GPUs by using a new major number and new device char files.
Furthermore, the drivers will be located in a separate place in the kernel
tree - drivers/accel/.
The accelerator devices will be exposed to the user space with the dedicated
261 major number and will have the following convention:
- device char files - /dev/accel/accel*
- sysfs - /sys/class/accel/accel*/
- debugfs - /sys/kernel/debug/accel/accel*/
Getting Started
===============
First, read the DRM documentation at Documentation/gpu/index.rst.
Not only it will explain how to write a new DRM driver but it will also
contain all the information on how to contribute, the Code Of Conduct and
what is the coding style/documentation. All of that is the same for the
accel subsystem.
Second, make sure the kernel is configured with CONFIG_DRM_ACCEL.
To expose your device as an accelerator, two changes are needed to
be done in your driver (as opposed to a standard DRM driver):
- Add the DRIVER_COMPUTE_ACCEL feature flag in your drm_driver's
driver_features field. It is important to note that this driver feature is
mutually exclusive with DRIVER_RENDER and DRIVER_MODESET. Devices that want
to expose both graphics and compute device char files should be handled by
two drivers that are connected using the auxiliary bus framework.
- Change the open callback in your driver fops structure to accel_open().
Alternatively, your driver can use DEFINE_DRM_ACCEL_FOPS macro to easily
set the correct function operations pointers structure.
External References
===================
email threads
-------------
* `Initial discussion on the New subsystem for acceleration devices <https://lkml.org/lkml/2022/7/31/83>`_ - Oded Gabbay (2022)
* `patch-set to add the new subsystem <https://lkml.org/lkml/2022/10/22/544>`_ - Oded Gabbay (2022)
Conference talks
----------------
* `LPC 2022 Accelerators BOF outcomes summary <https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html>`_ - Dave Airlie (2022)
...@@ -348,8 +348,13 @@ this can be accomplished with:: ...@@ -348,8 +348,13 @@ this can be accomplished with::
echo huge_idle > /sys/block/zramX/writeback echo huge_idle > /sys/block/zramX/writeback
If a user chooses to writeback only incompressible pages (pages that none of
algorithms can compress) this can be accomplished with::
echo incompressible > /sys/block/zramX/writeback
If an admin wants to write a specific page in zram device to the backing device, If an admin wants to write a specific page in zram device to the backing device,
they could write a page index into the interface. they could write a page index into the interface::
echo "page_index=1251" > /sys/block/zramX/writeback echo "page_index=1251" > /sys/block/zramX/writeback
...@@ -401,6 +406,87 @@ budget in next setting is user's job. ...@@ -401,6 +406,87 @@ budget in next setting is user's job.
If admin wants to measure writeback count in a certain period, they could If admin wants to measure writeback count in a certain period, they could
know it via /sys/block/zram0/bd_stat's 3rd column. know it via /sys/block/zram0/bd_stat's 3rd column.
recompression
-------------
With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative
(secondary) compression algorithms. The basic idea is that alternative
compression algorithm can provide better compression ratio at a price of
(potentially) slower compression/decompression speeds. Alternative compression
algorithm can, for example, be more successful compressing huge pages (those
that default algorithm failed to compress). Another application is idle pages
recompression - pages that are cold and sit in the memory can be recompressed
using more effective algorithm and, hence, reduce zsmalloc memory usage.
With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms:
one primary and up to 3 secondary ones. Primary zram compressor is explained
in "3) Select compression algorithm", secondary algorithms are configured
using recomp_algorithm device attribute.
Example:::
#show supported recompression algorithms
cat /sys/block/zramX/recomp_algorithm
#1: lzo lzo-rle lz4 lz4hc [zstd]
#2: lzo lzo-rle lz4 [lz4hc] zstd
Alternative compression algorithms are sorted by priority. In the example
above, zstd is used as the first alternative algorithm, which has priority
of 1, while lz4hc is configured as a compression algorithm with priority 2.
Alternative compression algorithm's priority is provided during algorithms
configuration:::
#select zstd recompression algorithm, priority 1
echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
#select deflate recompression algorithm, priority 2
echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,
which controls recompression.
Examples:::
#IDLE pages recompression is activated by `idle` mode
echo "type=idle" > /sys/block/zramX/recompress
#HUGE pages recompression is activated by `huge` mode
echo "type=huge" > /sys/block/zram0/recompress
#HUGE_IDLE pages recompression is activated by `huge_idle` mode
echo "type=huge_idle" > /sys/block/zramX/recompress
The number of idle pages can be significant, so user-space can pass a size
threshold (in bytes) to the recompress knob: zram will recompress only pages
of equal or greater size:::
#recompress all pages larger than 3000 bytes
echo "threshold=3000" > /sys/block/zramX/recompress
#recompress idle pages larger than 2000 bytes
echo "type=idle threshold=2000" > /sys/block/zramX/recompress
Recompression of idle pages requires memory tracking.
During re-compression for every page, that matches re-compression criteria,
ZRAM iterates the list of registered alternative compression algorithms in
order of their priorities. ZRAM stops either when re-compression was
successful (re-compressed object is smaller in size than the original one)
and matches re-compression criteria (e.g. size threshold) or when there are
no secondary algorithms left to try. If none of the secondary algorithms can
successfully re-compressed the page such a page is marked as incompressible,
so ZRAM will not attempt to re-compress it in the future.
This re-compression behaviour, when it iterates through the list of
registered compression algorithms, increases our chances of finding the
algorithm that successfully compresses a particular page. Sometimes, however,
it is convenient (and sometimes even necessary) to limit recompression to
only one particular algorithm so that it will not try any other algorithms.
This can be achieved by providing a algo=NAME parameter:::
#use zstd algorithm only (if registered)
echo "type=huge algo=zstd" > /sys/block/zramX/recompress
memory tracking memory tracking
=============== ===============
...@@ -411,9 +497,11 @@ pages of the process with*pagemap. ...@@ -411,9 +497,11 @@ pages of the process with*pagemap.
If you enable the feature, you could see block state via If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows:: /sys/kernel/debug/zram/zram0/block_state". The output is as follows::
300 75.033841 .wh. 300 75.033841 .wh...
301 63.806904 s... 301 63.806904 s.....
302 63.806919 ..hi 302 63.806919 ..hi..
303 62.801919 ....r.
304 146.781902 ..hi.n
First column First column
zram's block index. zram's block index.
...@@ -430,6 +518,10 @@ Third column ...@@ -430,6 +518,10 @@ Third column
huge page huge page
i: i:
idle page idle page
r:
recompressed page (secondary compression algorithm)
n:
none (including secondary) of algorithms could compress it
First line of above example says 300th block is accessed at 75.033841sec First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing and the block's state is huge so it is written back to the backing
......
...@@ -229,7 +229,7 @@ In addition to the kernel command line, the boot config can be used for ...@@ -229,7 +229,7 @@ In addition to the kernel command line, the boot config can be used for
passing the kernel parameters. All the key-value pairs under ``kernel`` passing the kernel parameters. All the key-value pairs under ``kernel``
key will be passed to kernel cmdline directly. Moreover, the key-value key will be passed to kernel cmdline directly. Moreover, the key-value
pairs under ``init`` will be passed to init process via the cmdline. pairs under ``init`` will be passed to init process via the cmdline.
The parameters are concatinated with user-given kernel cmdline string The parameters are concatenated with user-given kernel cmdline string
as the following order, so that the command line parameter can override as the following order, so that the command line parameter can override
bootconfig parameters (this depends on how the subsystem handles parameters bootconfig parameters (this depends on how the subsystem handles parameters
but in general, earlier parameter will be overwritten by later one.):: but in general, earlier parameter will be overwritten by later one.)::
......
...@@ -543,7 +543,8 @@ inactive_anon # of bytes of anonymous and swap cache memory on inactive ...@@ -543,7 +543,8 @@ inactive_anon # of bytes of anonymous and swap cache memory on inactive
LRU list. LRU list.
active_anon # of bytes of anonymous and swap cache memory on active active_anon # of bytes of anonymous and swap cache memory on active
LRU list. LRU list.
inactive_file # of bytes of file-backed memory on inactive LRU list. inactive_file # of bytes of file-backed memory and MADV_FREE anonymous memory(
LazyFree pages) on inactive LRU list.
active_file # of bytes of file-backed memory on active LRU list. active_file # of bytes of file-backed memory on active LRU list.
unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
=============== =============================================================== =============== ===============================================================
......
...@@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back. ...@@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the This is a simple interface to trigger memory reclaim in the
target cgroup. target cgroup.
This file accepts a single key, the number of bytes to reclaim. This file accepts a string which contains the number of bytes to
No nested keys are currently supported. reclaim.
Example:: Example::
echo "1G" > memory.reclaim echo "1G" > memory.reclaim
The interface can be later extended with nested keys to
configure the reclaim behavior. For example, specify the
type of memory to reclaim from (anon, file, ..).
Please note that the kernel can over or under reclaim from Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned. specified amount, -EAGAIN is returned.
...@@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back. ...@@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim. reclaim induced by memory.reclaim.
This file also allows the user to specify the nodes to reclaim from,
via the 'nodes=' key, for example::
echo "1G nodes=0,1" > memory.reclaim
The above instructs the kernel to reclaim memory from nodes 0,1.
memory.peak memory.peak
A read-only single value file which exists on non-root A read-only single value file which exists on non-root
cgroups. cgroups.
...@@ -1488,12 +1491,18 @@ PAGE_SIZE multiple when read back. ...@@ -1488,12 +1491,18 @@ PAGE_SIZE multiple when read back.
pgscan_direct (npn) pgscan_direct (npn)
Amount of scanned pages directly (in an inactive LRU list) Amount of scanned pages directly (in an inactive LRU list)
pgscan_khugepaged (npn)
Amount of scanned pages by khugepaged (in an inactive LRU list)
pgsteal_kswapd (npn) pgsteal_kswapd (npn)
Amount of reclaimed pages by kswapd Amount of reclaimed pages by kswapd
pgsteal_direct (npn) pgsteal_direct (npn)
Amount of reclaimed pages directly Amount of reclaimed pages directly
pgsteal_khugepaged (npn)
Amount of reclaimed pages by khugepaged
pgfault (npn) pgfault (npn)
Total number of page faults incurred Total number of page faults incurred
......
...@@ -858,7 +858,7 @@ CIFS kernel module parameters ...@@ -858,7 +858,7 @@ CIFS kernel module parameters
These module parameters can be specified or modified either during the time of These module parameters can be specified or modified either during the time of
module loading or during the runtime by using the interface:: module loading or during the runtime by using the interface::
/proc/module/cifs/parameters/<param> /sys/module/cifs/parameters/<param>
i.e.:: i.e.::
......
...@@ -123,3 +123,11 @@ Other examples (per target): ...@@ -123,3 +123,11 @@ Other examples (per target):
0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
For setups using device-mapper on top of asynchronously probed block
devices (MMC, USB, ..), it may be necessary to tell dm-init to
explicitly wait for them to become available before setting up the
device-mapper tables. This can be done with the "dm-mod.waitfor="
module parameter, which takes a list of devices to wait for::
dm-mod.waitfor=<device1>[,..,<deviceN>]
...@@ -3080,6 +3080,11 @@ ...@@ -3080,6 +3080,11 @@
... ...
255 = /dev/osd255 256th OSD Device 255 = /dev/osd255 256th OSD Device
261 char Compute Acceleration Devices
0 = /dev/accel/accel0 First acceleration device
1 = /dev/accel/accel1 Second acceleration device
...
384-511 char RESERVED FOR DYNAMIC ASSIGNMENT 384-511 char RESERVED FOR DYNAMIC ASSIGNMENT
Character devices that request a dynamic allocation of major Character devices that request a dynamic allocation of major
number will take numbers starting from 511 and downward, number will take numbers starting from 511 and downward,
......
========================================================== =================================
Linux support for random number generator in i8xx chipsets Hardware random number generators
========================================================== =================================
Introduction Introduction
============ ============
......
...@@ -595,3 +595,32 @@ X2TLB ...@@ -595,3 +595,32 @@ X2TLB
----- -----
Indicates whether the crashed kernel enabled SH extended mode. Indicates whether the crashed kernel enabled SH extended mode.
RISCV64
=======
VA_BITS
-------
The maximum number of bits for virtual addresses. Used to compute the
virtual memory ranges.
PAGE_OFFSET
-----------
Indicates the virtual kernel start address of the direct-mapped RAM region.
phys_ram_base
-------------
Indicates the start physical RAM address.
MODULES_VADDR|MODULES_END|VMALLOC_START|VMALLOC_END|VMEMMAP_START|VMEMMAP_END|KERNEL_LINK_ADDR
----------------------------------------------------------------------------------------------
Used to get the correct ranges:
* MODULES_VADDR ~ MODULES_END : Kernel module space.
* VMALLOC_START ~ VMALLOC_END : vmalloc() / ioremap() space.
* VMEMMAP_START ~ VMEMMAP_END : vmemmap space, used for struct page array.
* KERNEL_LINK_ADDR : start address of Kernel link and BPF
...@@ -703,6 +703,17 @@ ...@@ -703,6 +703,17 @@
condev= [HW,S390] console device condev= [HW,S390] console device
conmode= conmode=
con3215_drop= [S390] 3215 console drop mode.
Format: y|n|Y|N|1|0
When set to true, drop data on the 3215 console when
the console buffer is full. In this case the
operator using a 3270 terminal emulator (for example
x3270) does not have to enter the clear key for the
console output to advance and the kernel to continue.
This leads to a much faster boot time when a 3270
terminal emulator is active. If no 3270 terminal
emulator is used, this parameter has no effect.
console= [KNL] Output console device and options. console= [KNL] Output console device and options.
tty<n> Use the virtual console device <n>. tty<n> Use the virtual console device <n>.
...@@ -831,7 +842,7 @@ ...@@ -831,7 +842,7 @@
memory region [offset, offset + size] for that kernel memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset image. If '@offset' is omitted, then a suitable offset
is selected automatically. is selected automatically.
[KNL, X86-64] Select a region under 4G first, and [KNL, X86-64, ARM64] Select a region under 4G first, and
fall back to reserve region above 4G when '@offset' fall back to reserve region above 4G when '@offset'
hasn't been specified. hasn't been specified.
See Documentation/admin-guide/kdump/kdump.rst for further details. See Documentation/admin-guide/kdump/kdump.rst for further details.
...@@ -851,26 +862,23 @@ ...@@ -851,26 +862,23 @@
available. available.
It will be ignored if crashkernel=X is specified. It will be ignored if crashkernel=X is specified.
crashkernel=size[KMG],low crashkernel=size[KMG],low
[KNL, X86-64] range under 4G. When crashkernel=X,high [KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high
is passed, kernel could allocate physical memory region is passed, kernel could allocate physical memory region
above 4G, that cause second kernel crash on system above 4G, that cause second kernel crash on system
that require some amount of low memory, e.g. swiotlb that require some amount of low memory, e.g. swiotlb
requires at least 64M+32K low memory, also enough extra requires at least 64M+32K low memory, also enough extra
low memory is needed to make sure DMA buffers for 32-bit low memory is needed to make sure DMA buffers for 32-bit
devices won't run out. Kernel would try to allocate devices won't run out. Kernel would try to allocate
at least 256M below 4G automatically. default size of memory below 4G automatically. The default
size is platform dependent.
--> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
--> arm64: 128MiB
This one lets the user specify own low range under 4G This one lets the user specify own low range under 4G
for second kernel instead. for second kernel instead.
0: to disable low allocation. 0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G. or memory reserved is below 4G.
[KNL, ARM64] range in low memory.
This one lets the user specify a low range in the
DMA zone for the crash dump kernel.
It will be ignored when crashkernel=X,high is not used
or memory reserved is located in the DMA zones.
cryptomgr.notests cryptomgr.notests
[KNL] Disable crypto self-tests [KNL] Disable crypto self-tests
...@@ -1042,6 +1050,11 @@ ...@@ -1042,6 +1050,11 @@
them frequently to increase the rate of SLB faults them frequently to increase the rate of SLB faults
on kernel addresses. on kernel addresses.
stress_hpt [PPC]
Limits the number of kernel HPT entries in the hash
page table to increase the rate of hash page table
faults on kernel addresses.
disable= [IPV6] disable= [IPV6]
See Documentation/networking/ipv6.rst. See Documentation/networking/ipv6.rst.
...@@ -2300,7 +2313,13 @@ ...@@ -2300,7 +2313,13 @@
Provide an override to the IOAPIC-ID<->DEVICE-ID Provide an override to the IOAPIC-ID<->DEVICE-ID
mapping provided in the IVRS ACPI table. mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted. By default, PCI segment is 0, and can be omitted.
For example:
For example, to map IOAPIC-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
ivrs_ioapic=10@0001:00:14.0
Deprecated formats:
* To map IOAPIC-ID decimal 10 to PCI device 00:14.0 * To map IOAPIC-ID decimal 10 to PCI device 00:14.0
write the parameter as: write the parameter as:
ivrs_ioapic[10]=00:14.0 ivrs_ioapic[10]=00:14.0
...@@ -2312,7 +2331,13 @@ ...@@ -2312,7 +2331,13 @@
Provide an override to the HPET-ID<->DEVICE-ID Provide an override to the HPET-ID<->DEVICE-ID
mapping provided in the IVRS ACPI table. mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted. By default, PCI segment is 0, and can be omitted.
For example:
For example, to map HPET-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
ivrs_hpet=10@0001:00:14.0
Deprecated formats:
* To map HPET-ID decimal 0 to PCI device 00:14.0 * To map HPET-ID decimal 0 to PCI device 00:14.0
write the parameter as: write the parameter as:
ivrs_hpet[0]=00:14.0 ivrs_hpet[0]=00:14.0
...@@ -2323,15 +2348,20 @@ ...@@ -2323,15 +2348,20 @@
ivrs_acpihid [HW,X86-64] ivrs_acpihid [HW,X86-64]
Provide an override to the ACPI-HID:UID<->DEVICE-ID Provide an override to the ACPI-HID:UID<->DEVICE-ID
mapping provided in the IVRS ACPI table. mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
For example, to map UART-HID:UID AMD0020:0 to For example, to map UART-HID:UID AMD0020:0 to
PCI segment 0x1 and PCI device ID 00:14.5, PCI segment 0x1 and PCI device ID 00:14.5,
write the parameter as: write the parameter as:
ivrs_acpihid[0001:00:14.5]=AMD0020:0 ivrs_acpihid=AMD0020:0@0001:00:14.5
By default, PCI segment is 0, and can be omitted. Deprecated formats:
For example, PCI device 00:14.5 write the parameter as: * To map UART-HID:UID AMD0020:0 to PCI segment is 0,
PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[00:14.5]=AMD0020:0 ivrs_acpihid[00:14.5]=AMD0020:0
* To map UART-HID:UID AMD0020:0 to PCI segment 0x1 and
PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[0001:00:14.5]=AMD0020:0
js= [HW,JOY] Analog joystick js= [HW,JOY] Analog joystick
See Documentation/input/joydev/joystick.rst. See Documentation/input/joydev/joystick.rst.
...@@ -3777,12 +3807,15 @@ ...@@ -3777,12 +3807,15 @@
shutdown the other cpus. Instead use the REBOOT_VECTOR shutdown the other cpus. Instead use the REBOOT_VECTOR
irq. irq.
nomodeset Disable kernel modesetting. DRM drivers will not perform nomodeset Disable kernel modesetting. Most systems' firmware
display-mode changes or accelerated rendering. Only the sets up a display mode and provides framebuffer memory
system framebuffer will be available for use if this was for output. With nomodeset, DRM and fbdev drivers will
set-up by the firmware or boot loader. not load if they could possibly displace the pre-
initialized output. Only the system framebuffer will
be available for use. The respective drivers will not
perform display-mode changes or accelerated rendering.
Useful as fallback, or for testing and debugging. Useful as error fallback, or for testing and debugging.
nomodule Disable module load nomodule Disable module load
...@@ -4566,17 +4599,15 @@ ...@@ -4566,17 +4599,15 @@
ramdisk_start= [RAM] RAM disk image start address ramdisk_start= [RAM] RAM disk image start address
random.trust_cpu={on,off} random.trust_cpu=off
[KNL] Enable or disable trusting the use of the [KNL] Disable trusting the use of the CPU's
CPU's random number generator (if available) to random number generator (if available) to
fully seed the kernel's CRNG. Default is controlled initialize the kernel's RNG.
by CONFIG_RANDOM_TRUST_CPU.
random.trust_bootloader={on,off} random.trust_bootloader=off
[KNL] Enable or disable trusting the use of a [KNL] Disable trusting the use of the a seed
seed passed by the bootloader (if available) to passed by the bootloader (if available) to
fully seed the kernel's CRNG. Default is controlled initialize the kernel's RNG.
by CONFIG_RANDOM_TRUST_BOOTLOADER.
randomize_kstack_offset= randomize_kstack_offset=
[KNL] Enable or disable kernel stack offset [KNL] Enable or disable kernel stack offset
...@@ -6257,6 +6288,25 @@ ...@@ -6257,6 +6288,25 @@
See also Documentation/trace/ftrace.rst "trace options" See also Documentation/trace/ftrace.rst "trace options"
section. section.
trace_trigger=[trigger-list]
[FTRACE] Add a event trigger on specific events.
Set a trigger on top of a specific event, with an optional
filter.
The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
Where more than one trigger may be specified that are comma deliminated.
For example:
trace_trigger="sched_switch.stacktrace if prev_state == 2"
The above will enable the "stacktrace" trigger on the "sched_switch"
event but only trigger it if the "prev_state" of the "sched_switch"
event is "2" (TASK_UNINTERUPTIBLE).
See also "Event triggers" in Documentation/trace/events.rst
traceoff_on_warning traceoff_on_warning
[FTRACE] enable this option to disable tracing when a [FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can warning is hit. This turns off "tracing_on". Tracing can
......
.. SPDX-License-Identifier: GPL-2.0
=================================
CEC driver-specific documentation
=================================
.. toctree::
:maxdepth: 2
pulse8-cec
.. SPDX-License-Identifier: GPL-2.0
========
HDMI CEC
========
Supported hardware in mainline
==============================
HDMI Transmitters:
- Exynos4
- Exynos5
- STIH4xx HDMI CEC
- V4L2 adv7511 (same HW, but a different driver from the drm adv7511)
- stm32
- Allwinner A10 (sun4i)
- Raspberry Pi
- dw-hdmi (Synopsis IP)
- amlogic (meson ao-cec and ao-cec-g12a)
- drm adv7511/adv7533
- omap4
- tegra
- rk3288, rk3399
- tda998x
- DisplayPort CEC-Tunneling-over-AUX on i915, nouveau and amdgpu
- ChromeOS EC CEC
- CEC for SECO boards (UDOO x86).
- Chrontel CH7322
HDMI Receivers:
- adv7604/11/12
- adv7842
- tc358743
USB Dongles (see below for additional information on how to use these
dongles):
- Pulse-Eight: the pulse8-cec driver implements the following module option:
``persistent_config``: by default this is off, but when set to 1 the driver
will store the current settings to the device's internal eeprom and restore
it the next time the device is connected to the USB port.
- RainShadow Tech. Note: this driver does not support the persistent_config
module option of the Pulse-Eight driver. The hardware supports it, but I
have no plans to add this feature. But I accept patches :-)
Miscellaneous:
- vivid: emulates a CEC receiver and CEC transmitter.
Can be used to test CEC applications without actual CEC hardware.
- cec-gpio. If the CEC pin is hooked up to a GPIO pin then
you can control the CEC line through this driver. This supports error
injection as well.
Utilities
=========
Utilities are available here: https://git.linuxtv.org/v4l-utils.git
``utils/cec-ctl``: control a CEC device
``utils/cec-compliance``: test compliance of a remote CEC device
``utils/cec-follower``: emulate a CEC follower device
Note that ``cec-ctl`` has support for the CEC Hospitality Profile as is
used in some hotel displays. See http://www.htng.org.
Note that the libcec library (https://github.com/Pulse-Eight/libcec) supports
the linux CEC framework.
If you want to get the CEC specification, then look at the References of
the HDMI wikipedia page: https://en.wikipedia.org/wiki/HDMI. CEC is part
of the HDMI specification. HDMI 1.3 is freely available (very similar to
HDMI 1.4 w.r.t. CEC) and should be good enough for most things.
DisplayPort to HDMI Adapters with working CEC
=============================================
Background: most adapters do not support the CEC Tunneling feature,
and of those that do many did not actually connect the CEC pin.
Unfortunately, this means that while a CEC device is created, it
is actually all alone in the world and will never be able to see other
CEC devices.
This is a list of known working adapters that have CEC Tunneling AND
that properly connected the CEC pin. If you find adapters that work
but are not in this list, then drop me a note.
To test: hook up your DP-to-HDMI adapter to a CEC capable device
(typically a TV), then run::
cec-ctl --playback # Configure the PC as a CEC Playback device
cec-ctl -S # Show the CEC topology
The ``cec-ctl -S`` command should show at least two CEC devices,
ourselves and the CEC device you are connected to (i.e. typically the TV).
General note: I have only seen this work with the Parade PS175, PS176 and
PS186 chipsets and the MegaChips 2900. While MegaChips 28x0 claims CEC support,
I have never seen it work.
USB-C to HDMI
-------------
Samsung Multiport Adapter EE-PW700: https://www.samsung.com/ie/support/model/EE-PW700BBEGWW/
Kramer ADC-U31C/HF: https://www.kramerav.com/product/ADC-U31C/HF
Club3D CAC-2504: https://www.club-3d.com/en/detail/2449/usb_3.1_type_c_to_hdmi_2.0_uhd_4k_60hz_active_adapter/
DisplayPort to HDMI
-------------------
Club3D CAC-1080: https://www.club-3d.com/en/detail/2442/displayport_1.4_to_hdmi_2.0b_hdr/
CableCreation (SKU: CD0712): https://www.cablecreation.com/products/active-displayport-to-hdmi-adapter-4k-hdr
HP DisplayPort to HDMI True 4k Adapter (P/N 2JA63AA): https://www.hp.com/us-en/shop/pdp/hp-displayport-to-hdmi-true-4k-adapter
Mini-DisplayPort to HDMI
------------------------
Club3D CAC-1180: https://www.club-3d.com/en/detail/2443/mini_displayport_1.4_to_hdmi_2.0b_hdr/
Note that passive adapters will never work, you need an active adapter.
The Club3D adapters in this list are all MegaChips 2900 based. Other Club3D adapters
are PS176 based and do NOT have the CEC pin hooked up, so only the three Club3D
adapters above are known to work.
I suspect that MegaChips 2900 based designs in general are likely to work
whereas with the PS176 it is more hit-and-miss (mostly miss). The PS186 is
likely to have the CEC pin hooked up, it looks like they changed the reference
design for that chipset.
USB CEC Dongles
===============
These dongles appear as ``/dev/ttyACMX`` devices and need the ``inputattach``
utility to create the ``/dev/cecX`` devices. Support for the Pulse-Eight
has been added to ``inputattach`` 1.6.0. Support for the Rainshadow Tech has
been added to ``inputattach`` 1.6.1.
You also need udev rules to automatically start systemd services::
SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="2548", ATTRS{idProduct}=="1002", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="pulse8-cec-inputattach@%k.service"
SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="2548", ATTRS{idProduct}=="1001", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="pulse8-cec-inputattach@%k.service"
SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="ff59", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="rainshadow-cec-inputattach@%k.service"
and these systemd services:
For Pulse-Eight make /lib/systemd/system/pulse8-cec-inputattach@.service::
[Unit]
Description=inputattach for pulse8-cec device on %I
[Service]
Type=simple
ExecStart=/usr/bin/inputattach --pulse8-cec /dev/%I
For the RainShadow Tech make /lib/systemd/system/rainshadow-cec-inputattach@.service::
[Unit]
Description=inputattach for rainshadow-cec device on %I
[Service]
Type=simple
ExecStart=/usr/bin/inputattach --rainshadow-cec /dev/%I
For proper suspend/resume support create: /lib/systemd/system/restart-cec-inputattach.service::
[Unit]
Description=restart inputattach for cec devices
After=suspend.target
[Service]
Type=forking
ExecStart=/bin/bash -c 'for d in /dev/serial/by-id/usb-Pulse-Eight*; do /usr/bin/inputattach --daemon --pulse8-cec $d; done; for d in /dev/serial/by-id/usb-RainShadow_Tech*; do /usr/bin/inputattach --daemon --rainshadow-cec $d; done'
[Install]
WantedBy=suspend.target
And run ``systemctl enable restart-cec-inputattach``.
To automatically set the physical address of the CEC device whenever the
EDID changes, you can use ``cec-ctl`` with the ``-E`` option::
cec-ctl -E /sys/class/drm/card0-DP-1/edid
This assumes the dongle is connected to the card0-DP-1 output (``xrandr`` will tell
you which output is used) and it will poll for changes to the EDID and update
the Physical Address whenever they occur.
To automatically run this command you can use cron. Edit crontab with
``crontab -e`` and add this line::
@reboot /usr/local/bin/cec-ctl -E /sys/class/drm/card0-DP-1/edid
This only works for display drivers that expose the EDID in ``/sys/class/drm``,
such as the i915 driver.
CEC Without HPD
===============
Some displays when in standby mode have no HDMI Hotplug Detect signal, but
CEC is still enabled so connected devices can send an <Image View On> CEC
message in order to wake up such displays. Unfortunately, not all CEC
adapters can support this. An example is the Odroid-U3 SBC that has a
level-shifter that is powered off when the HPD signal is low, thus
blocking the CEC pin. Even though the SoC can use CEC without a HPD,
the level-shifter will prevent this from functioning.
There is a CEC capability flag to signal this: ``CEC_CAP_NEEDS_HPD``.
If set, then the hardware cannot wake up displays with this behavior.
Note for CEC application implementers: the <Image View On> message must
be the first message you send, don't send any other messages before.
Certain very bad but unfortunately not uncommon CEC implementations
get very confused if they receive anything else but this message and
they won't wake up.
When writing a driver it can be tricky to test this. There are two
ways to do this:
1) Get a Pulse-Eight USB CEC dongle, connect an HDMI cable from your
device to the Pulse-Eight, but do not connect the Pulse-Eight to
the display.
Now configure the Pulse-Eight dongle::
cec-ctl -p0.0.0.0 --tv
and start monitoring::
sudo cec-ctl -M
On the device you are testing run::
cec-ctl --playback
It should report a physical address of f.f.f.f. Now run this
command::
cec-ctl -t0 --image-view-on
The Pulse-Eight should see the <Image View On> message. If not,
then something (hardware and/or software) is preventing the CEC
message from going out.
To make sure you have the wiring correct just connect the
Pulse-Eight to a CEC-enabled display and run the same command
on your device: now there is a HPD, so you should see the command
arriving at the Pulse-Eight.
2) If you have another linux device supporting CEC without HPD, then
you can just connect your device to that device. Yes, you can connect
two HDMI outputs together. You won't have a HPD (which is what we
want for this test), but the second device can monitor the CEC pin.
Otherwise use the same commands as in 1.
If CEC messages do not come through when there is no HPD, then you
need to figure out why. Typically it is either a hardware restriction
or the software powers off the CEC core when the HPD goes low. The
first cannot be corrected of course, the second will likely required
driver changes.
Microcontrollers & CEC
======================
We have seen some CEC implementations in displays that use a microcontroller
to sample the bus. This does not have to be a problem, but some implementations
have timing issues. This is hard to discover unless you can hook up a low-level
CEC debugger (see the next section).
You will see cases where the CEC transmitter holds the CEC line high or low for
a longer time than is allowed. For directed messages this is not a problem since
if that happens the message will not be Acked and it will be retransmitted.
For broadcast messages no such mechanism exists.
It's not clear what to do about this. It is probably wise to transmit some
broadcast messages twice to reduce the chance of them being lost. Specifically
<Standby> and <Active Source> are candidates for that.
Making a CEC debugger
=====================
By using a Raspberry Pi 2B/3/4 and some cheap components you can make
your own low-level CEC debugger.
Here is a picture of my setup:
https://hverkuil.home.xs4all.nl/rpi3-cec.jpg
It's a Raspberry Pi 3 together with a breadboard and some breadboard wires:
http://www.dx.com/p/diy-40p-male-to-female-male-to-male-female-to-female-dupont-line-wire-3pcs-356089#.WYLOOXWGN7I
Finally on of these HDMI female-female passthrough connectors (full soldering type 1):
https://elabbay.myshopify.com/collections/camera/products/hdmi-af-af-v1a-hdmi-type-a-female-to-hdmi-type-a-female-pass-through-adapter-breakout-board?variant=45533926147
We've tested this and it works up to 4kp30 (297 MHz). The quality is not high
enough to pass-through 4kp60 (594 MHz).
I also added an RTC and a breakout shield:
https://www.amazon.com/Makerfire%C2%AE-Raspberry-Module-DS1307-Battery/dp/B00ZOXWHK4
https://www.dx.com/p/raspberry-pi-gpio-expansion-board-breadboard-easy-multiplexing-board-one-to-three-with-screw-for-raspberry-pi-2-3-b-b-2729992.html#.YGRCG0MzZ7I
These two are not needed but they make life a bit easier.
If you want to monitor the HPD line as well, then you need one of these
level shifters:
https://www.adafruit.com/product/757
(This is just where I got these components, there are many other places you
can get similar things).
The CEC pin of the HDMI connector needs to be connected to these pins:
CE0/IO8 and CE1/IO7 (pull-up GPIOs). The (optional) HPD pin of the HDMI
connector should be connected (via a level shifter to convert the 5V
to 3.3V) to these pins: IO17 and IO27. The (optional) 5V pin of the HDMI
connector should be connected (via a level shifter) to these pins: IO22
and IO24. Monitoring the HPD an 5V lines is not necessary, but it is helpful.
This kernel patch will hook up the cec-gpio driver correctly to
e.g. ``arch/arm/boot/dts/bcm2837-rpi-3-b-plus.dts``::
cec-gpio@7 {
compatible = "cec-gpio";
cec-gpios = <&gpio 7 (GPIO_ACTIVE_HIGH|GPIO_OPEN_DRAIN)>;
hpd-gpios = <&gpio 17 GPIO_ACTIVE_HIGH>;
v5-gpios = <&gpio 22 GPIO_ACTIVE_HIGH>;
};
cec-gpio@8 {
compatible = "cec-gpio";
cec-gpios = <&gpio 8 (GPIO_ACTIVE_HIGH|GPIO_OPEN_DRAIN)>;
hpd-gpios = <&gpio 27 GPIO_ACTIVE_HIGH>;
v5-gpios = <&gpio 24 GPIO_ACTIVE_HIGH>;
};
This dts change will enable two cec GPIO devices: I typically use one to
send/receive CEC commands and the other to monitor. If you monitor using
an unconfigured CEC adapter then it will use GPIO interrupts which makes
monitoring very accurate.
The documentation on how to use the error injection is here: :ref:`cec_pin_error_inj`.
``cec-ctl --monitor-pin`` will do low-level CEC bus sniffing and analysis.
You can also store the CEC traffic to file using ``--store-pin`` and analyze
it later using ``--analyze-pin``.
You can also use this as a full-fledged CEC device by configuring it
using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``.
...@@ -38,13 +38,14 @@ The media subsystem ...@@ -38,13 +38,14 @@ The media subsystem
remote-controller remote-controller
cec
dvb dvb
cardlist cardlist
v4l-drivers v4l-drivers
dvb-drivers dvb-drivers
cec-drivers
**Copyright** |copy| 1999-2020 : LinuxTV Developers **Copyright** |copy| 1999-2020 : LinuxTV Developers
......
.. SPDX-License-Identifier: GPL-2.0
Pulse-Eight CEC Adapter driver
==============================
The pulse8-cec driver implements the following module option:
``persistent_config``
---------------------
By default this is off, but when set to 1 the driver will store the current
settings to the device's internal eeprom and restore it the next time the
device is connected to the USB port.
...@@ -31,4 +31,5 @@ Video4Linux (V4L) driver-specific documentation ...@@ -31,4 +31,5 @@ Video4Linux (V4L) driver-specific documentation
si4713 si4713
si476x si476x
vimc vimc
visl
vivid vivid
...@@ -35,11 +35,11 @@ of commands fits for the default topology: ...@@ -35,11 +35,11 @@ of commands fits for the default topology:
media-ctl -d platform:vimc -V '"Sensor A":0[fmt:SBGGR8_1X8/640x480]' media-ctl -d platform:vimc -V '"Sensor A":0[fmt:SBGGR8_1X8/640x480]'
media-ctl -d platform:vimc -V '"Debayer A":0[fmt:SBGGR8_1X8/640x480]' media-ctl -d platform:vimc -V '"Debayer A":0[fmt:SBGGR8_1X8/640x480]'
media-ctl -d platform:vimc -V '"Sensor B":0[fmt:SBGGR8_1X8/640x480]' media-ctl -d platform:vimc -V '"Scaler":0[fmt:RGB888_1X24/640x480]'
media-ctl -d platform:vimc -V '"Debayer B":0[fmt:SBGGR8_1X8/640x480]' media-ctl -d platform:vimc -V '"Scaler":0[crop:(100,50)/400x150]'
v4l2-ctl -z platform:vimc -d "RGB/YUV Capture" -v width=1920,height=1440 media-ctl -d platform:vimc -V '"Scaler":1[fmt:RGB888_1X24/300x700]'
v4l2-ctl -z platform:vimc -d "RGB/YUV Capture" -v width=300,height=700
v4l2-ctl -z platform:vimc -d "Raw Capture 0" -v pixelformat=BA81 v4l2-ctl -z platform:vimc -d "Raw Capture 0" -v pixelformat=BA81
v4l2-ctl -z platform:vimc -d "Raw Capture 1" -v pixelformat=BA81
Subdevices Subdevices
---------- ----------
......
.. SPDX-License-Identifier: GPL-2.0
The Virtual Stateless Decoder Driver (visl)
===========================================
A virtual stateless decoder device for stateless uAPI development
purposes.
This tool's objective is to help the development and testing of
userspace applications that use the V4L2 stateless API to decode media.
A userspace implementation can use visl to run a decoding loop even when
no hardware is available or when the kernel uAPI for the codec has not
been upstreamed yet. This can reveal bugs at an early stage.
This driver can also trace the contents of the V4L2 controls submitted
to it. It can also dump the contents of the vb2 buffers through a
debugfs interface. This is in many ways similar to the tracing
infrastructure available for other popular encode/decode APIs out there
and can help develop a userspace application by using another (working)
one as a reference.
.. note::
No actual decoding of video frames is performed by visl. The
V4L2 test pattern generator is used to write various debug information
to the capture buffers instead.
Module parameters
-----------------
- visl_debug: Activates debug info, printing various debug messages through
dprintk. Also controls whether per-frame debug info is shown. Defaults to off.
Note that enabling this feature can result in slow performance through serial.
- visl_transtime_ms: Simulated process time in milliseconds. Slowing down the
decoding speed can be useful for debugging.
- visl_dprintk_frame_start, visl_dprintk_frame_nframes: Dictates a range of
frames where dprintk is activated. This only controls the dprintk tracing on a
per-frame basis. Note that printing a lot of data can be slow through serial.
- keep_bitstream_buffers: Controls whether bitstream (i.e. OUTPUT) buffers are
kept after a decoding session. Defaults to false so as to reduce the amount of
clutter. keep_bitstream_buffers == false works well when live debugging the
client program with GDB.
- bitstream_trace_frame_start, bitstream_trace_nframes: Similar to
visl_dprintk_frame_start, visl_dprintk_nframes, but controls the dumping of
buffer data through debugfs instead.
What is the default use case for this driver?
---------------------------------------------
This driver can be used as a way to compare different userspace implementations.
This assumes that a working client is run against visl and that the ftrace and
OUTPUT buffer data is subsequently used to debug a work-in-progress
implementation.
Information on reference frames, their timestamps, the status of the OUTPUT and
CAPTURE queues and more can be read directly from the CAPTURE buffers.
Supported codecs
----------------
The following codecs are supported:
- FWHT
- MPEG2
- VP8
- VP9
- H.264
- HEVC
visl trace events
-----------------
The trace events are defined on a per-codec basis, e.g.:
.. code-block:: bash
$ ls /sys/kernel/debug/tracing/events/ | grep visl
visl_fwht_controls
visl_h264_controls
visl_hevc_controls
visl_mpeg2_controls
visl_vp8_controls
visl_vp9_controls
For example, in order to dump HEVC SPS data:
.. code-block:: bash
$ echo 1 > /sys/kernel/debug/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable
The SPS data will be dumped to the trace buffer, i.e.:
.. code-block:: bash
$ cat /sys/kernel/debug/tracing/trace
video_parameter_set_id 0
seq_parameter_set_id 0
pic_width_in_luma_samples 1920
pic_height_in_luma_samples 1080
bit_depth_luma_minus8 0
bit_depth_chroma_minus8 0
log2_max_pic_order_cnt_lsb_minus4 4
sps_max_dec_pic_buffering_minus1 6
sps_max_num_reorder_pics 2
sps_max_latency_increase_plus1 0
log2_min_luma_coding_block_size_minus3 0
log2_diff_max_min_luma_coding_block_size 3
log2_min_luma_transform_block_size_minus2 0
log2_diff_max_min_luma_transform_block_size 3
max_transform_hierarchy_depth_inter 2
max_transform_hierarchy_depth_intra 2
pcm_sample_bit_depth_luma_minus1 0
pcm_sample_bit_depth_chroma_minus1 0
log2_min_pcm_luma_coding_block_size_minus3 0
log2_diff_max_min_pcm_luma_coding_block_size 0
num_short_term_ref_pic_sets 0
num_long_term_ref_pics_sps 0
chroma_format_idc 1
sps_max_sub_layers_minus1 0
flags AMP_ENABLED|SAMPLE_ADAPTIVE_OFFSET|TEMPORAL_MVP_ENABLED|STRONG_INTRA_SMOOTHING_ENABLED
Dumping OUTPUT buffer data through debugfs
------------------------------------------
If the **VISL_DEBUGFS** Kconfig is enabled, visl will populate
**/sys/kernel/debug/visl/bitstream** with OUTPUT buffer data according to the
values of bitstream_trace_frame_start and bitstream_trace_nframes. This can
highlight errors as broken clients may fail to fill the buffers properly.
A single file is created for each processed OUTPUT buffer. Its name contains an
integer that denotes the buffer sequence, i.e.:
.. code-block:: c
snprintf(name, 32, "bitstream%d", run->src->sequence);
Dumping the values is simply a matter of reading from the file, i.e.:
For the buffer with sequence == 0:
.. code-block:: bash
$ xxd /sys/kernel/debug/visl/bitstream/bitstream0
00000000: 2601 af04 d088 bc25 a173 0e41 a4f2 3274 &......%.s.A..2t
00000010: c668 cb28 e775 b4ac f53a ba60 f8fd 3aa1 .h.(.u...:.`..:.
00000020: 46b4 bcfc 506c e227 2372 e5f5 d7ea 579f F...Pl.'#r....W.
00000030: 6371 5eb5 0eb8 23b5 ca6a 5de5 983a 19e4 cq^...#..j]..:..
00000040: e8c3 4320 b4ba a226 cbc1 4138 3a12 32d6 ..C ...&..A8:.2.
00000050: fef3 247b 3523 4e90 9682 ac8e eb0c a389 ..${5#N.........
00000060: ddd0 6cfc 0187 0e20 7aae b15b 1812 3d33 ..l.... z..[..=3
00000070: e1c5 f425 a83a 00b7 4f18 8127 3c4c aefb ...%.:..O..'<L..
For the buffer with sequence == 1:
.. code-block:: bash
$ xxd /sys/kernel/debug/visl/bitstream/bitstream1
00000000: 0201 d021 49e1 0c40 aa11 1449 14a6 01dc ...!I..@...I....
00000010: 7023 889a c8cd 2cd0 13b4 dab0 e8ca 21fe p#....,.......!.
00000020: c4c8 ab4c 486e 4e2f b0df 96cc c74e 8dde ...LHnN/.....N..
00000030: 8ce7 ee36 d880 4095 4d64 30a0 ff4f 0c5e ...6..@.Md0..O.^
00000040: f16b a6a1 d806 ca2a 0ece a673 7bea 1f37 .k.....*...s{..7
00000050: 370f 5bb9 1dc4 ba21 6434 bc53 0173 cba0 7.[....!d4.S.s..
00000060: dfe6 bc99 01ea b6e0 346b 92b5 c8de 9f5d ........4k.....]
00000070: e7cc 3484 1769 fef2 a693 a945 2c8b 31da ..4..i.....E,.1.
And so on.
By default, the files are removed during STREAMOFF. This is to reduce the amount
of clutter.
...@@ -392,7 +392,7 @@ Which one is returned depends on the chosen channel, each next valid channel ...@@ -392,7 +392,7 @@ Which one is returned depends on the chosen channel, each next valid channel
will cycle through the possible audio subchannel combinations. This allows will cycle through the possible audio subchannel combinations. This allows
you to test the various combinations by just switching channels.. you to test the various combinations by just switching channels..
Finally, for these inputs the v4l2_timecode struct is filled in in the Finally, for these inputs the v4l2_timecode struct is filled in the
dequeued v4l2_buffer struct. dequeued v4l2_buffer struct.
......
...@@ -88,6 +88,9 @@ comma (","). :: ...@@ -88,6 +88,9 @@ comma (","). ::
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
│ │ │ │ │ │ │ tried_regions/
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ... │ │ │ │ │ │ ...
│ │ │ │ ... │ │ │ │ ...
│ │ ... │ │ ...
...@@ -125,7 +128,14 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the ...@@ -125,7 +128,14 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
user inputs in the sysfs files except ``state`` file again. Writing user inputs in the sysfs files except ``state`` file again. Writing
``update_schemes_stats`` to ``state`` file updates the contents of stats files ``update_schemes_stats`` to ``state`` file updates the contents of stats files
for each DAMON-based operation scheme of the kdamond. For details of the for each DAMON-based operation scheme of the kdamond. For details of the
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing
``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based
operation scheme action tried regions directory for each DAMON-based operation
scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``
file clears the DAMON-based operating scheme action tried regions directory for
each DAMON-based operation scheme of the kdamond. For details of the
DAMON-based operation scheme action tried regions directory, please refer to
:ref:tried_regions section <sysfs_schemes_tried_regions>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
...@@ -166,6 +176,8 @@ You can set and get what type of monitoring operations DAMON will use for the ...@@ -166,6 +176,8 @@ You can set and get what type of monitoring operations DAMON will use for the
context by writing one of the keywords listed in ``avail_operations`` file and context by writing one of the keywords listed in ``avail_operations`` file and
reading from the ``operations`` file. reading from the ``operations`` file.
.. _sysfs_monitoring_attrs:
contexts/<N>/monitoring_attrs/ contexts/<N>/monitoring_attrs/
------------------------------ ------------------------------
...@@ -235,6 +247,9 @@ In each region directory, you will find two files (``start`` and ``end``). You ...@@ -235,6 +247,9 @@ In each region directory, you will find two files (``start`` and ``end``). You
can set and get the start and end addresses of the initial monitoring target can set and get the start and end addresses of the initial monitoring target
region by writing to and reading from the files, respectively. region by writing to and reading from the files, respectively.
Each region should not overlap with others. ``end`` of directory ``N`` should
be equal or smaller than ``start`` of directory ``N+1``.
contexts/<N>/schemes/ contexts/<N>/schemes/
--------------------- ---------------------
...@@ -252,8 +267,9 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. ...@@ -252,8 +267,9 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme.
schemes/<N>/ schemes/<N>/
------------ ------------
In each scheme directory, four directories (``access_pattern``, ``quotas``, In each scheme directory, five directories (``access_pattern``, ``quotas``,
``watermarks``, and ``stats``) and one file (``action``) exist. ``watermarks``, ``stats``, and ``tried_regions``) and one file (``action``)
exist.
The ``action`` file is for setting and getting what action you want to apply to The ``action`` file is for setting and getting what action you want to apply to
memory regions having specific access pattern of the interest. The keywords memory regions having specific access pattern of the interest. The keywords
...@@ -348,6 +364,32 @@ should ask DAMON sysfs interface to updte the content of the files for the ...@@ -348,6 +364,32 @@ should ask DAMON sysfs interface to updte the content of the files for the
stats by writing a special keyword, ``update_schemes_stats`` to the relevant stats by writing a special keyword, ``update_schemes_stats`` to the relevant
``kdamonds/<N>/state`` file. ``kdamonds/<N>/state`` file.
.. _sysfs_schemes_tried_regions:
schemes/<N>/tried_regions/
--------------------------
When a special keyword, ``update_schemes_tried_regions``, is written to the
relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer
starting from ``0`` under this directory. Each directory contains files
exposing detailed information about each of the memory region that the
corresponding scheme's ``action`` has tried to be applied under this directory,
during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The
information includes address range, ``nr_accesses``, , and ``age`` of the
region.
The directories will be removed when another special keyword,
``clear_schemes_tried_regions``, is written to the relevant
``kdamonds/<N>/state`` file.
tried_regions/<N>/
------------------
In each region directory, you will find four files (``start``, ``end``,
``nr_accesses``, and ``age``). Reading the files will show the start and end
addresses, ``nr_accesses``, and ``age`` of the region that corresponding
DAMON-based operation scheme ``action`` has tried to be applied.
Example Example
~~~~~~~ ~~~~~~~
...@@ -465,8 +507,9 @@ regions in case of physical memory monitoring. Therefore, users should set the ...@@ -465,8 +507,9 @@ regions in case of physical memory monitoring. Therefore, users should set the
monitoring target regions by themselves. monitoring target regions by themselves.
In such cases, users can explicitly set the initial monitoring target regions In such cases, users can explicitly set the initial monitoring target regions
as they want, by writing proper values to the ``init_regions`` file. Each line as they want, by writing proper values to the ``init_regions`` file. The input
of the input should represent one region in below form.:: should be a sequence of three integers separated by white spaces that represent
one region in below form.::
<target idx> <start address> <end address> <target idx> <start address> <end address>
...@@ -481,9 +524,9 @@ ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one ...@@ -481,9 +524,9 @@ ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one
# cd <debugfs>/damon # cd <debugfs>/damon
# cat target_ids # cat target_ids
42 4242 42 4242
# echo "0 1 100 # echo "0 1 100 \
0 100 200 0 100 200 \
1 20 40 1 20 40 \
1 50 100" > init_regions 1 50 100" > init_regions
Note that this sets the initial monitoring target regions only. In case of Note that this sets the initial monitoring target regions only. In case of
......
...@@ -14,13 +14,7 @@ for potentially reduced swap I/O. This trade-off can also result in a ...@@ -14,13 +14,7 @@ for potentially reduced swap I/O. This trade-off can also result in a
significant performance improvement if reads from the compressed cache are significant performance improvement if reads from the compressed cache are
faster than reads from a swap device. faster than reads from a swap device.
.. note:: Some potential benefits:
Zswap is a new feature as of v3.11 and interacts heavily with memory
reclaim. This interaction has not been fully explored on the large set of
potential configurations and workloads that exist. For this reason, zswap
is a work in progress and should be considered experimental.
Some potential benefits:
* Desktop/laptop users with limited RAM capacities can mitigate the * Desktop/laptop users with limited RAM capacities can mitigate the
performance impact of swapping. performance impact of swapping.
......
...@@ -15,10 +15,10 @@ HiSilicon PCIe PMU driver ...@@ -15,10 +15,10 @@ HiSilicon PCIe PMU driver
The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe
Core id.:: Core id.::
/sys/bus/event_source/hisi_pcie<sicl>_<core> /sys/bus/event_source/hisi_pcie<sicl>_core<core>
PMU driver provides description of available events and filter options in sysfs, PMU driver provides description of available events and filter options in sysfs,
see /sys/bus/event_source/devices/hisi_pcie<sicl>_<core>. see /sys/bus/event_source/devices/hisi_pcie<sicl>_core<core>.
The "format" directory describes all formats of the config (events) and config1 The "format" directory describes all formats of the config (events) and config1
(filter options) fields of the perf_event_attr structure. The "events" directory (filter options) fields of the perf_event_attr structure. The "events" directory
...@@ -33,13 +33,13 @@ monitored by PMU. ...@@ -33,13 +33,13 @@ monitored by PMU.
Example usage of perf:: Example usage of perf::
$# perf list $# perf list
hisi_pcie0_0/rx_mwr_latency/ [kernel PMU event] hisi_pcie0_core0/rx_mwr_latency/ [kernel PMU event]
hisi_pcie0_0/rx_mwr_cnt/ [kernel PMU event] hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
------------------------------------------ ------------------------------------------
$# perf stat -e hisi_pcie0_0/rx_mwr_latency/ $# perf stat -e hisi_pcie0_core0/rx_mwr_latency/
$# perf stat -e hisi_pcie0_0/rx_mwr_cnt/ $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/
$# perf stat -g -e hisi_pcie0_0/rx_mwr_latency/ -e hisi_pcie0_0/rx_mwr_cnt/ $# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/
The current driver does not support sampling. So "perf record" is unsupported. The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported for PCIe PMU. Also attach to a task is unsupported for PCIe PMU.
...@@ -48,59 +48,83 @@ Filter options ...@@ -48,59 +48,83 @@ Filter options
-------------- --------------
1. Target filter 1. Target filter
PMU could only monitor the performance of traffic downstream target Root Ports
or downstream target Endpoint. PCIe PMU driver support "port" and "bdf"
interfaces for users, and these two interfaces aren't supported at the same
time.
-port PMU could only monitor the performance of traffic downstream target Root
"port" filter can be used in all PCIe PMU events, target Root Port can be Ports or downstream target Endpoint. PCIe PMU driver support "port" and
selected by configuring the 16-bits-bitmap "port". Multi ports can be selected "bdf" interfaces for users, and these two interfaces aren't supported at the
for AP-layer-events, and only one port can be selected for TL/DL-layer-events. same time.
For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of bitmap - port
should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4 lanes),
bit8 is set, port=0x100; if these two Root Ports are both monitored, port=0x101.
Example usage of perf:: "port" filter can be used in all PCIe PMU events, target Root Port can be
selected by configuring the 16-bits-bitmap "port". Multi ports can be
selected for AP-layer-events, and only one port can be selected for
TL/DL-layer-events.
$# perf stat -e hisi_pcie0_0/rx_mwr_latency,port=0x1/ sleep 5 For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of
bitmap should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4
lanes), bit8 is set, port=0x100; if these two Root Ports are both
monitored, port=0x101.
-bdf Example usage of perf::
"bdf" filter can only be used in bandwidth events, target Endpoint is selected $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0x1/ sleep 5
by configuring BDF to "bdf". Counter only counts the bandwidth of message
requested by target Endpoint.
For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0. - bdf
Example usage of perf:: "bdf" filter can only be used in bandwidth events, target Endpoint is
selected by configuring BDF to "bdf". Counter only counts the bandwidth of
message requested by target Endpoint.
For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mrd_flux,bdf=0x3900/ sleep 5 $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,bdf=0x3900/ sleep 5
2. Trigger filter 2. Trigger filter
Event statistics start when the first time TLP length is greater/smaller
than trigger condition. You can set the trigger condition by writing "trig_len",
and set the trigger mode by writing "trig_mode". This filter can only be used
in bandwidth events.
For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0" Event statistics start when the first time TLP length is greater/smaller
means statistics start when TLP length > trigger condition, "trig_mode=1" than trigger condition. You can set the trigger condition by writing
means start when TLP length < condition. "trig_len", and set the trigger mode by writing "trig_mode". This filter can
only be used in bandwidth events.
Example usage of perf:: For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"
means statistics start when TLP length > trigger condition, "trig_mode=1"
means start when TLP length < condition.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5 $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
3. Threshold filter 3. Threshold filter
Counter counts when TLP length within the specified range. You can set the
threshold by writing "thr_len", and set the threshold mode by writing
"thr_mode". This filter can only be used in bandwidth events.
For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means Counter counts when TLP length within the specified range. You can set the
counter counts when TLP length >= threshold, and "thr_mode=1" means counts threshold by writing "thr_len", and set the threshold mode by writing
when TLP length < threshold. "thr_mode". This filter can only be used in bandwidth events.
Example usage of perf:: For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means
counter counts when TLP length >= threshold, and "thr_mode=1" means counts
when TLP length < threshold.
Example usage of perf::
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
4. TLP Length filter
When counting bandwidth, the data can be composed of certain parts of TLP
packets. You can specify it through "len_mode":
- 2'b00: Reserved (Do not use this since the behaviour is undefined)
- 2'b01: Bandwidth of TLP payloads
- 2'b10: Bandwidth of TLP headers
- 2'b11: Bandwidth of both TLP payloads and headers
For example, "len_mode=2" means only counting the bandwidth of TLP headers
and "len_mode=3" means the final bandwidth data is composed of both TLP
headers and payloads. Default value if not specified is 2'b11.
Example usage of perf::
$# perf stat -e hisi_pcie0_0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5 $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5
...@@ -19,3 +19,5 @@ Performance monitor support ...@@ -19,3 +19,5 @@ Performance monitor support
arm_dsu_pmu arm_dsu_pmu
thunderx2-pmu thunderx2-pmu
alibaba_pmu alibaba_pmu
nvidia-pmu
meson-ddr-pmu
.. SPDX-License-Identifier: GPL-2.0
===========================================================
Amlogic SoC DDR Bandwidth Performance Monitoring Unit (PMU)
===========================================================
The Amlogic Meson G12 SoC contains a bandwidth monitor inside DRAM controller.
The monitor includes 4 channels. Each channel can count the request accessing
DRAM. The channel can count up to 3 AXI port simultaneously. It can be helpful
to show if the performance bottleneck is on DDR bandwidth.
Currently, this driver supports the following 5 perf events:
+ meson_ddr_bw/total_rw_bytes/
+ meson_ddr_bw/chan_1_rw_bytes/
+ meson_ddr_bw/chan_2_rw_bytes/
+ meson_ddr_bw/chan_3_rw_bytes/
+ meson_ddr_bw/chan_4_rw_bytes/
meson_ddr_bw/chan_{1,2,3,4}_rw_bytes/ events are channel-specific events.
Each channel support filtering, which can let the channel to monitor
individual IP module in SoC.
Below are DDR access request event filter keywords:
+ arm - from CPU
+ vpu_read1 - from OSD + VPP read
+ gpu - from 3D GPU
+ pcie - from PCIe controller
+ hdcp - from HDCP controller
+ hevc_front - from HEVC codec front end
+ usb3_0 - from USB3.0 controller
+ hevc_back - from HEVC codec back end
+ h265enc - from HEVC encoder
+ vpu_read2 - from DI read
+ vpu_write1 - from VDIN write
+ vpu_write2 - from di write
+ vdec - from legacy codec video decoder
+ hcodec - from H264 encoder
+ ge2d - from ge2d
+ spicc1 - from SPI controller 1
+ usb0 - from USB2.0 controller 0
+ dma - from system DMA controller 1
+ arb0 - from arb0
+ sd_emmc_b - from SD eMMC b controller
+ usb1 - from USB2.0 controller 1
+ audio - from Audio module
+ sd_emmc_c - from SD eMMC c controller
+ spicc2 - from SPI controller 2
+ ethernet - from Ethernet controller
Examples:
+ Show the total DDR bandwidth per seconds:
.. code-block:: bash
perf stat -a -e meson_ddr_bw/total_rw_bytes/ -I 1000 sleep 10
+ Show individual DDR bandwidth from CPU and GPU respectively, as well as
sum of them:
.. code-block:: bash
perf stat -a -e meson_ddr_bw/chan_1_rw_bytes,arm=1/ -I 1000 sleep 10
perf stat -a -e meson_ddr_bw/chan_2_rw_bytes,gpu=1/ -I 1000 sleep 10
perf stat -a -e meson_ddr_bw/chan_3_rw_bytes,arm=1,gpu=1/ -I 1000 sleep 10
=========================================================
NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
=========================================================
The NVIDIA Tegra SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:
* Scalable Coherency Fabric (SCF)
* NVLink-C2C0
* NVLink-C2C1
* CNVLink
* PCIE
PMU Driver
----------
The PMUs in this document are based on ARM CoreSight PMU Architecture as
described in document: ARM IHI 0091. Since this is a standard architecture, the
PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes
the available events and configuration of each PMU in sysfs. Please see the
sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,
the driver provides "cpumask" sysfs attribute to show the CPU id used to handle
the PMU event. There is also "associated_cpus" sysfs attribute, which contains a
list of CPUs associated with the PMU instance.
.. _SCF_PMU_Section:
SCF PMU
-------
The SCF PMU monitors system level cache events, CPU traffic, and
strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU
traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.
Example usage:
* Count event id 0x0 in socket 0::
perf stat -a -e nvidia_scf_pmu_0/event=0x0/
* Count event id 0x0 in socket 1::
perf stat -a -e nvidia_scf_pmu_1/event=0x0/
NVLink-C2C0 PMU
--------------------
The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with
NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU
varies dependent on the chip configuration:
* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.
In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.
* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.
In this config, the PMU captures read and relaxed ordered (RO) writes from
PCIE device of the remote SoC.
Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
Example usage:
* Count event id 0x0 from the GPU/CPU connected with socket 0::
perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/
* Count event id 0x0 from the GPU/CPU connected with socket 1::
perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/
* Count event id 0x0 from the GPU/CPU connected with socket 2::
perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/
* Count event id 0x0 from the GPU/CPU connected with socket 3::
perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/
NVLink-C2C1 PMU
-------------------
The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with
NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU
traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.
Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
Example usage:
* Count event id 0x0 from the GPU connected with socket 0::
perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/
* Count event id 0x0 from the GPU connected with socket 1::
perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/
* Count event id 0x0 from the GPU connected with socket 2::
perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/
* Count event id 0x0 from the GPU connected with socket 3::
perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/
CNVLink PMU
---------------
The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets
to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
for more info about the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.
Each SoC socket can be connected to one or more sockets via CNVLink. The user can
use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
socket 1 to 3.
/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
shows the valid bits that can be set in the "rem_socket" parameter.
The PMU can not distinguish the remote traffic initiator, therefore it does not
provide filter to select the traffic source to monitor. It reports combined
traffic from remote GPU and PCIE devices.
Example usage:
* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::
perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/
* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::
perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/
* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::
perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/
* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::
perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/
PCIE PMU
------------
The PCIE PMU monitors all read/write traffic from PCIE root ports to
local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`
for more info about the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.
Each SoC socket can support multiple root ports. The user can use
"root_port" bitmap parameter to select the port(s) to monitor, i.e.
"root_port=0xF" corresponds to root port 0 to 3.
/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
shows the valid bits that can be set in the "root_port" parameter.
Example usage:
* Count event id 0x0 from root port 0 and 1 of socket 0::
perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/
* Count event id 0x0 from root port 0 and 1 of socket 1::
perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/
.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:
Traffic Coverage
----------------
The PMU traffic coverage may vary dependent on the chip configuration:
* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC.
Example configuration with two Grace SoCs::
********************************* *********************************
* SOCKET-A * * SOCKET-B *
* * * *
* :::::::: * * :::::::: *
* : PCIE : * * : PCIE : *
* :::::::: * * :::::::: *
* | * * | *
* | * * | *
* ::::::: ::::::::: * * ::::::::: ::::::: *
* : : : : * * : : : : *
* : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *
* : : C2C : SoC : * * : SoC : C2C : : *
* ::::::: ::::::::: * * ::::::::: ::::::: *
* | | * * | | *
* | | * * | | *
* &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *
* & GMEM & & CMEM & * * & CMEM & & GMEM & *
* &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *
* * * *
********************************* *********************************
GMEM = GPU Memory (e.g. HBM)
CMEM = CPU Memory (e.g. LPDDR5X)
|
| Following table contains traffic coverage of Grace SoC PMU in socket-A:
::
+--------------+-------+-----------+-----------+-----+----------+----------+
| | Source |
+ +-------+-----------+-----------+-----+----------+----------+
| Destination | |GPU ATS |GPU Not-ATS| | Socket-B | Socket-B |
| |PCI R/W|Translated,|Translated | CPU | CPU/PCIE1| GPU/PCIE2|
| | |EGM | | | | |
+==============+=======+===========+===========+=====+==========+==========+
| Local | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | SCF PMU | CNVLink |
| SYSRAM/CMEM | PMU |PMU |PMU | PMU | | PMU |
+--------------+-------+-----------+-----------+-----+----------+----------+
| Local GMEM | PCIE | N/A |NVLink-C2C1| SCF | SCF PMU | CNVLink |
| | PMU | |PMU | PMU | | PMU |
+--------------+-------+-----------+-----------+-----+----------+----------+
| Remote | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | |
| SYSRAM/CMEM | PMU |PMU |PMU | PMU | N/A | N/A |
| over CNVLink | | | | | | |
+--------------+-------+-----------+-----------+-----+----------+----------+
| Remote GMEM | PCIE |NVLink-C2C0|NVLink-C2C1| SCF | | |
| over CNVLink | PMU |PMU |PMU | PMU | N/A | N/A |
+--------------+-------+-----------+-----------+-----+----------+----------+
PCIE1 traffic represents strongly ordered (SO) writes.
PCIE2 traffic represents reads and relaxed ordered (RO) writes.
* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected.
Example configuration with two Grace SoCs::
******************* *******************
* SOCKET-A * * SOCKET-B *
* * * *
* :::::::: * * :::::::: *
* : PCIE : * * : PCIE : *
* :::::::: * * :::::::: *
* | * * | *
* | * * | *
* ::::::::: * * ::::::::: *
* : : * * : : *
* : Grace :<--------NVLink------->: Grace : *
* : SoC : * C2C * : SoC : *
* ::::::::: * * ::::::::: *
* | * * | *
* | * * | *
* &&&&&&&& * * &&&&&&&& *
* & CMEM & * * & CMEM & *
* &&&&&&&& * * &&&&&&&& *
* * * *
******************* *******************
GMEM = GPU Memory (e.g. HBM)
CMEM = CPU Memory (e.g. LPDDR5X)
|
| Following table contains traffic coverage of Grace SoC PMU in socket-A:
::
+-----------------+-----------+---------+----------+-------------+
| | Source |
+ +-----------+---------+----------+-------------+
| Destination | | | Socket-B | Socket-B |
| | PCI R/W | CPU | CPU/PCIE1| PCIE2 |
| | | | | |
+=================+===========+=========+==========+=============+
| Local | PCIE PMU | SCF PMU | SCF PMU | NVLink-C2C0 |
| SYSRAM/CMEM | | | | PMU |
+-----------------+-----------+---------+----------+-------------+
| Remote | | | | |
| SYSRAM/CMEM | PCIE PMU | SCF PMU | N/A | N/A |
| over NVLink-C2C | | | | |
+-----------------+-----------+---------+----------+-------------+
PCIE1 traffic represents strongly ordered (SO) writes.
PCIE2 traffic represents reads and relaxed ordered (RO) writes.
...@@ -2,8 +2,6 @@ ...@@ -2,8 +2,6 @@
Documentation for /proc/sys/fs/ Documentation for /proc/sys/fs/
=============================== ===============================
kernel version 2.2.10
Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com> Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
...@@ -12,58 +10,40 @@ For general info and legal blurb, please look in intro.rst. ...@@ -12,58 +10,40 @@ For general info and legal blurb, please look in intro.rst.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
This file contains documentation for the sysctl files in This file contains documentation for the sysctl files and directories
/proc/sys/fs/ and is valid for Linux kernel version 2.2. in ``/proc/sys/fs/``.
The files in this directory can be used to tune and monitor The files in this directory can be used to tune and monitor
miscellaneous and general things in the operation of the Linux miscellaneous and general things in the operation of the Linux
kernel. Since some of the files _can_ be used to screw up your kernel. Since some of the files *can* be used to screw up your
system, it is advisable to read both documentation and source system, it is advisable to read both documentation and source
before actually making adjustments. before actually making adjustments.
1. /proc/sys/fs 1. /proc/sys/fs
=============== ===============
Currently, these files are in /proc/sys/fs: Currently, these files might (depending on your configuration)
show up in ``/proc/sys/fs``:
- aio-max-nr
- aio-nr .. contents:: :local:
- dentry-state
- dquot-max
- dquot-nr
- file-max
- file-nr
- inode-max
- inode-nr
- inode-state
- nr_open
- overflowuid
- overflowgid
- pipe-user-pages-hard
- pipe-user-pages-soft
- protected_fifos
- protected_hardlinks
- protected_regular
- protected_symlinks
- suid_dumpable
- super-max
- super-nr
aio-nr & aio-max-nr aio-nr & aio-max-nr
------------------- -------------------
aio-nr is the running total of the number of events specified on the ``aio-nr`` shows the current system-wide number of asynchronous io
io_setup system call for all currently active aio contexts. If aio-nr requests. ``aio-max-nr`` allows you to change the maximum value
reaches aio-max-nr then io_setup will fail with EAGAIN. Note that ``aio-nr`` can grow to. If ``aio-nr`` reaches ``aio-nr-max`` then
raising aio-max-nr does not result in the pre-allocation or re-sizing ``io_setup`` will fail with ``EAGAIN``. Note that raising
of any kernel data structures. ``aio-max-nr`` does not result in the
pre-allocation or re-sizing of any kernel data structures.
dentry-state dentry-state
------------ ------------
From linux/include/linux/dcache.h:: This file shows the values in ``struct dentry_stat``, as defined in
``linux/include/linux/dcache.h``::
struct dentry_stat_t dentry_stat { struct dentry_stat_t dentry_stat {
int nr_dentry; int nr_dentry;
...@@ -76,95 +56,84 @@ From linux/include/linux/dcache.h:: ...@@ -76,95 +56,84 @@ From linux/include/linux/dcache.h::
Dentries are dynamically allocated and deallocated. Dentries are dynamically allocated and deallocated.
nr_dentry shows the total number of dentries allocated (active ``nr_dentry`` shows the total number of dentries allocated (active
+ unused). nr_unused shows the number of dentries that are not + unused). ``nr_unused shows`` the number of dentries that are not
actively used, but are saved in the LRU list for future reuse. actively used, but are saved in the LRU list for future reuse.
Age_limit is the age in seconds after which dcache entries ``age_limit`` is the age in seconds after which dcache entries
can be reclaimed when memory is short and want_pages is can be reclaimed when memory is short and ``want_pages`` is
nonzero when shrink_dcache_pages() has been called and the nonzero when ``shrink_dcache_pages()`` has been called and the
dcache isn't pruned yet. dcache isn't pruned yet.
nr_negative shows the number of unused dentries that are also ``nr_negative`` shows the number of unused dentries that are also
negative dentries which do not map to any files. Instead, negative dentries which do not map to any files. Instead,
they help speeding up rejection of non-existing files provided they help speeding up rejection of non-existing files provided
by the users. by the users.
dquot-max & dquot-nr
--------------------
The file dquot-max shows the maximum number of cached disk
quota entries.
The file dquot-nr shows the number of allocated disk quota
entries and the number of free disk quota entries.
If the number of free cached disk quotas is very low and
you have some awesome number of simultaneous system users,
you might want to raise the limit.
file-max & file-nr file-max & file-nr
------------------ ------------------
The value in file-max denotes the maximum number of file- The value in ``file-max`` denotes the maximum number of file-
handles that the Linux kernel will allocate. When you get lots handles that the Linux kernel will allocate. When you get lots
of error messages about running out of file handles, you might of error messages about running out of file handles, you might
want to increase this limit. want to increase this limit.
Historically,the kernel was able to allocate file handles Historically,the kernel was able to allocate file handles
dynamically, but not to free them again. The three values in dynamically, but not to free them again. The three values in
file-nr denote the number of allocated file handles, the number ``file-nr`` denote the number of allocated file handles, the number
of allocated but unused file handles, and the maximum number of of allocated but unused file handles, and the maximum number of
file handles. Linux 2.6 always reports 0 as the number of free file handles. Linux 2.6 and later always reports 0 as the number of free
file handles -- this is not an error, it just means that the file handles -- this is not an error, it just means that the
number of allocated file handles exactly matches the number of number of allocated file handles exactly matches the number of
used file handles. used file handles.
Attempts to allocate more file descriptors than file-max are Attempts to allocate more file descriptors than ``file-max`` are
reported with printk, look for "VFS: file-max limit <number> reported with ``printk``, look for::
reached".
VFS: file-max limit <number> reached
nr_open in the kernel logs.
-------
This denotes the maximum number of file-handles a process can
allocate. Default value is 1024*1024 (1048576) which should be
enough for most machines. Actual limit depends on RLIMIT_NOFILE
resource limit.
inode-max, inode-nr & inode-state inode-nr & inode-state
--------------------------------- ----------------------
As with file handles, the kernel allocates the inode structures As with file handles, the kernel allocates the inode structures
dynamically, but can't free them yet. dynamically, but can't free them yet.
The value in inode-max denotes the maximum number of inode The file ``inode-nr`` contains the first two items from
handlers. This value should be 3-4 times larger than the value ``inode-state``, so we'll skip to that file...
in file-max, since stdin, stdout and network sockets also
need an inode struct to handle them. When you regularly run
out of inodes, you need to increase this value.
The file inode-nr contains the first two items from
inode-state, so we'll skip to that file...
Inode-state contains three actual numbers and four dummies. ``inode-state`` contains three actual numbers and four dummies.
The actual numbers are, in order of appearance, nr_inodes, The actual numbers are, in order of appearance, ``nr_inodes``,
nr_free_inodes and preshrink. ``nr_free_inodes`` and ``preshrink``.
Nr_inodes stands for the number of inodes the system has ``nr_inodes`` stands for the number of inodes the system has
allocated, this can be slightly more than inode-max because allocated.
Linux allocates them one pageful at a time.
Nr_free_inodes represents the number of free inodes (?) and ``nr_free_inodes`` represents the number of free inodes (?) and
preshrink is nonzero when the nr_inodes > inode-max and the preshrink is nonzero when the
system needs to prune the inode list instead of allocating system needs to prune the inode list instead of allocating
more. more.
mount-max
---------
This denotes the maximum number of mounts that may exist
in a mount namespace.
nr_open
-------
This denotes the maximum number of file-handles a process can
allocate. Default value is 1024*1024 (1048576) which should be
enough for most machines. Actual limit depends on ``RLIMIT_NOFILE``
resource limit.
overflowgid & overflowuid overflowgid & overflowuid
------------------------- -------------------------
...@@ -192,7 +161,7 @@ pipe-user-pages-soft ...@@ -192,7 +161,7 @@ pipe-user-pages-soft
Maximum total number of pages a non-privileged user may allocate for pipes Maximum total number of pages a non-privileged user may allocate for pipes
before the pipe size gets limited to a single page. Once this limit is reached, before the pipe size gets limited to a single page. Once this limit is reached,
new pipes will be limited to a single page in size for this user in order to new pipes will be limited to a single page in size for this user in order to
limit total memory usage, and trying to increase them using fcntl() will be limit total memory usage, and trying to increase them using ``fcntl()`` will be
denied until usage goes below the limit again. The default value allows to denied until usage goes below the limit again. The default value allows to
allocate up to 1024 pipes at their default size. When set to 0, no limit is allocate up to 1024 pipes at their default size. When set to 0, no limit is
applied. applied.
...@@ -207,7 +176,7 @@ file. ...@@ -207,7 +176,7 @@ file.
When set to "0", writing to FIFOs is unrestricted. When set to "0", writing to FIFOs is unrestricted.
When set to "1" don't allow O_CREAT open on FIFOs that we don't own When set to "1" don't allow ``O_CREAT`` open on FIFOs that we don't own
in world writable sticky directories, unless they are owned by the in world writable sticky directories, unless they are owned by the
owner of the directory. owner of the directory.
...@@ -221,7 +190,7 @@ protected_hardlinks ...@@ -221,7 +190,7 @@ protected_hardlinks
A long-standing class of security issues is the hardlink-based A long-standing class of security issues is the hardlink-based
time-of-check-time-of-use race, most commonly seen in world-writable time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw directories like ``/tmp``. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given hardlink (i.e. a is to cross privilege boundaries when following a given hardlink (i.e. a
root process follows a hardlink created by another user). Additionally, root process follows a hardlink created by another user). Additionally,
on systems without separated partitions, this stops unauthorized users on systems without separated partitions, this stops unauthorized users
...@@ -239,13 +208,13 @@ This protection is based on the restrictions in Openwall and grsecurity. ...@@ -239,13 +208,13 @@ This protection is based on the restrictions in Openwall and grsecurity.
protected_regular protected_regular
----------------- -----------------
This protection is similar to protected_fifos, but it This protection is similar to `protected_fifos`_, but it
avoids writes to an attacker-controlled regular file, where a program avoids writes to an attacker-controlled regular file, where a program
expected to create one. expected to create one.
When set to "0", writing to regular files is unrestricted. When set to "0", writing to regular files is unrestricted.
When set to "1" don't allow O_CREAT open on regular files that we When set to "1" don't allow ``O_CREAT`` open on regular files that we
don't own in world writable sticky directories, unless they are don't own in world writable sticky directories, unless they are
owned by the owner of the directory. owned by the owner of the directory.
...@@ -257,7 +226,7 @@ protected_symlinks ...@@ -257,7 +226,7 @@ protected_symlinks
A long-standing class of security issues is the symlink-based A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw directories like ``/tmp``. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see: incomplete list of hundreds of examples across the years, please see:
...@@ -272,23 +241,25 @@ follower match, or when the directory owner matches the symlink's owner. ...@@ -272,23 +241,25 @@ follower match, or when the directory owner matches the symlink's owner.
This protection is based on the restrictions in Openwall and grsecurity. This protection is based on the restrictions in Openwall and grsecurity.
suid_dumpable: suid_dumpable
-------------- -------------
This value can be used to query and set the core dump mode for setuid This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are or otherwise protected/tainted binaries. The modes are
= ========== =============================================================== = ========== ===============================================================
0 (default) traditional behaviour. Any process which has changed 0 (default) Traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped. privilege levels or is execute only will not be dumped.
1 (debug) all processes dump core when possible. The core dump is 1 (debug) All processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is owned by the current user and no security is applied. This is
intended for system debugging situations only. intended for system debugging situations only.
Ptrace is unchecked. Ptrace is unchecked.
This is insecure as it allows regular users to examine the This is insecure as it allows regular users to examine the
memory contents of privileged processes. memory contents of privileged processes.
2 (suidsafe) any binary which normally would not be dumped is dumped 2 (suidsafe) Any binary which normally would not be dumped is dumped
anyway, but only if the "core_pattern" kernel sysctl is set to anyway, but only if the ``core_pattern`` kernel sysctl (see
:ref:`Documentation/admin-guide/sysctl/kernel.rst <core_pattern>`)
is set to
either a pipe handler or a fully qualified path. (For more either a pipe handler or a fully qualified path. (For more
details on this limitation, see CVE-2006-2451.) This mode is details on this limitation, see CVE-2006-2451.) This mode is
appropriate when administrators are attempting to debug appropriate when administrators are attempting to debug
...@@ -301,36 +272,11 @@ or otherwise protected/tainted binaries. The modes are ...@@ -301,36 +272,11 @@ or otherwise protected/tainted binaries. The modes are
= ========== =============================================================== = ========== ===============================================================
super-max & super-nr
--------------------
These numbers control the maximum number of superblocks, and
thus the maximum number of mounted filesystems the kernel
can have. You only need to increase super-max if you need to
mount more filesystems than the current value in super-max
allows you to.
aio-nr & aio-max-nr
-------------------
aio-nr shows the current system-wide number of asynchronous io
requests. aio-max-nr allows you to change the maximum value
aio-nr can grow to.
mount-max
---------
This denotes the maximum number of mounts that may exist
in a mount namespace.
2. /proc/sys/fs/binfmt_misc 2. /proc/sys/fs/binfmt_misc
=========================== ===========================
Documentation for the files in /proc/sys/fs/binfmt_misc is Documentation for the files in ``/proc/sys/fs/binfmt_misc`` is
in Documentation/admin-guide/binfmt-misc.rst. in Documentation/admin-guide/binfmt-misc.rst.
...@@ -343,28 +289,32 @@ creation of a user space library that implements the POSIX message queues ...@@ -343,28 +289,32 @@ creation of a user space library that implements the POSIX message queues
API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
Interfaces specification.) Interfaces specification.)
The "mqueue" filesystem contains values for determining/setting the amount of The "mqueue" filesystem contains values for determining/setting the
resources used by the file system. amount of resources used by the file system.
/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the ``/proc/sys/fs/mqueue/queues_max`` is a read/write file for
maximum number of message queues allowed on the system. setting/getting the maximum number of message queues allowed on the
system.
/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the ``/proc/sys/fs/mqueue/msg_max`` is a read/write file for
maximum number of messages in a queue value. In fact it is the limiting value setting/getting the maximum number of messages in a queue value. In
for another (user) limit which is set in mq_open invocation. This attribute of fact it is the limiting value for another (user) limit which is set in
a queue must be less or equal then msg_max. ``mq_open`` invocation. This attribute of a queue must be less than
or equal to ``msg_max``.
/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the ``/proc/sys/fs/mqueue/msgsize_max`` is a read/write file for
maximum message size value (it is every message queue's attribute set during setting/getting the maximum message size value (it is an attribute of
its creation). every message queue, set during its creation).
/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the ``/proc/sys/fs/mqueue/msg_default`` is a read/write file for
default number of messages in a queue value if attr parameter of mq_open(2) is setting/getting the default number of messages in a queue value if the
NULL. If it exceed msg_max, the default value is initialized msg_max. ``attr`` parameter of ``mq_open(2)`` is ``NULL``. If it exceeds
``msg_max``, the default value is initialized to ``msg_max``.
/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting ``/proc/sys/fs/mqueue/msgsize_default`` is a read/write file for
the default message size value if attr parameter of mq_open(2) is NULL. If it setting/getting the default message size value if the ``attr``
exceed msgsize_max, the default value is initialized msgsize_max. parameter of ``mq_open(2)`` is ``NULL``. If it exceeds
``msgsize_max``, the default value is initialized to ``msgsize_max``.
4. /proc/sys/fs/epoll - Configuration options for the epoll interface 4. /proc/sys/fs/epoll - Configuration options for the epoll interface
===================================================================== =====================================================================
...@@ -378,7 +328,7 @@ Every epoll file descriptor can store a number of files to be monitored ...@@ -378,7 +328,7 @@ Every epoll file descriptor can store a number of files to be monitored
for event readiness. Each one of these monitored files constitutes a "watch". for event readiness. Each one of these monitored files constitutes a "watch".
This configuration option sets the maximum number of "watches" that are This configuration option sets the maximum number of "watches" that are
allowed for each user. allowed for each user.
Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes
on a 64bit one. on a 64-bit one.
The current default value for max_user_watches is the 1/25 (4%) of the The current default value for ``max_user_watches`` is 4% of the
available low memory, divided for the "watch" cost in bytes. available low memory, divided by the "watch" cost in bytes.
...@@ -139,6 +139,8 @@ Highest valid capability of the running kernel. Exports ...@@ -139,6 +139,8 @@ Highest valid capability of the running kernel. Exports
``CAP_LAST_CAP`` from the kernel. ``CAP_LAST_CAP`` from the kernel.
.. _core_pattern:
core_pattern core_pattern
============ ============
...@@ -174,6 +176,7 @@ core_pattern ...@@ -174,6 +176,7 @@ core_pattern
%f executable filename %f executable filename
%E executable path %E executable path
%c maximum size of core file by resource limit RLIMIT_CORE %c maximum size of core file by resource limit RLIMIT_CORE
%C CPU the task ran on
%<OTHER> both are dropped %<OTHER> both are dropped
======== ========================================== ======== ==========================================
...@@ -433,8 +436,8 @@ ignore-unaligned-usertrap ...@@ -433,8 +436,8 @@ ignore-unaligned-usertrap
On architectures where unaligned accesses cause traps, and where this On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``; feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``;
currently, ``arc`` and ``ia64``), controls whether all unaligned traps currently, ``arc``, ``ia64`` and ``loongarch``), controls whether all
are logged. unaligned traps are logged.
= ============================================================= = =============================================================
0 Log all unaligned accesses. 0 Log all unaligned accesses.
...@@ -667,6 +670,15 @@ This is the default behavior. ...@@ -667,6 +670,15 @@ This is the default behavior.
an oops event is detected. an oops event is detected.
oops_limit
==========
Number of kernel oopses after which the kernel should panic when
``panic_on_oops`` is not set. Setting this to 0 disables checking
the count. Setting this to 1 has the same effect as setting
``panic_on_oops=1``. The default value is 10000.
osrelease, ostype & version osrelease, ostype & version
=========================== ===========================
...@@ -1314,6 +1326,29 @@ watchdog work to be queued by the watchdog timer function, otherwise the NMI ...@@ -1314,6 +1326,29 @@ watchdog work to be queued by the watchdog timer function, otherwise the NMI
watchdog — if enabled — can detect a hard lockup condition. watchdog — if enabled — can detect a hard lockup condition.
split_lock_mitigate (x86 only)
==============================
On x86, each "split lock" imposes a system-wide performance penalty. On larger
systems, large numbers of split locks from unprivileged users can result in
denials of service to well-behaved and potentially more important users.
The kernel mitigates these bad users by detecting split locks and imposing
penalties: forcing them to wait and only allowing one core to execute split
locks at a time.
These mitigations can make those bad applications unbearably slow. Setting
split_lock_mitigate=0 may restore some application performance, but will also
increase system exposure to denial of service attacks from split lock users.
= ===================================================================
0 Disable the mitigation mode - just warns the split lock on kernel log
and exposes the system to denials of service from the split lockers.
1 Enable the mitigation mode (this is the default) - penalizes the split
lockers with intentional performance degradation.
= ===================================================================
stack_erasing stack_erasing
============= =============
...@@ -1457,8 +1492,8 @@ unaligned-trap ...@@ -1457,8 +1492,8 @@ unaligned-trap
On architectures where unaligned accesses cause traps, and where this On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently, feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently,
``arc`` and ``parisc``), controls whether unaligned traps are caught ``arc``, ``parisc`` and ``loongarch``), controls whether unaligned traps
and emulated (instead of failing). are caught and emulated (instead of failing).
= ======================================================== = ========================================================
0 Do not emulate unaligned accesses. 0 Do not emulate unaligned accesses.
...@@ -1500,6 +1535,16 @@ entry will default to 2 instead of 0. ...@@ -1500,6 +1535,16 @@ entry will default to 2 instead of 0.
2 Unprivileged calls to ``bpf()`` are disabled 2 Unprivileged calls to ``bpf()`` are disabled
= ============================================================= = =============================================================
warn_limit
==========
Number of kernel warnings after which the kernel should panic when
``panic_on_warn`` is not set. Setting this to 0 disables checking
the warning count. Setting this to 1 has the same effect as setting
``panic_on_warn=1``. The default value is 0.
watchdog watchdog
======== ========
......
...@@ -14,18 +14,20 @@ Orion family ...@@ -14,18 +14,20 @@ Orion family
Flavors: Flavors:
- 88F5082 - 88F5082
- 88F5181 - 88F5181 a.k.a Orion-1
- 88F5181L - 88F5181L a.k.a Orion-VoIP
- 88F5182 - 88F5182 a.k.a Orion-NAS
- Datasheet: https://web.archive.org/web/20210124231420/http://csclub.uwaterloo.ca/~board/ts7800/MV88F5182-datasheet.pdf - Datasheet: https://web.archive.org/web/20210124231420/http://csclub.uwaterloo.ca/~board/ts7800/MV88F5182-datasheet.pdf
- Programmer's User Guide: https://web.archive.org/web/20210124231536/http://csclub.uwaterloo.ca/~board/ts7800/MV88F5182-opensource-manual.pdf - Programmer's User Guide: https://web.archive.org/web/20210124231536/http://csclub.uwaterloo.ca/~board/ts7800/MV88F5182-opensource-manual.pdf
- User Manual: https://web.archive.org/web/20210124231631/http://csclub.uwaterloo.ca/~board/ts7800/MV88F5182-usermanual.pdf - User Manual: https://web.archive.org/web/20210124231631/http://csclub.uwaterloo.ca/~board/ts7800/MV88F5182-usermanual.pdf
- Functional Errata: https://web.archive.org/web/20210704165540/https://www.digriz.org.uk/ts78xx/88F5182_Functional_Errata.pdf - Functional Errata: https://web.archive.org/web/20210704165540/https://www.digriz.org.uk/ts78xx/88F5182_Functional_Errata.pdf
- 88F5281 - 88F5281 a.k.a Orion-2
- Datasheet: https://web.archive.org/web/20131028144728/http://www.ocmodshop.com/images/reviews/networking/qnap_ts409u/marvel_88f5281_data_sheet.pdf - Datasheet: https://web.archive.org/web/20131028144728/http://www.ocmodshop.com/images/reviews/networking/qnap_ts409u/marvel_88f5281_data_sheet.pdf
- 88F6183 - 88F6183 a.k.a Orion-1-90
Homepage:
https://web.archive.org/web/20080607215437/http://www.marvell.com/products/media/index.jsp
Core: Core:
Feroceon 88fr331 (88f51xx) or 88fr531-vd (88f52xx) ARMv5 compatible Feroceon 88fr331 (88f51xx) or 88fr531-vd (88f52xx) ARMv5 compatible
Linux kernel mach directory: Linux kernel mach directory:
......
...@@ -163,7 +163,7 @@ FPDT Section 5.2.23 (signature == "FPDT") ...@@ -163,7 +163,7 @@ FPDT Section 5.2.23 (signature == "FPDT")
**Firmware Performance Data Table** **Firmware Performance Data Table**
Optional, not currently supported. Optional, useful for boot performance profiling.
GTDT Section 5.2.24 (signature == "GTDT") GTDT Section 5.2.24 (signature == "GTDT")
......
...@@ -121,8 +121,9 @@ Header notes: ...@@ -121,8 +121,9 @@ Header notes:
to the base of DRAM, since memory below it is not to the base of DRAM, since memory below it is not
accessible via the linear mapping accessible via the linear mapping
1 1
2MB aligned base may be anywhere in physical 2MB aligned base such that all image_size bytes
memory counted from the start of the image are within
the 48-bit addressable range of physical memory
Bits 4-63 Reserved. Bits 4-63 Reserved.
============= =============================================================== ============= ===============================================================
...@@ -348,7 +349,7 @@ Before jumping into the kernel, the following conditions must be met: ...@@ -348,7 +349,7 @@ Before jumping into the kernel, the following conditions must be met:
- HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01. - HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64) For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):
- If EL3 is present: - If EL3 is present:
......
...@@ -275,6 +275,15 @@ HWCAP2_EBF16 ...@@ -275,6 +275,15 @@ HWCAP2_EBF16
HWCAP2_SVE_EBF16 HWCAP2_SVE_EBF16
Functionality implied by ID_AA64ZFR0_EL1.BF16 == 0b0010. Functionality implied by ID_AA64ZFR0_EL1.BF16 == 0b0010.
HWCAP2_CSSC
Functionality implied by ID_AA64ISAR2_EL1.CSSC == 0b0001.
HWCAP2_RPRFM
Functionality implied by ID_AA64ISAR2_EL1.RPRFM == 0b0001.
HWCAP2_SVE2P1
Functionality implied by ID_AA64ZFR0_EL1.SVEver == 0b0010.
4. Unused AT_HWCAP bits 4. Unused AT_HWCAP bits
----------------------- -----------------------
......
...@@ -52,6 +52,7 @@ model features for SVE is included in Appendix A. ...@@ -52,6 +52,7 @@ model features for SVE is included in Appendix A.
HWCAP2_SVEBITPERM HWCAP2_SVEBITPERM
HWCAP2_SVESHA3 HWCAP2_SVESHA3
HWCAP2_SVESM4 HWCAP2_SVESM4
HWCAP2_SVE2P1
This list may be extended over time as the SVE architecture evolves. This list may be extended over time as the SVE architecture evolves.
......
...@@ -142,7 +142,7 @@ Therefore, we also introduce *blk-crypto-fallback*, which is an implementation ...@@ -142,7 +142,7 @@ Therefore, we also introduce *blk-crypto-fallback*, which is an implementation
of inline encryption using the kernel crypto API. blk-crypto-fallback is built of inline encryption using the kernel crypto API. blk-crypto-fallback is built
into the block layer, so it works on any block device without any special setup. into the block layer, so it works on any block device without any special setup.
Essentially, when a bio with an encryption context is submitted to a Essentially, when a bio with an encryption context is submitted to a
request_queue that doesn't support that encryption context, the block layer will block_device that doesn't support that encryption context, the block layer will
handle en/decryption of the bio using blk-crypto-fallback. handle en/decryption of the bio using blk-crypto-fallback.
For encryption, the data cannot be encrypted in-place, as callers usually rely For encryption, the data cannot be encrypted in-place, as callers usually rely
...@@ -187,7 +187,7 @@ API presented to users of the block layer ...@@ -187,7 +187,7 @@ API presented to users of the block layer
``blk_crypto_config_supported()`` allows users to check ahead of time whether ``blk_crypto_config_supported()`` allows users to check ahead of time whether
inline encryption with particular crypto settings will work on a particular inline encryption with particular crypto settings will work on a particular
request_queue -- either via hardware or via blk-crypto-fallback. This function block_device -- either via hardware or via blk-crypto-fallback. This function
takes in a ``struct blk_crypto_config`` which is like blk_crypto_key, but omits takes in a ``struct blk_crypto_config`` which is like blk_crypto_key, but omits
the actual bytes of the key and instead just contains the algorithm, data unit the actual bytes of the key and instead just contains the algorithm, data unit
size, etc. This function can be useful if blk-crypto-fallback is disabled. size, etc. This function can be useful if blk-crypto-fallback is disabled.
...@@ -195,7 +195,7 @@ size, etc. This function can be useful if blk-crypto-fallback is disabled. ...@@ -195,7 +195,7 @@ size, etc. This function can be useful if blk-crypto-fallback is disabled.
``blk_crypto_init_key()`` allows users to initialize a blk_crypto_key. ``blk_crypto_init_key()`` allows users to initialize a blk_crypto_key.
Users must call ``blk_crypto_start_using_key()`` before actually starting to use Users must call ``blk_crypto_start_using_key()`` before actually starting to use
a blk_crypto_key on a request_queue (even if ``blk_crypto_config_supported()`` a blk_crypto_key on a block_device (even if ``blk_crypto_config_supported()``
was called earlier). This is needed to initialize blk-crypto-fallback if it was called earlier). This is needed to initialize blk-crypto-fallback if it
will be needed. This must not be called from the data path, as this may have to will be needed. This must not be called from the data path, as this may have to
allocate resources, which may deadlock in that case. allocate resources, which may deadlock in that case.
...@@ -207,7 +207,7 @@ for en/decryption. Users don't need to worry about freeing the bio_crypt_ctx ...@@ -207,7 +207,7 @@ for en/decryption. Users don't need to worry about freeing the bio_crypt_ctx
later, as that happens automatically when the bio is freed or reset. later, as that happens automatically when the bio is freed or reset.
Finally, when done using inline encryption with a blk_crypto_key on a Finally, when done using inline encryption with a blk_crypto_key on a
request_queue, users must call ``blk_crypto_evict_key()``. This ensures that block_device, users must call ``blk_crypto_evict_key()``. This ensures that
the key is evicted from all keyslots it may be programmed into and unlinked from the key is evicted from all keyslots it may be programmed into and unlinked from
any kernel data structures it may be linked into. any kernel data structures it may be linked into.
...@@ -221,9 +221,9 @@ as follows: ...@@ -221,9 +221,9 @@ as follows:
5. ``blk_crypto_evict_key()`` (after all I/O has completed) 5. ``blk_crypto_evict_key()`` (after all I/O has completed)
6. Zeroize the blk_crypto_key (this has no dedicated function) 6. Zeroize the blk_crypto_key (this has no dedicated function)
If a blk_crypto_key is being used on multiple request_queues, then If a blk_crypto_key is being used on multiple block_devices, then
``blk_crypto_config_supported()`` (if used), ``blk_crypto_start_using_key()``, ``blk_crypto_config_supported()`` (if used), ``blk_crypto_start_using_key()``,
and ``blk_crypto_evict_key()`` must be called on each request_queue. and ``blk_crypto_evict_key()`` must be called on each block_device.
API presented to device drivers API presented to device drivers
=============================== ===============================
......
...@@ -298,3 +298,48 @@ A: NO. ...@@ -298,3 +298,48 @@ A: NO.
The BTF_ID macro does not cause a function to become part of the ABI The BTF_ID macro does not cause a function to become part of the ABI
any more than does the EXPORT_SYMBOL_GPL macro. any more than does the EXPORT_SYMBOL_GPL macro.
Q: What is the compatibility story for special BPF types in map values?
-----------------------------------------------------------------------
Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
values (when using BTF support for BPF maps). This allows to use helpers for
such objects on these fields inside map values. Users are also allowed to embed
pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the
kernel preserve backwards compatibility for these features?
A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
NO, but see below.
For struct types that have been added already, like bpf_spin_lock and bpf_timer,
the kernel will preserve backwards compatibility, as they are part of UAPI.
For kptrs, they are also part of UAPI, but only with respect to the kptr
mechanism. The types that you can use with a __kptr and __kptr_ref tagged
pointer in your struct are NOT part of the UAPI contract. The supported types can
and will change across kernel releases. However, operations like accessing kptr
fields and bpf_kptr_xchg() helper will continue to be supported across kernel
releases for the supported types.
For any other supported struct type, unless explicitly stated in this document
and added to bpf.h UAPI header, such types can and will arbitrarily change their
size, type, and alignment, or any other user visible API or ABI detail across
kernel releases. The users must adapt their BPF programs to the new changes and
update them to make sure their programs continue to work correctly.
NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
order to introduce more special fields in the future. Hence, user programs must
avoid defining types with 'bpf\_' prefix to not be broken in future releases.
In other words, no backwards compatibility is guaranteed if one using a type
in BTF with 'bpf\_' prefix.
Q: What is the compatibility story for special BPF types in allocated objects?
------------------------------------------------------------------------------
Q: Same as above, but for allocated objects (i.e. objects allocated using
bpf_obj_new for user defined types). Will the kernel preserve backwards
compatibility for these features?
A: NO.
Unlike map value types, there are no stability guarantees for this case. The
whole API to work with allocated objects and any support for special fields
inside them is unstable (since it is exposed through kfuncs).
...@@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.** ...@@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.**
Submitting patches Submitting patches
================== ==================
Q: How do I run BPF CI on my changes before sending them out for review?
------------------------------------------------------------------------
A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf.
While GitHub also provides a CLI that can be used to accomplish the same
results, here we focus on the UI based workflow.
The following steps lay out how to start a CI run for your patches:
- Create a fork of the aforementioned repository in your own account (one time
action)
- Clone the fork locally, check out a new branch tracking either the bpf-next
or bpf branch, and apply your to-be-tested patches on top of it
- Push the local branch to your fork and create a pull request against
kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively
Shortly after the pull request has been created, the CI workflow will run. Note
that capacity is shared with patches submitted upstream being checked and so
depending on utilization the run can take a while to finish.
Note furthermore that both base branches (bpf-next_base and bpf_base) will be
updated as patches are pushed to the respective upstream branches they track. As
such, your patch set will automatically (be attempted to) be rebased as well.
This behavior can result in a CI run being aborted and restarted with the new
base line.
Q: To which mailing list do I need to submit my BPF patches? Q: To which mailing list do I need to submit my BPF patches?
------------------------------------------------------------ ------------------------------------------------------------
A: Please submit your BPF patches to the bpf kernel mailing list: A: Please submit your BPF patches to the bpf kernel mailing list:
......
此差异已折叠。
...@@ -1062,4 +1062,9 @@ format.:: ...@@ -1062,4 +1062,9 @@ format.::
7. Testing 7. Testing
========== ==========
Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests. The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_
provides an extensive set of BTF-related tests.
.. Links
.. _tools/testing/selftests/bpf/prog_tests/btf.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c
...@@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture. ...@@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture.
maps maps
bpf_prog_run bpf_prog_run
classic_vs_extended.rst classic_vs_extended.rst
bpf_iterators
bpf_licensing bpf_licensing
test_debug test_debug
clang-notes clang-notes
linux-notes linux-notes
other other
redirect
.. only:: subproject and html .. only:: subproject and html
......
...@@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) ...@@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
``BPF_XOR | BPF_K | BPF_ALU`` means:: ``BPF_XOR | BPF_K | BPF_ALU`` means::
src_reg = (u32) src_reg ^ (u32) imm32 dst_reg = (u32) dst_reg ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means:: ``BPF_XOR | BPF_K | BPF_ALU64`` means::
src_reg = src_reg ^ imm32 dst_reg = dst_reg ^ imm32
Byte swap instructions Byte swap instructions
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
...@@ -7,3 +7,6 @@ Program Types ...@@ -7,3 +7,6 @@ Program Types
:glob: :glob:
prog_* prog_*
For a list of all program types, see :ref:`program_types_and_elf` in
the :ref:`libbpf` documentation.
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册