提交 24844fd3 编写于 作者: J Jonathan Corbet

Merge branch 'mm-rst' into docs-next

Mike Rapoport says:

  These patches convert files in Documentation/vm to ReST format, add an
  initial index and link it to the top level documentation.

  There are no contents changes in the documentation, except few spelling
  fixes. The relatively large diffstat stems from the indentation and
  paragraph wrapping changes.

  I've tried to keep the formatting as consistent as possible, but I could
  miss some places that needed markup and add some markup where it was not
  necessary.

[jc: significant conflicts in vm/hmm.rst]
...@@ -90,4 +90,4 @@ Date: December 2009 ...@@ -90,4 +90,4 @@ Date: December 2009
Contact: Lee Schermerhorn <lee.schermerhorn@hp.com> Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>
Description: Description:
The node's huge page size control/query attributes. The node's huge page size control/query attributes.
See Documentation/vm/hugetlbpage.txt See Documentation/vm/hugetlbpage.rst
\ No newline at end of file \ No newline at end of file
...@@ -12,4 +12,4 @@ Description: ...@@ -12,4 +12,4 @@ Description:
free_hugepages free_hugepages
surplus_hugepages surplus_hugepages
resv_hugepages resv_hugepages
See Documentation/vm/hugetlbpage.txt for details. See Documentation/vm/hugetlbpage.rst for details.
...@@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface ...@@ -40,7 +40,7 @@ Description: Kernel Samepage Merging daemon sysfs interface
sleep_millisecs: how many milliseconds ksm should sleep between sleep_millisecs: how many milliseconds ksm should sleep between
scans. scans.
See Documentation/vm/ksm.txt for more information. See Documentation/vm/ksm.rst for more information.
What: /sys/kernel/mm/ksm/merge_across_nodes What: /sys/kernel/mm/ksm/merge_across_nodes
Date: January 2013 Date: January 2013
......
...@@ -37,7 +37,7 @@ Description: ...@@ -37,7 +37,7 @@ Description:
The alloc_calls file is read-only and lists the kernel code The alloc_calls file is read-only and lists the kernel code
locations from which allocations for this cache were performed. locations from which allocations for this cache were performed.
The alloc_calls file only contains information if debugging is The alloc_calls file only contains information if debugging is
enabled for that cache (see Documentation/vm/slub.txt). enabled for that cache (see Documentation/vm/slub.rst).
What: /sys/kernel/slab/cache/alloc_fastpath What: /sys/kernel/slab/cache/alloc_fastpath
Date: February 2008 Date: February 2008
...@@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>, ...@@ -219,7 +219,7 @@ Contact: Pekka Enberg <penberg@cs.helsinki.fi>,
Description: Description:
The free_calls file is read-only and lists the locations of The free_calls file is read-only and lists the locations of
object frees if slab debugging is enabled (see object frees if slab debugging is enabled (see
Documentation/vm/slub.txt). Documentation/vm/slub.rst).
What: /sys/kernel/slab/cache/free_fastpath What: /sys/kernel/slab/cache/free_fastpath
Date: February 2008 Date: February 2008
......
...@@ -3915,7 +3915,7 @@ ...@@ -3915,7 +3915,7 @@
cache (risks via metadata attacks are mostly cache (risks via metadata attacks are mostly
unchanged). Debug options disable merging on their unchanged). Debug options disable merging on their
own. own.
For more information see Documentation/vm/slub.txt. For more information see Documentation/vm/slub.rst.
slab_max_order= [MM, SLAB] slab_max_order= [MM, SLAB]
Determines the maximum allowed order for slabs. Determines the maximum allowed order for slabs.
...@@ -3929,7 +3929,7 @@ ...@@ -3929,7 +3929,7 @@
slub_debug can create guard zones around objects and slub_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the may poison objects when not in use. Also tracks the
last alloc / free. For more information see last alloc / free. For more information see
Documentation/vm/slub.txt. Documentation/vm/slub.rst.
slub_memcg_sysfs= [MM, SLUB] slub_memcg_sysfs= [MM, SLUB]
Determines whether to enable sysfs directories for Determines whether to enable sysfs directories for
...@@ -3943,7 +3943,7 @@ ...@@ -3943,7 +3943,7 @@
Determines the maximum allowed order for slabs. Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory A high setting may cause OOMs due to memory
fragmentation. For more information see fragmentation. For more information see
Documentation/vm/slub.txt. Documentation/vm/slub.rst.
slub_min_objects= [MM, SLUB] slub_min_objects= [MM, SLUB]
The minimum number of objects per slab. SLUB will The minimum number of objects per slab. SLUB will
...@@ -3952,12 +3952,12 @@ ...@@ -3952,12 +3952,12 @@
the number of objects indicated. The higher the number the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired. and the less frequently locks need to be acquired.
For more information see Documentation/vm/slub.txt. For more information see Documentation/vm/slub.rst.
slub_min_order= [MM, SLUB] slub_min_order= [MM, SLUB]
Determines the minimum page order for slabs. Must be Determines the minimum page order for slabs. Must be
lower than slub_max_order. lower than slub_max_order.
For more information see Documentation/vm/slub.txt. For more information see Documentation/vm/slub.rst.
slub_nomerge [MM, SLUB] slub_nomerge [MM, SLUB]
Same with slab_nomerge. This is supported for legacy. Same with slab_nomerge. This is supported for legacy.
...@@ -4313,7 +4313,7 @@ ...@@ -4313,7 +4313,7 @@
Format: [always|madvise|never] Format: [always|madvise|never]
Can be used to control the default behavior of the system Can be used to control the default behavior of the system
with respect to transparent hugepages. with respect to transparent hugepages.
See Documentation/vm/transhuge.txt for more details. See Documentation/vm/transhuge.rst for more details.
tsc= Disable clocksource stability checks for TSC. tsc= Disable clocksource stability checks for TSC.
Format: <string> Format: <string>
......
...@@ -120,7 +120,7 @@ A typical out of bounds access report looks like this:: ...@@ -120,7 +120,7 @@ A typical out of bounds access report looks like this::
The header of the report discribe what kind of bug happened and what kind of The header of the report discribe what kind of bug happened and what kind of
access caused it. It's followed by the description of the accessed slub object access caused it. It's followed by the description of the accessed slub object
(see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and (see 'SLUB Debug output' section in Documentation/vm/slub.rst for details) and
the description of the accessed memory page. the description of the accessed memory page.
In the last section the report shows memory state around the accessed address. In the last section the report shows memory state around the accessed address.
......
...@@ -515,7 +515,7 @@ guarantees: ...@@ -515,7 +515,7 @@ guarantees:
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
bits on both physical and virtual pages associated with a process, and the bits on both physical and virtual pages associated with a process, and the
soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). soft-dirty bit on pte (see Documentation/vm/soft-dirty.rst for details).
To clear the bits for all the pages associated with the process To clear the bits for all the pages associated with the process
> echo 1 > /proc/PID/clear_refs > echo 1 > /proc/PID/clear_refs
...@@ -536,7 +536,7 @@ Any other value written to /proc/PID/clear_refs will have no effect. ...@@ -536,7 +536,7 @@ Any other value written to /proc/PID/clear_refs will have no effect.
The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
using /proc/kpageflags and number of times a page is mapped using using /proc/kpageflags and number of times a page is mapped using
/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. /proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.rst.
The /proc/pid/numa_maps is an extension based on maps, showing the memory The /proc/pid/numa_maps is an extension based on maps, showing the memory
locality and binding policy, as well as the memory usage (in pages) of locality and binding policy, as well as the memory usage (in pages) of
......
...@@ -105,7 +105,7 @@ policy for the file will revert to "default" policy. ...@@ -105,7 +105,7 @@ policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified conjunction with their modes. These optional flags can be specified
when tmpfs is mounted by appending them to the mode before the NodeList. when tmpfs is mounted by appending them to the mode before the NodeList.
See Documentation/vm/numa_memory_policy.txt for a list of all available See Documentation/vm/numa_memory_policy.rst for a list of all available
memory allocation policy mode flags and their effect on memory policy. memory allocation policy mode flags and their effect on memory policy.
=static is equivalent to MPOL_F_STATIC_NODES =static is equivalent to MPOL_F_STATIC_NODES
......
...@@ -45,7 +45,7 @@ the kernel interface as seen by application developers. ...@@ -45,7 +45,7 @@ the kernel interface as seen by application developers.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
userspace-api/index userspace-api/index
Introduction to kernel development Introduction to kernel development
...@@ -89,6 +89,7 @@ needed). ...@@ -89,6 +89,7 @@ needed).
sound/index sound/index
crypto/index crypto/index
filesystems/index filesystems/index
vm/index
Architecture-specific documentation Architecture-specific documentation
----------------------------------- -----------------------------------
......
...@@ -515,7 +515,7 @@ nr_hugepages ...@@ -515,7 +515,7 @@ nr_hugepages
Change the minimum size of the hugepage pool. Change the minimum size of the hugepage pool.
See Documentation/vm/hugetlbpage.txt See Documentation/vm/hugetlbpage.rst
============================================================== ==============================================================
...@@ -524,7 +524,7 @@ nr_overcommit_hugepages ...@@ -524,7 +524,7 @@ nr_overcommit_hugepages
Change the maximum size of the hugepage pool. The maximum is Change the maximum size of the hugepage pool. The maximum is
nr_hugepages + nr_overcommit_hugepages. nr_hugepages + nr_overcommit_hugepages.
See Documentation/vm/hugetlbpage.txt See Documentation/vm/hugetlbpage.rst
============================================================== ==============================================================
...@@ -667,7 +667,7 @@ and don't use much of it. ...@@ -667,7 +667,7 @@ and don't use much of it.
The default value is 0. The default value is 0.
See Documentation/vm/overcommit-accounting and See Documentation/vm/overcommit-accounting.rst and
mm/mmap.c::__vm_enough_memory() for more information. mm/mmap.c::__vm_enough_memory() for more information.
============================================================== ==============================================================
......
00-INDEX 00-INDEX
- this file. - this file.
active_mm.txt active_mm.rst
- An explanation from Linus about tsk->active_mm vs tsk->mm. - An explanation from Linus about tsk->active_mm vs tsk->mm.
balance balance.rst
- various information on memory balancing. - various information on memory balancing.
cleancache.txt cleancache.rst
- Intro to cleancache and page-granularity victim cache. - Intro to cleancache and page-granularity victim cache.
frontswap.txt frontswap.rst
- Outline frontswap, part of the transcendent memory frontend. - Outline frontswap, part of the transcendent memory frontend.
highmem.txt highmem.rst
- Outline of highmem and common issues. - Outline of highmem and common issues.
hmm.txt hmm.rst
- Documentation of heterogeneous memory management - Documentation of heterogeneous memory management
hugetlbpage.txt hugetlbpage.rst
- a brief summary of hugetlbpage support in the Linux kernel. - a brief summary of hugetlbpage support in the Linux kernel.
hugetlbfs_reserv.txt hugetlbfs_reserv.rst
- A brief overview of hugetlbfs reservation design/implementation. - A brief overview of hugetlbfs reservation design/implementation.
hwpoison.txt hwpoison.rst
- explains what hwpoison is - explains what hwpoison is
idle_page_tracking.txt idle_page_tracking.rst
- description of the idle page tracking feature. - description of the idle page tracking feature.
ksm.txt ksm.rst
- how to use the Kernel Samepage Merging feature. - how to use the Kernel Samepage Merging feature.
mmu_notifier.txt mmu_notifier.rst
- a note about clearing pte/pmd and mmu notifications - a note about clearing pte/pmd and mmu notifications
numa numa.rst
- information about NUMA specific code in the Linux vm. - information about NUMA specific code in the Linux vm.
numa_memory_policy.txt numa_memory_policy.rst
- documentation of concepts and APIs of the 2.6 memory policy support. - documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting overcommit-accounting.rst
- description of the Linux kernels overcommit handling modes. - description of the Linux kernels overcommit handling modes.
page_frags page_frags.rst
- description of page fragments allocator - description of page fragments allocator
page_migration page_migration.rst
- description of page migration in NUMA systems. - description of page migration in NUMA systems.
pagemap.txt pagemap.rst
- pagemap, from the userspace perspective - pagemap, from the userspace perspective
page_owner.txt page_owner.rst
- tracking about who allocated each page - tracking about who allocated each page
remap_file_pages.txt remap_file_pages.rst
- a note about remap_file_pages() system call - a note about remap_file_pages() system call
slub.txt slub.rst
- a short users guide for SLUB. - a short users guide for SLUB.
soft-dirty.txt soft-dirty.rst
- short explanation for soft-dirty PTEs - short explanation for soft-dirty PTEs
split_page_table_lock split_page_table_lock.rst
- Separate per-table lock to improve scalability of the old page_table_lock. - Separate per-table lock to improve scalability of the old page_table_lock.
swap_numa.txt swap_numa.rst
- automatic binding of swap device to numa node - automatic binding of swap device to numa node
transhuge.txt transhuge.rst
- Transparent Hugepage Support, alternative way of using hugepages. - Transparent Hugepage Support, alternative way of using hugepages.
unevictable-lru.txt unevictable-lru.rst
- Unevictable LRU infrastructure - Unevictable LRU infrastructure
userfaultfd.txt userfaultfd.rst
- description of userfaultfd system call - description of userfaultfd system call
z3fold.txt z3fold.txt
- outline of z3fold allocator for storing compressed pages - outline of z3fold allocator for storing compressed pages
zsmalloc.txt zsmalloc.rst
- outline of zsmalloc allocator for storing compressed pages - outline of zsmalloc allocator for storing compressed pages
zswap.txt zswap.rst
- Intro to compressed cache for swap pages - Intro to compressed cache for swap pages
.. _active_mm:
=========
Active MM
=========
::
List: linux-kernel
Subject: Re: active_mm
From: Linus Torvalds <torvalds () transmeta ! com>
Date: 1999-07-30 21:36:24
Cc'd to linux-kernel, because I don't write explanations all that often,
and when I do I feel better about more people reading them.
On Fri, 30 Jul 1999, David Mosberger wrote:
>
> Is there a brief description someplace on how "mm" vs. "active_mm" in
> the task_struct are supposed to be used? (My apologies if this was
> discussed on the mailing lists---I just returned from vacation and
> wasn't able to follow linux-kernel for a while).
Basically, the new setup is:
- we have "real address spaces" and "anonymous address spaces". The
difference is that an anonymous address space doesn't care about the
user-level page tables at all, so when we do a context switch into an
anonymous address space we just leave the previous address space
active.
The obvious use for a "anonymous address space" is any thread that
doesn't need any user mappings - all kernel threads basically fall into
this category, but even "real" threads can temporarily say that for
some amount of time they are not going to be interested in user space,
and that the scheduler might as well try to avoid wasting time on
switching the VM state around. Currently only the old-style bdflush
sync does that.
- "tsk->mm" points to the "real address space". For an anonymous process,
tsk->mm will be NULL, for the logical reason that an anonymous process
really doesn't _have_ a real address space at all.
- however, we obviously need to keep track of which address space we
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
which shows what the currently active address space is.
The rule is that for a process with a real address space (ie tsk->mm is
non-NULL) the active_mm obviously always has to be the same as the real
one.
For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
"borrowed" mm while the anonymous process is running. When the
anonymous process gets scheduled away, the borrowed address space is
returned and cleared.
To support all that, the "struct mm_struct" now has two counters: a
"mm_users" counter that is how many "real address space users" there are,
and a "mm_count" counter that is the number of "lazy" users (ie anonymous
users) plus one if there are any real users.
Usually there is at least one real user, but it could be that the real
user exited on another CPU while a lazy user was still active, so you do
actually get cases where you have a address space that is _only_ used by
lazy users. That is often a short-lived state, because once that thread
gets scheduled away in favour of a real thread, the "zombie" mm gets
released because "mm_users" becomes zero.
Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
more. "init_mm" should be considered just a "lazy context when no other
context is available", and in fact it is mainly used just at bootup when
no real VM has yet been created. So code that used to check
if (current->mm == &init_mm)
should generally just do
if (!current->mm)
instead (which makes more sense anyway - the test is basically one of "do
we have a user context", and is generally done by the page fault handler
and things like that).
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
because it slightly changes the interfaces to accommodate the alpha (who
would have thought it, but the alpha actually ends up having one of the
ugliest context switch codes - unlike the other architectures where the MM
and register state is separate, the alpha PALcode joins the two, and you
need to switch both together).
(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
List: linux-kernel
Subject: Re: active_mm
From: Linus Torvalds <torvalds () transmeta ! com>
Date: 1999-07-30 21:36:24
Cc'd to linux-kernel, because I don't write explanations all that often,
and when I do I feel better about more people reading them.
On Fri, 30 Jul 1999, David Mosberger wrote:
>
> Is there a brief description someplace on how "mm" vs. "active_mm" in
> the task_struct are supposed to be used? (My apologies if this was
> discussed on the mailing lists---I just returned from vacation and
> wasn't able to follow linux-kernel for a while).
Basically, the new setup is:
- we have "real address spaces" and "anonymous address spaces". The
difference is that an anonymous address space doesn't care about the
user-level page tables at all, so when we do a context switch into an
anonymous address space we just leave the previous address space
active.
The obvious use for a "anonymous address space" is any thread that
doesn't need any user mappings - all kernel threads basically fall into
this category, but even "real" threads can temporarily say that for
some amount of time they are not going to be interested in user space,
and that the scheduler might as well try to avoid wasting time on
switching the VM state around. Currently only the old-style bdflush
sync does that.
- "tsk->mm" points to the "real address space". For an anonymous process,
tsk->mm will be NULL, for the logical reason that an anonymous process
really doesn't _have_ a real address space at all.
- however, we obviously need to keep track of which address space we
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
which shows what the currently active address space is.
The rule is that for a process with a real address space (ie tsk->mm is
non-NULL) the active_mm obviously always has to be the same as the real
one.
For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
"borrowed" mm while the anonymous process is running. When the
anonymous process gets scheduled away, the borrowed address space is
returned and cleared.
To support all that, the "struct mm_struct" now has two counters: a
"mm_users" counter that is how many "real address space users" there are,
and a "mm_count" counter that is the number of "lazy" users (ie anonymous
users) plus one if there are any real users.
Usually there is at least one real user, but it could be that the real
user exited on another CPU while a lazy user was still active, so you do
actually get cases where you have a address space that is _only_ used by
lazy users. That is often a short-lived state, because once that thread
gets scheduled away in favour of a real thread, the "zombie" mm gets
released because "mm_users" becomes zero.
Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
more. "init_mm" should be considered just a "lazy context when no other
context is available", and in fact it is mainly used just at bootup when
no real VM has yet been created. So code that used to check
if (current->mm == &init_mm)
should generally just do
if (!current->mm)
instead (which makes more sense anyway - the test is basically one of "do
we have a user context", and is generally done by the page fault handler
and things like that).
Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
because it slightly changes the interfaces to accommodate the alpha (who
would have thought it, but the alpha actually ends up having one of the
ugliest context switch codes - unlike the other architectures where the MM
and register state is separate, the alpha PALcode joins the two, and you
need to switch both together).
(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
.. _balance:
================
Memory Balancing
================
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
...@@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, ...@@ -62,11 +68,11 @@ for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
so as to give a fighting chance for replace_with_highmem() to get a so as to give a fighting chance for replace_with_highmem() to get a
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
fall back into regular zone. This also makes sure that HIGHMEM pages fall back into regular zone. This also makes sure that HIGHMEM pages
are not leaked (for example, in situations where a HIGHMEM page is in are not leaked (for example, in situations where a HIGHMEM page is in
the swapcache but is not being used by anyone) the swapcache but is not being used by anyone)
kswapd also needs to know about the zones it should balance. kswapd is kswapd also needs to know about the zones it should balance. kswapd is
primarily needed in a situation where balancing can not be done, primarily needed in a situation where balancing can not be done,
probably because all allocation requests are coming from intr context probably because all allocation requests are coming from intr context
and all process contexts are sleeping. For 2.3, kswapd does not really and all process contexts are sleeping. For 2.3, kswapd does not really
need to balance the highmem zone, since intr context does not request need to balance the highmem zone, since intr context does not request
...@@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. ...@@ -89,7 +95,8 @@ pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
(Good) Ideas that I have heard: (Good) Ideas that I have heard:
1. Dynamic experience should influence balancing: number of failed requests 1. Dynamic experience should influence balancing: number of failed requests
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
dma pages. (lkd@tantalophile.demon.co.uk) dma pages. (lkd@tantalophile.demon.co.uk)
MOTIVATION .. _cleancache:
==========
Cleancache
==========
Motivation
==========
Cleancache is a new optional feature provided by the VFS layer that Cleancache is a new optional feature provided by the VFS layer that
potentially dramatically increases page cache effectiveness for potentially dramatically increases page cache effectiveness for
...@@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented ...@@ -21,9 +28,10 @@ Transcendent memory "drivers" for cleancache are currently implemented
in Xen (using hypervisor memory) and zcache (using in-kernel compressed in Xen (using hypervisor memory) and zcache (using in-kernel compressed
memory) and other implementations are in development. memory) and other implementations are in development.
FAQs are included below. :ref:`FAQs <faq>` are included below.
IMPLEMENTATION OVERVIEW Implementation Overview
=======================
A cleancache "backend" that provides transcendent memory registers itself A cleancache "backend" that provides transcendent memory registers itself
to the kernel's cleancache "frontend" by calling cleancache_register_ops, to the kernel's cleancache "frontend" by calling cleancache_register_ops,
...@@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page ...@@ -80,22 +88,33 @@ different Linux threads are simultaneously putting and invalidating a page
with the same handle, the results are indeterminate. Callers must with the same handle, the results are indeterminate. Callers must
lock the page to ensure serial behavior. lock the page to ensure serial behavior.
CLEANCACHE PERFORMANCE METRICS Cleancache Performance Metrics
==============================
If properly configured, monitoring of cleancache is done via debugfs in If properly configured, monitoring of cleancache is done via debugfs in
the /sys/kernel/debug/cleancache directory. The effectiveness of cleancache the `/sys/kernel/debug/cleancache` directory. The effectiveness of cleancache
can be measured (across all filesystems) with: can be measured (across all filesystems) with:
succ_gets - number of gets that were successful ``succ_gets``
failed_gets - number of gets that failed number of gets that were successful
puts - number of puts attempted (all "succeed")
invalidates - number of invalidates attempted ``failed_gets``
number of gets that failed
``puts``
number of puts attempted (all "succeed")
``invalidates``
number of invalidates attempted
A backend implementation may provide additional metrics. A backend implementation may provide additional metrics.
.. _faq:
FAQ FAQ
===
1) Where's the value? (Andrew Morton) * Where's the value? (Andrew Morton)
Cleancache provides a significant performance benefit to many workloads Cleancache provides a significant performance benefit to many workloads
in many environments with negligible overhead by improving the in many environments with negligible overhead by improving the
...@@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And ...@@ -137,8 +156,8 @@ device that stores pages of data in a compressed state. And
the proposed "RAMster" driver shares RAM across multiple physical the proposed "RAMster" driver shares RAM across multiple physical
systems. systems.
2) Why does cleancache have its sticky fingers so deep inside the * Why does cleancache have its sticky fingers so deep inside the
filesystems and VFS? (Andrew Morton and Christoph Hellwig) filesystems and VFS? (Andrew Morton and Christoph Hellwig)
The core hooks for cleancache in VFS are in most cases a single line The core hooks for cleancache in VFS are in most cases a single line
and the minimum set are placed precisely where needed to maintain and the minimum set are placed precisely where needed to maintain
...@@ -168,9 +187,9 @@ filesystems in the future. ...@@ -168,9 +187,9 @@ filesystems in the future.
The total impact of the hooks to existing fs and mm files is only The total impact of the hooks to existing fs and mm files is only
about 40 lines added (not counting comments and blank lines). about 40 lines added (not counting comments and blank lines).
3) Why not make cleancache asynchronous and batched so it can * Why not make cleancache asynchronous and batched so it can more
more easily interface with real devices with DMA instead easily interface with real devices with DMA instead of copying each
of copying each individual page? (Minchan Kim) individual page? (Minchan Kim)
The one-page-at-a-time copy semantics simplifies the implementation The one-page-at-a-time copy semantics simplifies the implementation
on both the frontend and backend and also allows the backend to on both the frontend and backend and also allows the backend to
...@@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device" ...@@ -182,8 +201,8 @@ are avoided. While the interface seems odd for a "real device"
or for real kernel-addressable RAM, it makes perfect sense for or for real kernel-addressable RAM, it makes perfect sense for
transcendent memory. transcendent memory.
4) Why is non-shared cleancache "exclusive"? And where is the * Why is non-shared cleancache "exclusive"? And where is the
page "invalidated" after a "get"? (Minchan Kim) page "invalidated" after a "get"? (Minchan Kim)
The main reason is to free up space in transcendent memory and The main reason is to free up space in transcendent memory and
to avoid unnecessary cleancache_invalidate calls. If you want inclusive, to avoid unnecessary cleancache_invalidate calls. If you want inclusive,
...@@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call. ...@@ -193,7 +212,7 @@ be easily extended to add a "get_no_invalidate" call.
The invalidate is done by the cleancache backend implementation. The invalidate is done by the cleancache backend implementation.
5) What's the performance impact? * What's the performance impact?
Performance analysis has been presented at OLS'09 and LCA'10. Performance analysis has been presented at OLS'09 and LCA'10.
Briefly, performance gains can be significant on most workloads, Briefly, performance gains can be significant on most workloads,
...@@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache ...@@ -206,7 +225,7 @@ single-core systems with slow memory-copy speeds, cleancache
has little value, but in newer multicore machines, especially has little value, but in newer multicore machines, especially
consolidated/virtualized machines, it has great value. consolidated/virtualized machines, it has great value.
6) How do I add cleancache support for filesystem X? (Boaz Harrash) * How do I add cleancache support for filesystem X? (Boaz Harrash)
Filesystems that are well-behaved and conform to certain Filesystems that are well-behaved and conform to certain
restrictions can utilize cleancache simply by making a call to restrictions can utilize cleancache simply by making a call to
...@@ -217,26 +236,26 @@ not enable the optional cleancache. ...@@ -217,26 +236,26 @@ not enable the optional cleancache.
Some points for a filesystem to consider: Some points for a filesystem to consider:
- The FS should be block-device-based (e.g. a ram-based FS such - The FS should be block-device-based (e.g. a ram-based FS such
as tmpfs should not enable cleancache) as tmpfs should not enable cleancache)
- To ensure coherency/correctness, the FS must ensure that all - To ensure coherency/correctness, the FS must ensure that all
file removal or truncation operations either go through VFS or file removal or truncation operations either go through VFS or
add hooks to do the equivalent cleancache "invalidate" operations add hooks to do the equivalent cleancache "invalidate" operations
- To ensure coherency/correctness, either inode numbers must - To ensure coherency/correctness, either inode numbers must
be unique across the lifetime of the on-disk file OR the be unique across the lifetime of the on-disk file OR the
FS must provide an "encode_fh" function. FS must provide an "encode_fh" function.
- The FS must call the VFS superblock alloc and deactivate routines - The FS must call the VFS superblock alloc and deactivate routines
or add hooks to do the equivalent cleancache calls done there. or add hooks to do the equivalent cleancache calls done there.
- To maximize performance, all pages fetched from the FS should - To maximize performance, all pages fetched from the FS should
go through the do_mpag_readpage routine or the FS should add go through the do_mpag_readpage routine or the FS should add
hooks to do the equivalent (cf. btrfs) hooks to do the equivalent (cf. btrfs)
- Currently, the FS blocksize must be the same as PAGESIZE. This - Currently, the FS blocksize must be the same as PAGESIZE. This
is not an architectural restriction, but no backends currently is not an architectural restriction, but no backends currently
support anything different. support anything different.
- A clustered FS should invoke the "shared_init_fs" cleancache - A clustered FS should invoke the "shared_init_fs" cleancache
hook to get best performance for some backends. hook to get best performance for some backends.
7) Why not use the KVA of the inode as the key? (Christoph Hellwig) * Why not use the KVA of the inode as the key? (Christoph Hellwig)
If cleancache would use the inode virtual address instead of If cleancache would use the inode virtual address instead of
inode/filehandle, the pool id could be eliminated. But, this inode/filehandle, the pool id could be eliminated. But, this
...@@ -251,7 +270,7 @@ of cleancache would be lost because the cache of pages in cleanache ...@@ -251,7 +270,7 @@ of cleancache would be lost because the cache of pages in cleanache
is potentially much larger than the kernel pagecache and is most is potentially much larger than the kernel pagecache and is most
useful if the pages survive inode cache removal. useful if the pages survive inode cache removal.
8) Why is a global variable required? * Why is a global variable required?
The cleancache_enabled flag is checked in all of the frequently-used The cleancache_enabled flag is checked in all of the frequently-used
cleancache hooks. The alternative is a function call to check a static cleancache hooks. The alternative is a function call to check a static
...@@ -262,14 +281,14 @@ global variable allows cleancache to be enabled by default at compile ...@@ -262,14 +281,14 @@ global variable allows cleancache to be enabled by default at compile
time, but have insignificant performance impact when cleancache remains time, but have insignificant performance impact when cleancache remains
disabled at runtime. disabled at runtime.
9) Does cleanache work with KVM? * Does cleanache work with KVM?
The memory model of KVM is sufficiently different that a cleancache The memory model of KVM is sufficiently different that a cleancache
backend may have less value for KVM. This remains to be tested, backend may have less value for KVM. This remains to be tested,
especially in an overcommitted system. especially in an overcommitted system.
10) Does cleancache work in userspace? It sounds useful for * Does cleancache work in userspace? It sounds useful for
memory hungry caches like web browsers. (Jamie Lokier) memory hungry caches like web browsers. (Jamie Lokier)
No plans yet, though we agree it sounds useful, at least for No plans yet, though we agree it sounds useful, at least for
apps that bypass the page cache (e.g. O_DIRECT). apps that bypass the page cache (e.g. O_DIRECT).
......
# -*- coding: utf-8; mode: python -*-
project = "Linux Memory Management Documentation"
tags.add("subproject")
latex_documents = [
('index', 'memory-management.tex', project,
'The kernel development community', 'manual'),
]
.. _frontswap:
=========
Frontswap
=========
Frontswap provides a "transcendent memory" interface for swap pages. Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained because In some environments, dramatic performance savings may be obtained because
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" (Note, frontswap -- and :ref:`cleancache` (merged at 3.0) -- are the "frontends"
and the only necessary changes to the core kernel for transcendent memory; and the only necessary changes to the core kernel for transcendent memory;
all other supporting code -- the "backends" -- is implemented as drivers. all other supporting code -- the "backends" -- is implemented as drivers.
See the LWN.net article "Transcendent memory in a nutshell" for a detailed See the LWN.net article `Transcendent memory in a nutshell`_
overview of frontswap and related kernel parts: for a detailed overview of frontswap and related kernel parts)
https://lwn.net/Articles/454795/ )
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
Frontswap is so named because it can be thought of as the opposite of Frontswap is so named because it can be thought of as the opposite of
a "backing" store for a swap device. The storage is assumed to be a "backing" store for a swap device. The storage is assumed to be
...@@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may ...@@ -50,19 +57,27 @@ or the store fails AND the page is invalidated. This ensures stale data may
never be obtained from frontswap. never be obtained from frontswap.
If properly configured, monitoring of frontswap is done via debugfs in If properly configured, monitoring of frontswap is done via debugfs in
the /sys/kernel/debug/frontswap directory. The effectiveness of the `/sys/kernel/debug/frontswap` directory. The effectiveness of
frontswap can be measured (across all swap devices) with: frontswap can be measured (across all swap devices) with:
failed_stores - how many store attempts have failed ``failed_stores``
loads - how many loads were attempted (all should succeed) how many store attempts have failed
succ_stores - how many store attempts have succeeded
invalidates - how many invalidates were attempted ``loads``
how many loads were attempted (all should succeed)
``succ_stores``
how many store attempts have succeeded
``invalidates``
how many invalidates were attempted
A backend implementation may provide additional metrics. A backend implementation may provide additional metrics.
FAQ FAQ
===
1) Where's the value? * Where's the value?
When a workload starts swapping, performance falls through the floor. When a workload starts swapping, performance falls through the floor.
Frontswap significantly increases performance in many such workloads by Frontswap significantly increases performance in many such workloads by
...@@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And, ...@@ -117,8 +132,8 @@ A KVM implementation is underway and has been RFC'ed to lkml. And,
using frontswap, investigation is also underway on the use of NVM as using frontswap, investigation is also underway on the use of NVM as
a memory extension technology. a memory extension technology.
2) Sure there may be performance advantages in some situations, but * Sure there may be performance advantages in some situations, but
what's the space/time overhead of frontswap? what's the space/time overhead of frontswap?
If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
nothingness and the only overhead is a few extra bytes per swapon'ed nothingness and the only overhead is a few extra bytes per swapon'ed
...@@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A ...@@ -148,8 +163,8 @@ pressure that can potentially outweigh the other advantages. A
backend, such as zcache, must implement policies to carefully (but backend, such as zcache, must implement policies to carefully (but
dynamically) manage memory limits to ensure this doesn't happen. dynamically) manage memory limits to ensure this doesn't happen.
3) OK, how about a quick overview of what this frontswap patch does * OK, how about a quick overview of what this frontswap patch does
in terms that a kernel hacker can grok? in terms that a kernel hacker can grok?
Let's assume that a frontswap "backend" has registered during Let's assume that a frontswap "backend" has registered during
kernel initialization; this registration indicates that this kernel initialization; this registration indicates that this
...@@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend ...@@ -188,9 +203,9 @@ and (potentially) a swap device write are replaced by a "frontswap backend
store" and (possibly) a "frontswap backend loads", which are presumably much store" and (possibly) a "frontswap backend loads", which are presumably much
faster. faster.
4) Can't frontswap be configured as a "special" swap device that is * Can't frontswap be configured as a "special" swap device that is
just higher priority than any real swap device (e.g. like zswap, just higher priority than any real swap device (e.g. like zswap,
or maybe swap-over-nbd/NFS)? or maybe swap-over-nbd/NFS)?
No. First, the existing swap subsystem doesn't allow for any kind of No. First, the existing swap subsystem doesn't allow for any kind of
swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
...@@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices ...@@ -240,9 +255,9 @@ installation, frontswap is useless. Swapless portable devices
can still use frontswap but a backend for such devices must configure can still use frontswap but a backend for such devices must configure
some kind of "ghost" swap device and ensure that it is never used. some kind of "ghost" swap device and ensure that it is never used.
5) Why this weird definition about "duplicate stores"? If a page * Why this weird definition about "duplicate stores"? If a page
has been previously successfully stored, can't it always be has been previously successfully stored, can't it always be
successfully overwritten? successfully overwritten?
Nearly always it can, but no, sometimes it cannot. Consider an example Nearly always it can, but no, sometimes it cannot. Consider an example
where data is compressed and the original 4K page has been compressed where data is compressed and the original 4K page has been compressed
...@@ -254,7 +269,7 @@ the old data and ensure that it is no longer accessible. Since the ...@@ -254,7 +269,7 @@ the old data and ensure that it is no longer accessible. Since the
swap subsystem then writes the new data to the read swap device, swap subsystem then writes the new data to the read swap device,
this is the correct course of action to ensure coherency. this is the correct course of action to ensure coherency.
6) What is frontswap_shrink for? * What is frontswap_shrink for?
When the (non-frontswap) swap subsystem swaps out a page to a real When the (non-frontswap) swap subsystem swaps out a page to a real
swap device, that page is only taking up low-value pre-allocated disk swap device, that page is only taking up low-value pre-allocated disk
...@@ -267,7 +282,7 @@ to "repatriate" pages sent to a remote machine back to the local machine; ...@@ -267,7 +282,7 @@ to "repatriate" pages sent to a remote machine back to the local machine;
this is driven using the frontswap_shrink mechanism when memory pressure this is driven using the frontswap_shrink mechanism when memory pressure
subsides. subsides.
7) Why does the frontswap patch create the new include file swapfile.h? * Why does the frontswap patch create the new include file swapfile.h?
The frontswap code depends on some swap-subsystem-internal data The frontswap code depends on some swap-subsystem-internal data
structures that have, over the years, moved back and forth between structures that have, over the years, moved back and forth between
......
.. _highmem:
==================== ====================
HIGH MEMORY HANDLING High Memory Handling
==================== ====================
By: Peter Zijlstra <a.p.zijlstra@chello.nl> By: Peter Zijlstra <a.p.zijlstra@chello.nl>
Contents: .. contents:: :local:
(*) What is high memory?
(*) Temporary virtual mappings.
(*) Using kmap_atomic.
(*) Cost of temporary mappings.
(*) i386 PAE.
What Is High Memory?
====================
WHAT IS HIGH MEMORY?
==================== ====================
High memory (highmem) is used when the size of physical memory approaches or High memory (highmem) is used when the size of physical memory approaches or
...@@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on ...@@ -38,7 +27,7 @@ kernel entry/exit. This means the available virtual memory space (4GiB on
i386) has to be divided between user and kernel space. i386) has to be divided between user and kernel space.
The traditional split for architectures using this approach is 3:1, 3GiB for The traditional split for architectures using this approach is 3:1, 3GiB for
userspace and the top 1GiB for kernel space: userspace and the top 1GiB for kernel space::
+--------+ 0xffffffff +--------+ 0xffffffff
| Kernel | | Kernel |
...@@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual ...@@ -58,40 +47,38 @@ and user maps. Some hardware (like some ARMs), however, have limited virtual
space when they use mm context tags. space when they use mm context tags.
========================== Temporary Virtual Mappings
TEMPORARY VIRTUAL MAPPINGS
========================== ==========================
The kernel contains several ways of creating temporary mappings: The kernel contains several ways of creating temporary mappings:
(*) vmap(). This can be used to make a long duration mapping of multiple * vmap(). This can be used to make a long duration mapping of multiple
physical pages into a contiguous virtual space. It needs global physical pages into a contiguous virtual space. It needs global
synchronization to unmap. synchronization to unmap.
(*) kmap(). This permits a short duration mapping of a single page. It needs * kmap(). This permits a short duration mapping of a single page. It needs
global synchronization, but is amortized somewhat. It is also prone to global synchronization, but is amortized somewhat. It is also prone to
deadlocks when using in a nested fashion, and so it is not recommended for deadlocks when using in a nested fashion, and so it is not recommended for
new code. new code.
(*) kmap_atomic(). This permits a very short duration mapping of a single * kmap_atomic(). This permits a very short duration mapping of a single
page. Since the mapping is restricted to the CPU that issued it, it page. Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that performs well, but the issuing task is therefore required to stay on that
CPU until it has finished, lest some other task displace its mappings. CPU until it has finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it is does not kmap_atomic() may also be used by interrupt contexts, since it is does not
sleep and the caller may not sleep until after kunmap_atomic() is called. sleep and the caller may not sleep until after kunmap_atomic() is called.
It may be assumed that k[un]map_atomic() won't fail. It may be assumed that k[un]map_atomic() won't fail.
================= Using kmap_atomic
USING KMAP_ATOMIC
================= =================
When and where to use kmap_atomic() is straightforward. It is used when code When and where to use kmap_atomic() is straightforward. It is used when code
wants to access the contents of a page that might be allocated from high memory wants to access the contents of a page that might be allocated from high memory
(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
functions, and they can be used in a manner similar to the following: functions, and they can be used in a manner similar to the following::
/* Find the page of interest. */ /* Find the page of interest. */
struct page *page = find_get_page(mapping, offset); struct page *page = find_get_page(mapping, offset);
...@@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call ...@@ -109,7 +96,7 @@ Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
not the argument. not the argument.
If you need to map two pages because you want to copy from one page to If you need to map two pages because you want to copy from one page to
another you need to keep the kmap_atomic calls strictly nested, like: another you need to keep the kmap_atomic calls strictly nested, like::
vaddr1 = kmap_atomic(page1); vaddr1 = kmap_atomic(page1);
vaddr2 = kmap_atomic(page2); vaddr2 = kmap_atomic(page2);
...@@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like: ...@@ -120,8 +107,7 @@ another you need to keep the kmap_atomic calls strictly nested, like:
kunmap_atomic(vaddr1); kunmap_atomic(vaddr1);
========================== Cost of Temporary Mappings
COST OF TEMPORARY MAPPINGS
========================== ==========================
The cost of creating temporary mappings can be quite high. The arch has to The cost of creating temporary mappings can be quite high. The arch has to
...@@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary mappings and no ...@@ -136,25 +122,24 @@ If CONFIG_MMU is not set, then there can be no temporary mappings and no
highmem. In such a case, the arithmetic approach will also be used. highmem. In such a case, the arithmetic approach will also be used.
========
i386 PAE i386 PAE
======== ========
The i386 arch, under some circumstances, will permit you to stick up to 64GiB The i386 arch, under some circumstances, will permit you to stick up to 64GiB
of RAM into your 32-bit machine. This has a number of consequences: of RAM into your 32-bit machine. This has a number of consequences:
(*) Linux needs a page-frame structure for each page in the system and the * Linux needs a page-frame structure for each page in the system and the
pageframes need to live in the permanent mapping, which means: pageframes need to live in the permanent mapping, which means:
(*) you can have 896M/sizeof(struct page) page-frames at most; with struct * you can have 896M/sizeof(struct page) page-frames at most; with struct
page being 32-bytes that would end up being something in the order of 112G page being 32-bytes that would end up being something in the order of 112G
worth of pages; the kernel, however, needs to store more than just worth of pages; the kernel, however, needs to store more than just
page-frames in that memory... page-frames in that memory...
(*) PAE makes your page tables larger - which slows the system down as more * PAE makes your page tables larger - which slows the system down as more
data has to be accessed to traverse in TLB fills and the like. One data has to be accessed to traverse in TLB fills and the like. One
advantage is that PAE has more PTE bits and can provide advanced features advantage is that PAE has more PTE bits and can provide advanced features
like NX and PAT. like NX and PAT.
The general recommendation is that you don't use more than 8GiB on a 32-bit The general recommendation is that you don't use more than 8GiB on a 32-bit
machine - although more might work for you and your workload, you're pretty machine - although more might work for you and your workload, you're pretty
......
.. hmm:
=====================================
Heterogeneous Memory Management (HMM) Heterogeneous Memory Management (HMM)
=====================================
Provide infrastructure and helpers to integrate non-conventional memory (device Provide infrastructure and helpers to integrate non-conventional memory (device
memory like GPU on board memory) into regular kernel path, with the cornerstone memory like GPU on board memory) into regular kernel path, with the cornerstone
...@@ -6,10 +10,10 @@ of this being specialized struct page for such memory (see sections 5 to 7 of ...@@ -6,10 +10,10 @@ of this being specialized struct page for such memory (see sections 5 to 7 of
this document). this document).
HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., HMM also provides optional helpers for SVM (Share Virtual Memory), i.e.,
allowing a device to transparently access program address coherently with the allowing a device to transparently access program address coherently with
CPU meaning that any valid pointer on the CPU is also a valid pointer for the the CPU meaning that any valid pointer on the CPU is also a valid pointer
device. This is becoming mandatory to simplify the use of advanced hetero- for the device. This is becoming mandatory to simplify the use of advanced
geneous computing where GPU, DSP, or FPGA are used to perform various heterogeneous computing where GPU, DSP, or FPGA are used to perform various
computations on behalf of a process. computations on behalf of a process.
This document is divided as follows: in the first section I expose the problems This document is divided as follows: in the first section I expose the problems
...@@ -21,19 +25,10 @@ fifth section deals with how device memory is represented inside the kernel. ...@@ -21,19 +25,10 @@ fifth section deals with how device memory is represented inside the kernel.
Finally, the last section presents a new migration helper that allows lever- Finally, the last section presents a new migration helper that allows lever-
aging the device DMA engine. aging the device DMA engine.
.. contents:: :local:
1) Problems of using a device specific memory allocator: Problems of using a device specific memory allocator
2) I/O bus, device memory characteristics ====================================================
3) Shared address space and migration
4) Address space mirroring implementation and API
5) Represent and manage device memory from core kernel point of view
6) Migration to and from device memory
7) Memory cgroup (memcg) and rss accounting
-------------------------------------------------------------------------------
1) Problems of using a device specific memory allocator:
Devices with a large amount of on board memory (several gigabytes) like GPUs Devices with a large amount of on board memory (several gigabytes) like GPUs
have historically managed their memory through dedicated driver specific APIs. have historically managed their memory through dedicated driver specific APIs.
...@@ -77,9 +72,8 @@ are only do-able with a shared address space. It is also more reasonable to use ...@@ -77,9 +72,8 @@ are only do-able with a shared address space. It is also more reasonable to use
a shared address space for all other patterns. a shared address space for all other patterns.
------------------------------------------------------------------------------- I/O bus, device memory characteristics
======================================
2) I/O bus, device memory characteristics
I/O buses cripple shared address spaces due to a few limitations. Most I/O I/O buses cripple shared address spaces due to a few limitations. Most I/O
buses only allow basic memory access from device to main memory; even cache buses only allow basic memory access from device to main memory; even cache
...@@ -109,9 +103,8 @@ access any memory but we must also permit any memory to be migrated to device ...@@ -109,9 +103,8 @@ access any memory but we must also permit any memory to be migrated to device
memory while device is using it (blocking CPU access while it happens). memory while device is using it (blocking CPU access while it happens).
------------------------------------------------------------------------------- Shared address space and migration
==================================
3) Shared address space and migration
HMM intends to provide two main features. First one is to share the address HMM intends to provide two main features. First one is to share the address
space by duplicating the CPU page table in the device page table so the same space by duplicating the CPU page table in the device page table so the same
...@@ -148,23 +141,23 @@ ages device memory by migrating the part of the data set that is actively being ...@@ -148,23 +141,23 @@ ages device memory by migrating the part of the data set that is actively being
used by the device. used by the device.
------------------------------------------------------------------------------- Address space mirroring implementation and API
==============================================
4) Address space mirroring implementation and API
Address space mirroring's main objective is to allow duplication of a range of Address space mirroring's main objective is to allow duplication of a range of
CPU page table into a device page table; HMM helps keep both synchronized. A CPU page table into a device page table; HMM helps keep both synchronized. A
device driver that wants to mirror a process address space must start with the device driver that wants to mirror a process address space must start with the
registration of an hmm_mirror struct: registration of an hmm_mirror struct::
int hmm_mirror_register(struct hmm_mirror *mirror, int hmm_mirror_register(struct hmm_mirror *mirror,
struct mm_struct *mm); struct mm_struct *mm);
int hmm_mirror_register_locked(struct hmm_mirror *mirror, int hmm_mirror_register_locked(struct hmm_mirror *mirror,
struct mm_struct *mm); struct mm_struct *mm);
The locked variant is to be used when the driver is already holding mmap_sem The locked variant is to be used when the driver is already holding mmap_sem
of the mm in write mode. The mirror struct has a set of callbacks that are used of the mm in write mode. The mirror struct has a set of callbacks that are used
to propagate CPU page tables: to propagate CPU page tables::
struct hmm_mirror_ops { struct hmm_mirror_ops {
/* sync_cpu_device_pagetables() - synchronize page tables /* sync_cpu_device_pagetables() - synchronize page tables
...@@ -193,10 +186,10 @@ The device driver must perform the update action to the range (mark range ...@@ -193,10 +186,10 @@ The device driver must perform the update action to the range (mark range
read only, or fully unmap, ...). The device must be done with the update before read only, or fully unmap, ...). The device must be done with the update before
the driver callback returns. the driver callback returns.
When the device driver wants to populate a range of virtual addresses, it can When the device driver wants to populate a range of virtual addresses, it can
use either: use either::
int hmm_vma_get_pfns(struct vm_area_struct *vma,
int hmm_vma_get_pfns(struct vm_area_struct *vma,
struct hmm_range *range, struct hmm_range *range,
unsigned long start, unsigned long start,
unsigned long end, unsigned long end,
...@@ -221,7 +214,7 @@ provides a set of flags to help the driver identify special CPU page table ...@@ -221,7 +214,7 @@ provides a set of flags to help the driver identify special CPU page table
entries. entries.
Locking with the update() callback is the most important aspect the driver must Locking with the update() callback is the most important aspect the driver must
respect in order to keep things properly synchronized. The usage pattern is: respect in order to keep things properly synchronized. The usage pattern is::
int driver_populate_range(...) int driver_populate_range(...)
{ {
...@@ -262,9 +255,8 @@ report commands as executed is serialized (there is no point in doing this ...@@ -262,9 +255,8 @@ report commands as executed is serialized (there is no point in doing this
concurrently). concurrently).
------------------------------------------------------------------------------- Represent and manage device memory from core kernel point of view
=================================================================
5) Represent and manage device memory from core kernel point of view
Several different designs were tried to support device memory. First one used Several different designs were tried to support device memory. First one used
a device specific data structure to keep information about migrated memory and a device specific data structure to keep information about migrated memory and
...@@ -280,14 +272,14 @@ unaware of the difference. We only need to make sure that no one ever tries to ...@@ -280,14 +272,14 @@ unaware of the difference. We only need to make sure that no one ever tries to
map those pages from the CPU side. map those pages from the CPU side.
HMM provides a set of helpers to register and hotplug device memory as a new HMM provides a set of helpers to register and hotplug device memory as a new
region needing a struct page. This is offered through a very simple API: region needing a struct page. This is offered through a very simple API::
struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
struct device *device, struct device *device,
unsigned long size); unsigned long size);
void hmm_devmem_remove(struct hmm_devmem *devmem); void hmm_devmem_remove(struct hmm_devmem *devmem);
The hmm_devmem_ops is where most of the important things are: The hmm_devmem_ops is where most of the important things are::
struct hmm_devmem_ops { struct hmm_devmem_ops {
void (*free)(struct hmm_devmem *devmem, struct page *page); void (*free)(struct hmm_devmem *devmem, struct page *page);
...@@ -306,13 +298,12 @@ which it cannot do. This second callback must trigger a migration back to ...@@ -306,13 +298,12 @@ which it cannot do. This second callback must trigger a migration back to
system memory. system memory.
------------------------------------------------------------------------------- Migration to and from device memory
===================================
6) Migration to and from device memory
Because the CPU cannot access device memory, migration must use the device DMA Because the CPU cannot access device memory, migration must use the device DMA
engine to perform copy from and to device memory. For this we need a new engine to perform copy from and to device memory. For this we need a new
migration helper: migration helper::
int migrate_vma(const struct migrate_vma_ops *ops, int migrate_vma(const struct migrate_vma_ops *ops,
struct vm_area_struct *vma, struct vm_area_struct *vma,
...@@ -331,7 +322,7 @@ migration might be for a range of addresses the device is actively accessing. ...@@ -331,7 +322,7 @@ migration might be for a range of addresses the device is actively accessing.
The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy()) The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
controls destination memory allocation and copy operation. Second one is there controls destination memory allocation and copy operation. Second one is there
to allow the device driver to perform cleanup operations after migration. to allow the device driver to perform cleanup operations after migration::
struct migrate_vma_ops { struct migrate_vma_ops {
void (*alloc_and_copy)(struct vm_area_struct *vma, void (*alloc_and_copy)(struct vm_area_struct *vma,
...@@ -365,9 +356,8 @@ bandwidth but this is considered as a rare event and a price that we are ...@@ -365,9 +356,8 @@ bandwidth but this is considered as a rare event and a price that we are
willing to pay to keep all the code simpler. willing to pay to keep all the code simpler.
------------------------------------------------------------------------------- Memory cgroup (memcg) and rss accounting
========================================
7) Memory cgroup (memcg) and rss accounting
For now device memory is accounted as any regular page in rss counters (either For now device memory is accounted as any regular page in rss counters (either
anonymous if device page is used for anonymous, file if device page is used for anonymous if device page is used for anonymous, file if device page is used for
......
Hugetlbfs Reservation Overview .. _hugetlbfs_reserve:
------------------------------
Huge pages as described at 'Documentation/vm/hugetlbpage.txt' are typically =====================
Hugetlbfs Reservation
=====================
Overview
========
Huge pages as described at :ref:`hugetlbpage` are typically
preallocated for application use. These huge pages are instantiated in a preallocated for application use. These huge pages are instantiated in a
task's address space at page fault time if the VMA indicates huge pages are task's address space at page fault time if the VMA indicates huge pages are
to be used. If no huge page exists at page fault time, the task is sent to be used. If no huge page exists at page fault time, the task is sent
...@@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel. ...@@ -17,47 +24,55 @@ describe how huge page reserve processing is done in the v4.10 kernel.
Audience Audience
-------- ========
This description is primarily targeted at kernel developers who are modifying This description is primarily targeted at kernel developers who are modifying
hugetlbfs code. hugetlbfs code.
The Data Structures The Data Structures
------------------- ===================
resv_huge_pages resv_huge_pages
This is a global (per-hstate) count of reserved huge pages. Reserved This is a global (per-hstate) count of reserved huge pages. Reserved
huge pages are only available to the task which reserved them. huge pages are only available to the task which reserved them.
Therefore, the number of huge pages generally available is computed Therefore, the number of huge pages generally available is computed
as (free_huge_pages - resv_huge_pages). as (``free_huge_pages - resv_huge_pages``).
Reserve Map Reserve Map
A reserve map is described by the structure: A reserve map is described by the structure::
struct resv_map {
struct kref refs; struct resv_map {
spinlock_t lock; struct kref refs;
struct list_head regions; spinlock_t lock;
long adds_in_progress; struct list_head regions;
struct list_head region_cache; long adds_in_progress;
long region_cache_count; struct list_head region_cache;
}; long region_cache_count;
};
There is one reserve map for each huge page mapping in the system. There is one reserve map for each huge page mapping in the system.
The regions list within the resv_map describes the regions within The regions list within the resv_map describes the regions within
the mapping. A region is described as: the mapping. A region is described as::
struct file_region {
struct list_head link; struct file_region {
long from; struct list_head link;
long to; long from;
}; long to;
};
The 'from' and 'to' fields of the file region structure are huge page The 'from' and 'to' fields of the file region structure are huge page
indices into the mapping. Depending on the type of mapping, a indices into the mapping. Depending on the type of mapping, a
region in the reserv_map may indicate reservations exist for the region in the reserv_map may indicate reservations exist for the
range, or reservations do not exist. range, or reservations do not exist.
Flags for MAP_PRIVATE Reservations Flags for MAP_PRIVATE Reservations
These are stored in the bottom bits of the reservation map pointer. These are stored in the bottom bits of the reservation map pointer.
#define HPAGE_RESV_OWNER (1UL << 0) Indicates this task is the
owner of the reservations associated with the mapping. ``#define HPAGE_RESV_OWNER (1UL << 0)``
#define HPAGE_RESV_UNMAPPED (1UL << 1) Indicates task originally Indicates this task is the owner of the reservations
mapping this range (and creating reserves) has unmapped a associated with the mapping.
page from this task (the child) due to a failed COW. ``#define HPAGE_RESV_UNMAPPED (1UL << 1)``
Indicates task originally mapping this range (and creating
reserves) has unmapped a page from this task (the child)
due to a failed COW.
Page Flags Page Flags
The PagePrivate page flag is used to indicate that a huge page The PagePrivate page flag is used to indicate that a huge page
reservation must be restored when the huge page is freed. More reservation must be restored when the huge page is freed. More
...@@ -65,12 +80,14 @@ Page Flags ...@@ -65,12 +80,14 @@ Page Flags
Reservation Map Location (Private or Shared) Reservation Map Location (Private or Shared)
-------------------------------------------- ============================================
A huge page mapping or segment is either private or shared. If private, A huge page mapping or segment is either private or shared. If private,
it is typically only available to a single address space (task). If shared, it is typically only available to a single address space (task). If shared,
it can be mapped into multiple address spaces (tasks). The location and it can be mapped into multiple address spaces (tasks). The location and
semantics of the reservation map is significantly different for two types semantics of the reservation map is significantly different for two types
of mappings. Location differences are: of mappings. Location differences are:
- For private mappings, the reservation map hangs off the the VMA structure. - For private mappings, the reservation map hangs off the the VMA structure.
Specifically, vma->vm_private_data. This reserve map is created at the Specifically, vma->vm_private_data. This reserve map is created at the
time the mapping (mmap(MAP_PRIVATE)) is created. time the mapping (mmap(MAP_PRIVATE)) is created.
...@@ -82,15 +99,15 @@ of mappings. Location differences are: ...@@ -82,15 +99,15 @@ of mappings. Location differences are:
Creating Reservations Creating Reservations
--------------------- =====================
Reservations are created when a huge page backed shared memory segment is Reservations are created when a huge page backed shared memory segment is
created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
These operations result in a call to the routine hugetlb_reserve_pages() These operations result in a call to the routine hugetlb_reserve_pages()::
int hugetlb_reserve_pages(struct inode *inode, int hugetlb_reserve_pages(struct inode *inode,
long from, long to, long from, long to,
struct vm_area_struct *vma, struct vm_area_struct *vma,
vm_flags_t vm_flags) vm_flags_t vm_flags)
The first thing hugetlb_reserve_pages() does is check for the NORESERVE The first thing hugetlb_reserve_pages() does is check for the NORESERVE
flag was specified in either the shmget() or mmap() call. If NORESERVE flag was specified in either the shmget() or mmap() call. If NORESERVE
...@@ -105,6 +122,7 @@ the 'from' and 'to' arguments have been adjusted by this offset. ...@@ -105,6 +122,7 @@ the 'from' and 'to' arguments have been adjusted by this offset.
One of the big differences between PRIVATE and SHARED mappings is the way One of the big differences between PRIVATE and SHARED mappings is the way
in which reservations are represented in the reservation map. in which reservations are represented in the reservation map.
- For shared mappings, an entry in the reservation map indicates a reservation - For shared mappings, an entry in the reservation map indicates a reservation
exists or did exist for the corresponding page. As reservations are exists or did exist for the corresponding page. As reservations are
consumed, the reservation map is not modified. consumed, the reservation map is not modified.
...@@ -121,12 +139,13 @@ to indicate this VMA owns the reservations. ...@@ -121,12 +139,13 @@ to indicate this VMA owns the reservations.
The reservation map is consulted to determine how many huge page reservations The reservation map is consulted to determine how many huge page reservations
are needed for the current mapping/segment. For private mappings, this is are needed for the current mapping/segment. For private mappings, this is
always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the
section "Reservation Map Modifications" for details on how this is accomplished. section :ref:`Reservation Map Modifications <resv_map_modifications>`
for details on how this is accomplished.
The mapping may be associated with a subpool. If so, the subpool is consulted The mapping may be associated with a subpool. If so, the subpool is consulted
to ensure there is sufficient space for the mapping. It is possible that the to ensure there is sufficient space for the mapping. It is possible that the
subpool has set aside reservations that can be used for the mapping. See the subpool has set aside reservations that can be used for the mapping. See the
section "Subpool Reservations" for more details. section :ref:`Subpool Reservations <sub_pool_resv>` for more details.
After consulting the reservation map and subpool, the number of needed new After consulting the reservation map and subpool, the number of needed new
reservations is known. The routine hugetlb_acct_memory() is called to check reservations is known. The routine hugetlb_acct_memory() is called to check
...@@ -135,9 +154,11 @@ calls into routines that potentially allocate and adjust surplus page counts. ...@@ -135,9 +154,11 @@ calls into routines that potentially allocate and adjust surplus page counts.
However, within those routines the code is simply checking to ensure there However, within those routines the code is simply checking to ensure there
are enough free huge pages to accommodate the reservation. If there are, are enough free huge pages to accommodate the reservation. If there are,
the global reservation count resv_huge_pages is adjusted something like the the global reservation count resv_huge_pages is adjusted something like the
following. following::
if (resv_needed <= (resv_huge_pages - free_huge_pages)) if (resv_needed <= (resv_huge_pages - free_huge_pages))
resv_huge_pages += resv_needed; resv_huge_pages += resv_needed;
Note that the global lock hugetlb_lock is held when checking and adjusting Note that the global lock hugetlb_lock is held when checking and adjusting
these counters. these counters.
...@@ -152,14 +173,18 @@ If hugetlb_reserve_pages() was successful, the global reservation count and ...@@ -152,14 +173,18 @@ If hugetlb_reserve_pages() was successful, the global reservation count and
reservation map associated with the mapping will be modified as required to reservation map associated with the mapping will be modified as required to
ensure reservations exist for the range 'from' - 'to'. ensure reservations exist for the range 'from' - 'to'.
.. _consume_resv:
Consuming Reservations/Allocating a Huge Page Consuming Reservations/Allocating a Huge Page
--------------------------------------------- =============================================
Reservations are consumed when huge pages associated with the reservations Reservations are consumed when huge pages associated with the reservations
are allocated and instantiated in the corresponding mapping. The allocation are allocated and instantiated in the corresponding mapping. The allocation
is performed within the routine alloc_huge_page(). is performed within the routine alloc_huge_page()::
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve) struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
alloc_huge_page is passed a VMA pointer and a virtual address, so it can alloc_huge_page is passed a VMA pointer and a virtual address, so it can
consult the reservation map to determine if a reservation exists. In addition, consult the reservation map to determine if a reservation exists. In addition,
alloc_huge_page takes the argument avoid_reserve which indicates reserves alloc_huge_page takes the argument avoid_reserve which indicates reserves
...@@ -170,8 +195,9 @@ page are being allocated. ...@@ -170,8 +195,9 @@ page are being allocated.
The helper routine vma_needs_reservation() is called to determine if a The helper routine vma_needs_reservation() is called to determine if a
reservation exists for the address within the mapping(vma). See the section reservation exists for the address within the mapping(vma). See the section
"Reservation Map Helper Routines" for detailed information on what this :ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed
routine does. The value returned from vma_needs_reservation() is generally information on what this routine does.
The value returned from vma_needs_reservation() is generally
0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists.
If a reservation does not exist, and there is a subpool associated with the If a reservation does not exist, and there is a subpool associated with the
mapping the subpool is consulted to determine if it contains reservations. mapping the subpool is consulted to determine if it contains reservations.
...@@ -180,21 +206,25 @@ However, in every case the avoid_reserve argument overrides the use of ...@@ -180,21 +206,25 @@ However, in every case the avoid_reserve argument overrides the use of
a reservation for the allocation. After determining whether a reservation a reservation for the allocation. After determining whether a reservation
exists and can be used for the allocation, the routine dequeue_huge_page_vma() exists and can be used for the allocation, the routine dequeue_huge_page_vma()
is called. This routine takes two arguments related to reservations: is called. This routine takes two arguments related to reservations:
- avoid_reserve, this is the same value/argument passed to alloc_huge_page() - avoid_reserve, this is the same value/argument passed to alloc_huge_page()
- chg, even though this argument is of type long only the values 0 or 1 are - chg, even though this argument is of type long only the values 0 or 1 are
passed to dequeue_huge_page_vma. If the value is 0, it indicates a passed to dequeue_huge_page_vma. If the value is 0, it indicates a
reservation exists (see the section "Memory Policy and Reservations" for reservation exists (see the section "Memory Policy and Reservations" for
possible issues). If the value is 1, it indicates a reservation does not possible issues). If the value is 1, it indicates a reservation does not
exist and the page must be taken from the global free pool if possible. exist and the page must be taken from the global free pool if possible.
The free lists associated with the memory policy of the VMA are searched for The free lists associated with the memory policy of the VMA are searched for
a free page. If a page is found, the value free_huge_pages is decremented a free page. If a page is found, the value free_huge_pages is decremented
when the page is removed from the free list. If there was a reservation when the page is removed from the free list. If there was a reservation
associated with the page, the following adjustments are made: associated with the page, the following adjustments are made::
SetPagePrivate(page); /* Indicates allocating this page consumed SetPagePrivate(page); /* Indicates allocating this page consumed
* a reservation, and if an error is * a reservation, and if an error is
* encountered such that the page must be * encountered such that the page must be
* freed, the reservation will be restored. */ * freed, the reservation will be restored. */
resv_huge_pages--; /* Decrement the global reservation count */ resv_huge_pages--; /* Decrement the global reservation count */
Note, if no huge page can be found that satisfies the VMA's memory policy Note, if no huge page can be found that satisfies the VMA's memory policy
an attempt will be made to allocate one using the buddy allocator. This an attempt will be made to allocate one using the buddy allocator. This
brings up the issue of surplus huge pages and overcommit which is beyond brings up the issue of surplus huge pages and overcommit which is beyond
...@@ -222,12 +252,14 @@ mapping. In such cases, the reservation count and subpool free page count ...@@ -222,12 +252,14 @@ mapping. In such cases, the reservation count and subpool free page count
will be off by one. This rare condition can be identified by comparing the will be off by one. This rare condition can be identified by comparing the
return value from vma_needs_reservation and vma_commit_reservation. If such return value from vma_needs_reservation and vma_commit_reservation. If such
a race is detected, the subpool and global reserve counts are adjusted to a race is detected, the subpool and global reserve counts are adjusted to
compensate. See the section "Reservation Map Helper Routines" for more compensate. See the section
:ref:`Reservation Map Helper Routines <resv_map_helpers>` for more
information on these routines. information on these routines.
Instantiate Huge Pages Instantiate Huge Pages
---------------------- ======================
After huge page allocation, the page is typically added to the page tables After huge page allocation, the page is typically added to the page tables
of the allocating task. Before this, pages in a shared mapping are added of the allocating task. Before this, pages in a shared mapping are added
to the page cache and pages in private mappings are added to an anonymous to the page cache and pages in private mappings are added to an anonymous
...@@ -237,7 +269,8 @@ to the global reservation count (resv_huge_pages). ...@@ -237,7 +269,8 @@ to the global reservation count (resv_huge_pages).
Freeing Huge Pages Freeing Huge Pages
------------------ ==================
Huge page freeing is performed by the routine free_huge_page(). This routine Huge page freeing is performed by the routine free_huge_page(). This routine
is the destructor for hugetlbfs compound pages. As a result, it is only is the destructor for hugetlbfs compound pages. As a result, it is only
passed a pointer to the page struct. When a huge page is freed, reservation passed a pointer to the page struct. When a huge page is freed, reservation
...@@ -247,7 +280,8 @@ on an error path where a global reserve count must be restored. ...@@ -247,7 +280,8 @@ on an error path where a global reserve count must be restored.
The page->private field points to any subpool associated with the page. The page->private field points to any subpool associated with the page.
If the PagePrivate flag is set, it indicates the global reserve count should If the PagePrivate flag is set, it indicates the global reserve count should
be adjusted (see the section "Consuming Reservations/Allocating a Huge Page" be adjusted (see the section
:ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>`
for information on how these are set). for information on how these are set).
The routine first calls hugepage_subpool_put_pages() for the page. If this The routine first calls hugepage_subpool_put_pages() for the page. If this
...@@ -259,9 +293,11 @@ Therefore, the global resv_huge_pages counter is incremented in this case. ...@@ -259,9 +293,11 @@ Therefore, the global resv_huge_pages counter is incremented in this case.
If the PagePrivate flag was set in the page, the global resv_huge_pages counter If the PagePrivate flag was set in the page, the global resv_huge_pages counter
will always be incremented. will always be incremented.
.. _sub_pool_resv:
Subpool Reservations Subpool Reservations
-------------------- ====================
There is a struct hstate associated with each huge page size. The hstate There is a struct hstate associated with each huge page size. The hstate
tracks all huge pages of the specified size. A subpool represents a subset tracks all huge pages of the specified size. A subpool represents a subset
of pages within a hstate that is associated with a mounted hugetlbfs of pages within a hstate that is associated with a mounted hugetlbfs
...@@ -295,7 +331,8 @@ the global pools. ...@@ -295,7 +331,8 @@ the global pools.
COW and Reservations COW and Reservations
-------------------- ====================
Since shared mappings all point to and use the same underlying pages, the Since shared mappings all point to and use the same underlying pages, the
biggest reservation concern for COW is private mappings. In this case, biggest reservation concern for COW is private mappings. In this case,
two tasks can be pointing at the same previously allocated page. One task two tasks can be pointing at the same previously allocated page. One task
...@@ -326,30 +363,36 @@ faults on a non-present page. But, the original owner of the ...@@ -326,30 +363,36 @@ faults on a non-present page. But, the original owner of the
mapping/reservation will behave as expected. mapping/reservation will behave as expected.
.. _resv_map_modifications:
Reservation Map Modifications Reservation Map Modifications
----------------------------- =============================
The following low level routines are used to make modifications to a The following low level routines are used to make modifications to a
reservation map. Typically, these routines are not called directly. Rather, reservation map. Typically, these routines are not called directly. Rather,
a reservation map helper routine is called which calls one of these low level a reservation map helper routine is called which calls one of these low level
routines. These low level routines are fairly well documented in the source routines. These low level routines are fairly well documented in the source
code (mm/hugetlb.c). These routines are: code (mm/hugetlb.c). These routines are::
long region_chg(struct resv_map *resv, long f, long t);
long region_add(struct resv_map *resv, long f, long t); long region_chg(struct resv_map *resv, long f, long t);
void region_abort(struct resv_map *resv, long f, long t); long region_add(struct resv_map *resv, long f, long t);
long region_count(struct resv_map *resv, long f, long t); void region_abort(struct resv_map *resv, long f, long t);
long region_count(struct resv_map *resv, long f, long t);
Operations on the reservation map typically involve two operations: Operations on the reservation map typically involve two operations:
1) region_chg() is called to examine the reserve map and determine how 1) region_chg() is called to examine the reserve map and determine how
many pages in the specified range [f, t) are NOT currently represented. many pages in the specified range [f, t) are NOT currently represented.
The calling code performs global checks and allocations to determine if The calling code performs global checks and allocations to determine if
there are enough huge pages for the operation to succeed. there are enough huge pages for the operation to succeed.
2a) If the operation can succeed, region_add() is called to actually modify 2)
the reservation map for the same range [f, t) previously passed to a) If the operation can succeed, region_add() is called to actually modify
region_chg(). the reservation map for the same range [f, t) previously passed to
2b) If the operation can not succeed, region_abort is called for the same range region_chg().
[f, t) to abort the operation. b) If the operation can not succeed, region_abort is called for the same
range [f, t) to abort the operation.
Note that this is a two step process where region_add() and region_abort() Note that this is a two step process where region_add() and region_abort()
are guaranteed to succeed after a prior call to region_chg() for the same are guaranteed to succeed after a prior call to region_chg() for the same
...@@ -371,6 +414,7 @@ and make the appropriate adjustments. ...@@ -371,6 +414,7 @@ and make the appropriate adjustments.
The routine region_del() is called to remove regions from a reservation map. The routine region_del() is called to remove regions from a reservation map.
It is typically called in the following situations: It is typically called in the following situations:
- When a file in the hugetlbfs filesystem is being removed, the inode will - When a file in the hugetlbfs filesystem is being removed, the inode will
be released and the reservation map freed. Before freeing the reservation be released and the reservation map freed. Before freeing the reservation
map, all the individual file_region structures must be freed. In this case map, all the individual file_region structures must be freed. In this case
...@@ -384,6 +428,7 @@ It is typically called in the following situations: ...@@ -384,6 +428,7 @@ It is typically called in the following situations:
removed, region_del() is called to remove the corresponding entry from the removed, region_del() is called to remove the corresponding entry from the
reservation map. In this case, region_del is passed the range reservation map. In this case, region_del is passed the range
[page_idx, page_idx + 1). [page_idx, page_idx + 1).
In every case, region_del() will return the number of pages removed from the In every case, region_del() will return the number of pages removed from the
reservation map. In VERY rare cases, region_del() can fail. This can only reservation map. In VERY rare cases, region_del() can fail. This can only
happen in the hole punch case where it has to split an existing file_region happen in the hole punch case where it has to split an existing file_region
...@@ -403,9 +448,11 @@ outstanding (outstanding = (end - start) - region_count(resv, start, end)). ...@@ -403,9 +448,11 @@ outstanding (outstanding = (end - start) - region_count(resv, start, end)).
Since the mapping is going away, the subpool and global reservation counts Since the mapping is going away, the subpool and global reservation counts
are decremented by the number of outstanding reservations. are decremented by the number of outstanding reservations.
.. _resv_map_helpers:
Reservation Map Helper Routines Reservation Map Helper Routines
------------------------------- ===============================
Several helper routines exist to query and modify the reservation maps. Several helper routines exist to query and modify the reservation maps.
These routines are only interested with reservations for a specific huge These routines are only interested with reservations for a specific huge
page, so they just pass in an address instead of a range. In addition, page, so they just pass in an address instead of a range. In addition,
...@@ -414,32 +461,40 @@ or shared) and the location of the reservation map (inode or VMA) can be ...@@ -414,32 +461,40 @@ or shared) and the location of the reservation map (inode or VMA) can be
determined. These routines simply call the underlying routines described determined. These routines simply call the underlying routines described
in the section "Reservation Map Modifications". However, they do take into in the section "Reservation Map Modifications". However, they do take into
account the 'opposite' meaning of reservation map entries for private and account the 'opposite' meaning of reservation map entries for private and
shared mappings and hide this detail from the caller. shared mappings and hide this detail from the caller::
long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This routine calls region_chg() for the specified page. If no reservation This routine calls region_chg() for the specified page. If no reservation
exists, 1 is returned. If a reservation exists, 0 is returned. exists, 1 is returned. If a reservation exists, 0 is returned::
long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This calls region_add() for the specified page. As in the case of region_chg This calls region_add() for the specified page. As in the case of region_chg
and region_add, this routine is to be called after a previous call to and region_add, this routine is to be called after a previous call to
vma_needs_reservation. It will add a reservation entry for the page. It vma_needs_reservation. It will add a reservation entry for the page. It
returns 1 if the reservation was added and 0 if not. The return value should returns 1 if the reservation was added and 0 if not. The return value should
be compared with the return value of the previous call to be compared with the return value of the previous call to
vma_needs_reservation. An unexpected difference indicates the reservation vma_needs_reservation. An unexpected difference indicates the reservation
map was modified between calls. map was modified between calls::
void vma_end_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
void vma_end_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This calls region_abort() for the specified page. As in the case of region_chg This calls region_abort() for the specified page. As in the case of region_chg
and region_abort, this routine is to be called after a previous call to and region_abort, this routine is to be called after a previous call to
vma_needs_reservation. It will abort/end the in progress reservation add vma_needs_reservation. It will abort/end the in progress reservation add
operation. operation::
long vma_add_reservation(struct hstate *h,
struct vm_area_struct *vma,
unsigned long addr)
long vma_add_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
This is a special wrapper routine to help facilitate reservation cleanup This is a special wrapper routine to help facilitate reservation cleanup
on error paths. It is only called from the routine restore_reserve_on_error(). on error paths. It is only called from the routine restore_reserve_on_error().
This routine is used in conjunction with vma_needs_reservation in an attempt This routine is used in conjunction with vma_needs_reservation in an attempt
...@@ -453,8 +508,10 @@ be done on error paths. ...@@ -453,8 +508,10 @@ be done on error paths.
Reservation Cleanup in Error Paths Reservation Cleanup in Error Paths
---------------------------------- ==================================
As mentioned in the section "Reservation Map Helper Routines", reservation
As mentioned in the section
:ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation
map modifications are performed in two steps. First vma_needs_reservation map modifications are performed in two steps. First vma_needs_reservation
is called before a page is allocated. If the allocation is successful, is called before a page is allocated. If the allocation is successful,
then vma_commit_reservation is called. If not, vma_end_reservation is called. then vma_commit_reservation is called. If not, vma_end_reservation is called.
...@@ -494,13 +551,14 @@ so that a reservation will not be leaked when the huge page is freed. ...@@ -494,13 +551,14 @@ so that a reservation will not be leaked when the huge page is freed.
Reservations and Memory Policy Reservations and Memory Policy
------------------------------ ==============================
Per-node huge page lists existed in struct hstate when git was first used Per-node huge page lists existed in struct hstate when git was first used
to manage Linux code. The concept of reservations was added some time later. to manage Linux code. The concept of reservations was added some time later.
When reservations were added, no attempt was made to take memory policy When reservations were added, no attempt was made to take memory policy
into account. While cpusets are not exactly the same as memory policy, this into account. While cpusets are not exactly the same as memory policy, this
comment in hugetlb_acct_memory sums up the interaction between reservations comment in hugetlb_acct_memory sums up the interaction between reservations
and cpusets/memory policy. and cpusets/memory policy::
/* /*
* When cpuset is configured, it breaks the strict hugetlb page * When cpuset is configured, it breaks the strict hugetlb page
* reservation as the accounting is done on a global variable. Such * reservation as the accounting is done on a global variable. Such
......
.. hwpoison:
========
hwpoison
========
What is hwpoison? What is hwpoison?
=================
Upcoming Intel CPUs have support for recovering from some memory errors Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", (``MCA recovery``). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future. kill the processes associated with it and avoid using it in the future.
This patchkit implements the necessary infrastructure in the VM. This patchkit implements the necessary infrastructure in the VM.
...@@ -46,9 +53,10 @@ address. This in theory allows other applications to handle ...@@ -46,9 +53,10 @@ address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. won't do that, but some very specialized ones might.
--- Failure recovery modes
======================
There are two (actually three) modi memory failure recovery can be in: There are two (actually three) modes memory failure recovery can be in:
vm.memory_failure_recovery sysctl set to zero: vm.memory_failure_recovery sysctl set to zero:
All memory failures cause a panic. Do not attempt recovery. All memory failures cause a panic. Do not attempt recovery.
...@@ -67,9 +75,8 @@ late kill ...@@ -67,9 +75,8 @@ late kill
This is best for memory error unaware applications and default This is best for memory error unaware applications and default
Note some pages are always handled as late kill. Note some pages are always handled as late kill.
--- User control
============
User control:
vm.memory_failure_recovery vm.memory_failure_recovery
See sysctl.txt See sysctl.txt
...@@ -79,11 +86,19 @@ vm.memory_failure_early_kill ...@@ -79,11 +86,19 @@ vm.memory_failure_early_kill
PR_MCE_KILL PR_MCE_KILL
Set early/late kill mode/revert to system default Set early/late kill mode/revert to system default
arg1: PR_MCE_KILL_CLEAR: Revert to system default
arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode arg1: PR_MCE_KILL_CLEAR:
PR_MCE_KILL_EARLY: Early kill Revert to system default
PR_MCE_KILL_LATE: Late kill arg1: PR_MCE_KILL_SET:
PR_MCE_KILL_DEFAULT: Use system global default arg2 defines thread specific mode
PR_MCE_KILL_EARLY:
Early kill
PR_MCE_KILL_LATE:
Late kill
PR_MCE_KILL_DEFAULT
Use system global default
Note that if you want to have a dedicated thread which handles Note that if you want to have a dedicated thread which handles
the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should
call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise,
...@@ -92,77 +107,64 @@ PR_MCE_KILL ...@@ -92,77 +107,64 @@ PR_MCE_KILL
PR_MCE_KILL_GET PR_MCE_KILL_GET
return current mode return current mode
Testing
=======
--- * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the
process for testing
Testing:
madvise(MADV_HWPOISON, ....)
(as root)
Poison a page in the process for testing
hwpoison-inject module through debugfs * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/``
/sys/kernel/debug/hwpoison/ corrupt-pfn
Inject hwpoison fault at PFN echoed into this file. This does
some early filtering to avoid corrupted unintended pages in test suites.
corrupt-pfn unpoison-pfn
Software-unpoison page at PFN echoed into this file. This way
a page can be reused again. This only works for Linux
injected failures, not for real memory failures.
Inject hwpoison fault at PFN echoed into this file. This does Note these injection interfaces are not stable and might change between
some early filtering to avoid corrupted unintended pages in test suites. kernel versions
unpoison-pfn corrupt-filter-dev-major, corrupt-filter-dev-minor
Only handle memory failures to pages associated with the file
system defined by block device major/minor. -1U is the
wildcard value. This should be only used for testing with
artificial injection.
Software-unpoison page at PFN echoed into this file. This corrupt-filter-memcg
way a page can be reused again. Limit injection to pages owned by memgroup. Specified by inode
This only works for Linux injected failures, not for real number of the memcg.
memory failures.
Note these injection interfaces are not stable and might change between Example::
kernel versions
corrupt-filter-dev-major mkdir /sys/fs/cgroup/mem/hwpoison
corrupt-filter-dev-minor
Only handle memory failures to pages associated with the file system defined usemem -m 100 -s 1000 &
by block device major/minor. -1U is the wildcard value. echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
This should be only used for testing with artificial injection.
corrupt-filter-memcg memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
Limit injection to pages owned by memgroup. Specified by inode number page-types -p `pidof init` --hwpoison # shall do nothing
of the memcg. page-types -p `pidof usemem` --hwpoison # poison its pages
Example: corrupt-filter-flags-mask, corrupt-filter-flags-value
mkdir /sys/fs/cgroup/mem/hwpoison When specified, only poison pages if ((page_flags & mask) ==
value). This allows stress testing of many kinds of
pages. The page_flags are the same as in /proc/kpageflags. The
flag bits are defined in include/linux/kernel-page-flags.h and
documented in Documentation/vm/pagemap.rst
usemem -m 100 -s 1000 & * Architecture specific MCE injector
echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') x86 has mce-inject, mce-test
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
page-types -p `pidof init` --hwpoison # shall do nothing Some portable hwpoison test programs in mce-test, see below.
page-types -p `pidof usemem` --hwpoison # poison its pages
corrupt-filter-flags-mask References
corrupt-filter-flags-value ==========
When specified, only poison pages if ((page_flags & mask) == value).
This allows stress testing of many kinds of pages. The page_flags
are the same as in /proc/kpageflags. The flag bits are defined in
include/linux/kernel-page-flags.h and documented in
Documentation/vm/pagemap.txt
Architecture specific MCE injector
x86 has mce-inject, mce-test
Some portable hwpoison test programs in mce-test, see blow.
---
References:
http://halobates.de/mce-lc09-2.pdf http://halobates.de/mce-lc09-2.pdf
Overview presentation from LinuxCon 09 Overview presentation from LinuxCon 09
...@@ -174,14 +176,11 @@ git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git ...@@ -174,14 +176,11 @@ git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
x86 specific injector x86 specific injector
--- Limitations
===========
Limitations:
- Not all page types are supported and never will. Most kernel internal - Not all page types are supported and never will. Most kernel internal
objects cannot be recovered, only LRU pages for now. objects cannot be recovered, only LRU pages for now.
- Right now hugepage support is missing. - Right now hugepage support is missing.
--- ---
Andi Kleen, Oct 2009 Andi Kleen, Oct 2009
MOTIVATION .. _idle_page_tracking:
==================
Idle Page Tracking
==================
Motivation
==========
The idle page tracking feature allows to track which memory pages are being The idle page tracking feature allows to track which memory pages are being
accessed by a workload and which are idle. This information can be useful for accessed by a workload and which are idle. This information can be useful for
...@@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster. ...@@ -8,10 +15,14 @@ or deciding where to place the workload within a compute cluster.
It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. It is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
USER API .. _user_api:
The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, User API
it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. ========
The idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
Currently, it consists of the only read-write file,
``/sys/kernel/mm/page_idle/bitmap``.
The file implements a bitmap where each bit corresponds to a memory page. The The file implements a bitmap where each bit corresponds to a memory page. The
bitmap is represented by an array of 8-byte integers, and the page at PFN #i is bitmap is represented by an array of 8-byte integers, and the page at PFN #i is
...@@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is ...@@ -19,8 +30,9 @@ mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
set, the corresponding page is idle. set, the corresponding page is idle.
A page is considered idle if it has not been accessed since it was marked idle A page is considered idle if it has not been accessed since it was marked idle
(for more details on what "accessed" actually means see the IMPLEMENTATION (for more details on what "accessed" actually means see the :ref:`Implementation
DETAILS section). To mark a page idle one has to set the bit corresponding to Details <impl_details>` section).
To mark a page idle one has to set the bit corresponding to
the page by writing to the file. A value written to the file is OR-ed with the the page by writing to the file. A value written to the file is OR-ed with the
current bitmap value. current bitmap value.
...@@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, ...@@ -30,9 +42,9 @@ page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
and hence such pages are never reported idle. and hence such pages are never reported idle.
For huge pages the idle flag is set only on the head page, so one has to read For huge pages the idle flag is set only on the head page, so one has to read
/proc/kpageflags in order to correctly count idle huge pages. ``/proc/kpageflags`` in order to correctly count idle huge pages.
Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return Reading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
-EINVAL if you are not starting the read/write on an 8-byte boundary, or -EINVAL if you are not starting the read/write on an 8-byte boundary, or
if the size of the read/write is not a multiple of 8 bytes. Writing to if the size of the read/write is not a multiple of 8 bytes. Writing to
this file beyond max PFN will return -ENXIO. this file beyond max PFN will return -ENXIO.
...@@ -41,21 +53,25 @@ That said, in order to estimate the amount of pages that are not used by a ...@@ -41,21 +53,25 @@ That said, in order to estimate the amount of pages that are not used by a
workload one should: workload one should:
1. Mark all the workload's pages as idle by setting corresponding bits in 1. Mark all the workload's pages as idle by setting corresponding bits in
/sys/kernel/mm/page_idle/bitmap. The pages can be found by reading ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
/proc/pid/pagemap if the workload is represented by a process, or by ``/proc/pid/pagemap`` if the workload is represented by a process, or by
filtering out alien pages using /proc/kpagecgroup in case the workload is filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
placed in a memory cgroup. is placed in a memory cgroup.
2. Wait until the workload accesses its working set. 2. Wait until the workload accesses its working set.
3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
one wants to ignore certain types of pages, e.g. mlocked pages since they If one wants to ignore certain types of pages, e.g. mlocked pages since they
are not reclaimable, he or she can filter them out using /proc/kpageflags. are not reclaimable, he or she can filter them out using
``/proc/kpageflags``.
See Documentation/vm/pagemap.rst for more information about
``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, .. _impl_details:
/proc/kpageflags, and /proc/kpagecgroup.
IMPLEMENTATION DETAILS Implementation Details
======================
The kernel internally keeps track of accesses to user memory pages in order to The kernel internally keeps track of accesses to user memory pages in order to
reclaim unreferenced pages first on memory shortage conditions. A page is reclaim unreferenced pages first on memory shortage conditions. A page is
...@@ -77,7 +93,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or ...@@ -77,7 +93,8 @@ When a dirty page is written to swap or disk as a result of memory reclaim or
exceeding the dirty memory limit, it is not marked referenced. exceeding the dirty memory limit, it is not marked referenced.
The idle memory tracking feature adds a new page flag, the Idle flag. This flag The idle memory tracking feature adds a new page flag, the Idle flag. This flag
is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
:ref:`User API <user_api>`
section), and cleared automatically whenever a page is referenced as defined section), and cleared automatically whenever a page is referenced as defined
above. above.
......
=====================================
Linux Memory Management Documentation
=====================================
This is a collection of documents about Linux memory management (mm) subsystem.
User guides for MM features
===========================
The following documents provide guides for controlling and tuning
various features of the Linux memory management
.. toctree::
:maxdepth: 1
hugetlbpage
idle_page_tracking
ksm
numa_memory_policy
pagemap
transhuge
soft-dirty
swap_numa
userfaultfd
zswap
Kernel developers MM documentation
==================================
The below documents describe MM internals with different level of
details ranging from notes and mailing list responses to elaborate
descriptions of data structures and algorithms.
.. toctree::
:maxdepth: 1
active_mm
balance
cleancache
frontswap
highmem
hmm
hwpoison
hugetlbfs_reserv
mmu_notifier
numa
overcommit-accounting
page_migration
page_frags
page_owner
remap_file_pages
slub
split_page_table_lock
unevictable-lru
z3fold
zsmalloc
How to use the Kernel Samepage Merging feature .. _ksm:
----------------------------------------------
=======================
Kernel Samepage Merging
=======================
KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, added to the Linux kernel in 2.6.32. See ``mm/ksm.c`` for its implementation,
and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
The KSM daemon ksmd periodically scans those areas of user memory which The KSM daemon ksmd periodically scans those areas of user memory which
...@@ -51,110 +54,112 @@ Applications should be considerate in their use of MADV_MERGEABLE, ...@@ -51,110 +54,112 @@ Applications should be considerate in their use of MADV_MERGEABLE,
restricting its use to areas likely to benefit. KSM's scans may use a lot restricting its use to areas likely to benefit. KSM's scans may use a lot
of processing power: some installations will disable KSM for that reason. of processing power: some installations will disable KSM for that reason.
The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, The KSM daemon is controlled by sysfs files in ``/sys/kernel/mm/ksm/``,
readable by all but writable only by root: readable by all but writable only by root:
pages_to_scan - how many present pages to scan before ksmd goes to sleep pages_to_scan
e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" how many present pages to scan before ksmd goes to sleep
Default: 100 (chosen for demonstration purposes) e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan`` Default: 100
(chosen for demonstration purposes)
sleep_millisecs - how many milliseconds ksmd should sleep before next scan
e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" sleep_millisecs
Default: 20 (chosen for demonstration purposes) how many milliseconds ksmd should sleep before next scan
e.g. ``echo 20 > /sys/kernel/mm/ksm/sleep_millisecs`` Default: 20
merge_across_nodes - specifies if pages from different numa nodes can be merged. (chosen for demonstration purposes)
When set to 0, ksm merges only pages which physically
reside in the memory area of same NUMA node. That brings merge_across_nodes
lower latency to access of shared pages. Systems with more specifies if pages from different numa nodes can be merged.
nodes, at significant NUMA distances, are likely to benefit When set to 0, ksm merges only pages which physically reside
from the lower latency of setting 0. Smaller systems, which in the memory area of same NUMA node. That brings lower
need to minimize memory usage, are likely to benefit from latency to access of shared pages. Systems with more nodes, at
the greater sharing of setting 1 (default). You may wish to significant NUMA distances, are likely to benefit from the
compare how your system performs under each setting, before lower latency of setting 0. Smaller systems, which need to
deciding on which to use. merge_across_nodes setting can be minimize memory usage, are likely to benefit from the greater
changed only when there are no ksm shared pages in system: sharing of setting 1 (default). You may wish to compare how
set run 2 to unmerge pages first, then to 1 after changing your system performs under each setting, before deciding on
merge_across_nodes, to remerge according to the new setting. which to use. merge_across_nodes setting can be changed only
Default: 1 (merging across nodes as in earlier releases) when there are no ksm shared pages in system: set run 2 to
unmerge pages first, then to 1 after changing
run - set 0 to stop ksmd from running but keep merged pages, merge_across_nodes, to remerge according to the new setting.
set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", Default: 1 (merging across nodes as in earlier releases)
set 2 to stop ksmd and unmerge all pages currently merged,
but leave mergeable areas registered for next run run
Default: 0 (must be changed to 1 to activate KSM, set 0 to stop ksmd from running but keep merged pages,
except if CONFIG_SYSFS is disabled) set 1 to run ksmd e.g. ``echo 1 > /sys/kernel/mm/ksm/run``,
set 2 to stop ksmd and unmerge all pages currently merged, but
use_zero_pages - specifies whether empty pages (i.e. allocated pages leave mergeable areas registered for next run Default: 0 (must
that only contain zeroes) should be treated specially. be changed to 1 to activate KSM, except if CONFIG_SYSFS is
When set to 1, empty pages are merged with the kernel disabled)
zero page(s) instead of with each other as it would
happen normally. This can improve the performance on use_zero_pages
architectures with coloured zero pages, depending on specifies whether empty pages (i.e. allocated pages that only
the workload. Care should be taken when enabling this contain zeroes) should be treated specially. When set to 1,
setting, as it can potentially degrade the performance empty pages are merged with the kernel zero page(s) instead of
of KSM for some workloads, for example if the checksums with each other as it would happen normally. This can improve
of pages candidate for merging match the checksum of the performance on architectures with coloured zero pages,
an empty page. This setting can be changed at any time, depending on the workload. Care should be taken when enabling
it is only effective for pages merged after the change. this setting, as it can potentially degrade the performance of
Default: 0 (normal KSM behaviour as in earlier releases) KSM for some workloads, for example if the checksums of pages
candidate for merging match the checksum of an empty
max_page_sharing - Maximum sharing allowed for each KSM page. This page. This setting can be changed at any time, it is only
enforces a deduplication limit to avoid the virtual effective for pages merged after the change. Default: 0
memory rmap lists to grow too large. The minimum (normal KSM behaviour as in earlier releases)
value is 2 as a newly created KSM page will have at
least two sharers. The rmap walk has O(N) max_page_sharing
complexity where N is the number of rmap_items Maximum sharing allowed for each KSM page. This enforces a
(i.e. virtual mappings) that are sharing the page, deduplication limit to avoid the virtual memory rmap lists to
which is in turn capped by max_page_sharing. So grow too large. The minimum value is 2 as a newly created KSM
this effectively spread the the linear O(N) page will have at least two sharers. The rmap walk has O(N)
computational complexity from rmap walk context complexity where N is the number of rmap_items (i.e. virtual
over different KSM pages. The ksmd walk over the mappings) that are sharing the page, which is in turn capped
stable_node "chains" is also O(N), but N is the by max_page_sharing. So this effectively spread the the linear
number of stable_node "dups", not the number of O(N) computational complexity from rmap walk context over
rmap_items, so it has not a significant impact on different KSM pages. The ksmd walk over the stable_node
ksmd performance. In practice the best stable_node "chains" is also O(N), but N is the number of stable_node
"dup" candidate will be kept and found at the head "dups", not the number of rmap_items, so it has not a
of the "dups" list. The higher this value the significant impact on ksmd performance. In practice the best
faster KSM will merge the memory (because there stable_node "dup" candidate will be kept and found at the head
will be fewer stable_node dups queued into the of the "dups" list. The higher this value the faster KSM will
stable_node chain->hlist to check for pruning) and merge the memory (because there will be fewer stable_node dups
the higher the deduplication factor will be, but queued into the stable_node chain->hlist to check for pruning)
the slowest the worst case rmap walk could be for and the higher the deduplication factor will be, but the
any given KSM page. Slowing down the rmap_walk slowest the worst case rmap walk could be for any given KSM
means there will be higher latency for certain page. Slowing down the rmap_walk means there will be higher
virtual memory operations happening during latency for certain virtual memory operations happening during
swapping, compaction, NUMA balancing and page swapping, compaction, NUMA balancing and page migration, in
migration, in turn decreasing responsiveness for turn decreasing responsiveness for the caller of those virtual
the caller of those virtual memory operations. The memory operations. The scheduler latency of other tasks not
scheduler latency of other tasks not involved with involved with the VM operations doing the rmap walk is not
the VM operations doing the rmap walk is not affected by this parameter as the rmap walks are always
affected by this parameter as the rmap walks are schedule friendly themselves.
always schedule friendly themselves.
stable_node_chains_prune_millisecs
stable_node_chains_prune_millisecs - How frequently to walk the whole How frequently to walk the whole list of stable_node "dups"
list of stable_node "dups" linked in the linked in the stable_node "chains" in order to prune stale
stable_node "chains" in order to prune stale stable_nodes. Smaller milllisecs values will free up the KSM
stable_nodes. Smaller milllisecs values will free metadata with lower latency, but they will make ksmd use more
up the KSM metadata with lower latency, but they CPU during the scan. This only applies to the stable_node
will make ksmd use more CPU during the scan. This chains so it's a noop if not a single KSM page hit the
only applies to the stable_node chains so it's a max_page_sharing yet (there would be no stable_node chains in
noop if not a single KSM page hit the such case).
max_page_sharing yet (there would be no stable_node
chains in such case). The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: pages_shared
how many shared pages are being used
pages_shared - how many shared pages are being used pages_sharing
pages_sharing - how many more sites are sharing them i.e. how much saved how many more sites are sharing them i.e. how much saved
pages_unshared - how many pages unique but repeatedly checked for merging pages_unshared
pages_volatile - how many pages changing too fast to be placed in a tree how many pages unique but repeatedly checked for merging
full_scans - how many times all mergeable areas have been scanned pages_volatile
how many pages changing too fast to be placed in a tree
stable_node_chains - number of stable node chains allocated, this is full_scans
effectively the number of KSM pages that hit the how many times all mergeable areas have been scanned
max_page_sharing limit stable_node_chains
stable_node_dups - number of stable node dups queued into the number of stable node chains allocated, this is effectively
stable_node chains the number of KSM pages that hit the max_page_sharing limit
stable_node_dups
number of stable node dups queued into the stable_node chains
A high ratio of pages_sharing to pages_shared indicates good sharing, but A high ratio of pages_sharing to pages_shared indicates good sharing, but
a high ratio of pages_unshared to pages_sharing indicates wasted effort. a high ratio of pages_unshared to pages_sharing indicates wasted effort.
......
.. _mmu_notifier:
When do you need to notify inside page table lock ? When do you need to notify inside page table lock ?
===================================================
When clearing a pte/pmd we are given a choice to notify the event through When clearing a pte/pmd we are given a choice to notify the event through
(notify version of *_clear_flush call mmu_notifier_invalidate_range) under (notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
the page table lock. But that notification is not necessary in all cases. the page table lock. But that notification is not necessary in all cases.
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
...@@ -18,6 +21,7 @@ a page that might now be used by some completely different task. ...@@ -18,6 +21,7 @@ a page that might now be used by some completely different task.
Case B is more subtle. For correctness it requires the following sequence to Case B is more subtle. For correctness it requires the following sequence to
happen: happen:
- take page table lock - take page table lock
- clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
- set page table entry to point to new page - set page table entry to point to new page
...@@ -28,58 +32,60 @@ the device. ...@@ -28,58 +32,60 @@ the device.
Consider the following scenario (device use a feature similar to ATS/PASID): Consider the following scenario (device use a feature similar to ATS/PASID):
Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we assume Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
they are write protected for COW (other case of B apply too). they are write protected for COW (other case of B apply too).
[Time N] -------------------------------------------------------------------- ::
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB} [Time N] --------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {try to write to addrA}
CPU-thread-3 {} CPU-thread-1 {try to write to addrB}
DEV-thread-0 {read addrA and populate device TLB} CPU-thread-2 {}
DEV-thread-2 {read addrB and populate device TLB} CPU-thread-3 {}
[Time N+1] ------------------------------------------------------------------ DEV-thread-0 {read addrA and populate device TLB}
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} DEV-thread-2 {read addrB and populate device TLB}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} [Time N+1] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-3 {} CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+2] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} DEV-thread-2 {}
CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} [Time N+2] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}}
CPU-thread-3 {} CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+3] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {preempted} [Time N+3] ------------------------------------------------------------------
CPU-thread-2 {write to addrA which is a write to new page} CPU-thread-0 {preempted}
CPU-thread-3 {} CPU-thread-1 {preempted}
DEV-thread-0 {} CPU-thread-2 {write to addrA which is a write to new page}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+3] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {preempted} [Time N+3] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {preempted}
CPU-thread-3 {write to addrB which is a write to new page} CPU-thread-1 {preempted}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {write to addrB which is a write to new page}
[Time N+4] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} [Time N+4] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {preempted}
CPU-thread-3 {} CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
DEV-thread-0 {} CPU-thread-2 {}
DEV-thread-2 {} CPU-thread-3 {}
[Time N+5] ------------------------------------------------------------------ DEV-thread-0 {}
CPU-thread-0 {preempted} DEV-thread-2 {}
CPU-thread-1 {} [Time N+5] ------------------------------------------------------------------
CPU-thread-2 {} CPU-thread-0 {preempted}
CPU-thread-3 {} CPU-thread-1 {}
DEV-thread-0 {read addrA from old page} CPU-thread-2 {}
DEV-thread-2 {read addrB from new page} CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}
So here because at time N+2 the clear page table entry was not pair with a So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value for notification to invalidate the secondary TLB, the device see the new value for
......
.. _numa:
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
=============
What is NUMA? What is NUMA?
=============
This question can be answered from a couple of perspectives: the This question can be answered from a couple of perspectives: the
hardware view and the Linux software view. hardware view and the Linux software view.
...@@ -106,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces, ...@@ -106,7 +110,7 @@ to improve NUMA locality using various CPU affinity command line interfaces,
such as taskset(1) and numactl(1), and program interfaces such as such as taskset(1) and numactl(1), and program interfaces such as
sched_setaffinity(2). Further, one can modify the kernel's default local sched_setaffinity(2). Further, one can modify the kernel's default local
allocation behavior using Linux NUMA memory policy. allocation behavior using Linux NUMA memory policy.
[see Documentation/vm/numa_memory_policy.txt.] [see Documentation/vm/numa_memory_policy.rst.]
System administrators can restrict the CPUs and nodes' memories that a non- System administrators can restrict the CPUs and nodes' memories that a non-
privileged user can specify in the scheduling or NUMA commands and functions privileged user can specify in the scheduling or NUMA commands and functions
......
The Linux kernel supports the following overcommit handling modes
0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slightly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays
and just relying on the virtual memory consisting almost
entirely of zero pages.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap + a
configurable amount (default is 50%) of physical RAM.
Depending on the amount you use, in most situations
this means a process will not be killed while accessing
pages but will receive errors on memory allocation as
appropriate.
Useful for applications that want to guarantee their
memory allocations will be available in the future
without having to initialize every page.
The overcommit policy is set via the sysctl `vm.overcommit_memory'.
The overcommit amount can be set via `vm.overcommit_ratio' (percentage)
or `vm.overcommit_kbytes' (absolute value).
The current overcommit limit and amount committed are viewable in
/proc/meminfo as CommitLimit and Committed_AS respectively.
Gotchas
-------
The C language stack growth does an implicit mremap. If you want absolute
guarantees and run close to the edge you MUST mmap your stack for the
largest size you think you will need. For typical stack usage this does
not matter much but it's a corner case if you really really care
In mode 2 the MAP_NORESERVE flag is ignored.
How It Works
------------
The overcommit is based on the following rules
For a file backed map
SHARED or READ-only - 0 cost (the file is the map not swap)
PRIVATE WRITABLE - size of mapping per instance
For an anonymous or /dev/zero map
SHARED - size of mapping
PRIVATE READ-only - 0 cost (but of little use)
PRIVATE WRITABLE - size of mapping per instance
Additional accounting
Pages made writable copies by mmap
shmfs memory drawn from the same pool
Status
------
o We account mmap memory mappings
o We account mprotect changes in commit
o We account mremap changes in size
o We account brk
o We account munmap
o We report the commit status in /proc
o Account and check on fork
o Review stack handling/building on exec
o SHMfs accounting
o Implement actual limit enforcement
To Do
-----
o Account ptrace pages (this is hard)
.. _overcommit_accounting:
=====================
Overcommit Accounting
=====================
The Linux kernel supports the following overcommit handling modes
0
Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a
seriously wild allocation fails while allowing overcommit to
reduce swap usage. root is allowed to allocate slightly more
memory in this mode. This is the default.
1
Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays and
just relying on the virtual memory consisting almost entirely
of zero pages.
2
Don't overcommit. The total address space commit for the
system is not permitted to exceed swap + a configurable amount
(default is 50%) of physical RAM. Depending on the amount you
use, in most situations this means a process will not be
killed while accessing pages but will receive errors on memory
allocation as appropriate.
Useful for applications that want to guarantee their memory
allocations will be available in the future without having to
initialize every page.
The overcommit policy is set via the sysctl ``vm.overcommit_memory``.
The overcommit amount can be set via ``vm.overcommit_ratio`` (percentage)
or ``vm.overcommit_kbytes`` (absolute value).
The current overcommit limit and amount committed are viewable in
``/proc/meminfo`` as CommitLimit and Committed_AS respectively.
Gotchas
=======
The C language stack growth does an implicit mremap. If you want absolute
guarantees and run close to the edge you MUST mmap your stack for the
largest size you think you will need. For typical stack usage this does
not matter much but it's a corner case if you really really care
In mode 2 the MAP_NORESERVE flag is ignored.
How It Works
============
The overcommit is based on the following rules
For a file backed map
| SHARED or READ-only - 0 cost (the file is the map not swap)
| PRIVATE WRITABLE - size of mapping per instance
For an anonymous or ``/dev/zero`` map
| SHARED - size of mapping
| PRIVATE READ-only - 0 cost (but of little use)
| PRIVATE WRITABLE - size of mapping per instance
Additional accounting
| Pages made writable copies by mmap
| shmfs memory drawn from the same pool
Status
======
* We account mmap memory mappings
* We account mprotect changes in commit
* We account mremap changes in size
* We account brk
* We account munmap
* We report the commit status in /proc
* Account and check on fork
* Review stack handling/building on exec
* SHMfs accounting
* Implement actual limit enforcement
To Do
=====
* Account ptrace pages (this is hard)
.. _page_frags:
==============
Page fragments Page fragments
-------------- ==============
A page fragment is an arbitrary-length arbitrary-offset area of memory A page fragment is an arbitrary-length arbitrary-offset area of memory
which resides within a 0 or higher order compound page. Multiple which resides within a 0 or higher order compound page. Multiple
......
.. _page_migration:
==============
Page migration Page migration
-------------- ==============
Page migration allows the moving of the physical location of pages between Page migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the nodes in a numa system while the process is running. This means that the
...@@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen ...@@ -20,7 +23,7 @@ Page migration functions are provided by the numactl package by Andi Kleen
(a version later than 0.9.3 is required. Get it from (a version later than 0.9.3 is required. Get it from
ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
which provides an interface similar to other numa functionality for page which provides an interface similar to other numa functionality for page
migration. cat /proc/<pid>/numa_maps allows an easy review of where the migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the
pages of a process are located. See also the numa_maps documentation in the pages of a process are located. See also the numa_maps documentation in the
proc(5) man page. proc(5) man page.
...@@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel ...@@ -56,8 +59,8 @@ description for those trying to use migrate_pages() from the kernel
(for userspace usage see the Andi Kleen's numactl package mentioned above) (for userspace usage see the Andi Kleen's numactl package mentioned above)
and then a low level description of how the low level details work. and then a low level description of how the low level details work.
A. In kernel use of migrate_pages() In kernel use of migrate_pages()
----------------------------------- ================================
1. Remove pages from the LRU. 1. Remove pages from the LRU.
...@@ -78,8 +81,8 @@ A. In kernel use of migrate_pages() ...@@ -78,8 +81,8 @@ A. In kernel use of migrate_pages()
the new page for each page that is considered for the new page for each page that is considered for
moving. moving.
B. How migrate_pages() works How migrate_pages() works
---------------------------- =========================
migrate_pages() does several passes over its list of pages. A page is moved migrate_pages() does several passes over its list of pages. A page is moved
if all references to a page are removable at the time. The page has if all references to a page are removable at the time. The page has
...@@ -142,8 +145,8 @@ Steps: ...@@ -142,8 +145,8 @@ Steps:
20. The new page is moved to the LRU and can be scanned by the swapper 20. The new page is moved to the LRU and can be scanned by the swapper
etc again. etc again.
C. Non-LRU page migration Non-LRU page migration
------------------------- ======================
Although original migration aimed for reducing the latency of memory access Although original migration aimed for reducing the latency of memory access
for NUMA, compaction who want to create high-order page is also main customer. for NUMA, compaction who want to create high-order page is also main customer.
...@@ -164,89 +167,91 @@ migration path. ...@@ -164,89 +167,91 @@ migration path.
If a driver want to make own pages movable, it should define three functions If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations. which are function pointers of struct address_space_operations.
1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);``
What VM expects on isolate_page function of driver is to return *true* What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*. for isolation. If a driver cannot isolate the page, it should return *false*.
Once page is successfully isolated, VM uses page.lru fields so driver Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields. shouldn't expect to preserve values in that fields.
2. int (*migratepage) (struct address_space *mapping, 2. ``int (*migratepage) (struct address_space *mapping,``
struct page *newpage, struct page *oldpage, enum migrate_mode); | ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
After isolation, VM calls migratepage of driver with isolated page. After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via __ClearPageMovable() indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
under page_lock if you migrated the oldpage successfully and returns under page_lock if you migrated the oldpage successfully and returns
MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver
can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time
because VM interprets -EAGAIN as "temporal migration failure". On returning because VM interprets -EAGAIN as "temporal migration failure". On returning
any error except -EAGAIN, VM will give up the page migration without retrying any error except -EAGAIN, VM will give up the page migration without retrying
in this time. in this time.
Driver shouldn't touch page.lru field VM using in the functions. Driver shouldn't touch page.lru field VM using in the functions.
3. void (*putback_page)(struct page *); 3. ``void (*putback_page)(struct page *);``
If migration fails on isolated page, VM should return the isolated page If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page. to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data In this function, driver should put the isolated page back to the own data
structure. structure.
4. non-lru movable page flags 4. non-lru movable page flags
There are two page flags for supporting non-lru movable page. There are two page flags for supporting non-lru movable page.
* PG_movable * PG_movable
Driver should use the below function to make page movable under page_lock. Driver should use the below function to make page movable under page_lock::
void __SetPageMovable(struct page *page, struct address_space *mapping) void __SetPageMovable(struct page *page, struct address_space *mapping)
It needs argument of address_space for registering migration family functions It needs argument of address_space for registering migration
which will be called by VM. Exactly speaking, PG_movable is not a real flag of family functions which will be called by VM. Exactly speaking,
struct page. Rather than, VM reuses page->mapping's lower bits to represent it. PG_movable is not a real flag of struct page. Rather than, VM
reuses page->mapping's lower bits to represent it.
::
#define PAGE_MAPPING_MOVABLE 0x2 #define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
so driver shouldn't access page->mapping directly. Instead, driver should so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping under use page_mapping which mask off the low two bits of page->mapping under
page lock so it can get right struct address_space. page lock so it can get right struct address_space.
For testing of non-lru movable page, VM supports __PageMovable function. For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim. checking with lock_page in pfn scanning to select victim.
For guaranteeing non-lru movable page, VM provides PageMovable function. For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping. destroying of page->mapping.
Driver using __SetPageMovable should clear the flag via __ClearMovablePage Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page. under page_lock before the releasing the page.
* PG_isolated * PG_isolated
To prevent concurrent isolation among several CPUs, VM marks isolated page To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field. shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose. for own purpose.
Christoph Lameter, May 8, 2006. Christoph Lameter, May 8, 2006.
Minchan Kim, Mar 28, 2016. Minchan Kim, Mar 28, 2016.
.. _page_owner:
==================================================
page owner: Tracking about who allocated each page page owner: Tracking about who allocated each page
----------------------------------------------------------- ==================================================
* Introduction Introduction
============
page owner is for the tracking about who allocated each page. page owner is for the tracking about who allocated each page.
It can be used to debug memory leak or to find a memory hogger. It can be used to debug memory leak or to find a memory hogger.
...@@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump ...@@ -34,13 +38,15 @@ not affect to allocation performance, especially if the static keys jump
label patching functionality is available. Following is the kernel's code label patching functionality is available. Following is the kernel's code
size change due to this facility. size change due to this facility.
- Without page owner - Without page owner::
text data bss dec hex filename text data bss dec hex filename
40662 1493 644 42799 a72f mm/page_alloc.o 40662 1493 644 42799 a72f mm/page_alloc.o
- With page owner::
- With page owner
text data bss dec hex filename text data bss dec hex filename
40892 1493 644 43029 a815 mm/page_alloc.o 40892 1493 644 43029 a815 mm/page_alloc.o
1427 24 8 1459 5b3 mm/page_ext.o 1427 24 8 1459 5b3 mm/page_ext.o
2722 50 0 2772 ad4 mm/page_owner.o 2722 50 0 2772 ad4 mm/page_owner.o
...@@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct ...@@ -62,21 +68,23 @@ are catched and marked, although they are mostly allocated from struct
page extension feature. Anyway, after that, no page is left in page extension feature. Anyway, after that, no page is left in
un-tracking state. un-tracking state.
* Usage Usage
=====
1) Build user-space helper::
1) Build user-space helper
cd tools/vm cd tools/vm
make page_owner_sort make page_owner_sort
2) Enable page owner 2) Enable page owner: add "page_owner=on" to boot cmdline.
Add "page_owner=on" to boot cmdline.
3) Do the job what you want to debug 3) Do the job what you want to debug
4) Analyze information from page owner 4) Analyze information from page owner::
cat /sys/kernel/debug/page_owner > page_owner_full.txt cat /sys/kernel/debug/page_owner > page_owner_full.txt
grep -v ^PFN page_owner_full.txt > page_owner.txt grep -v ^PFN page_owner_full.txt > page_owner.txt
./page_owner_sort page_owner.txt sorted_page_owner.txt ./page_owner_sort page_owner.txt sorted_page_owner.txt
See the result about who allocated each page See the result about who allocated each page
in the sorted_page_owner.txt. in the ``sorted_page_owner.txt``.
pagemap, from the userspace perspective .. _pagemap:
---------------------------------------
======================================
pagemap from the Userspace Perspective
======================================
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by userspace programs to examine the page tables and related information by
reading files in /proc. reading files in ``/proc``.
There are four components to pagemap: There are four components to pagemap:
* /proc/pid/pagemap. This file lets a userspace process find out which * ``/proc/pid/pagemap``. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit physical frame each virtual page is mapped to. It contains one 64-bit
value for each virtual page, containing the following data (from value for each virtual page, containing the following data (from
fs/proc/task_mmu.c, above pagemap_read): fs/proc/task_mmu.c, above pagemap_read):
...@@ -15,7 +18,7 @@ There are four components to pagemap: ...@@ -15,7 +18,7 @@ There are four components to pagemap:
* Bits 0-54 page frame number (PFN) if present * Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped * Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped * Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.rst)
* Bit 56 page exclusively mapped (since 4.2) * Bit 56 page exclusively mapped (since 4.2)
* Bits 57-60 zero * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon (since 3.5) * Bit 61 page is file-page or shared-anon (since 3.5)
...@@ -37,24 +40,24 @@ There are four components to pagemap: ...@@ -37,24 +40,24 @@ There are four components to pagemap:
determine which areas of memory are actually mapped and llseek to determine which areas of memory are actually mapped and llseek to
skip over unmapped regions. skip over unmapped regions.
* /proc/kpagecount. This file contains a 64-bit count of the number of * ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN. times each page is mapped, indexed by PFN.
* /proc/kpageflags. This file contains a 64-bit set of flags for each * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
page, indexed by PFN. page, indexed by PFN.
The flags are (from fs/proc/page.c, above kpageflags_read): The flags are (from ``fs/proc/page.c``, above kpageflags_read):
0. LOCKED 0. LOCKED
1. ERROR 1. ERROR
2. REFERENCED 2. REFERENCED
3. UPTODATE 3. UPTODATE
4. DIRTY 4. DIRTY
5. LRU 5. LRU
6. ACTIVE 6. ACTIVE
7. SLAB 7. SLAB
8. WRITEBACK 8. WRITEBACK
9. RECLAIM 9. RECLAIM
10. BUDDY 10. BUDDY
11. MMAP 11. MMAP
12. ANON 12. ANON
...@@ -72,98 +75,108 @@ There are four components to pagemap: ...@@ -72,98 +75,108 @@ There are four components to pagemap:
24. ZERO_PAGE 24. ZERO_PAGE
25. IDLE 25. IDLE
* /proc/kpagecgroup. This file contains a 64-bit inode number of the * ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set. CONFIG_MEMCG is set.
Short descriptions to the page flags: Short descriptions to the page flags:
=====================================
0. LOCKED
page is being locked for exclusive access, eg. by undergoing read/write IO 0 - LOCKED
page is being locked for exclusive access, eg. by undergoing read/write IO
7. SLAB 7 - SLAB
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
When compound page is used, SLUB/SLQB will only set this flag on the head When compound page is used, SLUB/SLQB will only set this flag on the head
page; SLOB will not flag it at all. page; SLOB will not flag it at all.
10 - BUDDY
10. BUDDY
a free memory block managed by the buddy system allocator a free memory block managed by the buddy system allocator
The buddy system organizes free memory in blocks of various orders. The buddy system organizes free memory in blocks of various orders.
An order N block has 2^N physically contiguous pages, with the BUDDY flag An order N block has 2^N physically contiguous pages, with the BUDDY flag
set for and _only_ for the first page. set for and _only_ for the first page.
15 - COMPOUND_HEAD
15. COMPOUND_HEAD
16. COMPOUND_TAIL
A compound page with order N consists of 2^N physically contiguous pages. A compound page with order N consists of 2^N physically contiguous pages.
A compound page with order 2 takes the form of "HTTT", where H donates its A compound page with order 2 takes the form of "HTTT", where H donates its
head page and T donates its tail page(s). The major consumers of compound head page and T donates its tail page(s). The major consumers of compound
pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. pages are hugeTLB pages (Documentation/vm/hugetlbpage.rst), the SLUB etc.
memory allocators and various device drivers. However in this interface, memory allocators and various device drivers. However in this interface,
only huge/giga pages are made visible to end users. only huge/giga pages are made visible to end users.
17. HUGE 16 - COMPOUND_TAIL
A compound page tail (see description above).
17 - HUGE
this is an integral part of a HugeTLB page this is an integral part of a HugeTLB page
19 - HWPOISON
19. HWPOISON
hardware detected memory corruption on this page: don't touch the data! hardware detected memory corruption on this page: don't touch the data!
20 - NOPAGE
20. NOPAGE
no page frame exists at the requested address no page frame exists at the requested address
21 - KSM
21. KSM
identical memory pages dynamically shared between one or more processes identical memory pages dynamically shared between one or more processes
22 - THP
22. THP
contiguous pages which construct transparent hugepages contiguous pages which construct transparent hugepages
23 - BALLOON
23. BALLOON
balloon compaction page balloon compaction page
24 - ZERO_PAGE
24. ZERO_PAGE
zero page for pfn_zero or huge_zero page zero page for pfn_zero or huge_zero page
25 - IDLE
25. IDLE
page has not been accessed since it was marked idle (see page has not been accessed since it was marked idle (see
Documentation/vm/idle_page_tracking.txt). Note that this flag may be Documentation/vm/idle_page_tracking.rst). Note that this flag may be
stale in case the page was accessed via a PTE. To make sure the flag stale in case the page was accessed via a PTE. To make sure the flag
is up-to-date one has to read /sys/kernel/mm/page_idle/bitmap first. is up-to-date one has to read ``/sys/kernel/mm/page_idle/bitmap`` first.
[IO related page flags] IO related page flags
1. ERROR IO error occurred ---------------------
3. UPTODATE page has up-to-date data
ie. for file backed page: (in-memory data revision >= on-disk one) 1 - ERROR
4. DIRTY page has been written to, hence contains new data IO error occurred
ie. for file backed page: (in-memory data revision > on-disk one) 3 - UPTODATE
8. WRITEBACK page is being synced to disk page has up-to-date data
ie. for file backed page: (in-memory data revision >= on-disk one)
[LRU related page flags] 4 - DIRTY
5. LRU page is in one of the LRU lists page has been written to, hence contains new data
6. ACTIVE page is in the active LRU list ie. for file backed page: (in-memory data revision > on-disk one)
18. UNEVICTABLE page is in the unevictable (non-)LRU list 8 - WRITEBACK
It is somehow pinned and not a candidate for LRU page reclaims, page is being synced to disk
eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
2. REFERENCED page has been referenced since last LRU list enqueue/requeue LRU related page flags
9. RECLAIM page will be reclaimed soon after its pageout IO completed ----------------------
11. MMAP a memory mapped page
12. ANON a memory mapped page that is not part of a file 5 - LRU
13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry page is in one of the LRU lists
14. SWAPBACKED page is backed by swap/RAM 6 - ACTIVE
page is in the active LRU list
18 - UNEVICTABLE
page is in the unevictable (non-)LRU list It is somehow pinned and
not a candidate for LRU page reclaims, eg. ramfs pages,
shmctl(SHM_LOCK) and mlock() memory segments
2 - REFERENCED
page has been referenced since last LRU list enqueue/requeue
9 - RECLAIM
page will be reclaimed soon after its pageout IO completed
11 - MMAP
a memory mapped page
12 - ANON
a memory mapped page that is not part of a file
13 - SWAPCACHE
page is mapped to swap space, ie. has an associated swap entry
14 - SWAPBACKED
page is backed by swap/RAM
The page-types tool in the tools/vm directory can be used to query the The page-types tool in the tools/vm directory can be used to query the
above flags. above flags.
Using pagemap to do something useful: Using pagemap to do something useful
====================================
The general procedure for using pagemap to find out about a process' memory The general procedure for using pagemap to find out about a process' memory
usage goes like this: usage goes like this:
1. Read /proc/pid/maps to determine which parts of the memory space are 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
mapped to what. mapped to what.
2. Select the maps you are interested in -- all of them, or a particular 2. Select the maps you are interested in -- all of them, or a particular
library, or the stack or the heap, etc. library, or the stack or the heap, etc.
3. Open /proc/pid/pagemap and seek to the pages you would like to examine. 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
4. Read a u64 for each page from pagemap. 4. Read a u64 for each page from pagemap.
5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
read, seek to that entry in the file, and read the data you want. just read, seek to that entry in the file, and read the data you want.
For example, to find the "unique set size" (USS), which is the amount of For example, to find the "unique set size" (USS), which is the amount of
memory that a process is using that is not shared with any other process, memory that a process is using that is not shared with any other process,
...@@ -171,7 +184,8 @@ you can go through every map in the process, find the PFNs, look those up ...@@ -171,7 +184,8 @@ you can go through every map in the process, find the PFNs, look those up
in kpagecount, and tally up the number of pages that are only referenced in kpagecount, and tally up the number of pages that are only referenced
once. once.
Other notes: Other notes
===========
Reading from any of the files will return -EINVAL if you are not starting Reading from any of the files will return -EINVAL if you are not starting
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
......
.. _remap_file_pages:
==============================
remap_file_pages() system call
==============================
The remap_file_pages() system call is used to create a nonlinear mapping, The remap_file_pages() system call is used to create a nonlinear mapping,
that is, a mapping in which the pages of the file are mapped into a that is, a mapping in which the pages of the file are mapped into a
nonsequential order in memory. The advantage of using remap_file_pages() nonsequential order in memory. The advantage of using remap_file_pages()
......
SOFT-DIRTY PTEs .. _soft_dirty:
The soft-dirty is a bit on a PTE which helps to track which pages a task ===============
Soft-Dirty PTEs
===============
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should writes to. In order to do this tracking one should
1. Clear soft-dirty bits from the task's PTEs. 1. Clear soft-dirty bits from the task's PTEs.
This is done by writing "4" into the /proc/PID/clear_refs file of the This is done by writing "4" into the ``/proc/PID/clear_refs`` file of the
task in question. task in question.
2. Wait some time. 2. Wait some time.
3. Read soft-dirty bits from the PTEs. 3. Read soft-dirty bits from the PTEs.
This is done by reading from the /proc/PID/pagemap. The bit 55 of the This is done by reading from the ``/proc/PID/pagemap``. The bit 55 of the
64-bit qword is the soft-dirty one. If set, the respective PTE was 64-bit qword is the soft-dirty one. If set, the respective PTE was
written to since step 1. written to since step 1.
Internally, to do this tracking, the writable bit is cleared from PTEs Internally, to do this tracking, the writable bit is cleared from PTEs
when the soft-dirty bit is cleared. So, after this, when the task tries to when the soft-dirty bit is cleared. So, after this, when the task tries to
modify a page at some virtual address the #PF occurs and the kernel sets modify a page at some virtual address the #PF occurs and the kernel sets
the soft-dirty bit on the respective PTE. the soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after the Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast. soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus all This is so, since the pages are still mapped to physical memory, and thus all
the kernel does is finds this fact out and puts both writable and soft-dirty the kernel does is finds this fact out and puts both writable and soft-dirty
bits on the PTE. bits on the PTE.
While in most cases tracking memory changes by #PF-s is more than enough While in most cases tracking memory changes by #PF-s is more than enough
there is still a scenario when we can lose soft dirty bits -- a task there is still a scenario when we can lose soft dirty bits -- a task
unmaps a previously mapped memory region and then maps a new one at exactly unmaps a previously mapped memory region and then maps a new one at exactly
the same place. When unmap is called, the kernel internally clears PTE values the same place. When unmap is called, the kernel internally clears PTE values
...@@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such ...@@ -36,7 +40,7 @@ including soft dirty bits. To notify user space application about such
memory region renewal the kernel always marks new memory regions (and memory region renewal the kernel always marks new memory regions (and
expanded regions) as soft dirty. expanded regions) as soft dirty.
This feature is actively used by the checkpoint-restore project. You This feature is actively used by the checkpoint-restore project. You
can find more details about it on http://criu.org can find more details about it on http://criu.org
......
.. _split_page_table_lock:
=====================
Split page table lock Split page table lock
===================== =====================
...@@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD ...@@ -11,6 +14,7 @@ access to the table. At the moment we use split lock for PTE and PMD
tables. Access to higher level tables protected by mm->page_table_lock. tables. Access to higher level tables protected by mm->page_table_lock.
There are helpers to lock/unlock a table and other accessor functions: There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_lock() - pte_offset_map_lock()
maps pte and takes PTE table lock, returns pointer to the taken maps pte and takes PTE table lock, returns pointer to the taken
lock; lock;
...@@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE ...@@ -34,12 +38,13 @@ Split page table lock for PMD tables is enabled, if it's enabled for PTE
tables and the architecture supports it (see below). tables and the architecture supports it (see below).
Hugetlb and split page table lock Hugetlb and split page table lock
--------------------------------- =================================
Hugetlb can support several page sizes. We use split lock only for PMD Hugetlb can support several page sizes. We use split lock only for PMD
level, but not for PUD. level, but not for PUD.
Hugetlb-specific helpers: Hugetlb-specific helpers:
- huge_pte_lock() - huge_pte_lock()
takes pmd split lock for PMD_SIZE page, mm->page_table_lock takes pmd split lock for PMD_SIZE page, mm->page_table_lock
otherwise; otherwise;
...@@ -47,7 +52,7 @@ Hugetlb-specific helpers: ...@@ -47,7 +52,7 @@ Hugetlb-specific helpers:
returns pointer to table lock; returns pointer to table lock;
Support of split page table lock by an architecture Support of split page table lock by an architecture
--------------------------------------------------- ===================================================
There's no need in special enabling of PTE split page table lock: There's no need in special enabling of PTE split page table lock:
everything required is done by pgtable_page_ctor() and pgtable_page_dtor(), everything required is done by pgtable_page_ctor() and pgtable_page_dtor(),
...@@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must ...@@ -73,7 +78,7 @@ NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
be handled properly. be handled properly.
page->ptl page->ptl
--------- =========
page->ptl is used to access split page table lock, where 'page' is struct page->ptl is used to access split page table lock, where 'page' is struct
page of page containing the table. It shares storage with page->private page of page containing the table. It shares storage with page->private
...@@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private ...@@ -81,6 +86,7 @@ page of page containing the table. It shares storage with page->private
To avoid increasing size of struct page and have best performance, we use a To avoid increasing size of struct page and have best performance, we use a
trick: trick:
- if spinlock_t fits into long, we use page->ptr as spinlock, so we - if spinlock_t fits into long, we use page->ptr as spinlock, so we
can avoid indirect access and save a cache line. can avoid indirect access and save a cache line.
- if size of spinlock_t is bigger then size of long, we use page->ptl as - if size of spinlock_t is bigger then size of long, we use page->ptl as
......
.. _swap_numa:
===========================================
Automatically bind swap device to numa node Automatically bind swap device to numa node
------------------------------------------- ===========================================
If the system has more than one swap device and swap device has the node If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap information, we can make use of this information to decide which swap
...@@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance. ...@@ -7,15 +10,16 @@ device to use in get_swap_pages() to get better performance.
How to use this feature How to use this feature
----------------------- =======================
Swap device has priority and that decides the order of it to be used. To make Swap device has priority and that decides the order of it to be used. To make
use of automatically binding, there is no need to manipulate priority settings use of automatically binding, there is no need to manipulate priority settings
for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
swapB, with swapA attached to node 0 and swapB attached to node 1, are going swapB, with swapA attached to node 0 and swapB attached to node 1, are going
to be swapped on. Simply swapping them on by doing: to be swapped on. Simply swapping them on by doing::
# swapon /dev/swapA
# swapon /dev/swapB # swapon /dev/swapA
# swapon /dev/swapB
Then node 0 will use the two swap devices in the order of swapA then swapB and Then node 0 will use the two swap devices in the order of swapA then swapB and
node 1 will use the two swap devices in the order of swapB then swapA. Note node 1 will use the two swap devices in the order of swapB then swapA. Note
...@@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter. ...@@ -24,32 +28,39 @@ that the order of them being swapped on doesn't matter.
A more complex example on a 4 node machine. Assume 6 swap devices are going to A more complex example on a 4 node machine. Assume 6 swap devices are going to
be swapped on: swapA and swapB are attached to node 0, swapC is attached to be swapped on: swapA and swapB are attached to node 0, swapC is attached to
node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
The way to swap them on is the same as above: The way to swap them on is the same as above::
# swapon /dev/swapA
# swapon /dev/swapB # swapon /dev/swapA
# swapon /dev/swapC # swapon /dev/swapB
# swapon /dev/swapD # swapon /dev/swapC
# swapon /dev/swapE # swapon /dev/swapD
# swapon /dev/swapF # swapon /dev/swapE
# swapon /dev/swapF
Then node 0 will use them in the order of:
swapA/swapB -> swapC -> swapD -> swapE -> swapF Then node 0 will use them in the order of::
swapA/swapB -> swapC -> swapD -> swapE -> swapF
swapA and swapB will be used in a round robin mode before any other swap device. swapA and swapB will be used in a round robin mode before any other swap device.
node 1 will use them in the order of: node 1 will use them in the order of::
swapC -> swapA -> swapB -> swapD -> swapE -> swapF
swapC -> swapA -> swapB -> swapD -> swapE -> swapF
node 2 will use them in the order of::
swapD/swapE -> swapA -> swapB -> swapC -> swapF
node 2 will use them in the order of:
swapD/swapE -> swapA -> swapB -> swapC -> swapF
Similaly, swapD and swapE will be used in a round robin mode before any Similaly, swapD and swapE will be used in a round robin mode before any
other swap devices. other swap devices.
node 3 will use them in the order of: node 3 will use them in the order of::
swapF -> swapA -> swapB -> swapC -> swapD -> swapE
swapF -> swapA -> swapB -> swapC -> swapD -> swapE
Implementation details Implementation details
---------------------- ======================
The current code uses a priority based list, swap_avail_list, to decide The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same which swap device to use and if multiple swap devices share the same
......
============================== .. _unevictable_lru:
UNEVICTABLE LRU INFRASTRUCTURE
==============================
========
CONTENTS
========
(*) The Unevictable LRU
- The unevictable page list.
- Memory control group interaction.
- Marking address spaces unevictable.
- Detecting Unevictable Pages.
- vmscan's handling of unevictable pages.
(*) mlock()'d pages.
- History.
- Basic management.
- mlock()/mlockall() system call handling.
- Filtering special vmas.
- munlock()/munlockall() system call handling.
- Migrating mlocked pages.
- Compacting mlocked pages.
- mmap(MAP_LOCKED) system call handling.
- munmap()/exit()/exec() system call handling.
- try_to_unmap().
- try_to_munlock() reverse map scan.
- Page reclaim in shrink_*_list().
==============================
Unevictable LRU Infrastructure
==============================
============ .. contents:: :local:
INTRODUCTION
Introduction
============ ============
This document describes the Linux memory manager's "Unevictable LRU" This document describes the Linux memory manager's "Unevictable LRU"
...@@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the ...@@ -46,8 +22,8 @@ details - the "what does it do?" - by reading the code. One hopes that the
descriptions below add value by provide the answer to "why does it do that?". descriptions below add value by provide the answer to "why does it do that?".
===================
THE UNEVICTABLE LRU The Unevictable LRU
=================== ===================
The Unevictable LRU facility adds an additional LRU list to track unevictable The Unevictable LRU facility adds an additional LRU list to track unevictable
...@@ -66,17 +42,17 @@ completely unresponsive. ...@@ -66,17 +42,17 @@ completely unresponsive.
The unevictable list addresses the following classes of unevictable pages: The unevictable list addresses the following classes of unevictable pages:
(*) Those owned by ramfs. * Those owned by ramfs.
(*) Those mapped into SHM_LOCK'd shared memory regions. * Those mapped into SHM_LOCK'd shared memory regions.
(*) Those mapped into VM_LOCKED [mlock()ed] VMAs. * Those mapped into VM_LOCKED [mlock()ed] VMAs.
The infrastructure may also be able to handle other conditions that make pages The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future. unevictable, either by definition or by circumstance, in the future.
THE UNEVICTABLE PAGE LIST The Unevictable Page List
------------------------- -------------------------
The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
...@@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other ...@@ -118,7 +94,7 @@ the unevictable list when one task has the page isolated from the LRU and other
tasks are changing the "evictability" state of the page. tasks are changing the "evictability" state of the page.
MEMORY CONTROL GROUP INTERACTION Memory Control Group Interaction
-------------------------------- --------------------------------
The unevictable LRU facility interacts with the memory control group [aka The unevictable LRU facility interacts with the memory control group [aka
...@@ -144,7 +120,9 @@ effects: ...@@ -144,7 +120,9 @@ effects:
the control group to thrash or to OOM-kill tasks. the control group to thrash or to OOM-kill tasks.
MARKING ADDRESS SPACES UNEVICTABLE .. _mark_addr_space_unevict:
Marking Address Spaces Unevictable
---------------------------------- ----------------------------------
For facilities such as ramfs none of the pages attached to the address space For facilities such as ramfs none of the pages attached to the address space
...@@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE ...@@ -152,15 +130,15 @@ may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
address space flag is provided, and this can be manipulated by a filesystem address space flag is provided, and this can be manipulated by a filesystem
using a number of wrapper functions: using a number of wrapper functions:
(*) void mapping_set_unevictable(struct address_space *mapping); * ``void mapping_set_unevictable(struct address_space *mapping);``
Mark the address space as being completely unevictable. Mark the address space as being completely unevictable.
(*) void mapping_clear_unevictable(struct address_space *mapping); * ``void mapping_clear_unevictable(struct address_space *mapping);``
Mark the address space as being evictable. Mark the address space as being evictable.
(*) int mapping_unevictable(struct address_space *mapping); * ``int mapping_unevictable(struct address_space *mapping);``
Query the address space, and return true if it is completely Query the address space, and return true if it is completely
unevictable. unevictable.
...@@ -177,12 +155,13 @@ These are currently used in two places in the kernel: ...@@ -177,12 +155,13 @@ These are currently used in two places in the kernel:
ensure they're in memory. ensure they're in memory.
DETECTING UNEVICTABLE PAGES Detecting Unevictable Pages
--------------------------- ---------------------------
The function page_evictable() in vmscan.c determines whether a page is The function page_evictable() in vmscan.c determines whether a page is
evictable or not using the query function outlined above [see section "Marking evictable or not using the query function outlined above [see section
address spaces unevictable"] to check the AS_UNEVICTABLE flag. :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
to check the AS_UNEVICTABLE flag.
For address spaces that are so marked after being populated (as SHM regions For address spaces that are so marked after being populated (as SHM regions
might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
...@@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is ...@@ -202,7 +181,7 @@ flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
VMSCAN'S HANDLING OF UNEVICTABLE PAGES Vmscan's Handling of Unevictable Pages
-------------------------------------- --------------------------------------
If unevictable pages are culled in the fault path, or moved to the unevictable If unevictable pages are culled in the fault path, or moved to the unevictable
...@@ -233,8 +212,7 @@ extra evictabilty checks should not occur in the majority of calls to ...@@ -233,8 +212,7 @@ extra evictabilty checks should not occur in the majority of calls to
putback_lru_page(). putback_lru_page().
============= MLOCKED Pages
MLOCKED PAGES
============= =============
The unevictable page list is also useful for mlock(), in addition to ramfs and The unevictable page list is also useful for mlock(), in addition to ramfs and
...@@ -242,7 +220,7 @@ SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in ...@@ -242,7 +220,7 @@ SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
NOMMU situations, all mappings are effectively mlocked. NOMMU situations, all mappings are effectively mlocked.
HISTORY History
------- -------
The "Unevictable mlocked Pages" infrastructure is based on work originally The "Unevictable mlocked Pages" infrastructure is based on work originally
...@@ -263,7 +241,7 @@ replaced by walking the reverse map to determine whether any VM_LOCKED VMAs ...@@ -263,7 +241,7 @@ replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
mapped the page. More on this below. mapped the page. More on this below.
BASIC MANAGEMENT Basic Management
---------------- ----------------
mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
...@@ -304,10 +282,10 @@ mlocked pages become unlocked and rescued from the unevictable list when: ...@@ -304,10 +282,10 @@ mlocked pages become unlocked and rescued from the unevictable list when:
(4) before a page is COW'd in a VM_LOCKED VMA. (4) before a page is COW'd in a VM_LOCKED VMA.
mlock()/mlockall() SYSTEM CALL HANDLING mlock()/mlockall() System Call Handling
--------------------------------------- ---------------------------------------
Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup()
for each VMA in the range specified by the call. In the case of mlockall(), for each VMA in the range specified by the call. In the case of mlockall(),
this is the entire active address space of the task. Note that mlock_fixup() this is the entire active address space of the task. Note that mlock_fixup()
is used for both mlocking and munlocking a range of memory. A call to mlock() is used for both mlocking and munlocking a range of memory. A call to mlock()
...@@ -351,7 +329,7 @@ mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle ...@@ -351,7 +329,7 @@ mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page. it later if and when it attempts to reclaim the page.
FILTERING SPECIAL VMAS Filtering Special VMAs
---------------------- ----------------------
mlock_fixup() filters several classes of "special" VMAs: mlock_fixup() filters several classes of "special" VMAs:
...@@ -379,8 +357,9 @@ VM_LOCKED flag. Therefore, we won't have to deal with them later during ...@@ -379,8 +357,9 @@ VM_LOCKED flag. Therefore, we won't have to deal with them later during
munlock(), munmap() or task exit. Neither does mlock_fixup() account these munlock(), munmap() or task exit. Neither does mlock_fixup() account these
VMAs against the task's "locked_vm". VMAs against the task's "locked_vm".
.. _munlock_munlockall_handling:
munlock()/munlockall() SYSTEM CALL HANDLING munlock()/munlockall() System Call Handling
------------------------------------------- -------------------------------------------
The munlock() and munlockall() system calls are handled by the same functions - The munlock() and munlockall() system calls are handled by the same functions -
...@@ -426,7 +405,7 @@ This is fine, because we'll catch it later if and if vmscan tries to reclaim ...@@ -426,7 +405,7 @@ This is fine, because we'll catch it later if and if vmscan tries to reclaim
the page. This should be relatively rare. the page. This should be relatively rare.
MIGRATING MLOCKED PAGES Migrating MLOCKED Pages
----------------------- -----------------------
A page that is being migrated has been isolated from the LRU lists and is held A page that is being migrated has been isolated from the LRU lists and is held
...@@ -451,7 +430,7 @@ list because of a race between munlock and migration, page migration uses the ...@@ -451,7 +430,7 @@ list because of a race between munlock and migration, page migration uses the
putback_lru_page() function to add migrated pages back to the LRU. putback_lru_page() function to add migrated pages back to the LRU.
COMPACTING MLOCKED PAGES Compacting MLOCKED Pages
------------------------ ------------------------
The unevictable LRU can be scanned for compactable regions and the default The unevictable LRU can be scanned for compactable regions and the default
...@@ -461,7 +440,7 @@ unevictable LRU is enabled, the work of compaction is mostly handled by ...@@ -461,7 +440,7 @@ unevictable LRU is enabled, the work of compaction is mostly handled by
the page migration code and the same work flow as described in MIGRATING the page migration code and the same work flow as described in MIGRATING
MLOCKED PAGES will apply. MLOCKED PAGES will apply.
MLOCKING TRANSPARENT HUGE PAGES MLOCKING Transparent Huge Pages
------------------------------- -------------------------------
A transparent huge page is represented by a single entry on an LRU list. A transparent huge page is represented by a single entry on an LRU list.
...@@ -483,7 +462,7 @@ to unevictable LRU and the rest can be reclaimed. ...@@ -483,7 +462,7 @@ to unevictable LRU and the rest can be reclaimed.
See also comment in follow_trans_huge_pmd(). See also comment in follow_trans_huge_pmd().
mmap(MAP_LOCKED) SYSTEM CALL HANDLING mmap(MAP_LOCKED) System Call Handling
------------------------------------- -------------------------------------
In addition the mlock()/mlockall() system calls, an application can request In addition the mlock()/mlockall() system calls, an application can request
...@@ -514,7 +493,7 @@ memory range accounted as locked_vm, as the protections could be changed later ...@@ -514,7 +493,7 @@ memory range accounted as locked_vm, as the protections could be changed later
and pages allocated into that region. and pages allocated into that region.
munmap()/exit()/exec() SYSTEM CALL HANDLING munmap()/exit()/exec() System Call Handling
------------------------------------------- -------------------------------------------
When unmapping an mlocked region of memory, whether by an explicit call to When unmapping an mlocked region of memory, whether by an explicit call to
...@@ -568,16 +547,18 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, ...@@ -568,16 +547,18 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
holepunching, and truncation of file pages and their anonymous COWed pages. holepunching, and truncation of file pages and their anonymous COWed pages.
try_to_munlock() REVERSE MAP SCAN try_to_munlock() Reverse Map Scan
--------------------------------- ---------------------------------
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the .. warning::
page_referenced() reverse map walker. [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.
When munlock_vma_page() [see section "munlock()/munlockall() System Call When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
Handling" above] tries to munlock a page, it needs to determine whether or not Handling <munlock_munlockall_handling>` above] tries to munlock a
the page is mapped by any VM_LOCKED VMA without actually attempting to unmap page, it needs to determine whether or not the page is mapped by any
all PTEs from the page. For this purpose, the unevictable/mlock infrastructure VM_LOCKED VMA without actually attempting to unmap all PTEs from the
page. For this purpose, the unevictable/mlock infrastructure
introduced a variant of try_to_unmap() called try_to_munlock(). introduced a variant of try_to_unmap() called try_to_munlock().
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
...@@ -595,7 +576,7 @@ large region or tearing down a large address space that has been mlocked via ...@@ -595,7 +576,7 @@ large region or tearing down a large address space that has been mlocked via
mlockall(), overall this is a fairly rare event. mlockall(), overall this is a fairly rare event.
PAGE RECLAIM IN shrink_*_list() Page Reclaim in shrink_*_list()
------------------------------- -------------------------------
shrink_active_list() culls any obviously unevictable pages - i.e. shrink_active_list() culls any obviously unevictable pages - i.e.
......
.. _z3fold:
======
z3fold z3fold
------ ======
z3fold is a special purpose allocator for storing compressed pages. z3fold is a special purpose allocator for storing compressed pages.
It is designed to store up to three compressed pages per physical page. It is designed to store up to three compressed pages per physical page.
...@@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression ...@@ -7,6 +10,7 @@ It is a zbud derivative which allows for higher compression
ratio keeping the simplicity and determinism of its predecessor. ratio keeping the simplicity and determinism of its predecessor.
The main differences between z3fold and zbud are: The main differences between z3fold and zbud are:
* unlike zbud, z3fold allows for up to PAGE_SIZE allocations * unlike zbud, z3fold allows for up to PAGE_SIZE allocations
* z3fold can hold up to 3 compressed pages in its page * z3fold can hold up to 3 compressed pages in its page
* z3fold doesn't export any API itself and is thus intended to be used * z3fold doesn't export any API itself and is thus intended to be used
......
...@@ -15621,7 +15621,7 @@ L: linux-mm@kvack.org ...@@ -15621,7 +15621,7 @@ L: linux-mm@kvack.org
S: Maintained S: Maintained
F: mm/zsmalloc.c F: mm/zsmalloc.c
F: include/linux/zsmalloc.h F: include/linux/zsmalloc.h
F: Documentation/vm/zsmalloc.txt F: Documentation/vm/zsmalloc.rst
ZSWAP COMPRESSED SWAP CACHING ZSWAP COMPRESSED SWAP CACHING
M: Seth Jennings <sjenning@redhat.com> M: Seth Jennings <sjenning@redhat.com>
......
...@@ -585,7 +585,7 @@ config ARCH_DISCONTIGMEM_ENABLE ...@@ -585,7 +585,7 @@ config ARCH_DISCONTIGMEM_ENABLE
Say Y to support efficient handling of discontiguous physical memory, Say Y to support efficient handling of discontiguous physical memory,
for architectures which are either NUMA (Non-Uniform Memory Access) for architectures which are either NUMA (Non-Uniform Memory Access)
or have huge holes in the physical address space for other reasons. or have huge holes in the physical address space for other reasons.
See <file:Documentation/vm/numa> for more. See <file:Documentation/vm/numa.rst> for more.
source "mm/Kconfig" source "mm/Kconfig"
......
...@@ -397,7 +397,7 @@ config ARCH_DISCONTIGMEM_ENABLE ...@@ -397,7 +397,7 @@ config ARCH_DISCONTIGMEM_ENABLE
Say Y to support efficient handling of discontiguous physical memory, Say Y to support efficient handling of discontiguous physical memory,
for architectures which are either NUMA (Non-Uniform Memory Access) for architectures which are either NUMA (Non-Uniform Memory Access)
or have huge holes in the physical address space for other reasons. or have huge holes in the physical address space for other reasons.
See <file:Documentation/vm/numa> for more. See <file:Documentation/vm/numa.rst> for more.
config ARCH_FLATMEM_ENABLE config ARCH_FLATMEM_ENABLE
def_bool y def_bool y
......
...@@ -2556,7 +2556,7 @@ config ARCH_DISCONTIGMEM_ENABLE ...@@ -2556,7 +2556,7 @@ config ARCH_DISCONTIGMEM_ENABLE
Say Y to support efficient handling of discontiguous physical memory, Say Y to support efficient handling of discontiguous physical memory,
for architectures which are either NUMA (Non-Uniform Memory Access) for architectures which are either NUMA (Non-Uniform Memory Access)
or have huge holes in the physical address space for other reasons. or have huge holes in the physical address space for other reasons.
See <file:Documentation/vm/numa> for more. See <file:Documentation/vm/numa.rst> for more.
config ARCH_SPARSEMEM_ENABLE config ARCH_SPARSEMEM_ENABLE
bool bool
......
...@@ -883,7 +883,7 @@ config PPC_MEM_KEYS ...@@ -883,7 +883,7 @@ config PPC_MEM_KEYS
page-based protections, but without requiring modification of the page-based protections, but without requiring modification of the
page tables when an application changes protection domains. page tables when an application changes protection domains.
For details, see Documentation/vm/protection-keys.txt For details, see Documentation/vm/protection-keys.rst
If unsure, say y. If unsure, say y.
......
...@@ -196,7 +196,7 @@ config HUGETLBFS ...@@ -196,7 +196,7 @@ config HUGETLBFS
help help
hugetlbfs is a filesystem backing for HugeTLB pages, based on hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read ramfs. For architectures that support it, say Y here and read
<file:Documentation/vm/hugetlbpage.txt> for details. <file:Documentation/vm/hugetlbpage.rst> for details.
If unsure, say N. If unsure, say N.
......
...@@ -677,7 +677,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, ...@@ -677,7 +677,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
* downgrading page table protection not changing it to point * downgrading page table protection not changing it to point
* to a new page. * to a new page.
* *
* See Documentation/vm/mmu_notifier.txt * See Documentation/vm/mmu_notifier.rst
*/ */
if (pmdp) { if (pmdp) {
#ifdef CONFIG_FS_DAX_PMD #ifdef CONFIG_FS_DAX_PMD
......
此差异已折叠。
...@@ -16,7 +16,7 @@ ...@@ -16,7 +16,7 @@
/* /*
* Heterogeneous Memory Management (HMM) * Heterogeneous Memory Management (HMM)
* *
* See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it * See Documentation/vm/hmm.rst for reasons and overview of what HMM is and it
* is for. Here we focus on the HMM API description, with some explanation of * is for. Here we focus on the HMM API description, with some explanation of
* the underlying implementation. * the underlying implementation.
* *
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册