1. 17 1月, 2022 1 次提交
  2. 15 1月, 2022 38 次提交
    • G
      mm/damon: move the implementation of damon_insert_region to damon.h · 2cd4b8e1
      Guoqing Jiang 提交于
      Usually, inline function is declared static since it should sit between
      storage and type.  And implement it in a header file if used by multiple
      files.
      
      And this change also fixes compile issue when backport damon to 5.10.
      
        mm/damon/vaddr.c: In function `damon_va_evenly_split_region':
        ./include/linux/damon.h:425:13: error: inlining failed in call to `always_inline' `damon_insert_region': function body not available
        425 | inline void damon_insert_region(struct damon_region *r,
            | ^~~~~~~~~~~~~~~~~~~
        mm/damon/vaddr.c:86:3: note: called from here
        86 | damon_insert_region(n, r, next, t);
           | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Link: https://lkml.kernel.org/r/20211223085703.6142-1-guoqing.jiang@linux.devSigned-off-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
      Reviewed-by: NSeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cd4b8e1
    • S
      mm/damon/schemes: account how many times quota limit has exceeded · 6268eac3
      SeongJae Park 提交于
      If the time/space quotas of a given DAMON-based operation scheme is too
      small, the scheme could show unexpectedly slow progress.  However, there
      is no good way to notice the case in runtime.  This commit extends the
      DAMOS stat to provide how many times the quota limits exceeded so that
      the users can easily notice the case and tune the scheme.
      
      Link: https://lkml.kernel.org/r/20211210150016.35349-3-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6268eac3
    • S
      mm/damon/schemes: account scheme actions that successfully applied · 0e92c2ee
      SeongJae Park 提交于
      Patch series "mm/damon/schemes: Extend stats for better online analysis and tuning".
      
      To help online access pattern analysis and tuning of DAMON-based
      Operation Schemes (DAMOS), DAMOS provides simple statistics for each
      scheme.  Introduction of DAMOS time/space quota further made the tuning
      easier by making the risk management easier.  However, that also made
      understanding of the working schemes a little bit more difficult.
      
      For an example, progress of a given scheme can now be throttled by not
      only the aggressiveness of the target access pattern, but also the
      time/space quotas.  So, when a scheme is showing unexpectedly slow
      progress, it's difficult to know by what the progress of the scheme is
      throttled, with currently provided statistics.
      
      This patchset extends the statistics to contain some metrics that can be
      helpful for such online schemes analysis and tuning (patches 1-2),
      exports those to users (patches 3 and 5), and add documents (patches 4
      and 6).
      
      This patch (of 6):
      
      DAMON-based operation schemes (DAMOS) stats provide only the number and
      the amount of regions that the action of the scheme has tried to be
      applied.  Because the action could be failed for some reasons, the
      currently provided information is sometimes not useful or convenient
      enough for schemes profiling and tuning.  To improve this situation,
      this commit extends the DAMOS stats to provide the number and the amount
      of regions that the action has successfully applied.
      
      Link: https://lkml.kernel.org/r/20211210150016.35349-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211210150016.35349-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e92c2ee
    • S
      mm/damon: remove a mistakenly added comment for a future feature · f4c6d22c
      SeongJae Park 提交于
      Due to a mistake in patches reordering, a comment for a future feature
      called 'arbitrary monitoring target support'[1], which is still under
      development, has added.  Because it only introduces confusion and we
      don't have a plan to post the patches soon, this commit removes the
      mistakenly added part.
      
      [1] https://lore.kernel.org/linux-mm/20201215115448.25633-3-sjpark@amazon.com/
      
      Link: https://lkml.kernel.org/r/20211209131806.19317-7-sj@kernel.org
      Fixes: 1f366e42 ("mm/damon/core: implement DAMON-based Operation Schemes (DAMOS)")
      Signed-off-by: NSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4c6d22c
    • S
      mm/damon: convert macro functions to static inline functions · 88f86dcf
      SeongJae Park 提交于
      Patch series "mm/damon: Misc cleanups".
      
      This patchset contains miscellaneous cleanups for DAMON's macro
      functions and documentation.
      
      This patch (of 6):
      
      This commit converts macro functions in DAMON to static inline functions,
      for better type checking, code documentation, etc[1].
      
      [1] https://lore.kernel.org/linux-mm/20211202151213.6ec830863342220da4141bc5@linux-foundation.org/
      
      Link: https://lkml.kernel.org/r/20211209131806.19317-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211209131806.19317-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f86dcf
    • X
      mm/damon: modify damon_rand() macro to static inline function · 234d6873
      Xin Hao 提交于
      damon_rand() cannot be implemented as a macro.
      
      Example:
      	damon_rand(a++, b);
      
      The value of 'a' will be incremented twice, This is obviously
      unreasonable, So there fix it.
      
      Link: https://lkml.kernel.org/r/110ffcd4e420c86c42b41ce2bc9f0fe6a4f32cd3.1638795127.git.xhao@linux.alibaba.com
      Fixes: b9a6ac4e ("mm/damon: adaptively adjust regions")
      Signed-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      234d6873
    • X
      mm/damon: move damon_rand() definition into damon.h · 9b2a38d6
      Xin Hao 提交于
      damon_rand() is called in three files:damon/core.c, damon/ paddr.c,
      damon/vaddr.c, i think there is no need to redefine this twice, So move
      it to damon.h will be a good choice.
      
      Link: https://lkml.kernel.org/r/20211202075859.51341-1-xhao@linux.alibaba.comSigned-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: NSeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b2a38d6
    • X
      mm/damon: remove some unneeded function definitions in damon.h · cdeed009
      Xin Hao 提交于
      In damon.h some func definitions about VA & PA can only be used in its
      own file, so there no need to define in the header file, and the header
      file will look cleaner.
      
      If other files later need these functions, the prototypes can be added
      to damon.h at that time.
      
      [sj@kernel.org: remove unnecessary function prototype position changes]
       Link: https://lkml.kernel.org/r/20211118114827.20052-1-sj@kernel.org
      
      Link: https://lkml.kernel.org/r/45fd5b3ef6cce8e28dbc1c92f9dc845ccfc949d7.1636989871.git.xhao@linux.alibaba.comSigned-off-by: NXin Hao <xhao@linux.alibaba.com>
      Signed-off-by: NSeongJae Park <sj@kernel.org>
      Reviewed-by: NSeongJae Park <sj@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdeed009
    • T
      mm: make some vars and functions static or __init · cab0a7c1
      Ting Liu 提交于
      "page_idle_ops" as a global var, but its scope of use within this
      document.  So it should be static.
      
      "page_ext_ops" is a var used in the kernel initial phase.  And other
      functions are aslo used in the kernel initial phase.  So they should be
      __init or __initdata to reclaim memory.
      
      Link: https://lkml.kernel.org/r/20211217095023.67293-1-liuting.0x7c00@bytedance.comSigned-off-by: NTing Liu <liuting.0x7c00@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cab0a7c1
    • H
      mm/rmap: fix potential batched TLB flush race · 5ee2fa2f
      Huang Ying 提交于
      In theory, the following race is possible for batched TLB flushing.
      
        CPU0                               CPU1
        ----                               ----
        shrink_page_list()
                                           unmap
                                             zap_pte_range()
                                               flush_tlb_batched_pending()
                                                 flush_tlb_mm()
          try_to_unmap()
            set_tlb_ubc_flush_pending()
              mm->tlb_flush_batched = true
                                                 mm->tlb_flush_batched = false
      
      After the TLB is flushed on CPU1 via flush_tlb_mm() and before
      mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0 and
      the TLB flushing is pended.  Then the pended TLB flushing will be lost.
      Although both set_tlb_ubc_flush_pending() and
      flush_tlb_batched_pending() are called with PTL locked, different PTL
      instances may be used.
      
      Because the race window is really small, and the lost TLB flushing will
      cause problem only if a TLB entry is inserted before the unmapping in
      the race window, the race is only theoretical.  But the fix is simple
      and cheap too.
      
      Syzbot has reported this too as follows:
      
          ==================================================================
          BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
      
          write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
           flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
           madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
           walk_pmd_range mm/pagewalk.c:128 [inline]
           walk_pud_range mm/pagewalk.c:205 [inline]
           walk_p4d_range mm/pagewalk.c:240 [inline]
           walk_pgd_range mm/pagewalk.c:277 [inline]
           __walk_page_range+0x981/0x1160 mm/pagewalk.c:379
           walk_page_range+0x131/0x300 mm/pagewalk.c:475
           madvise_free_single_vma mm/madvise.c:734 [inline]
           madvise_dontneed_free mm/madvise.c:822 [inline]
           madvise_vma mm/madvise.c:996 [inline]
           do_madvise+0xe4a/0x1140 mm/madvise.c:1202
           __do_sys_madvise mm/madvise.c:1228 [inline]
           __se_sys_madvise mm/madvise.c:1226 [inline]
           __x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
           do_syscall_x64 arch/x86/entry/common.c:50 [inline]
           do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
          write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
           set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
           try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
           rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
           try_to_unmap+0xec/0x110
           shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
           shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
           shrink_list mm/vmscan.c:2621 [inline]
           shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
           shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
           shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
           kswapd_shrink_node mm/vmscan.c:4022 [inline]
           balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
           kswapd+0x200/0x340 mm/vmscan.c:4473
           kthread+0x2c7/0x2e0 kernel/kthread.c:327
           ret_from_fork+0x1f/0x30
      
          value changed: 0x01 -> 0x00
      
          Reported by Kernel Concurrency Sanitizer on:
          CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          ==================================================================
      
      [akpm@linux-foundation.org: tweak comments]
      
      Link: https://lkml.kernel.org/r/20211201021104.126469-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reported-by: syzbot+aa5bebed695edaccf0df@syzkaller.appspotmail.com
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ee2fa2f
    • N
      mm/hwpoison: fix unpoison_memory() · bf181c58
      Naoya Horiguchi 提交于
      After recent soft-offline rework, error pages can be taken off from
      buddy allocator, but the existing unpoison_memory() does not properly
      undo the operation.  Moreover, due to the recent change on
      __get_hwpoison_page(), get_page_unless_zero() is hardly called for
      hwpoisoned pages.  So __get_hwpoison_page() highly likely returns -EBUSY
      (meaning to fail to grab page refcount) and unpoison just clears
      PG_hwpoison without releasing a refcount.  That does not lead to a
      critical issue like kernel panic, but unpoisoned pages never get back to
      buddy (leaked permanently), which is not good.
      
      To (partially) fix this, we need to identify "taken off" pages from
      other types of hwpoisoned pages.  We can't use refcount or page flags
      for this purpose, so a pseudo flag is defined by hacking ->private
      field.  Someone might think that put_page() is enough to cancel
      taken-off pages, but the normal free path contains some operations not
      suitable for the current purpose, and can fire VM_BUG_ON().
      
      Note that unpoison_memory() is now supposed to be cancel hwpoison events
      injected only by madvise() or
      /sys/devices/system/memory/{hard,soft}_offline_page, not by MCE
      injection, so please don't try to use unpoison when testing with MCE
      injection.
      
      [lkp@intel.com: report build failure for ARCH=i386]
      
      Link: https://lkml.kernel.org/r/20211115084006.3728254-4-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ding Hui <dinghui@sangfor.com.cn>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181c58
    • N
      mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE · c9fdc4d5
      Naoya Horiguchi 提交于
      These action_page_types are no longer used, so remove them.
      
      Link: https://lkml.kernel.org/r/20211115084006.3728254-3-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: NYang Shi <shy828301@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ding Hui <dinghui@sangfor.com.cn>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9fdc4d5
    • A
      mm/mempolicy: wire up syscall set_mempolicy_home_node · 21b084fd
      Aneesh Kumar K.V 提交于
      Link: https://lkml.kernel.org/r/20211202123810.267175-4-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21b084fd
    • A
      mm/mempolicy: add set_mempolicy_home_node syscall · c6018b4b
      Aneesh Kumar K.V 提交于
      This syscall can be used to set a home node for the MPOL_BIND and
      MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
      setting up a memory policy for the specified range as shown below.
      
        mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
              new_nodes->size + 1, 0);
        sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
      				home_node, 0);
      
      The syscall allows specifying a home node/preferred node from which
      kernel will fulfill memory allocation requests first.
      
      For address range with MPOL_BIND memory policy, if nodemask specifies
      more than one node, page allocations will come from the node in the
      nodemask with sufficient free memory that is closest to the home
      node/preferred node.
      
      For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
      page allocation will come from the node in the nodemask with sufficient
      free memory that is closest to the home node/preferred node.  If there
      is not enough memory in all the nodes specified in the nodemask, the
      allocation will be attempted from the closest numa node to the home node
      in the system.
      
      This helps applications to hint at a memory allocation preference node
      and fallback to _only_ a set of nodes if the memory is not available on
      the preferred node.  Fallback allocation is attempted from the node
      which is nearest to the preferred node.
      
      This helps applications to have control on memory allocation numa nodes
      and avoids default fallback to slow memory NUMA nodes.  For example a
      system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
      slow memory
      
       new_nodes = numa_bitmask_alloc(nr_nodes);
      
       numa_bitmask_setbit(new_nodes, 1);
       numa_bitmask_setbit(new_nodes, 2);
       numa_bitmask_setbit(new_nodes, 3);
      
       p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
       mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
      
       sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
      
      This will allocate from nodes closer to node 2 and will make sure the
      kernel will only allocate from nodes 1, 2, and 3.  Memory will not be
      allocated from slow memory nodes 10, 11, and 12.  This differs from
      default MPOL_BIND behavior in that with default MPOL_BIND the allocation
      will be attempted from node closer to the local node.  One of the
      reasons to specify a home node is to allow allocations from cpu less
      NUMA node and its nearby NUMA nodes.
      
      With MPOL_PREFERRED_MANY on the other hand will first try to allocate
      from the closest node to node 2 from the node list 1, 2 and 3.  If those
      nodes don't have enough memory, kernel will allocate from slow memory
      node 10, 11 and 12 which ever is closer to node 2.
      
      Link: https://lkml.kernel.org/r/20211202123810.267175-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6018b4b
    • G
      vmscan: make drop_slab_node static · e4b424b7
      Gang Li 提交于
      drop_slab_node is only used in drop_slab.  So remove it's declaration
      from header file and add keyword static for it's definition.
      
      Link: https://lkml.kernel.org/r/20211111062445.5236-1-ligang.bdlg@bytedance.comSigned-off-by: NGang Li <ligang.bdlg@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4b424b7
    • Y
      mm/vmstat: add events for THP max_ptes_* exceeds · e9ea874a
      Yang Yang 提交于
      There are interfaces to adjust max_ptes_none, max_ptes_swap,
      max_ptes_shared values, see
        /sys/kernel/mm/transparent_hugepage/khugepaged/.
      
      But system administrator may not know which value is the best.  So Add
      those events to support adjusting max_ptes_* to suitable values.
      
      For example, if default max_ptes_swap value causes too much failures,
      and system uses zram whose IO is fast, administrator could increase
      max_ptes_swap until THP_SCAN_EXCEED_SWAP_PTE not increase anymore.
      
      Link: https://lkml.kernel.org/r/20211225094036.574157-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Saravanan D <saravanand@fb.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9ea874a
    • M
      hugetlb: add hugetlb.*.numa_stat file · f4776199
      Mina Almasry 提交于
      For hugetlb backed jobs/VMs it's critical to understand the numa
      information for the memory backing these jobs to deliver optimal
      performance.
      
      Currently this technically can be queried from /proc/self/numa_maps, but
      there are significant issues with that.  Namely:
      
      1. Memory can be mapped or unmapped.
      
      2. numa_maps are per process and need to be aggregated across all
         processes in the cgroup.  For shared memory this is more involved as
         the userspace needs to make sure it doesn't double count shared
         mappings.
      
      3. I believe querying numa_maps needs to hold the mmap_lock which adds
         to the contention on this lock.
      
      For these reasons I propose simply adding hugetlb.*.numa_stat file,
         which shows the numa information of the cgroup similarly to
         memory.numa_stat.
      
      On cgroup-v2:
         cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
         total=2097152 N0=2097152 N1=0
      
      On cgroup-v1:
         cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
         total=2097152 N0=2097152 N1=0
         hierarichal_total=2097152 N0=2097152 N1=0
      
      This patch was tested manually by allocating hugetlb memory and querying
      the hugetlb.*.numa_stat file of the cgroup and its parents.
      
      [colin.i.king@googlemail.com: fix spelling mistake "hierarichal" -> "hierarchical"]
        Link: https://lkml.kernel.org/r/20211125090635.23508-1-colin.i.king@gmail.com
      [keescook@chromium.org: fix copy/paste array assignment]
        Link: https://lkml.kernel.org/r/20211203065647.2819707-1-keescook@chromium.org
      
      Link: https://lkml.kernel.org/r/20211123001020.4083653-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NColin Ian King <colin.i.king@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jue Wang <juew@google.com>
      Cc: Yang Yao <ygyao@google.com>
      Cc: Joanna Li <joannali@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4776199
    • B
      mm_zone: add function to check if managed dma zone exists · 62b31070
      Baoquan He 提交于
      Patch series "Handle warning of allocation failure on DMA zone w/o
      managed pages", v4.
      
      **Problem observed:
      On x86_64, when crash is triggered and entering into kdump kernel, page
      allocation failure can always be seen.
      
       ---------------------------------
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        ......
        __alloc_pages+0x24d/0x2c0
        ......
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        ? rcu_read_lock_sched_held+0x3f/0x80
        kernel_init_freeable+0x290/0x2dc
        ? rest_init+0x24f/0x24f
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ------------------------------------
      
      ***Root cause:
      In the current kernel, it assumes that DMA zone must have managed pages
      and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
      always true. E.g in kdump kernel of x86_64, only low 1M is presented and
      locked down at very early stage of boot, so that this low 1M won't be
      added into buddy allocator to become managed pages of DMA zone. This
      exception will always cause page allocation failure if page is requested
      from DMA zone.
      
      ***Investigation:
      This failure happens since below commit merged into linus's tree.
        1a6a9044 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
        23721c8e x86/crash: Remove crash_reserve_low_1M()
        f1d4d47c x86/setup: Always reserve the first 1M of RAM
        7c321eb2 x86/kdump: Remove the backup region handling
        6f599d84 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
      
      Before them, on x86_64, the low 640K area will be reused by kdump kernel.
      So in kdump kernel, the content of low 640K area is copied into a backup
      region for dumping before jumping into kdump. Then except of those firmware
      reserved region in [0, 640K], the left area will be added into buddy
      allocator to become available managed pages of DMA zone.
      
      However, after above commits applied, in kdump kernel of x86_64, the low
      1M is reserved by memblock, but not released to buddy allocator. So any
      later page allocation requested from DMA zone will fail.
      
      At the beginning, if crashkernel is reserved, the low 1M need be locked
      down because AMD SME encrypts memory making the old backup region
      mechanims impossible when switching into kdump kernel.
      
      Later, it was also observed that there are BIOSes corrupting memory
      under 1M. To solve this, in commit f1d4d47c, the entire region of
      low 1M is always reserved after the real mode trampoline is allocated.
      
      Besides, recently, Intel engineer mentioned their TDX (Trusted domain
      extensions) which is under development in kernel also needs to lock down
      the low 1M. So we can't simply revert above commits to fix the page allocation
      failure from DMA zone as someone suggested.
      
      ***Solution:
      Currently, only DMA atomic pool and dma-kmalloc will initialize and
      request page allocation with GFP_DMA during bootup.
      
      So only initializ DMA atomic pool when DMA zone has available managed
      pages, otherwise just skip the initialization.
      
      For dma-kmalloc(), for the time being, let's mute the warning of
      allocation failure if requesting pages from DMA zone while no manged
      pages.  Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
      replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
      not necessary.  Christoph is posting patches to fix those under
      drivers/scsi/.  Finally, we can remove the need of dma-kmalloc() as people
      suggested.
      
      This patch (of 3):
      
      In some places of the current kernel, it assumes that dma zone must have
      managed pages if CONFIG_ZONE_DMA is enabled.  While this is not always
      true.  E.g in kdump kernel of x86_64, only low 1M is presented and locked
      down at very early stage of boot, so that there's no managed pages at all
      in DMA zone.  This exception will always cause page allocation failure if
      page is requested from DMA zone.
      
      Here add function has_managed_dma() and the relevant helper functions to
      check if there's DMA zone with managed pages.  It will be used in later
      patches.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b31070
    • M
      include/linux/gfp.h: further document GFP_DMA32 · 04a536bf
      Miles Chen 提交于
      kmalloc(..., GFP_DMA32) does not return DMA32 memory because the DMA32
      kmalloc cache array is not implemented.  (Reason: there is no such user
      in kernel).
      
      Put a short comment about this so people can understand this by reading
      the comment.
      
      [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
      
      Link: https://lkml.kernel.org/r/20211207093610.6406-1-miles.chen@mediatek.comSigned-off-by: NMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04a536bf
    • M
      mm: drop node from alloc_pages_vma · be1a13eb
      Michal Hocko 提交于
      alloc_pages_vma is meant to allocate a page with a vma specific memory
      policy.  The initial node parameter is always a local node so it is
      pointless to waste a function argument for this.  Drop the parameter.
      
      Link: https://lkml.kernel.org/r/YaSnlv4QpryEpesG@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1a13eb
    • C
      mm: fix boolreturn.cocci warning · 1611f74a
      Changcheng Deng 提交于
      Return statements in functions returning bool should use true/false
      instead of 1/0.
      
      Link: https://lkml.kernel.org/r/20211126073327.74815-1-deng.changcheng@zte.com.cnSigned-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn>
      Reported-by: NZeal Robot <zealci@zte.com.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1611f74a
    • N
      mm: introduce memalloc_retry_wait() · 4034247a
      NeilBrown 提交于
      Various places in the kernel - largely in filesystems - respond to a
      memory allocation failure by looping around and re-trying.  Some of
      these cannot conveniently use __GFP_NOFAIL, for reasons such as:
      
       - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
       - a need to check for the process being signalled between failures
       - the possibility that other recovery actions could be performed
       - the allocation is quite deep in support code, and passing down an
         extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
      
      Many of these currently use congestion_wait() which (in almost all
      cases) simply waits the given timeout - congestion isn't tracked for
      most devices.
      
      It isn't clear what the best delay is for loops, but it is clear that
      the various filesystems shouldn't be responsible for choosing a timeout.
      
      This patch introduces memalloc_retry_wait() with takes on that
      responsibility.  Code that wants to retry a memory allocation can call
      this function passing the GFP flags that were used.  It will wait
      however is appropriate.
      
      For now, it only considers __GFP_NORETRY and whatever
      gfpflags_allow_blocking() tests.  If blocking is allowed without
      __GFP_NORETRY, then alloc_page either made some reclaim progress, or
      waited for a while, before failing.  So there is no need for much
      further waiting.  memalloc_retry_wait() will wait until the current
      jiffie ends.  If this condition is not met, then alloc_page() won't have
      waited much if at all.  In that case memalloc_retry_wait() waits about
      200ms.  This is the delay that most current loops uses.
      
      linux/sched/mm.h needs to be included in some files now,
      but linux/backing-dev.h does not.
      
      Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.nameSigned-off-by: NNeilBrown <neilb@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4034247a
    • M
      mm: allow !GFP_KERNEL allocations for kvmalloc · a421ef30
      Michal Hocko 提交于
      Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
      previous patches so we can allow the support for kvmalloc.  This will
      allow some external users to simplify or completely remove their
      helpers.
      
      GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
      explicitly documented so let's add a note about that.
      
      ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.
      
      Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a421ef30
    • M
      mm: remove the total_mapcount argument from page_trans_huge_mapcount() · d08d2b62
      Matthew Wilcox (Oracle) 提交于
      All callers pass NULL, so we can stop calculating the value we would
      store in it.
      
      Link: https://lkml.kernel.org/r/20211220205943.456187-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d08d2b62
    • M
    • P
      mm: page table check · df4e817b
      Pasha Tatashin 提交于
      Check user page table entries at the time they are added and removed.
      
      Allows to synchronously catch memory corruption issues related to double
      mapping.
      
      When a pte for an anonymous page is added into page table, we verify
      that this pte does not already point to a file backed page, and vice
      versa if this is a file backed page that is being added we verify that
      this page does not have an anonymous mapping
      
      We also enforce that read-only sharing for anonymous pages is allowed
      (i.e.  cow after fork).  All other sharing must be for file pages.
      
      Page table check allows to protect and debug cases where "struct page"
      metadata became corrupted for some reason.  For example, when refcnt or
      mapcount become invalid.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.comSigned-off-by: NPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df4e817b
    • P
      mm: ptep_clear() page table helper · 08d5b29e
      Pasha Tatashin 提交于
      We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to
      clear PTE from user page tables, but there is no variant for simple
      clear of a present PTE from user page tables without using a low level
      pte_clear() which can be either native or para-virtualised.
      
      Add a new ptep_clear() that can be used in common code to clear PTEs
      from page table.  We will need this call later in order to add a hook
      for page table check.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-3-pasha.tatashin@soleen.comSigned-off-by: NPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d5b29e
    • S
      mm: document locking restrictions for vm_operations_struct::close · cc6dcfee
      Suren Baghdasaryan 提交于
      Add comments for vm_operations_struct::close documenting locking
      requirements for this callback and its callers.
      
      Link: https://lkml.kernel.org/r/20211209191325.3069345-2-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jan Engelhardt <jengelh@inai.de>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc6dcfee
    • A
      mm: move tlb_flush_pending inline helpers to mm_inline.h · 36090def
      Arnd Bergmann 提交于
      linux/mm_types.h should only define structure definitions, to make it
      cheap to include elsewhere.  The atomic_t helper function definitions
      are particularly large, so it's better to move the helpers using those
      into the existing linux/mm_inline.h and only include that where needed.
      
      As a follow-up, we may want to go through all the indirect includes in
      mm_types.h and reduce them as much as possible.
      
      Link: https://lkml.kernel.org/r/20211207125710.2503446-2-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36090def
    • A
      mm: move anon_vma declarations to linux/mm_inline.h · 17fca131
      Arnd Bergmann 提交于
      The patch to add anonymous vma names causes a build failure in some
      configurations:
      
        include/linux/mm_types.h: In function 'is_same_vma_anon_name':
        include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
          924 |         return name && vma_name && !strcmp(name, vma_name);
              |                                     ^~~~~~
        include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
      
      This should not really be part of linux/mm_types.h in the first place,
      as that header is meant to only contain structure defintions and need a
      minimum set of indirect includes itself.
      
      While the header clearly includes more than it should at this point,
      let's not make it worse by including string.h as well, which would pull
      in the expensive (compile-speed wise) fortify-string logic.
      
      Move the new functions into a separate header that only needs to be
      included in a couple of locations.
      
      Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
      Fixes: "mm: add a field to store names for private anonymous memory"
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@google.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17fca131
    • S
      mm: add anonymous vma name refcounting · 78db3412
      Suren Baghdasaryan 提交于
      While forking a process with high number (64K) of named anonymous vmas
      the overhead caused by strdup() is noticeable.  Experiments with ARM64
      Android device show up to 40% performance regression when forking a
      process with 64k unpopulated anonymous vmas using the max name lengths
      vs the same process with the same number of anonymous vmas having no
      name.
      
      Introduce anon_vma_name refcounted structure to avoid the overhead of
      copying vma names during fork() and when splitting named anonymous vmas.
      
      When a vma is duplicated, instead of copying the name we increment the
      refcount of this structure.  Multiple vmas can point to the same
      anon_vma_name as long as they increment the refcount.  The name member
      of anon_vma_name structure is assigned at structure allocation time and
      is never changed.  If vma name changes then the refcount of the original
      structure is dropped, a new anon_vma_name structure is allocated to hold
      the new name and the vma pointer is updated to point to the new
      structure.
      
      With this approach the fork() performance regressions is reduced 3-4x
      times and with usecases using more reasonable number of VMAs (a few
      thousand) the regressions is not measurable.
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-3-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78db3412
    • C
      mm: add a field to store names for private anonymous memory · 9a10064f
      Colin Cross 提交于
      In many userspace applications, and especially in VM based applications
      like Android uses heavily, there are multiple different allocators in
      use.  At a minimum there is libc malloc and the stack, and in many cases
      there are libc malloc, the stack, direct syscalls to mmap anonymous
      memory, and multiple VM heaps (one for small objects, one for big
      objects, etc.).  Each of these layers usually has its own tools to
      inspect its usage; malloc by compiling a debug version, the VM through
      heap inspection tools, and for direct syscalls there is usually no way
      to track them.
      
      On Android we heavily use a set of tools that use an extended version of
      the logic covered in Documentation/vm/pagemap.txt to walk all pages
      mapped in userspace and slice their usage by process, shared (COW) vs.
      unique mappings, backing, etc.  This can account for real physical
      memory usage even in cases like fork without exec (which Android uses
      heavily to share as many private COW pages as possible between
      processes), Kernel SamePage Merging, and clean zero pages.  It produces
      a measurement of the pages that only exist in that process (USS, for
      unique), and a measurement of the physical memory usage of that process
      with the cost of shared pages being evenly split between processes that
      share them (PSS).
      
      If all anonymous memory is indistinguishable then figuring out the real
      physical memory usage (PSS) of each heap requires either a pagemap
      walking tool that can understand the heap debugging of every layer, or
      for every layer's heap debugging tools to implement the pagemap walking
      logic, in which case it is hard to get a consistent view of memory
      across the whole system.
      
      Tracking the information in userspace leads to all sorts of problems.
      It either needs to be stored inside the process, which means every
      process has to have an API to export its current heap information upon
      request, or it has to be stored externally in a filesystem that somebody
      needs to clean up on crashes.  It needs to be readable while the process
      is still running, so it has to have some sort of synchronization with
      every layer of userspace.  Efficiently tracking the ranges requires
      reimplementing something like the kernel vma trees, and linking to it
      from every layer of userspace.  It requires more memory, more syscalls,
      more runtime cost, and more complexity to separately track regions that
      the kernel is already tracking.
      
      This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
      userspace-provided name for anonymous vmas.  The names of named
      anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
      [anon:<name>].
      
      Userspace can set the name for a region of memory by calling
      
         prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
      
      Setting the name to NULL clears it.  The name length limit is 80 bytes
      including NUL-terminator and is checked to contain only printable ascii
      characters (including space), except '[',']','\','$' and '`'.
      
      Ascii strings are being used to have a descriptive identifiers for vmas,
      which can be understood by the users reading /proc/pid/maps or
      /proc/pid/smaps.  Names can be standardized for a given system and they
      can include some variable parts such as the name of the allocator or a
      library, tid of the thread using it, etc.
      
      The name is stored in a pointer in the shared union in vm_area_struct
      that points to a null terminated string.  Anonymous vmas with the same
      name (equivalent strings) and are otherwise mergeable will be merged.
      The name pointers are not shared between vmas even if they contain the
      same name.  The name pointer is stored in a union with fields that are
      only used on file-backed mappings, so it does not increase memory usage.
      
      CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
      feature.  It keeps the feature disabled by default to prevent any
      additional memory overhead and to avoid confusing procfs parsers on
      systems which are not ready to support named anonymous vmas.
      
      The patch is based on the original patch developed by Colin Cross, more
      specifically on its latest version [1] posted upstream by Sumit Semwal.
      It used a userspace pointer to store vma names.  In that design, name
      pointers could be shared between vmas.  However during the last
      upstreaming attempt, Kees Cook raised concerns [2] about this approach
      and suggested to copy the name into kernel memory space, perform
      validity checks [3] and store as a string referenced from
      vm_area_struct.
      
      One big concern is about fork() performance which would need to strdup
      anonymous vma names.  Dave Hansen suggested experimenting with
      worst-case scenario of forking a process with 64k vmas having longest
      possible names [4].  I ran this experiment on an ARM64 Android device
      and recorded a worst-case regression of almost 40% when forking such a
      process.
      
      This regression is addressed in the followup patch which replaces the
      pointer to a name with a refcounted structure that allows sharing the
      name pointer between vmas of the same name.  Instead of duplicating the
      string during fork() or when splitting a vma it increments the refcount.
      
      [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
      [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
      [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
      [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
      
      Changes for prctl(2) manual page (in the options section):
      
      PR_SET_VMA
      	Sets an attribute specified in arg2 for virtual memory areas
      	starting from the address specified in arg3 and spanning the
      	size specified	in arg4. arg5 specifies the value of the attribute
      	to be set. Note that assigning an attribute to a virtual memory
      	area might prevent it from being merged with adjacent virtual
      	memory areas due to the difference in that attribute's value.
      
      	Currently, arg2 must be one of:
      
      	PR_SET_VMA_ANON_NAME
      		Set a name for anonymous virtual memory areas. arg5 should
      		be a pointer to a null-terminated string containing the
      		name. The name length including null byte cannot exceed
      		80 bytes. If arg5 is NULL, the name of the appropriate
      		anonymous virtual memory areas will be reset. The name
      		can contain only printable ascii characters (including
                      space), except '[',']','\','$' and '`'.
      
                      This feature is available only if the kernel is built with
                      the CONFIG_ANON_VMA_NAME option enabled.
      
      [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
        Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
      [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
       added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
       work here was done by Colin Cross, therefore, with his permission, keeping
       him as the author]
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a10064f
    • S
      memcg: add per-memcg vmalloc stat · 4e5aa1f4
      Shakeel Butt 提交于
      The kvmalloc* allocation functions can fallback to vmalloc allocations
      and more often on long running machines.  In addition the kernel does
      have __GFP_ACCOUNT kvmalloc* calls.  So, often on long running machines,
      the memory.stat does not tell the complete picture which type of memory
      is charged to the memcg.  So add a per-memcg vmalloc stat.
      
      [shakeelb@google.com: page_memcg() within rcu lock, per Muchun]
        Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com
      [akpm@linux-foundation.org: remove cast, per Muchun]
      [shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal]
        Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com
      
      Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e5aa1f4
    • D
      mm/memcg: add oom_group_kill memory event · b6bf9abb
      Dan Schatzberg 提交于
      Our container agent wants to know when a container exits if it was OOM
      killed or not to report to the user.  We use memory.oom.group = 1 to
      ensure that OOM kills within the container's cgroup kill everything.
      Existing memory.events are insufficient for knowing if this triggered:
      
      1) Our current approach reads memory.events oom_kill and reports the
         container was killed if the value is non-zero. This is erroneous in
         some cases where containers create their children cgroups with
         memory.oom.group=1 as such OOM kills will get counted against the
         container cgroup's oom_kill counter despite not actually OOM killing
         the entire container.
      
      2) Reading memory.events.local will fail to identify OOM kills in leaf
         cgroups (that don't set memory.oom.group) within the container
         cgroup.
      
      This patch adds a new oom_group_kill event when memory.oom.group
      triggers to allow userspace to cleanly identify when an entire cgroup is
      oom killed.
      
      [schatzberg.dan@gmail.com: changes from Johannes and Chris]
        Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com
      
      Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bf9abb
    • M
      mm,fs: split dump_mapping() out from dump_page() · 3e9d80a8
      Matthew Wilcox (Oracle) 提交于
      dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
      able to call it when we don't have a struct page.  Split it out and move
      it to fs/inode.c.  Take the opportunity to simplify some of the debug
      messages a little.
      
      Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e9d80a8
    • J
      mm/memremap: add ZONE_DEVICE support for compound pages · c4386bd8
      Joao Martins 提交于
      Add a new @vmemmap_shift property for struct dev_pagemap which specifies
      that a devmap is composed of a set of compound pages of order
      @vmemmap_shift, instead of base pages.  When a compound page devmap is
      requested, all but the first page are initialised as tail pages instead
      of order-0 pages.
      
      For certain ZONE_DEVICE users like device-dax which have a fixed page
      size, this creates an opportunity to optimize GUP and GUP-fast walkers,
      treating it the same way as THP or hugetlb pages.
      
      Additionally, commit 7118fc29 ("hugetlb: address ref count racing in
      prep_compound_gigantic_page") removed set_page_count() because the
      setting of page ref count to zero was redundant.  devmap pages don't
      come from page allocator though and only head page refcount is used for
      compound pages, hence initialize tail page count to zero.
      
      Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4386bd8
    • K
      mm: defer kmemleak object creation of module_alloc() · 60115fa5
      Kefeng Wang 提交于
      Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
      enabled(without KASAN_VMALLOC) on x86[1].
      
      When the module area allocates memory, it's kmemleak_object is created
      successfully, but the KASAN shadow memory of module allocation is not
      ready, so when kmemleak scan the module's pointer, it will panic due to
      no shadow memory with KASAN check.
      
        module_alloc
          __vmalloc_node_range
            kmemleak_vmalloc
      				kmemleak_scan
      				  update_checksum
          kasan_module_alloc
            kmemleak_ignore
      
      Note, there is no problem if KASAN_VMALLOC enabled, the modules area
      entire shadow memory is preallocated.  Thus, the bug only exits on ARCH
      which supports dynamic allocation of module area per module load, for
      now, only x86/arm64/s390 are involved.
      
      Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
      kmemleak in module_alloc() to fix this issue.
      
      [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/
      
      [wangkefeng.wang@huawei.com: fix build]
        Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
      [akpm@linux-foundation.org: simplify ifdefs, per Andrey]
        Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
      Fixes: 793213a8 ("s390/kasan: dynamic shadow mem allocation for modules")
      Fixes: 39d114dd ("arm64: add KASAN support")
      Fixes: bebf56a1 ("kasan: enable instrumentation of global variables")
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reported-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60115fa5
    • C
      kthread: add the helper function kthread_run_on_cpu() · 800977f6
      Cai Huoqing 提交于
      Add a new helper function kthread_run_on_cpu(), which includes
      kthread_create_on_cpu/wake_up_process().
      
      In some cases, use kthread_run_on_cpu() directly instead of
      kthread_create_on_node/kthread_bind/wake_up_process() or
      kthread_create_on_cpu/wake_up_process() or
      kthreadd_create/kthread_bind/wake_up_process() to simplify the code.
      
      [akpm@linux-foundation.org: export kthread_create_on_cpu to modules]
      
      Link: https://lkml.kernel.org/r/20211022025711.3673-2-caihuoqing@baidu.comSigned-off-by: NCai Huoqing <caihuoqing@baidu.com>
      Cc: Bernard Metzler <bmt@zurich.ibm.com>
      Cc: Cai Huoqing <caihuoqing@baidu.com>
      Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Paul E . McKenney" <paulmck@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800977f6
  3. 14 1月, 2022 1 次提交