1. 21 10月, 2022 1 次提交
  2. 04 10月, 2022 1 次提交
  3. 27 9月, 2022 1 次提交
  4. 12 9月, 2022 2 次提交
    • F
      mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process · d2226ebd
      Feng Tang 提交于
      Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
      commit b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
      preferred nodes"), the policy_nodemask_current()'s semantics for this new
      policy has been changed, which returns 'preferred' nodes instead of
      'allowed' nodes.
      
      With the changed semantic of policy_nodemask_current, a task with
      MPOL_PREFERRED_MANY policy could fail to get its reservation even though
      it can fall back to other nodes (either defined by cpusets or all online
      nodes) for that reservation failing mmap calles unnecessarily early.
      
      The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
      because they, unlike MPOL_MBIND, do not pose any actual hard constrain.
      
      Michal suggested the policy_nodemask_current() is only used by hugetlb,
      and could be moved to hugetlb code with more explicit name to enforce the
      'allowed' semantics for which only MPOL_BIND policy matters.
      
      apply_policy_zone() is made extern to be called in hugetlb code and its
      return value is changed to bool.
      
      [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
      
      Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com
      Fixes: b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Reported-by: NMuchun Song <songmuchun@bytedance.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ben Widawsky <bwidawsk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d2226ebd
    • A
      mm/mempolicy: fix lock contention on mems_allowed · 12c1dc8e
      Abel Wu 提交于
      The mems_allowed field can be modified by other tasks, so it isn't safe to
      access it with alloc_lock unlocked even in the current process context.
      
      Say there are two tasks: A from cpusetA is performing set_mempolicy(2),
      and B is changing cpusetA's cpuset.mems:
      
        A (set_mempolicy)		B (echo xx > cpuset.mems)
        -------------------------------------------------------
        pol = mpol_new();
      				update_tasks_nodemask(cpusetA) {
      				  foreach t in cpusetA {
      				    cpuset_change_task_nodemask(t) {
        mpol_set_nodemask(pol) {
      				      task_lock(t); // t could be A
          new = f(A->mems_allowed);
      				      update t->mems_allowed;
          pol.create(pol, new);
      				      task_unlock(t);
        }
      				    }
      				  }
      				}
        task_lock(A);
        A->mempolicy = pol;
        task_unlock(A);
      
      In this case A's pol->nodes is computed by old mems_allowed, and could
      be inconsistent with A's new mems_allowed.
      
      While it is different when replacing vmas' policy: the pol->nodes is
      gone wild only when current_cpuset_is_being_rebound():
      
        A (mbind)			B (echo xx > cpuset.mems)
        -------------------------------------------------------
        pol = mpol_new();
        mmap_write_lock(A->mm);
      				cpuset_being_rebound = cpusetA;
      				update_tasks_nodemask(cpusetA) {
      				  foreach t in cpusetA {
      				    cpuset_change_task_nodemask(t) {
        mpol_set_nodemask(pol) {
      				      task_lock(t); // t could be A
          mask = f(A->mems_allowed);
      				      update t->mems_allowed;
          pol.create(pol, mask);
      				      task_unlock(t);
        }
      				    }
        foreach v in A->mm {
          if (cpuset_being_rebound == cpusetA)
            pol.rebind(pol, cpuset.mems);
          v->vma_policy = pol;
        }
        mmap_write_unlock(A->mm);
      				    mmap_write_lock(t->mm);
      				    mpol_rebind_mm(t->mm);
      				    mmap_write_unlock(t->mm);
      				  }
      				}
      				cpuset_being_rebound = NULL;
      
      In this case, the cpuset.mems, which has already done updating, is finally
      used for calculating pol->nodes, rather than A->mems_allowed.  So it is OK
      to call mpol_set_nodemask() with alloc_lock unlocked when doing mbind(2).
      
      Link: https://lkml.kernel.org/r/20220811124157.74888-1-wuyun.abel@bytedance.com
      Fixes: 78b132e9 ("mm/mempolicy: remove or narrow the lock on current")
      Signed-off-by: NAbel Wu <wuyun.abel@bytedance.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      12c1dc8e
  5. 30 7月, 2022 1 次提交
  6. 18 7月, 2022 1 次提交
    • A
      mm: handling Non-LRU pages returned by vm_normal_pages · 3218f871
      Alex Sierra 提交于
      With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
      device-managed anonymous pages that are not LRU pages.  Although they
      behave like normal pages for purposes of mapping in CPU page, and for COW.
      They do not support LRU lists, NUMA migration or THP.
      
      Callers to follow_page() currently don't expect ZONE_DEVICE pages,
      however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
      ZONE_DEVICE pages in applicable users of follow_page() as well.
      
      Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com>
      Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>	[v2]
      Reviewed-by: Alistair Popple <apopple@nvidia.com>	[v6]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3218f871
  7. 04 7月, 2022 2 次提交
  8. 20 5月, 2022 1 次提交
    • W
      mm/mempolicy: fix uninit-value in mpol_rebind_policy() · 018160ad
      Wang Cheng 提交于
      mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when
      pol->mode is MPOL_LOCAL.  Check pol->mode before access
      pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c).
      
      BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline]
      BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
       mpol_rebind_policy mm/mempolicy.c:352 [inline]
       mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
       cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline]
       cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278
       cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515
       cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline]
       cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804
       __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520
       cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539
       cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852
       kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write fs/read_write.c:503 [inline]
       vfs_write+0x1318/0x2030 fs/read_write.c:590
       ksys_write+0x28b/0x510 fs/read_write.c:643
       __do_sys_write fs/read_write.c:655 [inline]
       __se_sys_write fs/read_write.c:652 [inline]
       __x64_sys_write+0xdb/0x120 fs/read_write.c:652
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Uninit was created at:
       slab_post_alloc_hook mm/slab.h:524 [inline]
       slab_alloc_node mm/slub.c:3251 [inline]
       slab_alloc mm/slub.c:3259 [inline]
       kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264
       mpol_new mm/mempolicy.c:293 [inline]
       do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853
       kernel_set_mempolicy mm/mempolicy.c:1504 [inline]
       __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline]
       __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507
       __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KMSAN: uninit-value in mpol_rebind_task (2)
      https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc
      
      This patch seems to fix below bug too.
      KMSAN: uninit-value in mpol_rebind_mm (2)
      https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b
      
      The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy().
      When syzkaller reproducer runs to the beginning of mpol_new(),
      
      	    mpol_new() mm/mempolicy.c
      	  do_mbind() mm/mempolicy.c
      	kernel_mbind() mm/mempolicy.c
      
      `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags`
      is 0. Then
      
      	mode = MPOL_LOCAL;
      	...
      	policy->mode = mode;
      	policy->flags = flags;
      
      will be executed. So in mpol_set_nodemask(),
      
      	    mpol_set_nodemask() mm/mempolicy.c
      	  do_mbind()
      	kernel_mbind()
      
      pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized,
      which will be accessed in mpol_rebind_policy().
      
      Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomainSigned-off-by: NWang Cheng <wanngchenng@gmail.com>
      Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com>
      Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      018160ad
  9. 13 5月, 2022 2 次提交
    • M
      mm: remove alloc_pages_vma() · adf88aa8
      Matthew Wilcox (Oracle) 提交于
      All callers have now been converted to use vma_alloc_folio(), so convert
      the body of alloc_pages_vma() to allocate folios instead.
      
      Link: https://lkml.kernel.org/r/20220504182857.4013401-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      adf88aa8
    • N
      mm/mprotect: use mmu_gather · 4a18419f
      Nadav Amit 提交于
      Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
      
      This patchset is intended to remove unnecessary TLB flushes during
      mprotect() syscalls.  Once this patch-set make it through, similar and
      further optimizations for MADV_COLD and userfaultfd would be possible.
      
      Basically, there are 3 optimizations in this patch-set:
      
      1. Use TLB batching infrastructure to batch flushes across VMAs and do
         better/fewer flushes.  This would also be handy for later userfaultfd
         enhancements.
      
      2. Avoid unnecessary TLB flushes.  This optimization is the one that
         provides most of the performance benefits.  Unlike previous versions,
         we now only avoid flushes that would not result in spurious
         page-faults.
      
      3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
         prevent the A/D bits from changing.
      
      Andrew asked for some benchmark numbers.  I do not have an easy
      determinate macrobenchmark in which it is easy to show benefit.  I
      therefore ran a microbenchmark: a loop that does the following on
      anonymous memory, just as a sanity check to see that time is saved by
      avoiding TLB flushes.  The loop goes:
      
      	mprotect(p, PAGE_SIZE, PROT_READ)
      	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
      	*p = 0; // make the page writable
      
      The test was run in KVM guest with 1 or 2 threads (the second thread was
      busy-looping).  I measured the time (cycles) of each operation:
      
      		1 thread		2 threads
      		mmots	+patch		mmots	+patch
      PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
      PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)
      
      [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]
      
      The exact numbers are really meaningless, but the benefit is clear.  There
      are 2 interesting results though.  
      
      (1) PROT_READ is cheaper, while one can expect it not to be affected. 
      This is presumably due to TLB miss that is saved
      
      (2) Without memory access (*p = 0), the speedup of the patch is even
      greater.  In that scenario mprotect(PROT_READ) also avoids the TLB flush. 
      As a result both operations on the patched kernel take roughly ~1500
      cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
      high as presented in the table.
      
      
      This patch (of 3):
      
      change_pXX_range() currently does not use mmu_gather, but instead
      implements its own deferred TLB flushes scheme.  This both complicates the
      code, as developers need to be aware of different invalidation schemes,
      and prevents opportunities to avoid TLB flushes or perform them in finer
      granularity.
      
      The use of mmu_gather for modified PTEs has benefits in various scenarios
      even if pages are not released.  For instance, if only a single page needs
      to be flushed out of a range of many pages, only that page would be
      flushed.  If a THP page is flushed, on x86 a single TLB invlpg instruction
      can be used instead of 512 instructions (or a full TLB flush, which would
      Linux would actually use by default).  mprotect() over multiple VMAs
      requires a single flush.
      
      Use mmu_gather in change_pXX_range().  As the pages are not released, only
      record the flushed range using tlb_flush_pXX_range().
      
      Handle THP similarly and get rid of flush_cache_range() which becomes
      redundant since tlb_start_vma() calls it when needed.
      
      Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com
      Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a18419f
  10. 29 4月, 2022 1 次提交
  11. 09 4月, 2022 2 次提交
  12. 07 4月, 2022 2 次提交
  13. 23 3月, 2022 2 次提交
  14. 06 3月, 2022 1 次提交
  15. 15 1月, 2022 5 次提交
    • R
      mm/mempolicy: fix all kernel-doc warnings · dad5b023
      Randy Dunlap 提交于
      Fix kernel-doc warnings in mempolicy.c:
      
        mempolicy.c:139: warning: No description found for return value of 'numa_map_to_online_node'
        mempolicy.c:2165: warning: Excess function parameter 'node' description in 'alloc_pages_vma'
        mempolicy.c:2973: warning: No description found for return value of 'mpol_parse_str'
      
      Link: https://lkml.kernel.org/r/20211213233216.5477-1-rdunlap@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dad5b023
    • A
      mm/mempolicy: add set_mempolicy_home_node syscall · c6018b4b
      Aneesh Kumar K.V 提交于
      This syscall can be used to set a home node for the MPOL_BIND and
      MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
      setting up a memory policy for the specified range as shown below.
      
        mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
              new_nodes->size + 1, 0);
        sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
      				home_node, 0);
      
      The syscall allows specifying a home node/preferred node from which
      kernel will fulfill memory allocation requests first.
      
      For address range with MPOL_BIND memory policy, if nodemask specifies
      more than one node, page allocations will come from the node in the
      nodemask with sufficient free memory that is closest to the home
      node/preferred node.
      
      For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
      page allocation will come from the node in the nodemask with sufficient
      free memory that is closest to the home node/preferred node.  If there
      is not enough memory in all the nodes specified in the nodemask, the
      allocation will be attempted from the closest numa node to the home node
      in the system.
      
      This helps applications to hint at a memory allocation preference node
      and fallback to _only_ a set of nodes if the memory is not available on
      the preferred node.  Fallback allocation is attempted from the node
      which is nearest to the preferred node.
      
      This helps applications to have control on memory allocation numa nodes
      and avoids default fallback to slow memory NUMA nodes.  For example a
      system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
      slow memory
      
       new_nodes = numa_bitmask_alloc(nr_nodes);
      
       numa_bitmask_setbit(new_nodes, 1);
       numa_bitmask_setbit(new_nodes, 2);
       numa_bitmask_setbit(new_nodes, 3);
      
       p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
       mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
      
       sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
      
      This will allocate from nodes closer to node 2 and will make sure the
      kernel will only allocate from nodes 1, 2, and 3.  Memory will not be
      allocated from slow memory nodes 10, 11, and 12.  This differs from
      default MPOL_BIND behavior in that with default MPOL_BIND the allocation
      will be attempted from node closer to the local node.  One of the
      reasons to specify a home node is to allow allocations from cpu less
      NUMA node and its nearby NUMA nodes.
      
      With MPOL_PREFERRED_MANY on the other hand will first try to allocate
      from the closest node to node 2 from the node list 1, 2 and 3.  If those
      nodes don't have enough memory, kernel will allocate from slow memory
      node 10, 11 and 12 which ever is closer to node 2.
      
      Link: https://lkml.kernel.org/r/20211202123810.267175-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6018b4b
    • A
      mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY · c0455116
      Aneesh Kumar K.V 提交于
      Patch series "mm: add new syscall set_mempolicy_home_node", v6.
      
      This patch (of 3):
      
      A followup patch will enable setting a home node with
      MPOL_PREFERRED_MANY memory policy.  To facilitate that switch to using
      policy_node helper.  There is no functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20211202123810.267175-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20211202123810.267175-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0455116
    • M
      mm: drop node from alloc_pages_vma · be1a13eb
      Michal Hocko 提交于
      alloc_pages_vma is meant to allocate a page with a vma specific memory
      policy.  The initial node parameter is always a local node so it is
      pointless to waste a function argument for this.  Drop the parameter.
      
      Link: https://lkml.kernel.org/r/YaSnlv4QpryEpesG@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1a13eb
    • C
      mm: add a field to store names for private anonymous memory · 9a10064f
      Colin Cross 提交于
      In many userspace applications, and especially in VM based applications
      like Android uses heavily, there are multiple different allocators in
      use.  At a minimum there is libc malloc and the stack, and in many cases
      there are libc malloc, the stack, direct syscalls to mmap anonymous
      memory, and multiple VM heaps (one for small objects, one for big
      objects, etc.).  Each of these layers usually has its own tools to
      inspect its usage; malloc by compiling a debug version, the VM through
      heap inspection tools, and for direct syscalls there is usually no way
      to track them.
      
      On Android we heavily use a set of tools that use an extended version of
      the logic covered in Documentation/vm/pagemap.txt to walk all pages
      mapped in userspace and slice their usage by process, shared (COW) vs.
      unique mappings, backing, etc.  This can account for real physical
      memory usage even in cases like fork without exec (which Android uses
      heavily to share as many private COW pages as possible between
      processes), Kernel SamePage Merging, and clean zero pages.  It produces
      a measurement of the pages that only exist in that process (USS, for
      unique), and a measurement of the physical memory usage of that process
      with the cost of shared pages being evenly split between processes that
      share them (PSS).
      
      If all anonymous memory is indistinguishable then figuring out the real
      physical memory usage (PSS) of each heap requires either a pagemap
      walking tool that can understand the heap debugging of every layer, or
      for every layer's heap debugging tools to implement the pagemap walking
      logic, in which case it is hard to get a consistent view of memory
      across the whole system.
      
      Tracking the information in userspace leads to all sorts of problems.
      It either needs to be stored inside the process, which means every
      process has to have an API to export its current heap information upon
      request, or it has to be stored externally in a filesystem that somebody
      needs to clean up on crashes.  It needs to be readable while the process
      is still running, so it has to have some sort of synchronization with
      every layer of userspace.  Efficiently tracking the ranges requires
      reimplementing something like the kernel vma trees, and linking to it
      from every layer of userspace.  It requires more memory, more syscalls,
      more runtime cost, and more complexity to separately track regions that
      the kernel is already tracking.
      
      This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
      userspace-provided name for anonymous vmas.  The names of named
      anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
      [anon:<name>].
      
      Userspace can set the name for a region of memory by calling
      
         prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
      
      Setting the name to NULL clears it.  The name length limit is 80 bytes
      including NUL-terminator and is checked to contain only printable ascii
      characters (including space), except '[',']','\','$' and '`'.
      
      Ascii strings are being used to have a descriptive identifiers for vmas,
      which can be understood by the users reading /proc/pid/maps or
      /proc/pid/smaps.  Names can be standardized for a given system and they
      can include some variable parts such as the name of the allocator or a
      library, tid of the thread using it, etc.
      
      The name is stored in a pointer in the shared union in vm_area_struct
      that points to a null terminated string.  Anonymous vmas with the same
      name (equivalent strings) and are otherwise mergeable will be merged.
      The name pointers are not shared between vmas even if they contain the
      same name.  The name pointer is stored in a union with fields that are
      only used on file-backed mappings, so it does not increase memory usage.
      
      CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
      feature.  It keeps the feature disabled by default to prevent any
      additional memory overhead and to avoid confusing procfs parsers on
      systems which are not ready to support named anonymous vmas.
      
      The patch is based on the original patch developed by Colin Cross, more
      specifically on its latest version [1] posted upstream by Sumit Semwal.
      It used a userspace pointer to store vma names.  In that design, name
      pointers could be shared between vmas.  However during the last
      upstreaming attempt, Kees Cook raised concerns [2] about this approach
      and suggested to copy the name into kernel memory space, perform
      validity checks [3] and store as a string referenced from
      vm_area_struct.
      
      One big concern is about fork() performance which would need to strdup
      anonymous vma names.  Dave Hansen suggested experimenting with
      worst-case scenario of forking a process with 64k vmas having longest
      possible names [4].  I ran this experiment on an ARM64 Android device
      and recorded a worst-case regression of almost 40% when forking such a
      process.
      
      This regression is addressed in the followup patch which replaces the
      pointer to a name with a refcounted structure that allows sharing the
      name pointer between vmas of the same name.  Instead of duplicating the
      string during fork() or when splitting a vma it increments the refcount.
      
      [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
      [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
      [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
      [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
      
      Changes for prctl(2) manual page (in the options section):
      
      PR_SET_VMA
      	Sets an attribute specified in arg2 for virtual memory areas
      	starting from the address specified in arg3 and spanning the
      	size specified	in arg4. arg5 specifies the value of the attribute
      	to be set. Note that assigning an attribute to a virtual memory
      	area might prevent it from being merged with adjacent virtual
      	memory areas due to the difference in that attribute's value.
      
      	Currently, arg2 must be one of:
      
      	PR_SET_VMA_ANON_NAME
      		Set a name for anonymous virtual memory areas. arg5 should
      		be a pointer to a null-terminated string containing the
      		name. The name length including null byte cannot exceed
      		80 bytes. If arg5 is NULL, the name of the appropriate
      		anonymous virtual memory areas will be reset. The name
      		can contain only printable ascii characters (including
                      space), except '[',']','\','$' and '`'.
      
                      This feature is available only if the kernel is built with
                      the CONFIG_ANON_VMA_NAME option enabled.
      
      [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
        Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
      [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
       added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
       work here was done by Colin Cross, therefore, with his permission, keeping
       him as the author]
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a10064f
  16. 26 12月, 2021 1 次提交
    • A
      mm: mempolicy: fix THP allocations escaping mempolicy restrictions · 33863534
      Andrey Ryabinin 提交于
      alloc_pages_vma() may try to allocate THP page on the local NUMA node
      first:
      
      	page = __alloc_pages_node(hpage_node,
      		gfp | __GFP_THISNODE | __GFP_NORETRY, order);
      
      And if the allocation fails it retries allowing remote memory:
      
      	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
          		page = __alloc_pages_node(hpage_node,
      					gfp, order);
      
      However, this retry allocation completely ignores memory policy nodemask
      allowing allocation to escape restrictions.
      
      The first appearance of this bug seems to be the commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings").
      
      The bug disappeared later in the commit 89c83fb5 ("mm, thp:
      consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and
      reappeared again in slightly different form in the commit 76e654cc
      ("mm, page_alloc: allow hugepage fallback to remote nodes when
      madvised")
      
      Fix this by passing correct nodemask to the __alloc_pages() call.
      
      The demonstration/reproducer of the problem:
      
          $ mount -oremount,size=4G,huge=always /dev/shm/
          $ echo always > /sys/kernel/mm/transparent_hugepage/defrag
          $ cat mbind_thp.c
          #include <unistd.h>
          #include <sys/mman.h>
          #include <sys/stat.h>
          #include <fcntl.h>
          #include <assert.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <numaif.h>
      
          #define SIZE 2ULL << 30
          int main(int argc, char **argv)
          {
              int fd;
              unsigned long long i;
              char *addr;
              pid_t pid;
              char buf[100];
              unsigned long nodemask = 1;
      
              fd = open("/dev/shm/test", O_RDWR|O_CREAT);
              assert(fd > 0);
              assert(ftruncate(fd, SIZE) == 0);
      
              addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                 MAP_SHARED, fd, 0);
      
              assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0);
              for (i = 0; i < SIZE; i+=4096) {
                addr[i] = 1;
              }
              pid = getpid();
              snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid);
              system(buf);
              sleep(10000);
      
              return 0;
          }
          $ gcc mbind_thp.c -o mbind_thp -lnuma
          $ numactl -H
          available: 2 nodes (0-1)
          node 0 cpus: 0 2
          node 0 size: 1918 MB
          node 0 free: 1595 MB
          node 1 cpus: 1 3
          node 1 size: 2014 MB
          node 1 free: 1731 MB
          node distances:
          node   0   1
            0:  10  20
            1:  20  10
          $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp
          7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4
      
      Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com
      Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
      Signed-off-by: NAndrey Ryabinin <arbn@yandex-team.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33863534
  17. 07 11月, 2021 2 次提交
  18. 19 10月, 2021 1 次提交
    • E
      mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() · 6d2aec9e
      Eric Dumazet 提交于
      syzbot reported access to unitialized memory in mbind() [1]
      
      Issue came with commit bda420b9 ("numa balancing: migrate on fault
      among multiple bound nodes")
      
      This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid
      combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in
      do_set_mempolicy()
      
      This patch moves the check in sanitize_mpol_flags() so that it is also
      used by mbind()
      
        [1]
        BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Uninit was created at:
         slab_alloc_node mm/slub.c:3221 [inline]
         slab_alloc mm/slub.c:3230 [inline]
         kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235
         mpol_new mm/mempolicy.c:293 [inline]
         do_mbind+0x912/0x15f0 mm/mempolicy.c:1289
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        =====================================================
        Kernel panic - not syncing: panic_on_kmsan set ...
        CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G    B             5.15.0-rc2-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106
         dump_stack+0x25/0x28 lib/dump_stack.c:113
         panic+0x44f/0xdeb kernel/panic.c:232
         kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186
         __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com
      Fixes: bda420b9 ("numa balancing: migrate on fault among multiple bound nodes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d2aec9e
  19. 18 10月, 2021 1 次提交
  20. 09 9月, 2021 3 次提交
    • Y
      mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task · 276aeee1
      yanghui 提交于
      Servers happened below panic:
      
        Kernel version:5.4.56
        BUG: unable to handle page fault for address: 0000000000002c48
        RIP: 0010:__next_zones_zonelist+0x1d/0x40
        Call Trace:
          __alloc_pages_nodemask+0x277/0x310
          alloc_page_interleave+0x13/0x70
          handle_mm_fault+0xf99/0x1390
          __do_page_fault+0x288/0x500
          do_page_fault+0x30/0x110
          page_fault+0x3e/0x50
      
      The reason for the panic is that MAX_NUMNODES is passed in the third
      parameter in __alloc_pages_nodemask(preferred_nid).  So access to
      zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic.
      
      In offset_il_node(), first_node() returns nid from pol->v.nodes, after
      this other threads may chang pol->v.nodes before next_node().  This race
      condition will let next_node return MAX_NUMNODES.  So put pol->nodes in
      a local variable.
      
      The race condition is between offset_il_node and cpuset_change_task_nodemask:
      
        CPU0:                                     CPU1:
        alloc_pages_vma()
          interleave_nid(pol,)
            offset_il_node(pol,)
              first_node(pol->v.nodes)            cpuset_change_task_nodemask
                              //nodes==0xc          mpol_rebind_task
                                                      mpol_rebind_policy
                                                        mpol_rebind_nodemask(pol,nodes)
                              //nodes==0x3
              next_node(nid, pol->v.nodes)//return MAX_NUMNODES
      
      Link: https://lkml.kernel.org/r/20210906034658.48721-1-yanghui.def@bytedance.comSigned-off-by: Nyanghui <yanghui.def@bytedance.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      276aeee1
    • A
      compat: remove some compat entry points · 59ab844e
      Arnd Bergmann 提交于
      These are all handled correctly when calling the native system call entry
      point, so remove the special cases.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ab844e
    • A
      mm: simplify compat numa syscalls · e130242d
      Arnd Bergmann 提交于
      The compat implementations for mbind, get_mempolicy, set_mempolicy and
      migrate_pages are just there to handle the subtly different layout of
      bitmaps on 32-bit hosts.
      
      The compat implementation however lacks some of the checks that are
      present in the native one, in particular for checking that the extra bits
      are all zero when user space has a larger mask size than the kernel.
      Worse, those extra bits do not get cleared when copying in or out of the
      kernel, which can lead to incorrect data as well.
      
      Unify the implementation to handle the compat bitmap layout directly in
      the get_nodes() and copy_nodes_to_user() helpers.  Splitting out the
      get_bitmap() helper from get_nodes() also helps readability of the native
      case.
      
      On x86, two additional problems are addressed by this: compat tasks can
      pass a bitmap at the end of a mapping, causing a fault when reading across
      the page boundary for a 64-bit word.  x32 tasks might also run into
      problems with get_mempolicy corrupting data when an odd number of 32-bit
      words gets passed.
      
      On parisc the migrate_pages() system call apparently had the wrong calling
      convention, as big-endian architectures expect the words inside of a
      bitmap to be swapped.  This is not a problem though since parisc has no
      NUMA support.
      
      [arnd@arndb.de: fix mempolicy crash]
        Link: https://lkml.kernel.org/r/20210730143417.3700653-1-arnd@kernel.org
        Link: https://lore.kernel.org/lkml/YQPLG20V3dmOfq3a@osiris/
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-5-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e130242d
  21. 04 9月, 2021 7 次提交
    • V
      mm/mempolicy.c: use in_task() in mempolicy_slab_node() · 38b031dd
      Vasily Averin 提交于
      Obsoleted in_intrrupt() include task context with disabled BH, it's better
      to use in_task() instead.
      
      Link: https://lkml.kernel.org/r/984ee771-4834-21da-801f-c15c18ddf4d1@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38b031dd
    • F
      mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies · be897d48
      Feng Tang 提交于
      As they all do the same thing: sanity check and save nodemask info, create
      one mpol_new_nodemask() to reduce redundancy.
      
      Link: https://lkml.kernel.org/r/1627970362-61305-6-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be897d48
    • B
      mm/mempolicy: advertise new MPOL_PREFERRED_MANY · a38a59fd
      Ben Widawsky 提交于
      Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
      
      MPOL_PREFERRED_MANY will be adequately documented in the internal
      admin-guide with this patch.  Eventually, the man pages for mbind(2),
      get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
      about this mode.  Those shall contain the canonical reference.
      
      NUMA systems continue to become more prevalent.  New technologies like
      PMEM make finer grain control over memory access patterns increasingly
      desirable.  MPOL_PREFERRED_MANY allows userspace to specify a set of nodes
      that will be tried first when performing allocations.  If those
      allocations fail, all remaining nodes will be tried.  It's a straight
      forward API which solves many of the presumptive needs of system
      administrators wanting to optimize workloads on such machines.  The mode
      will work either per VMA, or per thread.
      
      [Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY]
      
      Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
      Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.comSigned-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a38a59fd
    • F
      mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy · 4c54d949
      Feng Tang 提交于
      The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it
      will first try to allocate memory from the preferred node(s), and fallback
      to all nodes in system when first try fails.
      
      Add a dedicated function alloc_pages_preferred_many() for it just like for
      'interleave' policy, which will be used by 2 general memoory allocation
      APIs: alloc_pages() and alloc_pages_vma()
      
      Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
      Link: https://lkml.kernel.org/r/1627970362-61305-3-git-send-email-feng.tang@intel.comSuggested-by: NMichal Hocko <mhocko@suse.com>
      Originally-by: NBen Widawsky <ben.widawsky@intel.com>
      Co-developed-by: NBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c54d949
    • D
      mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes · b27abacc
      Dave Hansen 提交于
      Patch series "Introduce multi-preference mempolicy", v7.
      
      This patch series introduces the concept of the MPOL_PREFERRED_MANY
      mempolicy.  This mempolicy mode can be used with either the
      set_mempolicy(2) or mbind(2) interfaces.  Like the MPOL_PREFERRED
      interface, it allows an application to set a preference for nodes which
      will fulfil memory allocation requests.  Unlike the MPOL_PREFERRED mode,
      it takes a set of nodes.  Like the MPOL_BIND interface, it works over a
      set of nodes.  Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
      OOM killer if those preferred nodes are not available.
      
      Along with these patches are patches for libnuma, numactl, numademo, and
      memhog.  They still need some polish, but can be found here:
      https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
      usage: `numactl -P 0,3,4`
      
      The goal of the new mode is to enable some use-cases when using tiered memory
      usage models which I've lovingly named.
      
      1a. The Hare - The interconnect is fast enough to meet bandwidth and
          latency requirements allowing preference to be given to all nodes with
          "fast" memory.
      1b. The Indiscriminate Hare - An application knows it wants fast
          memory (or perhaps slow memory), but doesn't care which node it runs
          on.  The application can prefer a set of nodes and then xpu bind to
          the local node (cpu, accelerator, etc).  This reverses the nodes are
          chosen today where the kernel attempts to use local memory to the CPU
          whenever possible.  This will attempt to use the local accelerator to
          the memory.
      2.  The Tortoise - The administrator (or the application itself) is
          aware it only needs slow memory, and so can prefer that.
      
      Much of this is almost achievable with the bind interface, but the bind
      interface suffers from an inability to fallback to another set of nodes if
      binding fails to all nodes in the nodemask.
      
      Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
      preference.
      
      > /* Set first two nodes as preferred in an 8 node system. */
      > const unsigned long nodes = 0x3
      > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
      
      > /* Mimic interleave policy, but have fallback *.
      > const unsigned long nodes = 0xaa
      > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
      
      Some internal discussion took place around the interface. There are two
      alternatives which we have discussed, plus one I stuck in:
      
      1. Ordered list of nodes.  Currently it's believed that the added
         complexity is nod needed for expected usecases.
      2. A flag for bind to allow falling back to other nodes.  This
         confuses the notion of binding and is less flexible than the current
         solution.
      3. Create flags or new modes that helps with some ordering.  This
         offers both a friendlier API as well as a solution for more customized
         usage.  It's unknown if it's worth the complexity to support this.
         Here is sample code for how this might work:
      
      > // Prefer specific nodes for some something wacky
      > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
      >
      > // Default
      > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
      > // which is the same as
      > set_mempolicy(MPOL_DEFAULT, NULL, 0);
      >
      > // The Hare
      > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
      >
      > // The Tortoise
      > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
      >
      > // Prefer the fast memory of the first two sockets
      > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
      >
      
      This patch (of 5):
      
      The NUMA APIs currently allow passing in a "preferred node" as a single
      bit set in a nodemask.  If more than one bit it set, bits after the first
      are ignored.
      
      This single node is generally OK for location-based NUMA where memory
      being allocated will eventually be operated on by a single CPU.  However,
      in systems with multiple memory types, folks want to target a *type* of
      memory instead of a location.  For instance, someone might want some
      high-bandwidth memory but do not care about the CPU next to which it is
      allocated.  Or, they want a cheap, high capacity allocation and want to
      target all NUMA nodes which have persistent memory in volatile mode.  In
      both of these cases, the application wants to target a *set* of nodes, but
      does not want strict MPOL_BIND behavior as that could lead to OOM killer
      or SIGSEGV.
      
      So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
      requirement.  This is not a pie-in-the-sky dream for an API.  This was a
      response to a specific ask of more than one group at Intel.  Specifically:
      
      1. There are existing libraries that target memory types such as
         https://github.com/memkind/memkind.  These are known to suffer from
         SIGSEGV's when memory is low on targeted memory "kinds" that span more
         than one node.  The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
         example of this.
      
      2. Volatile-use persistent memory users want to have a memory policy
         which is targeted at either "cheap and slow" (PMEM) or "expensive and
         fast" (DRAM).  However, they do not want to experience allocation
         failures when the targeted type is unavailable.
      
      3. Allocate-then-run.  Generally, we let the process scheduler decide
         on which physical CPU to run a task.  That location provides a default
         allocation policy, and memory availability is not generally considered
         when placing tasks.  For situations where memory is valuable and
         constrained, some users want to allocate memory first, *then* allocate
         close compute resources to the allocation.  This is the reverse of the
         normal (CPU) model.  Accelerators such as GPUs that operate on
         core-mm-managed memory are interested in this model.
      
      A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
      policy to be used for now, and will be removed in later patch after all
      implementations for 'prefer_many' are ready, as suggested by Michal Hocko.
      
      [mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling]
      
      Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com
      Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
      Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.comCo-developed-by: NBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>b
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b27abacc
    • B
      mm/mempolicy: use readable NUMA_NO_NODE macro instead of magic number · 062db293
      Baolin Wang 提交于
      The caller of mpol_misplaced() already use NUMA_NO_NODE to check whether
      current page node is misplaced, thus using NUMA_NO_NODE in
      mpol_misplaced() instead of magic number is more readable.
      
      Link: https://lkml.kernel.org/r/1b77c0ce21183fa86f4db250b115cf5e27396528.1627558356.git.baolin.wang@linux.alibaba.comSigned-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      062db293
    • H
      mm/migrate: add sysfs interface to enable reclaim migration · 20b51af1
      Huang Ying 提交于
      Some method is obviously needed to enable reclaim-based migration.
      
      Just like traditional autonuma, there will be some workloads that will
      benefit like workloads with more "static" configurations where hot pages
      stay hot and cold pages stay cold.  If pages come and go from the hot and
      cold sets, the benefits of this approach will be more limited.
      
      The benefits are truly workload-based and *not* hardware-based.  We do not
      believe that there is a viable threshold where certain hardware
      configurations should have this mechanism enabled while others do not.
      
      To be conservative, earlier work defaulted to disable reclaim- based
      migration and did not include a mechanism to enable it.  This proposes add
      a new sysfs file
      
        /sys/kernel/mm/numa/demotion_enabled
      
      as a method to enable it.
      
      We are open to any alternative that allows end users to enable this
      mechanism or disable it if workload harm is detected (just like
      traditional autonuma).
      
      Once this is enabled page demotion may move data to a NUMA node that does
      not fall into the cpuset of the allocating process.  This could be
      construed to violate the guarantees of cpusets.  However, since this is an
      opt-in mechanism, the assumption is that anyone enabling it is content to
      relax the guarantees.
      
      Link: https://lkml.kernel.org/r/20210721063926.3024591-9-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20210715055145.195411-10-ying.huang@intel.comSigned-off-by: NHuang Ying <ying.huang@intel.com>
      Originally-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20b51af1