1. 07 4月, 2022 2 次提交
  2. 23 3月, 2022 2 次提交
  3. 06 3月, 2022 1 次提交
  4. 15 1月, 2022 5 次提交
    • R
      mm/mempolicy: fix all kernel-doc warnings · dad5b023
      Randy Dunlap 提交于
      Fix kernel-doc warnings in mempolicy.c:
      
        mempolicy.c:139: warning: No description found for return value of 'numa_map_to_online_node'
        mempolicy.c:2165: warning: Excess function parameter 'node' description in 'alloc_pages_vma'
        mempolicy.c:2973: warning: No description found for return value of 'mpol_parse_str'
      
      Link: https://lkml.kernel.org/r/20211213233216.5477-1-rdunlap@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dad5b023
    • A
      mm/mempolicy: add set_mempolicy_home_node syscall · c6018b4b
      Aneesh Kumar K.V 提交于
      This syscall can be used to set a home node for the MPOL_BIND and
      MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
      setting up a memory policy for the specified range as shown below.
      
        mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
              new_nodes->size + 1, 0);
        sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
      				home_node, 0);
      
      The syscall allows specifying a home node/preferred node from which
      kernel will fulfill memory allocation requests first.
      
      For address range with MPOL_BIND memory policy, if nodemask specifies
      more than one node, page allocations will come from the node in the
      nodemask with sufficient free memory that is closest to the home
      node/preferred node.
      
      For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
      page allocation will come from the node in the nodemask with sufficient
      free memory that is closest to the home node/preferred node.  If there
      is not enough memory in all the nodes specified in the nodemask, the
      allocation will be attempted from the closest numa node to the home node
      in the system.
      
      This helps applications to hint at a memory allocation preference node
      and fallback to _only_ a set of nodes if the memory is not available on
      the preferred node.  Fallback allocation is attempted from the node
      which is nearest to the preferred node.
      
      This helps applications to have control on memory allocation numa nodes
      and avoids default fallback to slow memory NUMA nodes.  For example a
      system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
      slow memory
      
       new_nodes = numa_bitmask_alloc(nr_nodes);
      
       numa_bitmask_setbit(new_nodes, 1);
       numa_bitmask_setbit(new_nodes, 2);
       numa_bitmask_setbit(new_nodes, 3);
      
       p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
       mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);
      
       sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
      
      This will allocate from nodes closer to node 2 and will make sure the
      kernel will only allocate from nodes 1, 2, and 3.  Memory will not be
      allocated from slow memory nodes 10, 11, and 12.  This differs from
      default MPOL_BIND behavior in that with default MPOL_BIND the allocation
      will be attempted from node closer to the local node.  One of the
      reasons to specify a home node is to allow allocations from cpu less
      NUMA node and its nearby NUMA nodes.
      
      With MPOL_PREFERRED_MANY on the other hand will first try to allocate
      from the closest node to node 2 from the node list 1, 2 and 3.  If those
      nodes don't have enough memory, kernel will allocate from slow memory
      node 10, 11 and 12 which ever is closer to node 2.
      
      Link: https://lkml.kernel.org/r/20211202123810.267175-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6018b4b
    • A
      mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY · c0455116
      Aneesh Kumar K.V 提交于
      Patch series "mm: add new syscall set_mempolicy_home_node", v6.
      
      This patch (of 3):
      
      A followup patch will enable setting a home node with
      MPOL_PREFERRED_MANY memory policy.  To facilitate that switch to using
      policy_node helper.  There is no functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20211202123810.267175-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20211202123810.267175-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0455116
    • M
      mm: drop node from alloc_pages_vma · be1a13eb
      Michal Hocko 提交于
      alloc_pages_vma is meant to allocate a page with a vma specific memory
      policy.  The initial node parameter is always a local node so it is
      pointless to waste a function argument for this.  Drop the parameter.
      
      Link: https://lkml.kernel.org/r/YaSnlv4QpryEpesG@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be1a13eb
    • C
      mm: add a field to store names for private anonymous memory · 9a10064f
      Colin Cross 提交于
      In many userspace applications, and especially in VM based applications
      like Android uses heavily, there are multiple different allocators in
      use.  At a minimum there is libc malloc and the stack, and in many cases
      there are libc malloc, the stack, direct syscalls to mmap anonymous
      memory, and multiple VM heaps (one for small objects, one for big
      objects, etc.).  Each of these layers usually has its own tools to
      inspect its usage; malloc by compiling a debug version, the VM through
      heap inspection tools, and for direct syscalls there is usually no way
      to track them.
      
      On Android we heavily use a set of tools that use an extended version of
      the logic covered in Documentation/vm/pagemap.txt to walk all pages
      mapped in userspace and slice their usage by process, shared (COW) vs.
      unique mappings, backing, etc.  This can account for real physical
      memory usage even in cases like fork without exec (which Android uses
      heavily to share as many private COW pages as possible between
      processes), Kernel SamePage Merging, and clean zero pages.  It produces
      a measurement of the pages that only exist in that process (USS, for
      unique), and a measurement of the physical memory usage of that process
      with the cost of shared pages being evenly split between processes that
      share them (PSS).
      
      If all anonymous memory is indistinguishable then figuring out the real
      physical memory usage (PSS) of each heap requires either a pagemap
      walking tool that can understand the heap debugging of every layer, or
      for every layer's heap debugging tools to implement the pagemap walking
      logic, in which case it is hard to get a consistent view of memory
      across the whole system.
      
      Tracking the information in userspace leads to all sorts of problems.
      It either needs to be stored inside the process, which means every
      process has to have an API to export its current heap information upon
      request, or it has to be stored externally in a filesystem that somebody
      needs to clean up on crashes.  It needs to be readable while the process
      is still running, so it has to have some sort of synchronization with
      every layer of userspace.  Efficiently tracking the ranges requires
      reimplementing something like the kernel vma trees, and linking to it
      from every layer of userspace.  It requires more memory, more syscalls,
      more runtime cost, and more complexity to separately track regions that
      the kernel is already tracking.
      
      This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
      userspace-provided name for anonymous vmas.  The names of named
      anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
      [anon:<name>].
      
      Userspace can set the name for a region of memory by calling
      
         prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
      
      Setting the name to NULL clears it.  The name length limit is 80 bytes
      including NUL-terminator and is checked to contain only printable ascii
      characters (including space), except '[',']','\','$' and '`'.
      
      Ascii strings are being used to have a descriptive identifiers for vmas,
      which can be understood by the users reading /proc/pid/maps or
      /proc/pid/smaps.  Names can be standardized for a given system and they
      can include some variable parts such as the name of the allocator or a
      library, tid of the thread using it, etc.
      
      The name is stored in a pointer in the shared union in vm_area_struct
      that points to a null terminated string.  Anonymous vmas with the same
      name (equivalent strings) and are otherwise mergeable will be merged.
      The name pointers are not shared between vmas even if they contain the
      same name.  The name pointer is stored in a union with fields that are
      only used on file-backed mappings, so it does not increase memory usage.
      
      CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
      feature.  It keeps the feature disabled by default to prevent any
      additional memory overhead and to avoid confusing procfs parsers on
      systems which are not ready to support named anonymous vmas.
      
      The patch is based on the original patch developed by Colin Cross, more
      specifically on its latest version [1] posted upstream by Sumit Semwal.
      It used a userspace pointer to store vma names.  In that design, name
      pointers could be shared between vmas.  However during the last
      upstreaming attempt, Kees Cook raised concerns [2] about this approach
      and suggested to copy the name into kernel memory space, perform
      validity checks [3] and store as a string referenced from
      vm_area_struct.
      
      One big concern is about fork() performance which would need to strdup
      anonymous vma names.  Dave Hansen suggested experimenting with
      worst-case scenario of forking a process with 64k vmas having longest
      possible names [4].  I ran this experiment on an ARM64 Android device
      and recorded a worst-case regression of almost 40% when forking such a
      process.
      
      This regression is addressed in the followup patch which replaces the
      pointer to a name with a refcounted structure that allows sharing the
      name pointer between vmas of the same name.  Instead of duplicating the
      string during fork() or when splitting a vma it increments the refcount.
      
      [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
      [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
      [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
      [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
      
      Changes for prctl(2) manual page (in the options section):
      
      PR_SET_VMA
      	Sets an attribute specified in arg2 for virtual memory areas
      	starting from the address specified in arg3 and spanning the
      	size specified	in arg4. arg5 specifies the value of the attribute
      	to be set. Note that assigning an attribute to a virtual memory
      	area might prevent it from being merged with adjacent virtual
      	memory areas due to the difference in that attribute's value.
      
      	Currently, arg2 must be one of:
      
      	PR_SET_VMA_ANON_NAME
      		Set a name for anonymous virtual memory areas. arg5 should
      		be a pointer to a null-terminated string containing the
      		name. The name length including null byte cannot exceed
      		80 bytes. If arg5 is NULL, the name of the appropriate
      		anonymous virtual memory areas will be reset. The name
      		can contain only printable ascii characters (including
                      space), except '[',']','\','$' and '`'.
      
                      This feature is available only if the kernel is built with
                      the CONFIG_ANON_VMA_NAME option enabled.
      
      [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
        Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
      [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
       added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
       work here was done by Colin Cross, therefore, with his permission, keeping
       him as the author]
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a10064f
  5. 26 12月, 2021 1 次提交
    • A
      mm: mempolicy: fix THP allocations escaping mempolicy restrictions · 33863534
      Andrey Ryabinin 提交于
      alloc_pages_vma() may try to allocate THP page on the local NUMA node
      first:
      
      	page = __alloc_pages_node(hpage_node,
      		gfp | __GFP_THISNODE | __GFP_NORETRY, order);
      
      And if the allocation fails it retries allowing remote memory:
      
      	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
          		page = __alloc_pages_node(hpage_node,
      					gfp, order);
      
      However, this retry allocation completely ignores memory policy nodemask
      allowing allocation to escape restrictions.
      
      The first appearance of this bug seems to be the commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings").
      
      The bug disappeared later in the commit 89c83fb5 ("mm, thp:
      consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and
      reappeared again in slightly different form in the commit 76e654cc
      ("mm, page_alloc: allow hugepage fallback to remote nodes when
      madvised")
      
      Fix this by passing correct nodemask to the __alloc_pages() call.
      
      The demonstration/reproducer of the problem:
      
          $ mount -oremount,size=4G,huge=always /dev/shm/
          $ echo always > /sys/kernel/mm/transparent_hugepage/defrag
          $ cat mbind_thp.c
          #include <unistd.h>
          #include <sys/mman.h>
          #include <sys/stat.h>
          #include <fcntl.h>
          #include <assert.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <numaif.h>
      
          #define SIZE 2ULL << 30
          int main(int argc, char **argv)
          {
              int fd;
              unsigned long long i;
              char *addr;
              pid_t pid;
              char buf[100];
              unsigned long nodemask = 1;
      
              fd = open("/dev/shm/test", O_RDWR|O_CREAT);
              assert(fd > 0);
              assert(ftruncate(fd, SIZE) == 0);
      
              addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                 MAP_SHARED, fd, 0);
      
              assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0);
              for (i = 0; i < SIZE; i+=4096) {
                addr[i] = 1;
              }
              pid = getpid();
              snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid);
              system(buf);
              sleep(10000);
      
              return 0;
          }
          $ gcc mbind_thp.c -o mbind_thp -lnuma
          $ numactl -H
          available: 2 nodes (0-1)
          node 0 cpus: 0 2
          node 0 size: 1918 MB
          node 0 free: 1595 MB
          node 1 cpus: 1 3
          node 1 size: 2014 MB
          node 1 free: 1731 MB
          node distances:
          node   0   1
            0:  10  20
            1:  20  10
          $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp
          7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4
      
      Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com
      Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
      Signed-off-by: NAndrey Ryabinin <arbn@yandex-team.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33863534
  6. 07 11月, 2021 2 次提交
  7. 19 10月, 2021 1 次提交
    • E
      mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() · 6d2aec9e
      Eric Dumazet 提交于
      syzbot reported access to unitialized memory in mbind() [1]
      
      Issue came with commit bda420b9 ("numa balancing: migrate on fault
      among multiple bound nodes")
      
      This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid
      combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in
      do_set_mempolicy()
      
      This patch moves the check in sanitize_mpol_flags() so that it is also
      used by mbind()
      
        [1]
        BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Uninit was created at:
         slab_alloc_node mm/slub.c:3221 [inline]
         slab_alloc mm/slub.c:3230 [inline]
         kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235
         mpol_new mm/mempolicy.c:293 [inline]
         do_mbind+0x912/0x15f0 mm/mempolicy.c:1289
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        =====================================================
        Kernel panic - not syncing: panic_on_kmsan set ...
        CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G    B             5.15.0-rc2-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106
         dump_stack+0x25/0x28 lib/dump_stack.c:113
         panic+0x44f/0xdeb kernel/panic.c:232
         kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186
         __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com
      Fixes: bda420b9 ("numa balancing: migrate on fault among multiple bound nodes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d2aec9e
  8. 18 10月, 2021 1 次提交
  9. 09 9月, 2021 3 次提交
    • Y
      mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task · 276aeee1
      yanghui 提交于
      Servers happened below panic:
      
        Kernel version:5.4.56
        BUG: unable to handle page fault for address: 0000000000002c48
        RIP: 0010:__next_zones_zonelist+0x1d/0x40
        Call Trace:
          __alloc_pages_nodemask+0x277/0x310
          alloc_page_interleave+0x13/0x70
          handle_mm_fault+0xf99/0x1390
          __do_page_fault+0x288/0x500
          do_page_fault+0x30/0x110
          page_fault+0x3e/0x50
      
      The reason for the panic is that MAX_NUMNODES is passed in the third
      parameter in __alloc_pages_nodemask(preferred_nid).  So access to
      zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic.
      
      In offset_il_node(), first_node() returns nid from pol->v.nodes, after
      this other threads may chang pol->v.nodes before next_node().  This race
      condition will let next_node return MAX_NUMNODES.  So put pol->nodes in
      a local variable.
      
      The race condition is between offset_il_node and cpuset_change_task_nodemask:
      
        CPU0:                                     CPU1:
        alloc_pages_vma()
          interleave_nid(pol,)
            offset_il_node(pol,)
              first_node(pol->v.nodes)            cpuset_change_task_nodemask
                              //nodes==0xc          mpol_rebind_task
                                                      mpol_rebind_policy
                                                        mpol_rebind_nodemask(pol,nodes)
                              //nodes==0x3
              next_node(nid, pol->v.nodes)//return MAX_NUMNODES
      
      Link: https://lkml.kernel.org/r/20210906034658.48721-1-yanghui.def@bytedance.comSigned-off-by: Nyanghui <yanghui.def@bytedance.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      276aeee1
    • A
      compat: remove some compat entry points · 59ab844e
      Arnd Bergmann 提交于
      These are all handled correctly when calling the native system call entry
      point, so remove the special cases.
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ab844e
    • A
      mm: simplify compat numa syscalls · e130242d
      Arnd Bergmann 提交于
      The compat implementations for mbind, get_mempolicy, set_mempolicy and
      migrate_pages are just there to handle the subtly different layout of
      bitmaps on 32-bit hosts.
      
      The compat implementation however lacks some of the checks that are
      present in the native one, in particular for checking that the extra bits
      are all zero when user space has a larger mask size than the kernel.
      Worse, those extra bits do not get cleared when copying in or out of the
      kernel, which can lead to incorrect data as well.
      
      Unify the implementation to handle the compat bitmap layout directly in
      the get_nodes() and copy_nodes_to_user() helpers.  Splitting out the
      get_bitmap() helper from get_nodes() also helps readability of the native
      case.
      
      On x86, two additional problems are addressed by this: compat tasks can
      pass a bitmap at the end of a mapping, causing a fault when reading across
      the page boundary for a 64-bit word.  x32 tasks might also run into
      problems with get_mempolicy corrupting data when an odd number of 32-bit
      words gets passed.
      
      On parisc the migrate_pages() system call apparently had the wrong calling
      convention, as big-endian architectures expect the words inside of a
      bitmap to be swapped.  This is not a problem though since parisc has no
      NUMA support.
      
      [arnd@arndb.de: fix mempolicy crash]
        Link: https://lkml.kernel.org/r/20210730143417.3700653-1-arnd@kernel.org
        Link: https://lore.kernel.org/lkml/YQPLG20V3dmOfq3a@osiris/
      
      Link: https://lkml.kernel.org/r/20210727144859.4150043-5-arnd@kernel.orgSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e130242d
  10. 04 9月, 2021 8 次提交
  11. 01 7月, 2021 5 次提交
  12. 30 6月, 2021 2 次提交
  13. 07 5月, 2021 2 次提交
  14. 06 5月, 2021 3 次提交
  15. 01 5月, 2021 2 次提交