1. 06 3月, 2022 3 次提交
    • S
      mm: fix use-after-free when anon vma name is used after vma is freed · 942341dc
      Suren Baghdasaryan 提交于
      When adjacent vmas are being merged it can result in the vma that was
      originally passed to madvise_update_vma being destroyed.  In the current
      implementation, the name parameter passed to madvise_update_vma points
      directly to vma->anon_name and it is used after the call to vma_merge.
      In the cases when vma_merge merges the original vma and destroys it,
      this might result in UAF.  For that the original vma would have to hold
      the anon_vma_name with the last reference.  The following vma would need
      to contain a different anon_vma_name object with the same string.  Such
      scenario is shown below:
      
      madvise_vma_behavior(vma)
        madvise_update_vma(vma, ..., anon_name == vma->anon_name)
          vma_merge(vma)
            __vma_adjust(vma) <-- merges vma with adjacent one
              vm_area_free(vma) <-- frees the original vma
          replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name
      
      Fix this by raising the name refcount and stabilizing it.
      
      Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com
      Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com
      Fixes: 9a10064f ("mm: add a field to store names for private anonymous memory")
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reported-by: syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      942341dc
    • S
      mm: prevent vm_area_struct::anon_name refcount saturation · 96403e11
      Suren Baghdasaryan 提交于
      A deep process chain with many vmas could grow really high.  With
      default sysctl_max_map_count (64k) and default pid_max (32k) the max
      number of vmas in the system is 2147450880 and the refcounter has
      headroom of 1073774592 before it reaches REFCOUNT_SATURATED
      (3221225472).
      
      Therefore it's unlikely that an anonymous name refcounter will overflow
      with these defaults.  Currently the max for pid_max is PID_MAX_LIMIT
      (4194304) and for sysctl_max_map_count it's INT_MAX (2147483647).  In
      this configuration anon_vma_name refcount overflow becomes theoretically
      possible (that still require heavy sharing of that anon_vma_name between
      processes).
      
      kref refcounting interface used in anon_vma_name structure will detect a
      counter overflow when it reaches REFCOUNT_SATURATED value but will only
      generate a warning and freeze the ref counter.  This would lead to the
      refcounted object never being freed.  A determined attacker could leak
      memory like that but it would be rather expensive and inefficient way to
      do so.
      
      To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
      sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
      leaves INT_MAX/2 (1073741823) values before the counter reaches
      REFCOUNT_SATURATED.  This should provide enough headroom for raising the
      refcounts temporarily.
      
      Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96403e11
    • S
      mm: refactor vm_area_struct::anon_vma_name usage code · 5c26f6ac
      Suren Baghdasaryan 提交于
      Avoid mixing strings and their anon_vma_name referenced pointers by
      using struct anon_vma_name whenever possible.  This simplifies the code
      and allows easier sharing of anon_vma_name structures when they
      represent the same name.
      
      [surenb@google.com: fix comment]
      
      Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Alexey Gladkov <legion@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c26f6ac
  2. 15 1月, 2022 4 次提交
    • A
      mm: move anon_vma declarations to linux/mm_inline.h · 17fca131
      Arnd Bergmann 提交于
      The patch to add anonymous vma names causes a build failure in some
      configurations:
      
        include/linux/mm_types.h: In function 'is_same_vma_anon_name':
        include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
          924 |         return name && vma_name && !strcmp(name, vma_name);
              |                                     ^~~~~~
        include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
      
      This should not really be part of linux/mm_types.h in the first place,
      as that header is meant to only contain structure defintions and need a
      minimum set of indirect includes itself.
      
      While the header clearly includes more than it should at this point,
      let's not make it worse by including string.h as well, which would pull
      in the expensive (compile-speed wise) fortify-string logic.
      
      Move the new functions into a separate header that only needs to be
      included in a couple of locations.
      
      Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
      Fixes: "mm: add a field to store names for private anonymous memory"
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@google.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17fca131
    • S
      mm: add anonymous vma name refcounting · 78db3412
      Suren Baghdasaryan 提交于
      While forking a process with high number (64K) of named anonymous vmas
      the overhead caused by strdup() is noticeable.  Experiments with ARM64
      Android device show up to 40% performance regression when forking a
      process with 64k unpopulated anonymous vmas using the max name lengths
      vs the same process with the same number of anonymous vmas having no
      name.
      
      Introduce anon_vma_name refcounted structure to avoid the overhead of
      copying vma names during fork() and when splitting named anonymous vmas.
      
      When a vma is duplicated, instead of copying the name we increment the
      refcount of this structure.  Multiple vmas can point to the same
      anon_vma_name as long as they increment the refcount.  The name member
      of anon_vma_name structure is assigned at structure allocation time and
      is never changed.  If vma name changes then the refcount of the original
      structure is dropped, a new anon_vma_name structure is allocated to hold
      the new name and the vma pointer is updated to point to the new
      structure.
      
      With this approach the fork() performance regressions is reduced 3-4x
      times and with usecases using more reasonable number of VMAs (a few
      thousand) the regressions is not measurable.
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-3-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78db3412
    • C
      mm: add a field to store names for private anonymous memory · 9a10064f
      Colin Cross 提交于
      In many userspace applications, and especially in VM based applications
      like Android uses heavily, there are multiple different allocators in
      use.  At a minimum there is libc malloc and the stack, and in many cases
      there are libc malloc, the stack, direct syscalls to mmap anonymous
      memory, and multiple VM heaps (one for small objects, one for big
      objects, etc.).  Each of these layers usually has its own tools to
      inspect its usage; malloc by compiling a debug version, the VM through
      heap inspection tools, and for direct syscalls there is usually no way
      to track them.
      
      On Android we heavily use a set of tools that use an extended version of
      the logic covered in Documentation/vm/pagemap.txt to walk all pages
      mapped in userspace and slice their usage by process, shared (COW) vs.
      unique mappings, backing, etc.  This can account for real physical
      memory usage even in cases like fork without exec (which Android uses
      heavily to share as many private COW pages as possible between
      processes), Kernel SamePage Merging, and clean zero pages.  It produces
      a measurement of the pages that only exist in that process (USS, for
      unique), and a measurement of the physical memory usage of that process
      with the cost of shared pages being evenly split between processes that
      share them (PSS).
      
      If all anonymous memory is indistinguishable then figuring out the real
      physical memory usage (PSS) of each heap requires either a pagemap
      walking tool that can understand the heap debugging of every layer, or
      for every layer's heap debugging tools to implement the pagemap walking
      logic, in which case it is hard to get a consistent view of memory
      across the whole system.
      
      Tracking the information in userspace leads to all sorts of problems.
      It either needs to be stored inside the process, which means every
      process has to have an API to export its current heap information upon
      request, or it has to be stored externally in a filesystem that somebody
      needs to clean up on crashes.  It needs to be readable while the process
      is still running, so it has to have some sort of synchronization with
      every layer of userspace.  Efficiently tracking the ranges requires
      reimplementing something like the kernel vma trees, and linking to it
      from every layer of userspace.  It requires more memory, more syscalls,
      more runtime cost, and more complexity to separately track regions that
      the kernel is already tracking.
      
      This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
      userspace-provided name for anonymous vmas.  The names of named
      anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
      [anon:<name>].
      
      Userspace can set the name for a region of memory by calling
      
         prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
      
      Setting the name to NULL clears it.  The name length limit is 80 bytes
      including NUL-terminator and is checked to contain only printable ascii
      characters (including space), except '[',']','\','$' and '`'.
      
      Ascii strings are being used to have a descriptive identifiers for vmas,
      which can be understood by the users reading /proc/pid/maps or
      /proc/pid/smaps.  Names can be standardized for a given system and they
      can include some variable parts such as the name of the allocator or a
      library, tid of the thread using it, etc.
      
      The name is stored in a pointer in the shared union in vm_area_struct
      that points to a null terminated string.  Anonymous vmas with the same
      name (equivalent strings) and are otherwise mergeable will be merged.
      The name pointers are not shared between vmas even if they contain the
      same name.  The name pointer is stored in a union with fields that are
      only used on file-backed mappings, so it does not increase memory usage.
      
      CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
      feature.  It keeps the feature disabled by default to prevent any
      additional memory overhead and to avoid confusing procfs parsers on
      systems which are not ready to support named anonymous vmas.
      
      The patch is based on the original patch developed by Colin Cross, more
      specifically on its latest version [1] posted upstream by Sumit Semwal.
      It used a userspace pointer to store vma names.  In that design, name
      pointers could be shared between vmas.  However during the last
      upstreaming attempt, Kees Cook raised concerns [2] about this approach
      and suggested to copy the name into kernel memory space, perform
      validity checks [3] and store as a string referenced from
      vm_area_struct.
      
      One big concern is about fork() performance which would need to strdup
      anonymous vma names.  Dave Hansen suggested experimenting with
      worst-case scenario of forking a process with 64k vmas having longest
      possible names [4].  I ran this experiment on an ARM64 Android device
      and recorded a worst-case regression of almost 40% when forking such a
      process.
      
      This regression is addressed in the followup patch which replaces the
      pointer to a name with a refcounted structure that allows sharing the
      name pointer between vmas of the same name.  Instead of duplicating the
      string during fork() or when splitting a vma it increments the refcount.
      
      [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
      [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
      [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
      [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
      
      Changes for prctl(2) manual page (in the options section):
      
      PR_SET_VMA
      	Sets an attribute specified in arg2 for virtual memory areas
      	starting from the address specified in arg3 and spanning the
      	size specified	in arg4. arg5 specifies the value of the attribute
      	to be set. Note that assigning an attribute to a virtual memory
      	area might prevent it from being merged with adjacent virtual
      	memory areas due to the difference in that attribute's value.
      
      	Currently, arg2 must be one of:
      
      	PR_SET_VMA_ANON_NAME
      		Set a name for anonymous virtual memory areas. arg5 should
      		be a pointer to a null-terminated string containing the
      		name. The name length including null byte cannot exceed
      		80 bytes. If arg5 is NULL, the name of the appropriate
      		anonymous virtual memory areas will be reset. The name
      		can contain only printable ascii characters (including
                      space), except '[',']','\','$' and '`'.
      
                      This feature is available only if the kernel is built with
                      the CONFIG_ANON_VMA_NAME option enabled.
      
      [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
        Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
      [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
       added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
       work here was done by Colin Cross, therefore, with his permission, keeping
       him as the author]
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a10064f
    • C
      mm: rearrange madvise code to allow for reuse · ac1e9acc
      Colin Cross 提交于
      Patch series "mm: rearrange madvise code to allow for reuse", v11.
      
      Avoid performance regression of the new anon vma name field refcounting it.
      
      I checked the image sizes with allnoconfig builds:
      
        unpatched Linus' ToT
           text    data     bss     dec     hex filename
        1324759      32   73928 1398719 1557bf vmlinux
      
        After the first patch is applied (madvise refactoring)
           text    data     bss     dec     hex filename
        1322346      32   73928 1396306 154e52 vmlinux
        >>> 2413 bytes decrease vs ToT <<<
      
        After all patches applied with CONFIG_ANON_VMA_NAME=n
           text    data     bss     dec     hex filename
        1322337      32   73928 1396297 154e49 vmlinux
        >>> 2422 bytes decrease vs ToT <<<
      
        After all patches applied with CONFIG_ANON_VMA_NAME=y
           text    data     bss     dec     hex filename
        1325228      32   73928 1399188 155994 vmlinux
        >>> 469 bytes increase vs ToT <<<
      
      This patch (of 3):
      
      Refactor the madvise syscall to allow for parts of it to be reused by a
      prctl syscall that affects vmas.
      
      Move the code that walks vmas in a virtual address range into a function
      that takes a function pointer as a parameter.  The only caller for now
      is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
      but the next patch will add an additional caller.
      
      Move handling all vma behaviors inside madvise_behavior, and rename it
      to madvise_vma_behavior.
      
      Move the code that updates the flags on a vma, including splitting or
      merging the vma as necessary, into a new function called
      madvise_update_vma.  The next patch will add support for updating a new
      anon_name field as well.
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-1-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac1e9acc
  3. 14 10月, 2021 1 次提交
  4. 04 9月, 2021 1 次提交
  5. 14 8月, 2021 1 次提交
    • D
      mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE) · eb2faa51
      David Hildenbrand 提交于
      Doing some extended tests and polishing the man page update for
      MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
      SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
      madvise() user error.
      
      We want to report only problematic mappings and permission problems that
      the user could have know as -EINVAL.
      
      Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
      but instead indicate -EFAULT to user space.  While we could also convert
      it to -ENOMEM, using -EFAULT looks more helpful when user space might
      want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
      not part of an final Linux release and we can still adjust the behavior.
      
      Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
      Fixes: 4ca9b385 ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb2faa51
  6. 13 7月, 2021 1 次提交
  7. 01 7月, 2021 1 次提交
    • D
      mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables · 4ca9b385
      David Hildenbrand 提交于
      I. Background: Sparse Memory Mappings
      
      When we manage sparse memory mappings dynamically in user space - also
      sometimes involving MAP_NORESERVE - we want to dynamically populate/
      discard memory inside such a sparse memory region.  Example users are
      hypervisors (especially implementing memory ballooning or similar
      technologies like virtio-mem) and memory allocators.  In addition, we want
      to fail in a nice way (instead of generating SIGBUS) if populating does
      not succeed because we are out of backend memory (which can happen easily
      with file-based mappings, especially tmpfs and hugetlbfs).
      
      While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
      reliably discarding memory for most mapping types, there is no generic
      approach to populate page tables and preallocate memory.
      
      Although mmap() supports MAP_POPULATE, it is not applicable to the concept
      of sparse memory mappings, where we want to populate/discard dynamically
      and avoid expensive/problematic remappings.  In addition, we never
      actually report errors during the final populate phase - it is best-effort
      only.
      
      fallocate() can be used to preallocate file-based memory and fail in a
      safe way.  However, it cannot really be used for any private mappings on
      anonymous files via memfd due to COW semantics.  In addition, fallocate()
      does not actually populate page tables, so we still always get pagefaults
      on first access - which is sometimes undesired (i.e., real-time workloads)
      and requires real prefaulting of page tables, not just a preallocation of
      backend storage.  There might be interesting use cases for sparse memory
      regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
      as it does not prefault page tables.
      
      II. On preallcoation/prefaulting from user space
      
      Because we don't have a proper interface, what applications (like QEMU and
      databases) end up doing is touching (i.e., reading+writing one byte to not
      overwrite existing data) all individual pages.
      
      However, that approach
      1) Can result in wear on storage backing, because we end up reading/writing
         each page; this is especially a problem for dax/pmem.
      2) Can result in mmap_sem contention when prefaulting via multiple
         threads.
      3) Requires expensive signal handling, especially to catch SIGBUS in case
         of hugetlbfs/shmem/file-backed memory. For example, this is
         problematic in hypervisors like QEMU where SIGBUS handlers might already
         be used by other subsystems concurrently to e.g, handle hardware errors.
         "Simply" doing preallocation concurrently from other thread is not that
         easy.
      
      III. On MADV_WILLNEED
      
      Extending MADV_WILLNEED is not an option because
      1. It would change the semantics: "Expect access in the near future." and
         "might be a good idea to read some pages" vs. "Definitely populate/
         preallocate all memory and definitely fail on errors.".
      2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
         don't want populate/prealloc semantics. They treat this rather as a hint
         to give a little performance boost without too much overhead - and don't
         expect that a lot of memory might get consumed or a lot of time
         might be spent.
      
      IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
      
      Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
      MAP_POPULATE, with the following semantics:
      1. MADV_POPULATE_READ can be used to prefault page tables just like
         manually reading each individual page. This will not break any COW
         mappings. The shared zero page might get mapped and no backend storage
         might get preallocated -- allocation might be deferred to
         write-fault time. Especially shared file mappings require an explicit
         fallocate() upfront to actually preallocate backend memory (blocks in
         the file system) in case the file might have holes.
      2. If MADV_POPULATE_READ succeeds, all page tables have been populated
         (prefaulted) readable once.
      3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
         prefault page tables just like manually writing (or
         reading+writing) each individual page. This will break any COW
         mappings -- e.g., the shared zeropage is never populated.
      4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
         (prefaulted) writable once.
      5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
         mappings marked with VM_PFNMAP and VM_IO. Also, proper access
         permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
         mapping is encountered, madvise() fails with -EINVAL.
      6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
         might have been populated.
      7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
         when encountering a HW poisoned page in the range.
      8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
         cannot protect from the OOM (Out Of Memory) handler killing the
         process.
      
      While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
      preallocate memory and prefault page tables for VMs), one issue is that
      whenever we prefault pages writable, the pages have to be marked dirty,
      because the CPU could dirty them any time.  while not a real problem for
      hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
      page will be marked dirty and has to be written back later when evicting.
      
      MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
      mapping from backend storage without marking it dirty, such that eviction
      won't have to write it back.  As discussed above, shared file mappings
      might require an explciit fallocate() upfront to achieve
      preallcoation+prepopulation.
      
      Although sparse memory mappings are the primary use case, this will also
      be useful for other preallocate/prefault use cases where MAP_POPULATE is
      not desired or the semantics of MAP_POPULATE are not sufficient: as one
      example, QEMU users can trigger preallocation/prefaulting of guest RAM
      after the mapping was created -- and don't want errors to be silently
      suppressed.
      
      Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
      however, the main motivation back than was performance improvements --
      which should also still be the case.
      
      V. Single-threaded performance comparison
      
      I did a short experiment, prefaulting page tables on completely *empty
      mappings/files* and repeated the experiment 10 times.  The results
      correspond to the shortest execution time.  In general, the performance
      benefit for huge pages is negligible with small mappings.
      
      V.1: Private mappings
      
      POPULATE_READ and POPULATE_WRITE is fastest.  Note that
      Reading/POPULATE_READ will populate the shared zeropage where applicable
      -- which result in short population times.
      
      The fastest way to allocate backend storage (here: swap or huge pages) and
      prefault page tables is POPULATE_WRITE.
      
      V.2: Shared mappings
      
      fallocate() is fastest, however, doesn't prefault page tables.
      POPULATE_WRITE is faster than simple writes and read/writes.
      POPULATE_READ is faster than simple reads.
      
      Without a fd, the fastest way to allocate backend storage and prefault
      page tables is POPULATE_WRITE.  With an fd, the fastest way is usually
      FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
      exception are actual files: FALLOCATE+Read is slightly faster than
      FALLOCATE+POPULATE_READ.
      
      The fastest way to allocate backend storage prefault page tables is
      FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
      FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
      dirty.
      
      v.3: Detailed results
      
      ==================================================
      2 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :     0.119 ms
      Anon 4 KiB     : Write                    :     0.222 ms
      Anon 4 KiB     : Read/Write               :     0.380 ms
      Anon 4 KiB     : POPULATE_READ            :     0.060 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
      Memfd 4 KiB    : Read                     :     0.034 ms
      Memfd 4 KiB    : Write                    :     0.310 ms
      Memfd 4 KiB    : Read/Write               :     0.362 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      tmpfs          : Read                     :     0.033 ms
      tmpfs          : Write                    :     0.313 ms
      tmpfs          : Read/Write               :     0.406 ms
      tmpfs          : POPULATE_READ            :     0.039 ms
      tmpfs          : POPULATE_WRITE           :     0.285 ms
      file           : Read                     :     0.033 ms
      file           : Write                    :     0.351 ms
      file           : Read/Write               :     0.408 ms
      file           : POPULATE_READ            :     0.039 ms
      file           : POPULATE_WRITE           :     0.290 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      **************************************************
      4096 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :   237.940 ms
      Anon 4 KiB     : Write                    :   708.409 ms
      Anon 4 KiB     : Read/Write               :  1054.041 ms
      Anon 4 KiB     : POPULATE_READ            :   124.310 ms
      Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
      Memfd 4 KiB    : Read                     :   136.928 ms
      Memfd 4 KiB    : Write                    :   963.898 ms
      Memfd 4 KiB    : Read/Write               :  1106.561 ms
      Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
      Memfd 2 MiB    : Read                     :   357.116 ms
      Memfd 2 MiB    : Write                    :   357.210 ms
      Memfd 2 MiB    : Read/Write               :   357.606 ms
      Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
      tmpfs          : Read                     :   137.536 ms
      tmpfs          : Write                    :   954.362 ms
      tmpfs          : Read/Write               :  1105.954 ms
      tmpfs          : POPULATE_READ            :    80.289 ms
      tmpfs          : POPULATE_WRITE           :   822.826 ms
      file           : Read                     :   137.874 ms
      file           : Write                    :   987.025 ms
      file           : Read/Write               :  1107.439 ms
      file           : POPULATE_READ            :    80.413 ms
      file           : POPULATE_WRITE           :   857.622 ms
      hugetlbfs      : Read                     :   355.607 ms
      hugetlbfs      : Write                    :   355.729 ms
      hugetlbfs      : Read/Write               :   356.127 ms
      hugetlbfs      : POPULATE_READ            :   354.585 ms
      hugetlbfs      : POPULATE_WRITE           :   355.138 ms
      **************************************************
      2 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :     0.394 ms
      Anon 4 KiB     : Write                    :     0.348 ms
      Anon 4 KiB     : Read/Write               :     0.400 ms
      Anon 4 KiB     : POPULATE_READ            :     0.326 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
      Anon 2 MiB     : Read                     :     0.030 ms
      Anon 2 MiB     : Write                    :     0.030 ms
      Anon 2 MiB     : Read/Write               :     0.030 ms
      Anon 2 MiB     : POPULATE_READ            :     0.030 ms
      Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
      Memfd 4 KiB    : Read                     :     0.412 ms
      Memfd 4 KiB    : Write                    :     0.372 ms
      Memfd 4 KiB    : Read/Write               :     0.419 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
      Memfd 4 KiB    : FALLOCATE                :     0.137 ms
      Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
      Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      Memfd 2 MiB    : FALLOCATE                :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
      tmpfs          : Read                     :     0.416 ms
      tmpfs          : Write                    :     0.369 ms
      tmpfs          : Read/Write               :     0.425 ms
      tmpfs          : POPULATE_READ            :     0.346 ms
      tmpfs          : POPULATE_WRITE           :     0.295 ms
      tmpfs          : FALLOCATE                :     0.139 ms
      tmpfs          : FALLOCATE+Read           :     0.447 ms
      tmpfs          : FALLOCATE+Write          :     0.333 ms
      tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
      file           : Read                     :     0.191 ms
      file           : Write                    :     0.511 ms
      file           : Read/Write               :     0.524 ms
      file           : POPULATE_READ            :     0.196 ms
      file           : POPULATE_WRITE           :     0.434 ms
      file           : FALLOCATE                :     0.004 ms
      file           : FALLOCATE+Read           :     0.197 ms
      file           : FALLOCATE+Write          :     0.554 ms
      file           : FALLOCATE+Read/Write     :     0.480 ms
      file           : FALLOCATE+POPULATE_READ  :     0.201 ms
      file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      hugetlbfs      : FALLOCATE                :     0.030 ms
      hugetlbfs      : FALLOCATE+Read           :     0.031 ms
      hugetlbfs      : FALLOCATE+Write          :     0.031 ms
      hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
      **************************************************
      4096 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :  1053.090 ms
      Anon 4 KiB     : Write                    :   913.642 ms
      Anon 4 KiB     : Read/Write               :  1060.350 ms
      Anon 4 KiB     : POPULATE_READ            :   893.691 ms
      Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
      Anon 2 MiB     : Read                     :   358.553 ms
      Anon 2 MiB     : Write                    :   358.419 ms
      Anon 2 MiB     : Read/Write               :   357.992 ms
      Anon 2 MiB     : POPULATE_READ            :   357.533 ms
      Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
      Memfd 4 KiB    : Read                     :  1078.144 ms
      Memfd 4 KiB    : Write                    :   942.036 ms
      Memfd 4 KiB    : Read/Write               :  1100.391 ms
      Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
      Memfd 4 KiB    : FALLOCATE                :   304.632 ms
      Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
      Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
      Memfd 2 MiB    : Read                     :   358.131 ms
      Memfd 2 MiB    : Write                    :   358.099 ms
      Memfd 2 MiB    : Read/Write               :   358.250 ms
      Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
      Memfd 2 MiB    : FALLOCATE                :   356.735 ms
      Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
      Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
      tmpfs          : Read                     :  1087.265 ms
      tmpfs          : Write                    :   950.840 ms
      tmpfs          : Read/Write               :  1107.567 ms
      tmpfs          : POPULATE_READ            :   922.605 ms
      tmpfs          : POPULATE_WRITE           :   810.094 ms
      tmpfs          : FALLOCATE                :   306.320 ms
      tmpfs          : FALLOCATE+Read           :  1169.796 ms
      tmpfs          : FALLOCATE+Write          :   933.730 ms
      tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
      file           : Read                     :   654.101 ms
      file           : Write                    :  1259.142 ms
      file           : Read/Write               :  1289.509 ms
      file           : POPULATE_READ            :   661.642 ms
      file           : POPULATE_WRITE           :  1106.816 ms
      file           : FALLOCATE                :     1.864 ms
      file           : FALLOCATE+Read           :   656.328 ms
      file           : FALLOCATE+Write          :  1153.300 ms
      file           : FALLOCATE+Read/Write     :  1180.613 ms
      file           : FALLOCATE+POPULATE_READ  :   668.347 ms
      file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
      hugetlbfs      : Read                     :   357.245 ms
      hugetlbfs      : Write                    :   357.413 ms
      hugetlbfs      : Read/Write               :   357.120 ms
      hugetlbfs      : POPULATE_READ            :   356.321 ms
      hugetlbfs      : POPULATE_WRITE           :   356.693 ms
      hugetlbfs      : FALLOCATE                :   355.927 ms
      hugetlbfs      : FALLOCATE+Read           :   357.074 ms
      hugetlbfs      : FALLOCATE+Write          :   357.120 ms
      hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
      **************************************************
      
      [1] https://lkml.org/lkml/2013/6/27/698
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ca9b385
  8. 07 5月, 2021 1 次提交
  9. 14 3月, 2021 1 次提交
  10. 30 1月, 2021 2 次提交
  11. 24 1月, 2021 2 次提交
  12. 16 12月, 2020 2 次提交
  13. 09 12月, 2020 1 次提交
    • M
      mm/madvise: remove racy mm ownership check · a68a0262
      Minchan Kim 提交于
      Jann spotted the security hole due to race of mm ownership check.
      
      If the task is sharing the mm_struct but goes through execve() before
      mm_access(), it could skip process_madvise_behavior_valid check.  That
      makes *any advice hint* to reach into the remote process.
      
      This patch removes the mm ownership check.  With it, it will lose the
      ability that local process could give *any* advice hint with vector
      interface for some reason (e.g., performance).  Since there is no
      concrete example in upstream yet, it would be better to remove the
      abiliity at this moment and need to review when such new advice comes
      up.
      
      Fixes: ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NJann Horn <jannh@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a68a0262
  14. 23 11月, 2020 2 次提交
  15. 19 10月, 2020 2 次提交
    • M
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim 提交于
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
    • M
      mm/madvise: pass mm to do_madvise · 0726b01e
      Minchan Kim 提交于
      Patch series "introduce memory hinting API for external process", v9.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3. Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 3):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.
      
      Furthermore, we must not access mm_struct via task->mm, but obtain it via
      access_mm() once (in the following patch) and only use that pointer [1],
      so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
      safe, so we can use them further down the call stack.
      
      And let's pass current->mm as arguments of do_madvise so it shouldn't
      change existing behavior but prepare next patch to make review easy.
      
      [vbabka@suse.cz: changelog tweak]
      [minchan@kernel.org: use current->mm for io_uring]
        Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
      [akpm@linux-foundation.org: fix it for upstream changes]
      [akpm@linux-foundation.org: whoops]
      [rdunlap@infradead.org: add missing includes]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0726b01e
  16. 17 10月, 2020 3 次提交
  17. 14 10月, 2020 1 次提交
  18. 27 9月, 2020 1 次提交
    • M
      mm: validate pmd after splitting · ce268425
      Minchan Kim 提交于
      syzbot reported the following KASAN splat:
      
        general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
        KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
        CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
        Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 <80> 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
        RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
        Call Trace:
          lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
          __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
          _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
          spin_lock include/linux/spinlock.h:354 [inline]
          madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
          walk_pmd_range mm/pagewalk.c:89 [inline]
          walk_pud_range mm/pagewalk.c:160 [inline]
          walk_p4d_range mm/pagewalk.c:193 [inline]
          walk_pgd_range mm/pagewalk.c:229 [inline]
          __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
          walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
          madvise_pageout_page_range mm/madvise.c:521 [inline]
          madvise_pageout mm/madvise.c:557 [inline]
          madvise_vma mm/madvise.c:946 [inline]
          do_madvise+0x12d0/0x2090 mm/madvise.c:1145
          __do_sys_madvise mm/madvise.c:1171 [inline]
          __se_sys_madvise mm/madvise.c:1169 [inline]
          __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
          do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The backing vma was shmem.
      
      In case of split page of file-backed THP, madvise zaps the pmd instead
      of remapping of sub-pages.  So we need to check pmd validity after
      split.
      
      Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce268425
  19. 06 9月, 2020 1 次提交
    • Y
      mm: madvise: fix vma user-after-free · 7867fd7c
      Yang Shi 提交于
      The syzbot reported the below use-after-free:
      
        BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline]
        BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline]
        BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
        Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996
      
        CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x18f/0x20d lib/dump_stack.c:118
          print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
          __kasan_report mm/kasan/report.c:513 [inline]
          kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
          madvise_willneed mm/madvise.c:293 [inline]
          madvise_vma mm/madvise.c:942 [inline]
          do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
          do_madvise mm/madvise.c:1169 [inline]
          __do_sys_madvise mm/madvise.c:1171 [inline]
          __se_sys_madvise mm/madvise.c:1169 [inline]
          __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Allocated by task 9992:
          kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482
          vm_area_alloc+0x1c/0x110 kernel/fork.c:347
          mmap_region+0x8e5/0x1780 mm/mmap.c:1743
          do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
          vm_mmap_pgoff+0x195/0x200 mm/util.c:506
          ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 9992:
          kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693
          remove_vma+0x132/0x170 mm/mmap.c:184
          remove_vma_list mm/mmap.c:2613 [inline]
          __do_munmap+0x743/0x1170 mm/mmap.c:2869
          do_munmap mm/mmap.c:2877 [inline]
          mmap_region+0x257/0x1780 mm/mmap.c:1716
          do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
          vm_mmap_pgoff+0x195/0x200 mm/util.c:506
          ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It is because vma is accessed after releasing mmap_lock, but someone
      else acquired the mmap_lock and the vma is gone.
      
      Releasing mmap_lock after accessing vma should fix the problem.
      
      Fixes: 692fe624 ("mm: Handle MADV_WILLNEED through vfs_fadvise()")
      Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[5.4+]
      Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7867fd7c
  20. 10 6月, 2020 2 次提交
  21. 25 4月, 2020 1 次提交
    • L
      mm: check that mm is still valid in madvise() · bc0c4d1e
      Linus Torvalds 提交于
      IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
      another process, which means that it can race with our crazy core dump
      handling which accesses the VM state without holding the mmap_sem
      (because it incorrectly thinks that it is the final user).
      
      This is clearly a core dumping problem, but we've never fixed it the
      right way, and instead have the notion of "check that the mm is still
      ok" using mmget_still_valid() after getting the mmap_sem for writing in
      any situation where we're not the original VM thread.
      
      See commit 04f5866e ("coredump: fix race condition between
      mmget_not_zero()/get_task_mm() and core dumping") for more background on
      this whole mmget_still_valid() thing.  You might want to have a barf bag
      handy when you do.
      
      We're discussing just fixing this properly in the only remaining core
      dumping routines.  But even if we do that, let's make do_madvise() do
      the right thing, and then when we fix core dumping, we can remove all
      these mmget_still_valid() checks.
      Reported-and-tested-by: NJann Horn <jannh@google.com>
      Fixes: c1ca757b ("io_uring: add IORING_OP_MADVISE")
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc0c4d1e
  22. 22 3月, 2020 1 次提交
  23. 21 1月, 2020 1 次提交
  24. 02 12月, 2019 3 次提交
  25. 16 11月, 2019 1 次提交
反馈
建议
客服 返回
顶部