1. 15 1月, 2022 1 次提交
    • C
      mm: rearrange madvise code to allow for reuse · ac1e9acc
      Colin Cross 提交于
      Patch series "mm: rearrange madvise code to allow for reuse", v11.
      
      Avoid performance regression of the new anon vma name field refcounting it.
      
      I checked the image sizes with allnoconfig builds:
      
        unpatched Linus' ToT
           text    data     bss     dec     hex filename
        1324759      32   73928 1398719 1557bf vmlinux
      
        After the first patch is applied (madvise refactoring)
           text    data     bss     dec     hex filename
        1322346      32   73928 1396306 154e52 vmlinux
        >>> 2413 bytes decrease vs ToT <<<
      
        After all patches applied with CONFIG_ANON_VMA_NAME=n
           text    data     bss     dec     hex filename
        1322337      32   73928 1396297 154e49 vmlinux
        >>> 2422 bytes decrease vs ToT <<<
      
        After all patches applied with CONFIG_ANON_VMA_NAME=y
           text    data     bss     dec     hex filename
        1325228      32   73928 1399188 155994 vmlinux
        >>> 469 bytes increase vs ToT <<<
      
      This patch (of 3):
      
      Refactor the madvise syscall to allow for parts of it to be reused by a
      prctl syscall that affects vmas.
      
      Move the code that walks vmas in a virtual address range into a function
      that takes a function pointer as a parameter.  The only caller for now
      is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
      but the next patch will add an additional caller.
      
      Move handling all vma behaviors inside madvise_behavior, and rename it
      to madvise_vma_behavior.
      
      Move the code that updates the flags on a vma, including splitting or
      merging the vma as necessary, into a new function called
      madvise_update_vma.  The next patch will add support for updating a new
      anon_name field as well.
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-1-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac1e9acc
  2. 14 10月, 2021 1 次提交
  3. 04 9月, 2021 1 次提交
  4. 14 8月, 2021 1 次提交
    • D
      mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE) · eb2faa51
      David Hildenbrand 提交于
      Doing some extended tests and polishing the man page update for
      MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
      SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
      madvise() user error.
      
      We want to report only problematic mappings and permission problems that
      the user could have know as -EINVAL.
      
      Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
      but instead indicate -EFAULT to user space.  While we could also convert
      it to -ENOMEM, using -EFAULT looks more helpful when user space might
      want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
      not part of an final Linux release and we can still adjust the behavior.
      
      Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
      Fixes: 4ca9b385 ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb2faa51
  5. 13 7月, 2021 1 次提交
  6. 01 7月, 2021 1 次提交
    • D
      mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables · 4ca9b385
      David Hildenbrand 提交于
      I. Background: Sparse Memory Mappings
      
      When we manage sparse memory mappings dynamically in user space - also
      sometimes involving MAP_NORESERVE - we want to dynamically populate/
      discard memory inside such a sparse memory region.  Example users are
      hypervisors (especially implementing memory ballooning or similar
      technologies like virtio-mem) and memory allocators.  In addition, we want
      to fail in a nice way (instead of generating SIGBUS) if populating does
      not succeed because we are out of backend memory (which can happen easily
      with file-based mappings, especially tmpfs and hugetlbfs).
      
      While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
      reliably discarding memory for most mapping types, there is no generic
      approach to populate page tables and preallocate memory.
      
      Although mmap() supports MAP_POPULATE, it is not applicable to the concept
      of sparse memory mappings, where we want to populate/discard dynamically
      and avoid expensive/problematic remappings.  In addition, we never
      actually report errors during the final populate phase - it is best-effort
      only.
      
      fallocate() can be used to preallocate file-based memory and fail in a
      safe way.  However, it cannot really be used for any private mappings on
      anonymous files via memfd due to COW semantics.  In addition, fallocate()
      does not actually populate page tables, so we still always get pagefaults
      on first access - which is sometimes undesired (i.e., real-time workloads)
      and requires real prefaulting of page tables, not just a preallocation of
      backend storage.  There might be interesting use cases for sparse memory
      regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
      as it does not prefault page tables.
      
      II. On preallcoation/prefaulting from user space
      
      Because we don't have a proper interface, what applications (like QEMU and
      databases) end up doing is touching (i.e., reading+writing one byte to not
      overwrite existing data) all individual pages.
      
      However, that approach
      1) Can result in wear on storage backing, because we end up reading/writing
         each page; this is especially a problem for dax/pmem.
      2) Can result in mmap_sem contention when prefaulting via multiple
         threads.
      3) Requires expensive signal handling, especially to catch SIGBUS in case
         of hugetlbfs/shmem/file-backed memory. For example, this is
         problematic in hypervisors like QEMU where SIGBUS handlers might already
         be used by other subsystems concurrently to e.g, handle hardware errors.
         "Simply" doing preallocation concurrently from other thread is not that
         easy.
      
      III. On MADV_WILLNEED
      
      Extending MADV_WILLNEED is not an option because
      1. It would change the semantics: "Expect access in the near future." and
         "might be a good idea to read some pages" vs. "Definitely populate/
         preallocate all memory and definitely fail on errors.".
      2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
         don't want populate/prealloc semantics. They treat this rather as a hint
         to give a little performance boost without too much overhead - and don't
         expect that a lot of memory might get consumed or a lot of time
         might be spent.
      
      IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
      
      Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
      MAP_POPULATE, with the following semantics:
      1. MADV_POPULATE_READ can be used to prefault page tables just like
         manually reading each individual page. This will not break any COW
         mappings. The shared zero page might get mapped and no backend storage
         might get preallocated -- allocation might be deferred to
         write-fault time. Especially shared file mappings require an explicit
         fallocate() upfront to actually preallocate backend memory (blocks in
         the file system) in case the file might have holes.
      2. If MADV_POPULATE_READ succeeds, all page tables have been populated
         (prefaulted) readable once.
      3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
         prefault page tables just like manually writing (or
         reading+writing) each individual page. This will break any COW
         mappings -- e.g., the shared zeropage is never populated.
      4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
         (prefaulted) writable once.
      5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
         mappings marked with VM_PFNMAP and VM_IO. Also, proper access
         permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
         mapping is encountered, madvise() fails with -EINVAL.
      6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
         might have been populated.
      7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
         when encountering a HW poisoned page in the range.
      8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
         cannot protect from the OOM (Out Of Memory) handler killing the
         process.
      
      While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
      preallocate memory and prefault page tables for VMs), one issue is that
      whenever we prefault pages writable, the pages have to be marked dirty,
      because the CPU could dirty them any time.  while not a real problem for
      hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
      page will be marked dirty and has to be written back later when evicting.
      
      MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
      mapping from backend storage without marking it dirty, such that eviction
      won't have to write it back.  As discussed above, shared file mappings
      might require an explciit fallocate() upfront to achieve
      preallcoation+prepopulation.
      
      Although sparse memory mappings are the primary use case, this will also
      be useful for other preallocate/prefault use cases where MAP_POPULATE is
      not desired or the semantics of MAP_POPULATE are not sufficient: as one
      example, QEMU users can trigger preallocation/prefaulting of guest RAM
      after the mapping was created -- and don't want errors to be silently
      suppressed.
      
      Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
      however, the main motivation back than was performance improvements --
      which should also still be the case.
      
      V. Single-threaded performance comparison
      
      I did a short experiment, prefaulting page tables on completely *empty
      mappings/files* and repeated the experiment 10 times.  The results
      correspond to the shortest execution time.  In general, the performance
      benefit for huge pages is negligible with small mappings.
      
      V.1: Private mappings
      
      POPULATE_READ and POPULATE_WRITE is fastest.  Note that
      Reading/POPULATE_READ will populate the shared zeropage where applicable
      -- which result in short population times.
      
      The fastest way to allocate backend storage (here: swap or huge pages) and
      prefault page tables is POPULATE_WRITE.
      
      V.2: Shared mappings
      
      fallocate() is fastest, however, doesn't prefault page tables.
      POPULATE_WRITE is faster than simple writes and read/writes.
      POPULATE_READ is faster than simple reads.
      
      Without a fd, the fastest way to allocate backend storage and prefault
      page tables is POPULATE_WRITE.  With an fd, the fastest way is usually
      FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
      exception are actual files: FALLOCATE+Read is slightly faster than
      FALLOCATE+POPULATE_READ.
      
      The fastest way to allocate backend storage prefault page tables is
      FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
      FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
      dirty.
      
      v.3: Detailed results
      
      ==================================================
      2 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :     0.119 ms
      Anon 4 KiB     : Write                    :     0.222 ms
      Anon 4 KiB     : Read/Write               :     0.380 ms
      Anon 4 KiB     : POPULATE_READ            :     0.060 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
      Memfd 4 KiB    : Read                     :     0.034 ms
      Memfd 4 KiB    : Write                    :     0.310 ms
      Memfd 4 KiB    : Read/Write               :     0.362 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      tmpfs          : Read                     :     0.033 ms
      tmpfs          : Write                    :     0.313 ms
      tmpfs          : Read/Write               :     0.406 ms
      tmpfs          : POPULATE_READ            :     0.039 ms
      tmpfs          : POPULATE_WRITE           :     0.285 ms
      file           : Read                     :     0.033 ms
      file           : Write                    :     0.351 ms
      file           : Read/Write               :     0.408 ms
      file           : POPULATE_READ            :     0.039 ms
      file           : POPULATE_WRITE           :     0.290 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      **************************************************
      4096 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :   237.940 ms
      Anon 4 KiB     : Write                    :   708.409 ms
      Anon 4 KiB     : Read/Write               :  1054.041 ms
      Anon 4 KiB     : POPULATE_READ            :   124.310 ms
      Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
      Memfd 4 KiB    : Read                     :   136.928 ms
      Memfd 4 KiB    : Write                    :   963.898 ms
      Memfd 4 KiB    : Read/Write               :  1106.561 ms
      Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
      Memfd 2 MiB    : Read                     :   357.116 ms
      Memfd 2 MiB    : Write                    :   357.210 ms
      Memfd 2 MiB    : Read/Write               :   357.606 ms
      Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
      tmpfs          : Read                     :   137.536 ms
      tmpfs          : Write                    :   954.362 ms
      tmpfs          : Read/Write               :  1105.954 ms
      tmpfs          : POPULATE_READ            :    80.289 ms
      tmpfs          : POPULATE_WRITE           :   822.826 ms
      file           : Read                     :   137.874 ms
      file           : Write                    :   987.025 ms
      file           : Read/Write               :  1107.439 ms
      file           : POPULATE_READ            :    80.413 ms
      file           : POPULATE_WRITE           :   857.622 ms
      hugetlbfs      : Read                     :   355.607 ms
      hugetlbfs      : Write                    :   355.729 ms
      hugetlbfs      : Read/Write               :   356.127 ms
      hugetlbfs      : POPULATE_READ            :   354.585 ms
      hugetlbfs      : POPULATE_WRITE           :   355.138 ms
      **************************************************
      2 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :     0.394 ms
      Anon 4 KiB     : Write                    :     0.348 ms
      Anon 4 KiB     : Read/Write               :     0.400 ms
      Anon 4 KiB     : POPULATE_READ            :     0.326 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
      Anon 2 MiB     : Read                     :     0.030 ms
      Anon 2 MiB     : Write                    :     0.030 ms
      Anon 2 MiB     : Read/Write               :     0.030 ms
      Anon 2 MiB     : POPULATE_READ            :     0.030 ms
      Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
      Memfd 4 KiB    : Read                     :     0.412 ms
      Memfd 4 KiB    : Write                    :     0.372 ms
      Memfd 4 KiB    : Read/Write               :     0.419 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
      Memfd 4 KiB    : FALLOCATE                :     0.137 ms
      Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
      Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      Memfd 2 MiB    : FALLOCATE                :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
      tmpfs          : Read                     :     0.416 ms
      tmpfs          : Write                    :     0.369 ms
      tmpfs          : Read/Write               :     0.425 ms
      tmpfs          : POPULATE_READ            :     0.346 ms
      tmpfs          : POPULATE_WRITE           :     0.295 ms
      tmpfs          : FALLOCATE                :     0.139 ms
      tmpfs          : FALLOCATE+Read           :     0.447 ms
      tmpfs          : FALLOCATE+Write          :     0.333 ms
      tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
      file           : Read                     :     0.191 ms
      file           : Write                    :     0.511 ms
      file           : Read/Write               :     0.524 ms
      file           : POPULATE_READ            :     0.196 ms
      file           : POPULATE_WRITE           :     0.434 ms
      file           : FALLOCATE                :     0.004 ms
      file           : FALLOCATE+Read           :     0.197 ms
      file           : FALLOCATE+Write          :     0.554 ms
      file           : FALLOCATE+Read/Write     :     0.480 ms
      file           : FALLOCATE+POPULATE_READ  :     0.201 ms
      file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      hugetlbfs      : FALLOCATE                :     0.030 ms
      hugetlbfs      : FALLOCATE+Read           :     0.031 ms
      hugetlbfs      : FALLOCATE+Write          :     0.031 ms
      hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
      **************************************************
      4096 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :  1053.090 ms
      Anon 4 KiB     : Write                    :   913.642 ms
      Anon 4 KiB     : Read/Write               :  1060.350 ms
      Anon 4 KiB     : POPULATE_READ            :   893.691 ms
      Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
      Anon 2 MiB     : Read                     :   358.553 ms
      Anon 2 MiB     : Write                    :   358.419 ms
      Anon 2 MiB     : Read/Write               :   357.992 ms
      Anon 2 MiB     : POPULATE_READ            :   357.533 ms
      Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
      Memfd 4 KiB    : Read                     :  1078.144 ms
      Memfd 4 KiB    : Write                    :   942.036 ms
      Memfd 4 KiB    : Read/Write               :  1100.391 ms
      Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
      Memfd 4 KiB    : FALLOCATE                :   304.632 ms
      Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
      Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
      Memfd 2 MiB    : Read                     :   358.131 ms
      Memfd 2 MiB    : Write                    :   358.099 ms
      Memfd 2 MiB    : Read/Write               :   358.250 ms
      Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
      Memfd 2 MiB    : FALLOCATE                :   356.735 ms
      Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
      Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
      tmpfs          : Read                     :  1087.265 ms
      tmpfs          : Write                    :   950.840 ms
      tmpfs          : Read/Write               :  1107.567 ms
      tmpfs          : POPULATE_READ            :   922.605 ms
      tmpfs          : POPULATE_WRITE           :   810.094 ms
      tmpfs          : FALLOCATE                :   306.320 ms
      tmpfs          : FALLOCATE+Read           :  1169.796 ms
      tmpfs          : FALLOCATE+Write          :   933.730 ms
      tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
      file           : Read                     :   654.101 ms
      file           : Write                    :  1259.142 ms
      file           : Read/Write               :  1289.509 ms
      file           : POPULATE_READ            :   661.642 ms
      file           : POPULATE_WRITE           :  1106.816 ms
      file           : FALLOCATE                :     1.864 ms
      file           : FALLOCATE+Read           :   656.328 ms
      file           : FALLOCATE+Write          :  1153.300 ms
      file           : FALLOCATE+Read/Write     :  1180.613 ms
      file           : FALLOCATE+POPULATE_READ  :   668.347 ms
      file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
      hugetlbfs      : Read                     :   357.245 ms
      hugetlbfs      : Write                    :   357.413 ms
      hugetlbfs      : Read/Write               :   357.120 ms
      hugetlbfs      : POPULATE_READ            :   356.321 ms
      hugetlbfs      : POPULATE_WRITE           :   356.693 ms
      hugetlbfs      : FALLOCATE                :   355.927 ms
      hugetlbfs      : FALLOCATE+Read           :   357.074 ms
      hugetlbfs      : FALLOCATE+Write          :   357.120 ms
      hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
      **************************************************
      
      [1] https://lkml.org/lkml/2013/6/27/698
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ca9b385
  7. 07 5月, 2021 1 次提交
  8. 14 3月, 2021 1 次提交
    • S
      mm/madvise: replace ptrace attach requirement for process_madvise · 96cfe2c0
      Suren Baghdasaryan 提交于
      process_madvise currently requires ptrace attach capability.
      PTRACE_MODE_ATTACH gives one process complete control over another
      process.  It effectively removes the security boundary between the two
      processes (in one direction).  Granting ptrace attach capability even to a
      system process is considered dangerous since it creates an attack surface.
      This severely limits the usage of this API.
      
      The operations process_madvise can perform do not affect the correctness
      of the operation of the target process; they only affect where the data is
      physically located (and therefore, how fast it can be accessed).  What we
      want is the ability for one process to influence another process in order
      to optimize performance across the entire system while leaving the
      security boundary intact.
      
      Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
      CAP_SYS_NICE.  PTRACE_MODE_READ to prevent leaking ASLR metadata and
      CAP_SYS_NICE for influencing process performance.
      
      Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96cfe2c0
  9. 30 1月, 2021 2 次提交
  10. 24 1月, 2021 2 次提交
  11. 16 12月, 2020 2 次提交
  12. 09 12月, 2020 1 次提交
    • M
      mm/madvise: remove racy mm ownership check · a68a0262
      Minchan Kim 提交于
      Jann spotted the security hole due to race of mm ownership check.
      
      If the task is sharing the mm_struct but goes through execve() before
      mm_access(), it could skip process_madvise_behavior_valid check.  That
      makes *any advice hint* to reach into the remote process.
      
      This patch removes the mm ownership check.  With it, it will lose the
      ability that local process could give *any* advice hint with vector
      interface for some reason (e.g., performance).  Since there is no
      concrete example in upstream yet, it would be better to remove the
      abiliity at this moment and need to review when such new advice comes
      up.
      
      Fixes: ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NJann Horn <jannh@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a68a0262
  13. 23 11月, 2020 2 次提交
  14. 19 10月, 2020 2 次提交
    • M
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim 提交于
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
    • M
      mm/madvise: pass mm to do_madvise · 0726b01e
      Minchan Kim 提交于
      Patch series "introduce memory hinting API for external process", v9.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3. Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 3):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.
      
      Furthermore, we must not access mm_struct via task->mm, but obtain it via
      access_mm() once (in the following patch) and only use that pointer [1],
      so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
      safe, so we can use them further down the call stack.
      
      And let's pass current->mm as arguments of do_madvise so it shouldn't
      change existing behavior but prepare next patch to make review easy.
      
      [vbabka@suse.cz: changelog tweak]
      [minchan@kernel.org: use current->mm for io_uring]
        Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
      [akpm@linux-foundation.org: fix it for upstream changes]
      [akpm@linux-foundation.org: whoops]
      [rdunlap@infradead.org: add missing includes]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0726b01e
  15. 17 10月, 2020 3 次提交
  16. 14 10月, 2020 1 次提交
  17. 27 9月, 2020 1 次提交
    • M
      mm: validate pmd after splitting · ce268425
      Minchan Kim 提交于
      syzbot reported the following KASAN splat:
      
        general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
        KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
        CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
        Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 <80> 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
        RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
        Call Trace:
          lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
          __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
          _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
          spin_lock include/linux/spinlock.h:354 [inline]
          madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
          walk_pmd_range mm/pagewalk.c:89 [inline]
          walk_pud_range mm/pagewalk.c:160 [inline]
          walk_p4d_range mm/pagewalk.c:193 [inline]
          walk_pgd_range mm/pagewalk.c:229 [inline]
          __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
          walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
          madvise_pageout_page_range mm/madvise.c:521 [inline]
          madvise_pageout mm/madvise.c:557 [inline]
          madvise_vma mm/madvise.c:946 [inline]
          do_madvise+0x12d0/0x2090 mm/madvise.c:1145
          __do_sys_madvise mm/madvise.c:1171 [inline]
          __se_sys_madvise mm/madvise.c:1169 [inline]
          __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
          do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The backing vma was shmem.
      
      In case of split page of file-backed THP, madvise zaps the pmd instead
      of remapping of sub-pages.  So we need to check pmd validity after
      split.
      
      Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce268425
  18. 06 9月, 2020 1 次提交
    • Y
      mm: madvise: fix vma user-after-free · 7867fd7c
      Yang Shi 提交于
      The syzbot reported the below use-after-free:
      
        BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline]
        BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline]
        BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
        Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996
      
        CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x18f/0x20d lib/dump_stack.c:118
          print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
          __kasan_report mm/kasan/report.c:513 [inline]
          kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
          madvise_willneed mm/madvise.c:293 [inline]
          madvise_vma mm/madvise.c:942 [inline]
          do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
          do_madvise mm/madvise.c:1169 [inline]
          __do_sys_madvise mm/madvise.c:1171 [inline]
          __se_sys_madvise mm/madvise.c:1169 [inline]
          __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Allocated by task 9992:
          kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482
          vm_area_alloc+0x1c/0x110 kernel/fork.c:347
          mmap_region+0x8e5/0x1780 mm/mmap.c:1743
          do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
          vm_mmap_pgoff+0x195/0x200 mm/util.c:506
          ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 9992:
          kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693
          remove_vma+0x132/0x170 mm/mmap.c:184
          remove_vma_list mm/mmap.c:2613 [inline]
          __do_munmap+0x743/0x1170 mm/mmap.c:2869
          do_munmap mm/mmap.c:2877 [inline]
          mmap_region+0x257/0x1780 mm/mmap.c:1716
          do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
          vm_mmap_pgoff+0x195/0x200 mm/util.c:506
          ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
          do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It is because vma is accessed after releasing mmap_lock, but someone
      else acquired the mmap_lock and the vma is gone.
      
      Releasing mmap_lock after accessing vma should fix the problem.
      
      Fixes: 692fe624 ("mm: Handle MADV_WILLNEED through vfs_fadvise()")
      Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[5.4+]
      Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7867fd7c
  19. 10 6月, 2020 2 次提交
  20. 25 4月, 2020 1 次提交
    • L
      mm: check that mm is still valid in madvise() · bc0c4d1e
      Linus Torvalds 提交于
      IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
      another process, which means that it can race with our crazy core dump
      handling which accesses the VM state without holding the mmap_sem
      (because it incorrectly thinks that it is the final user).
      
      This is clearly a core dumping problem, but we've never fixed it the
      right way, and instead have the notion of "check that the mm is still
      ok" using mmget_still_valid() after getting the mmap_sem for writing in
      any situation where we're not the original VM thread.
      
      See commit 04f5866e ("coredump: fix race condition between
      mmget_not_zero()/get_task_mm() and core dumping") for more background on
      this whole mmget_still_valid() thing.  You might want to have a barf bag
      handy when you do.
      
      We're discussing just fixing this properly in the only remaining core
      dumping routines.  But even if we do that, let's make do_madvise() do
      the right thing, and then when we fix core dumping, we can remove all
      these mmget_still_valid() checks.
      Reported-and-tested-by: NJann Horn <jannh@google.com>
      Fixes: c1ca757b ("io_uring: add IORING_OP_MADVISE")
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc0c4d1e
  21. 22 3月, 2020 1 次提交
  22. 21 1月, 2020 1 次提交
  23. 02 12月, 2019 3 次提交
  24. 16 11月, 2019 1 次提交
  25. 26 9月, 2019 4 次提交
    • M
      mm: factor out common parts between MADV_COLD and MADV_PAGEOUT · d616d512
      Minchan Kim 提交于
      There are many common parts between MADV_COLD and MADV_PAGEOUT.
      This patch factor them out to save code duplication.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d616d512
    • M
      mm: introduce MADV_PAGEOUT · 1a4e58cc
      Minchan Kim 提交于
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a4e58cc
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
    • A
      mm: untag user pointers passed to memory syscalls · 057d3389
      Andrey Konovalov 提交于
      This patch is a part of a series that extends kernel ABI to allow to pass
      tagged user pointers (with the top byte set to something else other than
      0x00) as syscall arguments.
      
      This patch allows tagged pointers to be passed to the following memory
      syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
      mremap, msync, munlock, move_pages.
      
      The mmap and mremap syscalls do not currently accept tagged addresses.
      Architectures may interpret the tag as a background colour for the
      corresponding vma.
      
      Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jens Wiklander <jens.wiklander@linaro.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      057d3389
  26. 25 9月, 2019 1 次提交
  27. 07 9月, 2019 1 次提交