1. 06 3月, 2019 1 次提交
    • O
      mm,mremap: bail out earlier in mremap_to under map pressure · ea2c3f6f
      Oscar Salvador 提交于
      When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
      calls mremap_to() which does the following:
      
      1) unmaps the destination region where we are going to move the map
      2) If the new region is going to be smaller, we unmap the last part
         of the old region
      
      Then, we will eventually call move_vma() to do the actual move.
      
      move_vma() checks whether we are at least 4 maps below max_map_count
      before going further, otherwise it bails out with -ENOMEM.  The problem
      is that we might have already unmapped the vma's in steps 1) and 2), so
      it is not possible for userspace to figure out the state of the vmas
      after it gets -ENOMEM, and it gets tricky for userspace to clean up
      properly on error path.
      
      While it is true that we can return -ENOMEM for more reasons (e.g: see
      may_expand_vm() or move_page_tables()), I think that we can avoid this
      scenario if we check early in mremap_to() if the operation has high
      chances to succeed map-wise.
      
      Should that not be the case, we can bail out before we even try to unmap
      anything, so we make sure the vma's are left untouched in case we are
      likely to be short of maps.
      
      The thumb-rule now is to rely on the worst-scenario case we can have.
      That is when both vma's (old region and new region) are going to be
      split in 3, so we get two more maps to the ones we already hold (one per
      each).  If current map count + 2 maps still leads us to 4 maps below the
      threshold, we are going to pass the check in move_vma().
      
      Of course, this is not free, as it might generate false positives when
      it is true that we are tight map-wise, but the unmap operation can
      release several vma's leading us to a good state.
      
      Another approach was also investigated [1], but it may be too much
      hassle for what it brings.
      
      [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
      
      Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea2c3f6f
  2. 05 1月, 2019 2 次提交
    • J
      mm: speed up mremap by 20x on large regions · 2c91bd4a
      Joel Fernandes (Google) 提交于
      Android needs to mremap large regions of memory during memory management
      related operations.  The mremap system call can be really slow if THP is
      not enabled.  The bottleneck is move_page_tables, which is copying each
      pte at a time, and can be really slow across a large map.  Turning on
      THP may not be a viable option, and is not for us.  This patch speeds up
      the performance for non-THP system by copying at the PMD level when
      possible.
      
      The speedup is an order of magnitude on x86 (~20x).  On a 1GB mremap,
      the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
      microseconds.
      
      Before:
      Total mremap time for 1GB data: 3521942 nanoseconds.
      Total mremap time for 1GB data: 3449229 nanoseconds.
      Total mremap time for 1GB data: 3488230 nanoseconds.
      
      After:
      Total mremap time for 1GB data: 150279 nanoseconds.
      Total mremap time for 1GB data: 144665 nanoseconds.
      Total mremap time for 1GB data: 158708 nanoseconds.
      
      If THP is enabled the optimization is mostly skipped except in certain
      situations.
      
      [joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning]
        Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
      Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c91bd4a
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
  3. 29 12月, 2018 1 次提交
  4. 27 10月, 2018 1 次提交
  5. 18 10月, 2018 1 次提交
  6. 15 6月, 2018 1 次提交
    • M
      mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns · 37a4094e
      Mel Gorman 提交于
      Commit 5d190420 ("mremap: fix race between mremap() and page
      cleanning") fixed races between mremap and other operations for both
      file-backed and anonymous mappings.  The file-backed was the most
      critical as it allowed the possibility that data could be changed on a
      physical page after page_mkclean returned which could trigger data loss
      or data integrity issues.
      
      A customer reported that the cost of the TLBs for anonymous regressions
      was excessive and resulting in a 30-50% drop in performance overall
      since this commit on a microbenchmark.  Unfortunately I neither have
      access to the test-case nor can I describe what it does other than
      saying that mremap operations dominate heavily.
      
      This patch removes the LATENCY_LIMIT to handle TLB flushes on a PMD
      boundary instead of every 64 pages to reduce the number of TLB
      shootdowns by a factor of 8 in the ideal case.  LATENCY_LIMIT was almost
      certainly used originally to limit the PTL hold times but the latency
      savings are likely offset by the cost of IPIs in many cases.  This patch
      is not reported to completely restore performance but gets it within an
      acceptable percentage.  The given metric here is simply described as
      "higher is better".
      
      Baseline that was known good
      002:  Metric:       91.05
      004:  Metric:      109.45
      008:  Metric:       73.08
      016:  Metric:       58.14
      032:  Metric:       61.09
      064:  Metric:       57.76
      128:  Metric:       55.43
      
      Current
      001:  Metric:       54.98
      002:  Metric:       56.56
      004:  Metric:       41.22
      008:  Metric:       35.96
      016:  Metric:       36.45
      032:  Metric:       35.71
      064:  Metric:       35.73
      128:  Metric:       34.96
      
      With patch
      001:  Metric:       61.43
      002:  Metric:       81.64
      004:  Metric:       67.92
      008:  Metric:       51.67
      016:  Metric:       50.47
      032:  Metric:       52.29
      064:  Metric:       50.01
      128:  Metric:       49.04
      
      So for low threads, it's not restored but for larger number of threads,
      it's closer to the "known good" baseline.
      
      Using a different mremap-intensive workload that is not representative
      of the real workload there is little difference observed outside of
      noise in the headline metrics However, the TLB shootdowns are reduced by
      11% on average and at the peak, TLB shootdowns were reduced by 21%.
      Interrupts were sampled every second while the workload ran to get those
      figures.  It's known that the figures will vary as the
      non-representative load is non-deterministic.
      
      An alternative patch was posted that should have significantly reduced
      the TLB flushes but unfortunately it does not perform as well as this
      version on the customer test case.  If revisited, the two patches can
      stack on top of each other.
      
      Link: http://lkml.kernel.org/r/20180606183803.k7qaw2xnbvzshv34@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37a4094e
  7. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  8. 09 9月, 2017 1 次提交
    • Z
      mm: thp: check pmd migration entry in common path · 84c3fc4e
      Zi Yan 提交于
      When THP migration is being used, memory management code needs to handle
      pmd migration entries properly.  This patch uses !pmd_present() or
      is_swap_pmd() (depending on whether pmd_none() needs separate code or
      not) to check pmd migration entries at the places where a pmd entry is
      present.
      
      Since pmd-related code uses split_huge_page(), split_huge_pmd(),
      pmd_trans_huge(), pmd_trans_unstable(), or
      pmd_none_or_trans_huge_or_clear_bad(), this patch:
      
      1. adds pmd migration entry split code in split_huge_pmd(),
      
      2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
      
      3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
      
      Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
      is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
      them.
      
      Until this commit, a pmd entry should be:
      1. pointing to a pte page,
      2. is_swap_pmd(),
      3. pmd_trans_huge(),
      4. pmd_devmap(), or
      5. pmd_none().
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c3fc4e
  9. 07 9月, 2017 1 次提交
  10. 03 8月, 2017 2 次提交
    • M
      userfaultfd: non-cooperative: notify about unmap of destination during mremap · b2282371
      Mike Rapoport 提交于
      When mremap is called with MREMAP_FIXED it unmaps memory at the
      destination address without notifying userfaultfd monitor.
      
      If the destination were registered with userfaultfd, the monitor has no
      way to distinguish between the old and new ranges and to properly relate
      the page faults that would occur in the destination region.
      
      Fixes: 897ab3e0 ("userfaultfd: non-cooperative: add event for memory unmaps")
      Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2282371
    • M
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman 提交于
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea27719
  11. 10 3月, 2017 1 次提交
  12. 25 2月, 2017 1 次提交
  13. 23 2月, 2017 2 次提交
  14. 30 11月, 2016 1 次提交
  15. 18 11月, 2016 1 次提交
    • A
      mremap: fix race between mremap() and page cleanning · 5d190420
      Aaron Lu 提交于
      Prior to 3.15, there was a race between zap_pte_range() and
      page_mkclean() where writes to a page could be lost.  Dave Hansen
      discovered by inspection that there is a similar race between
      move_ptes() and page_mkclean().
      
      We've been able to reproduce the issue by enlarging the race window with
      a msleep(), but have not been able to hit it without modifying the code.
      So, we think it's a real issue, but is difficult or impossible to hit in
      practice.
      
      The zap_pte_range() issue is fixed by commit 1cf35d47("mm: split
      'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
      this patch is to fix the race between page_mkclean() and mremap().
      
      Here is one possible way to hit the race: suppose a process mmapped a
      file with READ | WRITE and SHARED, it has two threads and they are bound
      to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
      1 did a write to addr X so that CPU1 now has a writable TLB for addr X
      on it.  Thread 2 starts mremaping from addr X to Y while thread 1
      cleaned the page and then did another write to the old addr X again.
      The 2nd write from thread 1 could succeed but the value will get lost.
      
              thread 1                           thread 2
           (bound to CPU1)                    (bound to CPU2)
      
        1: write 1 to addr X to get a
           writeable TLB on this CPU
      
                                              2: mremap starts
      
                                              3: move_ptes emptied PTE for addr X
                                                 and setup new PTE for addr Y and
                                                 then dropped PTL for X and Y
      
        4: page laundering for N by doing
           fadvise FADV_DONTNEED. When done,
           pageframe N is deemed clean.
      
        5: *write 2 to addr X
      
                                              6: tlb flush for addr X
      
        7: munmap (Y, pagesize) to make the
           page unmapped
      
        8: fadvise with FADV_DONTNEED again
           to kick the page off the pagecache
      
        9: pread the page from file to verify
           the value. If 1 is there, it means
           we have lost the written 2.
      
        *the write may or may not cause segmentation fault, it depends on
        if the TLB is still on the CPU.
      
      Please note that this is only one specific way of how the race could
      occur, it didn't mean that the race could only occur in exact the above
      config, e.g. more than 2 threads could be involved and fadvise() could
      be done in another thread, etc.
      
      For anonymous pages, they could race between mremap() and page reclaim:
      THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
      PMD gets unmapped/splitted/pagedout before the flush tlb happened for
      the old huge PMD in move_page_tables() and we could still write data to
      it.  The normal anonymous page has similar situation.
      
      To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
      if any, did the flush before dropping the PTL.  If we did the flush for
      every move_ptes()/move_huge_pmd() call then we do not need to do the
      flush in move_pages_tables() for the whole range.  But if we didn't, we
      still need to do the whole range flush.
      
      Alternatively, we can track which part of the range is flushed in
      move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
      range in move_page_tables().  But that would require multiple tlb
      flushes for the different sub-ranges and should be less efficient than
      the single whole range flush.
      
      KBuild test on my Sandybridge desktop doesn't show any noticeable change.
      v4.9-rc4:
        real    5m14.048s
        user    32m19.800s
        sys     4m50.320s
      
      With this commit:
        real    5m13.888s
        user    32m19.330s
        sys     4m51.200s
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d190420
  16. 27 7月, 2016 1 次提交
  17. 24 5月, 2016 1 次提交
    • M
      mm: make mmap_sem for write waits killable for mm syscalls · dc0ef0df
      Michal Hocko 提交于
      This is a follow up work for oom_reaper [1].  As the async OOM killing
      depends on oom_sem for read we would really appreciate if a holder for
      write didn't stood in the way.  This patchset is changing many of
      down_write calls to be killable to help those cases when the writer is
      blocked and waiting for readers to release the lock and so help
      __oom_reap_task to process the oom victim.
      
      Most of the patches are really trivial because the lock is help from a
      shallow syscall paths where we can return EINTR trivially and allow the
      current task to die (note that EINTR will never get to the userspace as
      the task has fatal signal pending).  Others seem to be easy as well as
      the callers are already handling fatal errors and bail and return to
      userspace which should be sufficient to handle the failure gracefully.
      I am not familiar with all those code paths so a deeper review is really
      appreciated.
      
      As this work is touching more areas which are not directly connected I
      have tried to keep the CC list as small as possible and people who I
      believed would be familiar are CCed only to the specific patches (all
      should have received the cover though).
      
      This patchset is based on linux-next and it depends on
      down_write_killable for rw_semaphores which got merged into tip
      locking/rwsem branch and it is merged into this next tree.  I guess it
      would be easiest to route these patches via mmotm because of the
      dependency on the tip tree but if respective maintainers prefer other
      way I have no objections.
      
      I haven't covered all the mmap_write(mm->mmap_sem) instances here
      
        $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
        98
        $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
        62
      
      I have tried to cover those which should be relatively easy to review in
      this series because this alone should be a nice improvement.  Other
      places can be changed on top.
      
      [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
      [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org
      
      This patch (of 18):
      
      This is the first step in making mmap_sem write waiters killable.  It
      focuses on the trivial ones which are taking the lock early after
      entering the syscall and they are not changing state before.
      
      Therefore it is very easy to change them to use down_write_killable and
      immediately return with -EINTR.  This will allow the waiter to pass away
      without blocking the mmap_sem which might be required to make a forward
      progress.  E.g.  the oom reaper will need the lock for reading to
      dismantle the OOM victim address space.
      
      The only tricky function in this patch is vm_mmap_pgoff which has many
      call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
      original non-killable semantic for now.
      
      vm_munmap callers do not bother checking the return value so open code
      it into the munmap syscall path for now for simplicity.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0ef0df
  18. 20 5月, 2016 2 次提交
    • H
      huge pagecache: extend mremap pmd rmap lockout to files · 1d069b7d
      Hugh Dickins 提交于
      Whatever huge pagecache implementation we go with, file rmap locking
      must be added to anon rmap locking, when mremap's move_page_tables()
      finds a pmd_trans_huge pmd entry: a simple change, let's do it now.
      
      Factor out take_rmap_locks() and drop_rmap_locks() to handle the locking
      for make move_ptes() and move_page_tables(), and delete the
      VM_BUG_ON_VMA which rejected vm_file and required anon_vma.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d069b7d
    • H
      huge mm: move_huge_pmd does not need new_vma · bf8616d5
      Hugh Dickins 提交于
      Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
      a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
      the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
      it was in the old vma, alignment and size permitting.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf8616d5
  19. 18 3月, 2016 2 次提交
  20. 12 2月, 2016 1 次提交
    • K
      mm, dax: check for pmd_none() after split_huge_pmd() · 6b9116a6
      Kirill A. Shutemov 提交于
      DAX implements split_huge_pmd() by clearing pmd.  This simple approach
      reduces memory overhead, as we don't need to deposit page table on huge
      page mapping to make split_huge_pmd() never-fail.  PTE table can be
      allocated and populated later on page fault from backing store.
      
      But one side effect is that have to check if pmd is pmd_none() after
      split_huge_pmd().  In most places we do this already to deal with
      parallel MADV_DONTNEED.
      
      But I found two call sites which is not affected by MADV_DONTNEED (due
      down_write(mmap_sem)), but need to have the check to work with DAX
      properly.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b9116a6
  21. 16 1月, 2016 2 次提交
  22. 15 1月, 2016 1 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
  23. 05 1月, 2016 1 次提交
    • T
      x86/mm/pat: Add untrack_pfn_moved for mremap · d9fe4fab
      Toshi Kani 提交于
      mremap() with MREMAP_FIXED on a VM_PFNMAP range causes the following
      WARN_ON_ONCE() message in untrack_pfn().
      
        WARNING: CPU: 1 PID: 3493 at arch/x86/mm/pat.c:985 untrack_pfn+0xbd/0xd0()
        Call Trace:
        [<ffffffff817729ea>] dump_stack+0x45/0x57
        [<ffffffff8109e4b6>] warn_slowpath_common+0x86/0xc0
        [<ffffffff8109e5ea>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8106a88d>] untrack_pfn+0xbd/0xd0
        [<ffffffff811d2d5e>] unmap_single_vma+0x80e/0x860
        [<ffffffff811d3725>] unmap_vmas+0x55/0xb0
        [<ffffffff811d916c>] unmap_region+0xac/0x120
        [<ffffffff811db86a>] do_munmap+0x28a/0x460
        [<ffffffff811dec33>] move_vma+0x1b3/0x2e0
        [<ffffffff811df113>] SyS_mremap+0x3b3/0x510
        [<ffffffff817793ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      MREMAP_FIXED moves a pfnmap from old vma to new vma.  untrack_pfn() is
      called with the old vma after its pfnmap page table has been removed,
      which causes follow_phys() to fail.  The new vma has a new pfnmap to
      the same pfn & cache type with VM_PAT set.  Therefore, we only need to
      clear VM_PAT from the old vma in this case.
      
      Add untrack_pfn_moved(), which clears VM_PAT from a given old vma.
      move_vma() is changed to call this function with the old vma when
      VM_PFNMAP is set.  move_vma() then calls do_munmap(), and untrack_pfn()
      is a no-op since VM_PAT is cleared.
      Reported-by: NStas Sergeev <stsp@list.ru>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1450832064-10093-2-git-send-email-toshi.kani@hpe.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d9fe4fab
  24. 06 11月, 2015 1 次提交
  25. 05 9月, 2015 5 次提交
  26. 25 6月, 2015 1 次提交
    • L
      mm: new arch_remap() hook · 4abad2ca
      Laurent Dufour 提交于
      Some architectures would like to be triggered when a memory area is moved
      through the mremap system call.
      
      This patch introduces a new arch_remap() mm hook which is placed in the
      path of mremap, and is called before the old area is unmapped (and the
      arch_unmap() hook is called).
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4abad2ca
  27. 16 4月, 2015 2 次提交
  28. 07 4月, 2015 1 次提交
    • A
      fix mremap() vs. ioctx_kill() race · b2edffdd
      Al Viro 提交于
      teach ->mremap() method to return an error and have it fail for
      aio mappings in process of being killed
      
      Note that in case of ->mremap() failure we need to undo move_page_tables()
      we'd already done; we could call ->mremap() first, but then the failure of
      move_page_tables() would require undoing whatever _successful_ ->mremap()
      has done, which would be a lot more headache in general.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2edffdd
  29. 11 2月, 2015 1 次提交