1. 07 9月, 2019 1 次提交
  2. 03 7月, 2019 1 次提交
  3. 15 5月, 2019 2 次提交
    • J
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse 提交于
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
  4. 03 4月, 2019 1 次提交
  5. 29 12月, 2018 1 次提交
  6. 06 10月, 2018 1 次提交
  7. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  8. 24 7月, 2018 1 次提交
    • D
      mm, madvise_inject_error: Let memory_failure() optionally take a page reference · 23e7b5c2
      Dan Williams 提交于
      The madvise_inject_error() routine uses get_user_pages() to lookup the
      pfn and other information for injected error, but it does not release
      that pin. The assumption is that failed pages should be taken out of
      circulation.
      
      However, for dax mappings it is not possible to take pages out of
      circulation since they are 1:1 physically mapped as filesystem blocks,
      or device-dax capacity. They also typically represent persistent memory
      which has an error clearing capability.
      
      In preparation for adding a special handler for dax mappings, shift the
      responsibility of taking the page reference to memory_failure(). I.e.
      drop the page reference and do not specify MF_COUNT_INCREASED to
      memory_failure().
      
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      23e7b5c2
  9. 24 1月, 2018 1 次提交
  10. 30 11月, 2017 1 次提交
    • C
      mm/madvise.c: fix madvise() infinite loop under special circumstances · 6ea8d958
      chenjie 提交于
      MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
      Unfortunately madvise_willneed() doesn't communicate this information
      properly to the generic madvise syscall implementation.  The calling
      convention is quite subtle there.  madvise_vma() is supposed to either
      return an error or update &prev otherwise the main loop will never
      advance to the next vma and it will keep looping for ever without a way
      to get out of the kernel.
      
      It seems this has been broken since introduction.  Nobody has noticed
      because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
      
      [mhocko@suse.com: rewrite changelog]
      Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
      Fixes: fe77ba6f ("[PATCH] xip: madvice/fadvice: execute in place")
      Signed-off-by: Nchenjie <chenjie6@huawei.com>
      Signed-off-by: Nguoxuenan <guoxuenan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: zhangyi (F) <yi.zhang@huawei.com>
      Cc: Miao Xie <miaoxie@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ea8d958
  11. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  12. 14 10月, 2017 1 次提交
  13. 04 10月, 2017 1 次提交
    • A
      mm, hugetlb, soft_offline: save compound page order before page migration · 19bfbe22
      Alexandru Moise 提交于
      This fixes a bug in madvise() where if you'd try to soft offline a
      hugepage via madvise(), while walking the address range you'd end up,
      using the wrong page offset due to attempting to get the compound order
      of a former but presently not compound page, due to dissolving the huge
      page (since commit c3114a84: "mm: hugetlb: soft-offline: dissolve
      source hugepage after successful migration").
      
      As a result I ended up with all my free pages except one being offlined.
      
      Link: http://lkml.kernel.org/r/20170912204306.GA12053@gmail.com
      Fixes: c3114a84 ("mm: hugetlb: soft-offline: dissolve source hugepage after successful migration")
      Signed-off-by: NAlexandru Moise <00moses.alexander00@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19bfbe22
  14. 09 9月, 2017 1 次提交
    • J
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse 提交于
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
  15. 07 9月, 2017 1 次提交
    • R
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel 提交于
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: NColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
  16. 01 9月, 2017 1 次提交
  17. 26 8月, 2017 1 次提交
    • E
      mm/madvise.c: fix freeing of locked page with MADV_FREE · 263630e8
      Eric Biggers 提交于
      If madvise(..., MADV_FREE) split a transparent hugepage, it called
      put_page() before unlock_page().
      
      This was wrong because put_page() can free the page, e.g. if a
      concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
      mapping. put_page() then rightfully complained about freeing a locked
      page.
      
      Fix this by moving the unlock_page() before put_page().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: Bad page state in process syzkaller412798  pfn:1bd800
          page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
          flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
          raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
          raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
          page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
          bad because of flags: 0x1(locked)
          Modules linked in:
          CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           bad_page+0x230/0x2b0 mm/page_alloc.c:565
           free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
           free_pages_check mm/page_alloc.c:952 [inline]
           free_pages_prepare mm/page_alloc.c:1043 [inline]
           free_pcp_prepare mm/page_alloc.c:1068 [inline]
           free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
           __put_single_page mm/swap.c:79 [inline]
           __put_page+0xfb/0x160 mm/swap.c:113
           put_page include/linux/mm.h:814 [inline]
           madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
           walk_pmd_range mm/pagewalk.c:50 [inline]
           walk_pud_range mm/pagewalk.c:108 [inline]
           walk_p4d_range mm/pagewalk.c:134 [inline]
           walk_pgd_range mm/pagewalk.c:160 [inline]
           __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
           walk_page_range+0x200/0x470 mm/pagewalk.c:326
           madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
           madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
           madvise_dontneed_free mm/madvise.c:555 [inline]
           madvise_vma mm/madvise.c:664 [inline]
           SYSC_madvise mm/madvise.c:832 [inline]
           SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Here is a C reproducer:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <sys/mman.h>
          #include <unistd.h>
      
          #define MADV_FREE	8
          #define PAGE_SIZE	4096
      
          static void *mapping;
          static const size_t mapping_size = 0x1000000;
      
          static void *madvise_thrproc(void *arg)
          {
              madvise(mapping, mapping_size, (long)arg);
          }
      
          int main(void)
          {
              pthread_t t[2];
      
              for (;;) {
                  mapping = mmap(NULL, mapping_size, PROT_WRITE,
                                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      
                  munmap(mapping + mapping_size / 2, PAGE_SIZE);
      
                  pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
                  pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
                  pthread_join(t[0], NULL);
                  pthread_join(t[1], NULL);
                  munmap(mapping, mapping_size);
              }
          }
      
      Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
      CONFIG_DEBUG_VM=y are needed.
      
      Google Bug Id: 64696096
      
      Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
      Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[v4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      263630e8
  18. 03 8月, 2017 1 次提交
    • M
      mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries · 3ea27719
      Mel Gorman 提交于
      Nadav Amit identified a theoritical race between page reclaim and
      mprotect due to TLB flushes being batched outside of the PTL being held.
      
      He described the race as follows:
      
              CPU0                            CPU1
              ----                            ----
                                              user accesses memory using RW PTE
                                              [PTE now cached in TLB]
              try_to_unmap_one()
              ==> ptep_get_and_clear()
              ==> set_tlb_ubc_flush_pending()
                                              mprotect(addr, PROT_READ)
                                              ==> change_pte_range()
                                              ==> [ PTE non-present - no flush ]
      
                                              user writes using cached RW PTE
              ...
      
              try_to_unmap_flush()
      
      The same type of race exists for reads when protecting for PROT_NONE and
      also exists for operations that can leave an old TLB entry behind such
      as munmap, mremap and madvise.
      
      For some operations like mprotect, it's not necessarily a data integrity
      issue but it is a correctness issue as there is a window where an
      mprotect that limits access still allows access.  For munmap, it's
      potentially a data integrity issue although the race is massive as an
      munmap, mmap and return to userspace must all complete between the
      window when reclaim drops the PTL and flushes the TLB.  However, it's
      theoritically possible so handle this issue by flushing the mm if
      reclaim is potentially currently batching TLB flushes.
      
      Other instances where a flush is required for a present pte should be ok
      as either the page lock is held preventing parallel reclaim or a page
      reference count is elevated preventing a parallel free leading to
      corruption.  In the case of page_mkclean there isn't an obvious path
      that userspace could take advantage of without using the operations that
      are guarded by this patch.  Other users such as gup as a race with
      reclaim looks just at PTEs.  huge page variants should be ok as they
      don't race with reclaim.  mincore only looks at PTEs.  userfault also
      should be ok as if a parallel reclaim takes place, it will either fault
      the page back in or read some of the data before the flush occurs
      triggering a fault.
      
      Note that a variant of this patch was acked by Andy Lutomirski but this
      was for the x86 parts on top of his PCID work which didn't make the 4.13
      merge window as expected.  His ack is dropped from this version and
      there will be a follow-on patch on top of PCID that will include his
      ack.
      
      [akpm@linux-foundation.org: tweak comments]
      [akpm@linux-foundation.org: fix spello]
      Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.deReported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>	[v4.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea27719
  19. 11 7月, 2017 2 次提交
    • M
      userfaultfd: non-cooperative: add madvise() event for MADV_FREE request · 230ca982
      Mike Rapoport 提交于
      MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
      monitor.  The monitor has to stop handling #PF events in the range being
      freed.  We are reusing userfaultfd_remove callback along with the logic
      required to re-get and re-validate the VMA which may change or disappear
      because userfaultfd_remove releases mmap_sem.
      
      Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230ca982
    • S
      swap: add block io poll in swapin path · 23955622
      Shaohua Li 提交于
      For fast flash disk, async IO could introduce overhead because of
      context switch.  block-mq now supports IO poll, which improves
      performance and latency a lot.  swapin is a good place to use this
      technique, because the task is waiting for the swapin page to continue
      execution.
      
      In my virtual machine, directly read 4k data from a NVMe with iopoll is
      about 60% better than that without poll.  With iopoll support in swapin
      patch, my microbenchmark (a task does random memory write) is about
      10%~25% faster.  CPU utilization increases a lot though, 2x and even 3x
      CPU utilization.  This will depend on disk speed.
      
      While iopoll in swapin isn't intended for all usage cases, it's a win
      for latency sensistive workloads with high speed swap disk.  block layer
      has knob to control poll in runtime.  If poll isn't enabled in block
      layer, there should be no noticeable change in swapin.
      
      I got a chance to run the same test in a NVMe with DRAM as the media.
      In simple fio IO test, blkpoll boosts 50% performance in single thread
      test and ~20% in 8 threads test.  So this is the base line.  In above
      swap test, blkpoll boosts ~27% performance in single thread test.
      blkpoll uses 2x CPU time though.
      
      If we enable hybid polling, the performance gain has very slight drop
      but CPU time is only 50% worse than that without blkpoll.  Also we can
      adjust parameter of hybid poll, with it, the CPU time penality is
      reduced further.  In 8 threads test, blkpoll doesn't help though.  The
      performance is similar to that without blkpoll, but cpu utilization is
      similar too.  There is lock contention in swap path.  The cpu time
      spending on blkpoll isn't high.  So overall, blkpoll swapin isn't worse
      than that without it.
      
      The swapin readahead might read several pages in in the same time and
      form a big IO request.  Since the IO will take longer time, it doesn't
      make sense to do poll, so the patch only does iopoll for single page
      swapin.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965cSigned-off-by: NShaohua Li <shli@fb.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23955622
  20. 04 5月, 2017 5 次提交
  21. 10 3月, 2017 1 次提交
  22. 25 2月, 2017 4 次提交
  23. 23 2月, 2017 4 次提交
  24. 13 12月, 2016 1 次提交
  25. 24 5月, 2016 1 次提交
    • M
      mm: make mmap_sem for write waits killable for mm syscalls · dc0ef0df
      Michal Hocko 提交于
      This is a follow up work for oom_reaper [1].  As the async OOM killing
      depends on oom_sem for read we would really appreciate if a holder for
      write didn't stood in the way.  This patchset is changing many of
      down_write calls to be killable to help those cases when the writer is
      blocked and waiting for readers to release the lock and so help
      __oom_reap_task to process the oom victim.
      
      Most of the patches are really trivial because the lock is help from a
      shallow syscall paths where we can return EINTR trivially and allow the
      current task to die (note that EINTR will never get to the userspace as
      the task has fatal signal pending).  Others seem to be easy as well as
      the callers are already handling fatal errors and bail and return to
      userspace which should be sufficient to handle the failure gracefully.
      I am not familiar with all those code paths so a deeper review is really
      appreciated.
      
      As this work is touching more areas which are not directly connected I
      have tried to keep the CC list as small as possible and people who I
      believed would be familiar are CCed only to the specific patches (all
      should have received the cover though).
      
      This patchset is based on linux-next and it depends on
      down_write_killable for rw_semaphores which got merged into tip
      locking/rwsem branch and it is merged into this next tree.  I guess it
      would be easiest to route these patches via mmotm because of the
      dependency on the tip tree but if respective maintainers prefer other
      way I have no objections.
      
      I haven't covered all the mmap_write(mm->mmap_sem) instances here
      
        $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
        98
        $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
        62
      
      I have tried to cover those which should be relatively easy to review in
      this series because this alone should be a nice improvement.  Other
      places can be changed on top.
      
      [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
      [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org
      
      This patch (of 18):
      
      This is the first step in making mmap_sem write waiters killable.  It
      focuses on the trivial ones which are taking the lock early after
      entering the syscall and they are not changing state before.
      
      Therefore it is very easy to change them to use down_write_killable and
      immediately return with -EINTR.  This will allow the waiter to pass away
      without blocking the mmap_sem which might be required to make a forward
      progress.  E.g.  the oom reaper will need the lock for reading to
      dismantle the OOM victim address space.
      
      The only tricky function in this patch is vm_mmap_pgoff which has many
      call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
      original non-killable semantic for now.
      
      vm_munmap callers do not bother checking the return value so open code
      it into the munmap syscall path for now for simplicity.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0ef0df
  26. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  27. 16 3月, 2016 2 次提交