1. 15 5月, 2019 40 次提交
    • A
      mm: implement new zone specific memblock iterator · 837566e7
      Alexander Duyck 提交于
      Introduce a new iterator for_each_free_mem_pfn_range_in_zone.
      
      This iterator will take care of making sure a given memory range provided
      is in fact contained within a zone.  It takes are of all the bounds
      checking we were doing in deferred_grow_zone, and deferred_init_memmap.
      In addition it should help to speed up the search a bit by iterating until
      the end of a range is greater than the start of the zone pfn range, and
      will exit completely if the start is beyond the end of the zone.
      
      Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      837566e7
    • A
      mm: use mm_zero_struct_page from SPARC on all 64b architectures · 5470dea4
      Alexander Duyck 提交于
      Patch series "Deferred page init improvements", v7.
      
      This patchset is essentially a refactor of the page initialization logic
      that is meant to provide for better code reuse while providing a
      significant improvement in deferred page initialization performance.
      
      In my testing on an x86_64 system with 384GB of RAM I have seen the
      following.  In the case of regular memory initialization the deferred init
      time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
      improvement for the deferred memory initialization performance.
      
      I have called out the improvement observed with each patch.
      
      This patch (of 4):
      
      Use the same approach that was already in use on Sparc on all the
      architectures that support a 64b long.
      
      This is mostly motivated by the fact that 7 to 10 store/move instructions
      are likely always going to be faster than having to call into a function
      that is not specialized for handling page init.
      
      An added advantage to doing it this way is that the compiler can get away
      with combining writes in the __init_single_page call.  As a result the
      memset call will be reduced to only about 4 write operations, or at least
      that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
      count/mapcount seem to be cancelling out at least 4 of the 8 assignments
      on my system.
      
      One change I had to make to the function was to reduce the minimum page
      size to 56 to support some powerpc64 configurations.
      
      This change should introduce no change on SPARC since it already had this
      code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
      initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
      with Broadcom's Stingray CPU and 48GB of RAM and found that
      __init_single_page() takes 19.30ns / 64-byte struct page before this patch
      and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
      ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
      and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
      and 4.59ns after this patch.
      
      Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5470dea4
    • J
      mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper · c6d23413
      Jérôme Glisse 提交于
      Helper to test if a range is updated to read only (it is still valid to
      read from the range).  This is useful for device driver or anyone who wish
      to optimize out update when they know that they already have the range map
      read only.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-9-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6d23413
    • J
      mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening · bf198b2b
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening
      
      This patch is just passing down the new informations by adding it to the
      mmu_notifier_range structure.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-8-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf198b2b
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
    • J
      mm/mmu_notifier: contextual information for event enums · d87f055b
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      This patch introduce a set of enums that can be associated with each of
      the events triggering a mmu notifier.  Latter patches take advantages of
      those enum values.
      
          - UNMAP: munmap() or mremap()
          - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
          - PROTECTION_VMA: change in access protections for the range
          - PROTECTION_PAGE: change in access protections for page in the range
          - SOFT_DIRTY: soft dirtyness tracking
      
      Being able to identify munmap() and mremap() from other reasons why the
      page table is cleared is important to allow user of mmu notifier to update
      their own internal tracking structure accordingly (on munmap or mremap it
      is not longer needed to track range of virtual address as it becomes
      invalid).
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-5-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d87f055b
    • J
      mm/mmu_notifier: convert mmu_notifier_range->blockable to a flags · 27560ee9
      Jérôme Glisse 提交于
      Use an unsigned field for flags other than blockable and convert the
      blockable field to be one of those flags.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-4-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27560ee9
    • J
      mm/mmu_notifier: helper to test if a range invalidation is blockable · 4a83bfe9
      Jérôme Glisse 提交于
      Patch series "mmu notifier provide context informations", v6.
      
      Here I am not posting users of this, they already have been posted to
      appropriate mailing list [6] and will be merge through the appropriate
      tree once this patchset is upstream.
      
      Note that this serie does not change any behavior for any existing code.
      It just pass down more information to mmu notifier listener.
      
      The rationale for this patchset:
      
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      This patchset introduce a set of enums that can be associated with each of
      the events triggering a mmu notifier:
      
          - UNMAP: munmap() or mremap()
          - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
          - PROTECTION_VMA: change in access protections for the range
          - PROTECTION_PAGE: change in access protections for page in the range
          - SOFT_DIRTY: soft dirtyness tracking
      
      Being able to identify munmap() and mremap() from other reasons why the
      page table is cleared is important to allow user of mmu notifier to update
      their own internal tracking structure accordingly (on munmap or mremap it
      is not longer needed to track range of virtual address as it becomes
      invalid).  Without this serie, driver are force to assume that every
      notification is an munmap which triggers useless trashing within drivers
      that associate structure with range of virtual address.  Each driver is
      force to free up its tracking structure and then restore it on next device
      page fault.  With this series we can also optimize device page table update.  Patches to use this are at
      
      https://lkml.org/lkml/2019/1/23/833
      https://lkml.org/lkml/2019/1/23/834
      https://lkml.org/lkml/2019/1/23/832
      https://lkml.org/lkml/2019/1/23/831
      
      Moreover this can also be used to optimize out some page table updates
      such as for KVM where we can update the secondary MMU directly from the
      callback instead of clearing it.
      
      ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
      ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
      
      This patch (of 8):
      
      Simple helpers to test if range invalidation is blockable.  Latter patches
      use cocinnelle to convert all direct dereference of range-> blockable to
      use this function instead so that we can convert the blockable field to an
      unsigned for more flags.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a83bfe9
    • J
      mm/hmm: convert various hmm_pfn_* to device_entry which is a better name · 391aab11
      Jérôme Glisse 提交于
      Convert hmm_pfn_* to device_entry_* as here we are dealing with device
      driver specific entry format and hmm provide helpers to allow differents
      components (including HMM) to create/parse device entry.
      
      We keep wrapper with the old name so that we can convert driver to use the
      new API in stages in each device driver tree.  This will get remove once
      all driver are converted.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-13-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      391aab11
    • J
      mm/hmm: add a helper function that fault pages and map them to a device · 55c0ece8
      Jérôme Glisse 提交于
      This is a all in one helper that fault pages in a range and map them to a
      device so that every single device driver do not have to re-implement this
      common pattern.
      
      This is taken from ODP RDMA in preparation of ODP RDMA convertion.  It
      will be use by nouveau and other drivers.
      
      [jglisse@redhat.com: Was using wrong field and wrong enum]
        Link: http://lkml.kernel.org/r/20190409175340.26614-1-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20190403193318.16478-12-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55c0ece8
    • J
      mm/hmm: add helpers to test if mm is still alive or not · 20239417
      Jérôme Glisse 提交于
      The device driver can have kernel thread or worker doing work against a
      process mm and it is useful for those to test wether the mm is dead or
      alive to avoid doing useless work.  Add an helper to test that so that
      driver can bail out early if a process is dying.
      
      Note that the helper does not perform any lock synchronization and thus is
      just a hint ie a process might be dying but the helper might still return
      the process as alive.  All HMM functions are safe to use in that case as
      HMM internal properly protect itself with lock.  If driver use this helper
      with non HMM functions it should ascertain that it is safe to do so.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-11-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20239417
    • J
      mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping) · 63d5066f
      Jérôme Glisse 提交于
      HMM mirror is a device driver helpers to mirror range of virtual address.
      It means that the process jobs running on the device can access the same
      virtual address as the CPU threads of that process.  This patch adds
      support for hugetlbfs mapping (ie range of virtual address that are mmap
      of a hugetlbfs).
      
      [rcampbell@nvidia.com: fix initial PFN for hugetlbfs pages]
        Link: http://lkml.kernel.org/r/20190419233536.8080-1-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20190403193318.16478-9-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63d5066f
    • J
      mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays · 023a019a
      Jérôme Glisse 提交于
      The HMM mirror API can be use in two fashions.  The first one where the
      HMM user coalesce multiple page faults into one request and set flags per
      pfns for of those faults.  The second one where the HMM user want to
      pre-fault a range with specific flags.  For the latter one it is a waste
      to have the user pre-fill the pfn arrays with a default flags value.
      
      This patch adds a default flags value allowing user to set them for a
      range without having to pre-fill the pfn array.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-8-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      023a019a
    • J
      mm/hmm: improve driver API to work and wait over a range · a3e0d41c
      Jérôme Glisse 提交于
      A common use case for HMM mirror is user trying to mirror a range and
      before they could program the hardware it get invalidated by some core mm
      event.  Instead of having user re-try right away to mirror the range
      provide a completion mechanism for them to wait for any active
      invalidation affecting the range.
      
      This also changes how hmm_range_snapshot() and hmm_range_fault() works by
      not relying on vma so that we can drop the mmap_sem when waiting and
      lookup the vma again on retry.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3e0d41c
    • J
      mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault() · 73231612
      Jérôme Glisse 提交于
      Minor optimization around hmm_pte_need_fault().  Rename for consistency
      between code, comments and documentation.  Also improves the comments on
      all the possible returns values.  Improve the function by returning the
      number of populated entries in pfns array.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73231612
    • J
      mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot() · 25f23a0c
      Jérôme Glisse 提交于
      Rename for consistency between code, comments and documentation.  Also
      improves the comments on all the possible returns values.  Improve the
      function by returning the number of populated entries in pfns array.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-5-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25f23a0c
    • J
      mm/hmm: use reference counting for HMM struct · 704f3f2c
      Jérôme Glisse 提交于
      Every time I read the code to check that the HMM structure does not vanish
      before it should thanks to the many lock protecting its removal i get a
      headache.  Switch to reference counting instead it is much easier to
      follow and harder to break.  This also remove some code that is no longer
      needed with refcounting.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      704f3f2c
    • M
      hugetlb: use same fault hash key for shared and private mappings · 1b426bac
      Mike Kravetz 提交于
      hugetlb uses a fault mutex hash table to prevent page faults of the
      same pages concurrently.  The key for shared and private mappings is
      different.  Shared keys off address_space and file index.  Private keys
      off mm and virtual address.  Consider a private mappings of a populated
      hugetlbfs file.  A fault will map the page from the file and if needed
      do a COW to map a writable page.
      
      Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
      pages.  It uses the address_space file index key.  However, private
      mappings will use a different key and could race with this code to map
      the file page.  This causes problems (BUG) for the page cache remove
      code as it expects the page to be unmapped.  A sample stack is:
      
      page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
      kernel BUG at mm/filemap.c:169!
      ...
      RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
      ...
      Call Trace:
      __delete_from_page_cache+0x39/0x220
      delete_from_page_cache+0x45/0x70
      remove_inode_hugepages+0x13c/0x380
      ? __add_to_page_cache_locked+0x162/0x380
      hugetlbfs_fallocate+0x403/0x540
      ? _cond_resched+0x15/0x30
      ? __inode_security_revalidate+0x5d/0x70
      ? selinux_file_permission+0x100/0x130
      vfs_fallocate+0x13f/0x270
      ksys_fallocate+0x3c/0x80
      __x64_sys_fallocate+0x1a/0x20
      do_syscall_64+0x5b/0x180
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      There seems to be another potential COW issue/race with this approach
      of different private and shared keys as noted in commit 8382d914
      ("mm, hugetlb: improve page-fault scalability").
      
      Since every hugetlb mapping (even anon and private) is actually a file
      mapping, just use the address_space index key for all mappings.  This
      results in potentially more hash collisions.  However, this should not
      be the common case.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b426bac
    • D
    • Y
      mm/vmscan: drop may_writepage and classzone_idx from direct reclaim begin template · 3481c37f
      Yafang Shao 提交于
      There are three tracepoints using this template, which are
      mm_vmscan_direct_reclaim_begin,
      mm_vmscan_memcg_reclaim_begin,
      mm_vmscan_memcg_softlimit_reclaim_begin.
      
      Regarding mm_vmscan_direct_reclaim_begin,
      sc.may_writepage is !laptop_mode, that's a static setting, and
      reclaim_idx is derived from gfp_mask which is already show in this
      tracepoint.
      
      Regarding mm_vmscan_memcg_reclaim_begin,
      may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1),
      which are both static value.
      
      mm_vmscan_memcg_softlimit_reclaim_begin is the same with
      mm_vmscan_memcg_reclaim_begin.
      
      So we can drop them all.
      
      Link: http://lkml.kernel.org/r/1553736322-32235-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3481c37f
    • J
      mm: introduce put_user_page*(), placeholder versions · fc1d8e7c
      John Hubbard 提交于
      A discussion of the overall problem is below.
      
      As mentioned in patch 0001, the steps are to fix the problem are:
      
      1) Provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      Overview
      ========
      
      Some kernel components (file systems, device drivers) need to access
      memory that is specified via process virtual address.  For a long time,
      the API to achieve that was get_user_pages ("GUP") and its variations.
      However, GUP has critical limitations that have been overlooked; in
      particular, GUP does not interact correctly with filesystems in all
      situations.  That means that file-backed memory + GUP is a recipe for
      potential problems, some of which have already occurred in the field.
      
      GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
      code to get the struct page behind a virtual address and to let storage
      hardware perform a direct copy to or from that page.  This is a
      short-lived access pattern, and as such, the window for a concurrent
      writeback of GUP'd page was small enough that there were not (we think)
      any reported problems.  Also, userspace was expected to understand and
      accept that Direct IO was not synchronized with memory-mapped access to
      that data, nor with any process address space changes such as munmap(),
      mremap(), etc.
      
      Over the years, more GUP uses have appeared (virtualization, device
      drivers, RDMA) that can keep the pages they get via GUP for a long period
      of time (seconds, minutes, hours, days, ...).  This long-term pinning
      makes an underlying design problem more obvious.
      
      In fact, there are a number of key problems inherent to GUP:
      
      Interactions with file systems
      ==============================
      
      File systems expect to be able to write back data, both to reclaim pages,
      and for data integrity.  Allowing other hardware (NICs, GPUs, etc) to gain
      write access to the file memory pages means that such hardware can dirty
      the pages, without the filesystem being aware.  This can, in some cases
      (depending on filesystem, filesystem options, block device, block device
      options, and other variables), lead to data corruption, and also to kernel
      bugs of the form:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on every
          write access to a clean file backed page, not just the first one.
          How long the GUP reference lasts is irrelevant, if the page is clean
          and you need to dirty it, you must call ->page_mkwrite before it is
          marked writeable and dirtied. Every. Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      Long term GUP
      =============
      
      Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
      writeable mapping is created), and the pages are file-backed.  That can
      lead to filesystem corruption.  What happens is that when a file-backed
      page is being written back, it is first mapped read-only in all of the CPU
      page tables; the file system then assumes that nobody can write to the
      page, and that the page content is therefore stable.  Unfortunately, the
      GUP callers generally do not monitor changes to the CPU pages tables; they
      instead assume that the following pattern is safe (it's not):
      
          get_user_pages()
      
          Hardware can keep a reference to those pages for a very long time,
          and write to it at any time.  Because "hardware" here means "devices
          that are not a CPU", this activity occurs without any interaction with
          the kernel's file system code.
      
          for each page
              set_page_dirty
              put_page()
      
      In fact, the GUP documentation even recommends that pattern.
      
      Anyway, the file system assumes that the page is stable (nothing is
      writing to the page), and that is a problem: stable page content is
      necessary for many filesystem actions during writeback, such as checksum,
      encryption, RAID striping, etc.  Furthermore, filesystem features like COW
      (copy on write) or snapshot also rely on being able to use a new page for
      as memory for that memory range inside the file.
      
      Corruption during write back is clearly possible here.  To solve that, one
      idea is to identify pages that have active GUP, so that we can use a
      bounce page to write stable data to the filesystem.  The filesystem would
      work on the bounce page, while any of the active GUP might write to the
      original page.  This would avoid the stable page violation problem, but
      note that it is only part of the overall solution, because other problems
      remain.
      
      Other filesystem features that need to replace the page with a new one can
      be inhibited for pages that are GUP-pinned.  This will, however, alter and
      limit some of those filesystem features.  The only fix for that would be
      to require GUP users to monitor and respond to CPU page table updates.
      Subsystems such as ODP and HMM do this, for example.  This aspect of the
      problem is still under discussion.
      
      Direct IO
      =========
      
      Direct IO can cause corruption, if userspace does Direct-IO that writes to
      a range of virtual addresses that are mmap'd to a file.  The pages written
      to are file-backed pages that can be under write back, while the Direct IO
      is taking place.  Here, Direct IO races with a write back: it calls GUP
      before page_mkclean() has replaced the CPU pte with a read-only entry.
      The race window is pretty small, which is probably why years have gone by
      before we noticed this problem: Direct IO is generally very quick, and
      tends to finish up before the filesystem gets around to do anything with
      the page contents.  However, it's still a real problem.  The solution is
      to never let GUP return pages that are under write back, but instead,
      force GUP to take a write fault on those pages.  That way, GUP will
      properly synchronize with the active write back.  This does not change the
      required GUP behavior, it just avoids that race.
      
      Details
      =======
      
      Introduces put_user_page(), which simply calls put_page().  This provides
      a way to update all get_user_pages*() callers, so that they call
      put_user_page(), instead of put_page().
      
      Also introduces put_user_pages(), and a few dirty/locked variations, as a
      replacement for release_pages(), and also as a replacement for open-coded
      loops that release multiple pages.  These may be used for subsequent
      performance improvements, via batching of pages to be released.
      
      This is the first step of fixing a problem (also described in [1] and [2])
      with interactions between get_user_pages ("gup") and filesystems.
      
      Problem description: let's start with a bug report.  Below, is what
      happens sometimes, under memory pressure, when a driver pins some pages
      via gup, and then marks those pages dirty, and releases them.  Note that
      the gup documentation actually recommends that pattern.  The problem is
      that the filesystem may do a writeback while the pages were gup-pinned,
      and then the filesystem believes that the pages are clean.  So, when the
      driver later marks the pages as dirty, that conflicts with the
      filesystem's page tracking and results in a BUG(), like this one that I
      experienced:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on
          every write access to a clean file backed page, not just the first
          one.  How long the GUP reference lasts is irrelevant, if the page is
          clean and you need to dirty it, you must call ->page_mkwrite before it
          is marked writeable and dirtied.  Every.  Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      The steps are to fix it are:
      
      1) (This patch): provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
      [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
      
      Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[docs]
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc1d8e7c
    • A
      hugetlb: allow to free gigantic pages regardless of the configuration · 4eb0716e
      Alexandre Ghiti 提交于
      On systems without CONTIG_ALLOC activated but that support gigantic pages,
      boottime reserved gigantic pages can not be freed at all.  This patch
      simply enables the possibility to hand back those pages to memory
      allocator.
      
      Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.frSigned-off-by: NAlexandre Ghiti <alex@ghiti.fr>
      Acked-by: David S. Miller <davem@davemloft.net> [sparc]
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb0716e
    • A
      mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOC · 8df995f6
      Alexandre Ghiti 提交于
      This condition allows to define alloc_contig_range, so simplify it into a
      more accurate naming.
      
      Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.frSigned-off-by: NAlexandre Ghiti <alex@ghiti.fr>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8df995f6
    • J
      mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() API · 113b7dfd
      Johannes Weiner 提交于
      Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks,
      group them together.
      
      Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      113b7dfd
    • J
      mm: memcontrol: push down mem_cgroup_node_nr_lru_pages() · 2b487e59
      Johannes Weiner 提交于
      mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around
      lruvec_page_state() that takes bitmasks of lru indexes and aggregates the
      counts for those.
      
      Replace callsites where the bitmask is simple enough with direct
      lruvec_page_state() calls.
      
      This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so
      make that function private again, too.
      
      Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b487e59
    • J
      mm: memcontrol: replace zone summing with lruvec_page_state() · 1a61ab80
      Johannes Weiner 提交于
      Instead of adding up the zone counters, use lruvec_page_state() to get the
      node state directly.  This is a bit cheaper and more stream-lined.
      
      Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a61ab80
    • J
      mm: memcontrol: track LRU counts in the vmstats array · e0ee0e71
      Johannes Weiner 提交于
      Patch series "mm: memcontrol: clean up the LRU counts tracking".
      
      The memcg LRU stats usage is currently a bit messy.  Memcg has private
      per-zone counters because reclaim needs zone granularity sometimes, but we
      also have plenty of users that need to awkwardly sum them up to node or
      memcg granularity.  Meanwhile the canonical per-memcg vmstats do not track
      the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
      
      This series enables LRU count tracking in the per-memcg vmstats array such
      that lruvec_page_state() and memcg_page_state() work on the enum
      node_stat_item items for the LRU counters.  Then it converts all the
      callers that don't specifically need per-zone numbers over to that.
      
      This patch (of 6):
      
      The memcg code currently maintains private per-zone breakdowns of the LRU
      counters.  This is necessary for reclaim decisions which are still
      zone-based, but there are a variety of users of these counters that only
      want the aggregate per-lruvec or per-memcg LRU counts, and they need to
      painfully sum up the zone counters on each request for that.
      
      These would be better served using the memcg vmstats arrays, which track
      VM statistics at the desired scope already.  They just don't have the LRU
      counts right now.
      
      So to kick off the conversion, begin tracking LRU counts in those.
      
      Link: http://lkml.kernel.org/r/20190228163020.24100-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0ee0e71
    • Y
      mm/vmscan: add tracepoints for node reclaim · 132bb8cf
      Yafang Shao 提交于
      The page alloc fast path it may perform node reclaim, which may cause a
      latency spike.  We should add tracepoint for this event, and also measure
      the latency it causes.
      
      So bellow two tracepoints are introduced,
      	mm_vmscan_node_reclaim_begin
      	mm_vmscan_node_reclaim_end
      
      Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <shaoyafang@didiglobal.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      132bb8cf
    • Y
      mm, compaction: some tracepoints should be defined only when CONFIG_COMPACTION is set · b6cfab7a
      Yafang Shao 提交于
      Only mm_compaction_isolate_{free, migrate}pages may be used when
      CONFIG_COMPACTION is not set.  All others are used only when
      CONFIG_COMPACTION is set.
      
      After this change, if CONFIG_COMPACTION is not set, the tracepoints that
      only work when CONFIG_COMPACTION is set will not be exposed to userspace.
      Without this change, they will always be exposed in debugfs whether
      CONFIG_COMPACTION is set or not.  This is an improvement.
      
      Link: http://lkml.kernel.org/r/1552440403-11780-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6cfab7a
    • Y
      mm: compaction: show gfp flag names in try_to_compact_pages tracepoint · b1746b99
      Yafang Shao 提交于
      Showing the gfp flag names instead of the gfp_mask makes trace more
      convenient.
      
      Link: http://lkml.kernel.org/r/1552527998-13162-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1746b99
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • I
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny 提交于
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
    • K
      9851ac13
    • K
      mm: move recent_rotated pages calculation to shrink_inactive_list() · 886cf190
      Kirill Tkhai 提交于
      Patch series "mm: Generalize putback functions"]
      
      putback_inactive_pages() and move_active_pages_to_lru() are almost
      similar, so this patchset merges them ina single function.
      
      This patch (of 4):
      
      The patch moves the calculation from putback_inactive_pages() to
      shrink_inactive_list().  This makes putback_inactive_pages() looking more
      similar to move_active_pages_to_lru().
      
      To do that, we account activated pages in reclaim_stat::nr_activate.
      Since a page may change its LRU type from anon to file cache inside
      shrink_page_list() (see ClearPageSwapBacked()), we have to account pages
      for the both types.  So, nr_activate becomes an array.
      
      Previously we used nr_activate to account PGACTIVATE events, but now we
      account them into pgactivate variable (since they are about number of
      pages in general, not about sum of hpage_nr_pages).
      
      Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      886cf190
    • M
      mm: page cache: store only head pages in i_pages · 5fd4ca2d
      Matthew Wilcox 提交于
      Transparent Huge Pages are currently stored in i_pages as pointers to
      consecutive subpages.  This patch changes that to storing consecutive
      pointers to the head page in preparation for storing huge pages more
      efficiently in i_pages.
      
      Large parts of this are "inspired" by Kirill's patch
      https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/
      
      [willy@infradead.org: fix swapcache pages]
        Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
      [kirill@shutemov.name: hugetlb stores pages in page cache differently]
        Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
      Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NKirill Shutemov <kirill@shutemov.name>
      Reviewed-and-tested-by: NSong Liu <songliubraving@fb.com>
      Tested-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Tested-by: NQian Cai <cai@lca.pw>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Song Liu <liu.song.a23@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fd4ca2d
    • P
      userfaultfd/sysctl: add vm.unprivileged_userfaultfd · cefdca0a
      Peter Xu 提交于
      Userfaultfd can be misued to make it easier to exploit existing
      use-after-free (and similar) bugs that might otherwise only make a
      short window or race condition available.  By using userfaultfd to
      stall a kernel thread, a malicious program can keep some state that it
      wrote, stable for an extended period, which it can then access using an
      existing exploit.  While it doesn't cause the exploit itself, and while
      it's not the only thing that can stall a kernel thread when accessing a
      memory location, it's one of the few that never needs privilege.
      
      We can add a flag, allowing userfaultfd to be restricted, so that in
      general it won't be useable by arbitrary user programs, but in
      environments that require userfaultfd it can be turned back on.
      
      Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
      whether userfaultfd is allowed by unprivileged users.  When this is
      set to zero, only privileged users (root user, or users with the
      CAP_SYS_PTRACE capability) will be able to use the userfaultfd
      syscalls.
      
      Andrea said:
      
      : The only difference between the bpf sysctl and the userfaultfd sysctl
      : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
      : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
      : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
      : already if it's doing other kind of tracking on processes runtime, in
      : addition of userfaultfd.  In other words both syscalls works only for
      : root, when the two sysctl are opt-in set to 1.
      
      [dgilbert@redhat.com: changelog additions]
      [akpm@linux-foundation.org: documentation tweak, per Mike]
      Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cefdca0a
    • Y
      include/trace/events/vmscan.h: drop zone id from kswapd tracepoints · 3b775998
      Yafang Shao 提交于
      It is not clear how the zone id is useful in kswapd tracepoints and the id
      itself is not really easy to process because it depends on the
      configuration (available zones).  Let's drop the id for now.  If somebody
      really needs that information then the zone name should be used instead.
      
      [mhocko@suse.com: new changelog]
      Link: http://lkml.kernel.org/r/1552451813-10833-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b775998
    • T
      mm: remove stale comment from page struct · 3e05617c
      Tobin C. Harding 提交于
      We now use the slab_list list_head instead of the lru list_head.  This
      comment has become stale.
      
      Remove stale comment from page struct slab_list list_head.
      
      Link: http://lkml.kernel.org/r/20190402230545.2929-8-tobin@kernel.orgSigned-off-by: NTobin C. Harding <tobin@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e05617c
    • T
      list: add function list_rotate_to_front() · a16b5384
      Tobin C. Harding 提交于
      Patch series "mm: Use slab_list list_head instead of lru", v5.
      
      Currently the slab allocators (ab)use the struct page 'lru' list_head.  We
      have a list head for slab allocators to use, 'slab_list'.
      
      During v2 it was noted by Christoph that the SLOB allocator was reaching
      into a list_head, this version adds 2 patches to the front of the set to
      fix that.
      
      Clean up all three allocators by using the 'slab_list' list_head instead
      of overloading the 'lru' list_head.
      
      This patch (of 7):
      
      Currently if we wish to rotate a list until a specific item is at the
      front of the list we can call list_move_tail(head, list).  Note that the
      arguments are the reverse way to the usual use of list_move_tail(list,
      head).  This is a hack, it depends on the developer knowing how the
      list_head operates internally which violates the layer of abstraction
      offered by the list_head.  Also, it is not intuitive so the next developer
      to come along must study list.h in order to fully understand what is meant
      by the call, while this is 'good for' the developer it makes reading the
      code harder.  We should have an function appropriately named that does
      this if there are users for it intree.
      
      By grep'ing the tree for list_move_tail() and list_tail() and attempting
      to guess the argument order from the names it seems there is only one
      place currently in the tree that does this - the slob allocatator.
      
      Add function list_rotate_to_front() to rotate a list until the specified
      item is at the front of the list.
      
      Link: http://lkml.kernel.org/r/20190402230545.2929-2-tobin@kernel.orgSigned-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a16b5384
    • D
      mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses · fce86ff5
      Dan Williams 提交于
      Starting with c6f3c5ee ("mm/huge_memory.c: fix modifying of page
      protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
      pmdp_set_access_flags().  That helper enforces a pmd aligned @address
      argument via VM_BUG_ON() assertion.
      
      Update the implementation to take a 'struct vm_fault' argument directly
      and apply the address alignment fixup internally to fix crash signatures
      like:
      
          kernel BUG at arch/x86/mm/pgtable.c:515!
          invalid opcode: 0000 [#1] SMP NOPTI
          CPU: 51 PID: 43713 Comm: java Tainted: G           OE     4.19.35 #1
          [..]
          RIP: 0010:pmdp_set_access_flags+0x48/0x50
          [..]
          Call Trace:
           vmf_insert_pfn_pmd+0x198/0x350
           dax_iomap_fault+0xe82/0x1190
           ext4_dax_huge_fault+0x103/0x1f0
           ? __switch_to_asm+0x40/0x70
           __handle_mm_fault+0x3f6/0x1370
           ? __switch_to_asm+0x34/0x70
           ? __switch_to_asm+0x40/0x70
           handle_mm_fault+0xda/0x200
           __do_page_fault+0x249/0x4f0
           do_page_fault+0x32/0x110
           ? page_fault+0x8/0x30
           page_fault+0x1e/0x30
      
      Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: c6f3c5ee ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NPiotr Balcer <piotr.balcer@intel.com>
      Tested-by: NYan Ma <yan.ma@intel.com>
      Tested-by: NPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Chandan Rajendra <chandan@linux.ibm.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fce86ff5