1. 20 9月, 2020 1 次提交
    • R
      mm/thp: fix __split_huge_pmd_locked() for migration PMD · ec0abae6
      Ralph Campbell 提交于
      A migrating transparent huge page has to already be unmapped.  Otherwise,
      the page could be modified while it is being copied to a new page and data
      could be lost.  The function __split_huge_pmd() checks for a PMD migration
      entry before calling __split_huge_pmd_locked() leading one to think that
      __split_huge_pmd_locked() can handle splitting a migrating PMD.
      
      However, the code always increments the page->_mapcount and adjusts the
      memory control group accounting assuming the page is mapped.
      
      Also, if the PMD entry is a migration PMD entry, the call to
      is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
      of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).  Fix these problems by
      checking for a PMD migration entry.
      
      Fixes: 84c3fc4e ("mm: thp: check pmd migration entry in common path")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec0abae6
  2. 05 9月, 2020 1 次提交
  3. 13 8月, 2020 2 次提交
  4. 08 8月, 2020 3 次提交
  5. 10 6月, 2020 2 次提交
  6. 05 6月, 2020 1 次提交
  7. 04 6月, 2020 8 次提交
  8. 03 6月, 2020 1 次提交
    • L
      gup: document and work around "COW can break either way" issue · 17839856
      Linus Torvalds 提交于
      Doing a "get_user_pages()" on a copy-on-write page for reading can be
      ambiguous: the page can be COW'ed at any time afterwards, and the
      direction of a COW event isn't defined.
      
      Yes, whoever writes to it will generally do the COW, but if the thread
      that did the get_user_pages() unmapped the page before the write (and
      that could happen due to memory pressure in addition to any outright
      action), the writer could also just take over the old page instead.
      
      End result: the get_user_pages() call might result in a page pointer
      that is no longer associated with the original VM, and is associated
      with - and controlled by - another VM having taken it over instead.
      
      So when doing a get_user_pages() on a COW mapping, the only really safe
      thing to do would be to break the COW when getting the page, even when
      only getting it for reading.
      
      At the same time, some users simply don't even care.
      
      For example, the perf code wants to look up the page not because it
      cares about the page, but because the code simply wants to look up the
      physical address of the access for informational purposes, and doesn't
      really care about races when a page might be unmapped and remapped
      elsewhere.
      
      This adds logic to force a COW event by setting FOLL_WRITE on any
      copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
      pointer as a result.
      
      The current semantics end up being:
      
       - __get_user_pages_fast(): no change. If you don't ask for a write,
         you won't break COW. You'd better know what you're doing.
      
       - get_user_pages_fast(): the fast-case "look it up in the page tables
         without anything getting mmap_sem" now refuses to follow a read-only
         page, since it might need COW breaking.  Which happens in the slow
         path - the fast path doesn't know if the memory might be COW or not.
      
       - get_user_pages() (including the slow-path fallback for gup_fast()):
         for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
         very similar semantics to FOLL_FORCE.
      
      If it turns out that we want finer granularity (ie "only break COW when
      it might actually matter" - things like the zero page are special and
      don't need to be broken) we might need to push these semantics deeper
      into the lookup fault path.  So if people care enough, it's possible
      that we might end up adding a new internal FOLL_BREAK_COW flag to go
      with the internal FOLL_COW flag we already have for tracking "I had a
      COW".
      
      Alternatively, if it turns out that different callers might want to
      explicitly control the forced COW break behavior, we might even want to
      make such a flag visible to the users of get_user_pages() instead of
      using the above default semantics.
      
      But for now, this is mostly commentary on the issue (this commit message
      being a lot bigger than the patch, and that patch in turn is almost all
      comments), with that minimal "enable COW breaking early" logic using the
      existing FOLL_WRITE behavior.
      
      [ It might be worth noting that we've always had this ambiguity, and it
        could arguably be seen as a user-space issue.
      
        You only get private COW mappings that could break either way in
        situations where user space is doing cooperative things (ie fork()
        before an execve() etc), but it _is_ surprising and very subtle, and
        fork() is supposed to give you independent address spaces.
      
        So let's treat this as a kernel issue and make the semantics of
        get_user_pages() easier to understand. Note that obviously a true
        shared mapping will still get a page that can change under us, so this
        does _not_ mean that get_user_pages() somehow returns any "stable"
        page ]
      Reported-by: NJann Horn <jannh@google.com>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill Shutemov <kirill@shutemov.name>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17839856
  9. 05 5月, 2020 1 次提交
  10. 08 4月, 2020 6 次提交
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • P
      userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork · b569a176
      Peter Xu 提交于
      UFFD_EVENT_FORK support for uffd-wp should be already there, except that
      we should clean the uffd-wp bit if uffd fork event is not enabled.  Detect
      that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
      by VM_UFFD_WP.  Do this for both small PTEs and huge PMDs.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b569a176
    • P
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu 提交于
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • P
      mm: merge parameters for change_protection() · 58705444
      Peter Xu 提交于
      change_protection() was used by either the NUMA or mprotect() code,
      there's one parameter for each of the callers (dirty_accountable and
      prot_numa).  Further, these parameters are passed along the calls:
      
        - change_protection_range()
        - change_p4d_range()
        - change_pud_range()
        - change_pmd_range()
        - ...
      
      Now we introduce a flag for change_protect() and all these helpers to
      replace these parameters.  Then we can avoid passing multiple parameters
      multiple times along the way.
      
      More importantly, it'll greatly simplify the work if we want to introduce
      any new parameters to change_protection().  In the follow up patches, a
      new parameter for userfaultfd write protection will be introduced.
      
      No functional change at all.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58705444
    • M
      mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE · 396bcc52
      Matthew Wilcox (Oracle) 提交于
      Commit e496cf3d ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
      notes that it should be reverted when the PowerPC problem was fixed.  The
      commit fixing the PowerPC problem (953c66c2) did not revert the
      commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
      CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
      oversight, so remove the Kconfig symbol and undo the work of commit
      e496cf3d.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396bcc52
    • D
      mm, thp: track fallbacks due to failed memcg charges separately · 85b9f46e
      David Rientjes 提交于
      The thp_fault_fallback and thp_file_fallback vmstats are incremented if
      either the hugepage allocation fails through the page allocator or the
      hugepage charge fails through mem cgroup.
      
      This patch leaves this field untouched but adds two new fields,
      thp_{fault,file}_fallback_charge, which is incremented only when the mem
      cgroup charge fails.
      
      This distinguishes between attempted hugepage allocations that fail due to
      fragmentation (or low memory conditions) and those that fail due to mem
      cgroup limits.  That can be used to determine the impact of fragmentation
      on the system by excluding faults that failed due to memcg usage.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jeremy Cline <jcline@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85b9f46e
  11. 03 4月, 2020 1 次提交
  12. 25 3月, 2020 2 次提交
    • T
      mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entries · 9a9731b1
      Thomas Hellstrom (VMware) 提交于
      For graphics drivers needing to modify the page-protection, add
      huge page-table entries counterparts to vmf_insert_pfn_prot().
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NThomas Hellstrom (VMware) <thomas_os@shipmail.org>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      9a9731b1
    • T
      mm: Introduce vma_is_special_huge · 2484ca9b
      Thomas Hellstrom (VMware) 提交于
      For VM_PFNMAP and VM_MIXEDMAP vmas that want to support transhuge pages
      and -page table entries, introduce vma_is_special_huge() that takes the
      same codepaths as vma_is_dax().
      
      The use of "special" follows the definition in memory.c, vm_normal_page():
      "Special" mappings do not wish to be associated with a "struct page"
      (either it doesn't exist, or it exists but they don't want to touch it)
      
      For PAGE_SIZE pages, "special" is determined per page table entry to be
      able to deal with COW pages. But since we don't have huge COW pages,
      we can classify a vma as either "special huge" or "normal huge".
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NThomas Hellstrom (VMware) <thomas_os@shipmail.org>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      2484ca9b
  13. 06 3月, 2020 1 次提交
  14. 01 2月, 2020 4 次提交
  15. 28 1月, 2020 1 次提交
  16. 14 1月, 2020 1 次提交
    • K
      mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment · 97d3d0f9
      Kirill A. Shutemov 提交于
      Patch series "Fix two above-47bit hint address vs.  THP bugs".
      
      The two get_unmapped_area() implementations have to be fixed to provide
      THP-friendly mappings if above-47bit hint address is specified.
      
      This patch (of 2):
      
      Filesystems use thp_get_unmapped_area() to provide THP-friendly
      mappings.  For DAX in particular.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks thp_get_unmapped_area(): the function
      would not try to allocate PMD-aligned area if *any* hint address
      specified.
      
      Modify the routine to handle it correctly:
      
       - Try to allocate the space at the specified hint address with length
         padding required for PMD alignment.
       - If failed, retry without length padding (but with the same hint
         address);
       - If the returned address matches the hint address return it.
       - Otherwise, align the address as required for THP and return.
      
      The user specified hint address is passed down to get_unmapped_area() so
      above-47bit hint address will be taken into account without breaking
      alignment requirements.
      
      Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
      Fixes: b569bab7 ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NThomas Willhalm <thomas.willhalm@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d3d0f9
  17. 02 12月, 2019 1 次提交
  18. 19 10月, 2019 1 次提交
  19. 29 9月, 2019 2 次提交
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
    • D
      Revert "Revert "mm, thp: restore node-local hugepage allocations"" · ac79f78d
      David Rientjes 提交于
      This reverts commit a8282608.
      
      The commit references the original intended semantic for MADV_HUGEPAGE
      which has subsequently taken on three unique purposes:
      
       - enables or disables thp for a range of memory depending on the system's
         config (is thp "enabled" set to "always" or "madvise"),
      
       - determines the synchronous compaction behavior for thp allocations at
         fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
         and
      
       - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
         clear previous hugepage advice).
      
      These are the three purposes that currently exist in 5.2 and over the
      past several years that userspace has been written around.  Adding a
      NUMA locality preference adds a fourth dimension to an already conflated
      advice mode.
      
      Based on the semantic that MADV_HUGEPAGE has provided over the past
      several years, there exist workloads that use the tunable based on these
      principles: specifically that the allocation should attempt to
      defragment a local node before falling back.  It is agreed that remote
      hugepages typically (but not always) have a better access latency than
      remote native pages, although on Naples this is at parity for
      intersocket.
      
      The revert commit that this patch reverts allows hugepage allocation to
      immediately allocate remotely when local memory is fragmented.  This is
      contrary to the semantic of MADV_HUGEPAGE over the past several years:
      that is, memory compaction should be attempted locally before falling
      back.
      
      The performance degradation of remote hugepages over local hugepages on
      Rome, for example, is 53.5% increased access latency.  For this reason,
      the goal is to revert back to the 5.2 and previous behavior that would
      attempt local defragmentation before falling back.  With the patch that
      is reverted by this patch, we see performance degradations at the tail
      because the allocator happily allocates the remote hugepage rather than
      even attempting to make a local hugepage available.
      
      zone_reclaim_mode is not a solution to this problem since it does not
      only impact hugepage allocations but rather changes the memory
      allocation strategy for *all* page allocations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac79f78d