1. 01 2月, 2018 1 次提交
    • R
      mm: show total hugetlb memory consumption in /proc/meminfo · fcb2b0c5
      Roman Gushchin 提交于
      Currently we display some hugepage statistics (total, free, etc) in
      /proc/meminfo, but only for default hugepage size (e.g.  2Mb).
      
      If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
      /proc/meminfo output can be confusing, as non-default sized hugepages
      are not reflected at all, and there are no signs that they are existing
      and consuming system memory.
      
      To solve this problem, let's display the total amount of memory,
      consumed by hugetlb pages of all sized (both free and used).  Let's call
      it "Hugetlb", and display size in kB to match generic /proc/meminfo
      style.
      
      For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
        $ cat /proc/meminfo
        MemTotal:        8168984 kB
        MemFree:         3789276 kB
        <...>
        CmaFree:               0 kB
        HugePages_Total:    1024
        HugePages_Free:     1024
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:         4194304 kB
        DirectMap4k:       32632 kB
        DirectMap2M:     4161536 kB
        DirectMap1G:     6291456 kB
      
      Also, this patch updates corresponding docs to reflect Hugetlb entry
      meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.
      
      Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcb2b0c5
  2. 30 11月, 2017 2 次提交
  3. 16 11月, 2017 1 次提交
    • J
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse 提交于
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
  4. 03 11月, 2017 1 次提交
  5. 07 9月, 2017 3 次提交
  6. 15 8月, 2017 1 次提交
    • A
      mm/hugetlb: Allow arch to override and call the weak function · e24a1307
      Aneesh Kumar K.V 提交于
      When running in guest mode ppc64 supports a different mechanism for hugetlb
      allocation/reservation. The LPAR management application called HMC can
      be used to reserve a set of hugepages and we pass the details of
      reserved pages via device tree to the guest. (more details in
      htab_dt_scan_hugepage_blocks()) . We do the memblock_reserve of the range
      and later in the boot sequence, we add the reserved range to huge_boot_pages.
      
      But to enable 16G hugetlb on baremetal config (when we are not running as guest)
      we want to do memblock reservation during boot. Generic code already does this
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e24a1307
  7. 11 8月, 2017 1 次提交
  8. 03 8月, 2017 1 次提交
    • D
      mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors · 2be7cfed
      Daniel Jordan 提交于
      Commit 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when
      FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
      errors from follow_hugetlb_page.  After such error, __get_user_pages
      subsequently calls faultin_page on the same VMA and start address that
      follow_hugetlb_page failed on instead of returning the error immediately
      as it should.
      
      In follow_hugetlb_page, when hugetlb_fault returns a value covered under
      VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
      to 0 as __get_user_pages expects in this case, which causes the
      following to happen in __get_user_pages: the "while (nr_pages)" check
      succeeds, we skip the "if (!vma..." check because we got a VMA the last
      time around, we find no page with follow_page_mask, and we call
      faultin_page, which calls hugetlb_fault for the second time.
      
      This issue also slightly changes how __get_user_pages works.  Before, it
      only returned error if it had made no progress (i = 0).  But now,
      follow_hugetlb_page can clobber "i" with an error code since its new
      return path doesn't check for progress.  So if "i" is nonzero before a
      failing call to follow_hugetlb_page, that indication of progress is lost
      and __get_user_pages can return error even if some pages were
      successfully pinned.
      
      To fix this, change follow_hugetlb_page so that it updates nr_pages,
      allowing __get_user_pages to fail immediately and restoring the "error
      only if no progress" behavior to __get_user_pages.
      
      Tested that __get_user_pages returns when expected on error from
      hugetlb_fault in follow_hugetlb_page.
      
      Fixes: 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
      Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.comSigned-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: <stable@vger.kernel.org>	[4.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be7cfed
  9. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  10. 11 7月, 2017 9 次提交
    • M
      hugetlb: add support for preferred node to alloc_huge_page_nodemask · 3e59fcb0
      Michal Hocko 提交于
      alloc_huge_page_nodemask tries to allocate from any numa node in the
      allowed node mask starting from lower numa nodes.  This might lead to
      filling up those low NUMA nodes while others are not used.  We can
      reduce this risk by introducing a concept of the preferred node similar
      to what we have in the regular page allocator.  We will start allocating
      from the preferred nid and then iterate over all allowed nodes in the
      zonelist order until we try them all.
      
      This is mimicing the page allocator logic except it operates on per-node
      mempools.  dequeue_huge_page_vma already does this so distill the
      zonelist logic into a more generic dequeue_huge_page_nodemask and use it
      in alloc_huge_page_nodemask.
      
      This will allow us to use proper per numa distance fallback also for
      alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
      can get rid of alloc_huge_page_node helper which doesn't have any user
      anymore.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e59fcb0
    • M
      mm, hugetlb: unclutter hugetlb allocation layers · aaf14e40
      Michal Hocko 提交于
      Patch series "mm, hugetlb: allow proper node fallback dequeue".
      
      While working on a hugetlb migration issue addressed in a separate
      patchset[1] I have noticed that the hugetlb allocations from the
      preallocated pool are quite subotimal.
      
       [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
      
      There is no fallback mechanism implemented and no notion of preferred
      node.  I have tried to work around it but Vlastimil was right to push
      back for a more robust solution.  It seems that such a solution is to
      reuse zonelist approach we use for the page alloctor.
      
      This series has 3 patches.  The first one tries to make hugetlb
      allocation layers more clear.  The second one implements the zonelist
      hugetlb pool allocation and introduces a preferred node semantic which
      is used by the migration callbacks.  The last patch is a clean up.
      
      This patch (of 3):
      
      Hugetlb allocation path for fresh huge pages is unnecessarily complex
      and it mixes different interfaces between layers.
      
      __alloc_buddy_huge_page is the central place to perform a new
      allocation.  It checks for the hugetlb overcommit and then relies on
      __hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
      all good except that __alloc_buddy_huge_page pushes vma and address down
      the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
      two different allocation modes - one for memory policy and other node
      specific (or to make it more obscure node non-specific) requests.
      
      This just screams for a reorganization.
      
      This patch pulls out all the vma specific handling up to
      __alloc_buddy_huge_page_with_mpol where it belongs.
      __alloc_buddy_huge_page will get nodemask argument and
      __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
      page allocator.
      
      In short:
      __alloc_buddy_huge_page_with_mpol - memory policy handling
        __alloc_buddy_huge_page - overcommit handling and accounting
          __hugetlb_alloc_buddy_huge_page - page allocator layer
      
      Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
      is not really needed because the page allocator already handles the
      cpusets update.
      
      Finally __hugetlb_alloc_buddy_huge_page had a special case for node
      specific allocations (when no policy is applied and there is a node
      given).  This has relied on __GFP_THISNODE to not fallback to a different
      node.  alloc_huge_page_node is the only caller which relies on this
      behavior so move the __GFP_THISNODE there.
      
      Not only does this remove quite some code it also should make those
      layers easier to follow and clear wrt responsibilities.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aaf14e40
    • M
      mm/hugetlb.c: replace memfmt with string_get_size · c6247f72
      Matthew Wilcox 提交于
      The hugetlb code has its own function to report human-readable sizes.
      Convert it to use the shared string_get_size() function.  This will lead
      to a minor difference in user visible output (MiB/GiB instead of MB/GB),
      but some would argue that's desirable anyway.
      
      Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6247f72
    • D
      mm, hugetlb: schedule when potentially allocating many hugepages · 69ed779a
      David Rientjes 提交于
      A few hugetlb allocators loop while calling the page allocator and can
      potentially prevent rescheduling if the page allocator slowpath is not
      utilized.
      
      Conditionally schedule when large numbers of hugepages can be allocated.
      
      Anshuman:
       "Fixes a task which was getting hung while writing like 10000 hugepages
        (16MB on POWER8) into /proc/sys/vm/nr_hugepages."
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69ed779a
    • M
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko 提交于
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • L
      mm/hugetlb.c: warn the user when issues arise on boot due to hugepages · d715cf80
      Liam R. Howlett 提交于
      When the user specifies too many hugepages or an invalid
      default_hugepagesz the communication to the user is implicit in the
      allocation message.  This patch adds a warning when the desired page
      count is not allocated and prints an error when the default_hugepagesz
      is invalid on boot.
      
      During boot hugepages will allocate until there is a fraction of the
      hugepage size left.  That is, we allocate until either the request is
      satisfied or memory for the pages is exhausted.  When memory for the
      pages is exhausted, it will most likely lead to the system failing with
      the OOM manager not finding enough (or anything) to kill (unless you're
      using really big hugepages in the order of 100s of MB or in the GBs).
      The user will most likely see the OOM messages much later in the boot
      sequence than the implicitly stated message.  Worse yet, you may even
      get an OOM for each processor which causes many pages of OOMs on modern
      systems.  Although these messages will be printed earlier than the OOM
      messages, at least giving the user errors and warnings will highlight
      the configuration as an issue.  I'm trying to point the user in the
      right direction by providing a more robust statement of what is failing.
      
      During the sysctl or echo command, the user can check the results much
      easier than if the system hangs during boot and the scenario of having
      nothing to OOM for kernel memory is highly unlikely.
      
      Mike said:
       "Before sending out this patch, I asked Liam off list why he was doing
        it. Was it something he just thought would be useful? Or, was there
        some type of user situation/need. He said that he had been called in
        to assist on several occasions when a system OOMed during boot. In
        almost all of these situations, the user had grossly misconfigured
        huge pages.
      
        DB users want to pre-allocate just the right amount of huge pages, but
        sometimes they can be really off. In such situations, the huge page
        init code just allocates as many huge pages as it can and reports the
        number allocated. There is no indication that it quit allocating
        because it ran out of memory. Of course, a user could compare the
        number in the message to what they requested on the command line to
        determine if they got all the huge pages they requested. The thought
        was that it would be useful to at least flag this situation. That way,
        the user might be able to better relate the huge page allocation
        failure to the OOM.
      
        I'm not sure if the e-mail discussion made it obvious that this is
        something he has seen on several occasions.
      
        I see Michal's point that this will only flag the situation where
        someone configures huge pages very badly. And, a more extensive look
        at the situation of misconfiguring huge pages might be in order. But,
        this has happened on several occasions which led to the creation of
        this patch"
      
      [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
      Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d715cf80
    • N
      ddd40d8a
    • A
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual 提交于
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3114a84
    • N
      mm: hugetlb: prevent reuse of hwpoisoned free hugepages · 243abd5b
      Naoya Horiguchi 提交于
      Patch series "mm: hwpoison: fixlet for hugetlb migration".
      
      This patchset updates the hwpoison/hugetlb code to address 2 reported
      issues.
      
      One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
      http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
      half was already fixed in mainline, and another half about hugetlb cases
      are solved in this series.
      
      Another issue is "narrow-down error affected region into a single 4kB
      page instead of a whole hugetlb page" issue, which was tried by Anshuman
      (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
      and I updated it to apply it more widely.
      
      This patch (of 9):
      
      We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
      as we did before.  So current dequeue_huge_page_node() doesn't work as
      intended because it still uses is_migrate_isolate_page() for this check.
      This patch fixes it with PageHWPoison flag.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      243abd5b
  11. 07 7月, 2017 9 次提交
  12. 03 6月, 2017 1 次提交
  13. 01 4月, 2017 2 次提交
    • M
      mm/hugetlb.c: don't call region_abort if region_chg fails · ff8c0c53
      Mike Kravetz 提交于
      Changes to hugetlbfs reservation maps is a two step process.  The first
      step is a call to region_chg to determine what needs to be changed, and
      prepare that change.  This should be followed by a call to call to
      region_add to commit the change, or region_abort to abort the change.
      
      The error path in hugetlb_reserve_pages called region_abort after a
      failed call to region_chg.  As a result, the adds_in_progress counter in
      the reservation map is off by 1.  This is caught by a VM_BUG_ON in
      resv_map_release when the reservation map is freed.
      
      syzkaller fuzzer (when using an injected kmalloc failure) found this
      bug, that resulted in the following:
      
       kernel BUG at mm/hugetlb.c:742!
       Call Trace:
        hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
        evict+0x481/0x920 fs/inode.c:553
        iput_final fs/inode.c:1515 [inline]
        iput+0x62b/0xa20 fs/inode.c:1542
        hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
        newseg+0x422/0xd30 ipc/shm.c:575
        ipcget_new ipc/util.c:285 [inline]
        ipcget+0x21e/0x580 ipc/util.c:639
        SYSC_shmget ipc/shm.c:673 [inline]
        SyS_shmget+0x158/0x230 ipc/shm.c:657
        entry_SYSCALL_64_fastpath+0x1f/0xc2
       RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
      
      Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8c0c53
    • N
      mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() · c9d398fa
      Naoya Horiguchi 提交于
      I found the race condition which triggers the following bug when
      move_pages() and soft offline are called on a single hugetlb page
      concurrently.
      
          Soft offlining page 0x119400 at 0x700000000000
          BUG: unable to handle kernel paging request at ffffea0011943820
          IP: follow_huge_pmd+0x143/0x190
          PGD 7ffd2067
          PUD 7ffd1067
          PMD 0
              [61163.582052] Oops: 0000 [#1] SMP
          Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
          CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P           OE   4.11.0-rc2-mm1+ #2
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:follow_huge_pmd+0x143/0x190
          RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
          RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
          RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
          RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
          R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
          R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
          FS:  00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
          Call Trace:
           follow_page_mask+0x270/0x550
           SYSC_move_pages+0x4ea/0x8f0
           SyS_move_pages+0xe/0x10
           do_syscall_64+0x67/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: 0033:0x7fc976e03949
          RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
          RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
          RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
          RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
          R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
          R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
          Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
          RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
          CR2: ffffea0011943820
          ---[ end trace e4f81353a2d23232 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
      
      This bug is triggered when pmd_present() returns true for non-present
      hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
      Using pmd_present() to determine present/non-present for hugetlb is not
      correct, because pmd_present() checks multiple bits (not only
      _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
      
      Fixes: e66f17ff ("mm/hugetlb: take page table lock in follow_huge_pmd()")
      Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>        [4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d398fa
  14. 10 3月, 2017 1 次提交
  15. 02 3月, 2017 1 次提交
  16. 25 2月, 2017 2 次提交
  17. 23 2月, 2017 3 次提交