1. 07 9月, 2017 3 次提交
  2. 11 8月, 2017 1 次提交
  3. 03 8月, 2017 1 次提交
    • D
      mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors · 2be7cfed
      Daniel Jordan 提交于
      Commit 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when
      FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
      errors from follow_hugetlb_page.  After such error, __get_user_pages
      subsequently calls faultin_page on the same VMA and start address that
      follow_hugetlb_page failed on instead of returning the error immediately
      as it should.
      
      In follow_hugetlb_page, when hugetlb_fault returns a value covered under
      VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
      to 0 as __get_user_pages expects in this case, which causes the
      following to happen in __get_user_pages: the "while (nr_pages)" check
      succeeds, we skip the "if (!vma..." check because we got a VMA the last
      time around, we find no page with follow_page_mask, and we call
      faultin_page, which calls hugetlb_fault for the second time.
      
      This issue also slightly changes how __get_user_pages works.  Before, it
      only returned error if it had made no progress (i = 0).  But now,
      follow_hugetlb_page can clobber "i" with an error code since its new
      return path doesn't check for progress.  So if "i" is nonzero before a
      failing call to follow_hugetlb_page, that indication of progress is lost
      and __get_user_pages can return error even if some pages were
      successfully pinned.
      
      To fix this, change follow_hugetlb_page so that it updates nr_pages,
      allowing __get_user_pages to fail immediately and restoring the "error
      only if no progress" behavior to __get_user_pages.
      
      Tested that __get_user_pages returns when expected on error from
      hugetlb_fault in follow_hugetlb_page.
      
      Fixes: 9a291a7c ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
      Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.comSigned-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: <stable@vger.kernel.org>	[4.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be7cfed
  4. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  5. 11 7月, 2017 9 次提交
    • M
      hugetlb: add support for preferred node to alloc_huge_page_nodemask · 3e59fcb0
      Michal Hocko 提交于
      alloc_huge_page_nodemask tries to allocate from any numa node in the
      allowed node mask starting from lower numa nodes.  This might lead to
      filling up those low NUMA nodes while others are not used.  We can
      reduce this risk by introducing a concept of the preferred node similar
      to what we have in the regular page allocator.  We will start allocating
      from the preferred nid and then iterate over all allowed nodes in the
      zonelist order until we try them all.
      
      This is mimicing the page allocator logic except it operates on per-node
      mempools.  dequeue_huge_page_vma already does this so distill the
      zonelist logic into a more generic dequeue_huge_page_nodemask and use it
      in alloc_huge_page_nodemask.
      
      This will allow us to use proper per numa distance fallback also for
      alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
      can get rid of alloc_huge_page_node helper which doesn't have any user
      anymore.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e59fcb0
    • M
      mm, hugetlb: unclutter hugetlb allocation layers · aaf14e40
      Michal Hocko 提交于
      Patch series "mm, hugetlb: allow proper node fallback dequeue".
      
      While working on a hugetlb migration issue addressed in a separate
      patchset[1] I have noticed that the hugetlb allocations from the
      preallocated pool are quite subotimal.
      
       [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
      
      There is no fallback mechanism implemented and no notion of preferred
      node.  I have tried to work around it but Vlastimil was right to push
      back for a more robust solution.  It seems that such a solution is to
      reuse zonelist approach we use for the page alloctor.
      
      This series has 3 patches.  The first one tries to make hugetlb
      allocation layers more clear.  The second one implements the zonelist
      hugetlb pool allocation and introduces a preferred node semantic which
      is used by the migration callbacks.  The last patch is a clean up.
      
      This patch (of 3):
      
      Hugetlb allocation path for fresh huge pages is unnecessarily complex
      and it mixes different interfaces between layers.
      
      __alloc_buddy_huge_page is the central place to perform a new
      allocation.  It checks for the hugetlb overcommit and then relies on
      __hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
      all good except that __alloc_buddy_huge_page pushes vma and address down
      the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
      two different allocation modes - one for memory policy and other node
      specific (or to make it more obscure node non-specific) requests.
      
      This just screams for a reorganization.
      
      This patch pulls out all the vma specific handling up to
      __alloc_buddy_huge_page_with_mpol where it belongs.
      __alloc_buddy_huge_page will get nodemask argument and
      __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
      page allocator.
      
      In short:
      __alloc_buddy_huge_page_with_mpol - memory policy handling
        __alloc_buddy_huge_page - overcommit handling and accounting
          __hugetlb_alloc_buddy_huge_page - page allocator layer
      
      Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
      is not really needed because the page allocator already handles the
      cpusets update.
      
      Finally __hugetlb_alloc_buddy_huge_page had a special case for node
      specific allocations (when no policy is applied and there is a node
      given).  This has relied on __GFP_THISNODE to not fallback to a different
      node.  alloc_huge_page_node is the only caller which relies on this
      behavior so move the __GFP_THISNODE there.
      
      Not only does this remove quite some code it also should make those
      layers easier to follow and clear wrt responsibilities.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aaf14e40
    • M
      mm/hugetlb.c: replace memfmt with string_get_size · c6247f72
      Matthew Wilcox 提交于
      The hugetlb code has its own function to report human-readable sizes.
      Convert it to use the shared string_get_size() function.  This will lead
      to a minor difference in user visible output (MiB/GiB instead of MB/GB),
      but some would argue that's desirable anyway.
      
      Link: http://lkml.kernel.org/r/20170606190350.GA20010@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6247f72
    • D
      mm, hugetlb: schedule when potentially allocating many hugepages · 69ed779a
      David Rientjes 提交于
      A few hugetlb allocators loop while calling the page allocator and can
      potentially prevent rescheduling if the page allocator slowpath is not
      utilized.
      
      Conditionally schedule when large numbers of hugepages can be allocated.
      
      Anshuman:
       "Fixes a task which was getting hung while writing like 10000 hugepages
        (16MB on POWER8) into /proc/sys/vm/nr_hugepages."
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706091535300.66176@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69ed779a
    • M
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko 提交于
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • L
      mm/hugetlb.c: warn the user when issues arise on boot due to hugepages · d715cf80
      Liam R. Howlett 提交于
      When the user specifies too many hugepages or an invalid
      default_hugepagesz the communication to the user is implicit in the
      allocation message.  This patch adds a warning when the desired page
      count is not allocated and prints an error when the default_hugepagesz
      is invalid on boot.
      
      During boot hugepages will allocate until there is a fraction of the
      hugepage size left.  That is, we allocate until either the request is
      satisfied or memory for the pages is exhausted.  When memory for the
      pages is exhausted, it will most likely lead to the system failing with
      the OOM manager not finding enough (or anything) to kill (unless you're
      using really big hugepages in the order of 100s of MB or in the GBs).
      The user will most likely see the OOM messages much later in the boot
      sequence than the implicitly stated message.  Worse yet, you may even
      get an OOM for each processor which causes many pages of OOMs on modern
      systems.  Although these messages will be printed earlier than the OOM
      messages, at least giving the user errors and warnings will highlight
      the configuration as an issue.  I'm trying to point the user in the
      right direction by providing a more robust statement of what is failing.
      
      During the sysctl or echo command, the user can check the results much
      easier than if the system hangs during boot and the scenario of having
      nothing to OOM for kernel memory is highly unlikely.
      
      Mike said:
       "Before sending out this patch, I asked Liam off list why he was doing
        it. Was it something he just thought would be useful? Or, was there
        some type of user situation/need. He said that he had been called in
        to assist on several occasions when a system OOMed during boot. In
        almost all of these situations, the user had grossly misconfigured
        huge pages.
      
        DB users want to pre-allocate just the right amount of huge pages, but
        sometimes they can be really off. In such situations, the huge page
        init code just allocates as many huge pages as it can and reports the
        number allocated. There is no indication that it quit allocating
        because it ran out of memory. Of course, a user could compare the
        number in the message to what they requested on the command line to
        determine if they got all the huge pages they requested. The thought
        was that it would be useful to at least flag this situation. That way,
        the user might be able to better relate the huge page allocation
        failure to the OOM.
      
        I'm not sure if the e-mail discussion made it obvious that this is
        something he has seen on several occasions.
      
        I see Michal's point that this will only flag the situation where
        someone configures huge pages very badly. And, a more extensive look
        at the situation of misconfiguring huge pages might be in order. But,
        this has happened on several occasions which led to the creation of
        this patch"
      
      [akpm@linux-foundation.org: reposition memfmt() to avoid forward declaration]
      Link: http://lkml.kernel.org/r/20170603005413.10380-1-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d715cf80
    • N
      ddd40d8a
    • A
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual 提交于
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3114a84
    • N
      mm: hugetlb: prevent reuse of hwpoisoned free hugepages · 243abd5b
      Naoya Horiguchi 提交于
      Patch series "mm: hwpoison: fixlet for hugetlb migration".
      
      This patchset updates the hwpoison/hugetlb code to address 2 reported
      issues.
      
      One is madvise(MADV_HWPOISON) failure reported by Intel's lkp robot (see
      http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop.) First
      half was already fixed in mainline, and another half about hugetlb cases
      are solved in this series.
      
      Another issue is "narrow-down error affected region into a single 4kB
      page instead of a whole hugetlb page" issue, which was tried by Anshuman
      (http://lkml.kernel.org/r/20170420110627.12307-1-khandual@linux.vnet.ibm.com)
      and I updated it to apply it more widely.
      
      This patch (of 9):
      
      We no longer use MIGRATE_ISOLATE to prevent reuse of hwpoison hugepages
      as we did before.  So current dequeue_huge_page_node() doesn't work as
      intended because it still uses is_migrate_isolate_page() for this check.
      This patch fixes it with PageHWPoison flag.
      
      Link: http://lkml.kernel.org/r/1496305019-5493-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      243abd5b
  6. 07 7月, 2017 9 次提交
  7. 03 6月, 2017 1 次提交
  8. 01 4月, 2017 2 次提交
    • M
      mm/hugetlb.c: don't call region_abort if region_chg fails · ff8c0c53
      Mike Kravetz 提交于
      Changes to hugetlbfs reservation maps is a two step process.  The first
      step is a call to region_chg to determine what needs to be changed, and
      prepare that change.  This should be followed by a call to call to
      region_add to commit the change, or region_abort to abort the change.
      
      The error path in hugetlb_reserve_pages called region_abort after a
      failed call to region_chg.  As a result, the adds_in_progress counter in
      the reservation map is off by 1.  This is caught by a VM_BUG_ON in
      resv_map_release when the reservation map is freed.
      
      syzkaller fuzzer (when using an injected kmalloc failure) found this
      bug, that resulted in the following:
      
       kernel BUG at mm/hugetlb.c:742!
       Call Trace:
        hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
        evict+0x481/0x920 fs/inode.c:553
        iput_final fs/inode.c:1515 [inline]
        iput+0x62b/0xa20 fs/inode.c:1542
        hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
        newseg+0x422/0xd30 ipc/shm.c:575
        ipcget_new ipc/util.c:285 [inline]
        ipcget+0x21e/0x580 ipc/util.c:639
        SYSC_shmget ipc/shm.c:673 [inline]
        SyS_shmget+0x158/0x230 ipc/shm.c:657
        entry_SYSCALL_64_fastpath+0x1f/0xc2
       RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
      
      Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8c0c53
    • N
      mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd() · c9d398fa
      Naoya Horiguchi 提交于
      I found the race condition which triggers the following bug when
      move_pages() and soft offline are called on a single hugetlb page
      concurrently.
      
          Soft offlining page 0x119400 at 0x700000000000
          BUG: unable to handle kernel paging request at ffffea0011943820
          IP: follow_huge_pmd+0x143/0x190
          PGD 7ffd2067
          PUD 7ffd1067
          PMD 0
              [61163.582052] Oops: 0000 [#1] SMP
          Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
          CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P           OE   4.11.0-rc2-mm1+ #2
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:follow_huge_pmd+0x143/0x190
          RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
          RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
          RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
          RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
          R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
          R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
          FS:  00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
          Call Trace:
           follow_page_mask+0x270/0x550
           SYSC_move_pages+0x4ea/0x8f0
           SyS_move_pages+0xe/0x10
           do_syscall_64+0x67/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: 0033:0x7fc976e03949
          RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
          RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
          RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
          RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
          R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
          R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
          Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
          RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
          CR2: ffffea0011943820
          ---[ end trace e4f81353a2d23232 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
      
      This bug is triggered when pmd_present() returns true for non-present
      hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
      Using pmd_present() to determine present/non-present for hugetlb is not
      correct, because pmd_present() checks multiple bits (not only
      _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
      
      Fixes: e66f17ff ("mm/hugetlb: take page table lock in follow_huge_pmd()")
      Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>        [4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d398fa
  9. 10 3月, 2017 1 次提交
  10. 02 3月, 2017 1 次提交
  11. 25 2月, 2017 2 次提交
  12. 23 2月, 2017 5 次提交
  13. 11 1月, 2017 1 次提交
  14. 13 12月, 2016 3 次提交