1. 29 9月, 2019 3 次提交
    • D
      mm, page_alloc: allow hugepage fallback to remote nodes when madvised · 76e654cc
      David Rientjes 提交于
      For systems configured to always try hard to allocate transparent
      hugepages (thp defrag setting of "always") or for memory that has been
      explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
      remote memory to allocate the hugepage if the local allocation fails
      first.
      
      The point is to allow the initial call to __alloc_pages_node() to attempt
      to defragment local memory to make a hugepage available, if possible,
      rather than immediately fallback to remote memory.  Local hugepages will
      always have a better access latency than remote (huge)pages, so an attempt
      to make a hugepage available locally is always preferred.
      
      If memory compaction cannot be successful locally, however, it is likely
      better to fallback to remote memory.  This could take on two forms: either
      allow immediate fallback to remote memory or do per-zone watermark checks.
      It would be possible to fallback only when per-zone watermarks fail for
      order-0 memory, since that would require local reclaim for all subsequent
      faults so remote huge allocation is likely better than thrashing the local
      zone for large workloads.
      
      In this case, it is assumed that because the system is configured to try
      hard to allocate hugepages or the vma is advised to explicitly want to try
      hard for hugepages that remote allocation is better when local allocation
      and memory compaction have both failed.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76e654cc
    • D
      Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 19deb769
      David Rientjes 提交于
      This reverts commit 92717d42.
      
      Since commit a8282608 ("Revert "mm, thp: restore node-local hugepage
      allocations"") is reverted in this series, it is better to restore the
      previous 5.2 behavior between the thp allocation and the page allocator
      rather than to attempt any consolidation or cleanup for a policy that is
      now reverted.  It's less risky during an rc cycle and subsequent patches
      in this series further modify the same policy that the pre-5.3 behavior
      implements.
      
      Consolidation and cleanup can be done subsequent to a sane default page
      allocation strategy, so this patch reverts a cleanup done on a strategy
      that is now reverted and thus is the least risky option.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19deb769
    • D
      Revert "Revert "mm, thp: restore node-local hugepage allocations"" · ac79f78d
      David Rientjes 提交于
      This reverts commit a8282608.
      
      The commit references the original intended semantic for MADV_HUGEPAGE
      which has subsequently taken on three unique purposes:
      
       - enables or disables thp for a range of memory depending on the system's
         config (is thp "enabled" set to "always" or "madvise"),
      
       - determines the synchronous compaction behavior for thp allocations at
         fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
         and
      
       - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
         clear previous hugepage advice).
      
      These are the three purposes that currently exist in 5.2 and over the
      past several years that userspace has been written around.  Adding a
      NUMA locality preference adds a fourth dimension to an already conflated
      advice mode.
      
      Based on the semantic that MADV_HUGEPAGE has provided over the past
      several years, there exist workloads that use the tunable based on these
      principles: specifically that the allocation should attempt to
      defragment a local node before falling back.  It is agreed that remote
      hugepages typically (but not always) have a better access latency than
      remote native pages, although on Naples this is at parity for
      intersocket.
      
      The revert commit that this patch reverts allows hugepage allocation to
      immediately allocate remotely when local memory is fragmented.  This is
      contrary to the semantic of MADV_HUGEPAGE over the past several years:
      that is, memory compaction should be attempted locally before falling
      back.
      
      The performance degradation of remote hugepages over local hugepages on
      Rome, for example, is 53.5% increased access latency.  For this reason,
      the goal is to revert back to the 5.2 and previous behavior that would
      attempt local defragmentation before falling back.  With the patch that
      is reverted by this patch, we see performance degradations at the tail
      because the allocator happily allocates the remote hugepage rather than
      even attempting to make a local hugepage available.
      
      zone_reclaim_mode is not a solution to this problem since it does not
      only impact hugepage allocations but rather changes the memory
      allocation strategy for *all* page allocations.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac79f78d
  2. 14 8月, 2019 4 次提交
    • A
      Revert "mm, thp: restore node-local hugepage allocations" · a8282608
      Andrea Arcangeli 提交于
      This reverts commit 2f0799a0 ("mm, thp: restore node-local
      hugepage allocations").
      
      commit 2f0799a0 was rightfully applied to avoid the risk of a
      severe regression that was reported by the kernel test robot at the end
      of the merge window.  Now we understood the regression was a false
      positive and was caused by a significant increase in fairness during a
      swap trashing benchmark.  So it's safe to re-apply the fix and continue
      improving the code from there.  The benchmark that reported the
      regression is very useful, but it provides a meaningful result only when
      there is no significant alteration in fairness during the workload.  The
      removal of __GFP_THISNODE increased fairness.
      
      __GFP_THISNODE cannot be used in the generic page faults path for new
      memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
      behavior significantly deviates from what the MPOL_DEFAULT semantics are
      supposed to be for THP and 4k allocations alike.
      
      Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
      set to "madvise") has never meant to provide an implicit MPOL_BIND on
      the "current" node the task is running on, causing swap storms and
      providing a much more aggressive behavior than even zone_reclaim_node =
      3.
      
      Any workload who could have benefited from __GFP_THISNODE has now to
      enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
      the zone_reclaim_mode behavior, but it only did so if THP was enabled:
      if THP was disabled, there would have been no chance to get any 4k page
      from the current node if the current node was full of pagecache, which
      further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
      MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
      semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
      must work exactly the same with MADV_HUGEPAGE set or not.
      
      The performance characteristic of memory depends on the hardware
      details.  The numbers below are obtained on Naples/EPYC architecture and
      the N/A projection extends them to show what we should aim for in the
      future as a good THP NUMA locality default.  The benchmark used
      exercises random memory seeks (note: the cost of the page faults is not
      part of the measurement).
      
        D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
        0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A
      
      D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
      intra socket memory), D2 means distance two (i.e.  inter socket memory),
      etc...
      
      For the guest physical memory allocated by qemu and for guest mode
      kernel the performance characteristic of RAM is more complex and an
      ideal default could be:
      
        D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
        0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A
      
      NOTE: the N/A are projections and haven't been measured yet, the
      measurement in this case is done on a 1950x with only two NUMA nodes.
      The THP case here means THP was used both in the host and in the guest.
      
      After applying this commit the THP NUMA locality order that we'll get
      out of MADV_HUGEPAGE is this:
      
        D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Before this commit it was:
      
        D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
      
      Even if we ignore the breakage of large workloads that can't fit in a
      single node that the __GFP_THISNODE implicit "current node" mbind
      caused, the THP NUMA locality order provided by __GFP_THISNODE was still
      not the one we shall aim for in the long term (i.e.  the first one at
      the top).
      
      After this commit is applied, we can introduce a new allocator multi
      order API and to replace those two alloc_pages_vmas calls in the page
      fault path, with a single multi order call:
      
              unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
              page = alloc_pages_multi_order(..., &order);
              if (!page)
              	goto out;
              if (!(order & (1 << 0))) {
              	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
              	/* THP fault */
              } else {
              	VM_WARN_ON(order != 1 << 0);
              	/* 4k fallback */
              }
      
      The page allocator logic has to be altered so that when it fails on any
      zone with order 9, it has to try again with a order 0 before falling
      back to the next zone in the zonelist.
      
      After that we need to do more measurements and evaluate if adding an
      opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
      with "DN+1 THP | DN 4k" at every NUMA distance crossing.
      
      Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8282608
    • A
      Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" · 92717d42
      Andrea Arcangeli 提交于
      Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
      
      The fixes for what was originally reported as "pathological THP
      behavior" we rightfully reverted to be sure not to introduced
      regressions at end of a merge window after a severe regression report
      from the kernel bot.  We can safely re-apply them now that we had time
      to analyze the problem.
      
      The mm process worked fine, because the good fixes were eventually
      committed upstream without excessive delay.
      
      The regression reported by the kernel bot however forced us to revert
      the good fixes to be sure not to introduce regressions and to give us
      the time to analyze the issue further.  The silver lining is that this
      extra time allowed to think more at this issue and also plan for a
      future direction to improve things further in terms of THP NUMA
      locality.
      
      This patch (of 2):
      
      This reverts commit 356ff8a9 ("Revert "mm, thp: consolidate THP
      gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
      89c83fb5 ("mm, thp: consolidate THP gfp handling into
      alloc_hugepage_direct_gfpmask").
      
      Consolidation of the THP allocation flags at the same place was meant to
      be a clean up to easier handle otherwise scattered code which is
      imposing a maintenance burden.  There were no real problems observed
      with the gfp mask consolidation but the reversion was rushed through
      without a larger consensus regardless.
      
      This patch brings the consolidation back because this should make the
      long term maintainability easier as well as it should allow future
      changes to be less error prone.
      
      [mhocko@kernel.org: changelog additions]
      Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92717d42
    • Y
      mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind · a53190a4
      Yang Shi 提交于
      When running syzkaller internally, we ran into the below bug on 4.9.x
      kernel:
      
        kernel BUG at mm/huge_memory.c:2124!
        invalid opcode: 0000 [#1] SMP KASAN
        CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
        task: ffff880067b34900 task.stack: ffff880068998000
        RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
        Call Trace:
          split_huge_page include/linux/huge_mm.h:100 [inline]
          queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
          walk_pmd_range mm/pagewalk.c:50 [inline]
          walk_pud_range mm/pagewalk.c:90 [inline]
          walk_pgd_range mm/pagewalk.c:116 [inline]
          __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
          walk_page_range+0x154/0x370 mm/pagewalk.c:285
          queue_pages_range+0x115/0x150 mm/mempolicy.c:694
          do_mbind mm/mempolicy.c:1241 [inline]
          SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
          SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
          do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
          entry_SYSCALL_64_after_swapgs+0x5d/0xdb
        Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
        RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
         RSP <ffff88006899f980>
      
      with the below test:
      
        uint64_t r[1] = {0xffffffffffffffff};
      
        int main(void)
        {
              syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                      intptr_t res = 0;
              res = syscall(__NR_socket, 0x11, 3, 0x300);
              if (res != -1)
                      r[0] = res;
              *(uint32_t*)0x20000040 = 0x10000;
              *(uint32_t*)0x20000044 = 1;
              *(uint32_t*)0x20000048 = 0xc520;
              *(uint32_t*)0x2000004c = 1;
              syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
              syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
              *(uint64_t*)0x20000340 = 2;
              syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
              return 0;
        }
      
      Actually the test does:
      
        mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
        socket(AF_PACKET, SOCK_RAW, 768)        = 3
        setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
        mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
        mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
      
      The setsockopt() would allocate compound pages (16 pages in this test)
      for packet tx ring, then the mmap() would call packet_mmap() to map the
      pages into the user address space specified by the mmap() call.
      
      When calling mbind(), it would scan the vma to queue the pages for
      migration to the new node.  It would split any huge page since 4.9
      doesn't support THP migration, however, the packet tx ring compound
      pages are not THP and even not movable.  So, the above bug is triggered.
      
      However, the later kernel is not hit by this issue due to commit
      d44d363f ("mm: don't assume anonymous pages have SwapBacked flag"),
      which just removes the PageSwapBacked check for a different reason.
      
      But, there is a deeper issue.  According to the semantic of mbind(), it
      should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
      MPOL_MF_STRICT was also specified, but the kernel was unable to move all
      existing pages in the range.  The tx ring of the packet socket is
      definitely not movable, however, mbind() returns success for this case.
      
      Although the most socket file associates with non-movable pages, but XDP
      may have movable pages from gup.  So, it sounds not fine to just check
      the underlying file type of vma in vma_migratable().
      
      Change migrate_page_add() to check if the page is movable or not, if it
      is unmovable, just return -EIO.  But do not abort pte walk immediately,
      since there may be pages off LRU temporarily.  We should migrate other
      pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
      paged could not be not moved, then return -EIO for mbind() eventually.
      
      With this change the above test would return -EIO as expected.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a53190a4
    • Y
      mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified · d8835445
      Yang Shi 提交于
      When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
      try best to migrate misplaced pages, if some of the pages could not be
      migrated, then return -EIO.
      
      There are three different sub-cases:
       1. vma is not migratable
       2. vma is migratable, but there are unmovable pages
       3. vma is migratable, pages are movable, but migrate_pages() fails
      
      If #1 happens, kernel would just abort immediately, then return -EIO,
      after a7f40cfe ("mm: mempolicy: make mbind() return -EIO when
      MPOL_MF_STRICT is specified").
      
      If #3 happens, kernel would set policy and migrate pages with
      best-effort, but won't rollback the migrated pages and reset the policy
      back.
      
      Before that commit, they behaves in the same way.  It'd better to keep
      their behavior consistent.  But, rolling back the migrated pages and
      resetting the policy back sounds not feasible, so just make #1 behave as
      same as #3.
      
      Userspace will know that not everything was successfully migrated (via
      -EIO), and can take whatever steps it deems necessary - attempt
      rollback, determine which exact page(s) are violating the policy, etc.
      
      Make queue_pages_range() return 1 to indicate there are unmovable pages
      or vma is not migratable.
      
      The #2 is not handled correctly in the current kernel, the following
      patch will fix it.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8835445
  3. 03 7月, 2019 1 次提交
  4. 29 6月, 2019 1 次提交
    • Z
      mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask · 29b190fa
      zhong jiang 提交于
      mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
      mempoclicies when the tasks's cpuset's mems_allowed changes.  For
      policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
      it works by remapping the policy's allowed nodes (stored in v.nodes)
      using the previous value of mems_allowed (stored in
      w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
      (passed as nodes) as the range of the map (see the comment of
      bitmap_remap() for details).
      
      The result of remapping is stored back as policy's nodemask in v.nodes,
      and the new value of mems_allowed should be stored in
      w.cpuset_mems_allowed to facilitate the next rebind, if it happens.
      
      However, 213980c0 ("mm, mempolicy: simplify rebinding mempolicies
      when updating cpusets") introduced a bug where the result of remapping
      is stored in w.cpuset_mems_allowed instead.  Thus, a mempolicy's
      allowed nodes can evolve in an unexpected way after a series of
      rebinding due to cpuset mems_allowed changes, possibly binding to a
      wrong node or a smaller number of nodes which may e.g.  overload them.
      This patch fixes the bug so rebinding again works as intended.
      
      [vbabka@suse.cz: new changlog]
        Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz
      Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com
      Fixes: 213980c0 ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29b190fa
  5. 31 5月, 2019 1 次提交
  6. 30 3月, 2019 1 次提交
  7. 06 3月, 2019 2 次提交
    • V
      mm, mempolicy: fix uninit memory access · 2e25644e
      Vlastimil Babka 提交于
      Syzbot with KMSAN reports (excerpt):
      
      ==================================================================
      BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
      BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
      CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x173/0x1d0 lib/dump_stack.c:113
        kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
        __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
        mpol_rebind_policy mm/mempolicy.c:353 [inline]
        mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
        update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
        update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
        update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
        cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728
      
      ...
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
        kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
        kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
        kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
        mpol_new mm/mempolicy.c:276 [inline]
        do_mbind mm/mempolicy.c:1180 [inline]
        kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
        __do_sys_mbind mm/mempolicy.c:1354 [inline]
      
      As it's difficult to report where exactly the uninit value resides in
      the mempolicy object, we have to guess a bit.  mm/mempolicy.c:353
      contains this part of mpol_rebind_policy():
      
              if (!mpol_store_user_nodemask(pol) &&
                  nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
      
      "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
      ever see being uninitialized after leaving mpol_new().  So I'll guess
      it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
      but still part of statement starting on line 353.
      
      For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
      reachable for a mempolicy where mpol_set_nodemask() is called in
      do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
      with empty set of nodes, i.e.  MPOL_LOCAL equivalent, with MPOL_F_LOCAL
      flag.  Let's exclude such policies from the nodes_equal() check.  Note
      the uninit access should be benign anyway, as rebinding this kind of
      policy is always a no-op.  Therefore no actual need for stable
      inclusion.
      
      Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
      Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e25644e
    • A
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual 提交于
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
  8. 22 2月, 2019 1 次提交
  9. 09 12月, 2018 1 次提交
    • D
      Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" · 356ff8a9
      David Rientjes 提交于
      This reverts commit 89c83fb5.
      
      This should have been done as part of 2f0799a0 ("mm, thp: restore
      node-local hugepage allocations").  The movement of the thp allocation
      policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
      intended to only set __GFP_THISNODE for mempolicies that are not
      MPOL_BIND whereas the revert could set this regardless of mempolicy.
      
      While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
      and alloc_pages_vma() was racy, that has since been removed since the
      revert.  What is left is the possibility to use __GFP_THISNODE in
      policy_node() when it is unexpected because the special handling for
      hugepages in alloc_pages_vma()  was removed as part of the consolidation.
      
      Secondly, prior to 89c83fb5, alloc_pages_vma() implemented a somewhat
      different policy for hugepage allocations, which were allocated through
      alloc_hugepage_vma().  For hugepage allocations, if the allocating
      process's node is in the set of allowed nodes, allocate with
      __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
      __GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
      allow fallback to other nodes in 89c83fb5 as it did for new_page() in
      mm/mempolicy.c which is functionally different behavior and removes the
      requirement to only allocate hugepages locally.
      
      So this commit does a full revert of 89c83fb5 instead of the partial
      revert that was done in 2f0799a0.  The result is the same thp
      allocation policy for 4.20 that was in 4.19.
      
      Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
      Fixes: 2f0799a0 ("mm, thp: restore node-local hugepage allocations")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      356ff8a9
  10. 06 12月, 2018 1 次提交
    • D
      mm, thp: restore node-local hugepage allocations · 2f0799a0
      David Rientjes 提交于
      This is a full revert of ac5b2c18 ("mm: thp: relax __GFP_THISNODE for
      MADV_HUGEPAGE mappings") and a partial revert of 89c83fb5 ("mm, thp:
      consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
      
      By not setting __GFP_THISNODE, applications can allocate remote hugepages
      when the local node is fragmented or low on memory when either the thp
      defrag setting is "always" or the vma has been madvised with
      MADV_HUGEPAGE.
      
      Remote access to hugepages often has much higher latency than local pages
      of the native page size.  On Haswell, ac5b2c18 was shown to have a
      13.9% access regression after this commit for binaries that remap their
      text segment to be backed by transparent hugepages.
      
      The intent of ac5b2c18 is to address an issue where a local node is
      low on memory or fragmented such that a hugepage cannot be allocated.  In
      every scenario where this was described as a fix, there is abundant and
      unfragmented remote memory available to allocate from, even with a greater
      access latency.
      
      If remote memory is also low or fragmented, not setting __GFP_THISNODE was
      also measured on Haswell to have a 40% regression in allocation latency.
      
      Restore __GFP_THISNODE for thp allocations.
      
      Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
      Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f0799a0
  11. 04 11月, 2018 2 次提交
    • M
      mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask · 89c83fb5
      Michal Hocko 提交于
      THP allocation mode is quite complex and it depends on the defrag mode.
      This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
      part currently. The NUMA special casing (namely __GFP_THISNODE) is
      however independent and placed in alloc_pages_vma currently. This both
      adds an unnecessary branch to all vma based page allocation requests and
      it makes the code more complex unnecessarily as well. Not to mention
      that e.g. shmem THP used to do the node reclaiming unconditionally
      regardless of the defrag mode until recently. This was not only
      unexpected behavior but it was also hardly a good default behavior and I
      strongly suspect it was just a side effect of the code sharing more than
      a deliberate decision which suggests that such a layering is wrong.
      
      Get rid of the thp special casing from alloc_pages_vma and move the
      logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
      resulting gfp mask only when the direct reclaim is not requested and
      when there is no explicit numa binding to preserve the current logic.
      
      Please note that there's also a slight difference wrt MPOL_BIND now. The
      previous code would avoid using __GFP_THISNODE if the local node was
      outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
      for all MPOL_BIND policies. So there's a difference that if local node
      is actually allowed by the bind policy's nodemask, previously
      __GFP_THISNODE would be added, but now it won't be. From the behavior
      POV this is still correct because the policy nodemask is used.
      
      Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c83fb5
    • A
      mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings · ac5b2c18
      Andrea Arcangeli 提交于
      THP allocation might be really disruptive when allocated on NUMA system
      with the local node full or hard to reclaim.  Stefan has posted an
      allocation stall report on 4.12 based SLES kernel which suggests the
      same issue:
      
        kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
        kvm cpuset=/ mems_allowed=0-1
        CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
        Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
        Call Trace:
         dump_stack+0x5c/0x84
         warn_alloc+0xe0/0x180
         __alloc_pages_slowpath+0x820/0xc90
         __alloc_pages_nodemask+0x1cc/0x210
         alloc_pages_vma+0x1e5/0x280
         do_huge_pmd_wp_page+0x83f/0xf00
         __handle_mm_fault+0x93d/0x1060
         handle_mm_fault+0xc6/0x1b0
         __do_page_fault+0x230/0x430
         do_page_fault+0x2a/0x70
         page_fault+0x7b/0x80
         [...]
        Mem-Info:
        active_anon:126315487 inactive_anon:1612476 isolated_anon:5
         active_file:60183 inactive_file:245285 isolated_file:0
         unevictable:15657 dirty:286 writeback:1 unstable:0
         slab_reclaimable:75543 slab_unreclaimable:2509111
         mapped:81814 shmem:31764 pagetables:370616 bounce:0
         free:32294031 free_pcp:6233 free_cma:0
        Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      
      The defrag mode is "madvise" and from the above report it is clear that
      the THP has been allocated for MADV_HUGEPAGA vma.
      
      Andrea has identified that the main source of the problem is
      __GFP_THISNODE usage:
      
      : The problem is that direct compaction combined with the NUMA
      : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
      : hard the local node, instead of failing the allocation if there's no
      : THP available in the local node.
      :
      : Such logic was ok until __GFP_THISNODE was added to the THP allocation
      : path even with MPOL_DEFAULT.
      :
      : The idea behind the __GFP_THISNODE addition, is that it is better to
      : provide local memory in PAGE_SIZE units than to use remote NUMA THP
      : backed memory. That largely depends on the remote latency though, on
      : threadrippers for example the overhead is relatively low in my
      : experience.
      :
      : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
      : extremely slow qemu startup with vfio, if the VM is larger than the
      : size of one host NUMA node. This is because it will try very hard to
      : unsuccessfully swapout get_user_pages pinned pages as result of the
      : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
      : allocations and instead of trying to allocate THP on other nodes (it
      : would be even worse without vfio type1 GUP pins of course, except it'd
      : be swapping heavily instead).
      
      Fix this by removing __GFP_THISNODE for THP requests which are
      requesting the direct reclaim.  This effectivelly reverts 5265047a
      on the grounds that the zone/node reclaim was known to be disruptive due
      to premature reclaim when there was memory free.  While it made sense at
      the time for HPC workloads without NUMA awareness on rare machines, it
      was ultimately harmful in the majority of cases.  The existing behaviour
      is similar, if not as widespare as it applies to a corner case but
      crucially, it cannot be tuned around like zone_reclaim_mode can.  The
      default behaviour should always be to cause the least harm for the
      common case.
      
      If there are specialised use cases out there that want zone_reclaim_mode
      in specific cases, then it can be built on top.  Longterm we should
      consider a memory policy which allows for the node reclaim like behavior
      for the specific memory ranges which would allow a
      
      [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
      
      Mel said:
      
      : Both patches look correct to me but I'm responding to this one because
      : it's the fix.  The change makes sense and moves further away from the
      : severe stalling behaviour we used to see with both THP and zone reclaim
      : mode.
      :
      : I put together a basic experiment with usemem configured to reference a
      : buffer multiple times that is 80% the size of main memory on a 2-socket
      : box with symmetric node sizes and defrag set to "always".  The defrag
      : setting is not the default but it would be functionally similar to
      : accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
      : reference the buffer multiple times and while it's not an interesting
      : workload, it would be expected to complete reasonably quickly as it fits
      : within memory.  The results were;
      :
      : usemem
      :                                   vanilla           noreclaim-v1
      : Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
      : Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
      : Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
      :
      : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
      : threads referencing buffers 80% the size of memory.  With the patches
      : applied, it's 37.18% faster for the single thread and 73% faster with two
      : threads.  Note that 4 threads showing little difference does not indicate
      : the problem is related to thread counts.  It's simply the case that 4
      : threads gets spread so their workload mostly fits in one node.
      :
      : The overall view from /proc/vmstats is more startling
      :
      :                          4.19.0-rc1  4.19.0-rc1
      :                             vanillanoreclaim-v1r1
      : Minor Faults               35593425      708164
      : Major Faults                 484088          36
      : Swap Ins                    3772837           0
      : Swap Outs                   3932295           0
      :
      : Massive amounts of swap in/out without the patch
      :
      : Direct pages scanned        6013214           0
      : Kswapd pages scanned              0           0
      : Kswapd pages reclaimed            0           0
      : Direct pages reclaimed      4033009           0
      :
      : Lots of reclaim activity without the patch
      :
      : Kswapd efficiency              100%        100%
      : Kswapd velocity               0.000       0.000
      : Direct efficiency               67%        100%
      : Direct velocity           11191.956       0.000
      :
      : Mostly from direct reclaim context as you'd expect without the patch.
      :
      : Page writes by reclaim  3932314.000       0.000
      : Page writes file                 19           0
      : Page writes anon            3932295           0
      : Page reclaim immediate        42336           0
      :
      : Writes from reclaim context is never good but the patch eliminates it.
      :
      : We should never have default behaviour to thrash the system for such a
      : basic workload.  If zone reclaim mode behaviour is ever desired but on a
      : single task instead of a global basis then the sensible option is to build
      : a mempolicy that enforces that behaviour.
      
      This was a severe regression compared to previous kernels that made
      important workloads unusable and it starts when __GFP_THISNODE was
      added to THP allocations under MADV_HUGEPAGE.  It is not a significant
      risk to go to the previous behavior before __GFP_THISNODE was added, it
      worked like that for years.
      
      This was simply an optimization to some lucky workloads that can fit in
      a single node, but it ended up breaking the VM for others that can't
      possibly fit in a single node, so going back is safe.
      
      [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
      Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
      Fixes: 5265047a ("mm, thp: really limit transparent hugepage allocation to local node")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Debugged-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac5b2c18
  12. 27 10月, 2018 2 次提交
    • Z
      mm/mempolicy.c: use match_string() helper to simplify the code · dedf2c73
      zhong jiang 提交于
      match_string() returns the index of an array for a matching string, which
      can be used intead of open coded implementation.
      
      Link: http://lkml.kernel.org/r/1536988365-50310-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dedf2c73
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 3b9aadf7
      Andrea Arcangeli 提交于
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b9aadf7
  13. 23 8月, 2018 2 次提交
  14. 27 7月, 2018 1 次提交
  15. 12 4月, 2018 3 次提交
    • M
      mm: unclutter THP migration · 94723aaf
      Michal Hocko 提交于
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by spliting the THP into small pages while moving the head
      page to the newly allocated order-0 page.  Remaning pages are moved to
      the LRU list by split_huge_page.  The same happens if the THP allocation
      fails.  This is really ugly and error prone [1].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      This patch tries to unclutter the situation by moving the special THP
      handling up to the migrate_pages layer where it actually belongs.  We
      simply split the THP page into the existing list if unmap_and_move fails
      with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
      specific migrate_pages users do not have to deal with this case in a
      special way.
      
      [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94723aaf
    • M
      mm, migrate: remove reason argument from new_page_t · 666feb21
      Michal Hocko 提交于
      No allocation callback is using this argument anymore.  new_page_node
      used to use this parameter to convey node_id resp.  migration error up
      to move_pages code (do_move_page_to_node_array).  The error status never
      made it into the final status field and we have a better way to
      communicate node id to the status field now.  All other allocation
      callbacks simply ignored the argument so we can drop it finally.
      
      [mhocko@suse.com: fix migration callback]
        Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
      [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
      [mhocko@kernel.org: fix build]
        Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      666feb21
    • M
      mm, numa: rework do_pages_move · a49bd4d7
      Michal Hocko 提交于
      Patch series "unclutter thp migration"
      
      Motivation:
      
      THP migration is hacked into the generic migration with rather
      surprising semantic.  The migration allocation callback is supposed to
      check whether the THP can be migrated at once and if that is not the
      case then it allocates a simple page to migrate.  unmap_and_move then
      fixes that up by splitting the THP into small pages while moving the
      head page to the newly allocated order-0 page.  Remaining pages are
      moved to the LRU list by split_huge_page.  The same happens if the THP
      allocation fails.  This is really ugly and error prone [2].
      
      I also believe that split_huge_page to the LRU lists is inherently wrong
      because all tail pages are not migrated.  Some callers will just work
      around that by retrying (e.g.  memory hotplug).  There are other pfn
      walkers which are simply broken though.  e.g. madvise_inject_error will
      migrate head and then advances next pfn by the huge page size.
      do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
      will simply split the THP before migration if the THP migration is not
      supported then falls back to single page migration but it doesn't handle
      tail pages if the THP migration path is not able to allocate a fresh THP
      so we end up with ENOMEM and fail the whole migration which is a
      questionable behavior.  Page compaction doesn't try to migrate large
      pages so it should be immune.
      
      The first patch reworks do_pages_move which relies on a very ugly
      calling semantic when the return status is pushed to the migration path
      via private pointer.  It uses pre allocated fixed size batching to
      achieve that.  We simply cannot do the same if a THP is to be split
      during the migration path which is done in the patch 3.  Patch 2 is
      follow up cleanup which removes the mentioned return status calling
      convention ugliness.
      
      On a side note:
      
      There are some semantic issues I have encountered on the way when
      working on patch 1 but I am not addressing them here.  E.g. trying to
      move THP tail pages will result in either success or EBUSY (the later
      one more likely once we isolate head from the LRU list).  Hugetlb
      reports EACCESS on tail pages.  Some errors are reported via status
      parameter but migration failures are not even though the original
      `reason' argument suggests there was an intention to do so.  From a
      quick look into git history this never worked.  I have tried to keep the
      semantic unchanged.
      
      Then there is a relatively minor thing that the page isolation might
      fail because of pages not being on the LRU - e.g. because they are
      sitting on the per-cpu LRU caches.  Easily fixable.
      
      This patch (of 3):
      
      do_pages_move is supposed to move user defined memory (an array of
      addresses) to the user defined numa nodes (an array of nodes one for
      each address).  The user provided status array then contains resulting
      numa node for each address or an error.  The semantic of this function
      is little bit confusing because only some errors are reported back.
      Notably migrate_pages error is only reported via the return value.  This
      patch doesn't try to address these semantic nuances but rather change
      the underlying implementation.
      
      Currently we are processing user input (which can be really large) in
      batches which are stored to a temporarily allocated page.  Each address
      is resolved to its struct page and stored to page_to_node structure
      along with the requested target numa node.  The array of these
      structures is then conveyed down the page migration path via private
      argument.  new_page_node then finds the corresponding structure and
      allocates the proper target page.
      
      What is the problem with the current implementation and why to change
      it? Apart from being quite ugly it also doesn't cope with unexpected
      pages showing up on the migration list inside migrate_pages path.  That
      doesn't happen currently but the follow up patch would like to make the
      thp migration code more clear and that would need to split a THP into
      the list for some cases.
      
      How does the new implementation work? Well, instead of batching into a
      fixed size array we simply batch all pages that should be migrated to
      the same node and isolate all of them into a linked list which doesn't
      require any additional storage.  This should work reasonably well
      because page migration usually migrates larger ranges of memory to a
      specific node.  So the common case should work equally well as the
      current implementation.  Even if somebody constructs an input where the
      target numa nodes would be interleaved we shouldn't see a large
      performance impact because page migration alone doesn't really benefit
      from batching.  mmap_sem batching for the lookup is quite questionable
      and isolate_lru_page which would benefit from batching is not using it
      even in the current implementation.
      
      Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a49bd4d7
  16. 03 4月, 2018 3 次提交
  17. 23 3月, 2018 1 次提交
  18. 01 2月, 2018 5 次提交
  19. 16 11月, 2017 2 次提交
  20. 14 10月, 2017 1 次提交
  21. 09 9月, 2017 2 次提交