1. 07 7月, 2017 18 次提交
    • H
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying 提交于
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6
    • A
      mm/vmstat.c: standardize file operations variable names · 9d85e15f
      Anshuman Khandual 提交于
      Standardize the file operation variable names related to all four memory
      management /proc interface files.  Also change all the symbol
      permissions (S_IRUGO) into octal permissions (0444) as it got complaints
      from checkpatch.pl.  This does not create any functional change to the
      interface.
      
      Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.comSigned-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d85e15f
    • A
      ksm: optimize refile of stable_node_dup at the head of the chain · 80b18dfa
      Andrea Arcangeli 提交于
      If a candidate stable_node_dup has been found and it can accept further
      merges it can be refiled to the head of the list to speedup next
      searches without altering which dup is found and how the dups accumulate
      in the chain.
      
      We already refiled it back to the head in the prune_stale_stable_nodes
      case, but we didn't refile it if not pruning (which is more common).
      And we also refiled it when it was already at the head which is
      unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
      more than one dup in the chain, it doesn't mean it's not already at the
      head of the chain).
      
      The stable_node_chain list is single threaded and there's no SMP locking
      contention so it should be faster to refile it to the head of the list
      also if prune_stale_stable_nodes is false.
      
      Profiling shows the refile happens 1.9% of the time when a dup is found
      with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
      the refile never happens of course as there's never space for one more
      merge) which is reasonably low.  At higher max_page_sharing values it
      should be much less frequent.
      
      This is just an optimization.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80b18dfa
    • A
      ksm: swap the two output parameters of chain/chain_prune · 8dc5ffcd
      Andrea Arcangeli 提交于
      Some static checker complains if chain/chain_prune returns a potentially
      stale pointer.
      
      There are two output parameters to chain/chain_prune, one is tree_page
      the other is stable_node_dup.  Like in get_ksm_page the caller has to
      check tree_page is NULL before touching the stable_node.  Similarly in
      chain/chain_prune the caller has to check tree_page before touching the
      stable_node_dup returned or the original stable_node passed as
      parameter.
      
      Because the tree_page is never returned as a stale pointer, it may be
      more intuitive to return tree_page and to pass stable_node_dup for
      reference instead of the reverse.
      
      This patch purely swaps the two output parameters of chain/chain_prune
      as a cleanup for the static checker and to mimic the get_ksm_page
      behavior more closely.  There's no change to the caller at all except
      the swap, it's purely a cleanup and it is a noop from the caller point
      of view.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Tested-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dc5ffcd
    • A
      ksm: cleanup stable_node chain collapse case · 0ba1d0f7
      Andrea Arcangeli 提交于
      Patch series "KSMscale cleanup/optimizations".
      
      There are no fixes here it's just minor cleanups and optimizations.
      
      1/3 removes makes the "fix" for the stale stable_node fall in the
          standard case without introducing new cases.  Setting stable_node to
          NULL was marginally safer, but stale pointer is still wiped from the
          caller, this looks cleaner.
      
      2/3 should fix the false positive from Dan's static checker.
      
      3/3 is a microoptimization to apply the the refile of future merge
          candidate dups at the head of the chain in all cases and to skip it in
          one case where we did it and but it was a noop (to avoid checking if
          it was already at the head but now we've to check it anyway so it got
          optimized away).
      
      This patch (of 3):
      
      When the stable_node chain is collapsed we can as well set the caller
      stable_node to match the returned stable_node_dup in chain_prune().
      
      This way the collapse case becomes indistinguishable from the regular
      stable_node case and we can remove two branches from the KSM page
      migration handling slow paths.
      
      While it was all correct this looks cleaner (and faster) as the caller has
      to deal with fewer special cases.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ba1d0f7
    • A
      ksm: fix use after free with merge_across_nodes = 0 · b4fecc67
      Andrea Arcangeli 提交于
      If merge_across_nodes was manually set to 0 (not the default value) by
      the admin or a tuned profile on NUMA systems triggering cross-NODE page
      migrations, a stable_node use after free could materialize.
      
      If the chain is collapsed stable_node would point to the old chain that
      was already freed.  stable_node_dup would be the stable_node dup now
      converted to a regular stable_node and indexed in the rbtree in
      replacement of the freed stable_node chain (not anymore a dup).
      
      This special case where the chain is collapsed in the NUMA replacement
      path, is now detected by setting stable_node to NULL by the chain_prune
      callee if it decides to collapse the chain.  This tells the NUMA
      replacement code that even if stable_node and stable_node_dup are
      different, this is not a chain if stable_node is NULL, as the
      stable_node_dup was converted to a regular stable_node and the chain was
      collapsed.
      
      It is generally safer for the callee to force the caller stable_node to
      NULL the moment it become stale so any other mistake like this would
      result in an instant Oops easier to debug than an use after free.
      
      Otherwise the replace logic would act like if stable_node was a valid
      chain, when in fact it was freed.  Notably
      stable_node_chain_add_dup(page_node, stable_node) would run on a stable
      stable_node.
      
      Andrey Ryabinin found the source of the use after free in chain_prune().
      
      Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NEvgheni Dereveanchin <ederevea@redhat.com>
      Tested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fecc67
    • A
      ksm: introduce ksm_max_page_sharing per page deduplication limit · 2c653d0e
      Andrea Arcangeli 提交于
      Without a max deduplication limit for each KSM page, the list of the
      rmap_items associated to each stable_node can grow infinitely large.
      
      During the rmap walk each entry can take up to ~10usec to process
      because of IPIs for the TLB flushing (both for the primary MMU and the
      secondary MMUs with the MMU notifier).  With only 16GB of address space
      shared in the same KSM page, that would amount to dozens of seconds of
      kernel runtime.
      
      A ~256 max deduplication factor will reduce the latencies of the rmap
      walks on KSM pages to order of a few msec.  Just doing the
      cond_resched() during the rmap walks is not enough, the list size must
      have a limit too, otherwise the caller could get blocked in (schedule
      friendly) kernel computations for seconds, unexpectedly.
      
      There's room for optimization to significantly reduce the IPI delivery
      cost during the page_referenced(), but at least for page_migration in
      the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
      it may be inevitable to send lots of IPIs if each rmap_item->mm is
      active on a different CPU and there are lots of CPUs.  Even if we ignore
      the IPI delivery cost, we've still to walk the whole KSM rmap list, so
      we can't allow millions or billions (ulimited) number of entries in the
      KSM stable_node rmap_item lists.
      
      The limit is enforced efficiently by adding a second dimension to the
      stable rbtree.  So there are three types of stable_nodes: the regular
      ones (identical as before, living in the first flat dimension of the
      stable rbtree), the "chains" and the "dups".
      
      Every "chain" and all "dups" linked into a "chain" enforce the invariant
      that they represent the same write protected memory content, even if
      each "dup" will be pointed by a different KSM page copy of that content.
      This way the stable rbtree lookup computational complexity is unaffected
      if compared to an unlimited max_sharing_limit.  It is still enforced
      that there cannot be KSM page content duplicates in the stable rbtree
      itself.
      
      Adding the second dimension to the stable rbtree only after the
      max_page_sharing limit hits, provides for a zero memory footprint
      increase on 64bit archs.  The memory overhead of the per-KSM page
      stable_tree and per virtual mapping rmap_item is unchanged.  Only after
      the max_page_sharing limit hits, we need to allocate a stable_tree
      "chain" and rb_replace() the "regular" stable_node with the newly
      allocated stable_node "chain".  After that we simply add the "regular"
      stable_node to the chain as a stable_node "dup" by linking hlist_dup in
      the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
      is converted to a stable_node "dup" living in the second dimension of
      the stable rbtree.
      
      During stable rbtree lookups the stable_node "chain" is identified as
      stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
      is_stable_node_chain()).
      
      When dropping stable_nodes, the stable_node "dup" is identified as
      stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
      
      The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
      elsewhere in any stable_node->head/node to avoid a clashes with the
      stable_node->node.rb_parent_color pointer, and different from
      &migrate_nodes.  So the second field of &migrate_nodes is picked and
      verified as always safe with a BUILD_BUG_ON in case the list_head
      implementation changes in the future.
      
      The STABLE_NODE_DUP is picked as a random negative value in
      stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
      it's a "regular" stable_node or a stable_node "dup".
      
      The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
      is aliased in a union with a time field used to rate limit the
      stable_node_chain->hlist prunes.
      
      The garbage collection of the stable_node_chain happens lazily during
      stable rbtree lookups (as for all other kind of stable_nodes), or while
      disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
      entire stable rbtree.
      
      While the "regular" stable_nodes and the stable_node "dups" must wait
      for their underlying tree_page to be freed before they can be freed
      themselves, the stable_node "chains" can be freed immediately if the
      stable_node->hlist turns empty.  This is because the "chains" are never
      pointed by any page->mapping and they're effectively stable rbtree KSM
      self contained metadata.
      
      [akpm@linux-foundation.org: fix non-NUMA build]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NPetr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c653d0e
    • W
      mm/nobootmem.c: return 0 when start_pfn equals end_pfn · 172ffeb9
      Wei Yang 提交于
      When start_pfn equals end_pfn, __free_pages_memory() has no effect and
      __free_memory_core() will finally return (end_pfn - start_pfn) = 0.
      
      This patch returns 0 directly when start_pfn equals end_pfn.
      
      Link: http://lkml.kernel.org/r/20170502131115.6650-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      172ffeb9
    • N
      mm/vmscan.c: fix unsequenced modification and access warning · f2f43e56
      Nick Desaulniers 提交于
      Clang and its -Wunsequenced emits a warning
      
        mm/vmscan.c:2961:25: error: unsequenced modification and access to 'gfp_mask' [-Wunsequenced]
                        .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
                                              ^
      
      While it is not clear to me whether the initialization code violates the
      specification (6.7.8 par 19 (ISO/IEC 9899) looks like it disagrees) the
      code is quite confusing and worth cleaning up anyway.  Fix this by
      reusing sc.gfp_mask rather than the updated input gfp_mask parameter.
      
      Link: http://lkml.kernel.org/r/20170510154030.10720-1-nick.desaulniers@gmail.comSigned-off-by: NNick Desaulniers <nick.desaulniers@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2f43e56
    • D
      mm/mmap.c: mark protection_map as __ro_after_init · ac34ceaf
      Daniel Micay 提交于
      The protection map is only modified by per-arch init code so it can be
      protected from writes after the init code runs.
      
      This change was extracted from PaX where it's part of KERNEXEC.
      
      Link: http://lkml.kernel.org/r/20170510174441.26163-1-danielmicay@gmail.comSigned-off-by: NDaniel Micay <danielmicay@gmail.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac34ceaf
    • D
      mm, sparsemem: break out of loops early · c4e1be9e
      Dave Hansen 提交于
      There are a number of times that we loop over NR_MEM_SECTIONS, looking
      for section_present() on each section.  But, when we have very large
      physical address spaces (large MAX_PHYSMEM_BITS), NR_MEM_SECTIONS
      becomes very large, making the loops quite long.
      
      With MAX_PHYSMEM_BITS=46 and a section size of 128MB, the current loops
      are 512k iterations, which we barely notice on modern hardware.  But,
      raising MAX_PHYSMEM_BITS higher (like we will see on systems that
      support 5-level paging) makes this 64x longer and we start to notice,
      especially on slower systems like simulators.  A 10-second delay for
      512k iterations is annoying.  But, a 640- second delay is crippling.
      
      This does not help if we have extremely sparse physical address spaces,
      but those are quite rare.  We expect that most of the "slow" systems
      where this matters will also be quite small and non-sparse.
      
      To fix this, we track the highest section we've ever encountered.  This
      lets us know when we will *never* see another section_present(), and
      lets us break out of the loops earlier.
      
      Doing the whole for_each_present_section_nr() macro is probably
      overkill, but it will ensure that any future loop iterations that we
      grow are more likely to be correct.
      
      Kirrill said "It shaved almost 40 seconds from boot time in qemu with
      5-level paging enabled for me".
      
      Link: http://lkml.kernel.org/r/20170504174434.C45A4735@viggo.jf.intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Tested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4e1be9e
    • K
      mm: allow slab_nomerge to be set at build time · 7660a6fd
      Kees Cook 提交于
      Some hardened environments want to build kernels with slab_nomerge
      already set (so that they do not depend on remembering to set the kernel
      command line option).  This is desired to reduce the risk of kernel heap
      overflows being able to overwrite objects from merged caches and changes
      the requirements for cache layout control, increasing the difficulty of
      these attacks.  By keeping caches unmerged, these kinds of exploits can
      usually only damage objects in the same cache (though the risk to
      metadata exploitation is unchanged).
      
      Link: http://lkml.kernel.org/r/20170620230911.GA25238@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: David Windsor <dave@nullcore.net>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Daniel Mack <daniel@zonque.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7660a6fd
    • C
      mm/slab.c: replace open-coded round-up code with ALIGN · e0771950
      Canjiang Lu 提交于
      Link: http://lkml.kernel.org/r/20170616072918epcms5p4ff16c24ef8472b4c3b4371823cd87856@epcms5p4Signed-off-by: NCanjiang Lu <canjiang.lu@samsung.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0771950
    • W
      mm/slub.c: wrap kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL · e6d0e1dc
      Wei Yang 提交于
      kmem_cache->cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is
      set, so wrap it with config CONFIG_SLUB_CPU_PARTIAL will save some space
      on 32bit arch.
      
      This patch wraps kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL
      and wraps its sysfs too.
      
      Link: http://lkml.kernel.org/r/20170502144533.10729-4-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6d0e1dc
    • W
      mm/slub.c: wrap cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL · a93cf07b
      Wei Yang 提交于
      cpu_slab's field partial is used when CONFIG_SLUB_CPU_PARTIAL is set,
      which means we can save a pointer's space on each cpu for every slub
      item.
      
      This patch wraps cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL and wraps
      its sysfs use too.
      
      [akpm@linux-foundation.org: avoid strange 80-col tricks]
      Link: http://lkml.kernel.org/r/20170502144533.10729-3-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a93cf07b
    • W
      mm/slub: reset cpu_slab's pointer in deactivate_slab() · d4ff6d35
      Wei Yang 提交于
      Each time a slab is deactivated, the page and freelist pointer should be
      reset.
      
      This patch just merges these two options into deactivate_slab().
      
      Link: http://lkml.kernel.org/r/20170507031215.3130-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4ff6d35
    • W
      mm/slub.c: remove a redundant assignment in ___slab_alloc() · 66fdbe52
      Wei Yang 提交于
      When the code comes to this point, there are two cases:
      1. cpu_slab is deactivated
      2. cpu_slab is empty
      
      In both cased, cpu_slab->freelist is NULL at this moment.
      
      This patch removes the redundant assignment of cpu_slab->freelist.
      
      Link: http://lkml.kernel.org/r/20170507031215.3130-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66fdbe52
    • K
      thp, mm: fix crash due race in MADV_FREE handling · bbf29ffc
      Kirill A. Shutemov 提交于
      Reinette reported the following crash:
      
        BUG: Bad page state in process log2exe  pfn:57600
        page:ffffea00015d8000 count:0 mapcount:0 mapping:          (null) index:0x20200
        flags: 0x4000000000040019(locked|uptodate|dirty|swapbacked)
        raw: 4000000000040019 0000000000000000 0000000000020200 00000000ffffffff
        raw: ffffea00015d8020 ffffea00015d8020 0000000000000000 0000000000000000
        page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
        bad because of flags: 0x1(locked)
        Modules linked in: rfcomm 8021q bnep intel_rapl x86_pkg_temp_thermal coretemp efivars btusb btrtl btbcm pwm_lpss_pci snd_hda_codec_hdmi btintel pwm_lpss snd_hda_codec_realtek snd_soc_skl snd_hda_codec_generic snd_soc_skl_ipc spi_pxa2xx_platform snd_soc_sst_ipc snd_soc_sst_dsp i2c_designware_platform i2c_designware_core snd_hda_ext_core snd_soc_sst_match snd_hda_intel snd_hda_codec mei_me snd_hda_core mei snd_soc_rt286 snd_soc_rl6347a snd_soc_core efivarfs
        CPU: 1 PID: 354 Comm: log2exe Not tainted 4.12.0-rc7-test-test #19
        Hardware name: Intel corporation NUC6CAYS/NUC6CAYB, BIOS AYAPLCEL.86A.0027.2016.1108.1529 11/08/2016
        Call Trace:
         bad_page+0x16a/0x1f0
         free_pages_check_bad+0x117/0x190
         free_hot_cold_page+0x7b1/0xad0
         __put_page+0x70/0xa0
         madvise_free_huge_pmd+0x627/0x7b0
         madvise_free_pte_range+0x6f8/0x1150
         __walk_page_range+0x6b5/0xe30
         walk_page_range+0x13b/0x310
         madvise_free_page_range.isra.16+0xad/0xd0
         madvise_free_single_vma+0x2e4/0x470
         SyS_madvise+0x8ce/0x1450
      
      If somebody frees the page under us and we hold the last reference to
      it, put_page() would attempt to free the page before unlocking it.
      
      The fix is trivial reorder of operations.
      
      Dave said:
       "I came up with the exact same patch.  For posterity, here's the test
        case, generated by syzkaller and trimmed down by Reinette:
      
        	https://www.sr71.net/~dave/intel/log2.c
      
        And the config that helps detect this:
      
        	https://www.sr71.net/~dave/intel/config-log2"
      
      Fixes: b8d3c4c3 ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
      Link: http://lkml.kernel.org/r/20170628101249.17879-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NReinette Chatre <reinette.chatre@intel.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbf29ffc
  2. 29 6月, 2017 1 次提交
  3. 24 6月, 2017 3 次提交
    • T
      slub: make sysfs file removal asynchronous · 3b7b3140
      Tejun Heo 提交于
      Commit bf5eb3de ("slub: separate out sysfs_slab_release() from
      sysfs_slab_remove()") made slub sysfs file removals synchronous to
      kmem_cache shutdown.
      
      Unfortunately, this created a possible ABBA deadlock between slab_mutex
      and sysfs draining mechanism triggering the following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        4.10.0-test+ #48 Not tainted
        -------------------------------------------------------
        rmmod/1211 is trying to acquire lock:
         (s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40
      
        but task is already holding lock:
         (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (slab_mutex){+.+.+.}:
      	 lock_acquire+0xf6/0x1f0
      	 __mutex_lock+0x75/0x950
      	 mutex_lock_nested+0x1b/0x20
      	 slab_attr_store+0x75/0xd0
      	 sysfs_kf_write+0x45/0x60
      	 kernfs_fop_write+0x13c/0x1c0
      	 __vfs_write+0x28/0x120
      	 vfs_write+0xc8/0x1e0
      	 SyS_write+0x49/0xa0
      	 entry_SYSCALL_64_fastpath+0x1f/0xc2
      
        -> #0 (s_active#120){++++.+}:
      	 __lock_acquire+0x10ed/0x1260
      	 lock_acquire+0xf6/0x1f0
      	 __kernfs_remove+0x254/0x320
      	 kernfs_remove+0x23/0x40
      	 sysfs_remove_dir+0x51/0x80
      	 kobject_del+0x18/0x50
      	 __kmem_cache_shutdown+0x3e6/0x460
      	 kmem_cache_destroy+0x1fb/0x2d0
      	 kvm_exit+0x2d/0x80 [kvm]
      	 vmx_exit+0x19/0xa1b [kvm_intel]
      	 SyS_delete_module+0x198/0x1f0
      	 entry_SYSCALL_64_fastpath+0x1f/0xc2
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(slab_mutex);
      				 lock(s_active#120);
      				 lock(slab_mutex);
          lock(s_active#120);
      
         *** DEADLOCK ***
      
        2 locks held by rmmod/1211:
         #0:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80
         #1:  (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0
      
        stack backtrace:
        CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48
        Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
        Call Trace:
         print_circular_bug+0x1be/0x210
         __lock_acquire+0x10ed/0x1260
         lock_acquire+0xf6/0x1f0
         __kernfs_remove+0x254/0x320
         kernfs_remove+0x23/0x40
         sysfs_remove_dir+0x51/0x80
         kobject_del+0x18/0x50
         __kmem_cache_shutdown+0x3e6/0x460
         kmem_cache_destroy+0x1fb/0x2d0
         kvm_exit+0x2d/0x80 [kvm]
         vmx_exit+0x19/0xa1b [kvm_intel]
         SyS_delete_module+0x198/0x1f0
         ? SyS_delete_module+0x5/0x1f0
         entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It'd be the cleanest to deal with the issue by removing sysfs files
      without holding slab_mutex before the rest of shutdown; however, given
      the current code structure, it is pretty difficult to do so.
      
      This patch punts sysfs file removal to a work item.  Before commit
      bf5eb3de, the removal was punted to a RCU delayed work item which is
      executed after release.  Now, we're punting to a different work item on
      shutdown which still maintains the goal removing the sysfs files earlier
      when destroying kmem_caches.
      
      Link: http://lkml.kernel.org/r/20170620204512.GI21326@htj.duckdns.org
      Fixes: bf5eb3de ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b7b3140
    • A
      mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings · 029c54b0
      Ard Biesheuvel 提交于
      Existing code that uses vmalloc_to_page() may assume that any address
      for which is_vmalloc_addr() returns true may be passed into
      vmalloc_to_page() to retrieve the associated struct page.
      
      This is not un unreasonable assumption to make, but on architectures
      that have CONFIG_HAVE_ARCH_HUGE_VMAP=y, it no longer holds, and we need
      to ensure that vmalloc_to_page() does not go off into the weeds trying
      to dereference huge PUDs or PMDs as table entries.
      
      Given that vmalloc() and vmap() themselves never create huge mappings or
      deal with compound pages at all, there is no correct answer in this
      case, so return NULL instead, and issue a warning.
      
      When reading /proc/kcore on arm64, you will hit an oops as soon as you
      hit the huge mappings used for the various segments that make up the
      mapping of vmlinux.  With this patch applied, you will no longer hit the
      oops, but the kcore contents willl be incorrect (these regions will be
      zeroed out)
      
      We are fixing this for kcore specifically, so it avoids vread() for
      those regions.  At least one other problematic user exists, i.e.,
      /dev/kmem, but that is currently broken on arm64 for other reasons.
      
      Link: http://lkml.kernel.org/r/20170609082226.26152-1-ard.biesheuvel@linaro.orgSigned-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NLaura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      029c54b0
    • D
      mm, thp: remove cond_resched from __collapse_huge_page_copy · c891d9f6
      David Rientjes 提交于
      This is a partial revert of commit 338a16ba ("mm, thp: copying user
      pages must schedule on collapse") which added a cond_resched() to
      __collapse_huge_page_copy().
      
      On x86 with CONFIG_HIGHPTE, __collapse_huge_page_copy is called in
      atomic context and thus scheduling is not possible.  This is only a
      possible config on arm and i386.
      
      Although need_resched has been shown to be set for over 100 jiffies
      while doing the iteration in __collapse_huge_page_copy, this is better
      than doing
      
      	if (in_atomic())
      		cond_resched()
      
      to cover only non-CONFIG_HIGHPTE configs.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706191341550.97821@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Tested-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c891d9f6
  4. 22 6月, 2017 4 次提交
  5. 21 6月, 2017 4 次提交
  6. 20 6月, 2017 4 次提交
    • G
      fs: return if direct I/O will trigger writeback · 6be96d3a
      Goldwyn Rodrigues 提交于
      Find out if the I/O will trigger a wait due to writeback. If yes,
      return -EAGAIN.
      
      Return -EINVAL for buffered AIO: there are multiple causes of
      delay such as page locks, dirty throttling logic, page loading
      from disk etc. which cannot be taken care of.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6be96d3a
    • G
      fs: Introduce filemap_range_has_page() · 7fc9e472
      Goldwyn Rodrigues 提交于
      filemap_range_has_page() return true if the file's mapping has
      a page within the range mentioned. This function will be used
      to check if a write() call will cause a writeback of previous
      writes.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7fc9e472
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  7. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  8. 17 6月, 2017 4 次提交
  9. 13 6月, 2017 1 次提交