1. 23 8月, 2018 2 次提交
    • D
      kernel/hung_task.c: allow to set checking interval separately from timeout · a2e51445
      Dmitry Vyukov 提交于
      Currently task hung checking interval is equal to timeout, as the result
      hung is detected anywhere between timeout and 2*timeout.  This is fine for
      most interactive environments, but this hurts automated testing setups
      (syzbot).  In an automated setup we need to strictly order CPU lockup <
      RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
      is not detected as task hung and task hung is not detected as silent
      machine loss.  The large variance in task hung detection timeout requires
      setting silent machine loss timeout to a very large value (e.g.  if task
      hung is 3 mins, then silent loss need to be set to ~7 mins).  The
      additional 3 minutes significantly reduce testing efficiency because
      usually we crash kernel within a minute, and this can add hours to bug
      localization process as it needs to do dozens of tests.
      
      Allow setting checking interval separately from timeout.  This allows to
      set timeout to, say, 3 minutes, but checking interval to 10 secs.
      
      The interval is controlled via a new hung_task_check_interval_secs sysctl,
      similar to the existing hung_task_timeout_secs sysctl.  The default value
      of 0 results in the current behavior: checking interval is equal to
      timeout.
      
      [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
      Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e51445
    • A
      mm: zero out the vma in vma_init() · a670468f
      Andrew Morton 提交于
      Rather than in vm_area_alloc().  To ensure that the various oddball
      stack-based vmas are in a good state.  Some of the callers were zeroing
      them out, others were not.
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a670468f
  2. 18 8月, 2018 1 次提交
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  3. 01 8月, 2018 1 次提交
  4. 27 7月, 2018 1 次提交
  5. 22 7月, 2018 3 次提交
    • L
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds 提交于
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • L
      mm: make vm_area_dup() actually copy the old vma data · 95faf699
      Linus Torvalds 提交于
      .. and re-initialize th eanon_vma_chain head.
      
      This removes some boiler-plate from the users, and also makes it clear
      why it didn't need use the 'zalloc()' version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95faf699
    • L
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds 提交于
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
  6. 17 7月, 2018 1 次提交
  7. 15 6月, 2018 1 次提交
    • T
      mm: check for SIGKILL inside dup_mmap() loop · 655c79bb
      Tetsuo Handa 提交于
      As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas
      can loop while potentially allocating memory, with mm->mmap_sem held for
      write by current thread.  This is bad if current thread was selected as
      an OOM victim, for current thread will continue allocations using memory
      reserves while OOM reaper is unable to reclaim memory.
      
      As an actually observable problem, it is not difficult to make OOM
      reaper unable to reclaim memory if the OOM victim is blocked at
      i_mmap_lock_write() in this loop.  Unfortunately, since nobody can
      explain whether it is safe to use killable wait there, let's check for
      SIGKILL before trying to allocate memory.  Even without an OOM event,
      there is no point with continuing the loop from the beginning if current
      thread is killed.
      
      I tested with debug printk().  This patch should be safe because we
      already fail if security_vm_enough_memory_mm() or
      kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it.
      
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting dup_mmap() due to SIGKILL *****
         ***** Aborting exit_mmap() due to NULL mmap *****
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      655c79bb
  8. 14 6月, 2018 1 次提交
    • L
      Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables · 050e9baa
      Linus Torvalds 提交于
      The changes to automatically test for working stack protector compiler
      support in the Kconfig files removed the special STACKPROTECTOR_AUTO
      option that picked the strongest stack protector that the compiler
      supported.
      
      That was all a nice cleanup - it makes no sense to have the AUTO case
      now that the Kconfig phase can just determine the compiler support
      directly.
      
      HOWEVER.
      
      It also meant that doing "make oldconfig" would now _disable_ the strong
      stackprotector if you had AUTO enabled, because in a legacy config file,
      the sane stack protector configuration would look like
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_NONE is not set
        # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_STACKPROTECTOR_AUTO=y
      
      and when you ran this through "make oldconfig" with the Kbuild changes,
      it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
      been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
      CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
      used to be disabled (because it was really enabled by AUTO), and would
      disable it in the new config, resulting in:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      That's dangerously subtle - people could suddenly find themselves with
      the weaker stack protector setup without even realizing.
      
      The solution here is to just rename not just the old RECULAR stack
      protector option, but also the strong one.  This does that by just
      removing the CC_ prefix entirely for the user choices, because it really
      is not about the compiler support (the compiler support now instead
      automatially impacts _visibility_ of the options to users).
      
      This results in "make oldconfig" actually asking the user for their
      choice, so that we don't have any silent subtle security model changes.
      The end result would generally look like this:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_STACKPROTECTOR=y
        CONFIG_STACKPROTECTOR_STRONG=y
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      where the "CC_" versions really are about internal compiler
      infrastructure, not the user selections.
      Acked-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      050e9baa
  9. 08 6月, 2018 1 次提交
    • Y
      mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct · 88aa7cc6
      Yang Shi 提交于
      mmap_sem is on the hot path of kernel, and it very contended, but it is
      abused too.  It is used to protect arg_start|end and evn_start|end when
      reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
      sense since those proc files just expect to read 4 values atomically and
      not related to VM, they could be set to arbitrary values by C/R.
      
      And, the mmap_sem contention may cause unexpected issue like below:
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
       Call Trace:
         schedule+0x36/0x80
         rwsem_down_read_failed+0xf0/0x150
         call_rwsem_down_read_failed+0x18/0x30
         down_read+0x20/0x40
         proc_pid_cmdline_read+0xd9/0x4e0
         __vfs_read+0x37/0x150
         vfs_read+0x96/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
      for them to mitigate the abuse of mmap_sem.
      
      So, introduce a new spinlock in mm_struct to protect the concurrent
      access to arg_start|end, env_start|end and others, as well as replace
      write map_sem to read to protect the race condition between prctl and
      sys_brk which might break check_data_rlimit(), and makes prctl more
      friendly to other VM operations.
      
      This patch just eliminates the abuse of mmap_sem, but it can't resolve
      the above hung task warning completely since the later
      access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
      scalability issue will be solved in the future.
      
      [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
        Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88aa7cc6
  10. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  11. 15 5月, 2018 1 次提交
  12. 21 4月, 2018 1 次提交
    • K
      fork: unconditionally clear stack on fork · e01e8063
      Kees Cook 提交于
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak
      
      I did some more with perf and cycle counts on running 100,000 execs of
      /bin/true.
      
      before:
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      
      after:
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e01e8063
  13. 06 4月, 2018 1 次提交
  14. 03 4月, 2018 2 次提交
  15. 22 2月, 2018 1 次提交
  16. 07 2月, 2018 2 次提交
  17. 01 2月, 2018 1 次提交
  18. 16 1月, 2018 3 次提交
    • K
      fork: Provide usercopy whitelisting for task_struct · 5905429a
      Kees Cook 提交于
      While the blocked and saved_sigmask fields of task_struct are copied to
      userspace (via sigmask_to_save() and setup_rt_frame()), it is always
      copied with a static length (i.e. sizeof(sigset_t)).
      
      The only portion of task_struct that is potentially dynamically sized and
      may be copied to userspace is in the architecture-specific thread_struct
      at the end of task_struct.
      
      cache object allocation:
          kernel/fork.c:
              alloc_task_struct_node(...):
                  return kmem_cache_alloc_node(task_struct_cachep, ...);
      
              dup_task_struct(...):
                  ...
                  tsk = alloc_task_struct_node(node);
      
              copy_process(...):
                  ...
                  dup_task_struct(...)
      
              _do_fork(...):
                  ...
                  copy_process(...)
      
      example usage trace:
      
          arch/x86/kernel/fpu/signal.c:
              __fpu__restore_sig(...):
                  ...
                  struct task_struct *tsk = current;
                  struct fpu *fpu = &tsk->thread.fpu;
                  ...
                  __copy_from_user(&fpu->state.xsave, ..., state_size);
      
              fpu__restore_sig(...):
                  ...
                  return __fpu__restore_sig(...);
      
          arch/x86/kernel/signal.c:
              restore_sigcontext(...):
                  ...
                  fpu__restore_sig(...)
      
      This introduces arch_thread_struct_whitelist() to let an architecture
      declare specifically where the whitelist should be within thread_struct.
      If undefined, the entire thread_struct field is left whitelisted.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: "Mickaël Salaün" <mic@digikod.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      5905429a
    • D
      fork: Define usercopy region in thread_stack slab caches · f9d29946
      David Windsor 提交于
      In support of usercopy hardening, this patch defines a region in the
      thread_stack slab caches in which userspace copy operations are allowed.
      Since the entire thread_stack needs to be available to userspace, the
      entire slab contents are whitelisted. Note that the slab-based thread
      stack is only present on systems with THREAD_SIZE < PAGE_SIZE and
      !CONFIG_VMAP_STACK.
      
      cache object allocation:
          kernel/fork.c:
              alloc_thread_stack_node(...):
                  return kmem_cache_alloc_node(thread_stack_cache, ...)
      
              dup_task_struct(...):
                  ...
                  stack = alloc_thread_stack_node(...)
                  ...
                  tsk->stack = stack;
      
              copy_process(...):
                  ...
                  dup_task_struct(...)
      
              _do_fork(...):
                  ...
                  copy_process(...)
      
      This region is known as the slab cache's usercopy region. Slab caches
      can now check that each dynamically sized copy operation involving
      cache-managed memory falls entirely within the slab's usercopy region.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split patch, provide usage trace]
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      f9d29946
    • D
      fork: Define usercopy region in mm_struct slab caches · 07dcd7fe
      David Windsor 提交于
      In support of usercopy hardening, this patch defines a region in the
      mm_struct slab caches in which userspace copy operations are allowed.
      Only the auxv field is copied to userspace.
      
      cache object allocation:
          kernel/fork.c:
              #define allocate_mm()     (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
      
              dup_mm():
                  ...
                  mm = allocate_mm();
      
              copy_mm(...):
                  ...
                  dup_mm();
      
              copy_process(...):
                  ...
                  copy_mm(...)
      
              _do_fork(...):
                  ...
                  copy_process(...)
      
      example usage trace:
      
          fs/binfmt_elf.c:
              create_elf_tables(...):
                  ...
                  elf_info = (elf_addr_t *)current->mm->saved_auxv;
                  ...
                  copy_to_user(..., elf_info, ei_index * sizeof(elf_addr_t))
      
              load_elf_binary(...):
                  ...
                  create_elf_tables(...);
      
      This region is known as the slab cache's usercopy region. Slab caches
      can now check that each dynamically sized copy operation involving
      cache-managed memory falls entirely within the slab's usercopy region.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: NDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, split patch, provide usage trace]
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      07dcd7fe
  19. 23 12月, 2017 1 次提交
    • T
      arch, mm: Allow arch_dup_mmap() to fail · c10e83f5
      Thomas Gleixner 提交于
      In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
      allowed to fail. Fix up all instances.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: dan.j.williams@intel.com
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c10e83f5
  20. 18 11月, 2017 1 次提交
  21. 16 11月, 2017 4 次提交
  22. 14 10月, 2017 1 次提交
  23. 04 10月, 2017 1 次提交
  24. 09 9月, 2017 2 次提交
  25. 07 9月, 2017 2 次提交
    • R
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel 提交于
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.comSigned-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: NColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • A
      mm: oom: let oom_reap_task and exit_mmap run concurrently · 21292580
      Andrea Arcangeli 提交于
      This is purely required because exit_aio() may block and exit_mmap() may
      never start, if the oom_reap_task cannot start running on a mm with
      mm_users == 0.
      
      At the same time if the OOM reaper doesn't wait at all for the memory of
      the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
      generate a spurious OOM kill.
      
      If it wasn't because of the exit_aio or similar blocking functions in
      the last mmput, it would be enough to change the oom_reap_task() in the
      case it finds mm_users == 0, to wait for a timeout or to wait for
      __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
      problem here so the concurrency of exit_mmap and oom_reap_task is
      apparently warranted.
      
      It's a non standard runtime, exit_mmap() runs without mmap_sem, and
      oom_reap_task runs with the mmap_sem for reading as usual (kind of
      MADV_DONTNEED).
      
      The race between the two is solved with a combination of
      tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
      (serialized by a dummy down_write/up_write cycle on the same lines of
      the ksm_exit method).
      
      If the oom_reap_task() may be running concurrently during exit_mmap,
      exit_mmap will wait it to finish in down_write (before taking down mm
      structures that would make the oom_reap_task fail with use after free).
      
      If exit_mmap comes first, oom_reap_task() will skip the mm if
      MMF_OOM_SKIP is already set and in turn all memory is already freed and
      furthermore the mm data structures may already have been taken down by
      free_pgtables.
      
      [aarcange@redhat.com: incremental one liner]
        Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
      [rientjes@google.com: remove unused mmput_async]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
      [aarcange@redhat.com: microoptimization]
        Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
      Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
      Fixes: 26db62f1 ("oom: keep mm of the killed task available")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21292580
  26. 01 9月, 2017 1 次提交
    • E
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 355627f5
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
      
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      
      Here is the use-after-free reported by KASAN:
      
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
      
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
           dump_stack+0xdb/0x185
           print_address_description+0x7e/0x290
           kasan_report+0x23b/0x350
           __asan_report_load8_noabort+0x19/0x20
           uprobe_clear_state+0x1c4/0x200
           mmput+0xd6/0x360
           do_exit+0x740/0x1670
           do_group_exit+0x13f/0x380
           get_signal+0x597/0x17d0
           do_signal+0x99/0x1df0
           exit_to_usermode_loop+0x166/0x1e0
           syscall_return_slowpath+0x258/0x2c0
           entry_SYSCALL_64_fastpath+0xbc/0xbe
      
          ...
      
          Allocated by task 199:
           save_stack_trace+0x1b/0x20
           kasan_kmalloc+0xfc/0x180
           kmem_cache_alloc_trace+0xf3/0x330
           __create_xol_area+0x10f/0x780
           uprobe_notify_resume+0x1674/0x2210
           exit_to_usermode_loop+0x150/0x1e0
           prepare_exit_to_usermode+0x14b/0x180
           retint_user+0x8/0x20
      
          Freed by task 199:
           save_stack_trace+0x1b/0x20
           kasan_slab_free+0xa8/0x1a0
           kfree+0xba/0x210
           uprobe_clear_state+0x151/0x200
           mmput+0xd6/0x360
           copy_process.part.8+0x605f/0x65d0
           _do_fork+0x1a5/0xbd0
           SyS_clone+0x19/0x20
           do_syscall_64+0x22f/0x660
           return_from_SYSCALL_64+0x0/0x7a
      
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>    [4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355627f5
  27. 26 8月, 2017 1 次提交
    • E
      fork: fix incorrect fput of ->exe_file causing use-after-free · 2b7e8665
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      taken.
      
      This caused the struct file to later be freed prematurely.
      
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *mmap_thread(void *_arg)
          {
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
              }
          }
      
          static void *fork_thread(void *_arg)
          {
              usleep(rand() % 10000);
              fork();
          }
      
          int main(void)
          {
              fork();
              fork();
              fork();
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      
      Google Bug Id: 64772007
      
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[v4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b7e8665
  28. 16 8月, 2017 1 次提交
    • M
      fork: allow arch-override of VMAP stack alignment · 48ac3c18
      Mark Rutland 提交于
      In some cases, an architecture might wish its stacks to be aligned to a
      boundary larger than THREAD_SIZE. For example, using an alignment of
      double THREAD_SIZE can allow for stack overflows smaller than
      THREAD_SIZE to be detected by checking a single bit of the stack
      pointer.
      
      This patch allows architectures to override the alignment of VMAP'd
      stacks, by defining THREAD_ALIGN. Where not defined, this defaults to
      THREAD_SIZE, as is the case today.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: linux-kernel@vger.kernel.org
      48ac3c18