1. 29 8月, 2017 1 次提交
  2. 26 8月, 2017 1 次提交
    • J
      futex: Remove duplicated code and fix undefined behaviour · 30d6e0a4
      Jiri Slaby 提交于
      There is code duplicated over all architecture's headers for
      futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
      and comparison of the result.
      
      Remove this duplication and leave up to the arches only the needed
      assembly which is now in arch_futex_atomic_op_inuser.
      
      This effectively distributes the Will Deacon's arm64 fix for undefined
      behaviour reported by UBSAN to all architectures. The fix was done in
      commit 5f16a046 (arm64: futex: Fix undefined behaviour with
      FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.
      
      And as suggested by Thomas, check for negative oparg too, because it was
      also reported to cause undefined behaviour report.
      
      Note that s390 removed access_ok check in d12a2970 ("s390/uaccess:
      remove pointless access_ok() checks") as access_ok there returns true.
      We introduce it back to the helper for the sake of simplicity (it gets
      optimized away anyway).
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
      Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Reviewed-by: NDarren Hart (VMware) <dvhart@infradead.org>
      Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
      Cc: linux-mips@linux-mips.org
      Cc: Rich Felker <dalias@libc.org>
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: peterz@infradead.org
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: sparclinux@vger.kernel.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: linux-s390@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: linux-hexagon@vger.kernel.org
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: linux-snps-arc@lists.infradead.org
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-xtensa@linux-xtensa.org
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: openrisc@lists.librecores.org
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-parisc@vger.kernel.org
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: linux-alpha@vger.kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
      30d6e0a4
  3. 25 8月, 2017 2 次提交
    • P
      locking/lockdep: Fix workqueue crossrelease annotation · e6f3faa7
      Peter Zijlstra 提交于
      The new completion/crossrelease annotations interact unfavourable with
      the extant flush_work()/flush_workqueue() annotations.
      
      The problem is that when a single work class does:
      
        wait_for_completion(&C)
      
      and
      
        complete(&C)
      
      in different executions, we'll build dependencies like:
      
        lock_map_acquire(W)
        complete_acquire(C)
      
      and
      
        lock_map_acquire(W)
        complete_release(C)
      
      which results in the dependency chain: W->C->W, which lockdep thinks
      spells deadlock, even though there is no deadlock potential since
      works are ran concurrently.
      
      One possibility would be to change the work 'lock' to recursive-read,
      but that would mean hitting a lockdep limitation on recursive locks.
      Also, unconditinoally switching to recursive-read here would fail to
      detect the actual deadlock on single-threaded workqueues, which do
      have a problem with this.
      
      For now, forcefully disregard these locks for crossrelease.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: byungchul.park@lge.com
      Cc: david@fromorbit.com
      Cc: johannes@sipsolutions.net
      Cc: oleg@redhat.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e6f3faa7
    • P
      workqueue/lockdep: 'Fix' flush_work() annotation · a1d14934
      Peter Zijlstra 提交于
      The flush_work() annotation as introduced by commit:
      
        e159489b ("workqueue: relax lockdep annotation on flush_work()")
      
      hits on the lockdep problem with recursive read locks.
      
      The situation as described is:
      
      Work W1:                Work W2:        Task:
      
      ARR(Q)                  ARR(Q)		flush_workqueue(Q)
      A(W1)                   A(W2)             A(Q)
        flush_work(W2)			  R(Q)
          A(W2)
          R(W2)
          if (special)
            A(Q)
          else
            ARR(Q)
          R(Q)
      
      where: A - acquire, ARR - acquire-read-recursive, R - release.
      
      Where under 'special' conditions we want to trigger a lock recursion
      deadlock, but otherwise allow the flush_work(). The allowing is done
      by using recursive read locks (ARR), but lockdep is broken for
      recursive stuff.
      
      However, there appears to be no need to acquire the lock if we're not
      'special', so if we remove the 'else' clause things become much
      simpler and no longer need the recursion thing at all.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: byungchul.park@lge.com
      Cc: david@fromorbit.com
      Cc: johannes@sipsolutions.net
      Cc: oleg@redhat.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a1d14934
  4. 24 8月, 2017 3 次提交
    • S
      tracing: Fix freeing of filter in create_filter() when set_str is false · 8b0db1a5
      Steven Rostedt (VMware) 提交于
      Performing the following task with kmemleak enabled:
      
       # cd /sys/kernel/tracing/events/irq/irq_handler_entry/
       # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
       # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
       # echo scan > /sys/kernel/debug/kmemleak
       # cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff8800b9290308 (size 32):
        comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
          [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
          [<ffffffff812639c9>] create_filter+0xa9/0x160
          [<ffffffff81263bdc>] create_event_filter+0xc/0x10
          [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
          [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
          [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
          [<ffffffff8138cf87>] __vfs_write+0xd7/0x380
          [<ffffffff8138f421>] vfs_write+0x101/0x260
          [<ffffffff8139187b>] SyS_write+0xab/0x130
          [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The function create_filter() is passed a 'filterp' pointer that gets
      allocated, and if "set_str" is true, it is up to the caller to free it, even
      on error. The problem is that the pointer is not freed by create_filter()
      when set_str is false. This is a bug, and it is not up to the caller to free
      the filter on error if it doesn't care about the string.
      
      Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 38b78eb8 ("tracing: Factorize filter creation")
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Tested-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8b0db1a5
    • C
      tracing: Fix kmemleak in tracing_map_array_free() · 475bb3c6
      Chunyu Hu 提交于
      kmemleak reported the below leak when I was doing clear of the hist
      trigger. With this patch, the kmeamleak is gone.
      
      unreferenced object 0xffff94322b63d760 (size 32):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00  ................
          10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff  ..........z.1...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0
          [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff9431f27aa880 (size 128):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff  ...*2......*2...
          00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff  ...*2......*2...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e425348>] __kmalloc+0xe8/0x220
          [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      475bb3c6
    • S
      ftrace: Check for null ret_stack on profile function graph entry function · a8f0f9e4
      Steven Rostedt (VMware) 提交于
      There's a small race when function graph shutsdown and the calling of the
      registered function graph entry callback. The callback must not reference
      the task's ret_stack without first checking that it is not NULL. Note, when
      a ret_stack is allocated for a task, it stays allocated until the task exits.
      The problem here, is that function_graph is shutdown, and a new task was
      created, which doesn't have its ret_stack allocated. But since some of the
      functions are still being traced, the callbacks can still be called.
      
      The normal function_graph code handles this, but starting with commit
      8861dd30 ("ftrace: Access ret_stack->subtime only in the function
      profiler") the profiler code references the ret_stack on function entry, but
      doesn't check if it is NULL first.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611
      
      Cc: stable@vger.kernel.org
      Fixes: 8861dd30 ("ftrace: Access ret_stack->subtime only in the function profiler")
      Reported-by: lilydjwg@gmail.com
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a8f0f9e4
  5. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  6. 20 8月, 2017 1 次提交
  7. 19 8月, 2017 2 次提交
    • J
      signal: don't remove SIGNAL_UNKILLABLE for traced tasks. · eb61b591
      Jamie Iles 提交于
      When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive
      faults, but this is undesirable when tracing.  For example, debugging an
      init process (whether global or namespace), hitting a breakpoint and
      SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE.
      Everything continues fine, but then once debugging has finished, the
      init process is left killable which is unlikely what the user expects,
      resulting in either an accidentally killed init or an init that stops
      reaping zombies.
      
      Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb61b591
    • L
      kmod: fix wait on recursive loop · 2ba293c9
      Luis R. Rodriguez 提交于
      Recursive loops with module loading were previously handled in kmod by
      restricting the number of modprobe calls to 50 and if that limit was
      breached request_module() would return an error and a user would see the
      following on their kernel dmesg:
      
        request_module: runaway loop modprobe binfmt-464c
        Starting init:/sbin/init exists but couldn't execute it (error -8)
      
      This issue could happen for instance when a 64-bit kernel boots a 32-bit
      userspace on some architectures and has no 32-bit binary format
      hanlders.  This is visible, for instance, when a CONFIG_MODULES enabled
      64-bit MIPS kernel boots a into o32 root filesystem and the binfmt
      handler for o32 binaries is not built-in.
      
      After commit 6d7964a7 ("kmod: throttle kmod thread limit") we now
      don't have any visible signs of an error and the kernel just waits for
      the loop to end somehow.
      
      Although this *particular* recursive loop could also be addressed by
      doing a sanity check on search_binary_handler() and disallowing a
      modular binfmt to be required for modprobe, a generic solution for any
      recursive kernel kmod issues is still needed.
      
      This should catch these loops.  We can investigate each loop and address
      each one separately as they come in, this however puts a stop gap for
      them as before.
      
      Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org
      Fixes: 6d7964a7 ("kmod: throttle kmod thread limit")
      Signed-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: NMatt Redfearn <matt.redfearn@imgtec.com>
      Tested-by: NMatt Redfearn <matt.redfearn@imgetc.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ba293c9
  8. 18 8月, 2017 2 次提交
    • T
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner 提交于
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: NKan Liang <Kan.liang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
    • M
      genirq: Restore trigger settings in irq_modify_status() · e8f24189
      Marc Zyngier 提交于
      irq_modify_status starts by clearing the trigger settings from
      irq_data before applying the new settings, but doesn't restore them,
      leaving them to IRQ_TYPE_NONE.
      
      That's pretty confusing to the potential request_irq() that could
      follow. Instead, snapshot the settings before clearing them, and restore
      them if the irq_modify_status() invocation was not changing the trigger.
      
      Fixes: 1e2a7d78 ("irqdomain: Don't set type when mapping an IRQ")
      Reported-and-tested-by: Njeffy <jeffy.chen@rock-chips.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170818095345.12378-1-marc.zyngier@arm.com
      e8f24189
  9. 17 8月, 2017 2 次提交
    • B
      locking/lockdep: Explicitly initialize wq_barrier::done::map · 52fa5bc5
      Boqun Feng 提交于
      With the new lockdep crossrelease feature, which checks completions usage,
      a false positive is reported in the workqueue code:
      
      > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
      > Task   B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
      > Task   C : acquired of lock#3 -> wait for completion of barr->done
      > (Task C is in lru_add_drain_all_cpuslocked())
      > Worker D : wait for wfc.work to be released -> will complete barr->done
      
      Such a dead lock can not happen because Task C's barr->done and Worker D's
      barr->done can not be the same instance.
      
      The reason of this false positive is we initialize all wq_barrier::done
      at insert_wq_barrier() via init_completion(), which makes them belong to
      the same lock class, therefore, impossible circles are reported.
      
      To fix this, explicitly initialize the lockdep map for wq_barrier::done
      in insert_wq_barrier(), so that the lock class key of wq_barrier::done
      is a subkey of the corresponding work_struct, as a result we won't build
      a dependency between a wq_barrier with a unrelated work, and we can
      differ wq barriers based on the related works, so the false positive
      above is avoided.
      
      Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
      to make the code simple and away from unnecessary #ifdefs.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      52fa5bc5
    • K
      locking/refcounts, x86/asm: Implement fast refcount overflow protection · 7a46ec0e
      Kees Cook 提交于
      This implements refcount_t overflow protection on x86 without a noticeable
      performance impact, though without the fuller checking of REFCOUNT_FULL.
      
      This is done by duplicating the existing atomic_t refcount implementation
      but with normally a single instruction added to detect if the refcount
      has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
      the handler saturates the refcount_t to INT_MIN / 2. With this overflow
      protection, the erroneous reference release that would follow a wrap back
      to zero is blocked from happening, avoiding the class of refcount-overflow
      use-after-free vulnerabilities entirely.
      
      Only the overflow case of refcounting can be perfectly protected, since
      it can be detected and stopped before the reference is freed and left to
      be abused by an attacker. There isn't a way to block early decrements,
      and while REFCOUNT_FULL stops increment-from-zero cases (which would
      be the state _after_ an early decrement and stops potential double-free
      conditions), this fast implementation does not, since it would require
      the more expensive cmpxchg loops. Since the overflow case is much more
      common (e.g. missing a "put" during an error path), this protection
      provides real-world protection. For example, the two public refcount
      overflow use-after-free exploits published in 2016 would have been
      rendered unexploitable:
      
        http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/
      
        http://cyseclabs.com/page?n=02012016
      
      This implementation does, however, notice an unchecked decrement to zero
      (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
      resulted in a zero). Decrements under zero are noticed (since they will
      have resulted in a negative value), though this only indicates that a
      use-after-free may have already happened. Such notifications are likely
      avoidable by an attacker that has already exploited a use-after-free
      vulnerability, but it's better to have them reported than allow such
      conditions to remain universally silent.
      
      On first overflow detection, the refcount value is reset to INT_MIN / 2
      (which serves as a saturation value) and a report and stack trace are
      produced. When operations detect only negative value results (such as
      changing an already saturated value), saturation still happens but no
      notification is performed (since the value was already saturated).
      
      On the matter of races, since the entire range beyond INT_MAX but before
      0 is negative, every operation at INT_MIN / 2 will trap, leaving no
      overflow-only race condition.
      
      As for performance, this implementation adds a single "js" instruction
      to the regular execution flow of a copy of the standard atomic_t refcount
      operations. (The non-"and_test" refcount_dec() function, which is uncommon
      in regular refcount design patterns, has an additional "jz" instruction
      to detect reaching exactly zero.) Since this is a forward jump, it is by
      default the non-predicted path, which will be reinforced by dynamic branch
      prediction. The result is this protection having virtually no measurable
      change in performance over standard atomic_t operations. The error path,
      located in .text.unlikely, saves the refcount location and then uses UD0
      to fire a refcount exception handler, which resets the refcount, handles
      reporting, and returns to regular execution. This keeps the changes to
      .text size minimal, avoiding return jumps and open-coded calls to the
      error reporting routine.
      
      Example assembly comparison:
      
      refcount_inc() before:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
      
      refcount_inc() after:
      
        .text:
        ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
        ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
        ...
        .text.unlikely:
        ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
        ffffffff816c36d7:       0f ff                   (bad)
      
      These are the cycle counts comparing a loop of refcount_inc() from 1
      to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
      unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
      (refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
      
        2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
      		    cycles		protections
        atomic_t           82249267387	none
        refcount_t-fast    82211446892	overflow, untested dec-to-zero
        refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero
      
      This code is a modified version of the x86 PAX_REFCOUNT atomic_t
      overflow defense from the last public patch of PaX/grsecurity, based
      on my understanding of the code. Changes or omissions from the original
      code are mine and don't reflect the original grsecurity/PaX code. Thanks
      to PaX Team for various suggestions for improvement for repurposing this
      code to be a refcount-only protection.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Hans Liljestrand <ishkamiel@gmail.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arozansk@redhat.com
      Cc: axboe@kernel.dk
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-arch <linux-arch@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170815161924.GA133115@beastSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7a46ec0e
  10. 16 8月, 2017 3 次提交
    • D
      bpf: fix bpf_trace_printk on 32 bit archs · 88a5c690
      Daniel Borkmann 提交于
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a5c690
    • J
      audit: Receive unmount event · b5fed474
      Jan Kara 提交于
      Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is
      not part of AUDIT_FS_WATCH mask and thus such event never gets to
      audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify
      subsystem on unmount without audit being notified about that which leads
      to a strange state of existing audit rules with dead fsnotify marks.
      
      Add FS_UNMOUNT to the mask of events to be received so that audit can
      clean up its state accordingly.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b5fed474
    • J
      audit: Fix use after free in audit_remove_watch_rule() · d76036ab
      Jan Kara 提交于
      audit_remove_watch_rule() drops watch's reference to parent but then
      continues to work with it. That is not safe as parent can get freed once
      we drop our reference. The following is a trivial reproducer:
      
      mount -o loop image /mnt
      touch /mnt/file
      auditctl -w /mnt/file -p wax
      umount /mnt
      auditctl -D
      <crash in fsnotify_destroy_mark()>
      
      Grab our own reference in audit_remove_watch_rule() earlier to make sure
      mark does not get freed under us.
      
      CC: stable@vger.kernel.org
      Reported-by: NTony Jones <tonyj@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NTony Jones <tonyj@suse.de>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      d76036ab
  11. 14 8月, 2017 2 次提交
  12. 11 8月, 2017 2 次提交
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  13. 10 8月, 2017 18 次提交