1. 09 6月, 2016 1 次提交
    • M
      futex: Calculate the futex key based on a tail page for file-based futexes · 077fa7ae
      Mel Gorman 提交于
      Mike Galbraith reported that the LTP test case futex_wake04 was broken
      by commit 65d8fc77 ("futex: Remove requirement for lock_page()
      in get_futex_key()").
      
      This test case uses futexes backed by hugetlbfs pages and so there is an
      associated inode with a futex stored on such pages. The problem is that
      the key is being calculated based on the head page index of the hugetlbfs
      page and not the tail page.
      
      Prior to the optimisation, the page lock was used to stabilise mappings and
      pin the inode is file-backed which is overkill. If the page was a compound
      page, the head page was automatically looked up as part of the page lock
      operation but the tail page index was used to calculate the futex key.
      
      After the optimisation, the compound head is looked up early and the page
      lock is only relied upon to identify truncated pages, special pages or a
      shmem page moving to swapcache. The head page is looked up because without
      the page lock, special care has to be taken to pin the inode correctly.
      However, the tail page is still required to calculate the futex key so
      this patch records the tail page.
      
      On vanilla 4.6, the output of the test case is;
      
      futex_wake04    0  TINFO  :  Hugepagesize 2097152
      futex_wake04    1  TFAIL  :  futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.
      
      With the patch applied
      
      futex_wake04    0  TINFO  :  Hugepagesize 2097152
      futex_wake04    1  TPASS  :  Hi hydra, thread2 awake!
      
      Fixes: 65d8fc77 "futex: Remove requirement for lock_page() in get_futex_key()"
      Reported-and-tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      077fa7ae
  2. 23 5月, 2016 1 次提交
    • L
      x86: remove more uaccess_32.h complexity · bd28b145
      Linus Torvalds 提交于
      I'm looking at trying to possibly merge the 32-bit and 64-bit versions
      of the x86 uaccess.h implementation, but first this needs to be cleaned
      up.
      
      For example, the 32-bit version of "__copy_from_user_inatomic()" is
      mostly the special cases for the constant size, and it's actually almost
      never relevant.  Most users aren't actually using a constant size
      anyway, and the few cases that do small constant copies are better off
      just using __get_user() instead.
      
      So get rid of the unnecessary complexity.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd28b145
  3. 21 4月, 2016 1 次提交
  4. 20 4月, 2016 1 次提交
  5. 09 3月, 2016 1 次提交
  6. 17 2月, 2016 2 次提交
    • M
      futex: Remove requirement for lock_page() in get_futex_key() · 65d8fc77
      Mel Gorman 提交于
      When dealing with key handling for shared futexes, we can drastically reduce
      the usage/need of the page lock. 1) For anonymous pages, the associated futex
      object is the mm_struct which does not require the page lock. 2) For inode
      based, keys, we can check under RCU read lock if the page mapping is still
      valid and take reference to the inode. This just leaves one rare race that
      requires the page lock in the slow path when examining the swapcache.
      
      Additionally realtime users currently have a problem with the page lock being
      contended for unbounded periods of time during futex operations.
      
      Task A
           get_futex_key()
           lock_page()
          ---> preempted
      
      Now any other task trying to lock that page will have to wait until
      task A gets scheduled back in, which is an unbound time.
      
      With this patch, we pretty much have a lockless futex_get_key().
      
      Experiments show that this patch can boost/speedup the hashing of shared
      futexes with the perf futex benchmarks (which is good for measuring such
      change) by up to 45% when there are high (> 100) thread counts on a 60 core
      Westmere. Lower counts are pretty much in the noise range or less than 10%,
      but mid range can be seen at over 30% overall throughput (hash ops/sec).
      This makes anon-mem shared futexes much closer to its private counterpart.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      [ Ported on top of thp refcount rework, changelog, comments, fixes. ]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      65d8fc77
    • D
      futex: Rename barrier references in ordering guarantees · 8ad7b378
      Davidlohr Bueso 提交于
      Ingo suggested we rename how we reference barriers A and B
      regarding futex ordering guarantees. This patch replaces,
      for both barriers, MB (A) with smp_mb(); (A), such that:
      
       - We explicitly state that the barriers are SMP, and
      
       - We standardize how we reference these across futex.c
         helping readers follow what barrier does what and where.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1455045314-8305-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8ad7b378
  7. 26 1月, 2016 1 次提交
    • T
      rtmutex: Make wait_lock irq safe · b4abf910
      Thomas Gleixner 提交于
      Sasha reported a lockdep splat about a potential deadlock between RCU boosting
      rtmutex and the posix timer it_lock.
      
      CPU0					CPU1
      
      rtmutex_lock(&rcu->rt_mutex)
        spin_lock(&rcu->rt_mutex.wait_lock)
      					local_irq_disable()
      					spin_lock(&timer->it_lock)
      					spin_lock(&rcu->mutex.wait_lock)
      --> Interrupt
          spin_lock(&timer->it_lock)
      
      This is caused by the following code sequence on CPU1
      
           rcu_read_lock()
           x = lookup();
           if (x)
           	spin_lock_irqsave(&x->it_lock);
           rcu_read_unlock();
           return x;
      
      We could fix that in the posix timer code by keeping rcu read locked across
      the spinlocked and irq disabled section, but the above sequence is common and
      there is no reason not to support it.
      
      Taking rt_mutex.wait_lock irq safe prevents the deadlock.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      b4abf910
  8. 21 1月, 2016 1 次提交
    • J
      ptrace: use fsuid, fsgid, effective creds for fs access checks · caaee623
      Jann Horn 提交于
      By checking the effective credentials instead of the real UID / permitted
      capabilities, ensure that the calling process actually intended to use its
      credentials.
      
      To ensure that all ptrace checks use the correct caller credentials (e.g.
      in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
      flag), use two new flags and require one of them to be set.
      
      The problem was that when a privileged task had temporarily dropped its
      privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
      perform following syscalls with the credentials of a user, it still passed
      ptrace access checks that the user would not be able to pass.
      
      While an attacker should not be able to convince the privileged task to
      perform a ptrace() syscall, this is a problem because the ptrace access
      check is reused for things in procfs.
      
      In particular, the following somewhat interesting procfs entries only rely
      on ptrace access checks:
      
       /proc/$pid/stat - uses the check for determining whether pointers
           should be visible, useful for bypassing ASLR
       /proc/$pid/maps - also useful for bypassing ASLR
       /proc/$pid/cwd - useful for gaining access to restricted
           directories that contain files with lax permissions, e.g. in
           this scenario:
           lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
           drwx------ root root /root
           drwxr-xr-x root root /root/foobar
           -rw-r--r-- root root /root/foobar/secret
      
      Therefore, on a system where a root-owned mode 6755 binary changes its
      effective credentials as described and then dumps a user-specified file,
      this could be used by an attacker to reveal the memory layout of root's
      processes or reveal the contents of files he is not allowed to access
      (through /proc/$pid/cwd).
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NJann Horn <jann@thejh.net>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caaee623
  9. 16 1月, 2016 2 次提交
  10. 20 12月, 2015 6 次提交
  11. 04 10月, 2015 1 次提交
  12. 22 9月, 2015 1 次提交
  13. 21 7月, 2015 1 次提交
  14. 20 7月, 2015 2 次提交
  15. 20 6月, 2015 1 次提交
  16. 19 5月, 2015 1 次提交
  17. 08 5月, 2015 1 次提交
    • D
      futex: Implement lockless wakeups · 1d0dcb3a
      Davidlohr Bueso 提交于
      Given the overall futex architecture, any chance of reducing
      hb->lock contention is welcome. In this particular case, using
      wake-queues to enable lockless wakeups addresses very much real
      world performance concerns, even cases of soft-lockups in cases
      of large amounts of blocked tasks (which is not hard to find in
      large boxes, using but just a handful of futex).
      
      At the lowest level, this patch can reduce latency of a single thread
      attempting to acquire hb->lock in highly contended scenarios by a
      up to 2x. At lower counts of nr_wake there are no regressions,
      confirming, of course, that the wake_q handling overhead is practically
      non existent. For instance, while a fair amount of variation,
      the extended pef-bench wakeup benchmark shows for a 20 core machine
      the following avg per-thread time to wakeup its share of tasks:
      
      	nr_thr	ms-before	ms-after
      	16 	0.0590		0.0215
      	32 	0.0396		0.0220
      	48 	0.0417		0.0182
      	64 	0.0536		0.0236
      	80 	0.0414		0.0097
      	96 	0.0672		0.0152
      
      Naturally, this can cause spurious wakeups. However there is no core code
      that cannot handle them afaict, and furthermore tglx does have the point
      that other events can already trigger them anyway.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: George Spelvin <linux@horizon.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430494072-30283-3-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1d0dcb3a
  18. 22 4月, 2015 1 次提交
  19. 18 2月, 2015 1 次提交
  20. 13 2月, 2015 1 次提交
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
  21. 19 1月, 2015 1 次提交
    • M
      futex: Fix argument handling in futex_lock_pi() calls · 996636dd
      Michael Kerrisk 提交于
      This patch fixes two separate buglets in calls to futex_lock_pi():
      
        * Eliminate unused 'detect' argument
        * Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL
      
      The 'detect' argument of futex_lock_pi() seems never to have been
      used (when it was included with the initial PI mutex implementation
      in Linux 2.6.18, all checks against its value were disabled by
      ANDing against 0 (i.e., if (detect... && 0)), and with
      commit 778e9a9c, any mention of
      this argument in futex_lock_pi() went way altogether. Its presence
      now serves only to confuse readers of the code, by giving the
      impression that the futex() FUTEX_LOCK_PI operation actually does
      use the 'val' argument. This patch removes the argument.
      
      The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes
      'timeout' as one of its arguments. This misleads the reader into thinking
      that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible
      purpose; but it does not.  Indeed, it cannot, because the checks at the
      start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations
      that do copy_from_user() on the timeout argument. So, in the
      FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change
      'timeout' to 'NULL'. This patch does that.
      Signed-off-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Reviewed-by: NDarren Hart <darren@dvhart.com>
      Link: http://lkml.kernel.org/r/54B96646.8010200@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      996636dd
  22. 26 10月, 2014 2 次提交
  23. 19 10月, 2014 1 次提交
    • C
      futex: Ensure get_futex_key_refs() always implies a barrier · 76835b0e
      Catalin Marinas 提交于
      Commit b0c29f79 (futexes: Avoid taking the hb->lock if there's
      nothing to wake up) changes the futex code to avoid taking a lock when
      there are no waiters. This code has been subsequently fixed in commit
      11d4616b (futex: revert back to the explicit waiter counting code).
      Both the original commit and the fix-up rely on get_futex_key_refs() to
      always imply a barrier.
      
      However, for private futexes, none of the cases in the switch statement
      of get_futex_key_refs() would be hit and the function completes without
      a memory barrier as required before checking the "waiters" in
      futex_wake() -> hb_waiters_pending(). The consequence is a race with a
      thread waiting on a futex on another CPU, allowing the waker thread to
      read "waiters == 0" while the waiter thread to have read "futex_val ==
      locked" (in kernel).
      
      Without this fix, the problem (user space deadlocks) can be seen with
      Android bionic's mutex implementation on an arm64 multi-cluster system.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NMatteo Franchin <Matteo.Franchin@arm.com>
      Fixes: b0c29f79 (futexes: Avoid taking the hb->lock if there's nothing to wake up)
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76835b0e
  24. 13 9月, 2014 1 次提交
  25. 22 6月, 2014 6 次提交
  26. 06 6月, 2014 1 次提交
    • T
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner 提交于
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a21788