1. 13 9月, 2014 1 次提交
  2. 22 6月, 2014 6 次提交
  3. 06 6月, 2014 4 次提交
    • T
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner 提交于
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a21788
    • T
      futex: Always cleanup owner tid in unlock_pi · 13fbca4c
      Thomas Gleixner 提交于
      If the owner died bit is set at futex_unlock_pi, we currently do not
      cleanup the user space futex.  So the owner TID of the current owner
      (the unlocker) persists.  That's observable inconsistant state,
      especially when the ownership of the pi state got transferred.
      
      Clean it up unconditionally.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13fbca4c
    • T
      futex: Validate atomic acquisition in futex_lock_pi_atomic() · b3eaa9fc
      Thomas Gleixner 提交于
      We need to protect the atomic acquisition in the kernel against rogue
      user space which sets the user space futex to 0, so the kernel side
      acquisition succeeds while there is existing state in the kernel
      associated to the real owner.
      
      Verify whether the futex has waiters associated with kernel state.  If
      it has, return -EINVAL.  The state is corrupted already, so no point in
      cleaning it up.  Subsequent calls will fail as well.  Not our problem.
      
      [ tglx: Use futex_top_waiter() and explain why we do not need to try
        	restoring the already corrupted user space state. ]
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3eaa9fc
    • T
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in... · e9c243a5
      Thomas Gleixner 提交于
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      
      If uaddr == uaddr2, then we have broken the rule of only requeueing from
      a non-pi futex to a pi futex with this call.  If we attempt this, then
      dangling pointers may be left for rt_waiter resulting in an exploitable
      condition.
      
      This change brings futex_requeue() in line with futex_wait_requeue_pi()
      which performs the same check as per commit 6f7b0a2a ("futex: Forbid
      uaddr == uaddr2 in futex_wait_requeue_pi()")
      
      [ tglx: Compare the resulting keys as well, as uaddrs might be
        	different depending on the mapping ]
      
      Fixes CVE-2014-3153.
      
      Reported-by: Pinkie Pie
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9c243a5
  4. 19 5月, 2014 2 次提交
    • T
      futex: Prevent attaching to kernel threads · f0d71b3d
      Thomas Gleixner 提交于
      We happily allow userspace to declare a random kernel thread to be the
      owner of a user space PI futex.
      
      Found while analysing the fallout of Dave Jones syscall fuzzer.
      
      We also should validate the thread group for private futexes and find
      some fast way to validate whether the "alleged" owner has RW access on
      the file which backs the SHM, but that's a separate issue.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Carlos ODonell <carlos@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      f0d71b3d
    • T
      futex: Add another early deadlock detection check · 866293ee
      Thomas Gleixner 提交于
      Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
      detection code of rtmutex:
        http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com
      
      That underlying issue has been fixed with a patch to the rtmutex code,
      but the futex code must not call into rtmutex in that case because
          - it can detect that issue early
          - it avoids a different and more complex fixup for backing out
      
      If the user space variable got manipulated to 0x80000000 which means
      no lock holder, but the waiters bit set and an active pi_state in the
      kernel is found we can figure out the recursive locking issue by
      looking at the pi_state owner. If that is the current task, then we
      can safely return -EDEADLK.
      
      The check should have been added in commit 59fa6245 (futex: Handle
      futex_pi OWNER_DIED take over correctly) already, but I did not see
      the above issue caused by user space manipulation back then.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Carlos ODonell <carlos@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      866293ee
  5. 18 4月, 2014 1 次提交
  6. 13 4月, 2014 1 次提交
  7. 09 4月, 2014 1 次提交
    • L
      futex: avoid race between requeue and wake · 69cd9eba
      Linus Torvalds 提交于
      Jan Stancek reported:
       "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
        occasionally fails, because some threads fail to wake up.
      
        Testcase creates 5 threads, which are all waiting on same condition.
        Main thread then calls pthread_cond_broadcast() without holding mutex,
        which calls:
      
            futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)
      
        This immediately wakes up single thread A, which unlocks mutex and
        tries to wake up another thread:
      
            futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)
      
        If thread A manages to call futex_wake() before any waiters are
        requeued for uaddr2, no other thread is woken up"
      
      The ordering constraints for the hash bucket waiter counting are that
      the waiter counts have to be incremented _before_ getting the spinlock
      (because the spinlock acts as part of the memory barrier), but the
      "requeue" operation didn't honor those rules, and nobody had even
      thought about that case.
      
      This fairly simple patch just increments the waiter count for the target
      hash bucket (hb2) when requeing a futex before taking the locks.  It
      then decrements them again after releasing the lock - the code that
      actually moves the futex(es) between hash buckets will do the additional
      required waiter count housekeeping.
      Reported-and-tested-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # 3.14
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69cd9eba
  8. 21 3月, 2014 1 次提交
    • L
      futex: revert back to the explicit waiter counting code · 11d4616b
      Linus Torvalds 提交于
      Srikar Dronamraju reports that commit b0c29f79 ("futexes: Avoid
      taking the hb->lock if there's nothing to wake up") causes java threads
      getting stuck on futexes when runing specjbb on a power7 numa box.
      
      The cause appears to be that the powerpc spinlocks aren't using the same
      ticket lock model that we use on x86 (and other) architectures, which in
      turn result in the "spin_is_locked()" test in hb_waiters_pending()
      occasionally reporting an unlocked spinlock even when there are pending
      waiters.
      
      So this reinstates Davidlohr Bueso's original explicit waiter counting
      code, which I had convinced Davidlohr to drop in favor of figuring out
      the pending waiters by just using the existing state of the spinlock and
      the wait queue.
      Reported-and-tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Original-code-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d4616b
  9. 03 3月, 2014 1 次提交
  10. 16 1月, 2014 1 次提交
    • H
      futexes: Fix futex_hashsize initialization · 63b1a816
      Heiko Carstens 提交于
      "futexes: Increase hash table size for better performance"
      introduces a new alloc_large_system_hash() call.
      
      alloc_large_system_hash() however may allocate less memory than
      requested, e.g. limited by MAX_ORDER.
      
      Hence pass a pointer to alloc_large_system_hash() which will
      contain the hash shift when the function returns. Afterwards
      correctly set futex_hashsize.
      
      Fixes a crash on s390 where the requested allocation size was
      4MB but only 1MB was allocated.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Link: http://lkml.kernel.org/r/20140116135450.GA4345@osirisSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b1a816
  11. 13 1月, 2014 5 次提交
    • P
      rtmutex: Turn the plist into an rb-tree · fb00aca4
      Peter Zijlstra 提交于
      Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
      and provide a proper comparison function for -deadline and
      -priority tasks.
      
      This is done mainly because:
       - classical prio field of the plist is just an int, which might
         not be enough for representing a deadline;
       - manipulating such a list would become O(nr_deadline_tasks),
         which might be to much, as the number of -deadline task increases.
      
      Therefore, an rb-tree is used, and tasks are queued in it according
      to the following logic:
       - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
         one with the higher (lower, actually!) prio wins;
       - among a -priority and a -deadline task, the latter always wins;
       - among two -deadline tasks, the one with the earliest deadline
         wins.
      
      Queueing and dequeueing functions are changed accordingly, for both
      the list of a task's pi-waiters and the list of tasks blocked on
      a pi-lock.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-again-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb00aca4
    • D
      futexes: Avoid taking the hb->lock if there's nothing to wake up · b0c29f79
      Davidlohr Bueso 提交于
      In futex_wake() there is clearly no point in taking the hb->lock
      if we know beforehand that there are no tasks to be woken. While
      the hash bucket's plist head is a cheap way of knowing this, we
      cannot rely 100% on it as there is a racy window between the
      futex_wait call and when the task is actually added to the
      plist. To this end, we couple it with the spinlock check as
      tasks trying to enter the critical region are most likely
      potential waiters that will be added to the plist, thus
      preventing tasks sleeping forever if wakers don't acknowledge
      all possible waiters.
      
      Furthermore, the futex ordering guarantees are preserved,
      ensuring that waiters either observe the changed user space
      value before blocking or is woken by a concurrent waker. For
      wakers, this is done by relying on the barriers in
      get_futex_key_refs() -- for archs that do not have implicit mb
      in atomic_inc(), we explicitly add them through a new
      futex_get_mm function. For waiters we rely on the fact that
      spin_lock calls already update the head counter, so spinners
      are visible even if the lock hasn't been acquired yet.
      
      For more details please refer to the updated comments in the
      code and related discussion:
      
        https://lkml.org/lkml/2013/11/26/556
      
      Special thanks to tglx for careful review and feedback.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0c29f79
    • T
      futexes: Document multiprocessor ordering guarantees · 99b60ce6
      Thomas Gleixner 提交于
      That's essential, if you want to hack on futexes.
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      99b60ce6
    • D
      futexes: Increase hash table size for better performance · a52b89eb
      Davidlohr Bueso 提交于
      Currently, the futex global hash table suffers from its fixed,
      smallish (for today's standards) size of 256 entries, as well as
      its lack of NUMA awareness. Large systems, using many futexes,
      can be prone to high amounts of collisions; where these futexes
      hash to the same bucket and lead to extra contention on the same
      hb->lock. Furthermore, cacheline bouncing is a reality when we
      have multiple hb->locks residing on the same cacheline and
      different futexes hash to adjacent buckets.
      
      This patch keeps the current static size of 16 entries for small
      systems, or otherwise, 256 * ncpus (or larger as we need to
      round the number to a power of 2). Note that this number of CPUs
      accounts for all CPUs that can ever be available in the system,
      taking into consideration things like hotpluging. While we do
      impose extra overhead at bootup by making the hash table larger,
      this is a one time thing, and does not shadow the benefits of
      this patch.
      
      Furthermore, as suggested by tglx, by cache aligning the hash
      buckets we can avoid access across cacheline boundaries and also
      avoid massive cache line bouncing if multiple cpus are hammering
      away at different hash buckets which happen to reside in the
      same cache line.
      
      Also, similar to other core kernel components (pid, dcache,
      tcp), by using alloc_large_system_hash() we benefit from its
      NUMA awareness and thus the table is distributed among the nodes
      instead of in a single one.
      
      For a custom microbenchmark that pounds on the uaddr hashing --
      making the wait path fail at futex_wait_setup() returning
      -EWOULDBLOCK for large amounts of futexes, we can see the
      following benefits on a 80-core, 8-socket 1Tb server:
      
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
       | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
       |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
       |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
       |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
       |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
       |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
       |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NWaiman Long <Waiman.Long@hp.com>
      Reviewed-and-tested-by: NJason Low <jason.low2@hp.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a52b89eb
    • J
      futexes: Clean up various details · 0d00c7b2
      Jason Low 提交于
      - Remove unnecessary head variables.
      - Delete unused parameter in queue_unlock().
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0d00c7b2
  12. 13 12月, 2013 2 次提交
    • L
      futex: move user address verification up to common code · 5cdec2d8
      Linus Torvalds 提交于
      When debugging the read-only hugepage case, I was confused by the fact
      that get_futex_key() did an access_ok() only for the non-shared futex
      case, since the user address checking really isn't in any way specific
      to the private key handling.
      
      Now, it turns out that the shared key handling does effectively do the
      equivalent checks inside get_user_pages_fast() (it doesn't actually
      check the address range on x86, but does check the page protections for
      being a user page).  So it wasn't actually a bug, but the fact that we
      treat the address differently for private and shared futexes threw me
      for a loop.
      
      Just move the check up, so that it gets done for both cases.  Also, use
      the 'rw' parameter for the type, even if it doesn't actually matter any
      more (it's a historical artifact of the old racy i386 "page faults from
      kernel space don't check write protections").
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cdec2d8
    • L
      futex: fix handling of read-only-mapped hugepages · f12d5bfc
      Linus Torvalds 提交于
      The hugepage code had the exact same bug that regular pages had in
      commit 7485d0d3 ("futexes: Remove rw parameter from
      get_futex_key()").
      
      The regular page case was fixed by commit 9ea71503 ("futex: Fix
      regression with read only mappings"), but the transparent hugepage case
      (added in a5b338f2: "thp: update futex compound knowledge") case
      remained broken.
      
      Found by Dave Jones and his trinity tool.
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Cc: stable@kernel.org # v2.6.38+
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f12d5bfc
  13. 06 11月, 2013 1 次提交
  14. 26 6月, 2013 2 次提交
    • C
      futex: Use freezable blocking call · 88c8004f
      Colin Cross 提交于
      Avoid waking up every thread sleeping in a futex_wait call during
      suspend and resume by calling a freezable blocking call.  Previous
      patches modified the freezer to avoid sending wakeups to threads
      that are blocked in freezable blocking calls.
      
      This call was selected to be converted to a freezable call because
      it doesn't hold any locks or release any resources when interrupted
      that might be needed by another freezing task or a kernel driver
      during suspend, and is a common site where idle userspace tasks are
      blocked.
      Signed-off-by: NColin Cross <ccross@android.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: arve@android.com
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      88c8004f
    • Z
      futex: Take hugepages into account when generating futex_key · 13d60f4b
      Zhang Yi 提交于
      The futex_keys of process shared futexes are generated from the page
      offset, the mapping host and the mapping index of the futex user space
      address. This should result in an unique identifier for each futex.
      
      Though this is not true when futexes are located in different subpages
      of an hugepage. The reason is, that the mapping index for all those
      futexes evaluates to the index of the base page of the hugetlbfs
      mapping. So a futex at offset 0 of the hugepage mapping and another
      one at offset PAGE_SIZE of the same hugepage mapping have identical
      futex_keys. This happens because the futex code blindly uses
      page->index.
      
      Steps to reproduce the bug:
      
      1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
         and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
         mapping.
      
         The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
         PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
         their keys solely depend on the user space address.
      
      2. Lock mutex1 and mutex2
      
      3. Create thread1 and in the thread function lock mutex1, which
         results in thread1 blocking on the locked mutex1.
      
      4. Create thread2 and in the thread function lock mutex2, which
         results in thread2 blocking on the locked mutex2.
      
      5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
         still blocks on mutex2 because the futex_key points to mutex1.
      
      To solve this issue we need to take the normal page index of the page
      which contains the futex into account, if the futex is in an hugetlbfs
      mapping. In other words, we calculate the normal page mapping index of
      the subpage in the hugetlbfs mapping.
      
      Mappings which are not based on hugetlbfs are not affected and still
      use page->index.
      
      Thanks to Mel Gorman who provided a patch for adding proper evaluation
      functions to the hugetlbfs code to avoid exposing hugetlbfs specific
      details to the futex code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NZhang Yi <zhang.yi20@zte.com.cn>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Tested-by: NMa Chenggong <ma.chenggong@zte.com.cn>
      Reviewed-by: N'Mel Gorman' <mgorman@suse.de>
      Acked-by: N'Darren Hart' <dvhart@linux.intel.com>
      Cc: 'Peter Zijlstra' <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      13d60f4b
  15. 12 5月, 2013 1 次提交
  16. 13 3月, 2013 1 次提交
  17. 28 2月, 2013 1 次提交
  18. 19 2月, 2013 1 次提交
    • T
      futex: Revert "futex: Mark get_robust_list as deprecated" · fe2b05f7
      Thomas Gleixner 提交于
      This reverts commit ec0c4274.
      
      get_robust_list() is in use and a removal would break existing user
      space. With the permission checks in place it's not longer a security
      hole. Remove the deprecation warnings.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: akpm@linux-foundation.org
      Cc: paul.gortmaker@windriver.com
      Cc: davej@redhat.com
      Cc: keescook@chromium.org
      Cc: stable@vger.kernel.org
      Cc: ebiederm@xmission.com
      fe2b05f7
  19. 08 2月, 2013 1 次提交
  20. 27 11月, 2012 1 次提交
    • D
      futex: avoid wake_futex() for a PI futex_q · aa10990e
      Darren Hart 提交于
      Dave Jones reported a bug with futex_lock_pi() that his trinity test
      exposed.  Sometime between queue_me() and taking the q.lock_ptr, the
      lock_ptr became NULL, resulting in a crash.
      
      While futex_wake() is careful to not call wake_futex() on futex_q's with
      a pi_state or an rt_waiter (which are either waiting for a
      futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
      futex_requeue() do not perform the same test.
      
      Update futex_wake_op() and futex_requeue() to test for q.pi_state and
      q.rt_waiter and abort with -EINVAL if detected.  To ensure any future
      breakage is caught, add a WARN() to wake_futex() if the same condition
      is true.
      
      This fix has seen 3 hours of testing with "trinity -c futex" on an
      x86_64 VM with 4 CPUS.
      
      [akpm@linux-foundation.org: tidy up the WARN()]
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Reported-by: NDave Jones <davej@redat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa10990e
  21. 01 11月, 2012 1 次提交
    • T
      futex: Handle futex_pi OWNER_DIED take over correctly · 59fa6245
      Thomas Gleixner 提交于
      Siddhesh analyzed a failure in the take over of pi futexes in case the
      owner died and provided a workaround.
      See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076
      
      The detailed problem analysis shows:
      
      Futex F is initialized with PTHREAD_PRIO_INHERIT and
      PTHREAD_MUTEX_ROBUST_NP attributes.
      
      T1 lock_futex_pi(F);
      
      T2 lock_futex_pi(F);
         --> T2 blocks on the futex and creates pi_state which is associated
             to T1.
      
      T1 exits
         --> exit_robust_list() runs
             --> Futex F userspace value TID field is set to 0 and
                 FUTEX_OWNER_DIED bit is set.
      
      T3 lock_futex_pi(F);
         --> Succeeds due to the check for F's userspace TID field == 0
         --> Claims ownership of the futex and sets its own TID into the
             userspace TID field of futex F
         --> returns to user space
      
      T1 --> exit_pi_state_list()
             --> Transfers pi_state to waiter T2 and wakes T2 via
             	   rt_mutex_unlock(&pi_state->mutex)
      
      T2 --> acquires pi_state->mutex and gains real ownership of the
             pi_state
         --> Claims ownership of the futex and sets its own TID into the
             userspace TID field of futex F
         --> returns to user space
      
      T3 --> observes inconsistent state
      
      This problem is independent of UP/SMP, preemptible/non preemptible
      kernels, or process shared vs. private. The only difference is that
      certain configurations are more likely to expose it.
      
      So as Siddhesh correctly analyzed the following check in
      futex_lock_pi_atomic() is the culprit:
      
      	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
      
      We check the userspace value for a TID value of 0 and take over the
      futex unconditionally if that's true.
      
      AFAICT this check is there as it is correct for a different corner
      case of futexes: the WAITERS bit became stale.
      
      Now the proposed change
      
      -	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
      +       if (unlikely(ownerdied ||
      +                       !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
      
      solves the problem, but it's not obvious why and it wreckages the
      "stale WAITERS bit" case.
      
      What happens is, that due to the WAITERS bit being set (T2 is blocked
      on that futex) it enforces T3 to go through lookup_pi_state(), which
      in the above case returns an existing pi_state and therefor forces T3
      to legitimately fight with T2 over the ownership of the pi_state (via
      pi_state->mutex). Probelm solved!
      
      Though that does not work for the "WAITERS bit is stale" problem
      because if lookup_pi_state() does not find existing pi_state it
      returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
      return -ESRCH to user space because the OWNER_DIED bit is not set.
      
      Now there is a different solution to that problem. Do not look at the
      user space value at all and enforce a lookup of possibly available
      pi_state. If pi_state can be found, then the new incoming locker T3
      blocks on that pi_state and legitimately races with T2 to acquire the
      rt_mutex and the pi_state and therefor the proper ownership of the
      user space futex.
      
      lookup_pi_state() has the correct order of checks. It first tries to
      find a pi_state associated with the user space futex and only if that
      fails it checks for futex TID value = 0. If no pi_state is available
      nothing can create new state at that point because this happens with
      the hash bucket lock held.
      
      So the above scenario changes to:
      
      T1 lock_futex_pi(F);
      
      T2 lock_futex_pi(F);
         --> T2 blocks on the futex and creates pi_state which is associated
             to T1.
      
      T1 exits
         --> exit_robust_list() runs
             --> Futex F userspace value TID field is set to 0 and
                 FUTEX_OWNER_DIED bit is set.
      
      T3 lock_futex_pi(F);
         --> Finds pi_state and blocks on pi_state->rt_mutex
      
      T1 --> exit_pi_state_list()
             --> Transfers pi_state to waiter T2 and wakes it via
             	   rt_mutex_unlock(&pi_state->mutex)
      
      T2 --> acquires pi_state->mutex and gains ownership of the pi_state
         --> Claims ownership of the futex and sets its own TID into the
             userspace TID field of futex F
         --> returns to user space
      
      This covers all gazillion points on which T3 might come in between
      T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
      also solves the "WAITERS bit stale" problem by forcing the take over.
      
      Another benefit of changing the code this way is that it makes it less
      dependent on untrusted user space values and therefor minimizes the
      possible wreckage which might be inflicted.
      
      As usual after staring for too long at the futex code my brain hurts
      so much that I really want to ditch that whole optimization of
      avoiding the syscall for the non contended case for PI futexes and rip
      out the maze of corner case handling code. Unfortunately we can't as
      user space relies on that existing behaviour, but at least thinking
      about it helps me to preserve my mental sanity. Maybe we should
      nevertheless :)
      Reported-and-tested-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionosAcked-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      59fa6245
  22. 24 7月, 2012 3 次提交
  23. 29 3月, 2012 1 次提交