1. 25 10月, 2016 3 次提交
    • P
      locking/mutex: Restructure wait loop · 5bbd7e64
      Peter Zijlstra 提交于
      Doesn't really matter yet, but pull the HANDOFF and trylock out from
      under the wait_lock.
      
      The intention is to add an optimistic spin loop here, which requires
      we do not hold the wait_lock, so shuffle code around in preparation.
      
      Also clarify the purpose of taking the wait_lock in the wait loop, its
      tempting to want to avoid it altogether, but the cancellation cases
      need to to avoid losing wakeups.
      Suggested-by: NWaiman Long <waiman.long@hpe.com>
      Tested-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5bbd7e64
    • P
      locking/mutex: Add lock handoff to avoid starvation · 9d659ae1
      Peter Zijlstra 提交于
      Implement lock handoff to avoid lock starvation.
      
      Lock starvation is possible because mutex_lock() allows lock stealing,
      where a running (or optimistic spinning) task beats the woken waiter
      to the acquire.
      
      Lock stealing is an important performance optimization because waiting
      for a waiter to wake up and get runtime can take a significant time,
      during which everyboy would stall on the lock.
      
      The down-side is of course that it allows for starvation.
      
      This patch has the waiter requesting a handoff if it fails to acquire
      the lock upon waking. This re-introduces some of the wait time,
      because once we do a handoff we have to wait for the waiter to wake up
      again.
      
      A future patch will add a round of optimistic spinning to attempt to
      alleviate this penalty, but if that turns out to not be enough, we can
      add a counter and only request handoff after multiple failed wakeups.
      
      There are a few tricky implementation details:
      
       - accepting a handoff must only be done in the wait-loop. Since the
         handoff condition is owner == current, it can easily cause
         recursive locking trouble.
      
       - accepting the handoff must be careful to provide the ACQUIRE
         semantics.
      
       - having the HANDOFF bit set on unlock requires care, we must not
         clear the owner.
      
       - we must be careful to not leave HANDOFF set after we've acquired
         the lock. The tricky scenario is setting the HANDOFF bit on an
         unlocked mutex.
      Tested-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NWaiman Long <Waiman.Long@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d659ae1
    • P
      locking/mutex: Rework mutex::owner · 3ca0ff57
      Peter Zijlstra 提交于
      The current mutex implementation has an atomic lock word and a
      non-atomic owner field.
      
      This disparity leads to a number of issues with the current mutex code
      as it means that we can have a locked mutex without an explicit owner
      (because the owner field has not been set, or already cleared).
      
      This leads to a number of weird corner cases, esp. between the
      optimistic spinning and debug code. Where the optimistic spinning
      code needs the owner field updated inside the lock region, the debug
      code is more relaxed because the whole lock is serialized by the
      wait_lock.
      
      Also, the spinning code itself has a few corner cases where we need to
      deal with a held lock without an owner field.
      
      Furthermore, it becomes even more of a problem when trying to fix
      starvation cases in the current code. We end up stacking special case
      on special case.
      
      To solve this rework the basic mutex implementation to be a single
      atomic word that contains the owner and uses the low bits for extra
      state.
      
      This matches how PI futexes and rt_mutex already work. By having the
      owner an integral part of the lock state a lot of the problems
      dissapear and we get a better option to deal with starvation cases,
      direct owner handoff.
      
      Changing the basic mutex does however invalidate all the arch specific
      mutex code; this patch leaves that unused in-place, a later patch will
      remove that.
      Tested-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3ca0ff57
  2. 22 9月, 2016 3 次提交
  3. 18 8月, 2016 3 次提交
    • D
      locking/rwsem: Scan the wait_list for readers only once · 70800c3c
      Davidlohr Bueso 提交于
      When wanting to wakeup readers, __rwsem_mark_wakeup() currently
      iterates the wait_list twice while looking to wakeup the first N
      queued reader-tasks. While this can be quite inefficient, it was
      there such that a awoken reader would be first and foremost
      acknowledged by the lock counter.
      
      Keeping the same logic, we can further benefit from the use of
      wake_qs and avoid entirely the first wait_list iteration that sets
      the counter as wake_up_process() isn't going to occur right away,
      and therefore we maintain the counter->list order of going about
      things.
      
      Other than saving cycles with O(n) "scanning", this change also
      nicely cleans up a good chunk of __rwsem_mark_wakeup(); both
      visually and less tedious to read.
      
      For example, the following improvements where seen on some will
      it scale microbenchmarks, on a 48-core Haswell:
      
                                             v4.7              v4.7-rwsem-v1
        Hmean    signal1-processes-8    5792691.42 (  0.00%)  5771971.04 ( -0.36%)
        Hmean    signal1-processes-12   6081199.96 (  0.00%)  6072174.38 ( -0.15%)
        Hmean    signal1-processes-21   3071137.71 (  0.00%)  3041336.72 ( -0.97%)
        Hmean    signal1-processes-48   3712039.98 (  0.00%)  3708113.59 ( -0.11%)
        Hmean    signal1-processes-79   4464573.45 (  0.00%)  4682798.66 (  4.89%)
        Hmean    signal1-processes-110  4486842.01 (  0.00%)  4633781.71 (  3.27%)
        Hmean    signal1-processes-141  4611816.83 (  0.00%)  4692725.38 (  1.75%)
        Hmean    signal1-processes-172  4638157.05 (  0.00%)  4714387.86 (  1.64%)
        Hmean    signal1-processes-203  4465077.80 (  0.00%)  4690348.07 (  5.05%)
        Hmean    signal1-processes-224  4410433.74 (  0.00%)  4687534.43 (  6.28%)
      
        Stddev   signal1-processes-8       6360.47 (  0.00%)     8455.31 ( 32.94%)
        Stddev   signal1-processes-12      4004.98 (  0.00%)     9156.13 (128.62%)
        Stddev   signal1-processes-21      3273.14 (  0.00%)     5016.80 ( 53.27%)
        Stddev   signal1-processes-48     28420.25 (  0.00%)    26576.22 ( -6.49%)
        Stddev   signal1-processes-79     22038.34 (  0.00%)    18992.70 (-13.82%)
        Stddev   signal1-processes-110    23226.93 (  0.00%)    17245.79 (-25.75%)
        Stddev   signal1-processes-141     6358.98 (  0.00%)     7636.14 ( 20.08%)
        Stddev   signal1-processes-172     9523.70 (  0.00%)     4824.75 (-49.34%)
        Stddev   signal1-processes-203    13915.33 (  0.00%)     9326.33 (-32.98%)
        Stddev   signal1-processes-224    15573.94 (  0.00%)    10613.82 (-31.85%)
      
      Other runs that saw improvements include context_switch and pipe; and
      as expected, this is particularly highlighted on larger thread counts
      as it becomes more expensive to walk the list twice.
      
      No change in wakeup ordering or semantics.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hp.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hpe.com
      Cc: wanpeng.li@hotmail.com
      Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70800c3c
    • D
      locking/rwsem: Remove a few useless comments · c2867bba
      Davidlohr Bueso 提交于
      Our rwsem code (xadd, at least) is rather well documented, but
      there are a few really annoying comments in there that serve
      no purpose and we shouldn't bother with them.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hp.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hpe.com
      Cc: wanpeng.li@hotmail.com
      Link: http://lkml.kernel.org/r/1470384285-32163-3-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c2867bba
    • D
      locking/rwsem: Return void in __rwsem_mark_wake() · 84b23f9b
      Davidlohr Bueso 提交于
      We currently return a rw_semaphore structure, which is the
      same lock we passed to the function's argument in the first
      place. While there are several functions that choose this
      return value, the callers use it, for example, for things
      like ERR_PTR. This is not the case for __rwsem_mark_wake(),
      and in addition this function is really about the lock
      waiters (which we know there are at this point), so its
      somewhat odd to be returning the sem structure.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hp.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hpe.com
      Cc: wanpeng.li@hotmail.com
      Link: http://lkml.kernel.org/r/1470384285-32163-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      84b23f9b
  4. 10 8月, 2016 5 次提交
  5. 27 6月, 2016 1 次提交
  6. 24 6月, 2016 1 次提交
  7. 16 6月, 2016 3 次提交
  8. 14 6月, 2016 2 次提交
  9. 08 6月, 2016 10 次提交
    • W
      locking/rwsem: Streamline the rwsem_optimistic_spin() code · ddd0fa73
      Waiman Long 提交于
      This patch moves the owner loading and checking code entirely inside of
      rwsem_spin_on_owner() to simplify the logic of rwsem_optimistic_spin()
      loop.
      Suggested-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463534783-38814-6-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ddd0fa73
    • W
      locking/rwsem: Improve reader wakeup code · bf7b4c47
      Waiman Long 提交于
      In __rwsem_do_wake(), the reader wakeup code will assume a writer
      has stolen the lock if the active reader/writer count is not 0.
      However, this is not as reliable an indicator as the original
      "< RWSEM_WAITING_BIAS" check. If another reader is present, the code
      will still break out and exit even if the writer is gone. This patch
      changes it to check the same "< RWSEM_WAITING_BIAS" condition to
      reduce the chance of false positive.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463534783-38814-5-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bf7b4c47
    • W
      locking/rwsem: Protect all writes to owner by WRITE_ONCE() · fb6a44f3
      Waiman Long 提交于
      Without using WRITE_ONCE(), the compiler can potentially break a
      write into multiple smaller ones (store tearing). So a read from the
      same data by another task concurrently may return a partial result.
      This can result in a kernel crash if the data is a memory address
      that is being dereferenced.
      
      This patch changes all write to rwsem->owner to use WRITE_ONCE()
      to make sure that store tearing will not happen. READ_ONCE() may
      not be needed for rwsem->owner as long as the value is only used for
      comparison and not dereferencing.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463534783-38814-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb6a44f3
    • W
      locking/rwsem: Add reader-owned state to the owner field · 19c5d690
      Waiman Long 提交于
      Currently, it is not possible to determine for sure if a reader
      owns a rwsem by looking at the content of the rwsem data structure.
      This patch adds a new state RWSEM_READER_OWNED to the owner field
      to indicate that readers currently own the lock. This enables us to
      address the following 2 issues in the rwsem optimistic spinning code:
      
       1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
          the owner field is NULL which can mean either the readers own
          the lock or the owning writer hasn't set the owner field yet.
          In the latter case, we miss the chance to do optimistic spinning.
      
       2) While a writer is waiting in the OSQ and a reader takes the lock,
          the writer will continue to spin when out of the OSQ in the main
          rwsem_optimistic_spin() loop as the owner field is NULL wasting
          CPU cycles if some of readers are sleeping.
      
      Adding the new state will allow optimistic spinning to go forward as
      long as the owner field is not RWSEM_READER_OWNED and the owner is
      running, if set, but stop immediately when that state has been reached.
      
      On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
      fio test with multithreaded randrw and randwrite tests on the same
      file on a XFS partition on top of a NVDIMM were run, the aggregated
      bandwidths before and after the patch were as follows:
      
        Test      BW before patch     BW after patch  % change
        ----      ---------------     --------------  --------
        randrw         988 MB/s          1192 MB/s      +21%
        randwrite     1513 MB/s          1623 MB/s      +7.3%
      
      The perf profile of the rwsem_down_write_failed() function in randrw
      before and after the patch were:
      
         19.95%  5.88%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
         14.20%  1.52%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
      
      The actual CPU cycles spend in rwsem_down_write_failed() dropped from
      5.88% to 1.52% after the patch.
      
      The xfstests was also run and no regression was observed.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJason Low <jason.low2@hp.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463534783-38814-2-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19c5d690
    • J
      locking/rwsem: Convert sem->count to 'atomic_long_t' · 8ee62b18
      Jason Low 提交于
      Convert the rwsem count variable to an atomic_long_t since we use it
      as an atomic variable. This also allows us to remove the
      rwsem_atomic_{add,update}() "abstraction" which would now be an unnecesary
      level of indirection. In follow up patches, we also remove the
      rwsem_atomic_{add,update}() definitions across the various architectures.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      [ Build warning fixes on various architectures. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Waiman Long <Waiman.Long@hpe.com>
      Link: http://lkml.kernel.org/r/1465017963-4839-2-git-send-email-jason.low2@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8ee62b18
    • P
      locking/qspinlock: Add comments · 055ce0fd
      Peter Zijlstra 提交于
      I figured we need to document the spin_is_locked() and
      spin_unlock_wait() constraints somwehere.
      
      Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
      could find a place in there. But currently that document is stale to
      the point of hardly being useful.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      055ce0fd
    • P
      locking/qspinlock: Clarify xchg_tail() ordering · 8d53fa19
      Peter Zijlstra 提交于
      While going over the code I noticed that xchg_tail() is a RELEASE but
      had no obvious pairing commented.
      
      It pairs with a somewhat unique address dependency through
      decode_tail().
      
      So the store-release of xchg_tail() is paired by the address
      dependency of the load of xchg_tail followed by the dereference from
      the pointer computed from that load.
      
      The @old -> @prev transformation itself is pure, and therefore does
      not depend on external state, so that is immaterial wrt. ordering.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d53fa19
    • P
      locking/qspinlock: Fix spin_unlock_wait() some more · 2c610022
      Peter Zijlstra 提交于
      While this prior commit:
      
        54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      
      ... fixes spin_is_locked() and spin_unlock_wait() for the usage
      in ipc/sem and netfilter, it does not in fact work right for the
      usage in task_work and futex.
      
      So while the 2 locks crossed problem:
      
      	spin_lock(A)		spin_lock(B)
      	if (!spin_is_locked(B)) spin_unlock_wait(A)
      	  foo()			foo();
      
      ... works with the smp_mb() injected by both spin_is_locked() and
      spin_unlock_wait(), this is not sufficient for:
      
      	flag = 1;
      	smp_mb();		spin_lock()
      	spin_unlock_wait()	if (!flag)
      				  // add to lockless list
      	// iterate lockless list
      
      ... because in this scenario, the store from spin_lock() can be delayed
      past the load of flag, uncrossing the variables and loosing the
      guarantee.
      
      This patch reworks spin_is_locked() and spin_unlock_wait() to work in
      both cases by exploiting the observation that while the lock byte
      store can be delayed, the contender must have registered itself
      visibly in other state contained in the word.
      
      It also allows for architectures to override both functions, as PPC
      and ARM64 have an additional issue for which we currently have no
      generic solution.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Giovanni Gherdovich <ggherdovich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org # v4.2 and later
      Fixes: 54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c610022
    • S
      locking/rtmutex: Only warn once on a trylock from bad context · a461d587
      Sebastian Andrzej Siewior 提交于
      One warning should be enough to get one motivated to fix this. It is
      possible that this happens more than once and that starts flooding the
      output. Later the prints will be suppressed so we only get half of it.
      Depending on the console system used it might not be helpful.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a461d587
    • P
      locking/lockdep: Use __jhash_mix() for iterate_chain_key() · dfaaf3fa
      Peter Zijlstra 提交于
      Use __jhash_mix() to mix the class_idx into the class_key. This
      function provides better mixing than the previously used, home grown
      mix function.
      
      Leave hashing to the professionals :-)
      Suggested-by: NGeorge Spelvin <linux@sciencehorizons.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dfaaf3fa
  10. 03 6月, 2016 5 次提交
    • J
      locking/mutex: Set and clear owner using WRITE_ONCE() · 6e281474
      Jason Low 提交于
      The mutex owner can get read and written to locklessly.
      Use WRITE_ONCE when setting and clearing the owner field
      in order to avoid optimizations such as store tearing. This
      avoids situations where the owner field gets written to with
      multiple stores and another thread could concurrently read
      and use a partially written owner value.
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: NWaiman Long <Waiman.Long@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463782776.2479.9.camel@j-VirtualBoxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e281474
    • J
      locking/rwsem: Optimize write lock by reducing operations in slowpath · c0fcb6c2
      Jason Low 提交于
      When acquiring the rwsem write lock in the slowpath, we first try
      to set count to RWSEM_WAITING_BIAS. When that is successful,
      we then atomically add the RWSEM_WAITING_BIAS in cases where
      there are other tasks on the wait list. This causes write lock
      operations to often issue multiple atomic operations.
      
      We can instead make the list_is_singular() check first, and then
      set the count accordingly, so that we issue at most 1 atomic
      operation when acquiring the write lock and reduce unnecessary
      cacheline contention.
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Waiman Long<Waiman.Long@hpe.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c0fcb6c2
    • D
      locking/rwsem: Rework zeroing reader waiter->task · e3851390
      Davidlohr Bueso 提交于
      Readers that are awoken will expect a nil ->task indicating
      that a wakeup has occurred. Because of the way readers are
      implemented, there's a small chance that the waiter will never
      block in the slowpath (rwsem_down_read_failed), and therefore
      requires some form of reference counting to avoid the following
      scenario:
      
      rwsem_down_read_failed()		rwsem_wake()
        get_task_struct();
        spin_lock_irq(&wait_lock);
        list_add_tail(&waiter.list)
        spin_unlock_irq(&wait_lock);
      					  raw_spin_lock_irqsave(&wait_lock)
      					  __rwsem_do_wake()
        while (1) {
          set_task_state(TASK_UNINTERRUPTIBLE);
      					    waiter->task = NULL
          if (!waiter.task) // true
            break;
          schedule() // never reached
      
         __set_task_state(TASK_RUNNING);
       do_exit();
      					    wake_up_process(tsk); // boom
      
      ... and therefore race with do_exit() when the caller returns.
      
      There is also a mismatch between the smp_mb() and its documentation,
      in that the serialization is done between reading the task and the
      nil store. Furthermore, in addition to having the overlapping of
      loads and stores to waiter->task guaranteed to be ordered within
      that CPU, both wake_up_process() originally and now wake_q_add()
      already imply barriers upon successful calls, which serves the
      comment.
      
      Now, as an alternative to perhaps inverting the checks in the blocker
      side (which has its own penalty in that schedule is unavoidable),
      with lockless wakeups this situation is naturally addressed and we
      can just use the refcount held by wake_q_add(), instead doing so
      explicitly. Of course, we must guarantee that the nil store is done
      as the _last_ operation in that the task must already be marked for
      deletion to not fall into the race above. Spurious wakeups are also
      handled transparently in that the task's reference is only removed
      when wake_up_q() is actually called _after_ the nil store.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hpe.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hp.com
      Cc: peter@hurleysoftware.com
      Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3851390
    • D
      locking/rwsem: Enable lockless waiter wakeup(s) · 133e89ef
      Davidlohr Bueso 提交于
      As wake_qs gain users, we can teach rwsems about them such that
      waiters can be awoken without the wait_lock. This is for both
      readers and writer, the former being the most ideal candidate
      as we can batch the wakeups shortening the critical region that
      much more -- ie writer task blocking a bunch of tasks waiting to
      service page-faults (mmap_sem readers).
      
      In general applying wake_qs to rwsem (xadd) is not difficult as
      the wait_lock is intended to be released soon _anyways_, with
      the exception of when a writer slowpath will proactively wakeup
      any queued readers if it sees that the lock is owned by a reader,
      in which we simply do the wakeups with the lock held (see comment
      in __rwsem_down_write_failed_common()).
      
      Similar to other locking primitives, delaying the waiter being
      awoken does allow, at least in theory, the lock to be stolen in
      the case of writers, however no harm was seen in this (in fact
      lock stealing tends to be a _good_ thing in most workloads), and
      this is a tiny window anyways.
      
      Some page-fault (pft) and mmap_sem intensive benchmarks show some
      pretty constant reduction in systime (by up to ~8 and ~10%) on a
      2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
      page allocations (page_test)
      
      aim9:
      	 4.6-rc6				4.6-rc6
      						rwsemv2
      Min      page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
      Min      exec_test      499.00 (  0.00%)      502.67 (  0.74%)
      Min      fork_test     3395.47 (  0.00%)     3537.64 (  4.19%)
      Hmean    page_test   395433.06 (  0.00%)   414693.68 (  4.87%)
      Hmean    exec_test      499.67 (  0.00%)      505.30 (  1.13%)
      Hmean    fork_test     3504.22 (  0.00%)     3594.95 (  2.59%)
      Stddev   page_test    17426.57 (  0.00%)    26649.92 (-52.93%)
      Stddev   exec_test        0.47 (  0.00%)        1.41 (-199.05%)
      Stddev   fork_test       63.74 (  0.00%)       32.59 ( 48.86%)
      Max      page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
      Max      exec_test      500.33 (  0.00%)      507.66 (  1.47%)
      Max      fork_test     3653.33 (  0.00%)     3650.90 ( -0.07%)
      
      	     4.6-rc6     4.6-rc6
      			 rwsemv2
      User            1.12        0.04
      System          0.23        0.04
      Elapsed       727.27      721.98
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hpe.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hp.com
      Cc: peter@hurleysoftware.com
      Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      133e89ef
    • C
      locking/ww_mutex: Report recursive ww_mutex locking early · 0422e83d
      Chris Wilson 提交于
      Recursive locking for ww_mutexes was originally conceived as an
      exception. However, it is heavily used by the DRM atomic modesetting
      code. Currently, the recursive deadlock is checked after we have queued
      up for a busy-spin and as we never release the lock, we spin until
      kicked, whereupon the deadlock is discovered and reported.
      
      A simple solution for the now common problem is to move the recursive
      deadlock discovery to the first action when taking the ww_mutex.
      Suggested-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0422e83d
  11. 26 5月, 2016 1 次提交
  12. 16 5月, 2016 1 次提交
    • P
      locking/rwsem: Fix down_write_killable() · 04cafed7
      Peter Zijlstra 提交于
      The new signal_pending exit path in __rwsem_down_write_failed_common()
      was fingered as breaking his kernel by Tetsuo Handa.
      
      Upon inspection it was found that there are two things wrong with it;
      
       - it forgets to remove WAITING_BIAS if it leaves the list empty, or
       - it forgets to wake further waiters that were blocked on the now
         removed waiter.
      
      Especially the first issue causes new lock attempts to block and stall
      indefinitely, as the code assumes that pending waiters mean there is
      an owner that will wake when it releases the lock.
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Tested-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Waiman Long <Waiman.Long@hpe.com>
      Link: http://lkml.kernel.org/r/20160512115745.GP3192@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      04cafed7
  13. 05 5月, 2016 2 次提交