1. 08 6月, 2016 11 次提交
    • W
      locking/rwsem: Protect all writes to owner by WRITE_ONCE() · fb6a44f3
      Waiman Long 提交于
      Without using WRITE_ONCE(), the compiler can potentially break a
      write into multiple smaller ones (store tearing). So a read from the
      same data by another task concurrently may return a partial result.
      This can result in a kernel crash if the data is a memory address
      that is being dereferenced.
      
      This patch changes all write to rwsem->owner to use WRITE_ONCE()
      to make sure that store tearing will not happen. READ_ONCE() may
      not be needed for rwsem->owner as long as the value is only used for
      comparison and not dereferencing.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463534783-38814-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb6a44f3
    • W
      locking/rwsem: Add reader-owned state to the owner field · 19c5d690
      Waiman Long 提交于
      Currently, it is not possible to determine for sure if a reader
      owns a rwsem by looking at the content of the rwsem data structure.
      This patch adds a new state RWSEM_READER_OWNED to the owner field
      to indicate that readers currently own the lock. This enables us to
      address the following 2 issues in the rwsem optimistic spinning code:
      
       1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
          the owner field is NULL which can mean either the readers own
          the lock or the owning writer hasn't set the owner field yet.
          In the latter case, we miss the chance to do optimistic spinning.
      
       2) While a writer is waiting in the OSQ and a reader takes the lock,
          the writer will continue to spin when out of the OSQ in the main
          rwsem_optimistic_spin() loop as the owner field is NULL wasting
          CPU cycles if some of readers are sleeping.
      
      Adding the new state will allow optimistic spinning to go forward as
      long as the owner field is not RWSEM_READER_OWNED and the owner is
      running, if set, but stop immediately when that state has been reached.
      
      On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
      fio test with multithreaded randrw and randwrite tests on the same
      file on a XFS partition on top of a NVDIMM were run, the aggregated
      bandwidths before and after the patch were as follows:
      
        Test      BW before patch     BW after patch  % change
        ----      ---------------     --------------  --------
        randrw         988 MB/s          1192 MB/s      +21%
        randwrite     1513 MB/s          1623 MB/s      +7.3%
      
      The perf profile of the rwsem_down_write_failed() function in randrw
      before and after the patch were:
      
         19.95%  5.88%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
         14.20%  1.52%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
      
      The actual CPU cycles spend in rwsem_down_write_failed() dropped from
      5.88% to 1.52% after the patch.
      
      The xfstests was also run and no regression was observed.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJason Low <jason.low2@hp.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463534783-38814-2-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19c5d690
    • J
      locking/rwsem: Remove rwsem_atomic_add() and rwsem_atomic_update() · d157bd86
      Jason Low 提交于
      The rwsem-xadd count has been converted to an atomic variable and the
      rwsem code now directly uses atomic_long_add() and
      atomic_long_add_return(), so we can remove the arch implementations of
      rwsem_atomic_add() and rwsem_atomic_update().
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Waiman Long <Waiman.Long@hpe.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d157bd86
    • J
      locking/rwsem: Convert sem->count to 'atomic_long_t' · 8ee62b18
      Jason Low 提交于
      Convert the rwsem count variable to an atomic_long_t since we use it
      as an atomic variable. This also allows us to remove the
      rwsem_atomic_{add,update}() "abstraction" which would now be an unnecesary
      level of indirection. In follow up patches, we also remove the
      rwsem_atomic_{add,update}() definitions across the various architectures.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      [ Build warning fixes on various architectures. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Waiman Long <Waiman.Long@hpe.com>
      Link: http://lkml.kernel.org/r/1465017963-4839-2-git-send-email-jason.low2@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8ee62b18
    • P
      locking/qspinlock: Add comments · 055ce0fd
      Peter Zijlstra 提交于
      I figured we need to document the spin_is_locked() and
      spin_unlock_wait() constraints somwehere.
      
      Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
      could find a place in there. But currently that document is stale to
      the point of hardly being useful.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      055ce0fd
    • P
      locking/qspinlock: Clarify xchg_tail() ordering · 8d53fa19
      Peter Zijlstra 提交于
      While going over the code I noticed that xchg_tail() is a RELEASE but
      had no obvious pairing commented.
      
      It pairs with a somewhat unique address dependency through
      decode_tail().
      
      So the store-release of xchg_tail() is paired by the address
      dependency of the load of xchg_tail followed by the dereference from
      the pointer computed from that load.
      
      The @old -> @prev transformation itself is pure, and therefore does
      not depend on external state, so that is immaterial wrt. ordering.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d53fa19
    • I
    • P
      locking/qspinlock: Fix spin_unlock_wait() some more · 2c610022
      Peter Zijlstra 提交于
      While this prior commit:
      
        54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      
      ... fixes spin_is_locked() and spin_unlock_wait() for the usage
      in ipc/sem and netfilter, it does not in fact work right for the
      usage in task_work and futex.
      
      So while the 2 locks crossed problem:
      
      	spin_lock(A)		spin_lock(B)
      	if (!spin_is_locked(B)) spin_unlock_wait(A)
      	  foo()			foo();
      
      ... works with the smp_mb() injected by both spin_is_locked() and
      spin_unlock_wait(), this is not sufficient for:
      
      	flag = 1;
      	smp_mb();		spin_lock()
      	spin_unlock_wait()	if (!flag)
      				  // add to lockless list
      	// iterate lockless list
      
      ... because in this scenario, the store from spin_lock() can be delayed
      past the load of flag, uncrossing the variables and loosing the
      guarantee.
      
      This patch reworks spin_is_locked() and spin_unlock_wait() to work in
      both cases by exploiting the observation that while the lock byte
      store can be delayed, the contender must have registered itself
      visibly in other state contained in the word.
      
      It also allows for architectures to override both functions, as PPC
      and ARM64 have an additional issue for which we currently have no
      generic solution.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Giovanni Gherdovich <ggherdovich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <waiman.long@hpe.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org # v4.2 and later
      Fixes: 54cf809b ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c610022
    • P
      locking/barriers: Validate lockless_dereference() is used on a pointer type · 331b6d8c
      Peter Zijlstra 提交于
      Use the type to validate the argument @p is indeed a pointer type.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160522104827.GP3193@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      331b6d8c
    • S
      locking/rtmutex: Only warn once on a trylock from bad context · a461d587
      Sebastian Andrzej Siewior 提交于
      One warning should be enough to get one motivated to fix this. It is
      possible that this happens more than once and that starts flooding the
      output. Later the prints will be suppressed so we only get half of it.
      Depending on the console system used it might not be helpful.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a461d587
    • P
      locking/lockdep: Use __jhash_mix() for iterate_chain_key() · dfaaf3fa
      Peter Zijlstra 提交于
      Use __jhash_mix() to mix the class_idx into the class_key. This
      function provides better mixing than the previously used, home grown
      mix function.
      
      Leave hashing to the professionals :-)
      Suggested-by: NGeorge Spelvin <linux@sciencehorizons.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dfaaf3fa
  2. 03 6月, 2016 7 次提交
    • T
      percpu, locking: Revert ("percpu: Replace smp_read_barrier_depends() with lockless_dereference()") · ed8ebd1d
      Tejun Heo 提交于
      lockless_dereference() is planned to grow a sanity check to ensure
      that the input parameter is a pointer.  __ref_is_percpu() passes in an
      unsinged long value which is a combination of a pointer and a flag.
      While it can be casted to a pointer lvalue, the casting looks messy
      and it's a special case anyway.  Let's revert back to open-coding
      READ_ONCE() and explicit barrier.
      
      This doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Link: http://lkml.kernel.org/g/20160522185040.GA23664@p183.telecom.bySigned-off-by: NIngo Molnar <mingo@kernel.org>
      ed8ebd1d
    • J
      locking/mutex: Set and clear owner using WRITE_ONCE() · 6e281474
      Jason Low 提交于
      The mutex owner can get read and written to locklessly.
      Use WRITE_ONCE when setting and clearing the owner field
      in order to avoid optimizations such as store tearing. This
      avoids situations where the owner field gets written to with
      multiple stores and another thread could concurrently read
      and use a partially written owner value.
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: NWaiman Long <Waiman.Long@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1463782776.2479.9.camel@j-VirtualBoxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e281474
    • J
      locking/rwsem: Optimize write lock by reducing operations in slowpath · c0fcb6c2
      Jason Low 提交于
      When acquiring the rwsem write lock in the slowpath, we first try
      to set count to RWSEM_WAITING_BIAS. When that is successful,
      we then atomically add the RWSEM_WAITING_BIAS in cases where
      there are other tasks on the wait list. This causes write lock
      operations to often issue multiple atomic operations.
      
      We can instead make the list_is_singular() check first, and then
      set the count accordingly, so that we issue at most 1 atomic
      operation when acquiring the write lock and reduce unnecessary
      cacheline contention.
      Signed-off-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Waiman Long<Waiman.Long@hpe.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Terry Rudd <terry.rudd@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c0fcb6c2
    • D
      locking/rwsem: Rework zeroing reader waiter->task · e3851390
      Davidlohr Bueso 提交于
      Readers that are awoken will expect a nil ->task indicating
      that a wakeup has occurred. Because of the way readers are
      implemented, there's a small chance that the waiter will never
      block in the slowpath (rwsem_down_read_failed), and therefore
      requires some form of reference counting to avoid the following
      scenario:
      
      rwsem_down_read_failed()		rwsem_wake()
        get_task_struct();
        spin_lock_irq(&wait_lock);
        list_add_tail(&waiter.list)
        spin_unlock_irq(&wait_lock);
      					  raw_spin_lock_irqsave(&wait_lock)
      					  __rwsem_do_wake()
        while (1) {
          set_task_state(TASK_UNINTERRUPTIBLE);
      					    waiter->task = NULL
          if (!waiter.task) // true
            break;
          schedule() // never reached
      
         __set_task_state(TASK_RUNNING);
       do_exit();
      					    wake_up_process(tsk); // boom
      
      ... and therefore race with do_exit() when the caller returns.
      
      There is also a mismatch between the smp_mb() and its documentation,
      in that the serialization is done between reading the task and the
      nil store. Furthermore, in addition to having the overlapping of
      loads and stores to waiter->task guaranteed to be ordered within
      that CPU, both wake_up_process() originally and now wake_q_add()
      already imply barriers upon successful calls, which serves the
      comment.
      
      Now, as an alternative to perhaps inverting the checks in the blocker
      side (which has its own penalty in that schedule is unavoidable),
      with lockless wakeups this situation is naturally addressed and we
      can just use the refcount held by wake_q_add(), instead doing so
      explicitly. Of course, we must guarantee that the nil store is done
      as the _last_ operation in that the task must already be marked for
      deletion to not fall into the race above. Spurious wakeups are also
      handled transparently in that the task's reference is only removed
      when wake_up_q() is actually called _after_ the nil store.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hpe.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hp.com
      Cc: peter@hurleysoftware.com
      Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3851390
    • D
      locking/rwsem: Enable lockless waiter wakeup(s) · 133e89ef
      Davidlohr Bueso 提交于
      As wake_qs gain users, we can teach rwsems about them such that
      waiters can be awoken without the wait_lock. This is for both
      readers and writer, the former being the most ideal candidate
      as we can batch the wakeups shortening the critical region that
      much more -- ie writer task blocking a bunch of tasks waiting to
      service page-faults (mmap_sem readers).
      
      In general applying wake_qs to rwsem (xadd) is not difficult as
      the wait_lock is intended to be released soon _anyways_, with
      the exception of when a writer slowpath will proactively wakeup
      any queued readers if it sees that the lock is owned by a reader,
      in which we simply do the wakeups with the lock held (see comment
      in __rwsem_down_write_failed_common()).
      
      Similar to other locking primitives, delaying the waiter being
      awoken does allow, at least in theory, the lock to be stolen in
      the case of writers, however no harm was seen in this (in fact
      lock stealing tends to be a _good_ thing in most workloads), and
      this is a tiny window anyways.
      
      Some page-fault (pft) and mmap_sem intensive benchmarks show some
      pretty constant reduction in systime (by up to ~8 and ~10%) on a
      2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
      page allocations (page_test)
      
      aim9:
      	 4.6-rc6				4.6-rc6
      						rwsemv2
      Min      page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
      Min      exec_test      499.00 (  0.00%)      502.67 (  0.74%)
      Min      fork_test     3395.47 (  0.00%)     3537.64 (  4.19%)
      Hmean    page_test   395433.06 (  0.00%)   414693.68 (  4.87%)
      Hmean    exec_test      499.67 (  0.00%)      505.30 (  1.13%)
      Hmean    fork_test     3504.22 (  0.00%)     3594.95 (  2.59%)
      Stddev   page_test    17426.57 (  0.00%)    26649.92 (-52.93%)
      Stddev   exec_test        0.47 (  0.00%)        1.41 (-199.05%)
      Stddev   fork_test       63.74 (  0.00%)       32.59 ( 48.86%)
      Max      page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
      Max      exec_test      500.33 (  0.00%)      507.66 (  1.47%)
      Max      fork_test     3653.33 (  0.00%)     3650.90 ( -0.07%)
      
      	     4.6-rc6     4.6-rc6
      			 rwsemv2
      User            1.12        0.04
      System          0.23        0.04
      Elapsed       727.27      721.98
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hpe.com
      Cc: dave@stgolabs.net
      Cc: jason.low2@hp.com
      Cc: peter@hurleysoftware.com
      Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      133e89ef
    • C
      locking/ww_mutex: Report recursive ww_mutex locking early · 0422e83d
      Chris Wilson 提交于
      Recursive locking for ww_mutexes was originally conceived as an
      exception. However, it is heavily used by the DRM atomic modesetting
      code. Currently, the recursive deadlock is checked after we have queued
      up for a busy-spin and as we never release the lock, we spin until
      kicked, whereupon the deadlock is discovered and reported.
      
      A simple solution for the now common problem is to move the recursive
      deadlock discovery to the first action when taking the ww_mutex.
      Suggested-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0422e83d
    • P
      locking/seqcount: Re-fix raw_read_seqcount_latch() · 55eed755
      Peter Zijlstra 提交于
      Commit 50755bc1 ("seqlock: fix raw_read_seqcount_latch()") broke
      raw_read_seqcount_latch().
      
      If you look at the comment that was modified; the thing that changes is
      the seq count, not the latch pointer.
      
       * void latch_modify(struct latch_struct *latch, ...)
       * {
       *	smp_wmb();	<- Ensure that the last data[1] update is visible
       *	latch->seq++;
       *	smp_wmb();	<- Ensure that the seqcount update is visible
       *
       *	modify(latch->data[0], ...);
       *
       *	smp_wmb();	<- Ensure that the data[0] update is visible
       *	latch->seq++;
       *	smp_wmb();	<- Ensure that the seqcount update is visible
       *
       *	modify(latch->data[1], ...);
       * }
       *
       * The query will have a form like:
       *
       * struct entry *latch_query(struct latch_struct *latch, ...)
       * {
       *	struct entry *entry;
       *	unsigned seq, idx;
       *
       *	do {
       *		seq = lockless_dereference(latch->seq);
      
      So here we have:
      
      		seq = READ_ONCE(latch->seq);
      		smp_read_barrier_depends();
      
      Which is exactly what we want; the new code:
      
      		seq = ({ p = READ_ONCE(latch);
      			 smp_read_barrier_depends(); p })->seq;
      
      is just wrong; because it looses the volatile read on seq, which can now
      be torn or worse 'optimized'. And the read_depend barrier is also placed
      wrong, we want it after the load of seq, to match the above data[]
      up-to-date wmb()s.
      
      Such that when we dereference latch->data[] below, we're guaranteed to
      observe the right data.
      
       *
       *		idx = seq & 0x01;
       *		entry = data_query(latch->data[idx], ...);
       *
       *		smp_rmb();
       *	} while (seq != latch->seq);
       *
       *	return entry;
       * }
      
      So yes, not passing a pointer is not pretty, but the code was correct,
      and isn't anymore now.
      
      Change to explicit READ_ONCE()+smp_read_barrier_depends() to avoid
      confusion and allow strict lockless_dereference() checking.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 50755bc1 ("seqlock: fix raw_read_seqcount_latch()")
      Link: http://lkml.kernel.org/r/20160527111117.GL3192@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      55eed755
  3. 02 6月, 2016 2 次提交
  4. 01 6月, 2016 20 次提交