1. 13 4月, 2016 2 次提交
    • D
      locking/locktorture: Fix NULL pointer dereference for cleanup paths · c1c33b92
      Davidlohr Bueso 提交于
      It has been found that paths that invoke cleanups through
      lock_torture_cleanup() can trigger NULL pointer dereferencing
      bugs during the statistics printing phase. This is mainly
      because we should not be calling into statistics before we are
      sure things have been set up correctly.
      
      Specifically, early checks (and the need for handling this in
      the cleanup call) only include parameter checks and basic
      statistics allocation. Once we start write/read kthreads
      we then consider the test as started. As such, update the function
      in question to check for cxt.lwsa writer stats, if not set,
      we either have a bogus parameter or -ENOMEM situation and
      therefore only need to deal with general torture calls.
      Reported-and-tested-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bobby.prani@gmail.com
      Cc: dhowells@redhat.com
      Cc: dipankar@in.ibm.com
      Cc: dvhart@linux.intel.com
      Cc: edumazet@google.com
      Cc: fweisbec@gmail.com
      Cc: jiangshanlai@gmail.com
      Cc: josh@joshtriplett.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: oleg@redhat.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/1460476038-27060-2-git-send-email-paulmck@linux.vnet.ibm.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1c33b92
    • D
      locking/locktorture: Fix deboosting NULL pointer dereference · 1f190931
      Davidlohr Bueso 提交于
      For the case of rtmutex torturing we will randomly call into the
      boost() handler, including upon module exiting when the tasks are
      deboosted before stopping. In such cases the task may or may not have
      already been boosted, and therefore the NULL being explicitly passed
      can occur anywhere. Currently we only assume that the task will is
      at a higher prio, and in consequence, dereference a NULL pointer.
      
      This patch fixes the case of a rmmod locktorture exploding while
      pounding on the rtmutex lock (partial trace):
      
       task: ffff88081026cf80 ti: ffff880816120000 task.ti: ffff880816120000
       RSP: 0018:ffff880816123eb0  EFLAGS: 00010206
       RAX: ffff88081026cf80 RBX: ffff880816bfa630 RCX: 0000000000160d1b
       RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
       RBP: ffff88081026cf80 R08: 000000000000001f R09: ffff88017c20ca80
       R10: 0000000000000000 R11: 000000000048c316 R12: ffffffffa05d1840
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       FS:  0000000000000000(0000) GS:ffff88203f880000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 0000000001c0a000 CR4: 00000000000406e0
       Stack:
        ffffffffa05d141d ffff880816bfa630 ffffffffa05d1922 ffff88081e70c2c0
        ffff880816bfa630 ffffffff81095fed 0000000000000000 ffffffff8107bf60
        ffff880816bfa630 ffffffff00000000 ffff880800000000 ffff880816123f08
       Call Trace:
        [<ffffffff81095fed>] kthread+0xbd/0xe0
        [<ffffffff815cf40f>] ret_from_fork+0x3f/0x70
      
      This patch ensures that if the random state pointer is not NULL and current
      is not boosted, then do nothing.
      
       RIP: 0010:[<ffffffffa05c6185>]  [<ffffffffa05c6185>] torture_random+0x5/0x60 [torture]
        [<ffffffffa05d141d>] torture_rtmutex_boost+0x1d/0x90 [locktorture]
        [<ffffffffa05d1922>] lock_torture_writer+0xe2/0x170 [locktorture]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bobby.prani@gmail.com
      Cc: dhowells@redhat.com
      Cc: dipankar@in.ibm.com
      Cc: dvhart@linux.intel.com
      Cc: edumazet@google.com
      Cc: fweisbec@gmail.com
      Cc: jiangshanlai@gmail.com
      Cc: josh@joshtriplett.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: oleg@redhat.com
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/1460476038-27060-1-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f190931
  2. 04 4月, 2016 1 次提交
  3. 31 3月, 2016 1 次提交
  4. 23 3月, 2016 1 次提交
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
  5. 16 3月, 2016 1 次提交
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
  6. 29 2月, 2016 7 次提交
  7. 12 2月, 2016 1 次提交
    • A
      kernel/locking/lockdep.c: convert hash tables to hlists · 4a389810
      Andrew Morton 提交于
      Mike said:
      
      : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i.  e
      : kernel with CONFIG_UBSAN_ALIGNMENT fails to load without even any error
      : message.
      :
      : The problem is that ubsan callbacks use spinlocks and might be called
      : before lockdep is initialized.  Particularly this line in the
      : reserve_ebda_region function causes problem:
      :
      : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
      :
      : If i put lockdep_init() before reserve_ebda_region call in
      : x86_64_start_reservations kernel loads well.
      
      Fix this ordering issue permanently: change lockdep so that it uses
      hlists for the hash tables.  Unlike a list_head, an hlist_head is in its
      initialized state when it is all-zeroes, so lockdep is ready for
      operation immediately upon boot - lockdep_init() need not have run.
      
      The patch will also save some memory.
      
      lockdep_init() and lockdep_initialized can be done away with now - a 4.6
      patch has been prepared to do this.
      Reported-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Suggested-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a389810
  8. 09 2月, 2016 3 次提交
    • A
      locking/lockdep: Eliminate lockdep_init() · 06bea3db
      Andrey Ryabinin 提交于
      Lockdep is initialized at compile time now.  Get rid of lockdep_init().
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Krinkin <krinkin.m.u@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: mm-commits@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      06bea3db
    • A
      locking/lockdep: Convert hash tables to hlists · a63f38cc
      Andrew Morton 提交于
      Mike said:
      
      : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i.e.
      : kernel with CONFIG_UBSAN_ALIGNMENT=y fails to load without even any error
      : message.
      :
      : The problem is that ubsan callbacks use spinlocks and might be called
      : before lockdep is initialized.  Particularly this line in the
      : reserve_ebda_region function causes problem:
      :
      : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
      :
      : If i put lockdep_init() before reserve_ebda_region call in
      : x86_64_start_reservations kernel loads well.
      
      Fix this ordering issue permanently: change lockdep so that it uses hlists
      for the hash tables.  Unlike a list_head, an hlist_head is in its
      initialized state when it is all-zeroes, so lockdep is ready for operation
      immediately upon boot - lockdep_init() need not have run.
      
      The patch will also save some memory.
      
      Probably lockdep_init() and lockdep_initialized can be done away with now.
      Suggested-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Reported-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: mm-commits@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a63f38cc
    • D
      locking/lockdep: Fix stack trace caching logic · 8a5fd564
      Dmitry Vyukov 提交于
      check_prev_add() caches saved stack trace in static trace variable
      to avoid duplicate save_trace() calls in dependencies involving trylocks.
      But that caching logic contains a bug. We may not save trace on first
      iteration due to early return from check_prev_add(). Then on the
      second iteration when we actually need the trace we don't save it
      because we think that we've already saved it.
      
      Let check_prev_add() itself control when stack is saved.
      
      There is another bug. Trace variable is protected by graph lock.
      But we can temporary release graph lock during printing.
      
      Fix this by invalidating cached stack trace when we release graph lock.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: glider@google.com
      Cc: kcc@google.com
      Cc: peter@hurleysoftware.com
      Cc: sasha.levin@oracle.com
      Link: http://lkml.kernel.org/r/1454593240-121647-1-git-send-email-dvyukov@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a5fd564
  9. 26 1月, 2016 1 次提交
    • T
      rtmutex: Make wait_lock irq safe · b4abf910
      Thomas Gleixner 提交于
      Sasha reported a lockdep splat about a potential deadlock between RCU boosting
      rtmutex and the posix timer it_lock.
      
      CPU0					CPU1
      
      rtmutex_lock(&rcu->rt_mutex)
        spin_lock(&rcu->rt_mutex.wait_lock)
      					local_irq_disable()
      					spin_lock(&timer->it_lock)
      					spin_lock(&rcu->mutex.wait_lock)
      --> Interrupt
          spin_lock(&timer->it_lock)
      
      This is caused by the following code sequence on CPU1
      
           rcu_read_lock()
           x = lookup();
           if (x)
           	spin_lock_irqsave(&x->it_lock);
           rcu_read_unlock();
           return x;
      
      We could fix that in the posix timer code by keeping rcu read locked across
      the spinlocked and irq disabled section, but the above sequence is common and
      there is no reason not to support it.
      
      Taking rt_mutex.wait_lock irq safe prevents the deadlock.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      b4abf910
  10. 18 12月, 2015 1 次提交
  11. 04 12月, 2015 4 次提交
    • W
      locking/pvqspinlock: Queue node adaptive spinning · cd0272fa
      Waiman Long 提交于
      In an overcommitted guest where some vCPUs have to be halted to make
      forward progress in other areas, it is highly likely that a vCPU later
      in the spinlock queue will be spinning while the ones earlier in the
      queue would have been halted. The spinning in the later vCPUs is then
      just a waste of precious CPU cycles because they are not going to
      get the lock soon as the earlier ones have to be woken up and take
      their turn to get the lock.
      
      This patch implements an adaptive spinning mechanism where the vCPU
      will call pv_wait() if the previous vCPU is not running.
      
      Linux kernel builds were run in KVM guest on an 8-socket, 4
      cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
      Haswell-EX system. Both systems are configured to have 32 physical
      CPUs. The kernel build times before and after the patch were:
      
      		    Westmere			Haswell
        Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
        -----		--------    --------    --------    --------
        Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
        After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s
      
      For 32 vCPUs, this patch doesn't cause any noticeable change in
      performance. For 48 vCPUs (over-committed), there is about 8%
      performance improvement.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd0272fa
    • W
      locking/pvqspinlock: Allow limited lock stealing · 1c4941fd
      Waiman Long 提交于
      This patch allows one attempt for the lock waiter to steal the lock
      when entering the PV slowpath. To prevent lock starvation, the pending
      bit will be set by the queue head vCPU when it is in the active lock
      spinning loop to disable any lock stealing attempt.  This helps to
      reduce the performance penalty caused by lock waiter preemption while
      not having much of the downsides of a real unfair lock.
      
      The pv_wait_head() function was renamed as pv_wait_head_or_lock()
      as it was modified to acquire the lock before returning. This is
      necessary because of possible lock stealing attempts from other tasks.
      
      Linux kernel builds were run in KVM guest on an 8-socket, 4
      cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
      Haswell-EX system. Both systems are configured to have 32 physical
      CPUs. The kernel build times before and after the patch were:
      
                          Westmere                    Haswell
        Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
        -----         --------    --------    --------    --------
        Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
        After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s
      
      For the overcommited case (48 vCPUs), this patch is able to reduce
      kernel build time by more than 54% for Westmere and 44% for Haswell.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1c4941fd
    • W
      locking/pvqspinlock: Collect slowpath lock statistics · 45e898b7
      Waiman Long 提交于
      This patch enables the accumulation of kicking and waiting related
      PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
      option is selected. It also enables the collection of data which
      enable us to calculate the kicking and wakeup latencies which have
      a heavy dependency on the CPUs being used.
      
      The statistical counters are per-cpu variables to minimize the
      performance overhead in their updates. These counters are exported
      via the debugfs filesystem under the qlockstat directory.  When the
      corresponding debugfs files are read, summation and computing of the
      required data are then performed.
      
      The measured latencies for different CPUs are:
      
      	CPU		Wakeup		Kicking
      	---		------		-------
      	Haswell-EX	63.6us		 7.4us
      	Westmere-EX	67.6us		 9.3us
      
      The measured latencies varied a bit from run-to-run. The wakeup
      latency is much higher than the kicking latency.
      
      A sample of statistical counters after system bootup (with vCPU
      overcommit) was:
      
      	pv_hash_hops=1.00
      	pv_kick_unlock=1148
      	pv_kick_wake=1146
      	pv_latency_kick=11040
      	pv_latency_wake=194840
      	pv_spurious_wakeup=7
      	pv_wait_again=4
      	pv_wait_head=23
      	pv_wait_node=1129
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45e898b7
    • P
      locking, sched: Introduce smp_cond_acquire() and use it · b3e0b1b6
      Peter Zijlstra 提交于
      Introduce smp_cond_acquire() which combines a control dependency and a
      read barrier to form acquire semantics.
      
      This primitive has two benefits:
      
       - it documents control dependencies,
       - its typically cheaper than using smp_load_acquire() in a loop.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3e0b1b6
  12. 23 11月, 2015 5 次提交
    • W
      locking/pvqspinlock, x86: Optimize the PV unlock code path · d7804530
      Waiman Long 提交于
      The unlock function in queued spinlocks was optimized for better
      performance on bare metal systems at the expense of virtualized guests.
      
      For x86-64 systems, the unlock call needs to go through a
      PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
      registers before calling the real __pv_queued_spin_unlock()
      function. The thunk code may also be in a separate cacheline from
      __pv_queued_spin_unlock().
      
      This patch optimizes the PV unlock code path by:
      
       1) Moving the unlock slowpath code from the fastpath into a separate
          __pv_queued_spin_unlock_slowpath() function to make the fastpath
          as simple as possible..
      
       2) For x86-64, hand-coded an assembly function to combine the register
          saving thunk code with the fastpath code. Only registers that
          are used in the fastpath will be saved and restored. If the
          fastpath fails, the slowpath function will be called via another
          PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
          __pv_queued_spin_unlock() code as the thunk saves and restores
          only one 32-bit register.
      
      With a microbenchmark of 5M lock-unlock loop, the table below shows
      the execution times before and after the patch with different number
      of threads in a VM running on a 32-core Westmere-EX box with x86-64
      4.2-rc1 based kernels:
      
        Threads	Before patch	After patch	% Change
        -------	------------	-----------	--------
           1		   134.1 ms	  119.3 ms	  -11%
           2		   1286  ms	   953  ms	  -26%
           3		   3715  ms	  3480  ms	  -6.3%
           4		   4092  ms	  3764  ms	  -8.0%
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7804530
    • W
      locking/qspinlock: Avoid redundant read of next pointer · aa68744f
      Waiman Long 提交于
      With optimistic prefetch of the next node cacheline, the next pointer
      may have been properly inititalized. As a result, the reading
      of node->next in the contended path may be redundant. This patch
      eliminates the redundant read if the next pointer value is not NULL.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aa68744f
    • W
      locking/qspinlock: Prefetch the next node cacheline · 81b55986
      Waiman Long 提交于
      A queue head CPU, after acquiring the lock, will have to notify
      the next CPU in the wait queue that it has became the new queue
      head. This involves loading a new cacheline from the MCS node of the
      next CPU. That operation can be expensive and add to the latency of
      locking operation.
      
      This patch addes code to optmistically prefetch the next MCS node
      cacheline if the next pointer is defined and it has been spinning
      for the MCS lock for a while. This reduces the locking latency and
      improves the system throughput.
      
      The performance change will depend on whether the prefetch overhead
      can be hidden within the latency of the lock spin loop. On really
      short critical section, there may not be performance gain at all. With
      longer critical section, however, it was found to have a performance
      boost of 5-10% over a range of different queue depths with a spinlock
      loop microbenchmark.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81b55986
    • W
      locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg() · 64d816cb
      Waiman Long 提交于
      This patch replaces the cmpxchg() and xchg() calls in the native
      qspinlock code with the more relaxed _acquire or _release versions of
      those calls to enable other architectures to adopt queued spinlocks
      with less memory barrier performance overhead.
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64d816cb
    • P
      treewide: Remove old email address · 90eec103
      Peter Zijlstra 提交于
      There were still a number of references to my old Red Hat email
      address in the kernel source. Remove these while keeping the
      Red Hat copyright notices intact.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90eec103
  13. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  14. 07 10月, 2015 8 次提交
  15. 06 10月, 2015 3 次提交