1. 14 1月, 2017 9 次提交
    • N
      locking/ww_mutex: Remove the __ww_mutex_lock*() inline wrappers · c5470b22
      Nicolai Hähnle 提交于
      Keep the documentation in the header file since there is no good place
      for it in mutex.c: there are two rather different implementations with
      different EXPORT_SYMBOLs for each function.
      Signed-off-by: NNicolai Hähnle <nicolai.haehnle@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: =?UTF-8?q?Nicolai=20H=C3=A4hnle?= <Nicolai.Haehnle@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maarten Lankhorst <dev@mblankhorst.nl>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dri-devel@lists.freedesktop.org
      Link: http://lkml.kernel.org/r/1482346000-9927-6-git-send-email-nhaehnle@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5470b22
    • N
      locking/ww_mutex: Set use_ww_ctx even when locking without a context · ea9e0fb8
      Nicolai Hähnle 提交于
      We will add a new field to struct mutex_waiter.  This field must be
      initialized for all waiters if any waiter uses the ww_use_ctx path.
      
      So there is a trade-off: Keep ww_mutex locking without a context on
      the faster non-use_ww_ctx path, at the cost of adding the
      initialization to all mutex locks (including non-ww_mutexes), or avoid
      the additional cost for non-ww_mutex locks, at the cost of adding
      additional checks to the use_ww_ctx path.
      
      We take the latter choice.  It may be worth eliminating the users of
      ww_mutex_lock(lock, NULL), but there are a lot of them.
      Signed-off-by: NNicolai Hähnle <Nicolai.Haehnle@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maarten Lankhorst <dev@mblankhorst.nl>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dri-devel@lists.freedesktop.org
      Link: http://lkml.kernel.org/r/1482346000-9927-5-git-send-email-nhaehnle@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea9e0fb8
    • N
      locking/ww_mutex: Extract stamp comparison to __ww_mutex_stamp_after() · 3822da3e
      Nicolai Hähnle 提交于
      The function will be re-used in subsequent patches.
      Signed-off-by: NNicolai Hähnle <Nicolai.Haehnle@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maarten Lankhorst <dev@mblankhorst.nl>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dri-devel@lists.freedesktop.org
      Link: http://lkml.kernel.org/r/1482346000-9927-4-git-send-email-nhaehnle@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3822da3e
    • P
      locking/mutex: Fix mutex handoff · e274795e
      Peter Zijlstra 提交于
      While reviewing the ww_mutex patches, I noticed that it was still
      possible to (incorrectly) succeed for (incorrect) code like:
      
      	mutex_lock(&a);
      	mutex_lock(&a);
      
      This was possible if the second mutex_lock() would block (as expected)
      but then receive a spurious wakeup. At that point it would find itself
      at the front of the queue, request a handoff and instantly claim
      ownership and continue, since owner would point to itself.
      
      Avoid this scenario and simplify the code by introducing a third low
      bit to signal handoff pickup. So once we request handoff, unlock
      clears the handoff bit and sets the pickup bit along with the new
      owner.
      
      This also removes the need for the .handoff argument to
      __mutex_trylock(), since that becomes superfluous with PICKUP.
      
      In order to guarantee enough low bits, ensure task_struct alignment is
      at least L1_CACHE_BYTES (which seems a good ideal regardless).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 9d659ae1 ("locking/mutex: Add lock handoff to avoid starvation")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e274795e
    • D
      locking/percpu-rwsem: Replace waitqueue with rcuwait · 52b94129
      Davidlohr Bueso 提交于
      The use of any kind of wait queue is an overkill for pcpu-rwsems.
      While one option would be to use the less heavy simple (swait)
      flavor, this is still too much for what pcpu-rwsems needs. For one,
      we do not care about any sort of queuing in that the only (rare) time
      writers (and readers, for that matter) are queued is when trying to
      acquire the regular contended rw_sem. There cannot be any further
      queuing as writers are serialized by the rw_sem in the first place.
      
      Given that percpu_down_write() must not be called after exit_notify(),
      we can replace the bulky waitqueue with rcuwait such that a writer
      can wait for its turn to take the lock. As such, we can avoid the
      queue handling and locking overhead.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1484148146-14210-3-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      52b94129
    • D
      sched/wait, RCU: Introduce rcuwait machinery · 8f95c90c
      Davidlohr Bueso 提交于
      rcuwait provides support for (single) RCU-safe task wait/wake functionality,
      with the caveat that it must not be called after exit_notify(), such that
      we avoid racing with rcu delayed_put_task_struct callbacks, task_struct
      being rcu unaware in this context -- for which we similarly have
      task_rcu_dereference() magic, but with different return semantics, which
      can conflict with the wakeup side.
      
      The interfaces are quite straightforward:
      
        rcuwait_wait_event()
        rcuwait_wake_up()
      
      More details are in the comments, but it's perhaps worth mentioning at least,
      that users must provide proper serialization when waiting on a condition, and
      avoid corrupting a concurrent waiter. Also care must be taken between the task
      and the condition for when calling the wakeup -- we cannot miss wakeups. When
      porting users, this is for example, a given when using waitqueues in that
      everything is done under the q->lock. As such, it can remove sources of non
      preemptable unbounded work for realtime.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1484148146-14210-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f95c90c
    • D
      sched/core: Remove set_task_state() · 642fa448
      Davidlohr Bueso 提交于
      This is a nasty interface and setting the state of a foreign task must
      not be done. As of the following commit:
      
        be628be0 ("bcache: Make gc wakeup sane, remove set_task_state()")
      
      ... everyone in the kernel calls set_task_state() with current, allowing
      the helper to be removed.
      
      However, as the comment indicates, it is still around for those archs
      where computing current is more expensive than using a pointer, at least
      in theory. An important arch that is affected is arm64, however this has
      been addressed now [1] and performance is up to par making no difference
      with either calls.
      
      Of all the callers, if any, it's the locking bits that would care most
      about this -- ie: we end up passing a tsk pointer to a lot of the lock
      slowpath, and setting ->state on that. The following numbers are based
      on two tests: a custom ad-hoc microbenchmark that just measures
      latencies (for ~65 million calls) between get_task_state() vs
      get_current_state().
      
      Secondly for a higher overview, an unlink microbenchmark was used,
      which pounds on a single file with open, close,unlink combos with
      increasing thread counts (up to 4x ncpus). While the workload is quite
      unrealistic, it does contend a lot on the inode mutex or now rwsem.
      
      [1] https://lkml.kernel.org/r/1483468021-8237-1-git-send-email-mark.rutland@arm.com
      
      == 1. x86-64 ==
      
      Avg runtime set_task_state():    601 msecs
      Avg runtime set_current_state(): 552 msecs
      
                                                  vanilla                 dirty
      Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
      Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
      Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
      Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
      Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
      Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
      Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
      Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
      Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
      Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
      Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
      Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
      Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)
      
      == 2. ppc64le ==
      
      Avg runtime set_task_state():  938 msecs
      Avg runtime set_current_state: 940 msecs
      
                                                  vanilla                 dirty
      Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
      Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
      Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
      Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
      Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
      Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
      Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
      Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
      Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
      Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
      Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
      Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
      Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
      Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
      Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
      Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)
      
      x86-64 (known to be fast for get_current()/this_cpu_read_stable() caching)
      and ppc64 (with paca) show similar improvements in the unlink microbenches.
      The small delta for ppc64 (2ms), does not represent the gains on the unlink
      runs. In the case of x86, there was a decent amount of variation in the
      latency runs, but always within a 20 to 50ms increase), ppc was more constant.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: mark.rutland@arm.com
      Link: http://lkml.kernel.org/r/1483479794-14013-5-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      642fa448
    • D
      kernel/locking: Compute 'current' directly · d269a8b8
      Davidlohr Bueso 提交于
      This patch effectively replaces the tsk pointer dereference
      (which is obviously == current), to directly use get_current()
      macro. This is to make the removal of setting foreign task
      states smoother and painfully obvious. Performance win on some
      archs such as x86-64 and ppc64. On a microbenchmark that calls
      set_task_state() vs set_current_state() and an inode rwsem
      pounding benchmark doing unlink:
      
      == 1. x86-64 ==
      
      Avg runtime set_task_state():    601 msecs
      Avg runtime set_current_state(): 552 msecs
      
                                                  vanilla                 dirty
      Hmean    unlink1-processes-2      36089.26 (  0.00%)    38977.33 (  8.00%)
      Hmean    unlink1-processes-5      28555.01 (  0.00%)    29832.55 (  4.28%)
      Hmean    unlink1-processes-8      37323.75 (  0.00%)    44974.57 ( 20.50%)
      Hmean    unlink1-processes-12     43571.88 (  0.00%)    44283.01 (  1.63%)
      Hmean    unlink1-processes-21     34431.52 (  0.00%)    38284.45 ( 11.19%)
      Hmean    unlink1-processes-30     34813.26 (  0.00%)    37975.17 (  9.08%)
      Hmean    unlink1-processes-48     37048.90 (  0.00%)    39862.78 (  7.59%)
      Hmean    unlink1-processes-79     35630.01 (  0.00%)    36855.30 (  3.44%)
      Hmean    unlink1-processes-110    36115.85 (  0.00%)    39843.91 ( 10.32%)
      Hmean    unlink1-processes-141    32546.96 (  0.00%)    35418.52 (  8.82%)
      Hmean    unlink1-processes-172    34674.79 (  0.00%)    36899.21 (  6.42%)
      Hmean    unlink1-processes-203    37303.11 (  0.00%)    36393.04 ( -2.44%)
      Hmean    unlink1-processes-224    35712.13 (  0.00%)    36685.96 (  2.73%)
      
      == 2. ppc64le ==
      
      Avg runtime set_task_state():  938 msecs
      Avg runtime set_current_state: 940 msecs
      
                                                  vanilla                 dirty
      Hmean    unlink1-processes-2      19269.19 (  0.00%)    30704.50 ( 59.35%)
      Hmean    unlink1-processes-5      20106.15 (  0.00%)    21804.15 (  8.45%)
      Hmean    unlink1-processes-8      17496.97 (  0.00%)    17243.28 ( -1.45%)
      Hmean    unlink1-processes-12     14224.15 (  0.00%)    17240.21 ( 21.20%)
      Hmean    unlink1-processes-21     14155.66 (  0.00%)    15681.23 ( 10.78%)
      Hmean    unlink1-processes-30     14450.70 (  0.00%)    15995.83 ( 10.69%)
      Hmean    unlink1-processes-48     16945.57 (  0.00%)    16370.42 ( -3.39%)
      Hmean    unlink1-processes-79     15788.39 (  0.00%)    14639.27 ( -7.28%)
      Hmean    unlink1-processes-110    14268.48 (  0.00%)    14377.40 (  0.76%)
      Hmean    unlink1-processes-141    14023.65 (  0.00%)    16271.69 ( 16.03%)
      Hmean    unlink1-processes-172    13417.62 (  0.00%)    16067.55 ( 19.75%)
      Hmean    unlink1-processes-203    15293.08 (  0.00%)    15440.40 (  0.96%)
      Hmean    unlink1-processes-234    13719.32 (  0.00%)    16190.74 ( 18.01%)
      Hmean    unlink1-processes-265    16400.97 (  0.00%)    16115.22 ( -1.74%)
      Hmean    unlink1-processes-296    14388.60 (  0.00%)    16216.13 ( 12.70%)
      Hmean    unlink1-processes-320    15771.85 (  0.00%)    15905.96 (  0.85%)
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: mark.rutland@arm.com
      Link: http://lkml.kernel.org/r/1483479794-14013-4-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d269a8b8
    • D
      kernel/exit: Compute 'current' directly · 0039962a
      Davidlohr Bueso 提交于
      This patch effectively replaces the tsk pointer dereference (which is
      obviously == current), to directly use get_current() macro. In this
      case, do_exit() always passes current to exit_mm(), hence we can
      simply get rid of the argument. This is also a performance win on some
      archs such as x86-64 and ppc64 -- arm64 is no longer an issue.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: mark.rutland@arm.com
      Link: http://lkml.kernel.org/r/1483479794-14013-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0039962a
  2. 12 1月, 2017 2 次提交
  3. 11 1月, 2017 3 次提交
  4. 04 1月, 2017 1 次提交
    • J
      audit: Fix sleep in atomic · be29d20f
      Jan Kara 提交于
      Audit tree code was happily adding new notification marks while holding
      spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
      lead to sleeping while holding a spinlock, deadlocks due to lock
      inversion, and probably other fun. Fix the problem by acquiring
      group->mark_mutex earlier.
      
      CC: Paul Moore <paul@paul-moore.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      be29d20f
  5. 27 12月, 2016 1 次提交
  6. 26 12月, 2016 2 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  7. 25 12月, 2016 4 次提交
  8. 24 12月, 2016 1 次提交
    • J
      fsnotify: Remove fsnotify_duplicate_mark() · e3ba7307
      Jan Kara 提交于
      There are only two calls sites of fsnotify_duplicate_mark(). Those are
      in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
      for audit tree, inode pointer and group gets set in
      fsnotify_add_mark_locked() later anyway, mask and free_mark are already
      set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
      actively harmful because following fsnotify_add_mark_locked() will leak
      group reference by overwriting the group pointer. So just remove the two
      calls to fsnotify_duplicate_mark() and the function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      [PM: line wrapping to fit in 80 chars]
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      e3ba7307
  9. 23 12月, 2016 1 次提交
    • A
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro 提交于
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  10. 21 12月, 2016 3 次提交
  11. 18 12月, 2016 4 次提交
    • M
      uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation · 297e765e
      Marcin Nowakowski 提交于
      Commit:
      
        72e6ae28 ('ARM: 8043/1: uprobes need icache flush after xol write'
      
      ... has introduced an arch-specific method to ensure all caches are
      flushed appropriately after an instruction is written to an XOL page.
      
      However, when the XOL area is created and the out-of-line breakpoint
      instruction is copied, caches are not flushed at all and stale data may
      be found in icache.
      
      Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
      the arch to ensure all caches are updated accordingly.
      
      This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).
      Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Victor Kamensky <victor.kamensky@linaro.org>
      Cc: linux-mips@linux-mips.org
      Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      297e765e
    • D
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 6760bf2d
      Daniel Borkmann 提交于
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6760bf2d
    • D
      bpf: fix overflow in prog accounting · 5ccb071e
      Daniel Borkmann 提交于
      Commit aaac3ba9 ("bpf: charge user for creation of BPF maps and
      programs") made a wrong assumption of charging against prog->pages.
      Unlike map->pages, prog->pages are still subject to change when we
      need to expand the program through bpf_prog_realloc().
      
      This can for example happen during verification stage when we need to
      expand and rewrite parts of the program. Should the required space
      cross a page boundary, then prog->pages is not the same anymore as
      its original value that we used to bpf_prog_charge_memlock() on. Thus,
      we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
      is freed eventually. I noticed this that despite having unlimited
      memlock, programs suddenly refused to load with EPERM error due to
      insufficient memlock.
      
      There are two ways to fix this issue. One would be to add a cached
      variable to struct bpf_prog that takes a snapshot of prog->pages at the
      time of charging. The other approach is to also account for resizes. I
      chose to go with the latter for a couple of reasons: i) We want accounting
      rather to be more accurate instead of further fooling limits, ii) adding
      yet another page counter on struct bpf_prog would also be a waste just
      for this purpose. We also do want to charge as early as possible to
      avoid going into the verifier just to find out later on that we crossed
      limits. The only place that needs to be fixed is bpf_prog_realloc(),
      since only here we expand the program, so we try to account for the
      needed delta and should we fail, call-sites check for outcome anyway.
      On cBPF to eBPF migrations, we don't grab a reference to the user as
      they are charged differently. With that in place, my test case worked
      fine.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ccb071e
    • D
      bpf: dynamically allocate digest scratch buffer · aafe6ae9
      Daniel Borkmann 提交于
      Geert rightfully complained that 7bd509e3 ("bpf: add prog_digest
      and expose it via fdinfo/netlink") added a too large allocation of
      variable 'raw' from bss section, and should instead be done dynamically:
      
        # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
        add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
        function                                     old     new   delta
        raw                                            -   32832  +32832
        [...]
      
      Since this is only relevant during program creation path, which can be
      considered slow-path anyway, lets allocate that dynamically and be not
      implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
      the beginning of replace_map_fd_with_map_ptr() and also error handling
      stays straight forward.
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aafe6ae9
  12. 17 12月, 2016 1 次提交
    • D
      bpf: fix regression on verifier pruning wrt map lookups · a08dd0da
      Daniel Borkmann 提交于
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a08dd0da
  13. 16 12月, 2016 2 次提交
  14. 15 12月, 2016 6 次提交
    • G
      genirq/affinity: Fix node generation from cpumask · c0af5243
      Guilherme G. Piccoli 提交于
      Commit 34c3d981 ("genirq/affinity: Provide smarter irq spreading
      infrastructure") introduced a better IRQ spreading mechanism, taking
      account of the available NUMA nodes in the machine.
      
      Problem is that the algorithm of retrieving the nodemask iterates
      "linearly" based on the number of online nodes - some architectures
      present non-linear node distribution among the nodemask, like PowerPC.
      If this is the case, the algorithm lead to a wrong node count number
      and therefore to a bad/incomplete IRQ affinity distribution.
      
      For example, this problem were found in a machine with 128 CPUs and two
      nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
      distributed). This led to a wrong affinity distribution which then led to
      a bad mq allocation for nvme driver.
      
      Finally, we take the opportunity to fix a comment regarding the affinity
      distribution when we have _more_ nodes than vectors.
      
      Fixes: 34c3d981 ("genirq/affinity: Provide smarter irq spreading infrastructure")
      Reported-by: NGabriel Krisman Bertazi <gabriel@krisman.be>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGabriel Krisman Bertazi <gabriel@krisman.be>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: hch@lst.de
      Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c0af5243
    • T
      tick/broadcast: Prevent NULL pointer dereference · c1a9eeb9
      Thomas Gleixner 提交于
      When a disfunctional timer, e.g. dummy timer, is installed, the tick core
      tries to setup the broadcast timer.
      
      If no broadcast device is installed, the kernel crashes with a NULL pointer
      dereference in tick_broadcast_setup_oneshot() because the function has no
      sanity check.
      Reported-by: NMason <slash.tmp@free.fr>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Richard Cochran <rcochran@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>,
      Cc: Sebastian Frias <sf84@laposte.net>
      Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
      c1a9eeb9
    • L
      printk: remove console flushing special cases for partial buffered lines · 5c2992ee
      Linus Torvalds 提交于
      It actively hurts proper merging, and makes for a lot of special cases.
      There was a good(ish) reason for doing it originally, but it's getting
      too painful to maintain.  And most of the original reasons for it are
      long gone.
      
      So instead of having special code to flush partial lines to the console
      (as opposed to the record buffers), do _all_ the console writing from
      the record buffer, and be done with it.
      
      If an oops happens (or some other synchronous event), we will flush the
      partial lines due to the oops printing activity, so this does not affect
      that.  It does mean that if you have a completely hung machine, a
      partial preceding line may not have been printed out.
      
      That was some of the original reason for this complexity, in fact, back
      when we used to test for the historical i386 "halt" instruction problem
      by doing
      
      	pr_info("Checking 'hlt' instruction... ");
      
      	if (!boot_cpu_data.hlt_works_ok) {
      		pr_cont("disabled\n");
      		return;
      	}
      	halt();
      	halt();
      	halt();
      	halt();
      	pr_cont("OK\n");
      
      and that model no longer works (it the 'hlt' instruction kills the
      machine, the partial line won't have been flushed, so you won't even see
      it).
      
      Of course, that was also back in the days when people actually had
      textual console output rather than a graphical splash-screen at bootup.
      How times change..
      
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Tested-by: NPetr Mladek <pmladek@suse.com>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c2992ee
    • L
      printk: remove games with previous record flags · 5aa068ea
      Linus Torvalds 提交于
      The record logging code looks at the previous record flags in various
      ways, and they are all wrong.
      
      You can't use the previous record flags to determine anything about the
      next record, because they may simply not be related.  In particular, the
      reason the previous record was a continuation record may well be exactly
      _because_ the new record was printed by a different process, which is
      why the previous record was flushed.
      
      So all those games are simply wrong, and make the code hard to
      understand (because the code fundamentally cdoes not make sense).
      
      So remove it.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5aa068ea
    • L
      mm: add locked parameter to get_user_pages_remote() · 5b56d49f
      Lorenzo Stoakes 提交于
      Patch series "mm: unexport __get_user_pages_unlocked()".
      
      This patch series continues the cleanup of get_user_pages*() functions
      taking advantage of the fact we can now pass gup_flags as we please.
      
      It firstly adds an additional 'locked' parameter to
      get_user_pages_remote() to allow for its callers to utilise
      VM_FAULT_RETRY functionality.  This is necessary as the invocation of
      __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
      this and no other existing higher level function would allow it to do
      so.
      
      Secondly existing callers of __get_user_pages_unlocked() are replaced
      with the appropriate higher-level replacement -
      get_user_pages_unlocked() if the current task and memory descriptor are
      referenced, or get_user_pages_remote() if other task/memory descriptors
      are referenced (having acquiring mmap_sem.)
      
      This patch (of 2):
      
      Add a int *locked parameter to get_user_pages_remote() to allow
      VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
      
      Taking into account the previous adjustments to get_user_pages*()
      functions allowing for the passing of gup_flags, we are now in a
      position where __get_user_pages_unlocked() need only be exported for his
      ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
      subsequently unexport __get_user_pages_unlocked() as well as allowing
      for future flexibility in the use of get_user_pages_remote().
      
      [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
        Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.comSigned-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b56d49f
    • B
      kernel/watchdog.c: move hardlockup detector to separate file · 73ce0511
      Babu Moger 提交于
      Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
      It is mostly straight forward.  Remove everything inside
      CONFIG_HARDLOCKUP_DETECTORS.  This code will go to file watchdog_hld.c.
      Also update the makefile accordigly.
      
      Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.comSigned-off-by: NBabu Moger <babu.moger@oracle.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Josh Hunt <johunt@akamai.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73ce0511