1. 01 11月, 2017 1 次提交
    • P
      futex: Fix more put_pi_state() vs. exit_pi_state_list() races · 153fbd12
      Peter Zijlstra 提交于
      Dmitry (through syzbot) reported being able to trigger the WARN in
      get_pi_state() and a use-after-free on:
      
      	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
      
      Both are due to this race:
      
        exit_pi_state_list()				put_pi_state()
      
        lock(&curr->pi_lock)
        while() {
      	pi_state = list_first_entry(head);
      	hb = hash_futex(&pi_state->key);
      	unlock(&curr->pi_lock);
      
      						dec_and_test(&pi_state->refcount);
      
      	lock(&hb->lock)
      	lock(&pi_state->pi_mutex.wait_lock)	// uaf if pi_state free'd
      	lock(&curr->pi_lock);
      
      	....
      
      	unlock(&curr->pi_lock);
      	get_pi_state();				// WARN; refcount==0
      
      The problem is we take the reference count too late, and don't allow it
      being 0. Fix it by using inc_not_zero() and simply retrying the loop
      when we fail to get a refcount. In that case put_pi_state() should
      remove the entry from the list.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Gratian Crisan <gratian.crisan@ni.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: dvhart@infradead.org
      Cc: syzbot <bot+2af19c9e1ffe4d4ee1d16c56ae7580feaee75765@syzkaller.appspotmail.com>
      Cc: syzkaller-bugs@googlegroups.com
      Cc: <stable@vger.kernel.org>
      Fixes: c74aef2d ("futex: Fix pi_state->owner serialization")
      Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      153fbd12
  2. 29 10月, 2017 2 次提交
  3. 22 10月, 2017 3 次提交
  4. 21 10月, 2017 2 次提交
  5. 20 10月, 2017 6 次提交
  6. 19 10月, 2017 2 次提交
  7. 18 10月, 2017 1 次提交
    • J
      bpf: disallow arithmetic operations on context pointer · 28e33f9d
      Jakub Kicinski 提交于
      Commit f1174f77 ("bpf/verifier: rework value tracking")
      removed the crafty selection of which pointer types are
      allowed to be modified.  This is OK for most pointer types
      since adjust_ptr_min_max_vals() will catch operations on
      immutable pointers.  One exception is PTR_TO_CTX which is
      now allowed to be offseted freely.
      
      The intent of aforementioned commit was to allow context
      access via modified registers.  The offset passed to
      ->is_valid_access() verifier callback has been adjusted
      by the value of the variable offset.
      
      What is missing, however, is taking the variable offset
      into account when the context register is used.  Or in terms
      of the code adding the offset to the value passed to the
      ->convert_ctx_access() callback.  This leads to the following
      eBPF user code:
      
           r1 += 68
           r0 = *(u32 *)(r1 + 8)
           exit
      
      being translated to this in kernel space:
      
         0: (07) r1 += 68
         1: (61) r0 = *(u32 *)(r1 +180)
         2: (95) exit
      
      Offset 8 is corresponding to 180 in the kernel, but offset
      76 is valid too.  Verifier will "accept" access to offset
      68+8=76 but then "convert" access to offset 8 as 180.
      Effective access to offset 248 is beyond the kernel context.
      (This is a __sk_buff example on a debug-heavy kernel -
      packet mark is 8 -> 180, 76 would be data.)
      
      Dereferencing the modified context pointer is not as easy
      as dereferencing other types, because we have to translate
      the access to reading a field in kernel structures which is
      usually at a different offset and often of a different size.
      To allow modifying the pointer we would have to make sure
      that given eBPF instruction will always access the same
      field or the fields accessed are "compatible" in terms of
      offset and size...
      
      Disallow dereferencing modified context pointers and add
      to selftests the test case described here.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e33f9d
  8. 14 10月, 2017 1 次提交
  9. 13 10月, 2017 2 次提交
    • D
      genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack() · 0d08af35
      Doug Berger 提交于
      Any usage of the irq_gc_mask_disable_reg_and_ack() function has
      been replaced with the desired functionality.
      
      The incorrect and ambiguously named function is removed here to
      prevent accidental misuse.
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0d08af35
    • D
      genirq: generic chip: Add irq_gc_mask_disable_and_ack_set() · 20608924
      Doug Berger 提交于
      The irq_gc_mask_disable_reg_and_ack() function name implies that it
      provides the combined functions of irq_gc_mask_disable_reg() and
      irq_gc_ack().  However, the implementation does not actually do
      that since it writes the mask instead of the disable register. It
      also does not maintain the mask cache which makes it inappropriate
      to use with other masking functions.
      
      In addition, commit 659fb32d ("genirq: replace irq_gc_ack() with
      {set,clr}_bit variants (fwd)") effectively renamed irq_gc_ack() to
      irq_gc_ack_set_bit() so this function probably should have also been
      renamed at that time.
      
      The generic chip code currently provides three functions for use
      with the irq_mask member of the irq_chip structure and two functions
      for use with the irq_ack member of the irq_chip structure. These
      functions could be combined into six functions for use with the
      irq_mask_ack member of the irq_chip structure.  However, since only
      one of the combinations is currently used, only the function
      irq_gc_mask_disable_and_ack_set() is added by this commit.
      
      The '_reg' and '_bit' portions of the base function name were left
      out of the new combined function name in an attempt to keep the
      function name length manageable with the 80 character source code
      line length while still allowing the distinct aspects of each
      combination to be captured by the name.
      
      If other combinations are desired in the future please add them to
      the irq generic chip library at that time.
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      20608924
  10. 11 10月, 2017 3 次提交
  11. 10 10月, 2017 9 次提交
    • T
      workqueue: replace pool->manager_arb mutex with a flag · 692b4825
      Tejun Heo 提交于
      Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
      lockdep:
      
       [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
       [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
       [ 1270.473240] -----------------------------------------------------
       [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
       [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
       [ 1270.474994]
       [ 1270.474994] and this task is already holding:
       [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
       [ 1270.476046] which would create a new lock dependency:
       [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
       [ 1270.476949]
       [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
       [ 1270.477553]  (&pool->lock/1){-.-.}
       ...
       [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
       [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
       ...
       [ 1270.494735]  Possible interrupt unsafe locking scenario:
       [ 1270.494735]
       [ 1270.495250]        CPU0                    CPU1
       [ 1270.495600]        ----                    ----
       [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
       [ 1270.496295]                                local_irq_disable();
       [ 1270.496753]                                lock(&pool->lock/1);
       [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
       [ 1270.497744]   <Interrupt>
       [ 1270.497948]     lock(&pool->lock/1);
      
      , which will cause a irq inversion deadlock if the above lock scenario
      happens.
      
      The root cause of this safe -> unsafe lock order is the
      mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
      held.
      
      Unlocking mutex while holding an irq spinlock was never safe and this
      problem has been around forever but it never got noticed because the
      only time the mutex is usually trylocked while holding irqlock making
      actual failures very unlikely and lockdep annotation missed the
      condition until the recent b9c16a0e ("locking/mutex: Fix
      lockdep_assert_held() fail").
      
      Using mutex for pool->manager_arb has always been a bit of stretch.
      It primarily is an mechanism to arbitrate managership between workers
      which can easily be done with a pool flag.  The only reason it became
      a mutex is that pool destruction path wants to exclude parallel
      managing operations.
      
      This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
      and make the destruction path wait for the current manager on a wait
      queue.
      
      v2: Drop unnecessary flag clearing before pool destruction as
          suggested by Boqun.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NLai Jiangshan <jiangshanlai@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: stable@vger.kernel.org
      692b4825
    • P
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra 提交于
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      024c9d2f
    • P
      sched/core: Address more wake_affine() regressions · f2cdd9cc
      Peter Zijlstra 提交于
      The trivial wake_affine_idle() implementation is very good for a
      number of workloads, but it comes apart at the moment there are no
      idle CPUs left, IOW. the overloaded case.
      
      hackbench:
      
      		NO_WA_WEIGHT		WA_WEIGHT
      
      hackbench-20  : 7.362717561 seconds	6.450509391 seconds
      
      (win)
      
      netperf:
      
      		  NO_WA_WEIGHT		WA_WEIGHT
      
      TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
      TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
      TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
      TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
      TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4
      
      TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
      TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
      TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
      TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
      TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41
      
      TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
      TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
      TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
      TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
      TCP_RR-80	: Avg: 25330.3          Avg: 26867.9
      
      UDP_RR-1	: Avg: 108266           Avg: 107844
      UDP_RR-10	: Avg: 95480            Avg: 95245.2
      UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
      UDP_RR-40	: Avg: 76231            Avg: 75419.1
      UDP_RR-80	: Avg: 34578.3          Avg: 35639.1
      
      UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
      UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
      UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
      UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
      UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97
      
      (wins and losses)
      
      sysbench:
      
      		    NO_WA_WEIGHT		WA_WEIGHT
      
      sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
      sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
      sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
      sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
      sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
      sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.
      
      sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
      sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
      sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
      sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
      sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
      sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.
      
      (increase on the top end)
      
      tbench:
      
      NO_WA_WEIGHT
      
      Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
      Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
      Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
      Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
      Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
      Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms
      
      WA_WEIGHT
      
      Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
      Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
      Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
      Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
      Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
      Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms
      
      (increase on the top end)
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f2cdd9cc
    • P
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra 提交于
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: NEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d153b153
    • L
      perf/core: Fix cgroup time when scheduling descendants · e6a52033
      leilei.lin 提交于
      Update cgroup time when an event is scheduled in by descendants.
      Reviewed-and-tested-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: Nleilei.lin <leilei.lin@alibaba-inc.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: brendan.d.gregg@gmail.com
      Cc: yang_oliver@hotmail.com
      Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e6a52033
    • W
      perf/core: Avoid freeing static PMU contexts when PMU is unregistered · df0062b2
      Will Deacon 提交于
      Since commit:
      
        1fd7e416 ("perf/core: Remove perf_cpu_context::unique_pmu")
      
      ... when a PMU is unregistered then its associated ->pmu_cpu_context is
      unconditionally freed. Whilst this is fine for dynamically allocated
      context types (i.e. those registered using perf_invalid_context), this
      causes a problem for sharing of static contexts such as
      perf_{sw,hw}_context, which are used by multiple built-in PMUs and
      effectively have a global lifetime.
      
      Whilst testing the ARM SPE driver, which must use perf_sw_context to
      support per-task AUX tracing, unregistering the driver as a result of a
      module unload resulted in:
      
       Unable to handle kernel NULL pointer dereference at virtual address 00000038
       Internal error: Oops: 96000004 [#1] PREEMPT SMP
       Modules linked in: [last unloaded: arm_spe_pmu]
       PC is at ctx_resched+0x38/0xe8
       LR is at perf_event_exec+0x20c/0x278
       [...]
       ctx_resched+0x38/0xe8
       perf_event_exec+0x20c/0x278
       setup_new_exec+0x88/0x118
       load_elf_binary+0x26c/0x109c
       search_binary_handler+0x90/0x298
       do_execveat_common.isra.14+0x540/0x618
       SyS_execve+0x38/0x48
      
      since the software context has been freed and the ctx.pmu->pmu_disable_count
      field has been set to NULL.
      
      This patch fixes the problem by avoiding the freeing of static PMU contexts
      altogether. Whilst the sharing of dynamic contexts is questionable, this
      actually requires the caller to share their context pointer explicitly
      and so the burden is on them to manage the object lifetime.
      Reported-by: NKim Phillips <kim.phillips@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 1fd7e416 ("perf/core: Remove perf_cpu_context::unique_pmu")
      Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      df0062b2
    • P
      locking/lockdep: Fix stacktrace mess · 8b405d5c
      Peter Zijlstra 提交于
      There is some complication between check_prevs_add() and
      check_prev_add() wrt. saving stack traces. The problem is that we want
      to be frugal with saving stack traces, since it consumes static
      resources.
      
      We'll only know in check_prev_add() if we need the trace, but we can
      call into it multiple times. So we want to do on-demand and re-use.
      
      A further complication is that check_prev_add() can drop graph_lock
      and mess with our static resources.
      
      In any case, the current state; after commit:
      
        ce07a941 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
      
      is that we'll assume the trace contains valid data once
      check_prev_add() returns '2'. However, as noted by Josh, this is
      false, check_prev_add() can return '2' before having saved a trace,
      this then result in the possibility of using uninitialized data.
      Testing, as reported by Wu, shows a NULL deref.
      
      So simplify.
      
      Since the graph_lock() thing is a debug path that hasn't
      really been used in a long while, take it out back and avoid the
      head-ache.
      
      Further initialize the stack_trace to a known 'empty' state; as long
      as nr_entries == 0, nothing should deref entries. We can then use the
      'entries == NULL' test for a valid trace / on-demand saving.
      Analyzed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ce07a941 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8b405d5c
    • E
      net: defer call to cgroup_sk_alloc() · fbb1fb4a
      Eric Dumazet 提交于
      sk_clone_lock() might run while TCP/DCCP listener already vanished.
      
      In order to prevent use after free, it is better to defer cgroup_sk_alloc()
      to the point we know both parent and child exist, and from process context.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbb1fb4a
    • K
      waitid(): Add missing access_ok() checks · 96ca579a
      Kees Cook 提交于
      Adds missing access_ok() checks.
      
      CVE-2017-5123
      Reported-by: NChris Salls <chrissalls5@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Fixes: 4c48abe9 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
      Cc: stable@kernel.org # 4.13
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96ca579a
  12. 09 10月, 2017 4 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
    • T
      genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs · e43b3b58
      Thomas Gleixner 提交于
      Managed interrupts can end up in a stale state on CPU hotplug. If the
      interrupt is not targeting a single CPU, i.e. the affinity mask spawns
      multiple CPUs then the following can happen:
      
      After boot:
      
      dstate:   0x01601200
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  0
      
      After offlining CPU 31 - 24
      
      dstate:   0x01a31000
                  IRQD_IRQ_DISABLED
                  IRQD_IRQ_MASKED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_AFFINITY_MANAGED
                  IRQD_MANAGED_SHUTDOWN
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  0
      
      Now CPU 25 gets onlined again, so it should get the effective interrupt
      affinity for this interruopt, but due to the x86 interrupt affinity setter
      restrictions this ends up after restarting the interrupt with:
      
      dstate:   0x01601300
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_SETAFFINITY_PENDING
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  24-31
      
      So the interrupt is still affine to CPU 24, which was the last CPU to go
      offline of that affinity set and the move to an online CPU within 24-31,
      in this case 25, is pending. This mechanism is x86/ia64 specific as those
      architectures cannot move interrupts from thread context and do this when
      an interrupt is actually handled. So the move is set to pending.
      
      Whats worse is that offlining CPU 25 again results in:
      
      dstate:   0x01601300
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_SETAFFINITY_PENDING
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  24-31
      
      This means the interrupt has not been shut down, because the outgoing CPU
      is not in the effective affinity mask, but of course nothing notices that
      the effective affinity mask is pointing at an offline CPU.
      
      In the case of restarting a managed interrupt the move restriction does not
      apply, so the affinity setting can be made unconditional. This needs to be
      done _before_ the interrupt is started up as otherwise the condition for
      moving it from thread context would not longer be fulfilled.
      
      With that change applied onlining CPU 25 after offlining 31-24 results in:
      
      dstate:   0x01600200
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 25
      pending:  
      
      And after offlining CPU 25:
      
      dstate:   0x01a30000
                  IRQD_IRQ_DISABLED
                  IRQD_IRQ_MASKED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_MANAGED
                  IRQD_MANAGED_SHUTDOWN
      node:     0
      affinity: 24-31
      effectiv: 25
      pending:  
      
      which is the correct and expected result.
      
      Fixes: 761ea388 ("genirq: Handle managed irqs gracefully in irq_startup()")
      Reported-by: NYASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: axboe@kernel.dk
      Cc: linux-scsi@vger.kernel.org
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: mpe@ellerman.id.au
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: keith.busch@intel.com
      Cc: peterz@infradead.org
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos
      e43b3b58
    • T
      genirq/cpuhotplug: Add sanity check for effective affinity mask · 60b09c51
      Thomas Gleixner 提交于
      The effective affinity mask handling has no safety net when the mask is not
      updated by the interrupt chip or the mask contains offline CPUs.
      
      If that happens the CPU unplug code fails to migrate interrupts.
      
      Add sanity checks and emit a warning when the mask contains only offline
      CPUs.
      
      Fixes: 415fcf1a ("genirq/cpuhotplug: Use effective affinity mask")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos
      60b09c51
    • T
      genirq: Warn when effective affinity is not updated · 19e1d4e9
      Thomas Gleixner 提交于
      Emit a one time warning when the effective affinity mask is enabled in
      Kconfig, but the interrupt chip does not update the mask in its
      irq_set_affinity() callback,
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos                                                                                                                                                                                                        
      19e1d4e9
  13. 08 10月, 2017 1 次提交
    • A
      bpf: fix liveness marking · 8fe2d6cc
      Alexei Starovoitov 提交于
      while processing Rx = Ry instruction the verifier does
      regs[insn->dst_reg] = regs[insn->src_reg]
      which often clears write mark (when Ry doesn't have it)
      that was just set by check_reg_arg(Rx) prior to the assignment.
      That causes mark_reg_read() to keep marking Rx in this block as
      REG_LIVE_READ (since the logic incorrectly misses that it's
      screened by the write) and in many of its parents (until lucky
      write into the same Rx or beginning of the program).
      That causes is_state_visited() logic to miss many pruning opportunities.
      
      Furthermore mark_reg_read() logic propagates the read mark
      for BPF_REG_FP as well (though it's readonly) which causes
      harmless but unnecssary work during is_state_visited().
      Note that do_propagate_liveness() skips FP correctly,
      so do the same in mark_reg_read() as well.
      It saves 0.2 seconds for the test below
      
      program               before  after
      bpf_lb-DLB_L3.o       2604    2304
      bpf_lb-DLB_L4.o       11159   3723
      bpf_lb-DUNKNOWN.o     1116    1110
      bpf_lxc-DDROP_ALL.o   34566   28004
      bpf_lxc-DUNKNOWN.o    53267   39026
      bpf_netdev.o          17843   16943
      bpf_overlay.o         8672    7929
      time                  ~11 sec  ~4 sec
      
      Fixes: dc503a8a ("bpf/verifier: track liveness for pruning")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe2d6cc
  14. 04 10月, 2017 3 次提交
    • T
      watchdog/core: Put softlockup_threads_initialized under ifdef guard · 0b62bf86
      Thomas Gleixner 提交于
      The variable is unused when the softlockup detector is disabled in Kconfig.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      0b62bf86
    • T
      watchdog/core: Rename some softlockup_* functions · 5587185d
      Thomas Gleixner 提交于
      The function names made sense up to the point where the watchdog
      (re)configuration was unified to use softlockup_reconfigure_threads() for
      all configuration purposes. But that includes scenarios which solely
      configure the nmi watchdog.
      
      Rename softlockup_reconfigure_threads() and softlockup_init_threads() so
      the function names match the functionality.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linuxfoundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Don Zickus <dzickus@redhat.com>
      5587185d
    • T
      powerpc/watchdog: Make use of watchdog_nmi_probe() · 34ddaa3e
      Thomas Gleixner 提交于
      The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
      on powerpc because it is called multiple times for the boot CPU.
      
      The first call is via:
      
        start_wd_on_cpu+0x80/0x2f0
        watchdog_nmi_reconfigure+0x124/0x170
        softlockup_reconfigure_threads+0x110/0x130
        lockup_detector_init+0xbc/0xe0
        kernel_init_freeable+0x18c/0x37c
        kernel_init+0x2c/0x160
        ret_from_kernel_thread+0x5c/0xbc
      
      And then again via the CPU hotplug registration:
      
        start_wd_on_cpu+0x80/0x2f0
        cpuhp_invoke_callback+0x194/0x620
        cpuhp_thread_fun+0x7c/0x1b0
        smpboot_thread_fn+0x290/0x2a0
        kthread+0x168/0x1b0
        ret_from_kernel_thread+0x5c/0xbc
      
      This can be avoided by setting up the cpu hotplug state with nocalls and
      move the initialization to the watchdog_nmi_probe() function. That
      initializes the hotplug callbacks without invoking the callback and the
      following core initialization function then configures the watchdog for the
      online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
      Reported-and-tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      34ddaa3e