1. 21 10月, 2017 1 次提交
  2. 20 10月, 2017 2 次提交
    • P
      doc: Fix various RCU docbook comment-header problems · 27fdb35f
      Paul E. McKenney 提交于
      Because many of RCU's files have not been included into docbook, a
      number of errors have accumulated.  This commit fixes them.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27fdb35f
    • M
      membarrier: Provide register expedited private command · a961e409
      Mathieu Desnoyers 提交于
      This introduces a "register private expedited" membarrier command which
      allows eventual removal of important memory barrier constraints on the
      scheduler fast-paths. It changes how the "private expedited" membarrier
      command (new to 4.14) is used from user-space.
      
      This new command allows processes to register their intent to use the
      private expedited command.  This affects how the expedited private
      command introduced in 4.14-rc is meant to be used, and should be merged
      before 4.14 final.
      
      Processes are now required to register before using
      MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
      
      This fixes a problem that arose when designing requested extensions to
      sys_membarrier() to allow JITs to efficiently flush old code from
      instruction caches.  Several potential algorithms are much less painful
      if the user register intent to use this functionality early on, for
      example, before the process spawns the second thread.  Registering at
      this time removes the need to interrupt each and every thread in that
      process at the first expedited sys_membarrier() system call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a961e409
  3. 14 10月, 2017 1 次提交
  4. 11 10月, 2017 2 次提交
  5. 10 10月, 2017 7 次提交
    • P
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra 提交于
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      024c9d2f
    • P
      sched/core: Address more wake_affine() regressions · f2cdd9cc
      Peter Zijlstra 提交于
      The trivial wake_affine_idle() implementation is very good for a
      number of workloads, but it comes apart at the moment there are no
      idle CPUs left, IOW. the overloaded case.
      
      hackbench:
      
      		NO_WA_WEIGHT		WA_WEIGHT
      
      hackbench-20  : 7.362717561 seconds	6.450509391 seconds
      
      (win)
      
      netperf:
      
      		  NO_WA_WEIGHT		WA_WEIGHT
      
      TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
      TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
      TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
      TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
      TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4
      
      TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
      TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
      TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
      TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
      TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41
      
      TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
      TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
      TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
      TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
      TCP_RR-80	: Avg: 25330.3          Avg: 26867.9
      
      UDP_RR-1	: Avg: 108266           Avg: 107844
      UDP_RR-10	: Avg: 95480            Avg: 95245.2
      UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
      UDP_RR-40	: Avg: 76231            Avg: 75419.1
      UDP_RR-80	: Avg: 34578.3          Avg: 35639.1
      
      UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
      UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
      UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
      UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
      UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97
      
      (wins and losses)
      
      sysbench:
      
      		    NO_WA_WEIGHT		WA_WEIGHT
      
      sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
      sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
      sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
      sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
      sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
      sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.
      
      sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
      sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
      sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
      sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
      sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
      sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.
      
      (increase on the top end)
      
      tbench:
      
      NO_WA_WEIGHT
      
      Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
      Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
      Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
      Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
      Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
      Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms
      
      WA_WEIGHT
      
      Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
      Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
      Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
      Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
      Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
      Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms
      
      (increase on the top end)
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f2cdd9cc
    • P
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra 提交于
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: NEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d153b153
    • L
      perf/core: Fix cgroup time when scheduling descendants · e6a52033
      leilei.lin 提交于
      Update cgroup time when an event is scheduled in by descendants.
      Reviewed-and-tested-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: Nleilei.lin <leilei.lin@alibaba-inc.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: brendan.d.gregg@gmail.com
      Cc: yang_oliver@hotmail.com
      Link: http://lkml.kernel.org/r/CALPjY3mkHiekRkRECzMi9G-bjUQOvOjVBAqxmWkTzc-g+0LwMg@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e6a52033
    • W
      perf/core: Avoid freeing static PMU contexts when PMU is unregistered · df0062b2
      Will Deacon 提交于
      Since commit:
      
        1fd7e416 ("perf/core: Remove perf_cpu_context::unique_pmu")
      
      ... when a PMU is unregistered then its associated ->pmu_cpu_context is
      unconditionally freed. Whilst this is fine for dynamically allocated
      context types (i.e. those registered using perf_invalid_context), this
      causes a problem for sharing of static contexts such as
      perf_{sw,hw}_context, which are used by multiple built-in PMUs and
      effectively have a global lifetime.
      
      Whilst testing the ARM SPE driver, which must use perf_sw_context to
      support per-task AUX tracing, unregistering the driver as a result of a
      module unload resulted in:
      
       Unable to handle kernel NULL pointer dereference at virtual address 00000038
       Internal error: Oops: 96000004 [#1] PREEMPT SMP
       Modules linked in: [last unloaded: arm_spe_pmu]
       PC is at ctx_resched+0x38/0xe8
       LR is at perf_event_exec+0x20c/0x278
       [...]
       ctx_resched+0x38/0xe8
       perf_event_exec+0x20c/0x278
       setup_new_exec+0x88/0x118
       load_elf_binary+0x26c/0x109c
       search_binary_handler+0x90/0x298
       do_execveat_common.isra.14+0x540/0x618
       SyS_execve+0x38/0x48
      
      since the software context has been freed and the ctx.pmu->pmu_disable_count
      field has been set to NULL.
      
      This patch fixes the problem by avoiding the freeing of static PMU contexts
      altogether. Whilst the sharing of dynamic contexts is questionable, this
      actually requires the caller to share their context pointer explicitly
      and so the burden is on them to manage the object lifetime.
      Reported-by: NKim Phillips <kim.phillips@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 1fd7e416 ("perf/core: Remove perf_cpu_context::unique_pmu")
      Link: http://lkml.kernel.org/r/1507040450-7730-1-git-send-email-will.deacon@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      df0062b2
    • P
      locking/lockdep: Fix stacktrace mess · 8b405d5c
      Peter Zijlstra 提交于
      There is some complication between check_prevs_add() and
      check_prev_add() wrt. saving stack traces. The problem is that we want
      to be frugal with saving stack traces, since it consumes static
      resources.
      
      We'll only know in check_prev_add() if we need the trace, but we can
      call into it multiple times. So we want to do on-demand and re-use.
      
      A further complication is that check_prev_add() can drop graph_lock
      and mess with our static resources.
      
      In any case, the current state; after commit:
      
        ce07a941 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
      
      is that we'll assume the trace contains valid data once
      check_prev_add() returns '2'. However, as noted by Josh, this is
      false, check_prev_add() can return '2' before having saved a trace,
      this then result in the possibility of using uninitialized data.
      Testing, as reported by Wu, shows a NULL deref.
      
      So simplify.
      
      Since the graph_lock() thing is a debug path that hasn't
      really been used in a long while, take it out back and avoid the
      head-ache.
      
      Further initialize the stack_trace to a known 'empty' state; as long
      as nr_entries == 0, nothing should deref entries. We can then use the
      'entries == NULL' test for a valid trace / on-demand saving.
      Analyzed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ce07a941 ("locking/lockdep: Make check_prev_add() able to handle external stack_trace")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8b405d5c
    • K
      waitid(): Add missing access_ok() checks · 96ca579a
      Kees Cook 提交于
      Adds missing access_ok() checks.
      
      CVE-2017-5123
      Reported-by: NChris Salls <chrissalls5@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Fixes: 4c48abe9 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
      Cc: stable@kernel.org # 4.13
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96ca579a
  6. 09 10月, 2017 4 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
    • T
      genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs · e43b3b58
      Thomas Gleixner 提交于
      Managed interrupts can end up in a stale state on CPU hotplug. If the
      interrupt is not targeting a single CPU, i.e. the affinity mask spawns
      multiple CPUs then the following can happen:
      
      After boot:
      
      dstate:   0x01601200
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  0
      
      After offlining CPU 31 - 24
      
      dstate:   0x01a31000
                  IRQD_IRQ_DISABLED
                  IRQD_IRQ_MASKED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_AFFINITY_MANAGED
                  IRQD_MANAGED_SHUTDOWN
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  0
      
      Now CPU 25 gets onlined again, so it should get the effective interrupt
      affinity for this interruopt, but due to the x86 interrupt affinity setter
      restrictions this ends up after restarting the interrupt with:
      
      dstate:   0x01601300
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_SETAFFINITY_PENDING
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  24-31
      
      So the interrupt is still affine to CPU 24, which was the last CPU to go
      offline of that affinity set and the move to an online CPU within 24-31,
      in this case 25, is pending. This mechanism is x86/ia64 specific as those
      architectures cannot move interrupts from thread context and do this when
      an interrupt is actually handled. So the move is set to pending.
      
      Whats worse is that offlining CPU 25 again results in:
      
      dstate:   0x01601300
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_SET
                  IRQD_SETAFFINITY_PENDING
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 24
      pending:  24-31
      
      This means the interrupt has not been shut down, because the outgoing CPU
      is not in the effective affinity mask, but of course nothing notices that
      the effective affinity mask is pointing at an offline CPU.
      
      In the case of restarting a managed interrupt the move restriction does not
      apply, so the affinity setting can be made unconditional. This needs to be
      done _before_ the interrupt is started up as otherwise the condition for
      moving it from thread context would not longer be fulfilled.
      
      With that change applied onlining CPU 25 after offlining 31-24 results in:
      
      dstate:   0x01600200
                  IRQD_ACTIVATED
                  IRQD_IRQ_STARTED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_MANAGED
      node:     0
      affinity: 24-31
      effectiv: 25
      pending:  
      
      And after offlining CPU 25:
      
      dstate:   0x01a30000
                  IRQD_IRQ_DISABLED
                  IRQD_IRQ_MASKED
                  IRQD_SINGLE_TARGET
                  IRQD_AFFINITY_MANAGED
                  IRQD_MANAGED_SHUTDOWN
      node:     0
      affinity: 24-31
      effectiv: 25
      pending:  
      
      which is the correct and expected result.
      
      Fixes: 761ea388 ("genirq: Handle managed irqs gracefully in irq_startup()")
      Reported-by: NYASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: axboe@kernel.dk
      Cc: linux-scsi@vger.kernel.org
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: mpe@ellerman.id.au
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: keith.busch@intel.com
      Cc: peterz@infradead.org
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos
      e43b3b58
    • T
      genirq/cpuhotplug: Add sanity check for effective affinity mask · 60b09c51
      Thomas Gleixner 提交于
      The effective affinity mask handling has no safety net when the mask is not
      updated by the interrupt chip or the mask contains offline CPUs.
      
      If that happens the CPU unplug code fails to migrate interrupts.
      
      Add sanity checks and emit a warning when the mask contains only offline
      CPUs.
      
      Fixes: 415fcf1a ("genirq/cpuhotplug: Use effective affinity mask")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos
      60b09c51
    • T
      genirq: Warn when effective affinity is not updated · 19e1d4e9
      Thomas Gleixner 提交于
      Emit a one time warning when the effective affinity mask is enabled in
      Kconfig, but the interrupt chip does not update the mask in its
      irq_set_affinity() callback,
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710042208400.2406@nanos                                                                                                                                                                                                        
      19e1d4e9
  7. 08 10月, 2017 1 次提交
    • A
      bpf: fix liveness marking · 8fe2d6cc
      Alexei Starovoitov 提交于
      while processing Rx = Ry instruction the verifier does
      regs[insn->dst_reg] = regs[insn->src_reg]
      which often clears write mark (when Ry doesn't have it)
      that was just set by check_reg_arg(Rx) prior to the assignment.
      That causes mark_reg_read() to keep marking Rx in this block as
      REG_LIVE_READ (since the logic incorrectly misses that it's
      screened by the write) and in many of its parents (until lucky
      write into the same Rx or beginning of the program).
      That causes is_state_visited() logic to miss many pruning opportunities.
      
      Furthermore mark_reg_read() logic propagates the read mark
      for BPF_REG_FP as well (though it's readonly) which causes
      harmless but unnecssary work during is_state_visited().
      Note that do_propagate_liveness() skips FP correctly,
      so do the same in mark_reg_read() as well.
      It saves 0.2 seconds for the test below
      
      program               before  after
      bpf_lb-DLB_L3.o       2604    2304
      bpf_lb-DLB_L4.o       11159   3723
      bpf_lb-DUNKNOWN.o     1116    1110
      bpf_lxc-DDROP_ALL.o   34566   28004
      bpf_lxc-DUNKNOWN.o    53267   39026
      bpf_netdev.o          17843   16943
      bpf_overlay.o         8672    7929
      time                  ~11 sec  ~4 sec
      
      Fixes: dc503a8a ("bpf/verifier: track liveness for pruning")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe2d6cc
  8. 04 10月, 2017 14 次提交
  9. 03 10月, 2017 2 次提交
    • P
      rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}() · f39b536c
      Paul E. McKenney 提交于
      The read of ->dynticks_nmi_nesting in rcu_irq_enter() and rcu_irq_exit()
      is currently protected with READ_ONCE().  However, this protection is
      unnecessary because (1) ->dynticks_nmi_nesting is updated only by the
      current CPU, (2) Although NMI handlers can update this field, they reset
      it back to its old value before return, and (3) Interrupts are disabled,
      so nothing else can modify it.  The value of ->dynticks_nmi_nesting is
      thus effectively constant, and so no protection is required.
      
      This commit therefore removes the READ_ONCE() protection from these
      two accesses.
      
      Link: http://lkml.kernel.org/r/20170926031902.GA2074@linux.vnet.ibm.comReported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f39b536c
    • S
      ftrace: Fix kmemleak in unregister_ftrace_graph · 2b0b8499
      Shu Wang 提交于
      The trampoline allocated by function tracer was overwriten by function_graph
      tracer, and caused a memory leak. The save_global_trampoline should have
      saved the previous trampoline in register_ftrace_graph() and restored it in
      unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was
      only used in unregister_ftrace_graph as default value 0, and it overwrote the
      previous trampoline's value. Causing the previous allocated trampoline to be
      lost.
      
      kmmeleak backtrace:
          kmemleak_vmalloc+0x77/0xc0
          __vmalloc_node_range+0x1b5/0x2c0
          module_alloc+0x7c/0xd0
          arch_ftrace_update_trampoline+0xb5/0x290
          ftrace_startup+0x78/0x210
          register_ftrace_function+0x8b/0xd0
          function_trace_init+0x4f/0x80
          tracing_set_tracer+0xe6/0x170
          tracing_set_trace_write+0x90/0xd0
          __vfs_write+0x37/0x170
          vfs_write+0xb2/0x1b0
          SyS_write+0x55/0xc0
          do_syscall_64+0x67/0x180
          return_from_SYSCALL_64+0x0/0x6a
      
      [
        Looking further into this, I found that this was left over from when the
        function and function graph tracers shared the same ftrace_ops. But in
        commit 5f151b24 ("ftrace: Fix function_profiler and function tracer
        together"), the two were separated, and the save_global_trampoline no
        longer was necessary (and it may have been broken back then too).
        -- Steven Rostedt
      ]
      
      Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 5f151b24 ("ftrace: Fix function_profiler and function tracer together")
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2b0b8499
  10. 30 9月, 2017 1 次提交
    • A
      fix infoleak in waitid(2) · 6c85501f
      Al Viro 提交于
      kernel_waitid() can return a PID, an error or 0.  rusage is filled in the first
      case and waitid(2) rusage should've been copied out exactly in that case, *not*
      whenever kernel_waitid() has not returned an error.  Compat variant shares that
      braino; none of kernel_wait4() callers do, so the below ought to fix it.
      Reported-and-tested-by: NAlexander Potapenko <glider@google.com>
      Fixes: ce72a16f ("wait4(2)/waitid(2): separate copying rusage to userland")
      Cc: stable@vger.kernel.org # v4.13
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6c85501f
  11. 29 9月, 2017 5 次提交
    • E
      sched/sysctl: Check user input value of sysctl_sched_time_avg · 5ccba44b
      Ethan Zhao 提交于
      System will hang if user set sysctl_sched_time_avg to 0:
      
        [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
      
        Stack traceback for pid 0
        0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
        ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
        0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
        ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
        Call Trace:
        <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
        [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
        [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
        [<ffffffff810c5b84>] ? load_balance+0x194/0x900
        [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
        [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
        [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
        [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
        [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
        [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
        [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
        [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
         <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
        [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
        [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
        [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
        [<ffffffff81053373>] ? start_secondary+0x173/0x1e0
      
      Because divide-by-zero error happens in function:
      
      update_group_capacity()
        update_cpu_capacity()
          scale_rt_capacity()
           {
                ...
                total = sched_avg_period() + delta;
                used = div_u64(avg, total);
                ...
           }
      
      To fix this issue, check user input value of sysctl_sched_time_avg, keep
      it unchanged when hitting invalid input, and set the minimum limit of
      sysctl_sched_time_avg to 1 ms.
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Signed-off-by: NEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ethan.kernel@gmail.com
      Cc: keescook@chromium.org
      Cc: mcgrof@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ccba44b
    • P
      sched/debug: Ignore TASK_IDLE for SysRq-W · 5d68cc95
      Peter Zijlstra 提交于
      Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
      which results in undesirable clutter.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5d68cc95
    • P
      sched/tracing: Use common task-state helpers · 5f6ad26e
      Peter Zijlstra 提交于
      Remove yet another task-state char instance.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5f6ad26e
    • P
      locking/rwsem-xadd: Fix missed wakeup due to reordering of load · 9c29c318
      Prateek Sood 提交于
      If a spinner is present, there is a chance that the load of
      rwsem_has_spinner() in rwsem_wake() can be reordered with
      respect to decrement of rwsem count in __up_write() leading
      to wakeup being missed:
      
       spinning writer                  up_write caller
       ---------------                  -----------------------
       [S] osq_unlock()                 [L] osq
        spin_lock(wait_lock)
        sem->count=0xFFFFFFFF00000001
                  +0xFFFFFFFF00000000
        count=sem->count
        MB
                                         sem->count=0xFFFFFFFE00000001
                                                   -0xFFFFFFFF00000001
                                         spin_trylock(wait_lock)
                                         return
       rwsem_try_write_lock(count)
       spin_unlock(wait_lock)
       schedule()
      
      Reordering of atomic_long_sub_return_release() in __up_write()
      and rwsem_has_spinner() in rwsem_wake() can cause missing of
      wakeup in up_write() context. In spinning writer, sem->count
      and local variable count is 0XFFFFFFFE00000001. It would result
      in rwsem_try_write_lock() failing to acquire rwsem and spinning
      writer going to sleep in rwsem_down_write_failed().
      
      The smp_rmb() will make sure that the spinner state is
      consulted after sem->count is updated in up_write context.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: longman@redhat.com
      Cc: parri.andrea@gmail.com
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c29c318
    • P
      sched/debug: Remove unused variable · 65d5dc47
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      65d5dc47