1. 19 10月, 2017 2 次提交
  2. 18 10月, 2017 1 次提交
    • J
      bpf: disallow arithmetic operations on context pointer · 28e33f9d
      Jakub Kicinski 提交于
      Commit f1174f77 ("bpf/verifier: rework value tracking")
      removed the crafty selection of which pointer types are
      allowed to be modified.  This is OK for most pointer types
      since adjust_ptr_min_max_vals() will catch operations on
      immutable pointers.  One exception is PTR_TO_CTX which is
      now allowed to be offseted freely.
      
      The intent of aforementioned commit was to allow context
      access via modified registers.  The offset passed to
      ->is_valid_access() verifier callback has been adjusted
      by the value of the variable offset.
      
      What is missing, however, is taking the variable offset
      into account when the context register is used.  Or in terms
      of the code adding the offset to the value passed to the
      ->convert_ctx_access() callback.  This leads to the following
      eBPF user code:
      
           r1 += 68
           r0 = *(u32 *)(r1 + 8)
           exit
      
      being translated to this in kernel space:
      
         0: (07) r1 += 68
         1: (61) r0 = *(u32 *)(r1 +180)
         2: (95) exit
      
      Offset 8 is corresponding to 180 in the kernel, but offset
      76 is valid too.  Verifier will "accept" access to offset
      68+8=76 but then "convert" access to offset 8 as 180.
      Effective access to offset 248 is beyond the kernel context.
      (This is a __sk_buff example on a debug-heavy kernel -
      packet mark is 8 -> 180, 76 would be data.)
      
      Dereferencing the modified context pointer is not as easy
      as dereferencing other types, because we have to translate
      the access to reading a field in kernel structures which is
      usually at a different offset and often of a different size.
      To allow modifying the pointer we would have to make sure
      that given eBPF instruction will always access the same
      field or the fields accessed are "compatible" in terms of
      offset and size...
      
      Disallow dereferencing modified context pointers and add
      to selftests the test case described here.
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28e33f9d
  3. 11 10月, 2017 1 次提交
  4. 10 10月, 2017 1 次提交
  5. 09 10月, 2017 1 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
  6. 08 10月, 2017 1 次提交
    • A
      bpf: fix liveness marking · 8fe2d6cc
      Alexei Starovoitov 提交于
      while processing Rx = Ry instruction the verifier does
      regs[insn->dst_reg] = regs[insn->src_reg]
      which often clears write mark (when Ry doesn't have it)
      that was just set by check_reg_arg(Rx) prior to the assignment.
      That causes mark_reg_read() to keep marking Rx in this block as
      REG_LIVE_READ (since the logic incorrectly misses that it's
      screened by the write) and in many of its parents (until lucky
      write into the same Rx or beginning of the program).
      That causes is_state_visited() logic to miss many pruning opportunities.
      
      Furthermore mark_reg_read() logic propagates the read mark
      for BPF_REG_FP as well (though it's readonly) which causes
      harmless but unnecssary work during is_state_visited().
      Note that do_propagate_liveness() skips FP correctly,
      so do the same in mark_reg_read() as well.
      It saves 0.2 seconds for the test below
      
      program               before  after
      bpf_lb-DLB_L3.o       2604    2304
      bpf_lb-DLB_L4.o       11159   3723
      bpf_lb-DUNKNOWN.o     1116    1110
      bpf_lxc-DDROP_ALL.o   34566   28004
      bpf_lxc-DUNKNOWN.o    53267   39026
      bpf_netdev.o          17843   16943
      bpf_overlay.o         8672    7929
      time                  ~11 sec  ~4 sec
      
      Fixes: dc503a8a ("bpf/verifier: track liveness for pruning")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe2d6cc
  7. 04 10月, 2017 14 次提交
  8. 03 10月, 2017 2 次提交
    • P
      rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}() · f39b536c
      Paul E. McKenney 提交于
      The read of ->dynticks_nmi_nesting in rcu_irq_enter() and rcu_irq_exit()
      is currently protected with READ_ONCE().  However, this protection is
      unnecessary because (1) ->dynticks_nmi_nesting is updated only by the
      current CPU, (2) Although NMI handlers can update this field, they reset
      it back to its old value before return, and (3) Interrupts are disabled,
      so nothing else can modify it.  The value of ->dynticks_nmi_nesting is
      thus effectively constant, and so no protection is required.
      
      This commit therefore removes the READ_ONCE() protection from these
      two accesses.
      
      Link: http://lkml.kernel.org/r/20170926031902.GA2074@linux.vnet.ibm.comReported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f39b536c
    • S
      ftrace: Fix kmemleak in unregister_ftrace_graph · 2b0b8499
      Shu Wang 提交于
      The trampoline allocated by function tracer was overwriten by function_graph
      tracer, and caused a memory leak. The save_global_trampoline should have
      saved the previous trampoline in register_ftrace_graph() and restored it in
      unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was
      only used in unregister_ftrace_graph as default value 0, and it overwrote the
      previous trampoline's value. Causing the previous allocated trampoline to be
      lost.
      
      kmmeleak backtrace:
          kmemleak_vmalloc+0x77/0xc0
          __vmalloc_node_range+0x1b5/0x2c0
          module_alloc+0x7c/0xd0
          arch_ftrace_update_trampoline+0xb5/0x290
          ftrace_startup+0x78/0x210
          register_ftrace_function+0x8b/0xd0
          function_trace_init+0x4f/0x80
          tracing_set_tracer+0xe6/0x170
          tracing_set_trace_write+0x90/0xd0
          __vfs_write+0x37/0x170
          vfs_write+0xb2/0x1b0
          SyS_write+0x55/0xc0
          do_syscall_64+0x67/0x180
          return_from_SYSCALL_64+0x0/0x6a
      
      [
        Looking further into this, I found that this was left over from when the
        function and function graph tracers shared the same ftrace_ops. But in
        commit 5f151b24 ("ftrace: Fix function_profiler and function tracer
        together"), the two were separated, and the save_global_trampoline no
        longer was necessary (and it may have been broken back then too).
        -- Steven Rostedt
      ]
      
      Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 5f151b24 ("ftrace: Fix function_profiler and function tracer together")
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2b0b8499
  9. 30 9月, 2017 1 次提交
    • A
      fix infoleak in waitid(2) · 6c85501f
      Al Viro 提交于
      kernel_waitid() can return a PID, an error or 0.  rusage is filled in the first
      case and waitid(2) rusage should've been copied out exactly in that case, *not*
      whenever kernel_waitid() has not returned an error.  Compat variant shares that
      braino; none of kernel_wait4() callers do, so the below ought to fix it.
      Reported-and-tested-by: NAlexander Potapenko <glider@google.com>
      Fixes: ce72a16f ("wait4(2)/waitid(2): separate copying rusage to userland")
      Cc: stable@vger.kernel.org # v4.13
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6c85501f
  10. 29 9月, 2017 7 次提交
    • E
      sched/sysctl: Check user input value of sysctl_sched_time_avg · 5ccba44b
      Ethan Zhao 提交于
      System will hang if user set sysctl_sched_time_avg to 0:
      
        [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
      
        Stack traceback for pid 0
        0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
        ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
        0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
        ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
        Call Trace:
        <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
        [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
        [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
        [<ffffffff810c5b84>] ? load_balance+0x194/0x900
        [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
        [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
        [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
        [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
        [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
        [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
        [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
        [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
         <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
        [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
        [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
        [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
        [<ffffffff81053373>] ? start_secondary+0x173/0x1e0
      
      Because divide-by-zero error happens in function:
      
      update_group_capacity()
        update_cpu_capacity()
          scale_rt_capacity()
           {
                ...
                total = sched_avg_period() + delta;
                used = div_u64(avg, total);
                ...
           }
      
      To fix this issue, check user input value of sysctl_sched_time_avg, keep
      it unchanged when hitting invalid input, and set the minimum limit of
      sysctl_sched_time_avg to 1 ms.
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Signed-off-by: NEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ethan.kernel@gmail.com
      Cc: keescook@chromium.org
      Cc: mcgrof@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ccba44b
    • P
      sched/debug: Ignore TASK_IDLE for SysRq-W · 5d68cc95
      Peter Zijlstra 提交于
      Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
      which results in undesirable clutter.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5d68cc95
    • P
      sched/tracing: Use common task-state helpers · 5f6ad26e
      Peter Zijlstra 提交于
      Remove yet another task-state char instance.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5f6ad26e
    • P
      locking/rwsem-xadd: Fix missed wakeup due to reordering of load · 9c29c318
      Prateek Sood 提交于
      If a spinner is present, there is a chance that the load of
      rwsem_has_spinner() in rwsem_wake() can be reordered with
      respect to decrement of rwsem count in __up_write() leading
      to wakeup being missed:
      
       spinning writer                  up_write caller
       ---------------                  -----------------------
       [S] osq_unlock()                 [L] osq
        spin_lock(wait_lock)
        sem->count=0xFFFFFFFF00000001
                  +0xFFFFFFFF00000000
        count=sem->count
        MB
                                         sem->count=0xFFFFFFFE00000001
                                                   -0xFFFFFFFF00000001
                                         spin_trylock(wait_lock)
                                         return
       rwsem_try_write_lock(count)
       spin_unlock(wait_lock)
       schedule()
      
      Reordering of atomic_long_sub_return_release() in __up_write()
      and rwsem_has_spinner() in rwsem_wake() can cause missing of
      wakeup in up_write() context. In spinning writer, sem->count
      and local variable count is 0XFFFFFFFE00000001. It would result
      in rwsem_try_write_lock() failing to acquire rwsem and spinning
      writer going to sleep in rwsem_down_write_failed().
      
      The smp_rmb() will make sure that the spinner state is
      consulted after sem->count is updated in up_write context.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: longman@redhat.com
      Cc: parri.andrea@gmail.com
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c29c318
    • P
      sched/debug: Remove unused variable · 65d5dc47
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      65d5dc47
    • A
      perf/aux: Only update ->aux_wakeup in non-overwrite mode · 441430eb
      Alexander Shishkin 提交于
      The following commit:
      
        d9a50b02 ("perf/aux: Ensure aux_wakeup represents most recent wakeup index")
      
      changed the AUX wakeup position calculation to rounddown(), which causes
      a division-by-zero in AUX overwrite mode (aka "snapshot mode").
      
      The zero denominator results from the fact that perf record doesn't set
      aux_watermark to anything, in which case the kernel will set it to half
      the AUX buffer size, but only for non-overwrite mode. In the overwrite
      mode aux_watermark stays zero.
      
      The good news is that, AUX overwrite mode, wakeups don't happen and
      related bookkeeping is not relevant, so we can simply forego the whole
      wakeup updates.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/20170906160811.16510-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      441430eb
    • R
      PM / s2idle: Invoke the ->wake() platform callback earlier · 87cbde8d
      Rafael J. Wysocki 提交于
      The role of the ->wake() platform callback for suspend-to-idle is to
      deal with possible spurious wakeups, among other things.  The ACPI
      implementation of it, acpi_s2idle_wake(), additionally checks the
      conditions for entering the Low Power S0 Idle state by the platform
      and reports the ones that have not been met.
      
      However, the ->wake() platform callback is invoked after calling
      dpm_noirq_resume_devices(), which means that the power states of some
      devices may have changed since s2idle_enter() returned, so some unmet
      Low Power S0 Idle conditions may be reported incorrectly as a result
      of that.
      
      To avoid these false positives, reorder the invocations of the
      dpm_noirq_resume_devices() routine and the ->wake() platform callback
      in s2idle_loop().
      
      Fixes: 726fb6b4 (ACPI / PM: Check low power idle constraints for debug only)
      Tested-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      87cbde8d
  11. 28 9月, 2017 3 次提交
  12. 26 9月, 2017 6 次提交
    • P
      smp/hotplug: Hotplug state fail injection · 1db49484
      Peter Zijlstra 提交于
      Add a sysfs file to one-time fail a specific state. This can be used
      to test the state rollback code paths.
      
      Something like this (hotplug-up.sh):
      
        #!/bin/bash
      
        echo 0 > /debug/sched_debug
        echo 1 > /debug/tracing/events/cpuhp/enable
      
        ALL_STATES=`cat /sys/devices/system/cpu/hotplug/states | cut -d':' -f1`
        STATES=${1:-$ALL_STATES}
      
        for state in $STATES
        do
      	  echo 0 > /sys/devices/system/cpu/cpu1/online
      	  echo 0 > /debug/tracing/trace
      	  echo Fail state: $state
      	  echo $state > /sys/devices/system/cpu/cpu1/hotplug/fail
      	  cat /sys/devices/system/cpu/cpu1/hotplug/fail
      	  echo 1 > /sys/devices/system/cpu/cpu1/online
      
      	  cat /debug/tracing/trace > hotfail-${state}.trace
      
      	  sleep 1
        done
      
      Can be used to test for all possible rollback (barring multi-instance)
      scenarios on CPU-up, CPU-down is a trivial modification of the above.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Cc: max.byungchul.park@gmail.com
      Link: https://lkml.kernel.org/r/20170920170546.972581715@infradead.org
      
      1db49484
    • P
      smp/hotplug: Differentiate the AP completion between up and down · 5ebe7742
      Peter Zijlstra 提交于
      With lockdep-crossrelease we get deadlock reports that span cpu-up and
      cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
      and cpu-down are globally serialized.
      
        takedown_cpu()
          irq_lock_sparse()
          wait_for_completion(&st->done)
      
                                      cpuhp_thread_fun
                                        cpuhp_up_callback
                                          cpuhp_invoke_callback
                                            irq_affinity_online_cpu
                                              irq_local_spare()
                                              irq_unlock_sparse()
                                        complete(&st->done)
      
      Now that we have consistent AP state, we can trivially separate the
      AP completion between up and down using st->bringup.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: max.byungchul.park@gmail.com
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Link: https://lkml.kernel.org/r/20170920170546.872472799@infradead.org
      
      5ebe7742
    • P
      smp/hotplug: Differentiate the AP-work lockdep class between up and down · 5f4b55e1
      Peter Zijlstra 提交于
      With lockdep-crossrelease we get deadlock reports that span cpu-up and
      cpu-down chains. Such deadlocks cannot possibly happen because cpu-up
      and cpu-down are globally serialized.
      
        CPU0                  CPU1                    CPU2
        cpuhp_up_callbacks:   takedown_cpu:           cpuhp_thread_fun:
      
        cpuhp_state
                              irq_lock_sparse()
          irq_lock_sparse()
                              wait_for_completion()
                                                      cpuhp_state
                                                      complete()
      
      Now that we have consistent AP state, we can trivially separate the
      AP-work class between up and down using st->bringup.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: max.byungchul.park@gmail.com
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Link: https://lkml.kernel.org/r/20170920170546.922524234@infradead.org
      
      5f4b55e1
    • P
      smp/hotplug: Callback vs state-machine consistency · 724a8688
      Peter Zijlstra 提交于
      While the generic callback functions have an 'int' return and thus
      appear to be allowed to return error, this is not true for all states.
      
      Specifically, what used to be STARTING/DYING are ran with IRQs
      disabled from critical parts of CPU bringup/teardown and are not
      allowed to fail. Add WARNs to enforce this rule.
      
      But since some callbacks are indeed allowed to fail, we have the
      situation where a state-machine rollback encounters a failure, in this
      case we're stuck, we can't go forward and we can't go back. Also add a
      WARN for that case.
      
      AFAICT this is a fundamental 'problem' with no real obvious solution.
      We want the 'prepare' callbacks to allow failure on either up or down.
      Typically on prepare-up this would be things like -ENOMEM from
      resource allocations, and the typical usage in prepare-down would be
      something like -EBUSY to avoid CPUs being taken away.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Cc: max.byungchul.park@gmail.com
      Link: https://lkml.kernel.org/r/20170920170546.819539119@infradead.org
      
      724a8688
    • P
      smp/hotplug: Rewrite AP state machine core · 4dddfb5f
      Peter Zijlstra 提交于
      There is currently no explicit state change on rollback. That is,
      st->bringup, st->rollback and st->target are not consistent when doing
      the rollback.
      
      Rework the AP state handling to be more coherent. This does mean we
      have to do a second AP kick-and-wait for rollback, but since rollback
      is the slow path of a slowpath, this really should not matter.
      
      Take this opportunity to simplify the AP thread function to only run a
      single callback per invocation. This unifies the three single/up/down
      modes is supports. The looping it used to do for up/down are achieved
      by retaining should_run and relying on the main smpboot_thread_fn()
      loop.
      
      (I have most of a patch that does the same for the BP state handling,
      but that's not critical and gets a little complicated because
      CPUHP_BRINGUP_CPU does the AP handoff from a callback, which gets
      recursive @st usage, I still have de-fugly that.)
      
      [ tglx: Move cpuhp_down_callbacks() et al. into the HOTPLUG_CPU section to
        	avoid gcc complaining about unused functions. Make the HOTPLUG_CPU
        	one piece instead of having two consecutive ifdef sections of the
        	same type. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Cc: max.byungchul.park@gmail.com
      Link: https://lkml.kernel.org/r/20170920170546.769658088@infradead.org
      
      4dddfb5f
    • P
      smp/hotplug: Allow external multi-instance rollback · 96abb968
      Peter Zijlstra 提交于
      Currently the rollback of multi-instance states is handled inside
      cpuhp_invoke_callback(). The problem is that when we want to allow an
      explicit state change for rollback, we need to return from the
      function without doing the rollback.
      
      Change cpuhp_invoke_callback() to optionally return the multi-instance
      state, such that rollback can be done from a subsequent call.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Cc: max.byungchul.park@gmail.com
      Link: https://lkml.kernel.org/r/20170920170546.720361181@infradead.org
      
      96abb968