1. 29 5月, 2018 2 次提交
    • R
      PM / QoS: Drop redundant declaration of pm_qos_get_value() · 74cd8171
      Rafael J. Wysocki 提交于
      The extra forward declaration of pm_qos_get_value() is redundant, so
      drop it.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      74cd8171
    • S
      tracing: Make the snapshot trigger work with instances · 2824f503
      Steven Rostedt (VMware) 提交于
      The snapshot trigger currently only affects the main ring buffer, even when
      it is used by the instances. This can be confusing as the snapshot trigger
      is listed in the instance.
      
       > # cd /sys/kernel/tracing
       > # mkdir instances/foo
       > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger
       > # echo top buffer > trace_marker
       > # echo foo buffer > instances/foo/trace_marker
       > # touch /tmp/bar
       > # chown rostedt /tmp/bar
       > # cat instances/foo/snapshot
       # tracer: nop
       #
       #
       # * Snapshot is freed *
       #
       # Snapshot commands:
       # echo 0 > snapshot : Clears and frees snapshot buffer
       # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
       #                      Takes a snapshot of the main buffer.
       # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
       #                      (Doesn't have to be '2' works with any number that
       #                       is not a '0' or '1')
      
       > # cat snapshot
       # tracer: nop
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
                   bash-1189  [000] ....   111.488323: tracing_mark_write: top buffer
      
      Not only did the snapshot occur in the top level buffer, but the instance
      snapshot buffer should have been allocated, and it is still free.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2824f503
  2. 28 5月, 2018 1 次提交
    • S
      tracing: Fix crash when freeing instances with event triggers · 86b389ff
      Steven Rostedt (VMware) 提交于
      If a instance has an event trigger enabled when it is freed, it could cause
      an access of free memory. Here's the case that crashes:
      
       # cd /sys/kernel/tracing
       # mkdir instances/foo
       # echo snapshot > instances/foo/events/initcall/initcall_start/trigger
       # rmdir instances/foo
      
      Would produce:
      
       general protection fault: 0000 [#1] PREEMPT SMP PTI
       Modules linked in: tun bridge ...
       CPU: 5 PID: 6203 Comm: rmdir Tainted: G        W         4.17.0-rc4-test+ #933
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:clear_event_triggers+0x3b/0x70
       RSP: 0018:ffffc90003783de0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0
       RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016
       R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0
       R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c
       FS:  00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0
       Call Trace:
        event_trace_del_tracer+0x2a/0xc5
        instance_rmdir+0x15c/0x200
        tracefs_syscall_rmdir+0x52/0x90
        vfs_rmdir+0xdb/0x160
        do_rmdir+0x16d/0x1c0
        __x64_sys_rmdir+0x17/0x20
        do_syscall_64+0x55/0x1a0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This was due to the call the clears out the triggers when an instance is
      being deleted not removing the trigger from the link list.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      86b389ff
  3. 27 5月, 2018 4 次提交
  4. 26 5月, 2018 1 次提交
  5. 25 5月, 2018 2 次提交
    • P
      kthread: Allow kthread_park() on a parked kthread · b1f5b378
      Peter Zijlstra 提交于
      The following commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      added a WARN() in the case where we call kthread_park() on an already
      parked thread, because the old code wasn't doing the right thing there
      and it wasn't at all clear that would happen.
      
      It turns out, this does in fact happen, so we have to deal with it.
      
      Instead of potentially returning early, also wait for the completion.
      This does however mean we have to use complete_all() and re-initialize
      the completion on re-use.
      Reported-by: NLKP <lkp@01.org>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel test robot <lkp@intel.com>
      Cc: wfg@linux.intel.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b1f5b378
    • J
      sched/topology: Clarify root domain(s) debug string · bf5015a5
      Juri Lelli 提交于
      When scheduler debug is enabled, building scheduling domains outputs
      information about how the domains are laid out and to which root domain
      each CPU (or sets of CPUs) belongs, e.g.:
      
       CPU0 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
       CPU1 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
      
       [...]
      
       span: 0-5 (max cpu_capacity = 1024)
      
      The fact that latest line refers to CPUs 0-5 root domain doesn't however look
      immediately obvious to me: one might wonder why span 0-5 is reported "again".
      
      Make it more clear by adding "root domain" to it, as to end with the
      following:
      
       CPU0 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
       CPU1 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
      
       [...]
      
       root domain span: 0-5 (max cpu_capacity = 1024)
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180524152936.17611-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bf5015a5
  6. 24 5月, 2018 2 次提交
    • D
      bpf: properly enforce index mask to prevent out-of-bounds speculation · c93552c4
      Daniel Borkmann 提交于
      While reviewing the verifier code, I recently noticed that the
      following two program variants in relation to tail calls can be
      loaded.
      
      Variant 1:
      
        # bpftool p d x i 15
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:5]
          3: (05) goto pc+2
          4: (18) r2 = map[id:6]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0xa0 goto pc+2
          8: (54) (u32) r3 &= (u32) 255
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 5
          5: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
        # bpftool m s i 6
          6: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
      
      Variant 2:
      
        # bpftool p d x i 20
          0: (15) if r1 == 0x0 goto pc+3
          1: (18) r2 = map[id:8]
          3: (05) goto pc+2
          4: (18) r2 = map[id:7]
          6: (b7) r3 = 7
          7: (35) if r3 >= 0x4 goto pc+2
          8: (54) (u32) r3 &= (u32) 3
          9: (85) call bpf_tail_call#12
         10: (b7) r0 = 1
         11: (95) exit
      
        # bpftool m s i 8
          8: prog_array  flags 0x0
              key 4B  value 4B  max_entries 160  memlock 4096B
        # bpftool m s i 7
          7: prog_array  flags 0x0
              key 4B  value 4B  max_entries 4  memlock 4096B
      
      In both cases the index masking inserted by the verifier in order
      to control out of bounds speculation from a CPU via b2157399
      ("bpf: prevent out-of-bounds speculation") seems to be incorrect
      in what it is enforcing. In the 1st variant, the mask is applied
      from the map with the significantly larger number of entries where
      we would allow to a certain degree out of bounds speculation for
      the smaller map, and in the 2nd variant where the mask is applied
      from the map with the smaller number of entries, we get buggy
      behavior since we truncate the index of the larger map.
      
      The original intent from commit b2157399 is to reject such
      occasions where two or more different tail call maps are used
      in the same tail call helper invocation. However, the check on
      the BPF_MAP_PTR_POISON is never hit since we never poisoned the
      saved pointer in the first place! We do this explicitly for map
      lookups but in case of tail calls we basically used the tail
      call map in insn_aux_data that was processed in the most recent
      path which the verifier walked. Thus any prior path that stored
      a pointer in insn_aux_data at the helper location was always
      overridden.
      
      Fix it by moving the map pointer poison logic into a small helper
      that covers both BPF helpers with the same logic. After that in
      fixup_bpf_calls() the poison check is then hit for tail calls
      and the program rejected. Latter only happens in unprivileged
      case since this is the *only* occasion where a rewrite needs to
      happen, and where such rewrite is specific to the map (max_entries,
      index_mask). In the privileged case the rewrite is generic for
      the insn->imm / insn->code update so multiple maps from different
      paths can be handled just fine since all the remaining logic
      happens in the instruction processing itself. This is similar
      to the case of map lookups: in case there is a collision of
      maps in fixup_bpf_calls() we must skip the inlined rewrite since
      this will turn the generic instruction sequence into a non-
      generic one. Thus the patch_call_imm will simply update the
      insn->imm location where the bpf_map_lookup_elem() will later
      take care of the dispatch. Given we need this 'poison' state
      as a check, the information of whether a map is an unpriv_array
      gets lost, so enforcing it prior to that needs an additional
      state. In general this check is needed since there are some
      complex and tail call intensive BPF programs out there where
      LLVM tends to generate such code occasionally. We therefore
      convert the map_ptr rather into map_state to store all this
      w/o extra memory overhead, and the bit whether one of the maps
      involved in the collision was from an unpriv_array thus needs
      to be retained as well there.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c93552c4
    • R
      cpufreq: schedutil: Avoid missing updates for one-CPU policies · a61dec74
      Rafael J. Wysocki 提交于
      Commit 152db033 (schedutil: Allow cpufreq requests to be made
      even when kthread kicked) made changes to prevent utilization updates
      from being discarded during processing a previous request, but it
      left a small window in which that still can happen in the one-CPU
      policy case.  Namely, updates coming in after setting work_in_progress
      in sugov_update_commit() and clearing it in sugov_work() will still
      be dropped due to the work_in_progress check in sugov_update_single().
      
      To close that window, rearrange the code so as to acquire the update
      lock around the deferred update branch in sugov_update_single()
      and drop the work_in_progress check from it.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NJuri Lelli <juri.lelli@redhat.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      a61dec74
  7. 23 5月, 2018 2 次提交
  8. 22 5月, 2018 2 次提交
    • P
      cpufreq: schedutil: Cleanup and document iowait boost · fd7d5287
      Patrick Bellasi 提交于
      The iowait boosting code has been recently updated to add a progressive
      boosting behavior which allows to be less aggressive in boosting tasks
      doing only sporadic IO operations, thus being more energy efficient for
      example on mobile platforms.
      
      The current code is now however a bit convoluted. Some functionalities
      (e.g. iowait boost reset) are replicated in different paths and their
      documentation is slightly misaligned.
      
      Let's cleanup the code by consolidating all the IO wait boosting related
      functionality within within few dedicated functions and better define
      their role:
      
      - sugov_iowait_boost: set/increase the IO wait boost of a CPU
      - sugov_iowait_apply: apply/reduce the IO wait boost of a CPU
      
      Both these two function are used at every sugov update and they make
      use of a unified IO wait boost reset policy provided by:
      
      - sugov_iowait_reset: reset/disable the IO wait boost of a CPU
           if a CPU is not updated for more then one tick
      
      This makes possible a cleaner and more self-contained design for the IO
      wait boosting code since the rest of the sugov update routines, both for
      single and shared frequency domains, follow the same template:
      
         /* Configure IO boost, if required */
         sugov_iowait_boost()
      
         /* Return here if freq change is in progress or throttled */
      
         /* Collect and aggregate utilization information */
         sugov_get_util()
         sugov_aggregate_util()
      
         /*
          * Add IO boost, if currently enabled, on top of the aggregated
          * utilization value
          */
         sugov_iowait_apply()
      
      As a extra bonus, let's also add the documentation for the new
      functions and better align the in-code documentation.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fd7d5287
    • P
      cpufreq: schedutil: Fix iowait boost reset · 295f1a99
      Patrick Bellasi 提交于
      A more energy efficient update of the IO wait boosting mechanism has
      been introduced in:
      
         commit a5a0809b ("cpufreq: schedutil: Make iowait boost more energy efficient")
      
      where the boost value is expected to be:
      
       - doubled at each successive wakeup from IO
         staring from the minimum frequency supported by a CPU
      
       - reset when a CPU is not updated for more then one tick
         by either disabling the IO wait boost or resetting its value to the
         minimum frequency if this new update requires an IO boost.
      
      This approach is supposed to "ignore" boosting for sporadic wakeups from
      IO, while still getting the frequency boosted to the maximum to benefit
      long sequence of wakeup from IO operations.
      
      However, these assumptions are not always satisfied.
      For example, when an IO boosted CPU enters idle for more the one tick
      and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
      first check the IOWAIT flag, we keep doubling the iowait boost instead
      of restarting from the minimum frequency value.
      
      This misbehavior could happen mainly on non-shared frequency domains,
      thus defeating the energy efficiency optimization, but it can also
      happen on shared frequency domain systems.
      
      Let fix this issue in sugov_set_iowait_boost() by:
       - first check the IO wait boost reset conditions
         to eventually reset the boost value
       - then applying the correct IO boost value
         if required by the caller
      
      Fixes: a5a0809b (cpufreq: schedutil: Make iowait boost more energy efficient)
      Reported-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      295f1a99
  9. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  10. 18 5月, 2018 5 次提交
    • M
      sched/deadline: Make the grub_reclaim() function static · 3febfc8a
      Mathieu Malaterre 提交于
      Since the grub_reclaim() function can be made static, make it so.
      
      Silences the following GCC warning (W=1):
      
        kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3febfc8a
    • M
      sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h · f6a34630
      Mathieu Malaterre 提交于
      In the following commit:
      
        6b55c965 ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")
      
      the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
      right next to the prototypes for print_cfs_stats(), print_rt_stats()
      and print_dl_stats().
      
      Finish this previous commit and also move related prototypes for
      print_rt_rq() and print_dl_rq().
      
      Remove existing extern declarations now that they not needed anymore.
      
      Silences the following GCC warning, triggered by W=1:
      
        kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
        kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f6a34630
    • D
      bpf: fix truncated jump targets on heavy expansions · 050fad7c
      Daniel Borkmann 提交于
      Recently during testing, I ran into the following panic:
      
        [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
        [  207.901637] Modules linked in: binfmt_misc [...]
        [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
        [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  207.992603] lr : 0xffff000000bdb754
        [  207.996080] sp : ffff000013703ca0
        [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
        [  208.004688] x27: 0000000000000001 x26: 0000000000000000
        [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
        [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
        [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
        [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
        [  208.031206] x17: 0000000000000000 x16: 0000000000000000
        [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
        [  208.041813] x13: 0000000000000000 x12: 0000000000000000
        [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
        [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
        [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
        [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
        [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
        [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
        [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
        [  208.086235] Call trace:
        [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  208.093713]  0xffff000000bdb754
        [  208.096845]  bpf_test_run+0x78/0xf8
        [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
        [  208.104758]  sys_bpf+0x314/0x1198
        [  208.108064]  el0_svc_naked+0x30/0x34
        [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
        [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
      
      The program itself which caused this had a long jump over the whole
      instruction sequence where all of the inner instructions required
      heavy expansions into multiple BPF instructions. Additionally, I also
      had BPF hardening enabled which requires once more rewrites of all
      constant values in order to blind them. Each time we rewrite insns,
      bpf_adj_branches() would need to potentially adjust branch targets
      which cross the patchlet boundary to accommodate for the additional
      delta. Eventually that lead to the case where the target offset could
      not fit into insn->off's upper 0x7fff limit anymore where then offset
      wraps around becoming negative (in s16 universe), or vice versa
      depending on the jump direction.
      
      Therefore it becomes necessary to detect and reject any such occasions
      in a generic way for native eBPF and cBPF to eBPF migrations. For
      the latter we can simply check bounds in the bpf_convert_filter()'s
      BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
      bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
      of subsequent hardening) is a bit more complex in that we need to
      detect such truncations before hitting the bpf_prog_realloc(). Thus
      the latter is split into an extra pass to probe problematic offsets
      on the original program in order to fail early. With that in place
      and carefully tested I no longer hit the panic and the rewrites are
      rejected properly. The above example panic I've seen on bpf-next,
      though the issue itself is generic in that a guard against this issue
      in bpf seems more appropriate in this case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      050fad7c
    • J
      bpf: parse and verdict prog attach may race with bpf map update · 96174560
      John Fastabend 提交于
      In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
      SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
      map type and when a sock is added to the map the programs are used by
      the socket. However, sockmap updates from both userspace and BPF
      programs can happen concurrently with the attach and detach of these
      programs.
      
      To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
      primitive to ensure the program pointer is not refeched and
      possibly NULL'd before the refcnt increment. This happens inside
      a RCU critical section so although the pointer reference in the map
      object may be NULL (by a concurrent detach operation) the reference
      from READ_ONCE will not be free'd until after grace period. This
      ensures the object returned by READ_ONCE() is valid through the
      RCU criticl section and safe to use as long as we "know" it may
      be free'd shortly.
      
      Daniel spotted a case in the sock update API where instead of using
      the READ_ONCE() program reference we used the pointer from the
      original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
      is the logic checks the object returned from the READ_ONCE() is not
      NULL and then tries to reference the object again but using the
      above map pointer, which may have already been NULL'd by a parallel
      detach operation. If this happened bpf_porg_inc_not_zero could
      dereference a NULL pointer.
      
      Fix this by using variable returned by READ_ONCE() that is checked
      for NULL.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      96174560
    • J
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt · a593f708
      John Fastabend 提交于
      If the user were to only attach one of the parse or verdict programs
      then it is possible a subsequent sockmap update could incorrectly
      decrement the refcnt on the program. This happens because in the
      rollback logic, after an error, we have to decrement the program
      reference count when its been incremented. However, we only increment
      the program reference count if the user has both a verdict and a
      parse program. The reason for this is because, at least at the
      moment, both are required for any one to be meaningful. The problem
      fixed here is in the rollback path we decrement the program refcnt
      even if only one existing. But we never incremented the refcnt in
      the first place creating an imbalance.
      
      This patch fixes the error path to handle this case.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a593f708
  11. 16 5月, 2018 3 次提交
    • W
      locking/percpu-rwsem: Annotate rwsem ownership transfer by setting RWSEM_OWNER_UNKNOWN · 5a817641
      Waiman Long 提交于
      The filesystem freezing code needs to transfer ownership of a rwsem
      embedded in a percpu-rwsem from the task that does the freezing to
      another one that does the thawing by calling percpu_rwsem_release()
      after freezing and percpu_rwsem_acquire() before thawing.
      
      However, the new rwsem debug code runs afoul with this scheme by warning
      that the task that releases the rwsem isn't the one that acquires it,
      as reported by Amir Goldstein:
      
        DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
        WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79
      
        Call Trace:
         percpu_up_write+0x1f/0x28
         thaw_super_locked+0xdf/0x120
         do_vfs_ioctl+0x270/0x5f1
         ksys_ioctl+0x52/0x71
         __x64_sys_ioctl+0x16/0x19
         do_syscall_64+0x5d/0x167
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To work properly with the rwsem debug code, we need to annotate that the
      rwsem ownership is unknown during the tranfer period until a brave soul
      comes forward to acquire the ownership. During that period, optimistic
      spinning will be disabled.
      Reported-by: NAmir Goldstein <amir73il@gmail.com>
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a817641
    • W
      locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag · d7d760ef
      Waiman Long 提交于
      There are use cases where a rwsem can be acquired by one task, but
      released by another task. In thess cases, optimistic spinning may need
      to be disabled.  One example will be the filesystem freeze/thaw code
      where the task that freezes the filesystem will acquire a write lock
      on a rwsem and then un-owns it before returning to userspace. Later on,
      another task will come along, acquire the ownership, thaw the filesystem
      and release the rwsem.
      
      Bit 0 of the owner field was used to designate that it is a reader
      owned rwsem. It is now repurposed to mean that the owner of the rwsem
      is not known. If only bit 0 is set, the rwsem is reader owned. If bit
      0 and other bits are set, it is writer owned with an unknown owner.
      One such value for the latter case is (-1L). So we can set owner to 1 for
      reader-owned, -1 for writer-owned. The owner is unknown in both cases.
      
      To handle transfer of rwsem ownership, the higher level code should
      set the owner field to -1 to indicate a write-locked rwsem with unknown
      owner.  Optimistic spinning will be disabled in this case.
      
      Once the higher level code figures who the new owner is, it can then
      set the owner field accordingly.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7d760ef
    • D
      tick/broadcast: Use for_each_cpu() specially on UP kernels · 5596fe34
      Dexuan Cui 提交于
      for_each_cpu() unintuitively reports CPU0 as set independent of the actual
      cpumask content on UP kernels. This causes an unexpected PIT interrupt
      storm on a UP kernel running in an SMP virtual machine on Hyper-V, and as
      a result, the virtual machine can suffer from a strange random delay of 1~20
      minutes during boot-up, and sometimes it can hang forever.
      
      Protect if by checking whether the cpumask is empty before entering the
      for_each_cpu() loop.
      
      [ tglx: Use !IS_ENABLED(CONFIG_SMP) instead of #ifdeffery ]
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Rakib Mullick <rakib.mullick@gmail.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Link: https://lkml.kernel.org/r/KL1P15301MB000678289FE55BA365B3279ABF990@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      Link: https://lkml.kernel.org/r/KL1P15301MB0006FA63BC22BEB64902EAA0BF930@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      5596fe34
  12. 15 5月, 2018 2 次提交
    • V
      cpufreq: schedutil: Don't set next_freq to UINT_MAX · ecd28842
      Viresh Kumar 提交于
      The schedutil driver sets sg_policy->next_freq to UINT_MAX on certain
      occasions to discard the cached value of next freq:
      - In sugov_start(), when the schedutil governor is started for a group
        of CPUs.
      - And whenever we need to force a freq update before rate-limit
        duration, which happens when:
        - there is an update in cpufreq policy limits.
        - Or when the utilization of DL scheduling class increases.
      
      In return, get_next_freq() doesn't return a cached next_freq value but
      recalculates the next frequency instead.
      
      But having special meaning for a particular value of frequency makes the
      code less readable and error prone. We recently fixed a bug where the
      UINT_MAX value was considered as valid frequency in
      sugov_update_single().
      
      All we need is a flag which can be used to discard the value of
      sg_policy->next_freq and we already have need_freq_update for that. Lets
      reuse it instead of setting next_freq to UINT_MAX.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ecd28842
    • D
      Revert "cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily" · 1b04722c
      Dietmar Eggemann 提交于
      This reverts commit e2cabe48.
      
      Lifting the restriction that the sugov kthread is bound to the
      policy->related_cpus for a system with a slow switching cpufreq driver,
      which is able to perform DVFS from any cpu (e.g. cpufreq-dt), is not
      only not beneficial it also harms Enery-Aware Scheduling (EAS) on
      systems with asymmetric cpu capacities (e.g. Arm big.LITTLE).
      
      The sugov kthread which does the update for the little cpus could
      potentially run on a big cpu. It could prevent that the big cluster goes
      into deeper idle states although all the tasks are running on the little
      cluster.
      
      Example: hikey960 w/ 4.16.0-rc6-+
               Arm big.LITTLE with per-cluster DVFS
      
      root@h960:~# cat /proc/cpuinfo | grep "^CPU part"
      CPU part        : 0xd03 (Cortex-A53, little cpu)
      CPU part        : 0xd03
      CPU part        : 0xd03
      CPU part        : 0xd03
      CPU part        : 0xd09 (Cortex-A73, big cpu)
      CPU part        : 0xd09
      CPU part        : 0xd09
      CPU part        : 0xd09
      
      root@h960:/sys/devices/system/cpu/cpufreq# ls
      policy0  policy4  schedutil
      
      root@h960:/sys/devices/system/cpu/cpufreq# cat policy*/related_cpus
      0 1 2 3
      4 5 6 7
      
      (1) w/o the revert:
      
      root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
      /sugov/'
        PID CLS RTPRIO PRI PSR COMMAND
        1489 #6      0 140   1 sugov:0
        1490 #6      0 140   0 sugov:4
      
      The sugov kthread sugov:4 responsible for policy4 runs on cpu0. (In this
      case both sugov kthreads run on little cpus).
      
      cross policy (cluster) remote callback example:
      ...
      migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=5
      migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=5
                           sg_cpu->sg_policy->policy->related_cpus=4-7
        sugov:4-1490 [000] sugov_work: this_cpu=0
                           sg_cpu->sg_policy->policy->related_cpus=4-7
      ...
      
      The remote callback (this_cpu=1, target_cpu=5) is executed on cpu=0.
      
      (2) w/ the revert:
      
      root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
      /sugov/'
        PID CLS RTPRIO PRI PSR COMMAND
        1491 #6      0 140   2 sugov:0
        1492 #6      0 140   4 sugov:4
      
      The sugov kthread sugov:4 responsible for policy4 runs on cpu4.
      
      cross policy (cluster) remote callback example:
      ...
      migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=7
      migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=7
                           sg_cpu->sg_policy->policy->related_cpus=4-7
        sugov:4-1492 [004] sugov_work: this_cpu=4
                           sg_cpu->sg_policy->policy->related_cpus=4-7
      ...
      
      The remote callback (this_cpu=1, target_cpu=7) is executed on cpu=4.
      
      Now the sugov kthread executes again on the policy (cluster) for which
      the Operating Performance Point (OPP) should be changed.
      It avoids the problem that an otherwise idle policy (cluster) is running
      schedutil (the sugov kthread) for another one.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1b04722c
  13. 12 5月, 2018 2 次提交
  14. 11 5月, 2018 2 次提交
  15. 10 5月, 2018 1 次提交
    • D
      PM / wakeup: Only update last time for active wakeup sources · 2ef7c01c
      Doug Berger 提交于
      When wakelock support was added, the wakeup_source_add() function
      was updated to set the last_time value of the wakeup source. This
      has the unintended side effect of producing confusing output from
      pm_print_active_wakeup_sources() when a wakeup source is added
      prior to a sleep that is blocked by a different wakeup source.
      
      The function pm_print_active_wakeup_sources() will search for the
      most recently active wakeup source when no active source is found.
      If a wakeup source is added after a different wakeup source blocks
      the system from going to sleep it may have a later last_time value
      than the blocking source and be output as the last active wakeup
      source even if it has never actually been active.
      
      It looks to me like the change to wakeup_source_add() was made to
      prevent the wakelock garbage collection from accidentally dropping
      a wakelock during the narrow window between adding the wakelock to
      the wakelock list in wakelock_lookup_add() and the activation of
      the wakeup source in pm_wake_lock().
      
      This commit changes the behavior so that only the last_time of the
      wakeup source used by a wakelock is initialized prior to adding it
      to the wakeup source list. This preserves the meaning of the
      last_time value as the last time the wakeup source was active and
      allows a wakeup source that has never been active to have a
      last_time value of 0.
      
      Fixes: b86ff982 (PM / Sleep: Add user space interface for manipulating wakeup sources, v3)
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2ef7c01c
  16. 09 5月, 2018 2 次提交
  17. 05 5月, 2018 6 次提交