1. 25 5月, 2018 2 次提交
    • P
      kthread: Allow kthread_park() on a parked kthread · b1f5b378
      Peter Zijlstra 提交于
      The following commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      added a WARN() in the case where we call kthread_park() on an already
      parked thread, because the old code wasn't doing the right thing there
      and it wasn't at all clear that would happen.
      
      It turns out, this does in fact happen, so we have to deal with it.
      
      Instead of potentially returning early, also wait for the completion.
      This does however mean we have to use complete_all() and re-initialize
      the completion on re-use.
      Reported-by: NLKP <lkp@01.org>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kernel test robot <lkp@intel.com>
      Cc: wfg@linux.intel.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b1f5b378
    • J
      sched/topology: Clarify root domain(s) debug string · bf5015a5
      Juri Lelli 提交于
      When scheduler debug is enabled, building scheduling domains outputs
      information about how the domains are laid out and to which root domain
      each CPU (or sets of CPUs) belongs, e.g.:
      
       CPU0 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
       CPU1 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
      
       [...]
      
       span: 0-5 (max cpu_capacity = 1024)
      
      The fact that latest line refers to CPUs 0-5 root domain doesn't however look
      immediately obvious to me: one might wonder why span 0-5 is reported "again".
      
      Make it more clear by adding "root domain" to it, as to end with the
      following:
      
       CPU0 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }
       CPU1 attaching sched-domain(s):
        domain-0: span=0-5 level=MC
         groups: 1:{ span=1 }, 2:{ span=2 }, 3:{ span=3 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 }
      
       [...]
      
       root domain span: 0-5 (max cpu_capacity = 1024)
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180524152936.17611-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bf5015a5
  2. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  3. 18 5月, 2018 5 次提交
    • M
      sched/deadline: Make the grub_reclaim() function static · 3febfc8a
      Mathieu Malaterre 提交于
      Since the grub_reclaim() function can be made static, make it so.
      
      Silences the following GCC warning (W=1):
      
        kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3febfc8a
    • M
      sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h · f6a34630
      Mathieu Malaterre 提交于
      In the following commit:
      
        6b55c965 ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")
      
      the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
      right next to the prototypes for print_cfs_stats(), print_rt_stats()
      and print_dl_stats().
      
      Finish this previous commit and also move related prototypes for
      print_rt_rq() and print_dl_rq().
      
      Remove existing extern declarations now that they not needed anymore.
      
      Silences the following GCC warning, triggered by W=1:
      
        kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
        kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f6a34630
    • D
      bpf: fix truncated jump targets on heavy expansions · 050fad7c
      Daniel Borkmann 提交于
      Recently during testing, I ran into the following panic:
      
        [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
        [  207.901637] Modules linked in: binfmt_misc [...]
        [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
        [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  207.992603] lr : 0xffff000000bdb754
        [  207.996080] sp : ffff000013703ca0
        [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
        [  208.004688] x27: 0000000000000001 x26: 0000000000000000
        [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
        [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
        [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
        [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
        [  208.031206] x17: 0000000000000000 x16: 0000000000000000
        [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
        [  208.041813] x13: 0000000000000000 x12: 0000000000000000
        [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
        [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
        [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
        [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
        [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
        [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
        [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
        [  208.086235] Call trace:
        [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  208.093713]  0xffff000000bdb754
        [  208.096845]  bpf_test_run+0x78/0xf8
        [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
        [  208.104758]  sys_bpf+0x314/0x1198
        [  208.108064]  el0_svc_naked+0x30/0x34
        [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
        [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
      
      The program itself which caused this had a long jump over the whole
      instruction sequence where all of the inner instructions required
      heavy expansions into multiple BPF instructions. Additionally, I also
      had BPF hardening enabled which requires once more rewrites of all
      constant values in order to blind them. Each time we rewrite insns,
      bpf_adj_branches() would need to potentially adjust branch targets
      which cross the patchlet boundary to accommodate for the additional
      delta. Eventually that lead to the case where the target offset could
      not fit into insn->off's upper 0x7fff limit anymore where then offset
      wraps around becoming negative (in s16 universe), or vice versa
      depending on the jump direction.
      
      Therefore it becomes necessary to detect and reject any such occasions
      in a generic way for native eBPF and cBPF to eBPF migrations. For
      the latter we can simply check bounds in the bpf_convert_filter()'s
      BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
      bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
      of subsequent hardening) is a bit more complex in that we need to
      detect such truncations before hitting the bpf_prog_realloc(). Thus
      the latter is split into an extra pass to probe problematic offsets
      on the original program in order to fail early. With that in place
      and carefully tested I no longer hit the panic and the rewrites are
      rejected properly. The above example panic I've seen on bpf-next,
      though the issue itself is generic in that a guard against this issue
      in bpf seems more appropriate in this case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      050fad7c
    • J
      bpf: parse and verdict prog attach may race with bpf map update · 96174560
      John Fastabend 提交于
      In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
      SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
      map type and when a sock is added to the map the programs are used by
      the socket. However, sockmap updates from both userspace and BPF
      programs can happen concurrently with the attach and detach of these
      programs.
      
      To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
      primitive to ensure the program pointer is not refeched and
      possibly NULL'd before the refcnt increment. This happens inside
      a RCU critical section so although the pointer reference in the map
      object may be NULL (by a concurrent detach operation) the reference
      from READ_ONCE will not be free'd until after grace period. This
      ensures the object returned by READ_ONCE() is valid through the
      RCU criticl section and safe to use as long as we "know" it may
      be free'd shortly.
      
      Daniel spotted a case in the sock update API where instead of using
      the READ_ONCE() program reference we used the pointer from the
      original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
      is the logic checks the object returned from the READ_ONCE() is not
      NULL and then tries to reference the object again but using the
      above map pointer, which may have already been NULL'd by a parallel
      detach operation. If this happened bpf_porg_inc_not_zero could
      dereference a NULL pointer.
      
      Fix this by using variable returned by READ_ONCE() that is checked
      for NULL.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      96174560
    • J
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt · a593f708
      John Fastabend 提交于
      If the user were to only attach one of the parse or verdict programs
      then it is possible a subsequent sockmap update could incorrectly
      decrement the refcnt on the program. This happens because in the
      rollback logic, after an error, we have to decrement the program
      reference count when its been incremented. However, we only increment
      the program reference count if the user has both a verdict and a
      parse program. The reason for this is because, at least at the
      moment, both are required for any one to be meaningful. The problem
      fixed here is in the rollback path we decrement the program refcnt
      even if only one existing. But we never incremented the refcnt in
      the first place creating an imbalance.
      
      This patch fixes the error path to handle this case.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a593f708
  4. 16 5月, 2018 3 次提交
    • W
      locking/percpu-rwsem: Annotate rwsem ownership transfer by setting RWSEM_OWNER_UNKNOWN · 5a817641
      Waiman Long 提交于
      The filesystem freezing code needs to transfer ownership of a rwsem
      embedded in a percpu-rwsem from the task that does the freezing to
      another one that does the thawing by calling percpu_rwsem_release()
      after freezing and percpu_rwsem_acquire() before thawing.
      
      However, the new rwsem debug code runs afoul with this scheme by warning
      that the task that releases the rwsem isn't the one that acquires it,
      as reported by Amir Goldstein:
      
        DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
        WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79
      
        Call Trace:
         percpu_up_write+0x1f/0x28
         thaw_super_locked+0xdf/0x120
         do_vfs_ioctl+0x270/0x5f1
         ksys_ioctl+0x52/0x71
         __x64_sys_ioctl+0x16/0x19
         do_syscall_64+0x5d/0x167
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To work properly with the rwsem debug code, we need to annotate that the
      rwsem ownership is unknown during the tranfer period until a brave soul
      comes forward to acquire the ownership. During that period, optimistic
      spinning will be disabled.
      Reported-by: NAmir Goldstein <amir73il@gmail.com>
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a817641
    • W
      locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag · d7d760ef
      Waiman Long 提交于
      There are use cases where a rwsem can be acquired by one task, but
      released by another task. In thess cases, optimistic spinning may need
      to be disabled.  One example will be the filesystem freeze/thaw code
      where the task that freezes the filesystem will acquire a write lock
      on a rwsem and then un-owns it before returning to userspace. Later on,
      another task will come along, acquire the ownership, thaw the filesystem
      and release the rwsem.
      
      Bit 0 of the owner field was used to designate that it is a reader
      owned rwsem. It is now repurposed to mean that the owner of the rwsem
      is not known. If only bit 0 is set, the rwsem is reader owned. If bit
      0 and other bits are set, it is writer owned with an unknown owner.
      One such value for the latter case is (-1L). So we can set owner to 1 for
      reader-owned, -1 for writer-owned. The owner is unknown in both cases.
      
      To handle transfer of rwsem ownership, the higher level code should
      set the owner field to -1 to indicate a write-locked rwsem with unknown
      owner.  Optimistic spinning will be disabled in this case.
      
      Once the higher level code figures who the new owner is, it can then
      set the owner field accordingly.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7d760ef
    • D
      tick/broadcast: Use for_each_cpu() specially on UP kernels · 5596fe34
      Dexuan Cui 提交于
      for_each_cpu() unintuitively reports CPU0 as set independent of the actual
      cpumask content on UP kernels. This causes an unexpected PIT interrupt
      storm on a UP kernel running in an SMP virtual machine on Hyper-V, and as
      a result, the virtual machine can suffer from a strange random delay of 1~20
      minutes during boot-up, and sometimes it can hang forever.
      
      Protect if by checking whether the cpumask is empty before entering the
      for_each_cpu() loop.
      
      [ tglx: Use !IS_ENABLED(CONFIG_SMP) instead of #ifdeffery ]
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Rakib Mullick <rakib.mullick@gmail.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Link: https://lkml.kernel.org/r/KL1P15301MB000678289FE55BA365B3279ABF990@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      Link: https://lkml.kernel.org/r/KL1P15301MB0006FA63BC22BEB64902EAA0BF930@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      5596fe34
  5. 12 5月, 2018 2 次提交
  6. 11 5月, 2018 2 次提交
  7. 09 5月, 2018 2 次提交
  8. 05 5月, 2018 6 次提交
  9. 04 5月, 2018 3 次提交
  10. 03 5月, 2018 12 次提交
    • Z
      tracing: Fix the file mode of stack tracer · 0c5a9acc
      Zhengyuan Liu 提交于
      It looks weird that the stack_trace_filter file can be written by root
      but shows that it does not have write permission by ll command.
      
      Link: http://lkml.kernel.org/r/1518054113-28096-1-git-send-email-liuzhengyuan@kylinos.cnSigned-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0c5a9acc
    • C
      ftrace: Have set_graph_* files have normal file modes · 1ce0500d
      Chen LinX 提交于
      The set_graph_function and set_graph_notrace file mode should be 0644
      instead of 0444 as they are writeable. Note, the mode appears to be ignored
      regardless, but they should at least look sane.
      
      Link: http://lkml.kernel.org/r/1409725869-4501-1-git-send-email-linx.z.chen@intel.comAcked-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NChen LinX <linx.z.chen@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ce0500d
    • K
      seccomp: Enable speculation flaw mitigations · 5c307089
      Kees Cook 提交于
      When speculation flaw mitigations are opt-in (via prctl), using seccomp
      will automatically opt-in to these protections, since using seccomp
      indicates at least some level of sandboxing is desired.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5c307089
    • K
      nospec: Allow getting/setting on non-current task · 7bbf1373
      Kees Cook 提交于
      Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
      current.
      
      This is needed both for /proc/$pid/status queries and for seccomp (since
      thread-syncing can trigger seccomp in non-current threads).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7bbf1373
    • T
      prctl: Add speculation control prctls · b617cfc8
      Thomas Gleixner 提交于
      Add two new prctls to control aspects of speculation related vulnerabilites
      and their mitigations to provide finer grained control over performance
      impacting mitigations.
      
      PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
      which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
      the following meaning:
      
      Bit  Define           Description
      0    PR_SPEC_PRCTL    Mitigation can be controlled per task by
                            PR_SET_SPECULATION_CTRL
      1    PR_SPEC_ENABLE   The speculation feature is enabled, mitigation is
                            disabled
      2    PR_SPEC_DISABLE  The speculation feature is disabled, mitigation is
                            enabled
      
      If all bits are 0 the CPU is not affected by the speculation misfeature.
      
      If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
      available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
      misfeature will fail.
      
      PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
      is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
      control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
      
      The common return values are:
      
      EINVAL  prctl is not implemented by the architecture or the unused prctl()
              arguments are not 0
      ENODEV  arg2 is selecting a not supported speculation misfeature
      
      PR_SET_SPECULATION_CTRL has these additional return values:
      
      ERANGE  arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
      ENXIO   prctl control of the selected speculation misfeature is disabled
      
      The first supported controlable speculation misfeature is
      PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
      architectures.
      
      Based on an initial patch from Tim Chen and mostly rewritten.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b617cfc8
    • P
      kthread, sched/wait: Fix kthread_parkme() completion issue · 85f1abe0
      Peter Zijlstra 提交于
      Even with the wait-loop fixed, there is a further issue with
      kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
      smpboot_park_threads() can return before all those threads are in fact
      blocked, due to the placement of the complete() in __kthread_parkme().
      
      When that happens, sched_cpu_dying() -> migrate_tasks() can end up
      migrating such a still runnable task onto another CPU.
      
      Normally the task will have hit schedule() and gone to sleep by the
      time we do kthread_unpark(), which will then do __kthread_bind() to
      re-bind the task to the correct CPU.
      
      However, when we loose the initial TASK_PARKED store to the concurrent
      wakeup issue described previously, do the complete(), get migrated, it
      is possible to either:
      
       - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
         the park and set TASK_RUNNING, or
      
       - __kthread_bind()'s wait_task_inactive() to observe the competing
         TASK_RUNNING store.
      
      Either way the WARN() in __kthread_bind() will trigger and fail to
      correctly set the CPU affinity.
      
      Fix this by only issuing the complete() when the kthread has scheduled
      out. This does away with all the icky 'still running' nonsense.
      
      The alternative is to promote TASK_PARKED to a special state, this
      guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
      and we'll end up doing the right thing, but this preserves the whole
      icky business of potentially migating the still runnable thing.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85f1abe0
    • P
      kthread, sched/wait: Fix kthread_parkme() wait-loop · 741a76b3
      Peter Zijlstra 提交于
      Gaurav reported a problem with __kthread_parkme() where a concurrent
      try_to_wake_up() could result in competing stores to ->state which,
      when the TASK_PARKED store got lost bad things would happen.
      
      The comment near set_current_state() actually mentions this competing
      store, but only mentions the case against TASK_RUNNING. This same
      store, with different timing, can happen against a subsequent !RUNNING
      store.
      
      This normally is not a problem, because as per that same comment, the
      !RUNNING state store is inside a condition based wait-loop:
      
        for (;;) {
          set_current_state(TASK_UNINTERRUPTIBLE);
          if (!need_sleep)
            break;
          schedule();
        }
        __set_current_state(TASK_RUNNING);
      
      If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
      (concurrent) wakeup, the schedule() will NO-OP and we'll go around the
      loop once more.
      
      The problem here is that the TASK_PARKED store is not inside the
      KTHREAD_SHOULD_PARK condition wait-loop.
      
      There is a genuine issue with sleeps that do not have a condition;
      this is addressed in a subsequent patch.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      741a76b3
    • V
      sched/fair: Fix the update of blocked load when newly idle · 457be908
      Vincent Guittot 提交于
      With commit:
      
        31e77c93 ("sched/fair: Update blocked load when newly idle")
      
      ... we release the rq->lock when updating blocked load of idle CPUs.
      
      This opens a time window during which another CPU can add a task to this
      CPU's cfs_rq.
      
      The check for newly added task of idle_balance() is not in the common path.
      Move the out label to include this check.
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 31e77c93 ("sched/fair: Update blocked load when newly idle")
      Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      457be908
    • P
      stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock · 0b26351b
      Peter Zijlstra 提交于
      Matt reported the following deadlock:
      
      CPU0					CPU1
      
      schedule(.prev=migrate/0)		<fault>
        pick_next_task()			  ...
          idle_balance()			    migrate_swap()
            active_balance()			      stop_two_cpus()
      						spin_lock(stopper0->lock)
      						spin_lock(stopper1->lock)
      						ttwu(migrate/0)
      						  smp_cond_load_acquire() -- waits for schedule()
              stop_one_cpu(1)
      	  spin_lock(stopper1->lock) -- waits for stopper lock
      
      Fix this deadlock by taking the wakeups out from under stopper->lock.
      This allows the active_balance() to queue the stop work and finish the
      context switch, which in turn allows the wakeup from migrate_swap() to
      observe the context and complete the wakeup.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reported-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180420095005.GH4064@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0b26351b
    • J
      bpf: sockmap, fix error handling in redirect failures · abaeb096
      John Fastabend 提交于
      When a redirect failure happens we release the buffers in-flight
      without calling a sk_mem_uncharge(), the uncharge is called before
      dropping the sock lock for the redirecte, however we missed updating
      the ring start index. When no apply actions are in progress this
      is OK because we uncharge the entire buffer before the redirect.
      But, when we have apply logic running its possible that only a
      portion of the buffer is being redirected. In this case we only
      do memory accounting for the buffer slice being redirected and
      expect to be able to loop over the BPF program again and/or if
      a sock is closed uncharge the memory at sock destruct time.
      
      With an invalid start index however the program logic looks at
      the start pointer index, checks the length, and when seeing the
      length is zero (from the initial release and failure to update
      the pointer) aborts without uncharging/releasing the remaining
      memory.
      
      The fix for this is simply to update the start index. To avoid
      fixing this error in two locations we do a small refactor and
      remove one case where it is open-coded. Then fix it in the
      single function.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      abaeb096
    • J
      bpf: sockmap, zero sg_size on error when buffer is released · fec51d40
      John Fastabend 提交于
      When an error occurs during a redirect we have two cases that need
      to be handled (i) we have a cork'ed buffer (ii) we have a normal
      sendmsg buffer.
      
      In the cork'ed buffer case we don't currently support recovering from
      errors in a redirect action. So the buffer is released and the error
      should _not_ be pushed back to the caller of sendmsg/sendpage. The
      rationale here is the user will get an error that relates to old
      data that may have been sent by some arbitrary thread on that sock.
      Instead we simple consume the data and tell the user that the data
      has been consumed. We may add proper error recovery in the future.
      However, this patch fixes a bug where the bytes outstanding counter
      sg_size was not zeroed. This could result in a case where if the user
      has both a cork'ed action and apply action in progress we may
      incorrectly call into the BPF program when the user expected an
      old verdict to be applied via the apply action. I don't have a use
      case where using apply and cork at the same time is valid but we
      never explicitly reject it because it should work fine. This patch
      ensures the sg_size is zeroed so we don't have this case.
      
      In the normal sendmsg buffer case (no cork data) we also do not
      zero sg_size. Again this can confuse the apply logic when the logic
      calls into the BPF program when the BPF programmer expected the old
      verdict to remain. So ensure we set sg_size to zero here as well. And
      additionally to keep the psock state in-sync with the sk_msg_buff
      release all the memory as well. Previously we did this before
      returning to the user but this left a gap where psock and sk_msg_buff
      states were out of sync which seems fragile. No additional overhead
      is taken here except for a call to check the length and realize its
      already been freed. This is in the error path as well so in my
      opinion lets have robust code over optimized error paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fec51d40
    • J
      bpf: sockmap, fix scatterlist update on error path in send with apply · 3cc9a472
      John Fastabend 提交于
      When the call to do_tcp_sendpage() fails to send the complete block
      requested we either retry if only a partial send was completed or
      abort if we receive a error less than or equal to zero. Before
      returning though we must update the scatterlist length/offset to
      account for any partial send completed.
      
      Before this patch we did this at the end of the retry loop, but
      this was buggy when used while applying a verdict to fewer bytes
      than in the scatterlist. When the scatterlist length was being set
      we forgot to account for the apply logic reducing the size variable.
      So the result was we chopped off some bytes in the scatterlist without
      doing proper cleanup on them. This results in a WARNING when the
      sock is tore down because the bytes have previously been charged to
      the socket but are never uncharged.
      
      The simple fix is to simply do the accounting inside the retry loop
      subtracting from the absolute scatterlist values rather than trying
      to accumulate the totals and subtract at the end.
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3cc9a472
  11. 02 5月, 2018 2 次提交