1. 25 5月, 2018 3 次提交
  2. 20 5月, 2018 1 次提交
    • A
      bpf: Prevent memory disambiguation attack · af86ca4e
      Alexei Starovoitov 提交于
      Detect code patterns where malicious 'speculative store bypass' can be used
      and sanitize such patterns.
      
       39: (bf) r3 = r10
       40: (07) r3 += -216
       41: (79) r8 = *(u64 *)(r7 +0)   // slow read
       42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
       43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
       44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
       45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                       // is now sanitized
      
      Above code after x86 JIT becomes:
       e5: mov    %rbp,%rdx
       e8: add    $0xffffffffffffff28,%rdx
       ef: mov    0x0(%r13),%r14
       f3: movq   $0x0,-0x48(%rbp)
       fb: mov    %rdx,0x0(%r14)
       ff: mov    0x0(%rbx),%rdi
      103: movzbq 0x0(%rdi),%rsi
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      af86ca4e
  3. 18 5月, 2018 5 次提交
    • M
      sched/deadline: Make the grub_reclaim() function static · 3febfc8a
      Mathieu Malaterre 提交于
      Since the grub_reclaim() function can be made static, make it so.
      
      Silences the following GCC warning (W=1):
      
        kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3febfc8a
    • M
      sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h · f6a34630
      Mathieu Malaterre 提交于
      In the following commit:
      
        6b55c965 ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")
      
      the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
      right next to the prototypes for print_cfs_stats(), print_rt_stats()
      and print_dl_stats().
      
      Finish this previous commit and also move related prototypes for
      print_rt_rq() and print_dl_rq().
      
      Remove existing extern declarations now that they not needed anymore.
      
      Silences the following GCC warning, triggered by W=1:
      
        kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
        kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f6a34630
    • D
      bpf: fix truncated jump targets on heavy expansions · 050fad7c
      Daniel Borkmann 提交于
      Recently during testing, I ran into the following panic:
      
        [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
        [  207.901637] Modules linked in: binfmt_misc [...]
        [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
        [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
        [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
        [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  207.992603] lr : 0xffff000000bdb754
        [  207.996080] sp : ffff000013703ca0
        [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
        [  208.004688] x27: 0000000000000001 x26: 0000000000000000
        [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
        [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
        [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
        [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
        [  208.031206] x17: 0000000000000000 x16: 0000000000000000
        [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
        [  208.041813] x13: 0000000000000000 x12: 0000000000000000
        [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
        [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
        [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
        [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
        [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
        [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
        [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
        [  208.086235] Call trace:
        [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
        [  208.093713]  0xffff000000bdb754
        [  208.096845]  bpf_test_run+0x78/0xf8
        [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
        [  208.104758]  sys_bpf+0x314/0x1198
        [  208.108064]  el0_svc_naked+0x30/0x34
        [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
        [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---
      
      The program itself which caused this had a long jump over the whole
      instruction sequence where all of the inner instructions required
      heavy expansions into multiple BPF instructions. Additionally, I also
      had BPF hardening enabled which requires once more rewrites of all
      constant values in order to blind them. Each time we rewrite insns,
      bpf_adj_branches() would need to potentially adjust branch targets
      which cross the patchlet boundary to accommodate for the additional
      delta. Eventually that lead to the case where the target offset could
      not fit into insn->off's upper 0x7fff limit anymore where then offset
      wraps around becoming negative (in s16 universe), or vice versa
      depending on the jump direction.
      
      Therefore it becomes necessary to detect and reject any such occasions
      in a generic way for native eBPF and cBPF to eBPF migrations. For
      the latter we can simply check bounds in the bpf_convert_filter()'s
      BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
      bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
      of subsequent hardening) is a bit more complex in that we need to
      detect such truncations before hitting the bpf_prog_realloc(). Thus
      the latter is split into an extra pass to probe problematic offsets
      on the original program in order to fail early. With that in place
      and carefully tested I no longer hit the panic and the rewrites are
      rejected properly. The above example panic I've seen on bpf-next,
      though the issue itself is generic in that a guard against this issue
      in bpf seems more appropriate in this case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      050fad7c
    • J
      bpf: parse and verdict prog attach may race with bpf map update · 96174560
      John Fastabend 提交于
      In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
      SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
      map type and when a sock is added to the map the programs are used by
      the socket. However, sockmap updates from both userspace and BPF
      programs can happen concurrently with the attach and detach of these
      programs.
      
      To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
      primitive to ensure the program pointer is not refeched and
      possibly NULL'd before the refcnt increment. This happens inside
      a RCU critical section so although the pointer reference in the map
      object may be NULL (by a concurrent detach operation) the reference
      from READ_ONCE will not be free'd until after grace period. This
      ensures the object returned by READ_ONCE() is valid through the
      RCU criticl section and safe to use as long as we "know" it may
      be free'd shortly.
      
      Daniel spotted a case in the sock update API where instead of using
      the READ_ONCE() program reference we used the pointer from the
      original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
      is the logic checks the object returned from the READ_ONCE() is not
      NULL and then tries to reference the object again but using the
      above map pointer, which may have already been NULL'd by a parallel
      detach operation. If this happened bpf_porg_inc_not_zero could
      dereference a NULL pointer.
      
      Fix this by using variable returned by READ_ONCE() that is checked
      for NULL.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      96174560
    • J
      bpf: sockmap update rollback on error can incorrectly dec prog refcnt · a593f708
      John Fastabend 提交于
      If the user were to only attach one of the parse or verdict programs
      then it is possible a subsequent sockmap update could incorrectly
      decrement the refcnt on the program. This happens because in the
      rollback logic, after an error, we have to decrement the program
      reference count when its been incremented. However, we only increment
      the program reference count if the user has both a verdict and a
      parse program. The reason for this is because, at least at the
      moment, both are required for any one to be meaningful. The problem
      fixed here is in the rollback path we decrement the program refcnt
      even if only one existing. But we never incremented the refcnt in
      the first place creating an imbalance.
      
      This patch fixes the error path to handle this case.
      
      Fixes: 2f857d04 ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a593f708
  4. 16 5月, 2018 3 次提交
    • W
      locking/percpu-rwsem: Annotate rwsem ownership transfer by setting RWSEM_OWNER_UNKNOWN · 5a817641
      Waiman Long 提交于
      The filesystem freezing code needs to transfer ownership of a rwsem
      embedded in a percpu-rwsem from the task that does the freezing to
      another one that does the thawing by calling percpu_rwsem_release()
      after freezing and percpu_rwsem_acquire() before thawing.
      
      However, the new rwsem debug code runs afoul with this scheme by warning
      that the task that releases the rwsem isn't the one that acquires it,
      as reported by Amir Goldstein:
      
        DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
        WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79
      
        Call Trace:
         percpu_up_write+0x1f/0x28
         thaw_super_locked+0xdf/0x120
         do_vfs_ioctl+0x270/0x5f1
         ksys_ioctl+0x52/0x71
         __x64_sys_ioctl+0x16/0x19
         do_syscall_64+0x5d/0x167
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To work properly with the rwsem debug code, we need to annotate that the
      rwsem ownership is unknown during the tranfer period until a brave soul
      comes forward to acquire the ownership. During that period, optimistic
      spinning will be disabled.
      Reported-by: NAmir Goldstein <amir73il@gmail.com>
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a817641
    • W
      locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag · d7d760ef
      Waiman Long 提交于
      There are use cases where a rwsem can be acquired by one task, but
      released by another task. In thess cases, optimistic spinning may need
      to be disabled.  One example will be the filesystem freeze/thaw code
      where the task that freezes the filesystem will acquire a write lock
      on a rwsem and then un-owns it before returning to userspace. Later on,
      another task will come along, acquire the ownership, thaw the filesystem
      and release the rwsem.
      
      Bit 0 of the owner field was used to designate that it is a reader
      owned rwsem. It is now repurposed to mean that the owner of the rwsem
      is not known. If only bit 0 is set, the rwsem is reader owned. If bit
      0 and other bits are set, it is writer owned with an unknown owner.
      One such value for the latter case is (-1L). So we can set owner to 1 for
      reader-owned, -1 for writer-owned. The owner is unknown in both cases.
      
      To handle transfer of rwsem ownership, the higher level code should
      set the owner field to -1 to indicate a write-locked rwsem with unknown
      owner.  Optimistic spinning will be disabled in this case.
      
      Once the higher level code figures who the new owner is, it can then
      set the owner field accordingly.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Theodore Y. Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7d760ef
    • D
      tick/broadcast: Use for_each_cpu() specially on UP kernels · 5596fe34
      Dexuan Cui 提交于
      for_each_cpu() unintuitively reports CPU0 as set independent of the actual
      cpumask content on UP kernels. This causes an unexpected PIT interrupt
      storm on a UP kernel running in an SMP virtual machine on Hyper-V, and as
      a result, the virtual machine can suffer from a strange random delay of 1~20
      minutes during boot-up, and sometimes it can hang forever.
      
      Protect if by checking whether the cpumask is empty before entering the
      for_each_cpu() loop.
      
      [ tglx: Use !IS_ENABLED(CONFIG_SMP) instead of #ifdeffery ]
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: Rakib Mullick <rakib.mullick@gmail.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Link: https://lkml.kernel.org/r/KL1P15301MB000678289FE55BA365B3279ABF990@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      Link: https://lkml.kernel.org/r/KL1P15301MB0006FA63BC22BEB64902EAA0BF930@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
      5596fe34
  5. 14 5月, 2018 2 次提交
    • R
      sched/core: Distinguish between idle_cpu() calls based on desired effect,... · 943d355d
      Rohit Jain 提交于
      sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
      
      In the following commit:
      
        247f2f6f ("sched/core: Don't schedule threads on pre-empted vCPUs")
      
      ... we distinguish between idle_cpu() when the vCPU is not running for
      scheduling threads.
      
      However, the idle_cpu() function is used in other places for
      actually checking whether the state of the CPU is idle or not.
      
      Hence split the use of that function based on the desired return value,
      by introducing the available_idle_cpu() function.
      
      This fixes a (slight) regression in that initial vCPU commit, because
      some code paths (like the load-balancer) don't care and shouldn't care
      if the vCPU is preempted or not, they just want to know if there's any
      tasks on the CPU.
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: linux-kernel@vger.kernel.org
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      943d355d
    • M
      sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
      Mel Gorman 提交于
      Threads share an address space and each can change the protections of the
      same address space to trap NUMA faults. This is redundant and potentially
      counter-productive as any thread doing the update will suffice. Potentially
      only one thread is required but that thread may be idle or it may not have
      any locality concerns and pick an unsuitable scan rate.
      
      This patch uses independent scan period but they are staggered based on
      the number of address space users when the thread is created.  The intent
      is that threads will avoid scanning at the same time and have a chance
      to adapt their scan rate later if necessary. This reduces the total scan
      activity early in the lifetime of the threads.
      
      The different in headline performance across a range of machines and
      workloads is marginal but the system CPU usage is reduced as well as overall
      scan activity.  The following is the time reported by NAS Parallel Benchmark
      using unbound openmp threads and a D size class:
      
      			      4.17.0-rc1             4.17.0-rc1
      				 vanilla           stagger-v1r1
      	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
      	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
      	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
      	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
      	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
      	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
      	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
      	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
      
      Note it's not a universal win but we have no prior knowledge of which
      thread matters but the number of threads created often exceeds the size
      of the node when the threads are not bound. However, there is a reducation
      of overall system CPU usage:
      
      				    4.17.0-rc1             4.17.0-rc1
      				       vanilla           stagger-v1r1
      	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
      	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
      	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
      	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
      	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
      	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
      	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
      	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
      
      NUMA scan activity is also reduced:
      
      	NUMA alloc local               1042828     1342670
      	NUMA base PTE updates        140481138    93577468
      	NUMA huge PMD updates           272171      180766
      	NUMA page range updates      279832690   186129660
      	NUMA hint faults               1395972     1193897
      	NUMA hint local faults          877925      855053
      	NUMA hint local percent             62          71
      	NUMA pages migrated           12057909     9158023
      
      Similar observations are made for other thread-intensive workloads. System
      CPU usage is lower even though the headline gains in performance tend to be
      small. For example, specjbb 2005 shows almost no difference in performance
      but scan activity is reduced by a third on a 4-socket box. I didn't find
      a workload (thread intensive or otherwise) that suffered badly.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13784475
  6. 12 5月, 2018 2 次提交
  7. 11 5月, 2018 2 次提交
  8. 09 5月, 2018 2 次提交
  9. 05 5月, 2018 6 次提交
  10. 04 5月, 2018 6 次提交
    • R
      sched/core: Don't schedule threads on pre-empted vCPUs · 247f2f6f
      Rohit Jain 提交于
      In paravirt configurations today, spinlocks figure out whether a vCPU is
      running to determine whether or not spinlock should bother spinning. We
      can use the same logic to prioritize CPUs when scheduling threads. If a
      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
      the time it actually spends to be running on the host CPU. If we had
      other vCPUs which were actually running on the host CPU and idle we
      should schedule threads there.
      
      Performance numbers:
      
      Note: With patch is referred to as Paravirt in the following and without
      patch is referred to as Base.
      
      1) When only 1 VM is running:
      
          a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):
      
      	+-------+-----------------+----------------+
      	|Number |Paravirt         |Base            |
      	|of     +---------+-------+-------+--------+
      	|Threads|Average  |Std Dev|Average| Std Dev|
      	+-------+---------+-------+-------+--------+
      	|1      |1.817    |0.076  |1.721  | 0.067  |
      	|2      |3.467    |0.120  |3.468  | 0.074  |
      	|4      |6.266    |0.035  |6.314  | 0.068  |
      	|8      |11.437   |0.105  |11.418 | 0.132  |
      	|16     |21.862   |0.167  |22.161 | 0.129  |
      	|25     |33.341   |0.326  |33.692 | 0.147  |
      	+-------+---------+-------+-------+--------+
      
      2) When two VMs are running with same CPU affinities:
      
          a) tbench test on VM 8 cpus
      
          Base:
      
      	VM1:
      
      	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
      	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
      	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
      	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms
      
      	VM2:
      
      	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
      	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
      	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
      	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms
      
          Paravirt:
      
      	VM1:
      
      	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
      	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
      	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
      	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms
      
      	VM2:
      
      	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
      	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
      	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
      	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms
      
          b) Hackbench with 4 fd 1,000,000 loops
      
      	+-------+--------------------------------------+----------------------------------------+
      	|Number |Paravirt                              |Base                                    |
      	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
      	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      
          Note that this test was run just to show the interference effect
          over-subscription can have in baseline
      
          c) schbench results with 2 message groups on 8 vCPU VMs
      
      	+-----------+-------+---------------+--------------+------------+
      	|           |       | Paravirt      | Base         |            |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
      	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
      	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
      	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
      	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
      	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
      	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
      	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
      	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
      	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
      	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
      	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
      	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
      	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
      	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
      	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
      	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
      	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
      	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
      	+-----------+-------+-------+-------+-------+------+------------+
      
          Note: Improvement is measured for (VM1+VM2)
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      247f2f6f
    • V
      sched/fair: Avoid calling sync_entity_load_avg() unnecessarily · c976a862
      Viresh Kumar 提交于
      Call sync_entity_load_avg() directly from find_idlest_cpu() instead of
      select_task_rq_fair(), as that's where we need to use task's utilization
      value. And call sync_entity_load_avg() only after making sure sched
      domain spans over one of the allowed CPUs for the task.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/cd019d1753824c81130eae7b43e2bbcec47cc1ad.1524738578.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c976a862
    • V
      sched/fair: Rearrange select_task_rq_fair() to optimize it · f1d88b44
      Viresh Kumar 提交于
      Rearrange select_task_rq_fair() a bit to avoid executing some
      conditional statements in few specific code-paths. That gets rid of the
      goto as well.
      
      This shouldn't result in any functional changes.
      Tested-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20831b8d237bf3a20e4e328286f678b425ff04c9.1524738578.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f1d88b44
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
    • D
      bpf: use array_index_nospec in find_prog_type · d0f1a451
      Daniel Borkmann 提交于
      Commit 9ef09e35 ("bpf: fix possible spectre-v1 in find_and_alloc_map()")
      converted find_and_alloc_map() over to use array_index_nospec() to sanitize
      map type that user space passes on map creation, and this patch does an
      analogous conversion for progs in find_prog_type() as it's also passed from
      user space when loading progs as attr->prog_type.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d0f1a451
    • M
      bpf: fix possible spectre-v1 in find_and_alloc_map() · 9ef09e35
      Mark Rutland 提交于
      It's possible for userspace to control attr->map_type. Sanitize it when
      using it as an array index to prevent an out-of-bounds value being used
      under speculation.
      
      Found by smatch.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: netdev@vger.kernel.org
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9ef09e35
  11. 03 5月, 2018 8 次提交
    • Z
      tracing: Fix the file mode of stack tracer · 0c5a9acc
      Zhengyuan Liu 提交于
      It looks weird that the stack_trace_filter file can be written by root
      but shows that it does not have write permission by ll command.
      
      Link: http://lkml.kernel.org/r/1518054113-28096-1-git-send-email-liuzhengyuan@kylinos.cnSigned-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0c5a9acc
    • C
      ftrace: Have set_graph_* files have normal file modes · 1ce0500d
      Chen LinX 提交于
      The set_graph_function and set_graph_notrace file mode should be 0644
      instead of 0444 as they are writeable. Note, the mode appears to be ignored
      regardless, but they should at least look sane.
      
      Link: http://lkml.kernel.org/r/1409725869-4501-1-git-send-email-linx.z.chen@intel.comAcked-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NChen LinX <linx.z.chen@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ce0500d
    • K
      seccomp: Enable speculation flaw mitigations · 5c307089
      Kees Cook 提交于
      When speculation flaw mitigations are opt-in (via prctl), using seccomp
      will automatically opt-in to these protections, since using seccomp
      indicates at least some level of sandboxing is desired.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5c307089
    • K
      nospec: Allow getting/setting on non-current task · 7bbf1373
      Kees Cook 提交于
      Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
      current.
      
      This is needed both for /proc/$pid/status queries and for seccomp (since
      thread-syncing can trigger seccomp in non-current threads).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7bbf1373
    • T
      prctl: Add speculation control prctls · b617cfc8
      Thomas Gleixner 提交于
      Add two new prctls to control aspects of speculation related vulnerabilites
      and their mitigations to provide finer grained control over performance
      impacting mitigations.
      
      PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
      which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
      the following meaning:
      
      Bit  Define           Description
      0    PR_SPEC_PRCTL    Mitigation can be controlled per task by
                            PR_SET_SPECULATION_CTRL
      1    PR_SPEC_ENABLE   The speculation feature is enabled, mitigation is
                            disabled
      2    PR_SPEC_DISABLE  The speculation feature is disabled, mitigation is
                            enabled
      
      If all bits are 0 the CPU is not affected by the speculation misfeature.
      
      If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
      available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
      misfeature will fail.
      
      PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
      is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
      control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
      
      The common return values are:
      
      EINVAL  prctl is not implemented by the architecture or the unused prctl()
              arguments are not 0
      ENODEV  arg2 is selecting a not supported speculation misfeature
      
      PR_SET_SPECULATION_CTRL has these additional return values:
      
      ERANGE  arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
      ENXIO   prctl control of the selected speculation misfeature is disabled
      
      The first supported controlable speculation misfeature is
      PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
      architectures.
      
      Based on an initial patch from Tim Chen and mostly rewritten.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b617cfc8
    • P
      kthread, sched/wait: Fix kthread_parkme() completion issue · 85f1abe0
      Peter Zijlstra 提交于
      Even with the wait-loop fixed, there is a further issue with
      kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
      smpboot_park_threads() can return before all those threads are in fact
      blocked, due to the placement of the complete() in __kthread_parkme().
      
      When that happens, sched_cpu_dying() -> migrate_tasks() can end up
      migrating such a still runnable task onto another CPU.
      
      Normally the task will have hit schedule() and gone to sleep by the
      time we do kthread_unpark(), which will then do __kthread_bind() to
      re-bind the task to the correct CPU.
      
      However, when we loose the initial TASK_PARKED store to the concurrent
      wakeup issue described previously, do the complete(), get migrated, it
      is possible to either:
      
       - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
         the park and set TASK_RUNNING, or
      
       - __kthread_bind()'s wait_task_inactive() to observe the competing
         TASK_RUNNING store.
      
      Either way the WARN() in __kthread_bind() will trigger and fail to
      correctly set the CPU affinity.
      
      Fix this by only issuing the complete() when the kthread has scheduled
      out. This does away with all the icky 'still running' nonsense.
      
      The alternative is to promote TASK_PARKED to a special state, this
      guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
      and we'll end up doing the right thing, but this preserves the whole
      icky business of potentially migating the still runnable thing.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85f1abe0
    • P
      kthread, sched/wait: Fix kthread_parkme() wait-loop · 741a76b3
      Peter Zijlstra 提交于
      Gaurav reported a problem with __kthread_parkme() where a concurrent
      try_to_wake_up() could result in competing stores to ->state which,
      when the TASK_PARKED store got lost bad things would happen.
      
      The comment near set_current_state() actually mentions this competing
      store, but only mentions the case against TASK_RUNNING. This same
      store, with different timing, can happen against a subsequent !RUNNING
      store.
      
      This normally is not a problem, because as per that same comment, the
      !RUNNING state store is inside a condition based wait-loop:
      
        for (;;) {
          set_current_state(TASK_UNINTERRUPTIBLE);
          if (!need_sleep)
            break;
          schedule();
        }
        __set_current_state(TASK_RUNNING);
      
      If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
      (concurrent) wakeup, the schedule() will NO-OP and we'll go around the
      loop once more.
      
      The problem here is that the TASK_PARKED store is not inside the
      KTHREAD_SHOULD_PARK condition wait-loop.
      
      There is a genuine issue with sleeps that do not have a condition;
      this is addressed in a subsequent patch.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      741a76b3
    • V
      sched/fair: Fix the update of blocked load when newly idle · 457be908
      Vincent Guittot 提交于
      With commit:
      
        31e77c93 ("sched/fair: Update blocked load when newly idle")
      
      ... we release the rq->lock when updating blocked load of idle CPUs.
      
      This opens a time window during which another CPU can add a task to this
      CPU's cfs_rq.
      
      The check for newly added task of idle_balance() is not in the common path.
      Move the out label to include this check.
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 31e77c93 ("sched/fair: Update blocked load when newly idle")
      Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      457be908