1. 19 4月, 2017 1 次提交
  2. 16 4月, 2017 1 次提交
  3. 15 4月, 2017 1 次提交
    • S
      ftrace: Fix removing of second function probe · 82cc4fc2
      Steven Rostedt (VMware) 提交于
      When two function probes are added to set_ftrace_filter, and then one of
      them is removed, the update to the function locations is not performed, and
      the record keeping of the function states are corrupted, and causes an
      ftrace_bug() to occur.
      
      This is easily reproducable by adding two probes, removing one, and then
      adding it back again.
      
       # cd /sys/kernel/debug/tracing
       # echo schedule:traceoff > set_ftrace_filter
       # echo do_IRQ:traceoff > set_ftrace_filter
       # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter
       # echo do_IRQ:traceoff > set_ftrace_filter
      
      Causes:
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220
       Modules linked in: [...]
       CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
       Call Trace:
        dump_stack+0x68/0x9f
        __warn+0x111/0x130
        ? trace_irq_work_interrupt+0xa0/0xa0
        warn_slowpath_null+0x1d/0x20
        ftrace_get_addr_curr+0x143/0x220
        ? __fentry__+0x10/0x10
        ftrace_replace_code+0xe3/0x4f0
        ? ftrace_int3_handler+0x90/0x90
        ? printk+0x99/0xb5
        ? 0xffffffff81000000
        ftrace_modify_all_code+0x97/0x110
        arch_ftrace_update_code+0x10/0x20
        ftrace_run_update_code+0x1c/0x60
        ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0
        register_ftrace_function_probe+0x4b6/0x590
        ? ftrace_startup+0x310/0x310
        ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30
        ? update_stack_state+0x88/0x110
        ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
        ? preempt_count_sub+0x18/0xd0
        ? mutex_lock_nested+0x104/0x800
        ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320
        ? __unwind_start+0x1c0/0x1c0
        ? _mutex_lock_nest_lock+0x800/0x800
        ftrace_trace_probe_callback.isra.3+0xc0/0x130
        ? func_set_flag+0xe0/0xe0
        ? __lock_acquire+0x642/0x1790
        ? __might_fault+0x1e/0x20
        ? trace_get_user+0x398/0x470
        ? strcmp+0x35/0x60
        ftrace_trace_onoff_callback+0x48/0x70
        ftrace_regex_write.isra.43.part.44+0x251/0x320
        ? match_records+0x420/0x420
        ftrace_filter_write+0x2b/0x30
        __vfs_write+0xd7/0x330
        ? do_loop_readv_writev+0x120/0x120
        ? locks_remove_posix+0x90/0x2f0
        ? do_lock_file_wait+0x160/0x160
        ? __lock_is_held+0x93/0x100
        ? rcu_read_lock_sched_held+0x5c/0xb0
        ? preempt_count_sub+0x18/0xd0
        ? __sb_start_write+0x10a/0x230
        ? vfs_write+0x222/0x240
        vfs_write+0xef/0x240
        SyS_write+0xab/0x130
        ? SyS_read+0x130/0x130
        ? trace_hardirqs_on_caller+0x182/0x280
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL_64_fastpath+0x18/0xad
       RIP: 0033:0x7fe61c157c30
       RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
       RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30
       RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001
       RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700
       R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400
       R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c
        ? trace_hardirqs_off_caller+0xc0/0x110
       ---[ end trace 99fa09b3d9869c2c ]---
       Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150)
      
      Cc: stable@vger.kernel.org
      Fixes: 59df055f ("ftrace: trace different functions with a different tracer")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      82cc4fc2
  4. 14 4月, 2017 1 次提交
  5. 11 4月, 2017 2 次提交
    • J
      bpf: reference may_access_skb() from __bpf_prog_run() · 96a94cc5
      Johannes Berg 提交于
      It took me quite some time to figure out how this was linked,
      so in order to save the next person the effort of finding it
      add a comment in __bpf_prog_run() that indicates what exactly
      determines that a program can access the ctx == skb.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a94cc5
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks · bfb0b80d
      Zefan Li 提交于
      Run this:
      
          touch file0
          for ((; ;))
          {
              mount -t cpuset xxx file0
          }
      
      And this concurrently:
      
          touch file1
          for ((; ;))
          {
              mount -t cpuset xxx file1
          }
      
      We'll trigger a warning like this:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
       percpu_ref_kill_and_confirm called more than once on css_release!
       CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       Call Trace:
        dump_stack+0x63/0x84
        __warn+0xd1/0xf0
        warn_slowpath_fmt+0x5f/0x80
        percpu_ref_kill_and_confirm+0x92/0xb0
        cgroup_kill_sb+0x95/0xb0
        deactivate_locked_super+0x43/0x70
        deactivate_super+0x46/0x60
       ...
       ---[ end trace a79f61c2a2633700 ]---
      
      Here's a race:
      
        Thread A				Thread B
      
        cgroup1_mount()
          # alloc a new cgroup root
          cgroup_setup_root()
      					cgroup1_mount()
      					  # no sb yet, returns NULL
      					  kernfs_pin_sb()
      
      					  # but succeeds in getting the refcnt,
      					  # so re-use cgroup root
      					  percpu_ref_tryget_live()
          # alloc sb with cgroup root
          cgroup_do_mount()
      
        cgroup_kill_sb()
      					  # alloc another sb with same root
      					  cgroup_do_mount()
      
      					cgroup_kill_sb()
      
      We end up using the same cgroup root for two different superblocks,
      so percpu_ref_kill() will be called twice on the same root when the
      two superblocks are destroyed.
      
      We should fix to make sure the superblock pinning is really successful.
      
      Cc: stable@vger.kernel.org # 3.16+
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bfb0b80d
  6. 10 4月, 2017 1 次提交
    • P
      audit: make sure we don't let the retry queue grow without bounds · 264d5096
      Paul Moore 提交于
      The retry queue is intended to provide a temporary buffer in the case
      of transient errors when communicating with auditd, it is not meant
      as a long life queue, that functionality is provided by the hold
      queue.
      
      This patch fixes a problem identified by Seth where the retry queue
      could grow uncontrollably if an auditd instance did not connect to
      the kernel to drain the queues.  This commit fixes this by doing the
      following:
      
      * Make sure we always call auditd_reset() if we decide the connection
      with audit is really dead.  There were some cases in
      kauditd_hold_skb() where we did not reset the connection, this patch
      relocates the reset calls to kauditd_thread() so all the error
      conditions are caught and the connection reset.  As a side effect,
      this means we could move auditd_reset() and get rid of the forward
      definition at the top of kernel/audit.c.
      
      * We never checked the status of the auditd connection when
      processing the main audit queue which meant that the retry queue
      could grow unchecked.  This patch adds a call to auditd_reset()
      after the main queue has been processed if auditd is not connected,
      the auditd_reset() call will make sure the retry and hold queues are
      correctly managed/flushed so that the retry queue remains reasonable.
      
      Cc: <stable@vger.kernel.org> # 4.10.x-: 5b52330bReported-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      264d5096
  7. 09 4月, 2017 1 次提交
  8. 08 4月, 2017 2 次提交
  9. 05 4月, 2017 1 次提交
  10. 04 4月, 2017 1 次提交
  11. 02 4月, 2017 2 次提交
    • D
      bpf, verifier: fix rejection of unaligned access checks for map_value_adj · 79adffcd
      Daniel Borkmann 提交于
      Currently, the verifier doesn't reject unaligned access for map_value_adj
      register types. Commit 48461135 ("bpf: allow access into map value
      arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
      to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
      is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
      non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
      architectures not supporting efficient unaligned access, and thus worst
      case could raise exceptions on some archs that are unable to correct the
      unaligned access or perform a different memory access to the actual
      requested one and such.
      
      i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
         on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x42533a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+11
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (61) r1 = *(u32 *)(r0 +0)
         8: (35) if r1 >= 0xb goto pc+9
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
         9: (07) r0 += 3
        10: (79) r7 = *(u64 *)(r0 +0)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
        11: (79) r7 = *(u64 *)(r0 +2)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
        [...]
      
      ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
          on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x4df16a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+19
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (07) r0 += 3
         8: (7a) *(u64 *)(r0 +0) = 42
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
         9: (7a) *(u64 *)(r0 +2) = 43
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        10: (7a) *(u64 *)(r0 -2) = 44
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        [...]
      
      For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
      was fetched, it later receives a reg->id from env->id_gen generator
      once another register with UNKNOWN_VALUE type was added to it via
      check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
      is used in find_good_pkt_pointers() for setting the allowed access
      range for regs with PTR_TO_PACKET of same id once verifier matched
      on data/data_end tests, and ii) for check_ptr_alignment() to determine
      that when not having efficient unaligned access and register with
      UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
      to access the content bytewise due to unknown unalignment. reg->id
      was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
      always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
      was added after 48461135 via 57a09bf0 ("bpf: Detect identical
      PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
      non-root environment due to prohibited pointer arithmetic.
      
      The fix splits register-type specific checks into their own helper
      instead of keeping them combined, so we don't run into a similar
      issue in future once we extend check_ptr_alignment() further and
      forget to add reg->type checks for some of the checks.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79adffcd
    • D
      bpf, verifier: fix alu ops against map_value{, _adj} register types · fce366a9
      Daniel Borkmann 提交于
      While looking into map_value_adj, I noticed that alu operations
      directly on the map_value() resp. map_value_adj() register (any
      alu operation on a map_value() register will turn it into a
      map_value_adj() typed register) are not sufficiently protected
      against some of the operations. Two non-exhaustive examples are
      provided that the verifier needs to reject:
      
       i) BPF_AND on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xbf842a00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (57) r0 &= 8
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xc24eee00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (04) (u32) r0 += (u32) 0
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      Issue is, while min_value / max_value boundaries for the access
      are adjusted appropriately, we change the pointer value in a way
      that cannot be sufficiently tracked anymore from its origin.
      Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
      that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
      all the test cases coming with 48461135 ("bpf: allow access
      into map value arrays") perform BPF_ADD only on the destination
      register that is PTR_TO_MAP_VALUE_ADJ.
      
      Only for UNKNOWN_VALUE register types such operations make sense,
      f.e. with unknown memory content fetched initially from a constant
      offset from the map value memory into a register. That register is
      then later tested against lower / upper bounds, so that the verifier
      can then do the tracking of min_value / max_value, and properly
      check once that UNKNOWN_VALUE register is added to the destination
      register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
      original use-case is solving. Note, tracking on what is being
      added is done through adjust_reg_min_max_vals() and later access
      to the map value enforced with these boundaries and the given offset
      from the insn through check_map_access_adj().
      
      Tests will fail for non-root environment due to prohibited pointer
      arithmetic, in particular in check_alu_op(), we bail out on the
      is_pointer_value() check on the dst_reg (which is false in root
      case as we allow for pointer arithmetic via env->allow_ptr_leaks).
      
      Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
      allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
      mode BPF_ADD. The test_verifier suite runs fine after the patch
      and it also rejects mentioned test cases.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fce366a9
  12. 31 3月, 2017 1 次提交
  13. 28 3月, 2017 1 次提交
  14. 27 3月, 2017 1 次提交
  15. 25 3月, 2017 1 次提交
  16. 24 3月, 2017 1 次提交
    • J
      padata: avoid race in reordering · de5540d0
      Jason A. Donenfeld 提交于
      Under extremely heavy uses of padata, crashes occur, and with list
      debugging turned on, this happens instead:
      
      [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
      __list_add+0xae/0x130
      [87487.301868] list_add corruption. prev->next should be next
      (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
      [87487.339011]  [<ffffffff9a53d075>] dump_stack+0x68/0xa3
      [87487.342198]  [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
      [87487.345364]  [<ffffffff99d6b91f>] __warn+0xff/0x140
      [87487.348513]  [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
      [87487.351659]  [<ffffffff9a58b5de>] __list_add+0xae/0x130
      [87487.354772]  [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
      [87487.357915]  [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
      [87487.361084]  [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
      
      padata_reorder calls list_add_tail with the list to which its adding
      locked, which seems correct:
      
      spin_lock(&squeue->serial.lock);
      list_add_tail(&padata->list, &squeue->serial.list);
      spin_unlock(&squeue->serial.lock);
      
      This therefore leaves only place where such inconsistency could occur:
      if padata->list is added at the same time on two different threads.
      This pdata pointer comes from the function call to
      padata_get_next(pd), which has in it the following block:
      
      next_queue = per_cpu_ptr(pd->pqueue, cpu);
      padata = NULL;
      reorder = &next_queue->reorder;
      if (!list_empty(&reorder->list)) {
             padata = list_entry(reorder->list.next,
                                 struct padata_priv, list);
             spin_lock(&reorder->lock);
             list_del_init(&padata->list);
             atomic_dec(&pd->reorder_objects);
             spin_unlock(&reorder->lock);
      
             pd->processed++;
      
             goto out;
      }
      out:
      return padata;
      
      I strongly suspect that the problem here is that two threads can race
      on reorder list. Even though the deletion is locked, call to
      list_entry is not locked, which means it's feasible that two threads
      pick up the same padata object and subsequently call list_add_tail on
      them at the same time. The fix is thus be hoist that lock outside of
      that block.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      de5540d0
  17. 23 3月, 2017 3 次提交
  18. 21 3月, 2017 2 次提交
    • P
      audit: fix auditd/kernel connection state tracking · 5b52330b
      Paul Moore 提交于
      What started as a rather straightforward race condition reported by
      Dmitry using the syzkaller fuzzer ended up revealing some major
      problems with how the audit subsystem managed its netlink sockets and
      its connection with the userspace audit daemon.  Fixing this properly
      had quite the cascading effect and what we are left with is this rather
      large and complicated patch.  My initial goal was to try and decompose
      this patch into multiple smaller patches, but the way these changes
      are intertwined makes it difficult to split these changes into
      meaningful pieces that don't break or somehow make things worse for
      the intermediate states.
      
      The patch makes a number of changes, but the most significant are
      highlighted below:
      
      * The auditd tracking variables, e.g. audit_sock, are now gone and
      replaced by a RCU/spin_lock protected variable auditd_conn which is
      a structure containing all of the auditd tracking information.
      
      * We no longer track the auditd sock directly, instead we track it
      via the network namespace in which it resides and we use the audit
      socket associated with that namespace.  In spirit, this is what the
      code was trying to do prior to this patch (at least I think that is
      what the original authors intended), but it was done rather poorly
      and added a layer of obfuscation that only masked the underlying
      problems.
      
      * Big backlog queue cleanup, again.  In v4.10 we made some pretty big
      changes to how the audit backlog queues work, here we haven't changed
      the queue design so much as cleaned up the implementation.  Brought
      about by the locking changes, we've simplified kauditd_thread() quite
      a bit by consolidating the queue handling into a new helper function,
      kauditd_send_queue(), which allows us to eliminate a lot of very
      similar code and makes the looping logic in kauditd_thread() clearer.
      
      * All netlink messages sent to auditd are now sent via
      auditd_send_unicast_skb().  Other than just making sense, this makes
      the lock handling easier.
      
      * Change the audit_log_start() sleep behavior so that we never sleep
      on auditd events (unchanged) or if the caller is holding the
      audit_cmd_mutex (changed).  Previously we didn't sleep if the caller
      was auditd or if the message type fell between a certain range; the
      type check was a poor effort of doing what the cmd_mutex check now
      does.  Richard Guy Briggs originally proposed not sleeping the
      cmd_mutex owner several years ago but his patch wasn't acceptable
      at the time.  At least the idea lives on here.
      
      * A problem with the lost record counter has been resolved.  Steve
      Grubb and I both happened to notice this problem and according to
      some quick testing by Steve, this problem goes back quite some time.
      It's largely a harmless problem, although it may have left some
      careful sysadmins quite puzzled.
      
      Cc: <stable@vger.kernel.org> # 4.10.x-
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      5b52330b
    • R
      cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() · 4296f23e
      Rafael J. Wysocki 提交于
      sugov_start() only initializes struct sugov_cpu per-CPU structures
      for shared policies, but it should do that for single-CPU policies too.
      
      That in particular makes the IO-wait boost mechanism work in the
      cases when cpufreq policies correspond to individual CPUs.
      
      Fixes: 21ca6d2c (cpufreq: schedutil: Add iowait boosting)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
      4296f23e
  19. 17 3月, 2017 2 次提交
    • T
      cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796
      Tejun Heo 提交于
      Creation of a kthread goes through a couple interlocked stages between
      the kthread itself and its creator.  Once the new kthread starts
      running, it initializes itself and wakes up the creator.  The creator
      then can further configure the kthread and then let it start doing its
      job by waking it up.
      
      In this configuration-by-creator stage, the creator is the only one
      that can wake it up but the kthread is visible to userland.  When
      altering the kthread's attributes from userland is allowed, this is
      fine; however, for cases where CPU affinity is critical,
      kthread_bind() is used to first disable affinity changes from userland
      and then set the affinity.  This also prevents the kthread from being
      migrated into non-root cgroups as that can affect the CPU affinity and
      many other things.
      
      Unfortunately, the cgroup side of protection is racy.  While the
      PF_NO_SETAFFINITY flag prevents further migrations, userland can win
      the race before the creator sets the flag with kthread_bind() and put
      the kthread in a non-root cgroup, which can lead to all sorts of
      problems including incorrect CPU affinity and starvation.
      
      This bug got triggered by userland which periodically tries to migrate
      all processes in the root cpuset cgroup to a non-root one.  Per-cpu
      workqueue workers got caught while being created and ended up with
      incorrected CPU affinity breaking concurrency management and sometimes
      stalling workqueue execution.
      
      This patch adds task->no_cgroup_migration which disallows the task to
      be migrated by userland.  kthreadd starts with the flag set making
      every child kthread start in the root cgroup with migration
      disallowed.  The flag is cleared after the kthread finishes
      initialization by which time PF_NO_SETAFFINITY is set if the kthread
      should stay in the root cgroup.
      
      It'd be better to wait for the initialization instead of failing but I
      couldn't think of a way of implementing that without adding either a
      new PF flag, or sleeping and retrying from waiting side.  Even if
      userland depends on changing cgroup membership of a kthread, it either
      has to be synchronized with kthread_create() or periodically repeat,
      so it's unlikely that this would break anything.
      
      v2: Switch to a simpler implementation using a new task_struct bit
          field suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-and-debugged-by: NChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
      Signed-off-by: NTejun Heo <tj@kernel.org>
      77f88796
    • H
      mm: add private lock to serialize memory hotplug operations · 55adc1d0
      Heiko Carstens 提交于
      Commit bfc8c901 ("mem-hotplug: implement get/put_online_mems")
      introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
      in order to allow similar semantics for memory hotplug like for cpu
      hotplug.
      
      The corresponding functions for cpu hotplug are get/put_online_cpus()
      and cpu_hotplug_begin/done() for cpu hotplug.
      
      The commit however missed to introduce functions that would serialize
      memory hotplug operations like they are done for cpu hotplug with
      cpu_maps_update_begin/done().
      
      This basically leaves mem_hotplug.active_writer unprotected and allows
      concurrent writers to modify it, which may lead to problems as outlined
      by commit f931ab47 ("mm: fix devm_memremap_pages crash, use
      mem_hotplug_{begin, done}").
      
      That commit was extended again with commit b5d24fda ("mm,
      devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
      done}") which serializes memory hotplug operations for some call sites
      by using the device_hotplug lock.
      
      In addition with commit 3fc21924 ("mm: validate device_hotplug is held
      for memory hotplug") a sanity check was added to mem_hotplug_begin() to
      verify that the device_hotplug lock is held.
      
      This in turn triggers the following warning on s390:
      
      WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
       Call Trace:
        assert_held_device_hotplug+0x40/0x58)
        mem_hotplug_begin+0x34/0xc8
        add_memory_resource+0x7e/0x1f8
        add_memory+0xda/0x130
        add_memory_merged+0x15c/0x178
        sclp_detect_standby_memory+0x2ae/0x2f8
        do_one_initcall+0xa2/0x150
        kernel_init_freeable+0x228/0x2d8
        kernel_init+0x2a/0x140
        kernel_thread_starter+0x6/0xc
      
      One possible fix would be to add more lock_device_hotplug() and
      unlock_device_hotplug() calls around each call site of
      mem_hotplug_begin/end().  But that would give the device_hotplug lock
      additional semantics it better should not have (serialize memory hotplug
      operations).
      
      Instead add a new memory_add_remove_lock which has the similar semantics
      like cpu_add_remove_lock for cpu hotplug.
      
      To keep things hopefully a bit easier the lock will be locked and unlocked
      within the mem_hotplug_begin/end() functions.
      
      Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55adc1d0
  20. 16 3月, 2017 11 次提交
    • P
      perf/core: Better explain the inherit magic · d8a8cfc7
      Peter Zijlstra 提交于
      While going through the event inheritance code Oleg got confused.
      
      Add some comments to better explain the silent dissapearance of
      orphaned events.
      
      So what happens is that at perf_event_release_kernel() time; when an
      event looses its connection to userspace (and ceases to exist from the
      user's perspective) we can still have an arbitrary amount of inherited
      copies of the event. We want to synchronously find and remove all
      these child events.
      
      Since that requires a bit of lock juggling, there is the possibility
      that concurrent clone()s will create new child events. Therefore we
      first mark the parent event as DEAD, which marks all the extant child
      events as orphaned.
      
      We then avoid copying orphaned events; in order to avoid getting more
      of them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d8a8cfc7
    • P
      perf/core: Simplify perf_event_free_task() · 15121c78
      Peter Zijlstra 提交于
      We have ctx->event_list that contains all events; no need to
      repeatedly iterate the group lists to find them all.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      15121c78
    • P
      perf/core: Fix event inheritance on fork() · e7cc4865
      Peter Zijlstra 提交于
      While hunting for clues to a use-after-free, Oleg spotted that
      perf_event_init_context() can loose an error value with the result
      that fork() can succeed even though we did not fully inherit the perf
      event context.
      Spotted-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: oleg@redhat.com
      Cc: stable@vger.kernel.org
      Fixes: 889ff015 ("perf/core: Split context's event group list into pinned and non-pinned lists")
      Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7cc4865
    • P
      perf/core: Fix use-after-free in perf_release() · e552a838
      Peter Zijlstra 提交于
      Dmitry reported syzcaller tripped a use-after-free in perf_release().
      
      After much puzzlement Oleg spotted the below scenario:
      
        Task1                           Task2
      
        fork()
          perf_event_init_task()
          /* ... */
          goto bad_fork_$foo;
          /* ... */
          perf_event_free_task()
            mutex_lock(ctx->lock)
            perf_free_event(B)
      
                                        perf_event_release_kernel(A)
                                          mutex_lock(A->child_mutex)
                                          list_for_each_entry(child, ...) {
                                            /* child == B */
                                            ctx = B->ctx;
                                            get_ctx(ctx);
                                            mutex_unlock(A->child_mutex);
      
              mutex_lock(A->child_mutex)
              list_del_init(B->child_list)
              mutex_unlock(A->child_mutex)
      
              /* ... */
      
            mutex_unlock(ctx->lock);
            put_ctx() /* >0 */
          free_task();
                                            mutex_lock(ctx->lock);
                                            mutex_lock(A->child_mutex);
                                            /* ... */
                                            mutex_unlock(A->child_mutex);
                                            mutex_unlock(ctx->lock)
                                            put_ctx() /* 0 */
                                              ctx->task && !TOMBSTONE
                                                put_task_struct() /* UAF */
      
      This patch closes the hole by making perf_event_free_task() destroy the
      task <-> ctx relation such that perf_event_release_kernel() will no longer
      observe the now dead task.
      Spotted-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Cc: oleg@redhat.com
      Cc: stable@vger.kernel.org
      Fixes: c6e5b732 ("perf: Synchronously clean up child events")
      Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
      Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e552a838
    • S
      sched/deadline: Use deadline instead of period when calculating overflow · 2317d5f1
      Steven Rostedt (VMware) 提交于
      I was testing Daniel's changes with his test case, and tweaked it a
      little. Instead of having the runtime equal to the deadline, I
      increased the deadline ten fold.
      
      Daniel's test case had:
      
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      To make it more interesting, I changed it to:
      
      	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
      	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */
      
      The results were rather surprising. The behavior that Daniel's patch
      was fixing came back. The task started using much more than .1% of the
      CPU. More like 20%.
      
      Looking into this I found that it was due to the dl_entity_overflow()
      constantly returning true. That's because it uses the relative period
      against relative runtime vs the absolute deadline against absolute
      runtime.
      
        runtime / (deadline - t) > dl_runtime / dl_period
      
      There's even a comment mentioning this, and saying that when relative
      deadline equals relative period, that the equation is the same as using
      deadline instead of period. That comment is backwards! What we really
      want is:
      
        runtime / (deadline - t) > dl_runtime / dl_deadline
      
      We care about if the runtime can make its deadline, not its period. And
      then we can say "when the deadline equals the period, the equation is
      the same as using dl_period instead of dl_deadline".
      
      After correcting this, now when the task gets enqueued, it can throttle
      correctly, and Daniel's fix to the throttling of sleeping deadline
      tasks works even when the runtime and deadline are not the same.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2317d5f1
    • D
      sched/deadline: Throttle a constrained deadline task activated after the deadline · df8eac8c
      Daniel Bristot de Oliveira 提交于
      During the activation, CBS checks if it can reuse the current task's
      runtime and period. If the deadline of the task is in the past, CBS
      cannot use the runtime, and so it replenishes the task. This rule
      works fine for implicit deadline tasks (deadline == period), and the
      CBS was designed for implicit deadline tasks. However, a task with
      constrained deadline (deadine < period) might be awakened after the
      deadline, but before the next period. In this case, replenishing the
      task would allow it to run for runtime / deadline. As in this case
      deadline < period, CBS enables a task to run for more than the
      runtime / period. In a very loaded system, this can cause a domino
      effect, making other tasks miss their deadlines.
      
      To avoid this problem, in the activation of a constrained deadline
      task after the deadline but before the next period, throttle the
      task and set the replenishing timer to the begin of the next period,
      unless it is boosted.
      
      Reproducer:
      
       --------------- %< ---------------
        int main (int argc, char **argv)
        {
      	int ret;
      	int flags = 0;
      	unsigned long l = 0;
      	struct timespec ts;
      	struct sched_attr attr;
      
      	memset(&attr, 0, sizeof(attr));
      	attr.size = sizeof(attr);
      
      	attr.sched_policy   = SCHED_DEADLINE;
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      	ts.tv_sec = 0;
      	ts.tv_nsec = 2000 * 1000;			/* 2 ms */
      
      	ret = sched_setattr(0, &attr, flags);
      
      	if (ret < 0) {
      		perror("sched_setattr");
      		exit(-1);
      	}
      
      	for(;;) {
      		/* XXX: you may need to adjust the loop */
      		for (l = 0; l < 150000; l++);
      		/*
      		 * The ideia is to go to sleep right before the deadline
      		 * and then wake up before the next period to receive
      		 * a new replenishment.
      		 */
      		nanosleep(&ts, NULL);
      	}
      
      	exit(0);
        }
        --------------- >% ---------------
      
      On my box, this reproducer uses almost 50% of the CPU time, which is
      obviously wrong for a task with 2/2000 reservation.
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      df8eac8c
    • D
      sched/deadline: Make sure the replenishment timer fires in the next period · 5ac69d37
      Daniel Bristot de Oliveira 提交于
      Currently, the replenishment timer is set to fire at the deadline
      of a task. Although that works for implicit deadline tasks because the
      deadline is equals to the begin of the next period, that is not correct
      for constrained deadline tasks (deadline < period).
      
      For instance:
      
      f.c:
       --------------- %< ---------------
      int main (void)
      {
      	for(;;);
      }
       --------------- >% ---------------
      
        # gcc -o f f.c
      
        # trace-cmd record -e sched:sched_switch                              \
      				   -e syscalls:sys_exit_sched_setattr   \
         chrt -d --sched-runtime  490000000					\
                 --sched-deadline 500000000					\
      	   --sched-period  1000000000 0 ./f
      
        # trace-cmd report | grep "{pid of ./f}"
      
      After setting parameters, the task is replenished and continue running
      until being throttled:
      
               f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
      
      The task is throttled after running 492318 ms, as expected:
      
               f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]
      
      But then, the task is replenished 500719 ms after the first
      replenishment:
      
          <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]
      
      Running for 490277 ms:
      
               f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]
      
      Hence, in the first period, the task runs 2 * runtime, and that is a bug.
      
      During the first replenishment, the next deadline is set one period away.
      So the runtime / period starts to be respected. However, as the second
      replenishment took place in the wrong instant, the next replenishment
      will also be held in a wrong instant of time. Rather than occurring in
      the nth period away from the first activation, it is taking place
      in the (nth period - relative deadline).
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ac69d37
    • N
      locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y · 17fcbd59
      Niklas Cassel 提交于
      We hang if SIGKILL has been sent, but the task is stuck in down_read()
      (after do_exit()), even though no task is doing down_write() on the
      rwsem in question:
      
        INFO: task libupnp:21868 blocked for more than 120 seconds.
        libupnp         D    0 21868      1 0x08100008
        ...
        Call Trace:
        __schedule()
        schedule()
        __down_read()
        do_exit()
        do_group_exit()
        __wake_up_parent()
      
      This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
      the following commit:
      
       04cafed7 ("locking/rwsem: Fix down_write_killable()")
      
      ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.
      Signed-off-by: NNiklas Cassel <niklas.cassel@axis.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Niklas Cassel <niklass@axis.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: d4799608 ("locking/rwsem: Introduce basis for down_write_killable()")
      Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      17fcbd59
    • M
      sched/loadavg: Use {READ,WRITE}_ONCE() for sample window · caeb5882
      Matt Fleming 提交于
      'calc_load_update' is accessed without any kind of locking and there's
      a clear assumption in the code that only a single value is read or
      written.
      
      Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
      unintentionally seeing multiple values, or having the load/stores
      split.
      
      Technically the loads in calc_global_*() don't require this since
      those are the only functions that update 'calc_load_update', but I've
      added the READ_ONCE() for consistency.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      caeb5882
    • M
      sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting · 6e5f32f7
      Matt Fleming 提交于
      If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
      the pending sample window time on exit, setting the next update not
      one window into the future, but two.
      
      This situation on exiting NO_HZ is described by:
      
        this_rq->calc_load_update < jiffies < calc_load_update
      
      In this scenario, what we should be doing is:
      
        this_rq->calc_load_update = calc_load_update		     [ next window ]
      
      But what we actually do is:
      
        this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]
      
      This has the effect of delaying load average updates for potentially
      up to ~9seconds.
      
      This can result in huge spikes in the load average values due to
      per-cpu uninterruptible task counts being out of sync when accumulated
      across all CPUs.
      
      It's safe to update the per-cpu active count if we wake between sample
      windows because any load that we left in 'calc_load_idle' will have
      been zero'd when the idle load was folded in calc_global_load().
      
      This issue is easy to reproduce before,
      
        commit 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      
      just by forking short-lived process pipelines built from ps(1) and
      grep(1) in a loop. I'm unable to reproduce the spikes after that
      commit, but the bug still seems to be present from code review.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again")
      Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e5f32f7
    • W
      sched/deadline: Add missing update_rq_clock() in dl_task_timer() · dcc3b5ff
      Wanpeng Li 提交于
      The following warning can be triggered by hot-unplugging the CPU
      on which an active SCHED_DEADLINE task is running on:
      
       ------------[ cut here ]------------
       WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
       rq->clock_update_flags < RQCF_ACT_SKIP
       CPU: 7 PID: 0 Comm: swapper/7 Tainted: G    B           4.11.0-rc1+ #24
       Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
       Call Trace:
        <IRQ>
        dump_stack+0x85/0xc4
        __warn+0x172/0x1b0
        warn_slowpath_fmt+0xb4/0xf0
        ? __warn+0x1b0/0x1b0
        ? debug_check_no_locks_freed+0x2c0/0x2c0
        ? cpudl_set+0x3d/0x2b0
        replenish_dl_entity+0x71e/0xc40
        enqueue_task_dl+0x2ea/0x12e0
        ? dl_task_timer+0x777/0x990
        ? __hrtimer_run_queues+0x270/0xa50
        dl_task_timer+0x316/0x990
        ? enqueue_task_dl+0x12e0/0x12e0
        ? enqueue_task_dl+0x12e0/0x12e0
        __hrtimer_run_queues+0x270/0xa50
        ? hrtimer_cancel+0x20/0x20
        ? hrtimer_interrupt+0x119/0x600
        hrtimer_interrupt+0x19c/0x600
        ? trace_hardirqs_off+0xd/0x10
        local_apic_timer_interrupt+0x74/0xe0
        smp_apic_timer_interrupt+0x76/0xa0
        apic_timer_interrupt+0x93/0xa0
      
      The DL task will be migrated to a suitable later deadline rq once the DL
      timer fires and currnet rq is offline. The rq clock of the new rq should
      be updated. This patch fixes it by updating the rq clock after holding
      the new rq's rq lock.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dcc3b5ff
  21. 15 3月, 2017 3 次提交