1. 14 4月, 2017 1 次提交
    • P
      sched/fair: Fix corner case in __accumulate_sum() · 05296e75
      Peter Zijlstra 提交于
      Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
      __accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
      incorrect.
      
      This is because at this point, the decay_load() on the old state --
      the first step in accumulate_sum() -- will not have resulted in 0, and
      will therefore result in a sum larger than the maximum value of our
      series. Obviously broken.
      
      Note that:
      
      	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =
      
                      1   (345 / 32)
      	47742 * - ^            = ~27
                      2
      
      Not to mention that any further contribution from the d3 segment (our
      new period) would also push it over the maximum.
      
      Solve this by noting that we can write our c2 term:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      In terms of our maximum value:
      
      		    inf		      inf	  p
      	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
      		    n=0		      n=p+1	 n=1
      
      Further note that:
      
                 inf              inf            inf
              ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
                 n=0              n=0            n=p
      
      Combined that gives us:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      		     inf        inf
      	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
      		     n=0        n=p+1
      
      	   = max - (max y^(p+1)) - 1024
      
      Further simplify things by dealing with p=0 early on.
      Reported-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Fixes: a481db34 ("sched/fair: Optimize ___update_sched_avg()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05296e75
  2. 11 4月, 2017 1 次提交
    • N
      sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags() · 717a94b5
      NeilBrown 提交于
      It is not safe for one thread to modify the ->flags
      of another thread as there is no locking that can protect
      the update.
      
      So tsk_restore_flags(), which takes a task pointer and modifies
      the flags, is an invitation to do the wrong thing.
      
      All current users pass "current" as the task, so no developers have
      accepted that invitation.  It would be best to ensure it remains
      that way.
      
      So rename tsk_restore_flags() to current_restore_flags() and don't
      pass in a task_struct pointer.  Always operate on current->flags.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      717a94b5
  3. 09 4月, 2017 1 次提交
  4. 08 4月, 2017 2 次提交
  5. 05 4月, 2017 1 次提交
  6. 02 4月, 2017 2 次提交
    • D
      bpf, verifier: fix rejection of unaligned access checks for map_value_adj · 79adffcd
      Daniel Borkmann 提交于
      Currently, the verifier doesn't reject unaligned access for map_value_adj
      register types. Commit 48461135 ("bpf: allow access into map value
      arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
      to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
      is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
      non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
      architectures not supporting efficient unaligned access, and thus worst
      case could raise exceptions on some archs that are unable to correct the
      unaligned access or perform a different memory access to the actual
      requested one and such.
      
      i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
         on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x42533a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+11
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (61) r1 = *(u32 *)(r0 +0)
         8: (35) if r1 >= 0xb goto pc+9
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
         9: (07) r0 += 3
        10: (79) r7 = *(u64 *)(r0 +0)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
        11: (79) r7 = *(u64 *)(r0 +2)
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
        [...]
      
      ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
          on r0 (map_value_adj):
      
         0: (bf) r2 = r10
         1: (07) r2 += -8
         2: (7a) *(u64 *)(r2 +0) = 0
         3: (18) r1 = 0x4df16a00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+19
          R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
         7: (07) r0 += 3
         8: (7a) *(u64 *)(r0 +0) = 42
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
         9: (7a) *(u64 *)(r0 +2) = 43
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        10: (7a) *(u64 *)(r0 -2) = 44
          R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
        [...]
      
      For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
      was fetched, it later receives a reg->id from env->id_gen generator
      once another register with UNKNOWN_VALUE type was added to it via
      check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
      is used in find_good_pkt_pointers() for setting the allowed access
      range for regs with PTR_TO_PACKET of same id once verifier matched
      on data/data_end tests, and ii) for check_ptr_alignment() to determine
      that when not having efficient unaligned access and register with
      UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
      to access the content bytewise due to unknown unalignment. reg->id
      was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
      always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
      was added after 48461135 via 57a09bf0 ("bpf: Detect identical
      PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
      non-root environment due to prohibited pointer arithmetic.
      
      The fix splits register-type specific checks into their own helper
      instead of keeping them combined, so we don't run into a similar
      issue in future once we extend check_ptr_alignment() further and
      forget to add reg->type checks for some of the checks.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79adffcd
    • D
      bpf, verifier: fix alu ops against map_value{, _adj} register types · fce366a9
      Daniel Borkmann 提交于
      While looking into map_value_adj, I noticed that alu operations
      directly on the map_value() resp. map_value_adj() register (any
      alu operation on a map_value() register will turn it into a
      map_value_adj() typed register) are not sufficiently protected
      against some of the operations. Two non-exhaustive examples are
      provided that the verifier needs to reject:
      
       i) BPF_AND on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xbf842a00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (57) r0 &= 8
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):
      
        0: (bf) r2 = r10
        1: (07) r2 += -8
        2: (7a) *(u64 *)(r2 +0) = 0
        3: (18) r1 = 0xc24eee00
        5: (85) call bpf_map_lookup_elem#1
        6: (15) if r0 == 0x0 goto pc+2
         R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        7: (04) (u32) r0 += (u32) 0
        8: (7a) *(u64 *)(r0 +0) = 22
         R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
        9: (95) exit
      
        from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
        9: (95) exit
        processed 10 insns
      
      Issue is, while min_value / max_value boundaries for the access
      are adjusted appropriately, we change the pointer value in a way
      that cannot be sufficiently tracked anymore from its origin.
      Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
      that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
      all the test cases coming with 48461135 ("bpf: allow access
      into map value arrays") perform BPF_ADD only on the destination
      register that is PTR_TO_MAP_VALUE_ADJ.
      
      Only for UNKNOWN_VALUE register types such operations make sense,
      f.e. with unknown memory content fetched initially from a constant
      offset from the map value memory into a register. That register is
      then later tested against lower / upper bounds, so that the verifier
      can then do the tracking of min_value / max_value, and properly
      check once that UNKNOWN_VALUE register is added to the destination
      register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
      original use-case is solving. Note, tracking on what is being
      added is done through adjust_reg_min_max_vals() and later access
      to the map value enforced with these boundaries and the given offset
      from the insn through check_map_access_adj().
      
      Tests will fail for non-root environment due to prohibited pointer
      arithmetic, in particular in check_alu_op(), we bail out on the
      is_pointer_value() check on the dst_reg (which is false in root
      case as we allow for pointer arithmetic via env->allow_ptr_leaks).
      
      Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
      allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
      mode BPF_ADD. The test_verifier suite runs fine after the patch
      and it also rejects mentioned test cases.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fce366a9
  7. 30 3月, 2017 2 次提交
    • Y
      sched/fair: Optimize ___update_sched_avg() · a481db34
      Yuyang Du 提交于
      The main PELT function ___update_load_avg(), which implements the
      accumulation and progression of the geometric average series, is
      implemented along the following lines for the scenario where the time
      delta spans all 3 possible sections (see figure below):
      
        1. add the remainder of the last incomplete period
        2. decay old sum
        3. accumulate new sum in full periods since last_update_time
        4. accumulate the current incomplete period
        5. update averages
      
      Or:
      
                  d1          d2           d3
                  ^           ^            ^
                  |           |            |
                |<->|<----------------->|<--->|
        ... |---x---|------| ... |------|-----x (now)
      
        load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)
      
                                              p
      	      weight * scale * 1024 * \Sum y^n +		(3)
                                             n=1
      
      	      weight * scale * d3 * y^0				(4)
      
        load_avg' = load_sum' / LOAD_AVG_MAX				(5)
      
      Where:
      
       d1 - is the delta part completing the remainder of the last
            incomplete period,
       d2 - is the delta part spannind complete periods, and
       d3 - is the delta part starting the current incomplete period.
      
      We can simplify the code in two steps; the first step is to separate
      the first term into new and old parts like:
      
        (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
      					       weight * scale * d1 * y^(p+1)
      
      Once we've done that, its easy to see that all new terms carry the
      common factors:
      
        weight * scale
      
      If we factor those out, we arrive at the form:
      
        load_sum' = load_sum * y^(p+1) +
      
      	      weight * scale * (d1 * y^(p+1) +
      
      					 p
      			        1024 * \Sum y^n +
      					n=1
      
      				d3 * y^0)
      
      Which results in a simpler, smaller and faster implementation.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: matt@codeblueprint.co.uk
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a481db34
    • P
      sched/fair: Explicitly generate __update_load_avg() instances · 0ccb977f
      Peter Zijlstra 提交于
      The __update_load_avg() function is an __always_inline because its
      used with constant propagation to generate different variants of the
      code without having to duplicate it (which would be prone to bugs).
      
      Explicitly instantiate the 3 variants.
      
      Note that most of this is called from rather hot paths, so reducing
      branches is good.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0ccb977f
  8. 28 3月, 2017 1 次提交
  9. 27 3月, 2017 2 次提交
    • P
      sched/clock: Fix broken stable to unstable transfer · 7b09cc5a
      Pavel Tatashin 提交于
      When it is determined that the clock is actually unstable, and
      we switch from stable to unstable, the __clear_sched_clock_stable()
      function is eventually called.
      
      In this function we set gtod_offset so the following holds true:
      
        sched_clock() + raw_offset == ktime_get_ns() + gtod_offset
      
      But instead of getting the latest timestamps, we use the last values
      from scd, so instead of sched_clock() we use scd->tick_raw, and
      instead of ktime_get_ns() we use scd->tick_gtod.
      
      However, later, when we use gtod_offset sched_clock_local() we do not
      add it to scd->tick_gtod to calculate the correct clock value when we
      determine the boundaries for min/max clocks.
      
      This can result in tick granularity sched_clock() values, so fix it.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Fixes: 5680d809 ("sched/clock: Provide better clock continuity")
      Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7b09cc5a
    • S
      sched/fair: Prefer sibiling only if local group is under-utilized · 05b40e05
      Srikar Dronamraju 提交于
      If the child domain prefers tasks to go siblings, the local group could
      end up pulling tasks to itself even if the local group is almost equally
      loaded as the source group.
      
      Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
      Everytime, local group has capacity and source group has atleast 2 threads,
      local group tries to pull the task. This causes the threads to constantly
      move between different cores. This is even more profound if the cores have
      more threads, like in Power 8, smt 8 mode.
      
      Fix this by only allowing local group to pull a task, if the source group
      has more number of tasks than the local group.
      
      Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
      
      Without patch:
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
                     366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
                   3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
                   3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
                   5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
                   2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
                  10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )
      
      With patch:
      
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
                     123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
                   3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
                     189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
                   5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
                     506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
                  10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )
      
      Which show that in these workloads CPU migrations get reduced significantly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05b40e05
  10. 25 3月, 2017 1 次提交
  11. 24 3月, 2017 1 次提交
    • J
      padata: avoid race in reordering · de5540d0
      Jason A. Donenfeld 提交于
      Under extremely heavy uses of padata, crashes occur, and with list
      debugging turned on, this happens instead:
      
      [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
      __list_add+0xae/0x130
      [87487.301868] list_add corruption. prev->next should be next
      (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
      [87487.339011]  [<ffffffff9a53d075>] dump_stack+0x68/0xa3
      [87487.342198]  [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
      [87487.345364]  [<ffffffff99d6b91f>] __warn+0xff/0x140
      [87487.348513]  [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
      [87487.351659]  [<ffffffff9a58b5de>] __list_add+0xae/0x130
      [87487.354772]  [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
      [87487.357915]  [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
      [87487.361084]  [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
      
      padata_reorder calls list_add_tail with the list to which its adding
      locked, which seems correct:
      
      spin_lock(&squeue->serial.lock);
      list_add_tail(&padata->list, &squeue->serial.list);
      spin_unlock(&squeue->serial.lock);
      
      This therefore leaves only place where such inconsistency could occur:
      if padata->list is added at the same time on two different threads.
      This pdata pointer comes from the function call to
      padata_get_next(pd), which has in it the following block:
      
      next_queue = per_cpu_ptr(pd->pqueue, cpu);
      padata = NULL;
      reorder = &next_queue->reorder;
      if (!list_empty(&reorder->list)) {
             padata = list_entry(reorder->list.next,
                                 struct padata_priv, list);
             spin_lock(&reorder->lock);
             list_del_init(&padata->list);
             atomic_dec(&pd->reorder_objects);
             spin_unlock(&reorder->lock);
      
             pd->processed++;
      
             goto out;
      }
      out:
      return padata;
      
      I strongly suspect that the problem here is that two threads can race
      on reorder list. Even though the deletion is locked, call to
      list_entry is not locked, which means it's feasible that two threads
      pick up the same padata object and subsequently call list_add_tail on
      them at the same time. The fix is thus be hoist that lock outside of
      that block.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      de5540d0
  12. 23 3月, 2017 5 次提交
  13. 21 3月, 2017 2 次提交
    • P
      audit: fix auditd/kernel connection state tracking · 5b52330b
      Paul Moore 提交于
      What started as a rather straightforward race condition reported by
      Dmitry using the syzkaller fuzzer ended up revealing some major
      problems with how the audit subsystem managed its netlink sockets and
      its connection with the userspace audit daemon.  Fixing this properly
      had quite the cascading effect and what we are left with is this rather
      large and complicated patch.  My initial goal was to try and decompose
      this patch into multiple smaller patches, but the way these changes
      are intertwined makes it difficult to split these changes into
      meaningful pieces that don't break or somehow make things worse for
      the intermediate states.
      
      The patch makes a number of changes, but the most significant are
      highlighted below:
      
      * The auditd tracking variables, e.g. audit_sock, are now gone and
      replaced by a RCU/spin_lock protected variable auditd_conn which is
      a structure containing all of the auditd tracking information.
      
      * We no longer track the auditd sock directly, instead we track it
      via the network namespace in which it resides and we use the audit
      socket associated with that namespace.  In spirit, this is what the
      code was trying to do prior to this patch (at least I think that is
      what the original authors intended), but it was done rather poorly
      and added a layer of obfuscation that only masked the underlying
      problems.
      
      * Big backlog queue cleanup, again.  In v4.10 we made some pretty big
      changes to how the audit backlog queues work, here we haven't changed
      the queue design so much as cleaned up the implementation.  Brought
      about by the locking changes, we've simplified kauditd_thread() quite
      a bit by consolidating the queue handling into a new helper function,
      kauditd_send_queue(), which allows us to eliminate a lot of very
      similar code and makes the looping logic in kauditd_thread() clearer.
      
      * All netlink messages sent to auditd are now sent via
      auditd_send_unicast_skb().  Other than just making sense, this makes
      the lock handling easier.
      
      * Change the audit_log_start() sleep behavior so that we never sleep
      on auditd events (unchanged) or if the caller is holding the
      audit_cmd_mutex (changed).  Previously we didn't sleep if the caller
      was auditd or if the message type fell between a certain range; the
      type check was a poor effort of doing what the cmd_mutex check now
      does.  Richard Guy Briggs originally proposed not sleeping the
      cmd_mutex owner several years ago but his patch wasn't acceptable
      at the time.  At least the idea lives on here.
      
      * A problem with the lost record counter has been resolved.  Steve
      Grubb and I both happened to notice this problem and according to
      some quick testing by Steve, this problem goes back quite some time.
      It's largely a harmless problem, although it may have left some
      careful sysadmins quite puzzled.
      
      Cc: <stable@vger.kernel.org> # 4.10.x-
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      5b52330b
    • R
      cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() · 4296f23e
      Rafael J. Wysocki 提交于
      sugov_start() only initializes struct sugov_cpu per-CPU structures
      for shared policies, but it should do that for single-CPU policies too.
      
      That in particular makes the IO-wait boost mechanism work in the
      cases when cpufreq policies correspond to individual CPUs.
      
      Fixes: 21ca6d2c (cpufreq: schedutil: Add iowait boosting)
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
      4296f23e
  14. 17 3月, 2017 1 次提交
    • H
      mm: add private lock to serialize memory hotplug operations · 55adc1d0
      Heiko Carstens 提交于
      Commit bfc8c901 ("mem-hotplug: implement get/put_online_mems")
      introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
      in order to allow similar semantics for memory hotplug like for cpu
      hotplug.
      
      The corresponding functions for cpu hotplug are get/put_online_cpus()
      and cpu_hotplug_begin/done() for cpu hotplug.
      
      The commit however missed to introduce functions that would serialize
      memory hotplug operations like they are done for cpu hotplug with
      cpu_maps_update_begin/done().
      
      This basically leaves mem_hotplug.active_writer unprotected and allows
      concurrent writers to modify it, which may lead to problems as outlined
      by commit f931ab47 ("mm: fix devm_memremap_pages crash, use
      mem_hotplug_{begin, done}").
      
      That commit was extended again with commit b5d24fda ("mm,
      devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
      done}") which serializes memory hotplug operations for some call sites
      by using the device_hotplug lock.
      
      In addition with commit 3fc21924 ("mm: validate device_hotplug is held
      for memory hotplug") a sanity check was added to mem_hotplug_begin() to
      verify that the device_hotplug lock is held.
      
      This in turn triggers the following warning on s390:
      
      WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
       Call Trace:
        assert_held_device_hotplug+0x40/0x58)
        mem_hotplug_begin+0x34/0xc8
        add_memory_resource+0x7e/0x1f8
        add_memory+0xda/0x130
        add_memory_merged+0x15c/0x178
        sclp_detect_standby_memory+0x2ae/0x2f8
        do_one_initcall+0xa2/0x150
        kernel_init_freeable+0x228/0x2d8
        kernel_init+0x2a/0x140
        kernel_thread_starter+0x6/0xc
      
      One possible fix would be to add more lock_device_hotplug() and
      unlock_device_hotplug() calls around each call site of
      mem_hotplug_begin/end().  But that would give the device_hotplug lock
      additional semantics it better should not have (serialize memory hotplug
      operations).
      
      Instead add a new memory_add_remove_lock which has the similar semantics
      like cpu_add_remove_lock for cpu hotplug.
      
      To keep things hopefully a bit easier the lock will be locked and unlocked
      within the mem_hotplug_begin/end() functions.
      
      Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55adc1d0
  15. 16 3月, 2017 17 次提交
    • P
      perf/core: Better explain the inherit magic · d8a8cfc7
      Peter Zijlstra 提交于
      While going through the event inheritance code Oleg got confused.
      
      Add some comments to better explain the silent dissapearance of
      orphaned events.
      
      So what happens is that at perf_event_release_kernel() time; when an
      event looses its connection to userspace (and ceases to exist from the
      user's perspective) we can still have an arbitrary amount of inherited
      copies of the event. We want to synchronously find and remove all
      these child events.
      
      Since that requires a bit of lock juggling, there is the possibility
      that concurrent clone()s will create new child events. Therefore we
      first mark the parent event as DEAD, which marks all the extant child
      events as orphaned.
      
      We then avoid copying orphaned events; in order to avoid getting more
      of them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d8a8cfc7
    • P
      perf/core: Simplify perf_event_free_task() · 15121c78
      Peter Zijlstra 提交于
      We have ctx->event_list that contains all events; no need to
      repeatedly iterate the group lists to find them all.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      15121c78
    • P
      perf/core: Fix event inheritance on fork() · e7cc4865
      Peter Zijlstra 提交于
      While hunting for clues to a use-after-free, Oleg spotted that
      perf_event_init_context() can loose an error value with the result
      that fork() can succeed even though we did not fully inherit the perf
      event context.
      Spotted-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: oleg@redhat.com
      Cc: stable@vger.kernel.org
      Fixes: 889ff015 ("perf/core: Split context's event group list into pinned and non-pinned lists")
      Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7cc4865
    • P
      perf/core: Fix use-after-free in perf_release() · e552a838
      Peter Zijlstra 提交于
      Dmitry reported syzcaller tripped a use-after-free in perf_release().
      
      After much puzzlement Oleg spotted the below scenario:
      
        Task1                           Task2
      
        fork()
          perf_event_init_task()
          /* ... */
          goto bad_fork_$foo;
          /* ... */
          perf_event_free_task()
            mutex_lock(ctx->lock)
            perf_free_event(B)
      
                                        perf_event_release_kernel(A)
                                          mutex_lock(A->child_mutex)
                                          list_for_each_entry(child, ...) {
                                            /* child == B */
                                            ctx = B->ctx;
                                            get_ctx(ctx);
                                            mutex_unlock(A->child_mutex);
      
              mutex_lock(A->child_mutex)
              list_del_init(B->child_list)
              mutex_unlock(A->child_mutex)
      
              /* ... */
      
            mutex_unlock(ctx->lock);
            put_ctx() /* >0 */
          free_task();
                                            mutex_lock(ctx->lock);
                                            mutex_lock(A->child_mutex);
                                            /* ... */
                                            mutex_unlock(A->child_mutex);
                                            mutex_unlock(ctx->lock)
                                            put_ctx() /* 0 */
                                              ctx->task && !TOMBSTONE
                                                put_task_struct() /* UAF */
      
      This patch closes the hole by making perf_event_free_task() destroy the
      task <-> ctx relation such that perf_event_release_kernel() will no longer
      observe the now dead task.
      Spotted-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: fweisbec@gmail.com
      Cc: oleg@redhat.com
      Cc: stable@vger.kernel.org
      Fixes: c6e5b732 ("perf: Synchronously clean up child events")
      Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
      Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e552a838
    • P
      sched/core: Avoid double update_rq_clock() in move_queued_task() · 15ff991e
      Peter Zijlstra 提交于
      Address this case:
      
        WARNING: CPU: 0 PID: 2070 at ../kernel/sched/core.c:109 update_rq_clock+0x74/0x80
        rq->clock_update_flags & RQCF_UPDATED
      
        Call Trace:
        update_rq_clock()
        move_queued_task()
        __set_cpus_allowed_ptr()
        ...
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      15ff991e
    • P
      sched/core: Fix double update_rq_clock) calls in attach_task()/detach_task() · 5704ac0a
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5704ac0a
    • P
      sched/core: Avoid obvious double update_rq_clock() · 7a57f32a
      Peter Zijlstra 提交于
      Add DEQUEUE_NOCLOCK to all places where we just did an
      update_rq_clock() already.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7a57f32a
    • P
      sched/core: Simplify update_rq_clock() in __schedule() · bce4dc80
      Peter Zijlstra 提交于
      Instead of relying on deactivate_task() to call update_rq_clock() and
      handling the case where it didn't happen (task_on_rq_queued),
      unconditionally do update_rq_clock() and skip any further updates.
      
      This also avoids a double update on deactivate_task() + ttwu_local().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bce4dc80
    • P
      sched/core: Make sched_ttwu_pending() atomic in time · 77558e4d
      Peter Zijlstra 提交于
      Since all tasks on the wake_list are woken under a single rq->lock
      avoid calling update_rq_clock() for each task.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      77558e4d
    • P
      sched/core: Add ENQUEUE_NOCLOCK to ENQUEUE_RESTORE · 7134b3e9
      Peter Zijlstra 提交于
      In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because
      DEQUEUE_SAVE will have done an update_rq_clock().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7134b3e9
    • P
      sched/core: Add {EN,DE}QUEUE_NOCLOCK flags · 0a67d1ee
      Peter Zijlstra 提交于
      Currently {en,de}queue_task() do an unconditional update_rq_clock().
      However since we want to avoid duplicate updates, so that each
      rq->lock section appears atomic in time, we need to be able to skip
      these clock updates.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0a67d1ee
    • P
      sched/core: Add rq->lock wrappers · 8a8c69c3
      Peter Zijlstra 提交于
      The missing update_rq_clock() check can work with partial rq->lock
      wrappery, since a missing wrapper can cause the warning to not be
      emitted when it should have, but cannot cause the warning to trigger
      when it should not have.
      
      The duplicate update_rq_clock() check however can cause false warnings
      to trigger. Therefore add more comprehensive rq->lock wrappery.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8a8c69c3
    • P
      sched/core: Add WARNING for multiple update_rq_clock() calls · 26ae58d2
      Peter Zijlstra 提交于
      Now that we have no missing calls, add a warning to find multiple
      calls.
      
      By having only a single update_rq_clock() call per rq-lock section,
      the section appears 'atomic' wrt time.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      26ae58d2
    • S
      sched/rt: Add comments describing the RT IPI pull method · 3e777f99
      Steven Rostedt (VMware) 提交于
      While looking into optimizations for the RT scheduler IPI logic, I realized
      that the comments are lacking to describe it efficiently. It deserves a
      lengthy description describing its design.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
      [ Small typographical edits. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3e777f99
    • S
      sched/deadline: Use deadline instead of period when calculating overflow · 2317d5f1
      Steven Rostedt (VMware) 提交于
      I was testing Daniel's changes with his test case, and tweaked it a
      little. Instead of having the runtime equal to the deadline, I
      increased the deadline ten fold.
      
      Daniel's test case had:
      
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      To make it more interesting, I changed it to:
      
      	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
      	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */
      
      The results were rather surprising. The behavior that Daniel's patch
      was fixing came back. The task started using much more than .1% of the
      CPU. More like 20%.
      
      Looking into this I found that it was due to the dl_entity_overflow()
      constantly returning true. That's because it uses the relative period
      against relative runtime vs the absolute deadline against absolute
      runtime.
      
        runtime / (deadline - t) > dl_runtime / dl_period
      
      There's even a comment mentioning this, and saying that when relative
      deadline equals relative period, that the equation is the same as using
      deadline instead of period. That comment is backwards! What we really
      want is:
      
        runtime / (deadline - t) > dl_runtime / dl_deadline
      
      We care about if the runtime can make its deadline, not its period. And
      then we can say "when the deadline equals the period, the equation is
      the same as using dl_period instead of dl_deadline".
      
      After correcting this, now when the task gets enqueued, it can throttle
      correctly, and Daniel's fix to the throttling of sleeping deadline
      tasks works even when the runtime and deadline are not the same.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2317d5f1
    • D
      sched/deadline: Throttle a constrained deadline task activated after the deadline · df8eac8c
      Daniel Bristot de Oliveira 提交于
      During the activation, CBS checks if it can reuse the current task's
      runtime and period. If the deadline of the task is in the past, CBS
      cannot use the runtime, and so it replenishes the task. This rule
      works fine for implicit deadline tasks (deadline == period), and the
      CBS was designed for implicit deadline tasks. However, a task with
      constrained deadline (deadine < period) might be awakened after the
      deadline, but before the next period. In this case, replenishing the
      task would allow it to run for runtime / deadline. As in this case
      deadline < period, CBS enables a task to run for more than the
      runtime / period. In a very loaded system, this can cause a domino
      effect, making other tasks miss their deadlines.
      
      To avoid this problem, in the activation of a constrained deadline
      task after the deadline but before the next period, throttle the
      task and set the replenishing timer to the begin of the next period,
      unless it is boosted.
      
      Reproducer:
      
       --------------- %< ---------------
        int main (int argc, char **argv)
        {
      	int ret;
      	int flags = 0;
      	unsigned long l = 0;
      	struct timespec ts;
      	struct sched_attr attr;
      
      	memset(&attr, 0, sizeof(attr));
      	attr.size = sizeof(attr);
      
      	attr.sched_policy   = SCHED_DEADLINE;
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      	ts.tv_sec = 0;
      	ts.tv_nsec = 2000 * 1000;			/* 2 ms */
      
      	ret = sched_setattr(0, &attr, flags);
      
      	if (ret < 0) {
      		perror("sched_setattr");
      		exit(-1);
      	}
      
      	for(;;) {
      		/* XXX: you may need to adjust the loop */
      		for (l = 0; l < 150000; l++);
      		/*
      		 * The ideia is to go to sleep right before the deadline
      		 * and then wake up before the next period to receive
      		 * a new replenishment.
      		 */
      		nanosleep(&ts, NULL);
      	}
      
      	exit(0);
        }
        --------------- >% ---------------
      
      On my box, this reproducer uses almost 50% of the CPU time, which is
      obviously wrong for a task with 2/2000 reservation.
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      df8eac8c
    • D
      sched/deadline: Make sure the replenishment timer fires in the next period · 5ac69d37
      Daniel Bristot de Oliveira 提交于
      Currently, the replenishment timer is set to fire at the deadline
      of a task. Although that works for implicit deadline tasks because the
      deadline is equals to the begin of the next period, that is not correct
      for constrained deadline tasks (deadline < period).
      
      For instance:
      
      f.c:
       --------------- %< ---------------
      int main (void)
      {
      	for(;;);
      }
       --------------- >% ---------------
      
        # gcc -o f f.c
      
        # trace-cmd record -e sched:sched_switch                              \
      				   -e syscalls:sys_exit_sched_setattr   \
         chrt -d --sched-runtime  490000000					\
                 --sched-deadline 500000000					\
      	   --sched-period  1000000000 0 ./f
      
        # trace-cmd report | grep "{pid of ./f}"
      
      After setting parameters, the task is replenished and continue running
      until being throttled:
      
               f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
      
      The task is throttled after running 492318 ms, as expected:
      
               f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]
      
      But then, the task is replenished 500719 ms after the first
      replenishment:
      
          <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]
      
      Running for 490277 ms:
      
               f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]
      
      Hence, in the first period, the task runs 2 * runtime, and that is a bug.
      
      During the first replenishment, the next deadline is set one period away.
      So the runtime / period starts to be respected. However, as the second
      replenishment took place in the wrong instant, the next replenishment
      will also be held in a wrong instant of time. Rather than occurring in
      the nth period away from the first activation, it is taking place
      in the (nth period - relative deadline).
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ac69d37