1. 15 7月, 2016 1 次提交
  2. 13 7月, 2016 2 次提交
  3. 11 7月, 2016 3 次提交
  4. 09 7月, 2016 3 次提交
  5. 07 7月, 2016 2 次提交
    • W
      locking/static_keys: Fix non static symbol Sparse warning · 885885f6
      Wei Yongjun 提交于
      Fix the following sparse warnings:
      
        kernel/jump_label.c:473:23: warning:
         symbol 'jump_label_module_nb' was not declared. Should it be static?
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1466183980-8903-1-git-send-email-weiyj_lk@163.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      885885f6
    • M
      perf/core: Fix pmu::filter_match for SW-led groups · 2c81a647
      Mark Rutland 提交于
      The following commit:
      
        66eb579e ("perf: allow for PMU-specific event filtering")
      
      added the pmu::filter_match() callback. This was intended to
      avoid HW constraints on events from resulting in extremely
      pessimistic scheduling.
      
      However, pmu::filter_match() is only called for the leader of each event
      group. When the leader is a SW event, we do not filter the groups, and
      may fail at pmu::add() time, and when this happens we'll give up on
      scheduling any event groups later in the list until they are rotated
      ahead of the failing group.
      
      This can result in extremely sub-optimal event scheduling behaviour,
      e.g. if running the following on a big.LITTLE platform:
      
      $ taskset -c 0 ./perf stat \
       -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
       -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
       ls
      
           <not counted>      context-switches                                              (0.00%)
           <not counted>      armv8_cortex_a57/config=0x11/                                 (0.00%)
                      24      context-switches                                              (37.36%)
                57589154      armv8_cortex_a53/config=0x11/                                 (37.36%)
      
      Here the 'a53' event group was always eligible to be scheduled, but
      the 'a57' group never eligible to be scheduled, as the task was always
      affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
      group was eligible, but the HW event failed at pmu::add() time,
      resulting in ctx_flexible_sched_in giving up on scheduling further
      groups with HW events.
      
      One way of avoiding this is to check pmu::filter_match() on siblings
      as well as the group leader. If any of these fail their
      pmu::filter_match() call, we must skip the entire group before
      attempting to add any events.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: 66eb579e ("perf: allow for PMU-specific event filtering")
      Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
      [ Small readability edits. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c81a647
  6. 01 7月, 2016 1 次提交
  7. 29 6月, 2016 3 次提交
  8. 27 6月, 2016 11 次提交
    • P
      sched/fair: Rework throttle_count sync · 55e16d30
      Peter Zijlstra 提交于
      Since we already take rq->lock when creating a cgroup, use it to also
      sync the throttle_count and avoid the extra state and enqueue path
      branch.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      [ Fixed build warning. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      55e16d30
    • Z
      sched/core: Fix sched_getaffinity() return value kerneldoc comment · 599b4840
      Zev Weiss 提交于
      Previous version was probably written referencing the man page for
      glibc's wrapper, but the wrapper's behavior differs from that of the
      syscall itself in this case.
      Signed-off-by: NZev Weiss <zev@bewilderbeest.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      599b4840
    • P
      sched/fair: Reorder cgroup creation code · 8663e24d
      Peter Zijlstra 提交于
      A future patch needs rq->lock held _after_ we link the task_group into
      the hierarchy. In order to avoid taking every rq->lock twice, reorder
      things a little and create online_fair_sched_group() to be called
      after we link the task_group.
      
      All this code is still ran from css_alloc() so css_online() isn't in
      fact used for this.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8663e24d
    • P
      sched/fair: Apply more PELT fixes · 3d30544f
      Peter Zijlstra 提交于
      One additional 'rule' for using update_cfs_rq_load_avg() is that one
      should call update_tg_load_avg() if it returns true.
      
      Add a bunch of comments to hopefully clarify some of the rules:
      
       o  You need to update cfs_rq _before_ any entity attach/detach,
          this is important, because while for mathmatical consisency this
          isn't strictly needed, it is required for the physical
          interpretation of the model, you attach/detach _now_.
      
       o  When you modify the cfs_rq avg, you have to then call
          update_tg_load_avg() in order to propagate changes upwards.
      
       o  (Fair) entities are always attached, switched_{to,from}_fair()
          deal with !fair. This directly follows from the definition of the
          cfs_rq averages, namely that they are a direct sum of all
          (runnable or blocked) entities on that rq.
      
      It is the second rule that this patch enforces, but it adds comments
      pertaining to all of them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d30544f
    • P
      sched/fair: Fix PELT integrity for new tasks · 7dc603c9
      Peter Zijlstra 提交于
      Vincent and Yuyang found another few scenarios in which entity
      tracking goes wobbly.
      
      The scenarios are basically due to the fact that new tasks are not
      immediately attached and thereby differ from the normal situation -- a
      task is always attached to a cfs_rq load average (such that it
      includes its blocked contribution) and are explicitly
      detached/attached on migration to another cfs_rq.
      
      Scenario 1: switch to fair class
      
        p->sched_class = fair_class;
        if (queued)
          enqueue_task(p);
            ...
              enqueue_entity()
      	  enqueue_entity_load_avg()
      	    migrated = !sa->last_update_time (true)
      	    if (migrated)
      	      attach_entity_load_avg()
        check_class_changed()
          switched_from() (!fair)
          switched_to()   (fair)
            switched_to_fair()
              attach_entity_load_avg()
      
      If @p is a new task that hasn't been fair before, it will have
      !last_update_time and, per the above, end up in
      attach_entity_load_avg() _twice_.
      
      Scenario 2: change between cgroups
      
        sched_move_group(p)
          if (queued)
            dequeue_task()
          task_move_group_fair()
            detach_task_cfs_rq()
              detach_entity_load_avg()
            set_task_rq()
            attach_task_cfs_rq()
              attach_entity_load_avg()
          if (queued)
            enqueue_task();
              ...
                enqueue_entity()
      	    enqueue_entity_load_avg()
      	      migrated = !sa->last_update_time (true)
      	      if (migrated)
      	        attach_entity_load_avg()
      
      Similar as with scenario 1, if @p is a new task, it will have
      !load_update_time and we'll end up in attach_entity_load_avg()
      _twice_.
      
      Furthermore, notice how we do a detach_entity_load_avg() on something
      that wasn't attached to begin with.
      
      As stated above; the problem is that the new task isn't yet attached
      to the load tracking and thereby violates the invariant assumption.
      
      This patch remedies this by ensuring a new task is indeed properly
      attached to the load tracking on creation, through
      post_init_entity_util_avg().
      
      Of course, this isn't entirely as straightforward as one might think,
      since the task is hashed before we call wake_up_new_task() and thus
      can be poked at. We avoid this by adding TASK_NEW and teaching
      cpu_cgroup_can_attach() to refuse such tasks.
      Reported-by: NYuyang Du <yuyang.du@intel.com>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7dc603c9
    • V
      sched/cgroup: Fix cpu_cgroup_fork() handling · ea86cb4b
      Vincent Guittot 提交于
      A new fair task is detached and attached from/to task_group with:
      
        cgroup_post_fork()
          ss->fork(child) := cpu_cgroup_fork()
            sched_move_task()
              task_move_group_fair()
      
      Which is wrong, because at this point in fork() the task isn't fully
      initialized and it cannot 'move' to another group, because its not
      attached to any group as yet.
      
      In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
      can just call this small part directly instead sched_move_task(). And
      the task doesn't really migrate because it is not yet attached so we
      need the following sequence:
      
        do_fork()
          sched_fork()
            __set_task_cpu()
      
          cgroup_post_fork()
            set_task_rq() # set task group and runqueue
      
          wake_up_new_task()
            select_task_rq() can select a new cpu
            __set_task_cpu
            post_init_entity_util_avg
              attach_task_cfs_rq()
            activate_task
              enqueue_task
      
      This patch makes that happen.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added TASK_SET_GROUP to set depth properly. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea86cb4b
    • P
      sched/fair: Fix PELT integrity for new groups · 01011473
      Peter Zijlstra 提交于
      Vincent reported that when a new task is moved into a new cgroup it
      gets attached twice to the load tracking:
      
        sched_move_task()
          task_move_group_fair()
            detach_task_cfs_rq()
            set_task_rq()
            attach_task_cfs_rq()
              attach_entity_load_avg()
                se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0
      
        enqueue_entity()
          enqueue_entity_load_avg()
            update_cfs_rq_load_avg()
              now = clock()
              __update_load_avg(&cfs_rq->avg)
                cfs_rq->avg.last_load_update = now
                // ages load/util for: now - 0, load/util -> 0
            if (migrated)
              attach_entity_load_avg()
                se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0
      
      The problem is that we don't update cfs_rq load_avg before all
      entity attach/detach operations. Only enqueue_task() and migrate_task()
      do this.
      
      By fixing this, the above will not happen, because the
      sched_move_task() attach will have updated cfs_rq's last_load_update
      time before attach, and in turn the attach will have set the entity's
      last_load_update stamp.
      
      Note that there is a further problem with sched_move_task() calling
      detach on a task that hasn't yet been attached; this will be taken
      care of in a subsequent patch.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      01011473
    • P
      sched/fair: Fix and optimize the fork() path · e210bffd
      Peter Zijlstra 提交于
      The task_fork_fair() callback already calls __set_task_cpu() and takes
      rq->lock.
      
      If we move the sched_class::task_fork callback in sched_fork() under
      the existing p->pi_lock, right after its set_task_cpu() call, we can
      avoid doing two such calls and omit the IRQ disabling on the rq->lock.
      
      Change to __set_task_cpu() to skip the migration bits, this is a new
      task, not a migration. Similarly, make wake_up_new_task() use
      __set_task_cpu() for the same reason, the task hasn't actually
      migrated as it hasn't ever ran.
      
      This cures the problem of calling migrate_task_rq_fair(), which does
      remove_entity_from_load_avg() on tasks that have never been added to
      the load avg to begin with.
      
      This bug would result in transiently messed up load_avg values, averaged
      out after a few dozen milliseconds. This is probably the reason why
      this bug was not found for such a long time.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e210bffd
    • P
      locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec() · 0dceeaf5
      Pan Xinhui 提交于
      queued_spin_lock_slowpath() should not worry about another
      queued_spin_lock_slowpath() running in interrupt context and
      changing node->count by accident, because node->count keeps
      the same value every time we enter/leave queued_spin_lock_slowpath().
      
      On some architectures this_cpu_dec() will save/restore irq flags,
      which has high overhead. Use the much cheaper __this_cpu_dec() instead.
      Signed-off-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman.Long@hpe.com
      Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
      [ Rewrote changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0dceeaf5
    • P
      sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion · ea1dc6fc
      Peter Zijlstra 提交于
      Commit:
      
        fde7d22e ("sched/fair: Fix overly small weight for interactive group entities")
      
      did something non-obvious but also did it buggy yet latent.
      
      The problem was exposed for real by a later commit in the v4.7 merge window:
      
        2159197d ("sched/core: Enable increased load resolution on 64-bit kernels")
      
      ... after which tg->load_avg and cfs_rq->load.weight had different
      units (10 bit fixed point and 20 bit fixed point resp.).
      
      Add a comment to explain the use of cfs_rq->load.weight over the
      'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
      for the difference in unit.
      
      Since this is (now, as per a previous commit) the only user of
      calc_tg_weight(), collapse it.
      
      The effects of this bug should be randomly inconsistent SMP-balancing
      of cgroups workloads.
      Reported-by: NJirka Hladky <jhladky@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 2159197d ("sched/core: Enable increased load resolution on 64-bit kernels")
      Fixes: fde7d22e ("sched/fair: Fix overly small weight for interactive group entities")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea1dc6fc
    • P
      sched/fair: Fix effective_load() to consistently use smoothed load · 7dd49125
      Peter Zijlstra 提交于
      Starting with the following commit:
      
        fde7d22e ("sched/fair: Fix overly small weight for interactive group entities")
      
      calc_tg_weight() doesn't compute the right value as expected by effective_load().
      
      The difference is in the 'correction' term. In order to ensure \Sum
      rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
      lagging a correction on the current cfs_rq->avg.load_avg value.
      Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
      cfs_rq->avg.load_avg.
      
      Now, per the referenced commit, calc_tg_weight() doesn't use
      cfs_rq->avg.load_avg, as is later used in @w, but uses
      cfs_rq->load.weight instead.
      
      So stop using calc_tg_weight() and do it explicitly.
      
      The effects of this bug are wake_affine() making randomly
      poor choices in cgroup-intense workloads.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # v4.3+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: fde7d22e ("sched/fair: Fix overly small weight for interactive group entities")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7dd49125
  9. 25 6月, 2016 3 次提交
    • M
      Fix build break in fork.c when THREAD_SIZE < PAGE_SIZE · 9521d399
      Michael Ellerman 提交于
      Commit b235beea ("Clarify naming of thread info/stack allocators")
      breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:
      
        kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
        kernel/fork.c:355:8: error: assignment from incompatible pointer type
          stack = alloc_thread_stack_node(tsk, node);
          ^
      
      Fix it by renaming free_stack() to free_thread_stack(), and updating the
      return type of alloc_thread_stack_node().
      
      Fixes: b235beea ("Clarify naming of thread info/stack allocators")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9521d399
    • M
      oom, suspend: fix oom_reaper vs. oom_killer_disable race · 74070542
      Michal Hocko 提交于
      Tetsuo has reported the following potential oom_killer_disable vs.
      oom_reaper race:
      
       (1) freeze_processes() starts freezing user space threads.
       (2) Somebody (maybe a kenrel thread) calls out_of_memory().
       (3) The OOM killer calls mark_oom_victim() on a user space thread
           P1 which is already in __refrigerator().
       (4) oom_killer_disable() sets oom_killer_disabled = true.
       (5) P1 leaves __refrigerator() and enters do_exit().
       (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
           exit_oom_victim(P1).
       (7) oom_killer_disable() returns while P1 not yet finished
       (8) P1 perform IO/interfere with the freezer.
      
      This situation is unfortunate.  We cannot move oom_killer_disable after
      all the freezable kernel threads are frozen because the oom victim might
      depend on some of those kthreads to make a forward progress to exit so
      we could deadlock.  It is also far from trivial to teach the oom_reaper
      to not call exit_oom_victim() because then we would lose a guarantee of
      the OOM killer and oom_killer_disable forward progress because
      exit_mm->mmput might block and never call exit_oom_victim.
      
      It seems the easiest way forward is to workaround this race by calling
      try_to_freeze_tasks again after oom_killer_disable.  This will make sure
      that all the tasks are frozen or it bails out.
      
      Fixes: 449d777d ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
      Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74070542
    • L
      Clarify naming of thread info/stack allocators · b235beea
      Linus Torvalds 提交于
      We've had the thread info allocated together with the thread stack for
      most architectures for a long time (since the thread_info was split off
      from the task struct), but that is about to change.
      
      But the patches that move the thread info to be off-stack (and a part of
      the task struct instead) made it clear how confused the allocator and
      freeing functions are.
      
      Because the common case was that we share an allocation with the thread
      stack and the thread_info, the two pointers were identical.  That
      identity then meant that we would have things like
      
      	ti = alloc_thread_info_node(tsk, node);
      	...
      	tsk->stack = ti;
      
      which certainly _worked_ (since stack and thread_info have the same
      value), but is rather confusing: why are we assigning a thread_info to
      the stack? And if we move the thread_info away, the "confusing" code
      just gets to be entirely bogus.
      
      So remove all this confusion, and make it clear that we are doing the
      stack allocation by renaming and clarifying the function names to be
      about the stack.  The fact that the thread_info then shares the
      allocation is an implementation detail, and not really about the
      allocation itself.
      
      This is a pure renaming and type fix: we pass in the same pointer, it's
      just that we clarify what the pointer means.
      
      The ia64 code that actually only has one single allocation (for all of
      task_struct, thread_info and kernel thread stack) now looks a bit odd,
      but since "tsk->stack" is actually not even used there, that oddity
      doesn't matter.  It would be a separate thing to clean that up, I
      intentionally left the ia64 changes as a pure brute-force renaming and
      type change.
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b235beea
  10. 24 6月, 2016 6 次提交
    • T
      sched/core: Allow kthreads to fall back to online && !active cpus · feb245e3
      Tejun Heo 提交于
      During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
      online but not active.  A CPU_ONLINE callback may create or bind a
      kthread so that its cpus_allowed mask only allows the CPU which is
      being brought online.  The kthread may start executing before the CPU
      is made active and can end up in select_fallback_rq().
      
      In such cases, the expected behavior is selecting the CPU which is
      coming online; however, because select_fallback_rq() only chooses from
      active CPUs, it determines that the task doesn't have any viable CPU
      in its allowed mask and ends up overriding it to cpu_possible_mask.
      
      CPU_ONLINE callbacks should be able to put kthreads on the CPU which
      is coming online.  Update select_fallback_rq() so that it follows
      cpu_online() rather than cpu_active() for kthreads.
      Reported-by: NGautham R Shenoy <ego@linux.vnet.ibm.com>
      Tested-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      feb245e3
    • K
      sched/fair: Do not announce throttled next buddy in dequeue_task_fair() · 754bd598
      Konstantin Khlebnikov 提交于
      Hierarchy could be already throttled at this point. Throttled next
      buddy could trigger a NULL pointer dereference in pick_next_task_fair().
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      754bd598
    • K
      sched/fair: Initialize throttle_count for new task-groups lazily · 094f4691
      Konstantin Khlebnikov 提交于
      Cgroup created inside throttled group must inherit current throttle_count.
      Broken throttle_count allows to nominate throttled entries as a next buddy,
      later this leads to null pointer dereference in pick_next_task_fair().
      
      This patch initialize cfs_rq->throttle_count at first enqueue: laziness
      allows to skip locking all rq at group creation. Lazy approach also allows
      to skip full sub-tree scan at throttling hierarchy (not in this patch).
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      094f4691
    • P
      locking/static_key: Fix concurrent static_key_slow_inc() · 4c5ea0a9
      Paolo Bonzini 提交于
      The following scenario is possible:
      
          CPU 1                                   CPU 2
          static_key_slow_inc()
           atomic_inc_not_zero()
            -> key.enabled == 0, no increment
           jump_label_lock()
           atomic_inc_return()
            -> key.enabled == 1 now
                                                  static_key_slow_inc()
                                                   atomic_inc_not_zero()
                                                    -> key.enabled == 1, inc to 2
                                                   return
                                                  ** static key is wrong!
           jump_label_update()
           jump_label_unlock()
      
      Testing the static key at the point marked by (**) will follow the
      wrong path for jumps that have not been patched yet.  This can
      actually happen when creating many KVM virtual machines with userspace
      LAPIC emulation; just run several copies of the following program:
      
          #include <fcntl.h>
          #include <unistd.h>
          #include <sys/ioctl.h>
          #include <linux/kvm.h>
      
          int main(void)
          {
              for (;;) {
                  int kvmfd = open("/dev/kvm", O_RDONLY);
                  int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
                  close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
                  close(vmfd);
                  close(kvmfd);
              }
              return 0;
          }
      
      Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
      The static key's purpose is to skip NULL pointer checks and indeed one
      of the processes eventually dereferences NULL.
      
      As explained in the commit that introduced the bug:
      
        706249c2 ("locking/static_keys: Rework update logic")
      
      jump_label_update() needs key.enabled to be true.  The solution adopted
      here is to temporarily make key.enabled == -1, and use go down the
      slow path when key.enabled <= 0.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # v4.3+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 706249c2 ("locking/static_keys: Rework update logic")
      Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
      [ Small stylistic edits to the changelog and the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4c5ea0a9
    • D
      cgroup: Disable IRQs while holding css_set_lock · 82d6489d
      Daniel Bristot de Oliveira 提交于
      While testing the deadline scheduler + cgroup setup I hit this
      warning.
      
      [  132.612935] ------------[ cut here ]------------
      [  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
      [  132.612952] Modules linked in: (a ton of modules...)
      [  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
      [  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
      [  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
      [  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
      [  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
      [  132.612986] Call Trace:
      [  132.612987]  <IRQ>  [<ffffffff813d229e>] dump_stack+0x63/0x85
      [  132.612994]  [<ffffffff810a652b>] __warn+0xcb/0xf0
      [  132.612997]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.612999]  [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20
      [  132.613000]  [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80
      [  132.613008]  [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20
      [  132.613010]  [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10
      [  132.613015]  [<ffffffff811388ac>] put_css_set+0x5c/0x60
      [  132.613016]  [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0
      [  132.613017]  [<ffffffff810a3912>] __put_task_struct+0x42/0x140
      [  132.613018]  [<ffffffff810e776a>] dl_task_timer+0xca/0x250
      [  132.613027]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.613030]  [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270
      [  132.613031]  [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190
      [  132.613034]  [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60
      [  132.613035]  [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50
      [  132.613037]  [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0
      [  132.613038]  <EOI>  [<ffffffff81063466>] ? native_safe_halt+0x6/0x10
      [  132.613043]  [<ffffffff81037a4e>] default_idle+0x1e/0xd0
      [  132.613044]  [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20
      [  132.613046]  [<ffffffff810e8fda>] default_idle_call+0x2a/0x40
      [  132.613047]  [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340
      [  132.613048]  [<ffffffff81050235>] start_secondary+0x155/0x190
      [  132.613049] ---[ end trace f91934d162ce9977 ]---
      
      The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
      context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
      this problem - and other problems of sharing a spinlock with an
      interrupt.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: cgroups@vger.kernel.org
      Cc: stable@vger.kernel.org # 4.5+
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: N"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      82d6489d
    • L
      locking: avoid passing around 'thread_info' in mutex debugging code · 6720a305
      Linus Torvalds 提交于
      None of the code actually wants a thread_info, it all wants a
      task_struct, and it's just converting back and forth between the two
      ("ti->task" to get the task_struct from the thread_info, and
      "task_thread_info(task)" to go the other way).
      
      No semantic change.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6720a305
  11. 20 6月, 2016 2 次提交
    • S
      tracing: Handle NULL formats in hold_module_trace_bprintk_format() · 70c8217a
      Steven Rostedt (Red Hat) 提交于
      If a task uses a non constant string for the format parameter in
      trace_printk(), then the trace_printk_fmt variable is set to NULL. This
      variable is then saved in the __trace_printk_fmt section.
      
      The function hold_module_trace_bprintk_format() checks to see if duplicate
      formats are used by modules, and reuses them if so (saves them to the list
      if it is new). But this function calls lookup_format() that does a strcmp()
      to the value (which is now NULL) and can cause a kernel oops.
      
      This wasn't an issue till 3debb0a9 ("tracing: Fix trace_printk() to print
      when not using bprintk()") which added "__used" to the trace_printk_fmt
      variable, and before that, the kernel simply optimized it out (no NULL value
      was saved).
      
      The fix is simply to handle the NULL pointer in lookup_format() and have the
      caller ignore the value if it was NULL.
      
      Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.comReported-by: Nxingzhen <zhengjun.xing@intel.com>
      Acked-by: NNamhyung Kim <namhyung@kernel.org>
      Fixes: 3debb0a9 ("tracing: Fix trace_printk() to print when not using bprintk()")
      Cc: stable@vger.kernel.org # v3.5+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      70c8217a
    • P
      sched/fair: Fix cfs_rq avg tracking underflow · 89741892
      Peter Zijlstra 提交于
      As per commit:
      
        b7fa30c9 ("sched/fair: Fix post_init_entity_util_avg() serialization")
      
      > the code generated from update_cfs_rq_load_avg():
      >
      > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      > 		removed_load = 1;
      > 	}
      >
      > turns into:
      >
      > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
      > ffffffff8108706b:       48 85 c0                test   %rax,%rax
      > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
      > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
      > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
      > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
      > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
      > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
      > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
      > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      >
      > Which you'll note ends up with sa->load_avg -= r in memory at
      > ffffffff8108707a.
      
      So I _should_ have looked at other unserialized users of ->load_avg,
      but alas. Luckily nikbor reported a similar /0 from task_h_load() which
      instantly triggered recollection of this here problem.
      
      Aside from the intermediate value hitting memory and causing problems,
      there's another problem: the underflow detection relies on the signed
      bit. This reduces the effective width of the variables, IOW its
      effectively the same as having these variables be of signed type.
      
      This patch changes to a different means of unsigned underflow
      detection to not rely on the signed bit. This allows the variables to
      use the 'full' unsigned range. And it does so with explicit LOAD -
      STORE to ensure any intermediate value will never be visible in
      memory, allowing these unserialized loads.
      
      Note: GCC generates crap code for this, might warrant a look later.
      
      Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
             maybe we should do clamping on add too.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: kernel@kyup.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89741892
  12. 17 6月, 2016 2 次提交
    • T
      cgroup: set css->id to -1 during init · 8fa3b8d6
      Tejun Heo 提交于
      If percpu_ref initialization fails during css_create(), the free path
      can end up trying to free css->id of zero.  As ID 0 is unused, it
      doesn't cause a critical breakage but it does trigger a warning
      message.  Fix it by setting css->id to -1 from init_and_link_css().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Wenwei Tao <ww.tao0320@gmail.com>
      Fixes: 01e58659 ("cgroup: release css->id after css_free")
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8fa3b8d6
    • P
      workqueue: Fix setting affinity of unbound worker threads · d945b5e9
      Peter Zijlstra 提交于
      With commit e9d867a6 ("sched: Allow per-cpu kernel threads to
      run on online && !active"), __set_cpus_allowed_ptr() expects that only
      strict per-cpu kernel threads can have affinity to an online CPU which
      is not yet active.
      
      This assumption is currently broken in the CPU_ONLINE notification
      handler for the workqueues where restore_unbound_workers_cpumask()
      calls set_cpus_allowed_ptr() when the first cpu in the unbound
      worker's pool->attr->cpumask comes online. Since
      set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
      only one CPU is online which is not yet active, we get the following
      WARN_ON during an CPU online operation.
      
      ------------[ cut here ]------------
      WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
      __set_cpus_allowed_ptr+0x228/0x2e0
      Modules linked in:
      CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4
      <..snip..>
      Call Trace:
      [c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
      [c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
      [c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
      [c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
      [c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
      [c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
      [c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
      [c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
      [c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
      [c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
      [c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
      ---[ end trace 00f1456578b2a3b2 ]---
      
      This patch fixes this by limiting the mask to the intersection of
      the pool affinity and online CPUs.
      
      Changelog-cribbed-from: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d945b5e9
  13. 16 6月, 2016 1 次提交