1. 27 1月, 2016 1 次提交
  2. 26 1月, 2016 1 次提交
    • A
      tick/sched: Hide unused oneshot timer code · 7809998a
      Arnd Bergmann 提交于
      A couple of functions in kernel/time/tick-sched.c are only
      relevant for oneshot timer mode, i.e. when hires-timers or
      nohz mode are enabled. If both are disabled, we get gcc warnings
      about them:
      
      kernel/time/tick-sched.c:98:16: warning: 'tick_init_jiffy_update' defined but not used [-Wunused-function]
       static ktime_t tick_init_jiffy_update(void)
                      ^
      kernel/time/tick-sched.c:112:13: warning: 'tick_sched_do_timer' defined but not used [-Wunused-function]
       static void tick_sched_do_timer(ktime_t now)
                   ^
      kernel/time/tick-sched.c:134:13: warning: 'tick_sched_handle' defined but not used [-Wunused-function]
       static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
                   ^
      
      This encloses the whole set of functions in an appropriate ifdef
      to avoid the warning and to make it clearer when they are used.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Link: http://lkml.kernel.org/r/1453736525-1959191-1-git-send-email-arnd@arndb.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7809998a
  3. 22 1月, 2016 1 次提交
  4. 17 1月, 2016 3 次提交
  5. 29 12月, 2015 1 次提交
  6. 19 12月, 2015 1 次提交
  7. 18 12月, 2015 1 次提交
  8. 17 12月, 2015 4 次提交
    • J
      timekeeping: Cap adjustments so they don't exceed the maxadj value · ec02b076
      John Stultz 提交于
      Thus its been occasionally noted that users have seen
      confusing warnings like:
      
          Adjusting tsc more than 11% (5941981 vs 7759439)
      
      We try to limit the maximum total adjustment to 11% (10% tick
      adjustment + 0.5% frequency adjustment). But this is done by
      bounding the requested adjustment values, and the internal
      steering that is done by tracking the error from what was
      requested and what was applied, does not have any such limits.
      
      This is usually not problematic, but in some cases has a risk
      that an adjustment could cause the clocksource mult value to
      overflow, so its an indication things are outside of what is
      expected.
      
      It ends up most of the reports of this 11% warning are on systems
      using chrony, which utilizes the adjtimex() ADJ_TICK interface
      (which allows a +-10% adjustment). The original rational for
      ADJ_TICK unclear to me but my assumption it was originally added
      to allow broken systems to get a big constant correction at boot
      (see adjtimex userspace package for an example) which would allow
      the system to work w/ ntpd's 0.5% adjustment limit.
      
      Chrony uses ADJ_TICK to make very aggressive short term corrections
      (usually right at startup). Which push us close enough to the max
      bound that a few late ticks can cause the internal steering to push
      past the max adjust value (tripping the warning).
      
      Thus this patch adds some extra logic to enforce the max adjustment
      cap in the internal steering.
      
      Note: This has the potential to slow corrections when the ADJ_TICK
      value is furthest away from the default value. So it would be good to
      get some testing from folks using chrony, to make sure we don't
      cause any troubles there.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Tested-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      ec02b076
    • D
      ntp: Fix second_overflow's input parameter type to be 64bits · c7963487
      DengChao 提交于
      The function "second_overflow" uses "unsign long"
      as its input parameter type which will overflow after
      year 2106 on 32bit systems.
      
      Thus this patch replaces it with time64_t type.
      
      While the 64-bit division is expensive, "next_ntp_leap_sec"
      has been calculated already, so we can just re-use it in the
      TIME_INS/DEL cases, allowing one expensive division per
      leapsecond instead of re-doing the divsion once a second after
      the leap flag has been set.
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      [jstultz: Tweaked commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      c7963487
    • D
      ntp: Change time_reftime to time64_t and utilize 64bit __ktime_get_real_seconds · 0af86465
      DengChao 提交于
      The type of static variant "time_reftime" and the call of
      get_seconds in ntp are both not y2038 safe.
      
      So change the type of time_reftime to time64_t and replace
      get_seconds with __ktime_get_real_seconds.
      
      The local variant "secs" in ntp_update_offset represents
      seconds between now and last ntp adjustment, it seems impossible
      that this time will last more than 68 years, so keep its type as
      "long".
      Reviewed-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      [jstultz: Tweaked commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0af86465
    • D
      timekeeping: Provide internal function __ktime_get_real_seconds · dee36654
      DengChao 提交于
      In order to fix Y2038 issues in the ntp code we will need replace
      get_seconds() with ktime_get_real_seconds() but as the ntp code uses
      the timekeeping lock which is also used by ktime_get_real_seconds(),
      we need a version without locking.
      Add a new function __ktime_get_real_seconds() in timekeeping to
      do this.
      Reviewed-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      dee36654
  9. 14 12月, 2015 1 次提交
  10. 13 12月, 2015 1 次提交
    • C
      kernel: remove stop_machine() Kconfig dependency · 86fffe4a
      Chris Wilson 提交于
      Currently the full stop_machine() routine is only enabled on SMP if
      module unloading is enabled, or if the CPUs are hotpluggable.  This
      leads to configurations where stop_machine() is broken as it will then
      only run the callback on the local CPU with irqs disabled, and not stop
      the other CPUs or run the callback on them.
      
      For example, this breaks MTRR setup on x86 in certain configs since
      ea8596bb ("kprobes/x86: Remove unused text_poke_smp() and
      text_poke_smp_batch() functions") as the MTRR is only established on the
      boot CPU.
      
      This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
      and HOTPLUG_CPU config options to compile the correct stop_machine() for
      the architecture, removing the false dependency on MODULE_UNLOAD in the
      process.
      
      Link: https://lkml.org/lkml/2014/10/8/124
      References: https://bugs.freedesktop.org/show_bug.cgi?id=84794Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Iulia Manda <iulia.manda21@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86fffe4a
  11. 11 12月, 2015 2 次提交
    • J
      time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow · 37cf4dc3
      John Stultz 提交于
      For adjtimex()'s ADJ_SETOFFSET, make sure the tv_usec value is
      sane. We might multiply them later which can cause an overflow
      and undefined behavior.
      
      This patch introduces new helper functions to simplify the
      checking code and adds comments to clarify
      
      Orginally this patch was by Sasha Levin, but I've basically
      rewritten it, so he should get credit for finding the issue
      and I should get the blame for any mistakes made since.
      
      Also, credit to Richard Cochran for the phrasing used in the
      comment for what is considered valid here.
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      37cf4dc3
    • S
      ntp: Verify offset doesn't overflow in ntp_update_offset · 52d189f1
      Sasha Levin 提交于
      We need to make sure that the offset is valid before manipulating it,
      otherwise it might overflow on the multiplication.
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      [jstultz: Reworked one of the checks so it makes more sense]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      52d189f1
  12. 08 12月, 2015 2 次提交
    • S
      clocksource: Add CPU info to clocksource watchdog reporting · 390dd67c
      Seiichi Ikarashi 提交于
      The clocksource watchdog reporting was improved by 0b046b21.
      I want to add the info of CPU where the watchdog detects a
      deviation because it is necessary to identify the trouble spot
      if the clocksource is TSC.
      Signed-off-by: NSeiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
      [jstultz: Tweaked commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      390dd67c
    • D
      time: Avoid signed overflow in timekeeping_get_ns() · 35a4933a
      David Gibson 提交于
      1e75fa8b "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
      clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
      of the same logic to avoid keeping a semi-redundant struct timespec
      in struct timekeeper.
      
      However, the commit also introduced a subtle semantic change - where
      clocksource_cyc2ns() uses purely unsigned math, the new version introduces
      a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
      overflow the following shift will still give a negative result.  The
      choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
      generally happen if there's a ~10 minute pause in examining the
      clocksource.
      
      This can be triggered on a powerpc KVM guest by stopping it from qemu for
      a bit over 10 minutes.  After resuming time has jumped backwards several
      minutes causing numerous problems (jiffies does not advance, msleep()s can
      be extended by minutes..).  It doesn't happen on x86 KVM guests, because
      the guest TSC is effectively frozen while the guest is stopped, which is
      not the case for the powerpc timebase.
      
      Obviously an unsigned (64 bit) overflow will only take twice as long as a
      signed, 63-bit overflow.  I don't know the time code well enough to know
      if that will still cause incorrect calculations, or if a 64-bit overflow
      is avoided elsewhere.
      
      Still, an incorrect forwards clock adjustment will cause less trouble than
      time going backwards.  So, this patch removes the potential for
      intermediate signed overflow.
      
      Cc: stable@vger.kernel.org  (3.7+)
      Suggested-by: NLaurent Vivier <lvivier@redhat.com>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      35a4933a
  13. 06 12月, 2015 1 次提交
    • J
      perf: Do not send exit event twice · 4e93ad60
      Jiri Olsa 提交于
      In case we monitor events system wide, we get EXIT event
      (when configured) twice for each task that exited.
      
      Note doubled lines with same pid/tid in following example:
      
        $ sudo ./perf record -a
        ^C[ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ]
        $ sudo ./perf report -D | grep EXIT
      
        0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
        1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
        1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
        2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
        2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
      
      The reason is that the cpu contexts are processes each time
      we call perf_event_task. I'm changing the perf_event_aux logic
      to serve task_ctx and cpu contexts separately, which ensure we
      don't get EXIT event generated twice on same cpu context.
      
      This does not affect other auxiliary events, as they don't
      use task_ctx at all.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e93ad60
  14. 04 12月, 2015 8 次提交
    • P
      sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() · ecf7d01c
      Peter Zijlstra 提交于
      Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
      that we'll prematurely continue with the wakeup and effectively run p on
      two CPUs at the same time.
      
      Even though the overlap is very limited; the task is in the middle of
      being scheduled out; it could still result in corruption of the
      scheduler data structures.
      
              CPU0                            CPU1
      
              set_current_state(...)
      
              <preempt_schedule>
                context_switch(X, Y)
                  prepare_lock_switch(Y)
                    Y->on_cpu = 1;
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
                                              try_to_wake_up(X)
                                                LOCK(p->pi_lock);
      
                                                t = X->on_cpu; // 0
      
                context_switch(Y, X)
                  prepare_lock_switch(X)
                    X->on_cpu = 1;
                  finish_lock_switch(Y)
                    store_release(Y->on_cpu, 0);
              </preempt_schedule>
      
              schedule();
                deactivate_task(X);
                X->on_rq = 0;
      
                                                if (X->on_rq) // false
      
                                                if (t) while (X->on_cpu)
                                                  cpu_relax();
      
                context_switch(X, ..)
                  finish_lock_switch(X)
                    store_release(X->on_cpu, 0);
      
      Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ecf7d01c
    • P
      sched/core: Better document the try_to_wake_up() barriers · b75a2253
      Peter Zijlstra 提交于
      Explain how the control dependency and smp_rmb() end up providing
      ACQUIRE semantics and pair with smp_store_release() in
      finish_lock_switch().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b75a2253
    • H
      sched/cputime: Fix invalid gtime in proc · 2541117b
      Hiroshi Shimamoto 提交于
      /proc/stats shows invalid gtime when the thread is running in guest.
      When vtime accounting is not enabled, we cannot get a valid delta.
      The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
      is only updated when vtime accounting is runtime enabled.
      
      This patch makes task_gtime() just return gtime without computing the
      buggy non-existing tickless delta when vtime accounting is not enabled.
      
      Use context_tracking_is_enabled() to check if vtime is accounting on
      some cpu, in which case only we need to check the tickless delta. This
      way we fix the gtime value regression on machines not running nohz full.
      
      The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
      CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.
      
      I ran and stop a busy loop in VM and see the gtime in host.
      Dump the 43rd field which shows the gtime in every second:
      
      	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
      	S 4348
      	R 7064566
      	R 7064766
      	R 7064967
      	R 7065168
      	S 4759
      	S 4759
      
      During running busy loop, it returns large value.
      
      After applying this patch, we can see right gtime.
      
      	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
      	S 5338
      	R 5365
      	R 5465
      	R 5566
      	R 5666
      	S 5726
      	S 5726
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2541117b
    • X
      sched/core: Clear the root_domain cpumasks in init_rootdomain() · 8295c699
      Xunlei Pang 提交于
      root_domain::rto_mask allocated through alloc_cpumask_var()
      contains garbage data, this may cause problems. For instance,
      When doing pull_rt_task(), it may do useless iterations if
      rto_mask retains some extra garbage bits. Worse still, this
      violates the isolated domain rule for clustered scheduling
      using cpuset, because the tasks(with all the cpus allowed)
      belongs to one root domain can be pulled away into another
      root domain.
      
      The patch cleans the garbage by using zalloc_cpumask_var()
      instead of alloc_cpumask_var() for root_domain::rto_mask
      allocation, thereby addressing the issues.
      
      Do the same thing for root_domain's other cpumask memembers:
      dlo_mask, span, and online.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8295c699
    • S
      sched/core: Remove false-positive warning from wake_up_process() · 119d6f6a
      Sasha Levin 提交于
      Because wakeups can (fundamentally) be late, a task might not be in
      the expected state. Therefore testing against a task's state is racy,
      and can yield false positives.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: oleg@redhat.com
      Fixes: 9067ac85 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
      Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      119d6f6a
    • P
      sched/wait: Fix signal handling in bit wait helpers · 68985633
      Peter Zijlstra 提交于
      Vladimir reported getting RCU stall warnings and bisected it back to
      commit:
      
        74316201 ("sched: Remove proliferation of wait_on_bit() action functions")
      
      That commit inadvertently reversed the calls to schedule() and signal_pending(),
      thereby not handling the case where the signal receives while we sleep.
      Reported-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Tested-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mark.rutland@arm.com
      Cc: neilb@suse.de
      Cc: oleg@redhat.com
      Fixes: 74316201 ("sched: Remove proliferation of wait_on_bit() action functions")
      Fixes: cbbce822 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
      Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68985633
    • P
      perf: Fix PERF_EVENT_IOC_PERIOD deadlock · 642c2d67
      Peter Zijlstra 提交于
      Dmitry reported a fairly silly recursive lock deadlock for
      PERF_EVENT_IOC_PERIOD, fix this by explicitly doing the inactive part of
      __perf_event_period() instead of calling that function.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: c7999c6f ("perf: Fix PERF_EVENT_IOC_PERIOD migration race")
      Link: http://lkml.kernel.org/r/20151130115615.GJ17308@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      642c2d67
    • Z
      alarmtimer: Avoid unexpected rtc interrupt when system resume from S3 · a0e3213f
      zhuo-hao 提交于
      Before the system go to suspend (S3), if user create a timer
      with clockid CLOCK_REALTIME_ALARM/CLOCK_BOOTTIME_ALARM and set a
      "large" timeout value to this timer. The function
      alarmtimer_suspend will be called to setup a timeout value to
      RTC timer to avoid the system sleep over time. However, if the
      system wakeup early than RTC timeout, the RTC timer will not be
      cleared. And this will cause the hpet_rtc_interrupt come
      unexpectedly until the RTC timeout. To fix this problem, just
      adding alarmtimer_resume to cancel the RTC timer.
      
      This was noticed because the HPET RTC emulation fires an
      interrupt every 16ms(=1/2^DEFAULT_RTC_SHIFT) up to the point
      where the alarm time is reached.
      
      This program always hits this situation
      (https://lkml.org/lkml/2015/11/8/326), if system wake up earlier
      than alarm time.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NZhuo-hao Lee <zhuo-hao.lee@intel.com>
      [jstultz: Tweak commit subject & formatting slightly]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      a0e3213f
  15. 03 12月, 2015 4 次提交
    • T
      cgroup_pids: don't account for the root cgroup · 67cde9c4
      Tejun Heo 提交于
      Because accounting resources for the root cgroup sometimes incurs
      measureable overhead for workloads which don't care about cgroup and
      often ends up calculating a number which is available elsewhere in a
      slightly different form, cgroup is not in the business of providing
      system-wide statistics.  The pids controller which was introduced
      recently was exposing "pids.current" at the root.  This patch disable
      accounting for root cgroup and removes the file from the root
      directory.
      
      While this is a userland visible behavior change, pids has been
      available only in one version and was badly broken there, so I don't
      think this will be noticeable.  If it turns out to be a problem, we
      can reinstate it for v1 hierarchies.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      67cde9c4
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
    • T
      cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach() · 599c963a
      Tejun Heo 提交于
      If one or more tasks get moved into a frozen css, the frozen state is
      cleared up from the destination css so that it can be reasserted once
      the migrated tasks are frozen.  freezer_attach() implements this in
      two separate steps - clearing CGROUP_FROZEN on the target css while
      processing each task and propagating the clearing upwards after the
      task loop is done if necessary.
      
      This patch merges the two steps.  Propagation now takes place inside
      the task loop.  This simplifies the code and prepares it for the fix
      of multi-destination migration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      599c963a
    • A
      bpf: fix allocation warnings in bpf maps and integer overflow · 01b3f521
      Alexei Starovoitov 提交于
      For large map->value_size the user space can trigger memory allocation warnings like:
      WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
      __alloc_pages_nodemask+0x695/0x14e0()
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
       [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
       [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
       [<     inline     >] __alloc_pages_slowpath mm/page_alloc.c:2989
       [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
       [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
       [<     inline     >] alloc_pages include/linux/gfp.h:451
       [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
       [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
       [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
       [<     inline     >] kmalloc_large include/linux/slab.h:390
       [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
       [<     inline     >] kmalloc include/linux/slab.h:463
       [<     inline     >] map_update_elem kernel/bpf/syscall.c:288
       [<     inline     >] SYSC_bpf kernel/bpf/syscall.c:744
      
      To avoid never succeeding kmalloc with order >= MAX_ORDER check that
      elem->value_size and computed elem_size are within limits for both hash and
      array type maps.
      Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
      Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
      so keep those kmalloc-s as-is.
      
      Large value_size can cause integer overflows in elem_size and map.pages
      formulas, so check for that as well.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01b3f521
  16. 02 12月, 2015 2 次提交
    • D
      bpf, array: fix heap out-of-bounds access when updating elements · fbca9d2d
      Daniel Borkmann 提交于
      During own review but also reported by Dmitry's syzkaller [1] it has been
      noticed that we trigger a heap out-of-bounds access on eBPF array maps
      when updating elements. This happens with each map whose map->value_size
      (specified during map creation time) is not multiple of 8 bytes.
      
      In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
      used to align array map slots for faster access. However, in function
      array_map_update_elem(), we update the element as ...
      
      memcpy(array->value + array->elem_size * index, value, array->elem_size);
      
      ... where we access 'value' out-of-bounds, since it was allocated from
      map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
      and later on copied through copy_from_user(value, uvalue, map->value_size).
      Thus, up to 7 bytes, we can access out-of-bounds.
      
      Same could happen from within an eBPF program, where in worst case we
      access beyond an eBPF program's designated stack.
      
      Since 1be7f75d ("bpf: enable non-root eBPF programs") didn't hit an
      official release yet, it only affects priviledged users.
      
      In case of array_map_lookup_elem(), the verifier prevents eBPF programs
      from accessing beyond map->value_size through check_map_access(). Also
      from syscall side map_lookup_elem() only copies map->value_size back to
      user, so nothing could leak.
      
        [1] http://github.com/google/syzkaller
      
      Fixes: 28fbcfa0 ("bpf: add array type of eBPF maps")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbca9d2d
    • S
      tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter · 0f72e37e
      Steven Rostedt (Red Hat) 提交于
      The set_event_pid filter relies on attaching to the sched_switch and
      sched_wakeup tracepoints to see if it should filter the tracing on schedule
      tracepoints. By adding the callbacks to sched_wakeup, pids in the
      set_event_pid file will trace the wakeups of those tasks with those pids.
      
      But sched_wakeup_new and sched_waking were missed. These two should also be
      traced. Luckily, these tracepoints share the same class as sched_wakeup
      which means they can use the same pre and post callbacks as sched_wakeup
      does.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0f72e37e
  17. 30 11月, 2015 3 次提交
    • O
      cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork() · afbcb364
      Oleg Nesterov 提交于
      Now that we know that the forking task can't migrate amd the child is always
      moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can
      change pids_can_fork() and pids_cancel_fork() to just use task_css(current).
      And since we no longer need to pin this css, we can remove pid_fork().
      
      Note: the patch uses task_css_check(true), perhaps it makes sense to add a
      helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into
      account.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      afbcb364
    • O
      cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate() · c9e75f04
      Oleg Nesterov 提交于
      If the new child migrates to another cgroup before cgroup_post_fork() calls
      subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
      pids_uncharge(old_pids) + pids_charge(pids) sequence twice.
      
      Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
      unconditionally. percpu_down_read() is cheap and this allows other cleanups,
      see the next changes.
      
      Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c9e75f04
    • T
      cgroup: make css_set pin its css's to avoid use-afer-free · 53254f90
      Tejun Heo 提交于
      A css_set represents the relationship between a set of tasks and
      css's.  css_set never pinned the associated css's.  This was okay
      because tasks used to always disassociate immediately (in RCU sense) -
      either a task is moved to a different css_set or exits and never
      accesses css_set again.
      
      Unfortunately, afcf6c8b ("cgroup: add cgroup_subsys->free() method
      and use it to fix pids controller") and patches leading up to it made
      a zombie hold onto its css_set and deref the associated css's on its
      release.  Nothing pins the css's after exit and it might have already
      been freed leading to use-after-free.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
       RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
       ...
       Call Trace:
        <IRQ>
        [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
        [<ffffffff810f8893>] cgroup_free+0x53/0xe0
        [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
        [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
        [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
        [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
        [<ffffffff81056e54>] __do_softirq+0xd4/0x460
        [<ffffffff81057369>] irq_exit+0x89/0xa0
        [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
        <EOI>
       ...
       Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
       RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
        RSP <ffff88001fc03e20>
       ---[ end trace 89a4a4b916b90c49 ]---
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: disabled
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix it by making css_set pin the associate css's until its release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
      Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
      Fixes: afcf6c8b ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
      53254f90
  18. 26 11月, 2015 2 次提交
    • P
      nohz: Clarify magic in tick_nohz_stop_sched_tick() · 82bbe34b
      Peter Zijlstra 提交于
      While going through the nohz code I got stumped by some of it.
      
      This patch adds a few comments clarifying the code; based on discussion
      with Thomas.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20151119162106.GO3816@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      82bbe34b
    • D
      bpf: fix clearing on persistent program array maps · c9da161c
      Daniel Borkmann 提交于
      Currently, when having map file descriptors pointing to program arrays,
      there's still the issue that we unconditionally flush program array
      contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
      when such a file descriptor is released and is independent of the map's
      refcount.
      
      Having this flush independent of the refcount is for a reason: there
      can be arbitrary complex dependency chains among tail calls, also circular
      ones (direct or indirect, nesting limit determined during runtime), and
      we need to make sure that the map drops all references to eBPF programs
      it holds, so that the map's refcount can eventually drop to zero and
      initiate its freeing. Btw, a walk of the whole dependency graph would
      not be possible for various reasons, one being complexity and another
      one inconsistency, i.e. new programs can be added to parts of the graph
      at any time, so there's no guaranteed consistent state for the time of
      such a walk.
      
      Now, the program array pinning itself works, but the issue is that each
      derived file descriptor on close would nevertheless call unconditionally
      into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
      this flush until the last reference to a user is dropped. As this only
      concerns a subset of references (f.e. a prog array could hold a program
      that itself has reference on the prog array holding it, etc), we need to
      track them separately.
      
      Short analysis on the refcounting: on map creation time usercnt will be
      one, so there's no change in behaviour for bpf_map_release(), if unpinned.
      If we already fail in map_create(), we are immediately freed, and no
      file descriptor has been made public yet. In bpf_obj_pin_user(), we need
      to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
      reference, so before we drop the reference on the fd with fdput().
      Therefore, if actual pinning fails, we need to drop that reference again
      in bpf_any_put(), otherwise we keep holding it. When last reference
      drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
      care of dropping the usercnt again. In the bpf_obj_get_user() case, the
      bpf_any_get() will grab a reference on the usercnt, still at a time when
      we have the reference on the path. Should we later on fail to grab a new
      file descriptor, bpf_any_put() will drop it, otherwise we hold it until
      bpf_map_release() time.
      
      Joint work with Alexei.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9da161c
  19. 25 11月, 2015 1 次提交