1. 25 9月, 2012 7 次提交
    • J
      time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD · 70639421
      John Stultz 提交于
      To help migrate archtectures over to the new update_vsyscall method,
      redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      70639421
    • J
      time: Move update_vsyscall definitions to timekeeper_internal.h · 189374ae
      John Stultz 提交于
      Since users will need to include timekeeper_internal.h, move
      update_vsyscall definitions to timekeeper_internal.h.
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      189374ae
    • J
      time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes · d7b4202e
      John Stultz 提交于
      We're going to need to access the timekeeper in update_vsyscall,
      so make the structure available for those who need it.
      
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d7b4202e
    • J
      jiffies: Remove compile time assumptions about CLOCK_TICK_RATE · b3c869d3
      John Stultz 提交于
      CLOCK_TICK_RATE is used to accurately caclulate exactly how
      a tick will be at a given HZ.
      
      This is useful, because while we'd expect NSEC_PER_SEC/HZ,
      the underlying hardware will have some granularity limit,
      so we won't be able to have exactly HZ ticks per second.
      
      This slight error can cause timekeeping quality problems
      when using the jiffies or other jiffies driven clocksources.
      Thus we currently use compile time CLOCK_TICK_RATE value to
      generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
      to adjust the jiffies clocksource to correct this error.
      
      Unfortunately though, since CLOCK_TICK_RATE is a compile
      time value, and the jiffies clocksource is registered very
      early during boot, there are a number of cases where there
      are different possible hardware timers that have different
      tick rates. This causes problems in cases like ARM where
      there are numerous different types of hardware, each having
      their own compile-time CLOCK_TICK_RATE, making it hard to
      accurately support different hardware with a single kernel.
      
      For the most part, this doesn't matter all that much, as not
      too many systems actually utilize the jiffies or jiffies driven
      clocksource. Usually there are other highres clocksources
      who's granularity error is negligable.
      
      Even so, we have some complicated calcualtions that we do
      everywhere to handle these edge cases.
      
      This patch removes the compile time SHIFTED_HZ value, and
      introduces a register_refined_jiffies() function. This results
      in the default jiffies clock as being assumed a perfect HZ
      freq, and allows archtectures that care about jiffies accuracy
      to call register_refined_jiffies() with the tick rate, specified
      dynamically at boot.
      
      This allows us, where necessary, to not have a compile time
      CLOCK_TICK_RATE constant, simplifies the jiffies code, and
      still provides a way to have an accurate jiffies clock.
      
      NOTE: Since this patch does not add register_refinied_jiffies()
      calls for every arch, it may cause time quality regressions
      in some cases. Its likely these will not be noticable, but
      if they are an issue, adding the following to the end of
      setup_arch() should resolve the regression:
      	register_refinied_jiffies(CLOCK_TICK_RATE)
      
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      b3c869d3
    • J
      alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue · a65bcc12
      John Stultz 提交于
      Now that alarmtimer_remove has been simplified, change
      its name to _dequeue to better match its paired _enqueue
      function.
      
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Colin Cross <ccross@android.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      a65bcc12
    • J
      alarmtimer: Use hrtimer per-alarm instead of per-base · dae373be
      John Stultz 提交于
      Arve Hjønnevåg reported numerous crashes from the
      "BUG_ON(timer->state != HRTIMER_STATE_CALLBACK)" check
      in __run_hrtimer after it called alarmtimer_fired.
      
      It ends up the alarmtimer code was not properly handling
      possible failures of hrtimer_try_to_cancel, and because
      these faulres occur when the underlying base hrtimer is
      being run, this limits the ability to properly handle
      modifications to any alarmtimers on that base.
      
      Because much of the logic duplicates the hrtimer logic,
      it seems that we might as well have a per-alarmtimer
      hrtimer, and avoid the extra complextity of trying to
      multiplex many alarmtimers off of one hrtimer.
      
      Thus this patch moves the hrtimer to the alarm structure
      and simplifies the management logic.
      
      Changelog:
      v2:
      * Includes a fix for double alarm_start calls found by
        Arve
      
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Colin Cross <ccross@android.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: NArve Hjønnevåg <arve@android.com>
      Tested-by: NArve Hjønnevåg <arve@android.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      dae373be
    • T
      alarmtimer: Implement minimum alarm interval for allowing suspend · 59a93c27
      Todd Poynor 提交于
      alarmtimer suspend return -EBUSY if the next alarm will fire in less
      than 2 seconds.  This allows one RTC seconds tick to occur subsequent
      to this check before the alarm wakeup time is set, ensuring the wakeup
      time is still in the future (assuming the RTC does not tick one more
      second prior to setting the alarm).
      
      If suspend is rejected due to an imminent alarm, hold a wakeup source
      for 2 seconds to process the alarm prior to reattempting suspend.
      
      If setting the alarm incurs an -ETIME for an alarm set in the past,
      or any other problem setting the alarm, abort suspend and hold a
      wakelock for 1 second while the alarm is allowed to be serviced or
      other hopefully transient conditions preventing the alarm clear up.
      Signed-off-by: NTodd Poynor <toddpoynor@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      59a93c27
  2. 13 9月, 2012 1 次提交
  3. 11 9月, 2012 2 次提交
    • L
      workqueue: fix possible idle worker depletion across CPU hotplug · ee378aa4
      Lai Jiangshan 提交于
      To simplify both normal and CPU hotplug paths, worker management is
      prevented while CPU hoplug is in progress.  This is achieved by CPU
      hotplug holding the same exclusion mechanism used by workers to ensure
      there's only one manager per pool.
      
      If someone else seems to be performing the manager role, workers
      proceed to execute work items.  CPU hotplug using the same mechanism
      can lead to idle worker depletion because all workers could proceed to
      execute work items while CPU hotplug is in progress and CPU hotplug
      itself wouldn't actually perform the worker management duty - it
      doesn't guarantee that there's an idle worker left when it releases
      management.
      
      This idle worker depletion, under extreme circumstances, can break
      forward-progress guarantee and thus lead to deadlock.
      
      This patch fixes the bug by using separate mechanisms for manager
      exclusion among workers and hotplug exclusion.  For manager exclusion,
      POOL_MANAGING_WORKERS which was restored by the previous patch is
      used.  pool->manager_mutex is now only used for exclusion between the
      elected manager and CPU hotplug.  The elected manager won't proceed
      without holding pool->manager_mutex.
      
      This ensures that the worker which won the manager position can't skip
      managing while CPU hotplug is in progress.  It will block on
      manager_mutex and perform management after CPU hotplug is complete.
      
      Note that hotplug may happen while waiting for manager_mutex.  A
      manager isn't either on idle or busy list and thus the hoplug code
      can't unbind/rebind it.  Make the manager handle its own un/rebinding.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee378aa4
    • L
      workqueue: restore POOL_MANAGING_WORKERS · 552a37e9
      Lai Jiangshan 提交于
      This patch restores POOL_MANAGING_WORKERS which was replaced by
      pool->manager_mutex by 60373152 "workqueue: use mutex for global_cwq
      manager exclusion".
      
      There's a subtle idle worker depletion bug across CPU hotplug events
      and we need to distinguish an actual manager and CPU hotplug
      preventing management.  POOL_MANAGING_WORKERS will be used for the
      former and manager_mutex the later.
      
      This patch just lays POOL_MANAGING_WORKERS on top of the existing
      manager_mutex and doesn't introduce any synchronization changes.  The
      next patch will update it.
      
      Note that this patch fixes a non-critical anomaly where
      too_many_workers() may return %true spuriously while CPU hotplug is in
      progress.  While the issue could schedule idle timer spuriously, it
      didn't trigger any actual misbehavior.
      
      tj: Rewrote patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      552a37e9
  4. 06 9月, 2012 2 次提交
    • T
      workqueue: fix possible deadlock in idle worker rebinding · ec58815a
      Tejun Heo 提交于
      Currently, rebind_workers() and idle_worker_rebind() are two-way
      interlocked.  rebind_workers() waits for idle workers to finish
      rebinding and rebound idle workers wait for rebind_workers() to finish
      rebinding busy workers before proceeding.
      
      Unfortunately, this isn't enough.  The second wait from idle workers
      is implemented as follows.
      
      	wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
      
      rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
      then returns.  If CPU hotplug cycle happens again before one of the
      idle workers finishes the above wait_event(), rebind_workers() will
      repeat the first part of the handshake - set WORKER_REBIND again and
      wait for the idle worker to finish rebinding - and this leads to
      deadlock because the idle worker would be waiting for WORKER_REBIND to
      clear.
      
      This is fixed by adding another interlocking step at the end -
      rebind_workers() now waits for all the idle workers to finish the
      above WORKER_REBIND wait before returning.  This ensures that all
      rebinding steps are complete on all idle workers before the next
      hotplug cycle can happen.
      
      This problem was diagnosed by Lai Jiangshan who also posted a patch to
      fix the issue, upon which this patch is based.
      
      This is the minimal fix and further patches are scheduled for the next
      merge window to simplify the CPU hotplug path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
      ec58815a
    • T
      workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function · 90beca5d
      Tejun Heo 提交于
      This doesn't make any functional difference and is purely to help the
      next patch to be simpler.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      90beca5d
  5. 05 9月, 2012 1 次提交
    • L
      workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic · 96e65306
      Lai Jiangshan 提交于
      The compiler may compile the following code into TWO write/modify
      instructions.
      
      	worker->flags &= ~WORKER_UNBOUND;
      	worker->flags |= WORKER_REBIND;
      
      so the other CPU may temporarily see worker->flags which doesn't have
      either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
      prematurely.
      
      Fix it by using single explicit assignment via ACCESS_ONCE().
      
      Because idle workers have another WORKER_NOT_RUNNING flag, this bug
      doesn't exist for them; however, update it to use the same pattern for
      consistency.
      
      tj: Applied the change to idle workers too and updated comments and
          patch description a bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      96e65306
  6. 02 9月, 2012 1 次提交
    • J
      time: Move ktime_t overflow checking into timespec_valid_strict · cee58483
      John Stultz 提交于
      Andreas Bombe reported that the added ktime_t overflow checking added to
      timespec_valid in commit 4e8b1452 ("time: Improve sanity checking of
      timekeeping inputs") was causing problems with X.org because it caused
      timeouts larger then KTIME_T to be invalid.
      
      Previously, these large timeouts would be clamped to KTIME_MAX and would
      never expire, which is valid.
      
      This patch splits the ktime_t overflow checking into a new
      timespec_valid_strict function, and converts the timekeeping codes
      internal checking to use this more strict function.
      Reported-and-tested-by: NAndreas Bombe <aeb@debian.org>
      Cc: Zhouping Liu <zliu@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cee58483
  7. 22 8月, 2012 6 次提交
  8. 21 8月, 2012 1 次提交
  9. 19 8月, 2012 1 次提交
  10. 18 8月, 2012 1 次提交
  11. 15 8月, 2012 4 次提交
  12. 14 8月, 2012 5 次提交
    • M
      sched: Fix migration thread runtime bogosity · 8f618968
      Mike Galbraith 提交于
      Make stop scheduler class do the same accounting as other classes,
      
      Migration threads can be caught in the act while doing exec balancing,
      leading to the below due to use of unmaintained ->se.exec_start.  The
      load that triggered this particular instance was an apparently out of
      control heavily threaded application that does system monitoring in
      what equated to an exec bomb, with one of the VERY frequently migrated
      tasks being ps.
      
      %CPU   PID USER     CMD
      99.3    45 root     [migration/10]
      97.7    53 root     [migration/12]
      97.0    57 root     [migration/13]
      90.1    49 root     [migration/11]
      89.6    65 root     [migration/15]
      88.7    17 root     [migration/3]
      80.4    37 root     [migration/8]
      78.1    41 root     [migration/9]
      44.2    13 root     [migration/2]
      Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8f618968
    • M
      sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled · e221d028
      Mike Galbraith 提交于
      Root task group bandwidth replenishment must service all CPUs, regardless of
      where the timer was last started, and regardless of the isolation mechanism,
      lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e221d028
    • M
      sched,cgroup: Fix up task_groups list · 35cf4e50
      Mike Galbraith 提交于
      With multiple instances of task_groups, for_each_rt_rq() is a noop,
      no task groups having been added to the rt.c list instance.  This
      renders __enable/disable_runtime() and print_rt_stats() noop, the
      user (non) visible effect being that rt task groups are missing in
      /proc/sched_debug.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: stable@kernel.org # v3.3+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      35cf4e50
    • S
      sched: fix divide by zero at {thread_group,task}_times · bea6832c
      Stanislaw Gruszka 提交于
      On architectures where cputime_t is 64 bit type, is possible to trigger
      divide by zero on do_div(temp, (__force u32) total) line, if total is a
      non zero number but has lower 32 bit's zeroed. Removing casting is not
      a good solution since some do_div() implementations do cast to u32
      internally.
      
      This problem can be triggered in practice on very long lived processes:
      
        PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
         #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
         #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
         #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
         #3 [ffff880472a51cd0] die at ffffffff8100f26b
         #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
         #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
         #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
            [exception RIP: thread_group_times+0x56]
            RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
            RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
            RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
            RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
            R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
         #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
         #8 [ffff880472a51f40] sys_times at ffffffff81088524
         #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
            RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
            RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
            RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
            RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
            R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
            ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bea6832c
    • P
      sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466
      Peter Zijlstra 提交于
      Peter Portante reported that for large cgroup hierarchies (and or on
      large CPU counts) we get immense lock contention on rq->lock and stuff
      stops working properly.
      
      His workload was a ton of processes, each in their own cgroup,
      everybody idling except for a sporadic wakeup once every so often.
      
      It was found that:
      
        schedule()
          idle_balance()
            load_balance()
              local_irq_save()
              double_rq_lock()
              update_h_load()
                walk_tg_tree(tg_load_down)
                  tg_load_down()
      
      Results in an entire cgroup hierarchy walk under rq->lock for every
      new-idle balance and since new-idle balance isn't throttled this
      results in a lot of work while holding the rq->lock.
      
      This patch does two things, it removes the work from under rq->lock
      based on the good principle of race and pray which is widely employed
      in the load-balancer as a whole. And secondly it throttles the
      update_h_load() calculation to max once per jiffy.
      
      I considered excluding update_h_load() for new-idle balance
      all-together, but purely relying on regular balance passes to update
      this data might not work out under some rare circumstances where the
      new-idle busiest isn't the regular busiest for a while (unlikely, but
      a nightmare to debug if someone hits it and suffers).
      
      Cc: pjt@google.com
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NPeter Portante <pportant@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a35b6466
  13. 13 8月, 2012 1 次提交
    • J
      printk: Fix calculation of length used to discard records · e3756477
      Jeff Mahoney 提交于
      While tracking down a weird buffer overflow issue in a program that
      looked to be sane, I started double checking the length returned by
      syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
      the buffer.
      
      Sure enough, it was.  I saw this in strace:
      
        11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279
      
      It turns out that the loops that calculate how much space the entries
      will take when they're copied don't include the newlines and prefixes
      that will be included in the final output since prev flags is passed as
      zero.
      
      This patch properly accounts for it and fixes the overflow.
      
      CC: stable@kernel.org
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3756477
  14. 09 8月, 2012 1 次提交
  15. 05 8月, 2012 1 次提交
    • I
      time: Fix adjustment cleanup bug in timekeeping_adjust() · 1d17d174
      Ingo Molnar 提交于
      Tetsuo Handa reported that sporadically the system clock starts
      counting up too quickly which is enough to confuse the hangcheck
      timer to print a bogus stall warning.
      
      Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling
      timekeeping_adjust" overlooked this exit path:
      
              } else
                      return;
      
      which should really be a proper exit sequence, fixing the bug as a
      side effect.
      
      Also make the flow more readable by properly balancing curly
      braces.
      
      Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
      Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: john.stultz@linaro.org
      Cc: a.p.zijlstra@chello.nl
      Cc: richardcochran@gmail.com
      Cc: prarit@redhat.com
      Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1d17d174
  16. 01 8月, 2012 5 次提交
    • M
      mm: allow PF_MEMALLOC from softirq context · 907aed48
      Mel Gorman 提交于
      This is needed to allow network softirq packet processing to make use of
      PF_MEMALLOC.
      
      Currently softirq context cannot use PF_MEMALLOC due to it not being
      associated with a task, and therefore not having task flags to fiddle with
      - thus the gfp to alloc flag mapping ignores the task flags when in
      interrupts (hard or soft) context.
      
      Allowing softirqs to make use of PF_MEMALLOC therefore requires some
      trickery.  This patch borrows the task flags from whatever process happens
      to be preempted by the softirq.  It then modifies the gfp to alloc flags
      mapping to not exclude task flags in softirq context, and modify the
      softirq code to save, clear and restore the PF_MEMALLOC flag.
      
      The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
      leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
      cannot leak back into the preempted process.  This should be safe due to
      the following reasons
      
      Softirqs can run on multiple CPUs sure but the same task should not be
      	executing the same softirq code. Neither should the softirq
      	handler be preempted by any other softirq handler so the flags
      	should not leak to an unrelated softirq.
      
      Softirqs re-enable hardware interrupts in __do_softirq() so can be
      	preempted by hardware interrupts so PF_MEMALLOC is inherited
      	by the hard IRQ. However, this is similar to a process in
      	reclaim being preempted by a hardirq. While PF_MEMALLOC is
      	set, gfp_to_alloc_flags() distinguishes between hard and
      	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
      	flag.
      
      If the softirq is deferred to ksoftirq then its flags may be used
              instead of a normal tasks but as the softirq cannot be preempted,
              the PF_MEMALLOC flag does not leak to other code by accident.
      
      [davem@davemloft.net: Document why PF_MEMALLOC is safe]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907aed48
    • J
      mm/hotplug: correctly setup fallback zonelists when creating new pgdat · 9adb62a5
      Jiang Liu 提交于
      When hotadd_new_pgdat() is called to create new pgdat for a new node, a
      fallback zonelist should be created for the new node.  There's code to try
      to achieve that in hotadd_new_pgdat() as below:
      
      	/*
      	 * The node we allocated has no zone fallback lists. For avoiding
      	 * to access not-initialized zonelist, build here.
      	 */
      	mutex_lock(&zonelists_mutex);
      	build_all_zonelists(pgdat, NULL);
      	mutex_unlock(&zonelists_mutex);
      
      But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
      new node is still in offline state because node_set_online(nid) hasn't
      been called yet.  And build_all_zonelists() only builds zonelists for
      online nodes as:
      
              for_each_online_node(nid) {
                      pg_data_t *pgdat = NODE_DATA(nid);
      
                      build_zonelists(pgdat);
                      build_zonelist_cache(pgdat);
              }
      
      Though we hope to create zonelist for the new pgdat, but it doesn't.  So
      add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
      the new pgdat too.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9adb62a5
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
    • W
      mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads · 3965c9ae
      Wanpeng Li 提交于
      Since per-BDI flusher threads were introduced in 2.6, the pdflush
      mechanism is not used any more.  But the old interface exported through
      /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
      
      For back-compatibility, printk warning information and return 2 to notify
      the users that the interface is removed.
      Signed-off-by: NWanpeng Li <liwp@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3965c9ae
    • H
      mm: account the total_vm in the vm_stat_account() · 44de9d0c
      Huang Shijie 提交于
      vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
      But we can also account for total_vm in the vm_stat_account() which makes
      the code tidy.
      
      Even for mprotect_fixup(), we can get the right result in the end.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44de9d0c