1. 18 5月, 2011 8 次提交
    • R
      PM: Allow drivers to allocate memory from .prepare() callbacks safely · 91e7c75b
      Rafael J. Wysocki 提交于
      If device drivers allocate substantial amounts of memory (above 1 MB)
      in their hibernate .freeze() callbacks (or in their legacy suspend
      callbcks during hibernation), the subsequent creation of hibernate
      image may fail due to the lack of memory.  This is the case, because
      the drivers' .freeze() callbacks are executed after the hibernate
      memory preallocation has been carried out and the preallocated amount
      of memory may be too small to cover the new driver allocations.
      Unfortunately, the drivers' .prepare() callbacks also are executed
      after the hibernate memory preallocation has completed, so they are
      not suitable for allocating additional memory either.  Thus the only
      way a driver can safely allocate memory during hibernation is to use
      a hibernate/suspend notifier.  However, the notifiers are called
      before the freezing of user space and the drivers wanting to use them
      for allocating additional memory may not know how much memory needs
      to be allocated at that point.
      
      To let device drivers overcome this difficulty rework the hibernation
      sequence so that the memory preallocation is carried out after the
      drivers' .prepare() callbacks have been executed, so that the
      .prepare() callbacks can be used for allocating additional memory
      to be used by the drivers' .freeze() callbacks.  Update documentation
      to match the new behavior of the code.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      91e7c75b
    • R
      PM: Remove CONFIG_PM_VERBOSE · c650da23
      Rafael J. Wysocki 提交于
      Now that we have CONFIG_DYNAMIC_DEBUG there is no need for yet
      another flag causing dev_dbg() and pr_debug() statements in the
      core PM code to produce output.  Moreover, CONFIG_PM_VERBOSE
      causes so much output to be generated that it's not really useful
      and almost no one sets it.
      
      References: https://bugzilla.kernel.org/show_bug.cgi?id=23182Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      c650da23
    • R
      Revert "PM / Hibernate: Reduce autotuned default image size" · 1c1be3a9
      Rafael J. Wysocki 提交于
      This reverts commit bea3864f
      (PM / Hibernate: Reduce autotuned default image size), because users
      are now able to resolve the issue this commit was supposed to address
      in a different way (i.e. by using the new /sys/power/reserved_size
      interface).
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      1c1be3a9
    • R
      PM / Hibernate: Add sysfs knob to control size of memory for drivers · ddeb6487
      Rafael J. Wysocki 提交于
      Martin reports that on his system hibernation occasionally fails due
      to the lack of memory, because the radeon driver apparently allocates
      too much of it during the device freeze stage.  It turns out that the
      amount of memory allocated by radeon during hibernation (and
      presumably during system suspend too) depends on the utilization of
      the GPU (e.g. hibernating while there are two KDE 4 sessions with
      compositing enabled causes radeon to allocate more memory than for
      one KDE 4 session).
      
      In principle it should be possible to use image_size to make the
      memory preallocation mechanism free enough memory for the radeon
      driver, but in practice it is not easy to guess the right value
      because of the way the preallocation code uses image_size.  For this
      reason, it seems reasonable to allow users to control the amount of
      memory reserved for driver allocations made after the hibernate
      preallocation, which currently is constant and amounts to 1 MB.
      
      Introduce a new sysfs file, /sys/power/reserved_size, whose value
      will be used as the amount of memory to reserve for the
      post-preallocation reservations made by device drivers, in bytes.
      For backwards compatibility, set its default (and initial) value to
      the currently used number (1 MB).
      
      References: https://bugzilla.kernel.org/show_bug.cgi?id=34102Reported-and-tested-by: NMartin Steigerwald <Martin@Lichtvoll.de>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      ddeb6487
    • K
      kmod: always provide usermodehelper_disable() · 13d53f87
      Kay Sievers 提交于
      We need to prevent kernel-forked processes during system poweroff.
      Such processes try to access the filesystem whose disks we are
      trying to shutdown at the same time. This causes delays and exceptions
      in the storage drivers.
      
      A follow-up patch will add these calls and need usermodehelper_disable()
      also on systems without suspend support.
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      13d53f87
    • R
      PM: Print a warning if firmware is requested when tasks are frozen · a144c6a6
      Rafael J. Wysocki 提交于
      Some drivers erroneously use request_firmware() from their ->resume()
      (or ->thaw(), or ->restore()) callbacks, which is not going to work
      unless the firmware has been built in.  This causes system resume to
      stall until the firmware-loading timeout expires, which makes users
      think that the resume has failed and reboot their machines
      unnecessarily.  For this reason, make _request_firmware() print a
      warning and return immediately with error code if it has been called
      when tasks are frozen and it's impossible to start any new usermode
      helpers.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Reviewed-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      a144c6a6
    • M
      Freezer: Use SMP barriers · ee940d8d
      Mike Frysinger 提交于
      The freezer processes are dealing with multiple threads running
      simultaneously, and on a UP system, the memory reads/writes do
      not need barriers to keep things in sync.  These are only needed
      on SMP systems, so use SMP barriers instead.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      ee940d8d
    • M
      PM / Suspend: Do not ignore error codes returned by suspend_enter() · 3c431936
      MyungJoo Ham 提交于
      The current implementation of suspend-to-RAM returns 0 if there is an
      error from suspend_enter(), because suspend_devices_and_enter() ignores
      the return value from suspend_enter().  This patch addresses this issue
      and properly keep the error return from suspend_enter() and let
      suspend_devices_and_enter relay the error return.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      3c431936
  2. 14 5月, 2011 1 次提交
    • S
      Cache user_ns in struct cred · 47a150ed
      Serge E. Hallyn 提交于
      If !CONFIG_USERNS, have current_user_ns() defined to (&init_user_ns).
      
      Get rid of _current_user_ns.  This requires nsown_capable() to be
      defined in capability.c rather than as static inline in capability.h,
      so do that.
      
      Request_key needs init_user_ns defined at current_user_ns if
      !CONFIG_USERNS, so forward-declare that in cred.h if !CONFIG_USERNS
      at current_user_ns() define.
      
      Compile-tested with and without CONFIG_USERNS.
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      [ This makes a huge performance difference for acl_permission_check(),
        up to 30%.  And that is one of the hottest kernel functions for loads
        that are pathname-lookup heavy.  ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47a150ed
  3. 12 5月, 2011 4 次提交
  4. 07 5月, 2011 1 次提交
  5. 03 5月, 2011 1 次提交
  6. 30 4月, 2011 2 次提交
  7. 29 4月, 2011 2 次提交
    • T
      hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically · ce31332d
      Thomas Gleixner 提交于
      Sedat and Bruno reported RCU stalls which turned out to be caused by
      the following;
      
      sched_init() calls init_rt_bandwidth() which calls hrtimer_init()
      _BEFORE_ hrtimers_init() is called. While not entirely correct this
      worked because hrtimer_init() only accessed statically initialized
      data (hrtimer_bases.clock_base[CLOCK_MONOTONIC])
      
      Commit e06383db (hrtimers: extend hrtimer base code to handle more
      then 2 clockids) added an indirection to the hrtimer_bases.clock_base
      lookup to avoid gap handling in the hot path. The table which is used
      for the translataion from CLOCK_ID to HRTIMER_BASE index is
      initialized at runtime in hrtimers_init(). So the early call of the
      scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME.
      
      Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is
      armed and the wall clock time is set (e.g. ntpdate in the early boot
      process - which also gives the problem deterministic behaviour
      i.e. magic recovery after N hours), then the timer ends up with an
      expiry time far into the future. That breaks the RT throttler
      mechanism as rt runtime is accumulated and never cleared, so the rt
      throttler detects a false cpu hog condition and blocks all RT tasks
      until the timer finally expires. That in turn stalls the RCU thread of
      TINYRCU which leads to an huge amount of RCU callbacks piling up.
      
      Make the translation table statically initialized, so we are back to
      the status of <= 2.6.39.
      Reported-and-tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Reported-by: NBruno Prémont <bonbons@linux-vserver.org>
      Cc: John stultz <johnstul@us.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3EReviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ce31332d
    • H
      kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdog · 1409f141
      Hillf Danton 提交于
      In corner cases where softlockup watchdog is not setup successfully, the
      relevant nmi perf event for hardlockup watchdog could be disabled, then
      the status of the underlying hardware remains unchanged.
      
      Also, if the kthread doesn't start then the hrtimer won't run and the
      hardlockup detector will falsely fire.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1409f141
  8. 25 4月, 2011 1 次提交
    • F
      ptrace: Prepare to fix racy accesses on task breakpoints · bf26c018
      Frederic Weisbecker 提交于
      When a task is traced and is in a stopped state, the tracer
      may execute a ptrace request to examine the tracee state and
      get its task struct. Right after, the tracee can be killed
      and thus its breakpoints released.
      This can happen concurrently when the tracer is in the middle
      of reading or modifying these breakpoints, leading to dereferencing
      a freed pointer.
      
      Hence, to prepare the fix, create a generic breakpoint reference
      holding API. When a reference on the breakpoints of a task is
      held, the breakpoints won't be released until the last reference
      is dropped. After that, no more ptrace request on the task's
      breakpoints can be serviced for the tracer.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: v2.6.33.. <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
      bf26c018
  9. 21 4月, 2011 1 次提交
  10. 20 4月, 2011 1 次提交
  11. 19 4月, 2011 2 次提交
    • R
      PM: Fix error code paths executed after failing syscore_suspend() · 2ca6f62f
      Rafael J. Wysocki 提交于
      If syscore_suspend() fails in suspend_enter(), create_image() or
      resume_target_kernel(), it is necessary to call sysdev_resume(),
      because sysdev_suspend() has been called already and succeeded
      and we are going to abort the transition.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      2ca6f62f
    • L
      next_pidmap: fix overflow condition · c78193e9
      Linus Torvalds 提交于
      next_pidmap() just quietly accepted whatever 'last' pid that was passed
      in, which is not all that safe when one of the users is /proc.
      
      Admittedly the proc code should do some sanity checking on the range
      (and that will be the next commit), but that doesn't mean that the
      helper functions should just do that pidmap pointer arithmetic without
      checking the range of its arguments.
      
      So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
      doesn't really matter, the for-loop does check against the end of the
      pidmap array properly (it's only the actual pointer arithmetic overflow
      case we need to worry about, and going one bit beyond isn't going to
      overflow).
      
      [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Analyzed-by: NRobert Święcki <robert@swiecki.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c78193e9
  12. 18 4月, 2011 1 次提交
  13. 16 4月, 2011 2 次提交
    • J
      block: make unplug timer trace event correspond to the schedule() unplug · 49cac01e
      Jens Axboe 提交于
      It's a pretty close match to what we had before - the timer triggering
      would mean that nobody unplugged the plug in due time, in the new
      scheme this matches very closely what the schedule() unplug now is.
      It's essentially the difference between an explicit unplug (IO unplug)
      or an implicit unplug (timer unplug, we scheduled with pending IO
      queued).
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      49cac01e
    • J
      block: let io_schedule() flush the plug inline · a237c1c5
      Jens Axboe 提交于
      Linus correctly observes that the most important dispatch cases
      are now done from kblockd, this isn't ideal for latency reasons.
      The original reason for switching dispatches out-of-line was to
      avoid too deep a stack, so by _only_ letting the "accidental"
      flush directly in schedule() be guarded by offload to kblockd,
      we should be able to get the best of both worlds.
      
      So add a blk_schedule_flush_plug() that offloads to kblockd,
      and only use that from the schedule() path.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a237c1c5
  14. 15 4月, 2011 1 次提交
  15. 13 4月, 2011 1 次提交
  16. 12 4月, 2011 4 次提交
  17. 11 4月, 2011 3 次提交
    • K
      sched: Fix erroneous all_pinned logic · b30aef17
      Ken Chen 提交于
      The scheduler load balancer has specific code to deal with cases of
      unbalanced system due to lots of unmovable tasks (for example because of
      hard CPU affinity). In those situation, it excludes the busiest CPU that
      has pinned tasks for load balance consideration such that it can perform
      second 2nd load balance pass on the rest of the system.
      
      This all works as designed if there is only one cgroup in the system.
      
      However, when we have multiple cgroups, this logic has false positives and
      triggers multiple load balance passes despite there are actually no pinned
      tasks at all.
      
      The reason it has false positives is that the all pinned logic is deep in
      the lowest function of can_migrate_task() and is too low level:
      
      load_balance_fair() iterates each task group and calls balance_tasks() to
      migrate target load. Along the way, balance_tasks() will also set a
      all_pinned variable. Given that task-groups are iterated, this all_pinned
      variable is essentially the status of last group in the scanning process.
      Task group can have number of reasons that no load being migrated, none
      due to cpu affinity. However, this status bit is being propagated back up
      to the higher level load_balance(), which incorrectly think that no tasks
      were moved.  It kick off the all pinned logic and start multiple passes
      attempt to move load onto puller CPU.
      
      To fix this, move the all_pinned aggregation up at the iterator level.
      This ensures that the status is aggregated over all task-groups, not just
      last one in the list.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b30aef17
    • K
      sched: Fix sched-domain avg_load calculation · b0432d8f
      Ken Chen 提交于
      In function find_busiest_group(), the sched-domain avg_load isn't
      calculated at all if there is a group imbalance within the domain. This
      will cause erroneous imbalance calculation.
      
      The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
      will dump entire sds->max_load into imbalance variable, which is used
      later on to migrate entire load from busiest CPU to the puller CPU.
      
      This has two really bad effect:
      
      1. stampede of task migration, and they won't be able to break out
         of the bad state because of positive feedback loop: large load
         delta -> heavier load migration -> larger imbalance and the cycle
         goes on.
      
      2. severe imbalance in CPU queue depth.  This causes really long
         scheduling latency blip which affects badly on application that
         has tight latency requirement.
      
      The fix is to have kernel calculate domain avg_load in both cases. This
      will ensure that imbalance calculation is always sensible and the target
      is usually half way between busiest and puller CPU.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0432d8f
    • S
      perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() · e566b76e
      Stephane Eranian 提交于
      There is a bug in perf_event_enable_on_exec() when cgroup events are
      active on a CPU: the cgroup events may be scheduled twice causing event
      state corruptions which eventually may lead to kernel panics.
      
      The reason is that the function needs to first schedule out the cgroup
      events, just like for the per-thread events. The cgroup event are
      scheduled back in automatically from the perf_event_context_sched_in()
      function.
      
      The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
      bogus state.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110406005454.GA1062@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e566b76e
  18. 09 4月, 2011 1 次提交
  19. 05 4月, 2011 3 次提交