1. 07 4月, 2015 1 次提交
  2. 14 2月, 2015 1 次提交
    • R
      PM / sleep: Re-implement suspend-to-idle handling · 38106313
      Rafael J. Wysocki 提交于
      In preparation for adding support for quiescing timers in the final
      stage of suspend-to-idle transitions, rework the freeze_enter()
      function making the system wait on a wakeup event, the freeze_wake()
      function terminating the suspend-to-idle loop and the mechanism by
      which deep idle states are entered during suspend-to-idle.
      
      First of all, introduce a simple state machine for suspend-to-idle
      and make the code in question use it.
      
      Second, prevent freeze_enter() from losing wakeup events due to race
      conditions and ensure that the number of online CPUs won't change
      while it is being executed.  In addition to that, make it force
      all of the CPUs re-enter the idle loop in case they are in idle
      states already (so they can enter deeper idle states if possible).
      
      Next, drop cpuidle_use_deepest_state() and replace use_deepest_state
      checks in cpuidle_select() and cpuidle_reflect() with a single
      suspend-to-idle state check in cpuidle_idle_call().
      
      Finally, introduce cpuidle_enter_freeze() that will simply find the
      deepest idle state available to the given CPU and enter it using
      cpuidle_enter().
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      38106313
  3. 12 2月, 2015 2 次提交
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • M
      PM: convert printk to pr_* equivalent · 35536ae1
      Michal Hocko 提交于
      While touching this area let's convert printk to pr_*.  This also makes
      the printing of continuation lines done properly.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35536ae1
  4. 04 2月, 2015 1 次提交
  5. 24 1月, 2015 2 次提交
  6. 07 1月, 2015 1 次提交
    • P
      rcu: Make SRCU optional by using CONFIG_SRCU · 83fe27ea
      Pranith Kumar 提交于
      SRCU is not necessary to be compiled by default in all cases. For tinification
      efforts not compiling SRCU unless necessary is desirable.
      
      The current patch tries to make compiling SRCU optional by introducing a new
      Kconfig option CONFIG_SRCU which is selected when any of the components making
      use of SRCU are selected.
      
      If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
      
         text    data     bss     dec     hex filename
         2007       0       0    2007     7d7 kernel/rcu/srcu.o
      
      Size of arch/powerpc/boot/zImage changes from
      
         text    data     bss     dec     hex filename
       831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
       829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after
      
      so the savings are about ~2000 bytes.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Josh Triplett <josh@joshtriplett.org>
      CC: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
      83fe27ea
  7. 20 12月, 2014 1 次提交
  8. 04 12月, 2014 1 次提交
  9. 19 11月, 2014 1 次提交
    • R
      PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected · b2b49ccb
      Rafael J. Wysocki 提交于
      The number of and dependencies between high-level power management
      Kconfig options make life much harder than necessary.  Several
      conbinations of them have to be tested and supported, even though
      some of those combinations are very rarely used in practice (if
      they are used in practice at all).  Moreover, the fact that we
      have separate independent Kconfig options for runtime PM and
      system suspend is a serious obstacle for integration between
      the two frameworks.
      
      To overcome these difficulties, always select PM_RUNTIME if PM_SLEEP
      is set.  Among other things, this will allow system suspend callbacks
      provided by bus types and device drivers to rely on the runtime PM
      framework regardless of the kernel configuration.
      Enthusiastically-acked-by: NKevin Hilman <khilman@linaro.org>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      b2b49ccb
  10. 18 11月, 2014 2 次提交
  11. 15 11月, 2014 1 次提交
  12. 09 11月, 2014 1 次提交
  13. 03 11月, 2014 1 次提交
  14. 28 10月, 2014 1 次提交
  15. 23 10月, 2014 1 次提交
  16. 22 10月, 2014 2 次提交
    • M
      PM: convert do_each_thread to for_each_process_thread · a28e785a
      Michal Hocko 提交于
      as per 0c740d0a (introduce for_each_thread() to replace the buggy
      while_each_thread()) get rid of do_each_thread { } while_each_thread()
      construct and replace it by a more error prone for_each_thread.
      
      This patch doesn't introduce any user visible change.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a28e785a
    • M
      OOM, PM: OOM killed task shouldn't escape PM suspend · 5695be14
      Michal Hocko 提交于
      PM freezer relies on having all tasks frozen by the time devices are
      getting frozen so that no task will touch them while they are getting
      frozen. But OOM killer is allowed to kill an already frozen task in
      order to handle OOM situtation. In order to protect from late wake ups
      OOM killer is disabled after all tasks are frozen. This, however, still
      keeps a window open when a killed task didn't manage to die by the time
      freeze_processes finishes.
      
      Reduce the race window by checking all tasks after OOM killer has been
      disabled. This is still not race free completely unfortunately because
      oom_killer_disable cannot stop an already ongoing OOM killer so a task
      might still wake up from the fridge and get killed without
      freeze_processes noticing. Full synchronization of OOM and freezer is,
      however, too heavy weight for this highly unlikely case.
      
      Introduce and check oom_kills counter which gets incremented early when
      the allocator enters __alloc_pages_may_oom path and only check all the
      tasks if the counter changes during the freezing attempt. The counter
      is updated so early to reduce the race window since allocator checked
      oom_killer_disabled which is set by PM-freezing code. A false positive
      will push the PM-freezer into a slow path but that is not a big deal.
      
      Changes since v1
      - push the re-check loop out of freeze_processes into
        check_frozen_processes and invert the condition to make the code more
        readable as per Rafael
      
      Fixes: f660daac (oom: thaw threads if oom killed thread is frozen before deferring)
      Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5695be14
  17. 01 10月, 2014 4 次提交
    • J
      PM / hibernate: Iterate over set bits instead of PFNs in swsusp_free() · fdd64ed5
      Joerg Roedel 提交于
      The existing implementation of swsusp_free iterates over all
      pfns in the system and checks every bit in the two memory
      bitmaps.
      
      This doesn't scale very well with large numbers of pfns,
      especially when the bitmaps are not populated very densly.
      Change the algorithm to iterate over the set bits in the
      bitmaps instead to make it scale better in large memory
      configurations.
      
      Also add a memory_bm_clear_current() helper function that
      clears the bit for the last position returned from the
      memory bitmap.
      
      This new version adds a !NULL check for the memory bitmaps
      before they are walked. Not doing so causes a kernel crash
      when the bitmaps are NULL.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      fdd64ed5
    • R
      ACPI / sleep: Rework the handling of ACPI GPE wakeup from suspend-to-idle · a8d46b9e
      Rafael J. Wysocki 提交于
      The ACPI GPE wakeup from suspend-to-idle is currently based on using
      the IRQF_NO_SUSPEND flag for the ACPI SCI, but that is problematic
      for a couple of reasons.  First, in principle the ACPI SCI may be
      shared and IRQF_NO_SUSPEND does not really work well with shared
      interrupts.  Second, it may require the ACPI subsystem to special-case
      the handling of device notifications depending on whether or not
      they are received during suspend-to-idle in some places which would
      lead to fragile code.  Finally, it's better the handle ACPI wakeup
      interrupts consistently with wakeup interrupts from other sources.
      
      For this reason, remove the IRQF_NO_SUSPEND flag from the ACPI SCI
      and use enable_irq_wake()/disable_irq_wake() with it instead, which
      requires two additional platform hooks to be added to struct
      platform_freeze_ops.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a8d46b9e
    • R
      PM / sleep: Rename platform suspend/resume functions in suspend.c · ebc3e41e
      Rafael J. Wysocki 提交于
      Rename several local functions related to platform handling during
      system suspend resume in suspend.c so that their names better
      reflect their roles.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ebc3e41e
    • R
      PM / sleep: Export dpm_suspend_late/noirq() and dpm_resume_early/noirq() · 2a8a8ce6
      Rafael J. Wysocki 提交于
      Subsequent change sets will add platform-related operations between
      dpm_suspend_late() and dpm_suspend_noirq() as well as between
      dpm_resume_noirq() and dpm_resume_early() in suspend_enter(), so
      export these functions for suspend_enter() to be able to call them
      separately and split the invocations of dpm_suspend_end() and
      dpm_resume_start() in there accordingly.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2a8a8ce6
  18. 25 9月, 2014 2 次提交
    • T
      PM / QoS: Add PM_QOS_MEMORY_BANDWIDTH class · 7990da71
      Tomeu Vizoso 提交于
      Also adds a class type PM_QOS_SUM that aggregates the values by summing them.
      
      It can be used by memory controllers to calculate the optimum clock frequency
      based on the bandwidth needs of the different memory clients.
      Signed-off-by: NTomeu Vizoso <tomeu.vizoso@collabora.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7990da71
    • R
      Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()" · 5c4dd348
      Rafael J. Wysocki 提交于
      Revert commit 6efde38f (PM / Hibernate: Iterate over set bits
      instead of PFNs in swsusp_free()) that introduced a NULL pointer
      dereference during system resume from hibernation:
      
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
      PGD b39c2067 PUD b39c1067 PMD 0
      Oops: 0000 [#1] SMP
      Modules linked in: <irrelevant list of modules>
      CPU: 1 PID: 4898 Comm: s2disk Tainted: G         C     3.17-rc5-amd64 #1 Debian 3.17~rc5-1~exp1
      Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
      task: ffff88023155ea40 ti: ffff8800b3b14000 task.ti: ffff8800b3b14000
      RIP: 0010:[<ffffffff810a8cc1>]  [<ffffffff810a8cc1>]
      swsusp_free+0x21/0x190
      RSP: 0018:ffff8800b3b17ea8  EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8800b39bab00 RCX: 0000000000000001
      RDX: ffff8800b39bab10 RSI: ffff8800b39bab00 RDI: 0000000000000000
      RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8800b39bab10 R11: 0000000000000246 R12: ffffea0000000000
      R13: ffff880232f485a0 R14: ffff88023ac27cd8 R15: ffff880232927590
      FS:  00007f406d83b700(0000) GS:ffff88023bc80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 00000000b3a62000 CR4: 00000000000007e0
      Stack:
       ffff8800b39bab00 0000000000000010 ffff880232927590 ffffffff810acb4a
       ffff8800b39bab00 ffffffff811a955a ffff8800b39bab10 0000000000000000
       ffff88023155f098 ffffffff81a6b8c0 ffff88023155ea40 0000000000000007
      Call Trace:
       [<ffffffff810acb4a>] ? snapshot_release+0x2a/0xb0
       [<ffffffff811a955a>] ? __fput+0xca/0x1d0
       [<ffffffff81080627>] ? task_work_run+0x97/0xd0
       [<ffffffff81012d89>] ? do_notify_resume+0x69/0xa0
       [<ffffffff8151452a>] ? int_signal+0x12/0x17
      Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 48 8b 05 ba 62 9c 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d a1 62 9c 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40
      RIP  [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
       RSP <ffff8800b3b17ea8>
      CR2: 0000000000000000
      ---[ end trace f02be86a1ec0cccb ]---
      
      due to forbidden_pages_map being NULL in swsusp_free().
      
      Fixes: 6efde38f "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
      Reported-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5c4dd348
  19. 22 9月, 2014 3 次提交
  20. 09 9月, 2014 2 次提交
  21. 03 9月, 2014 1 次提交
  22. 01 9月, 2014 1 次提交
  23. 07 8月, 2014 1 次提交
    • L
      PM / hibernate: avoid unsafe pages in e820 reserved regions · 84c91b7a
      Lee, Chun-Yi 提交于
      When the machine doesn't well handle the e820 persistent when hibernate
      resuming, then it may cause page fault when writing image to snapshot
      buffer:
      
      [   17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
      [   17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
      [   17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
      [   17.933469] Oops: 0002 [#1] SMP
      ...
      
      The ffff880069d4f000 page is in e820 reserved region of resume boot
      kernel:
      
      [    0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
      ...
      [    0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
      
      So snapshot.c mark the pfn to forbidden pages map. But, this
      page is also in the memory bitmap in snapshot image because it's an
      original page used by image kernel, so it will also mark as an
      unsafe(free) page in prepare_image().
      
      That means the page in e820 when resuming mark as "forbidden" and
      "free", it causes get_buffer() treat it as an allocated unsafe page.
      Then snapshot_write_next() return this page to load_image, load_image
      writing content to this address, but this page didn't really allocated
      . So, we got page fault.
      
      Although the root cause is from BIOS, I think aggressive check and
      significant message in kernel will better then a page fault for
      issue tracking, especially when serial console unavailable.
      
      This patch adds code in mark_unsafe_pages() for check does free pages in
      nosave region. If so, then it print message and return fault to stop whole
      S4 resume process:
      
      [    8.166004] PM: Image loading progress:   0%
      [    8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
      [    8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
      [    8.926633] PM: Error -14 resuming
      [    8.933534] PM: Failed to load hibernation image, recovering.
      Reviewed-by: NTakashi Iwai <tiwai@suse.de>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NLee, Chun-Yi <jlee@suse.com>
      [rjw: Subject]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      84c91b7a
  24. 29 7月, 2014 6 次提交