1. 22 11月, 2011 13 次提交
    • T
      freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE · a3201227
      Tejun Heo 提交于
      Using TIF_FREEZE for freezing worked when there was only single
      freezing condition (the PM one); however, now there is also the
      cgroup_freezer and single bit flag is getting clumsy.
      thaw_processes() is already testing whether cgroup freezing in in
      effect to avoid thawing tasks which were frozen by both PM and cgroup
      freezers.
      
      This is racy (nothing prevents race against cgroup freezing) and
      fragile.  A much simpler way is to test actual freeze conditions from
      freezing() - ie. directly test whether PM or cgroup freezing is in
      effect.
      
      This patch adds variables to indicate whether and what type of
      freezing conditions are in effect and reimplements freezing() such
      that it directly tests whether any of the two freezing conditions is
      active and the task should freeze.  On fast path, freezing() is still
      very cheap - it only tests system_freezing_cnt.
      
      This makes the clumsy dancing aroung TIF_FREEZE unnecessary and
      freeze/thaw operations more usual - updating state variables for the
      new state and nudging target tasks so that they notice the new state
      and comply.  As long as the nudging happens after state update, it's
      race-free.
      
      * This allows use of freezing() in freeze_task().  Replace the open
        coded tests with freezing().
      
      * p != current test is added to warning printing conditions in
        try_to_freeze_tasks() failure path.  This is necessary as freezing()
        is now true for the task which initiated freezing too.
      
      -v2: Oleg pointed out that re-freezing FROZEN cgroup could increment
           system_freezing_cnt.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: Paul Menage <paul@paulmenage.org>  (for the cgroup portions)
      a3201227
    • T
      cgroup_freezer: prepare for removal of TIF_FREEZE · 22b4e111
      Tejun Heo 提交于
      TIF_FREEZE will be removed soon and freezing() will directly test
      whether any freezing condition is in effect.  Make the following
      changes in preparation.
      
      * Rename cgroup_freezing_or_frozen() to cgroup_freezing() and make it
        return bool.
      
      * Make cgroup_freezing() access task_freezer() under rcu read lock
        instead of task_lock().  This makes the state dereferencing racy
        against task moving to another cgroup; however, it was already racy
        without this change as ->state dereference wasn't synchronized.
        This will be later dealt with using attach hooks.
      
      * freezer->state is now set before trying to push tasks into the
        target state.
      
      -v2: Oleg pointed out that freeze_change_state() was setting
           freeze->state incorrectly to CGROUP_FROZEN instead of
           CGROUP_FREEZING.  Fixed.
      
      -v3: Matt pointed out that setting CGROUP_FROZEN used to always invoke
           try_to_freeze_cgroup() regardless of the current state.  Patch
           updated such that the actual freeze/thaw operations are always
           performed on invocation.  This shouldn't make any difference
           unless something is broken.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPaul Menage <paul@paulmenage.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      22b4e111
    • T
      freezer: clean up freeze_processes() failure path · 03afed8b
      Tejun Heo 提交于
      freeze_processes() failure path is rather messy.  Freezing is canceled
      for workqueues and tasks which aren't frozen yet but frozen tasks are
      left alone and should be thawed by the caller and of course some
      callers (xen and kexec) didn't do it.
      
      This patch updates __thaw_task() to handle cancelation correctly and
      makes freeze_processes() and freeze_kernel_threads() call
      thaw_processes() on failure instead so that the system is fully thawed
      on failure.  Unnecessary [suspend_]thaw_processes() calls are removed
      from kernel/power/hibernate.c, suspend.c and user.c.
      
      While at it, restructure error checking if clause in suspend_prepare()
      to be less weird.
      
      -v2: Srivatsa spotted missing removal of suspend_thaw_processes() in
           suspend_prepare() and error in commit message.  Updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      03afed8b
    • T
      freezer: kill PF_FREEZING · 376fede8
      Tejun Heo 提交于
      With the previous changes, there's no meaningful difference between
      PF_FREEZING and PF_FROZEN.  Remove PF_FREEZING and use PF_FROZEN
      instead in task_contributes_to_load().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      376fede8
    • T
      freezer: test freezable conditions while holding freezer_lock · 85f1d476
      Tejun Heo 提交于
      try_to_freeze_tasks() and thaw_processes() use freezable() and
      frozen() as preliminary tests before initiating operations on a task.
      These are done without any synchronization and hinder with
      synchronization cleanup without any real performance benefits.
      
      In try_to_freeze_tasks(), open code self test and move PF_NOFREEZE and
      frozen() tests inside freezer_lock in freeze_task().
      
      thaw_processes() can simply drop freezable() test as frozen() test in
      __thaw_task() is enough.
      
      Note: This used to be a part of larger patch to fix set_freezable()
            race.  Separated out to satisfy ordering among dependent fixes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      85f1d476
    • T
      freezer: make freezing indicate freeze condition in effect · 6907483b
      Tejun Heo 提交于
      Currently freezing (TIF_FREEZE) and frozen (PF_FROZEN) states are
      interlocked - freezing is set to request freeze and when the task
      actually freezes, it clears freezing and sets frozen.
      
      This interlocking makes things more complex than necessary - freezing
      doesn't mean there's freezing condition in effect and frozen doesn't
      match the task actually entering and leaving frozen state (it's
      cleared by the thawing task).
      
      This patch makes freezing indicate that freeze condition is in effect.
      A task enters and stays frozen if freezing.  This makes PF_FROZEN
      manipulation done only by the task itself and prevents wakeup from
      __thaw_task() leaking outside of refrigerator.
      
      The only place which needs to tell freezing && !frozen is
      try_to_freeze_task() to whine about tasks which don't enter frozen.
      It's updated to test the condition explicitly.
      
      With the change, frozen() state my linger after __thaw_task() until
      the task wakes up and exits fridge.  This can trigger BUG_ON() in
      update_if_frozen().  Work it around by testing freezing() && frozen()
      instead of frozen().
      
      -v2: Oleg pointed out missing re-check of freezing() when trying to
           clear FROZEN and possible spurious BUG_ON() trigger in
           update_if_frozen().  Both fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Menage <paul@paulmenage.org>
      6907483b
    • T
      freezer: use dedicated lock instead of task_lock() + memory barrier · 0c9af092
      Tejun Heo 提交于
      Freezer synchronization is needlessly complicated - it's by no means a
      hot path and the priority is staying unintrusive and safe.  This patch
      makes it simply use a dedicated lock instead of piggy-backing on
      task_lock() and playing with memory barriers.
      
      On the failure path of try_to_freeze_tasks(), locking is moved from it
      to cancel_freezing().  This makes the frozen() test racy but the race
      here is a non-issue as the warning is printed for tasks which failed
      to enter frozen for 20 seconds and race on PF_FROZEN at the last
      moment doesn't change anything.
      
      This simplifies freezer implementation and eases further changes
      including some race fixes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0c9af092
    • T
      freezer: don't distinguish nosig tasks on thaw · 6cd8dedc
      Tejun Heo 提交于
      There's no point in thawing nosig tasks before others.  There's no
      ordering requirement between the two groups on thaw, which the staged
      thawing can't guarantee anyway.  Simplify thaw_processes() by removing
      the distinction and collapsing thaw_tasks() into thaw_processes().
      This will help further updates to freezer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6cd8dedc
    • T
      freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks · a585042f
      Tejun Heo 提交于
      clear_freeze_flag() in exit_mm() is racy.  Freezing can start
      afterwards.  Remove it.  Skipping freezer for exiting task will be
      properly implemented later.
      
      Also, freezable() was testing exit_state directly to make system
      freezer ignore dead tasks.  Let the exiting task set PF_NOFREEZE after
      entering TASK_DEAD instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      a585042f
    • T
      freezer: rename thaw_process() to __thaw_task() and simplify the implementation · a5be2d0d
      Tejun Heo 提交于
      thaw_process() now has only internal users - system and cgroup
      freezers.  Remove the unnecessary return value, rename, unexport and
      collapse __thaw_process() into it.  This will help further updates to
      the freezer code.
      
      -v3: oom_kill grew a use of thaw_process() while this patch was
           pending.  Convert it to use __thaw_task() for now.  In the longer
           term, this should be handled by allowing tasks to die if killed
           even if it's frozen.
      
      -v2: minor style update as suggested by Matt.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      a5be2d0d
    • T
      freezer: implement and use kthread_freezable_should_stop() · 8a32c441
      Tejun Heo 提交于
      Writeback and thinkpad_acpi have been using thaw_process() to prevent
      deadlock between the freezer and kthread_stop(); unfortunately, this
      is inherently racy - nothing prevents freezing from happening between
      thaw_process() and kthread_stop().
      
      This patch implements kthread_freezable_should_stop() which enters
      refrigerator if necessary but is guaranteed to return if
      kthread_stop() is invoked.  Both thaw_process() users are converted to
      use the new function.
      
      Note that this deadlock condition exists for many of freezable
      kthreads.  They need to be converted to use the new should_stop or
      freezable workqueue.
      
      Tested with synthetic test case.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      8a32c441
    • T
      freezer: unexport refrigerator() and update try_to_freeze() slightly · a0acae0e
      Tejun Heo 提交于
      There is no reason to export two functions for entering the
      refrigerator.  Calling refrigerator() instead of try_to_freeze()
      doesn't save anything noticeable or removes any race condition.
      
      * Rename refrigerator() to __refrigerator() and make it return bool
        indicating whether it scheduled out for freezing.
      
      * Update try_to_freeze() to return bool and relay the return value of
        __refrigerator() if freezing().
      
      * Convert all refrigerator() users to try_to_freeze().
      
      * Update documentation accordingly.
      
      * While at it, add might_sleep() to try_to_freeze().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Samuel Ortiz <samuel@sortiz.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Christoph Hellwig <hch@infradead.org>
      a0acae0e
    • T
      freezer: fix current->state restoration race in refrigerator() · 50fb4f7f
      Tejun Heo 提交于
      refrigerator() saves current->state before entering frozen state and
      restores it before returning using __set_current_state(); however,
      this is racy, for example, please consider the following sequence.
      
      	set_current_state(TASK_INTERRUPTIBLE);
      	try_to_freeze();
      	if (kthread_should_stop())
      		break;
      	schedule();
      
      If kthread_stop() races with ->state restoration, the restoration can
      restore ->state to TASK_INTERRUPTIBLE after kthread_stop() sets it to
      TASK_RUNNING but kthread_should_stop() may still see zero
      ->should_stop because there's no memory barrier between restoring
      TASK_INTERRUPTIBLE and kthread_should_stop() test.
      
      This isn't restricted to kthread_should_stop().  current->state is
      often used in memory barrier based synchronization and silently
      restoring it w/o mb breaks them.
      
      Use set_current_state() instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      50fb4f7f
  2. 19 11月, 2011 2 次提交
  3. 08 11月, 2011 1 次提交
  4. 07 11月, 2011 2 次提交
  5. 05 11月, 2011 3 次提交
    • T
      PM / Freezer: Revert 27920651 "PM / Freezer: Make fake_signal_wake_up() wake... · d6cc7685
      Tejun Heo 提交于
      PM / Freezer: Revert 27920651 "PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too"
      
      Commit 27920651 "PM / Freezer: Make fake_signal_wake_up() wake
      TASK_KILLABLE tasks too" updated fake_signal_wake_up() used by freezer
      to wake up KILLABLE tasks.  Sending unsolicited wakeups to tasks in
      killable sleep is dangerous as there are code paths which depend on
      tasks not waking up spuriously from KILLABLE sleep.
      
      For example. sys_read() or page can sleep in TASK_KILLABLE assuming
      that wait/down/whatever _killable can only fail if we can not return
      to the usermode.  TASK_TRACED is another obvious example.
      
      The previous patch updated wait_event_freezekillable() such that it
      doesn't depend on the spurious wakeup.  This patch reverts the
      offending commit.
      
      Note that the spurious KILLABLE wakeup had other implicit effects in
      KILLABLE sleeps in nfs and cifs and those will need further updates to
      regain freezekillable behavior.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      d6cc7685
    • G
      PM / QoS: Remove redundant check · 6513fd69
      Guennadi Liakhovetski 提交于
      Remove an "if" check, that repeats an equivalent one 6 lines above.
      Signed-off-by: NGuennadi Liakhovetski <g.liakhovetski@gmx.de>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      6513fd69
    • S
      PM / Sleep: Fix race between CPU hotplug and freezer · 79cfbdfa
      Srivatsa S. Bhat 提交于
      The CPU hotplug notifications sent out by the _cpu_up() and _cpu_down()
      functions depend on the value of the 'tasks_frozen' argument passed to them
      (which indicates whether tasks have been frozen or not).
      (Examples for such CPU hotplug notifications: CPU_ONLINE, CPU_ONLINE_FROZEN,
      CPU_DEAD, CPU_DEAD_FROZEN).
      
      Thus, it is essential that while the callbacks for those notifications are
      running, the state of the system with respect to the tasks being frozen or
      not remains unchanged, *throughout that duration*. Hence there is a need for
      synchronizing the CPU hotplug code with the freezer subsystem.
      
      Since the freezer is involved only in the Suspend/Hibernate call paths, this
      patch hooks the CPU hotplug code to the suspend/hibernate notifiers
      PM_[SUSPEND|HIBERNATE]_PREPARE and PM_POST_[SUSPEND|HIBERNATE] to prevent
      the race between CPU hotplug and freezer, thus ensuring that CPU hotplug
      notifications will always be run with the state of the system really being
      what the notifications indicate, _throughout_ their execution time.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      79cfbdfa
  6. 03 11月, 2011 7 次提交
  7. 01 11月, 2011 12 次提交
    • A
      kgdb: follow rename pack_hex_byte() to hex_byte_pack() · 50e1499f
      Andy Shevchenko 提交于
      There is no functional change.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50e1499f
    • W
      printk: remove bounds checking for log_prefix · ae29bc92
      William Douglas 提交于
      Currently log_prefix is testing that the first character of the log level
      and facility is less than '0' and greater than '9' (which is always
      false).
      
      Since the code being updated works because strtoul bombs out (endp isn't
      updated) and 0 is returned anyway just remove the check and don't change
      the behavior of the function.
      Signed-off-by: NWilliam Douglas <william.douglas@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae29bc92
    • W
      printk: fix bounds checking for log_prefix · 48e41899
      William Douglas 提交于
      Currently log_prefix is testing that the first character of the log level
      and facility is less than '0' and greater than '9' (which is always
      false).  It should be testing to see if the character less than '0' or
      greater than '9' instead.  This patch makes that change.
      
      The code being changed worked because strtoul bombs out (endp isn't
      updated) and 0 is returned anyway.
      Signed-off-by: NWilliam Douglas <william.douglas@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48e41899
    • Y
      printk: add console_suspend module parameter · 134620f7
      Yanmin Zhang 提交于
      We are enabling some power features on medfield.  To test suspend-2-RAM
      conveniently, we need turn on/off console_suspend_enabled frequently.
      
      Add a module parameter, so users could change it by:
      /sys/module/printk/parameters/console_suspend
      Signed-off-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      134620f7
    • Y
      printk: add module parameter ignore_loglevel to control ignore_loglevel · 0eca6b7c
      Yanmin Zhang 提交于
      We are enabling some power features on medfield.  To test suspend-2-RAM
      conveniently, we need turn on/off ignore_loglevel frequently without
      rebooting.
      
      Add a module parameter, so users can change it by:
      /sys/module/printk/parameters/ignore_loglevel
      Signed-off-by: NYanmin Zhang <yanmin.zhang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eca6b7c
    • D
      kernel/sysctl.c: add cap_last_cap to /proc/sys/kernel · 73efc039
      Dan Ballard 提交于
      Userspace needs to know the highest valid capability of the running
      kernel, which right now cannot reliably be retrieved from the header files
      only.  The fact that this value cannot be determined properly right now
      creates various problems for libraries compiled on newer header files
      which are run on older kernels.  They assume capabilities are available
      which actually aren't.  libcap-ng is one example.  And we ran into the
      same problem with systemd too.
      
      Now the capability is exported in /proc/sys/kernel/cap_last_cap.
      
      [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich]
      Signed-off-by: NDan Ballard <dan@mindstab.net>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Ulrich Drepper <drepper@akkadia.org>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73efc039
    • V
      watchdog: move watchdog_*_all_cpus under CONFIG_SYSCTL · 4ff81951
      Vasily Averin 提交于
      Fix compilation warnings for CONFIG_SYSCTL=n:
      
      fixed compilation warnings in case of disabled CONFIG_SYSCTL
      kernel/watchdog.c:483:13: warning: `watchdog_enable_all_cpus' defined but not used
      kernel/watchdog.c:500:13: warning: `watchdog_disable_all_cpus' defined but not used
      
      these functions are static and are used only in sysctl handler, so move
      them inside #ifdef CONFIG_SYSCTL too
      Signed-off-by: NVasily Averin <vvs@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ff81951
    • J
      stop_machine: make stop_machine safe and efficient to call early · f445027e
      Jeremy Fitzhardinge 提交于
      Make stop_machine() safe to call early in boot, before SMP has been set
      up, by simply calling the callback function directly if there's only one
      CPU online.
      
      [ Fixes from AKPM:
         - add comment
         - local_irq_flags, not save_flags
         - also call hard_irq_disable() for systems which need it
      
        Tejun suggested using an explicit flag rather than just looking at
        the online cpu count. ]
      
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: NTejun Heo <htejun@gmail.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f445027e
    • C
      mm: distinguish between mlocked and pinned pages · bc3e53f6
      Christoph Lameter 提交于
      Some kernel components pin user space memory (infiniband and perf) (by
      increasing the page count) and account that memory as "mlocked".
      
      The difference between mlocking and pinning is:
      
      A. mlocked pages are marked with PG_mlocked and are exempt from
         swapping. Page migration may move them around though.
         They are kept on a special LRU list.
      
      B. Pinned pages cannot be moved because something needs to
         directly access physical memory. They may not be on any
         LRU list.
      
      I recently saw an mlockalled process where mm->locked_vm became
      bigger than the virtual size of the process (!) because some
      memory was accounted for twice:
      
      Once when the page was mlocked and once when the Infiniband
      layer increased the refcount because it needt to pin the RDMA
      memory.
      
      This patch introduces a separate counter for pinned pages and
      accounts them seperately.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Mike Marciniszyn <infinipath@qlogic.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc3e53f6
    • D
      oom: remove oom_disable_count · c9f01245
      David Rientjes 提交于
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245
    • C
      Cross Memory Attach · fcf63409
      Christopher Yeoh 提交于
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
    • P
      irq: don't put module.h into irq.h for tracking irqgen modules. · ec53cf23
      Paul Gortmaker 提交于
      Recent commit "irq: Track the  owner of irq descriptor" in
      commit ID b6873807 placed module.h into linux/irq.h
      but we are trying to limit module.h inclusion to just C files
      that really need it, due to its size and number of children
      includes.  This targets just reversing that include.
      
      Add in the basic "struct module" since that is all we really need
      to ensure things compile.  In theory, b6873807 should have added the
      module.h include to the irqdesc.h header as well, but the implicit
      module.h everywhere presence masked this from showing up.  So give
      it the "struct module" as well.
      
      As for the C files, irqdesc.c is only using THIS_MODULE, so it
      does not need module.h - give it export.h instead.  The C file
      irq/manage.c is now (as of b6873807) using try_module_get and
      module_put and so it needs module.h (which it already has).
      
      Also convert the irq_alloc_descs variants to macros, since all
      they really do is is call the __irq_alloc_descs primitive.
      This avoids including export.h and no debug info is lost.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      ec53cf23