1. 09 1月, 2006 14 次提交
    • P
      [PATCH] cpuset: use rcu directly optimization · 6b9c2603
      Paul Jackson 提交于
      Optimize the cpuset impact on page allocation, the most performance critical
      cpuset hook in the kernel.
      
      On each page allocation, the cpuset hook needs to check for a possible change
      in the current tasks cpuset.  It can now handle the common case, of no change,
      without taking any spinlock or semaphore, thanks to RCU.
      
      Convert a spinlock on the current task to an rcu_read_lock(), saving
      approximately a memory barrier and an atomic op, depending on architecture.
      
      This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
      write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
      freeing up a detached cpuset until after any critical sections referencing
      that pointer.
      
      Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6b9c2603
    • P
      [PATCH] cpuset: remove test for null cpuset from alloc code path · c417f024
      Paul Jackson 提交于
      Remove a couple of more lines of code from the cpuset hooks in the page
      allocation code path.
      
      There was a check for a NULL cpuset pointer in the routine
      cpuset_update_task_memory_state() that was only needed during system boot,
      after the memory subsystem was initialized, before the cpuset subsystem was
      initialized, to catch a NULL task->cpuset pointer.
      
      Add a cpuset_init_early() routine, just before the mem_init() call in
      init/main.c, that sets up just enough of the init tasks cpuset structure to
      render cpuset_update_task_memory_state() calls harmless.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c417f024
    • P
      [PATCH] cpuset: migrate all tasks in cpuset at once · 04c19fa6
      Paul Jackson 提交于
      Given the mechanism in the previous patch to handle rebinding the per-vma
      mempolicies of all tasks in a cpuset that changes its memory placement, it is
      now easier to handle the page migration requirements of such tasks at the same
      time.
      
      The previous code didn't actually attempt to migrate the pages of the tasks in
      a cpuset whose memory placement changed until the next time each such task
      tried to allocate memory.  This was undesirable, as users invoking memory page
      migration exected to happen when the placement changed, not some unspecified
      time later when the task needed more memory.
      
      It is now trivial to handle the page migration at the same time as the per-vma
      rebinding is done.
      
      The routine cpuset.c:update_nodemask(), which handles changing a cpusets
      memory placement ('mems') now checks for the special case of being asked to
      write a placement that is the same as before.  It was harmless enough before
      to just recompute everything again, even though nothing had changed.  But page
      migration is a heavy weight operation - moving pages about.  So now it is
      worth avoiding that if asked to move a cpuset to its current location.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      04c19fa6
    • P
      [PATCH] cpuset: rebind vma mempolicies fix · 4225399a
      Paul Jackson 提交于
      Fix more of longstanding bug in cpuset/mempolicy interaction.
      
      NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
      to just the Memory Nodes allowed by that cpuset.  The kernel maintains
      internal state for each mempolicy, tracking what nodes are used for the
      MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
      
      When a tasks cpuset memory placement changes, whether because the cpuset
      changed, or because the task was attached to a different cpuset, then the
      tasks mempolicies have to be rebound to the new cpuset placement, so as to
      preserve the cpuset-relative numbering of the nodes in that policy.
      
      An earlier fix handled such mempolicy rebinding for mempolicies attached to a
      task.
      
      This fix rebinds mempolicies attached to vma's (address ranges in a tasks
      address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
      updating vma's, the rebinding of vma mempolicies has to be done when the
      cpuset memory placement is changed, at which time mmap_sem can be safely
      acquired.  The tasks mempolicy is rebound later, when the task next attempts
      to allocate memory and notices that its task->cpuset_mems_generation is
      out-of-date with its cpusets mems_generation.
      
      Because walking the tasklist to find all tasks attached to a changing cpuset
      requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
      affected tasks while doing the tasklist scan.  In general, one cannot acquire
      a semaphore (which can sleep) while already holding a spinlock (such as
      tasklist_lock).  So a list of mm references has to be built up during the
      tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
      acquired, and the vma's in that mm rebound.
      
      Once the tasklist lock is dropped, affected tasks may fork new tasks, before
      their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
      point to the cpuset being rebound (there can only be one; cpuset modifications
      are done under a global 'manage_sem' semaphore), and the mpol_copy code that
      is used to copy a tasks mempolicies during fork catches such forking tasks,
      and ensures their children are also rebound.
      
      When a task is moved to a different cpuset, it is easier, as there is only one
      task involved.  It's mm->vma's are scanned, using the same
      mpol_rebind_policy() as used above.
      
      It may happen that both the mpol_copy hook and the update done via the
      tasklist scan update the same mm twice.  This is ok, as the mempolicies of
      each vma in an mm keep track of what mems_allowed they are relative to, and
      safely no-op a second request to rebind to the same nodes.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4225399a
    • P
      [PATCH] cpuset: number_of_cpusets optimization · 202f72d5
      Paul Jackson 提交于
      Easy little optimization hack to avoid actually having to call
      cpuset_zone_allowed() and check mems_allowed, in the main page allocation
      routine, __alloc_pages().  This saves several CPU cycles per page allocation
      on systems not using cpusets.
      
      A counter is updated each time a cpuset is created or removed, and whenever
      there is only one cpuset in the system, it must be the root cpuset, which
      contains all CPUs and all Memory Nodes.  In that case, when the counter is
      one, all allocations are allowed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      202f72d5
    • P
      [PATCH] cpuset: numa_policy_rebind cleanup · 74cb2155
      Paul Jackson 提交于
      Cleanup, reorganize and make more robust the mempolicy.c code to rebind
      mempolicies relative to the containing cpuset after a tasks memory placement
      changes.
      
      The real motivator for this cleanup patch is to lay more groundwork for the
      upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
      after the containing cpuset memory placement changes.
      
      NUMA mempolicies are constrained by the cpuset their task is a member of.
      When either (1) a task is moved to a different cpuset, or (2) the 'mems'
      mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
      node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
      be recalculated, relative to their new cpuset placement.
      
      The old code used an unreliable method of determining what was the old
      mems_allowed constraining the mempolicy.  It just looked at the tasks
      mems_allowed value.  This sort of worked with the present code, that just
      rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
      referring to the old nodes.  But in an upcoming patch, the vma mempolicies
      will be rebound as well.  Then the order in which the various task and vma
      mempolicies are updated will no longer be deterministic, and one can no longer
      count on the task->mems_allowed holding the old value for as long as needed.
      It's not even clear if the current code was guaranteed to work reliably for
      task mempolicies.
      
      So I added a mems_allowed field to each mempolicy, stating exactly what
      mems_allowed the policy is relative to, and updated synchronously and reliably
      anytime that the mempolicy is rebound.
      
      Also removed a useless wrapper routine, numa_policy_rebind(), and had its
      caller, cpuset_update_task_memory_state(), call directly to the rewritten
      policy_rebind() routine, and made that rebind routine extern instead of
      static, and added a "mpol_" prefix to its name, making it
      mpol_rebind_policy().
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      74cb2155
    • P
      [PATCH] cpuset: implement cpuset_mems_allowed · 909d75a3
      Paul Jackson 提交于
      Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
      needed, to obtain the mems_allowed vector of a cpuset, and replaced the
      workaround in sys_migrate_pages() to call this new method.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      909d75a3
    • P
      [PATCH] cpuset: combine refresh_mems and update_mems · cf2a473c
      Paul Jackson 提交于
      The important code paths through alloc_pages_current() and alloc_page_vma(),
      by which most kernel page allocations go, both called
      cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
      -Both- of these latter two routines did a tasklock, got the tasks cpuset
      pointer, and checked for out of date cpuset->mems_generation.
      
      That was a silly duplication of code and waste of CPU cycles on an important
      code path.
      
      Consolidated those two routines into a single routine, called
      cpuset_update_task_memory_state(), since it updates more than just
      mems_allowed.
      
      Changed all callers of either routine to call the new consolidated routine.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cf2a473c
    • P
      [PATCH] cpuset: fork hook fix · b4b26418
      Paul Jackson 提交于
      Fix obscure, never seen in real life, cpuset fork race.  The cpuset_fork()
      call in fork.c was setting up the correct task->cpuset pointer after the
      tasklist_lock was dropped, which briefly exposed the newly forked process with
      an unsafe (copied from parent without locks or usage counter increment) cpuset
      pointer.
      
      In theory, that exposed cpuset pointer could have been pointing at a cpuset
      that was already freed and removed, and in theory another task that had been
      sitting on the tasklist_lock waiting to scan the task list could have raced
      down the entire tasklist, found our new child at the far end, and dereferenced
      that bogus cpuset pointer.
      
      To fix, setup up the correct cpuset pointer in the new child by calling
      cpuset_fork() before the new task is linked into the tasklist, and with that,
      add a fork failure case, to dereference that cpuset, if the fork fails along
      the way, after cpuset_fork() was called.
      
      Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
      the call to cpuset_exit() from a failed fork would not have PF_EXITING set.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b4b26418
    • P
      [PATCH] cpuset: update_nodemask code reformat · 59dac16f
      Paul Jackson 提交于
      Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
      removing embedded returns and nested if's in favor of goto completion labels.
      This is being done in anticipation of adding more logic to this routine, which
      will favor the goto style structure.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      59dac16f
    • P
      [PATCH] cpuset: minor spacing initializer fixes · c5b2aff8
      Paul Jackson 提交于
      Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
      mark one __read_mostly.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c5b2aff8
    • P
      [PATCH] cpuset: memory pressure meter · 3e0d98b9
      Paul Jackson 提交于
      Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
      that the tasks in a cpuset call try_to_free_pages(), the synchronous
      (direct) memory reclaim code.
      
      This enables batch managers monitoring jobs running in dedicated cpusets to
      efficiently detect what level of memory pressure that job is causing.
      
      This is useful both on tightly managed systems running a wide mix of
      submitted jobs, which may choose to terminate or reprioritize jobs that are
      trying to use more memory than allowed on the nodes assigned them, and with
      tightly coupled, long running, massively parallel scientific computing jobs
      that will dramatically fail to meet required performance goals if they
      start to use more memory than allowed to them.
      
      This patch just provides a very economical way for the batch manager to
      monitor a cpuset for signs of memory pressure.  It's up to the batch
      manager or other user code to decide what to do about it and take action.
      
      ==> Unless this feature is enabled by writing "1" to the special file
          /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
          code of __alloc_pages() for this metric reduces to simply noticing
          that the cpuset_memory_pressure_enabled flag is zero.  So only
          systems that enable this feature will compute the metric.
      
      Why a per-cpuset, running average:
      
          Because this meter is per-cpuset, rather than per-task or mm, the
          system load imposed by a batch scheduler monitoring this metric is
          sharply reduced on large systems, because a scan of the tasklist can be
          avoided on each set of queries.
      
          Because this meter is a running average, instead of an accumulating
          counter, a batch scheduler can detect memory pressure with a single
          read, instead of having to read and accumulate results for a period of
          time.
      
          Because this meter is per-cpuset rather than per-task or mm, the
          batch scheduler can obtain the key information, memory pressure in a
          cpuset, with a single read, rather than having to query and accumulate
          results over all the (dynamically changing) set of tasks in the cpuset.
      
      A per-cpuset simple digital filter (requires a spinlock and 3 words of data
      per-cpuset) is kept, and updated by any task attached to that cpuset, if it
      enters the synchronous (direct) page reclaim code.
      
      A per-cpuset file provides an integer number representing the recent
      (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
      in the cpuset, in units of reclaims attempted per second, times 1000.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e0d98b9
    • P
      [PATCH] cpuset: mempolicy one more nodemask conversion · 5966514d
      Paul Jackson 提交于
      Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
      conversion had left one routine using bitmaps, since it involved a
      corresponding change to kernel/cpuset.c
      
      Fix that interface by replacing with a simple macro that calls nodes_subset(),
      or if !CONFIG_CPUSET, returns (1).
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5966514d
    • P
      [PATCH] cpusets: swap migration interface · 45b07ef3
      Paul Jackson 提交于
      Add a boolean "memory_migrate" to each cpuset, represented by a file
      containing "0" or "1" in each directory below /dev/cpuset.
      
      It defaults to false (file contains "0").  It can be set true by writing
      "1" to the file.
      
      If true, then anytime that a task is attached to the cpuset so marked, the
      pages of that task will be moved to that cpuset, preserving, to the extent
      practical, the cpuset-relative placement of the pages.
      
      Also anytime that a cpuset so marked has its memory placement changed (by
      writing to its "mems" file), the tasks in that cpuset will have their pages
      moved to the cpusets new nodes, preserving, to the extent practical, the
      cpuset-relative placement of the moved pages.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      45b07ef3
  2. 14 11月, 2005 1 次提交
  3. 31 10月, 2005 5 次提交
    • P
      [PATCH] cpusets: automatic numa mempolicy rebinding · 68860ec1
      Paul Jackson 提交于
      This patch automatically updates a tasks NUMA mempolicy when its cpuset
      memory placement changes.  It does so within the context of the task,
      without any need to support low level external mempolicy manipulation.
      
      If a system is not using cpusets, or if running on a system with just the
      root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
      task is moved between cpusets, or a cpusets memory placement is changed
      does the following apply.  Otherwise, the main routine below,
      rebind_policy() is not even called.
      
      When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
      essential role of cpusets is to place jobs (several related tasks) on a set
      of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
      manage a jobs processor placement within its allowed cpuset, and the
      essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
      memory placement within its allowed cpuset.
      
      However, CPU affinity and NUMA memory placement are managed within the
      kernel using absolute system wide numbering, not cpuset relative numbering.
      
      This is ok until a job is migrated to a different cpuset, or what's the
      same, a jobs cpuset is moved to different CPUs and Memory Nodes.
      
      Then the CPU affinity and NUMA memory placement of the tasks in the job
      need to be updated, to preserve their cpuset-relative position.  This can
      be done for CPU affinity using sched_setaffinity() from user code, as one
      task can modify anothers CPU affinity.  This cannot be done from an
      external task for NUMA memory placement, as that can only be modified in
      the context of the task using it.
      
      However, it easy enough to remap a tasks NUMA mempolicy automatically when
      a task is migrated, using the existing cpuset mechanism to trigger a
      refresh of a tasks memory placement after its cpuset has changed.  All that
      is needed is the old and new nodemask, and notice to the task that it needs
      to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
      tasks cpuset has the new mask, and the existing
      cpuset_update_current_mems_allowed() mechanism provides the notice.  The
      bitmap/cpumask/nodemask remap operators provide the cpuset relative
      calculations.
      
      This patch leaves open a couple of issues:
      
       1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
      
          These mempolicies may reference nodes outside of those allowed to
          the current task by its cpuset.  Tasks are migrated as part of jobs,
          which reside on what might be several cpusets in a subtree.  When such
          a job is migrated, all NUMA memory policy references to nodes within
          that cpuset subtree should be translated, and references to any nodes
          outside that subtree should be left untouched.  A future patch will
          provide the cpuset mechanism needed to mark such subtrees.  With that
          patch, we will be able to correctly migrate these other memory policies
          across a job migration.
      
       2) Updating cpuset, affinity and memory policies in user space:
      
          This is harder.  Any placement state stored in user space using
          system-wide numbering will be invalidated across a migration.  More
          work will be required to provide user code with a migration-safe means
          to manage its cpuset relative placement, while preserving the current
          API's that pass system wide numbers, not cpuset relative numbers across
          the kernel-user boundary.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68860ec1
    • P
      [PATCH] cpusets: simple rename · 18a19cb3
      Paul Jackson 提交于
      Add support for renaming cpusets.  Only allow simple rename of cpuset
      directories in place.  Don't allow moving cpusets elsewhere in hierarchy or
      renaming the special cpuset files in each cpuset directory.
      
      The usefulness of this simple rename became apparent when developing task
      migration facilities.  It allows building a second cpuset hierarchy using
      new names and containing new CPUs and Memory Nodes, moving tasks from the
      old to the new cpusets, removing the old cpusets, and then renaming the new
      cpusets to be just like the old names, so that any knowledge that the tasks
      had of their cpuset names will still be valid.
      
      Leaf node cpusets can be migrated to other CPUs or Memory Nodes by just
      updating their 'cpus' and 'mems' files, but because no cpuset can contain
      CPUs or Nodes not in its parent cpuset, one cannot do this in a cpuset
      hierarchy without first expanding all the non-leaf cpusets to contain the
      union of both the old and new CPUs and Nodes, which would obfuscate the
      one-to-one migration of a task from one cpuset to another required to
      correctly migrate the physical page frames currently allocated to that
      task.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      18a19cb3
    • P
      [PATCH] cpusets: dual semaphore locking overhaul · 053199ed
      Paul Jackson 提交于
      Overhaul cpuset locking.  Replace single semaphore with two semaphores.
      
      The suggestion to use two locks was made by Roman Zippel.
      
      Both locks are global.  Code that wants to modify cpusets must first
      acquire the exclusive manage_sem, which allows them read-only access to
      cpusets, and holds off other would-be modifiers.  Before making actual
      changes, the second semaphore, callback_sem must be acquired as well.  Code
      that needs only to query cpusets must acquire callback_sem, which is also a
      global exclusive lock.
      
      The earlier problems with double tripping are avoided, because it is
      allowed for holders of manage_sem to nest the second callback_sem lock, and
      only callback_sem is needed by code called from within __alloc_pages(),
      where the double tripping had been possible.
      
      This is not quite the same as a normal read/write semaphore, because
      obtaining read-only access with intent to change must hold off other such
      attempts, while allowing read-only access w/o such intention.  Changing
      cpusets involves several related checks and changes, which must be done
      while allowing read-only queries (to avoid the double trip), but while
      ensuring nothing changes (holding off other would be modifiers.)
      
      This overhaul of cpuset locking also makes careful use of task_lock() to
      guard access to the task->cpuset pointer, closing a couple of race
      conditions noticed while reading this code (thanks, Roman).  I've never
      seen these races fail in any use or test.
      
      See further the comments in the code.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      053199ed
    • P
      [PATCH] cpusets: remove depth counted locking hack · 5aa15b5f
      Paul Jackson 提交于
      Remove a rather hackish depth counter on cpuset locking.  The depth counter
      was avoiding a possible double trip on the global cpuset_sem semaphore.  It
      worked, but now an improved version of cpuset locking is available, to come
      in the next patch, using two global semaphores.
      
      This patch reverses "cpuset semaphore depth check deadlock fix"
      
      The kernel still works, even after this patch, except for some rare and
      difficult to reproduce race conditions when agressively creating and
      destroying cpusets marked with the notify_on_release option, on very large
      systems.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5aa15b5f
    • P
      [PATCH] cpuset cleanup · f35f31d7
      Paul Jackson 提交于
      Remove one more useless line from cpuset_common_file_read().
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f35f31d7
  4. 09 10月, 2005 1 次提交
  5. 30 9月, 2005 1 次提交
  6. 28 9月, 2005 1 次提交
  7. 13 9月, 2005 1 次提交
    • P
      [PATCH] cpuset semaphore depth check optimize · b3426599
      Paul Jackson 提交于
      Optimize the deadlock avoidance check on the global cpuset
      semaphore cpuset_sem.  Instead of adding a depth counter to the
      task struct of each task, rather just two words are enough, one
      to store the depth and the other the current cpuset_sem holder.
      
      Thanks to Nikita Danilov for the idea.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      
      [ We may want to change this further, but at least it's now
        a totally internal decision to the cpusets code ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b3426599
  8. 11 9月, 2005 1 次提交
    • P
      [PATCH] cpuset semaphore depth check deadlock fix · 4247bdc6
      Paul Jackson 提交于
      The cpusets-formalize-intermediate-gfp_kernel-containment patch
      has a deadlock problem.
      
      This patch was part of a set of four patches to make more
      extensive use of the cpuset 'mem_exclusive' attribute to
      manage kernel GFP_KERNEL memory allocations and to constrain
      the out-of-memory (oom) killer.
      
      A task that is changing cpusets in particular ways on a system
      when it is very short of free memory could double trip over
      the global cpuset_sem semaphore (get the lock and then deadlock
      trying to get it again).
      
      The second attempt to get cpuset_sem would be in the routine
      cpuset_zone_allowed().  This was discovered by code inspection.
      I can not reproduce the problem except with an artifically
      hacked kernel and a specialized stress test.
      
      In real life you cannot hit this unless you are manipulating
      cpusets, and are very unlikely to hit it unless you are rapidly
      modifying cpusets on a memory tight system.  Even then it would
      be a rare occurence.
      
      If you did hit it, the task double tripping over cpuset_sem
      would deadlock in the kernel, and any other task also trying
      to manipulate cpusets would deadlock there too, on cpuset_sem.
      Your batch manager would be wedged solid (if it was cpuset
      savvy), but classic Unix shells and utilities would work well
      enough to reboot the system.
      
      The unusual condition that led to this bug is that unlike most
      semaphores, cpuset_sem _can_ be acquired while in the page
      allocation code, when __alloc_pages() calls cpuset_zone_allowed.
      So it easy to mistakenly perform the following sequence:
        1) task makes system call to alter a cpuset
        2) take cpuset_sem
        3) try to allocate memory
        4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
        5) deadlock
      
      The reason that this is not a serious bug for most users
      is that almost all calls to allocate memory don't require
      taking cpuset_sem.  Only some code paths off the beaten
      track require taking cpuset_sem -- which is good.  Taking
      a global semaphore on the main code path for allocating
      memory would not scale well.
      
      This patch fixes this deadlock by wrapping the up() and down()
      calls on cpuset_sem in kernel/cpuset.c with code that tracks
      the nesting depth of the current task on that semaphore, and
      only does the real down() if the task doesn't hold the lock
      already, and only does the real up() if the nesting depth
      (number of unmatched downs) is exactly one.
      
      The previous required use of refresh_mems(), anytime that
      the cpuset_sem semaphore was acquired and the code executed
      while holding that semaphore might try to allocate memory, is
      no longer required.  Two refresh_mems() calls were removed
      thanks to this.  This is a good change, as failing to get
      all the necessary refresh_mems() calls placed was a primary
      source of bugs in this cpuset code.  The only remaining call
      to refresh_mems() is made while doing a memory allocation,
      if certain task memory placement data needs to be updated
      from its cpuset, due to the cpuset having been changed behind
      the tasks back.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4247bdc6
  9. 10 9月, 2005 1 次提交
  10. 08 9月, 2005 3 次提交
    • J
      [PATCH] cpusets: re-enable "dynamic sched domains" · 0811bab2
      John Hawkes 提交于
      Revert the hack introduced last week.
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0811bab2
    • P
      [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset · ef08e3b4
      Paul Jackson 提交于
      Now the real motivation for this cpuset mem_exclusive patch series seems
      trivial.
      
      This patch keeps a task in or under one mem_exclusive cpuset from provoking an
      oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
      interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
      containment, there is little to gain from oom killing a task under a
      non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
      allocation must come from disjoint memory nodes.
      
      This patch enables configuring a system so that a runaway job under one
      mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
      that might be using very high compute and memory resources for a prolonged
      time.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ef08e3b4
    • P
      [PATCH] cpusets: formalize intermediate GFP_KERNEL containment · 9bf2229f
      Paul Jackson 提交于
      This patch makes use of the previously underutilized cpuset flag
      'mem_exclusive' to provide what amounts to another layer of memory placement
      resolution.  With this patch, there are now the following four layers of
      memory placement available:
      
       1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
       2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
       3) The current tasks cpuset (GFP_USER allocations constrained to here), and
       4) Specific node placement, using mbind and set_mempolicy.
      
      These nest - each layer is a subset (same or within) of the previous.
      
      Layer (2) above is new, with this patch.  The call used to check whether a
      zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
      extended to take a gfp_mask argument, and its logic is extended, in the case
      that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
      hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
      placement is allowed.  The definition of GFP_USER, which used to be identical
      to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
      cpuset_gfp_hardwall_flag patch.
      
      GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
      cpuset, so long as any node therein is not too tight on memory, but will
      escape to the larger layer, if need be.
      
      The intended use is to allow something like a batch manager to handle several
      jobs, each job in its own cpuset, but using common kernel memory for caches
      and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
      task in or below one mem_exclusive cpuset should not cause swapping on nodes
      in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
      task in another such cpuset.  Heavy use of kernel memory for i/o caching and
      such by one job should not impact the memory available to jobs in other
      non-overlapping mem_exclusive cpusets.
      
      This patch enables providing hardwall, inescapable cpusets for memory
      allocations of each job, while sharing kernel memory allocations between
      several jobs, in an enclosing mem_exclusive cpuset.
      
      Like Dinakar's patch earlier to enable administering sched domains using the
      cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
      that had previously done nothing much useful other than restrict what cpuset
      configurations were allowed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9bf2229f
  11. 27 8月, 2005 2 次提交
    • P
      [PATCH] completely disable cpu_exclusive sched domain · 212d6d22
      Paul Jackson 提交于
      At the suggestion of Nick Piggin and Dinakar, totally disable
      the facility to allow cpu_exclusive cpusets to define dynamic
      sched domains in Linux 2.6.13, in order to avoid problems
      first reported by John Hawkes (corrupt sched data structures
      and kernel oops).
      
      This has been built for ppc64, i386, ia64, x86_64, sparc, alpha.
      It has been built, booted and tested for cpuset functionality
      on an SN2 (ia64).
      
      Dinakar or Nick - could you verify that it for sure does avoid
      the problems Hawkes reported.  Hawkes is out of town, and I don't
      have the recipe to reproduce what he found.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      212d6d22
    • P
      [PATCH] undo partial cpu_exclusive sched domain disabling · ca2f3daf
      Paul Jackson 提交于
      The partial disabling of Dinakar's new facility to allow
      cpu_exclusive cpusets to define dynamic sched domains
      doesn't go far enough.  At the suggestion of Nick Piggin
      and Dinakar, let us instead totally disable this facility
      for 2.6.13, in order to avoid problems first reported
      by John Hawkes (corrupt sched data structures and kernel oops).
      
      This patch removes the partial disabling code in 2.6.13-rc7,
      in anticipation of the next patch, which will totally disable
      it instead.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ca2f3daf
  12. 25 8月, 2005 1 次提交
    • P
      [PATCH] cpu_exclusive sched domains build fix · 3725822f
      Paul Jackson 提交于
      As reported by Paul Mackerras <paulus@samba.org>, the previous patch
      "cpu_exclusive sched domains fix" broke the ppc64 build with
      CONFIC_CPUSET, yielding error messages:
      
      kernel/cpuset.c: In function 'update_cpu_domains':
      kernel/cpuset.c:648: error: invalid lvalue in unary '&'
      kernel/cpuset.c:648: error: invalid lvalue in unary '&'
      
      On some arch's, the node_to_cpumask() is a function, returning
      a cpumask_t.  But the for_each_cpu_mask() requires an lvalue mask.
      
      The following patch fixes this build failure by making a copy
      of the cpumask_t on the stack.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3725822f
  13. 24 8月, 2005 1 次提交
    • P
      [PATCH] cpu_exclusive sched domains on partial nodes temp fix · d10689b6
      Paul Jackson 提交于
      This keeps the kernel/cpuset.c routine update_cpu_domains() from
      invoking the sched.c routine partition_sched_domains() if the cpuset in
      question doesn't fall on node boundaries.
      
      I have boot tested this on an SN2, and with the help of a couple of ad
      hoc printk's, determined that it does indeed avoid calling the
      partition_sched_domains() routine on partial nodes.
      
      I did not directly verify that this avoids setting up bogus sched
      domains or avoids the oops that Hawkes saw.
      
      This patch imposes a silent artificial constraint on which cpusets can
      be used to define dynamic sched domains.
      
      This patch should allow proceeding with this new feature in 2.6.13 for
      the configurations in which it is useful (node alligned sched domains)
      while avoiding trying to setup sched domains in the less useful cases
      that can cause the kernel corruption and oops.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NDinakar Guniguntala <dino@in.ibm.com>
      Acked-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d10689b6
  14. 10 8月, 2005 1 次提交
    • P
      [PATCH] cpuset release ABBA deadlock fix · 3077a260
      Paul Jackson 提交于
      Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.
      
      For a particular usage pattern, creating and destroying cpusets fairly
      frequently using notify_on_release, on a very large system, this deadlock
      can be seen every few days.  If you are not using the cpuset
      notify_on_release feature, you will never see this deadlock.
      
      The existing code, on task exit (or cpuset deletion) did:
      
        get cpuset_sem
        if cpuset marked notify_on_release and is ready to release:
          compute cpuset path relative to /dev/cpuset mount point
          call_usermodehelper() forks /sbin/cpuset_release_agent with path
        drop cpuset_sem
      
      Unfortunately, the fork in call_usermodehelper can allocate memory, and
      allocating memory can require cpuset_sem, if the mems_generation values
      changed in the interim.  This results in an ABBA deadlock, trying to obtain
      cpuset_sem when it is already held by the current task.
      
      To fix this, I put the cpuset path (which must be computed while holding
      cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
      call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.
      
      So the new logic is:
      
        get cpuset_sem
        if cpuset marked notify_on_release and is ready to release:
          compute cpuset path relative to /dev/cpuset mount point
          stash path in kmalloc'd buffer
        drop cpuset_sem
        call_usermodehelper() forks /sbin/cpuset_release_agent with path
        free path
      
      The sharp eyed reader might notice that this patch does not contain any
      calls to kmalloc.  The existing code in the check_for_release() routine was
      already kmalloc'ing a buffer to hold the cpuset path.  In the old code, it
      just held the buffer for a few lines, over the cpuset_release_agent() call
      that in turn invoked call_usermodehelper().  In the new code, with the
      application of this patch, it returns that buffer via the new char
      **ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
      which is called after cpuset_sem is dropped.  Whereas the old code has just
      one call to cpuset_release_agent(), right in the check_for_release()
      routine, the new code has three calls to cpuset_release_agent(), from the
      various places that a cpuset can be released.
      
      This patch has been build and booted on SN2, and passed a stress test that
      previously hit the deadlock within a few seconds.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3077a260
  15. 28 7月, 2005 1 次提交
  16. 26 6月, 2005 1 次提交
  17. 24 6月, 2005 1 次提交
  18. 27 5月, 2005 1 次提交
    • P
      [PATCH] cpuset exit NULL dereference fix · 2efe86b8
      Paul Jackson 提交于
      There is a race in the kernel cpuset code, between the code
      to handle notify_on_release, and the code to remove a cpuset.
      The notify_on_release code can end up trying to access a
      cpuset that has been removed.  In the most common case, this
      causes a NULL pointer dereference from the routine cpuset_path.
      However all manner of bad things are possible, in theory at least.
      
      The existing code decrements the cpuset use count, and if the
      count goes to zero, processes the notify_on_release request,
      if appropriate.  However, once the count goes to zero, unless we
      are holding the global cpuset_sem semaphore, there is nothing to
      stop another task from immediately removing the cpuset entirely,
      and recycling its memory.
      
      The obvious fix would be to always hold the cpuset_sem
      semaphore while decrementing the use count and dealing with
      notify_on_release.  However we don't want to force a global
      semaphore into the mainline task exit path, as that might create
      a scaling problem.
      
      The actual fix is almost as easy - since this is only an issue
      for cpusets using notify_on_release, which the top level big
      cpusets don't normally need to use, only take the cpuset_sem
      for cpusets using notify_on_release.
      
      This code has been run for hours without a hiccup, while running
      a cpuset create/destroy stress test that could crash the existing
      kernel in seconds.  This patch applies to the current -linus
      git kernel.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NSimon Derr <simon.derr@bull.net>
      Acked-by: NDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2efe86b8
  19. 17 4月, 2005 2 次提交
    • B
      [PATCH] cpuset: remove function attribute const · 9a848896
      Benoit Boissinot 提交于
      gcc-4 warns with
      include/linux/cpuset.h:21: warning: type qualifiers ignored on function
      return type
      
      cpuset_cpus_allowed is declared with const
      extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
      
      First const should be __attribute__((const)), but the gcc manual
      explains that:
      
      "Note that a function that has pointer arguments and examines the data
      pointed to must not be declared const. Likewise, a function that calls a
      non-const function usually must not be const. It does not make sense for
      a const function to return void."
      
      The following patch remove const from the function declaration.
      Signed-off-by: NBenoit Boissinot <benoit.boissinot@ens-lyon.org>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a848896
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4