1. 17 6月, 2007 1 次提交
    • P
      cpuset: zero malloc - fix for old cpusets · 3e903e7b
      Paul Jackson 提交于
      The cpuset code to present a list of tasks using a cpuset to user space could
      write to an array that it had kmalloc'd, after a kmalloc request of zero size.
      
      The problem was that the code didn't check for writes past the allocated end
      of the array until -after- the first write.
      
      This is a race condition that is likely rare -- it would only show up if a
      cpuset went from being empty to having a task in it, during the brief time
      between the allocation and the first write.
      
      Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
      zero kmalloc returned a few usable bytes anyway, and no harm was done with the
      bogus write.
      
      With the 2.6.22 kernel changes to make issue a warning if code tries to write
      to the location returned from a zero size allocation, this problem is no
      longer benign.  This cpuset code would occassionally trigger that warning.
      
      The fix is trivial -- check before storing into the array, not after, whether
      the array is big enough to hold the store.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e903e7b
  2. 10 5月, 2007 1 次提交
  3. 09 5月, 2007 3 次提交
  4. 08 5月, 2007 1 次提交
  5. 13 2月, 2007 2 次提交
  6. 31 12月, 2006 1 次提交
  7. 14 12月, 2006 1 次提交
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  8. 09 12月, 2006 1 次提交
  9. 08 12月, 2006 4 次提交
  10. 11 10月, 2006 1 次提交
  11. 01 10月, 2006 1 次提交
  12. 30 9月, 2006 4 次提交
    • P
      [PATCH] cpuset: fix obscure attach_task vs exiting race · 181b6480
      Paul Jackson 提交于
      Fix obscure race condition in kernel/cpuset.c attach_task() code.
      
      There is basically zero chance of anyone accidentally being harmed by this
      race.
      
      It requires a special 'micro-stress' load and a special timing loop hacks
      in the kernel to hit in less than an hour, and even then you'd have to hit
      it hundreds or thousands of times, followed by some unusual and senseless
      cpuset configuration requests, including removing the top cpuset, to cause
      any visibly harm affects.
      
      One could, with perhaps a few days or weeks of such effort, get the
      reference count on the top cpuset below zero, and manage to crash the
      kernel by asking to remove the top cpuset.
      
      I found it by code inspection.
      
      The race was introduced when 'the_top_cpuset_hack' was introduced, and one
      piece of code was not updated.  An old check for a possibly null task
      cpuset pointer needed to be changed to a check for a task marked
      PF_EXITING.  The pointer can't be null anymore, thanks to
      the_top_cpuset_hack (documented in kernel/cpuset.c).  But the task could
      have gone into PF_EXITING state after it was found in the task_list scan.
      
      If a task is PF_EXITING in this code, it is possible that its task->cpuset
      pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
      than because the top_cpuset was that tasks last valid cpuset.  In that
      case, the wrong cpuset reference counter would be decremented.
      
      The fix is trivial.  Instead of failing the system call if the tasks cpuset
      pointer is null here, fail it if the task is in PF_EXITING state.
      
      The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
      the top_cpuset is done without locking, so could happen at anytime.  But it
      is done during the exit handling, after the PF_EXITING flag is set.  So if
      we verify that a task is still not PF_EXITING after we copy out its cpuset
      pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
      hack references to the top_cpuset.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      181b6480
    • P
      [PATCH] cpuset: hotunplug cpus and mems in all cpusets · b1aac8bb
      Paul Jackson 提交于
      The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
      it could remove a CPU or Node from the top cpuset, while leaving it still
      in some child cpusets.
      
      One basic rule of cpusets is that each cpusets cpus and mems are subsets of
      its parents.  The cpuset hot unplug code violated this rule.
      
      So the cpuset hotunplug handler must walk down the tree, removing any
      removed CPU or Node from all cpusets.
      
      However, it is not allowed to make a cpusets cpus or mems become empty.
      They can only transition from empty to non-empty, not back.
      
      So if the last CPU or Node would be removed from a cpuset by the above
      walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
      that still has something online, and copy its CPU or Memory placement.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b1aac8bb
    • P
      [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map · 38837fc7
      Paul Jackson 提交于
      Change the list of memory nodes allowed to tasks in the top (root) nodeset
      to dynamically track what cpus are online, using a call to a cpuset hook
      from the memory hotplug code.  Make this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support memory hotplug, then these tasks cannot make
      use of memory nodes that are added after system boot, because the memory
      nodes are not allowed in the top cpuset.  This is a surprising regression
      over earlier kernels that didn't have cpusets enabled.
      
      One key motivation for this change is to remain consistent with the
      behaviour for the top_cpuset's 'cpus', which is also read-only, and which
      automatically tracks the cpu_online_map.
      
      This change also has the minor benefit that it fixes a long standing,
      little noticed, minor bug in cpusets.  The cpuset performance tweak to
      short circuit the cpuset_zone_allowed() check on systems with just a single
      cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
      changing the 'mems' of the top_cpuset had no affect, even though the change
      (the write system call) appeared to succeed.  With the following change,
      that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
      refuses to be changed via user space writes.  Thus no one should be mislead
      into thinking they've changed the top_cpusets's 'mems' when in affect they
      haven't.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'mems' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of node_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged memory nodes
      allowed by their cpuset.
      
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      38837fc7
    • S
      [PATCH] pidspace: is_init() · f400e198
      Sukadev Bhattiprolu 提交于
      This is an updated version of Eric Biederman's is_init() patch.
      (http://lkml.org/lkml/2006/2/6/280).  It applies cleanly to 2.6.18-rc3 and
      replaces a few more instances of ->pid == 1 with is_init().
      
      Further, is_init() checks pid and thus removes dependency on Eric's other
      patches for now.
      
      Eric's original description:
      
      	There are a lot of places in the kernel where we test for init
      	because we give it special properties.  Most  significantly init
      	must not die.  This results in code all over the kernel test
      	->pid == 1.
      
      	Introduce is_init to capture this case.
      
      	With multiple pid spaces for all of the cases affected we are
      	looking for only the first process on the system, not some other
      	process that has pid == 1.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: <lxc-devel@lists.sourceforge.net>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f400e198
  13. 27 9月, 2006 1 次提交
  14. 26 9月, 2006 2 次提交
  15. 28 8月, 2006 2 次提交
    • N
      [PATCH] cpuset: oom panic fix · 0d673a5a
      Nick Piggin 提交于
      cpuset_excl_nodes_overlap always returns 0 if current is exiting.  This caused
      customer's systems to panic in the OOM killer when processes were having
      trouble getting memory for the final put_user in mm_release.  Even though
      there were lots of processes to kill.
      
      Change to returning 1 in this case.  This achieves parity with !CONFIG_CPUSETS
      case, and was observed to fix the problem.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0d673a5a
    • P
      [PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map · 4c4d50f7
      Paul Jackson 提交于
      Change the list of cpus allowed to tasks in the top (root) cpuset to
      dynamically track what cpus are online, using a CPU hotplug notifier.  Make
      this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support CPU hotplug, then these tasks cannot make use
      of CPUs that are added after system boot, because the CPUs are not allowed
      in the top cpuset.  This is a surprising regression over earlier kernels
      that didn't have cpusets enabled.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'cpus' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of cpu_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
      by their cpuset.
      
      Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
      driving the fix, and earlier versions of this patch.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c4d50f7
  16. 24 7月, 2006 1 次提交
    • P
      [PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lock · abb5a5cc
      Paul Jackson 提交于
      Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
      callback_mutex lock.
      
      It only happens on cpu_exclusive cpusets, due to the dynamic
      sched domain code trying to take the cpu hotplug lock inside
      the cpuset callback_mutex lock.
      
      This bug has apparently been here for several months, but didn't
      get hit until the right customer load on a large system.
      
      This fix appears right from inspection, but it will take a few
      more days running it on that customers workload to be confident
      we nailed it.  We don't have any other reproducible test case.
      
      The cpu_hotplug_lock() tends to cover large runs of code.
      The other places that hold both that lock and the cpuset callback
      mutex lock always nest the cpuset lock inside the hotplug lock.
      This place tries to do the reverse, risking an ABBA deadlock.
      
      This is in the cpuset_rmdir() code, where we:
        * take the callback_mutex lock
        * mark the cpuset CS_REMOVED
        * call update_cpu_domains for cpu_exclusive cpusets
        * in that call, take the cpu_hotplug lock if the
          cpuset is marked for removal.
      
      Thanks to Jack Steiner for identifying this deadlock.
      
      The fix is to tear down the dynamic sched domain before we grab
      the cpuset callback_mutex lock.  This way, the two locks are
      serialized, with the hotplug lock taken and released before
      trying for the cpuset lock.
      
      I suspect that this bug was introduced when I changed the
      cpuset locking from one lock to two.  The dynamic sched domain
      dependency on cpu_exclusive cpusets and its hotplug hooks were
      added to this code earlier, when cpusets had only a single lock.
      It may well have been fine then.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      abb5a5cc
  17. 01 7月, 2006 2 次提交
  18. 27 6月, 2006 2 次提交
    • E
      [PATCH] proc: Use struct pid not struct task_ref · 13b41b09
      Eric W. Biederman 提交于
      Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
      that they work with struct pid instead of struct task_ref.
      
      Mostly this is a straight 1-1 substitution.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      13b41b09
    • E
      [PATCH] proc: don't lock task_structs indefinitely · 99f89551
      Eric W. Biederman 提交于
      Every inode in /proc holds a reference to a struct task_struct.  If a
      directory or file is opened and remains open after the the task exits this
      pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
      file descriptor is about 10K.
      
      Normally I would figure a reasonable per user process limit is about 100
      processes.  With 80 processes, with a 1000 file descriptors each I can trigger
      the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
      data.
      
      This patch replaces the struct task_struct pointer with a pointer to a struct
      task_ref which has a struct task_struct pointer.  The so the pinning of dead
      tasks does not happen.
      
      The code now has to contend with the fact that the task may now exit at any
      time.  Which is a little but not muh more complicated.
      
      With this change it takes about 1000 processes each opening up 1000 file
      descriptors before I can trigger the OOM killer.  Much better.
      
      [mlp@google.com: task_mmu small fixes]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Signed-off-by: NPrasanna Meda <mlp@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      99f89551
  19. 23 6月, 2006 2 次提交
    • D
      [PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c) · 22fb52dd
      David Quigley 提交于
      Add a security hook call to enable security modules to control the ability
      to attach a task to a cpuset.  While limited control over this operation is
      possible via permission checks on the pseudo fs interface, those checks are
      not sufficient to control access to the target task, which is looked up in
      this function.  The existing task_setscheduler hook is re-used for this
      operation since this falls under the same class of operations.
      Signed-off-by: NDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      22fb52dd
    • D
      [PATCH] VFS: Permit filesystem to override root dentry on mount · 454e2398
      David Howells 提交于
      Extend the get_sb() filesystem operation to take an extra argument that
      permits the VFS to pass in the target vfsmount that defines the mountpoint.
      
      The filesystem is then required to manually set the superblock and root dentry
      pointers.  For most filesystems, this should be done with simple_set_mnt()
      which will set the superblock pointer and then set the root dentry to the
      superblock's s_root (as per the old default behaviour).
      
      The get_sb() op now returns an integer as there's now no need to return the
      superblock pointer.
      
      This patch permits a superblock to be implicitly shared amongst several mount
      points, such as can be done with NFS to avoid potential inode aliasing.  In
      such a case, simple_set_mnt() would not be called, and instead the mnt_root
      and mnt_sb would be set directly.
      
      The patch also makes the following changes:
      
       (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
           pointer argument and return an integer, so most filesystems have to change
           very little.
      
       (*) If one of the convenience function is not used, then get_sb() should
           normally call simple_set_mnt() to instantiate the vfsmount. This will
           always return 0, and so can be tail-called from get_sb().
      
       (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
           dcache upon superblock destruction rather than shrink_dcache_anon().
      
           This is required because the superblock may now have multiple trees that
           aren't actually bound to s_root, but that still need to be cleaned up. The
           currently called functions assume that the whole tree is rooted at s_root,
           and that anonymous dentries are not the roots of trees which results in
           dentries being left unculled.
      
           However, with the way NFS superblock sharing are currently set to be
           implemented, these assumptions are violated: the root of the filesystem is
           simply a dummy dentry and inode (the real inode for '/' may well be
           inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
           with child trees.
      
           [*] Anonymous until discovered from another tree.
      
       (*) The documentation has been adjusted, including the additional bit of
           changing ext2_* into foo_* in the documentation.
      
      [akpm@osdl.org: convert ipath_fs, do other stuff]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      454e2398
  20. 22 5月, 2006 2 次提交
  21. 01 4月, 2006 3 次提交
  22. 24 3月, 2006 2 次提交