1. 13 9月, 2005 1 次提交
    • P
      [PATCH] cpuset semaphore depth check optimize · b3426599
      Paul Jackson 提交于
      Optimize the deadlock avoidance check on the global cpuset
      semaphore cpuset_sem.  Instead of adding a depth counter to the
      task struct of each task, rather just two words are enough, one
      to store the depth and the other the current cpuset_sem holder.
      
      Thanks to Nikita Danilov for the idea.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      
      [ We may want to change this further, but at least it's now
        a totally internal decision to the cpusets code ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b3426599
  2. 12 9月, 2005 1 次提交
  3. 11 9月, 2005 3 次提交
    • N
      [PATCH] add schedule_timeout_{,un}interruptible() interfaces · 64ed93a2
      Nishanth Aravamudan 提交于
      Add schedule_timeout_{,un}interruptible() interfaces so that
      schedule_timeout() callers don't have to worry about forgetting to add the
      set_current_state() call beforehand.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      64ed93a2
    • I
      [PATCH] sched: TASK_NONINTERACTIVE · d79fc0fc
      Ingo Molnar 提交于
      This patch implements a task state bit (TASK_NONINTERACTIVE), which can be
      used by blocking points to mark the task's wait as "non-interactive".  This
      does not mean the task will be considered a CPU-hog - the wait will simply
      not have an effect on the waiting task's priority - positive or negative
      alike.  Right now only pipe_wait() will make use of it, because it's a
      common source of not-so-interactive waits (kernel compilation jobs, etc.).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d79fc0fc
    • P
      [PATCH] cpuset semaphore depth check deadlock fix · 4247bdc6
      Paul Jackson 提交于
      The cpusets-formalize-intermediate-gfp_kernel-containment patch
      has a deadlock problem.
      
      This patch was part of a set of four patches to make more
      extensive use of the cpuset 'mem_exclusive' attribute to
      manage kernel GFP_KERNEL memory allocations and to constrain
      the out-of-memory (oom) killer.
      
      A task that is changing cpusets in particular ways on a system
      when it is very short of free memory could double trip over
      the global cpuset_sem semaphore (get the lock and then deadlock
      trying to get it again).
      
      The second attempt to get cpuset_sem would be in the routine
      cpuset_zone_allowed().  This was discovered by code inspection.
      I can not reproduce the problem except with an artifically
      hacked kernel and a specialized stress test.
      
      In real life you cannot hit this unless you are manipulating
      cpusets, and are very unlikely to hit it unless you are rapidly
      modifying cpusets on a memory tight system.  Even then it would
      be a rare occurence.
      
      If you did hit it, the task double tripping over cpuset_sem
      would deadlock in the kernel, and any other task also trying
      to manipulate cpusets would deadlock there too, on cpuset_sem.
      Your batch manager would be wedged solid (if it was cpuset
      savvy), but classic Unix shells and utilities would work well
      enough to reboot the system.
      
      The unusual condition that led to this bug is that unlike most
      semaphores, cpuset_sem _can_ be acquired while in the page
      allocation code, when __alloc_pages() calls cpuset_zone_allowed.
      So it easy to mistakenly perform the following sequence:
        1) task makes system call to alter a cpuset
        2) take cpuset_sem
        3) try to allocate memory
        4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
        5) deadlock
      
      The reason that this is not a serious bug for most users
      is that almost all calls to allocate memory don't require
      taking cpuset_sem.  Only some code paths off the beaten
      track require taking cpuset_sem -- which is good.  Taking
      a global semaphore on the main code path for allocating
      memory would not scale well.
      
      This patch fixes this deadlock by wrapping the up() and down()
      calls on cpuset_sem in kernel/cpuset.c with code that tracks
      the nesting depth of the current task on that semaphore, and
      only does the real down() if the task doesn't hold the lock
      already, and only does the real up() if the nesting depth
      (number of unmatched downs) is exactly one.
      
      The previous required use of refresh_mems(), anytime that
      the cpuset_sem semaphore was acquired and the code executed
      while holding that semaphore might try to allocate memory, is
      no longer required.  Two refresh_mems() calls were removed
      thanks to this.  This is a good change, as failing to get
      all the necessary refresh_mems() calls placed was a primary
      source of bugs in this cpuset code.  The only remaining call
      to refresh_mems() is made while doing a memory allocation,
      if certain task memory placement data needs to be updated
      from its cpuset, due to the cpuset having been changed behind
      the tasks back.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4247bdc6
  4. 10 9月, 2005 1 次提交
  5. 08 9月, 2005 3 次提交
  6. 13 7月, 2005 1 次提交
    • R
      [PATCH] inotify · 0eeca283
      Robert Love 提交于
      inotify is intended to correct the deficiencies of dnotify, particularly
      its inability to scale and its terrible user interface:
      
              * dnotify requires the opening of one fd per each directory
                that you intend to watch. This quickly results in too many
                open files and pins removable media, preventing unmount.
              * dnotify is directory-based. You only learn about changes to
                directories. Sure, a change to a file in a directory affects
                the directory, but you are then forced to keep a cache of
                stat structures.
              * dnotify's interface to user-space is awful.  Signals?
      
      inotify provides a more usable, simple, powerful solution to file change
      notification:
      
              * inotify's interface is a system call that returns a fd, not SIGIO.
      	  You get a single fd, which is select()-able.
              * inotify has an event that says "the filesystem that the item
                you were watching is on was unmounted."
              * inotify can watch directories or files.
      
      Inotify is currently used by Beagle (a desktop search infrastructure),
      Gamin (a FAM replacement), and other projects.
      
      See Documentation/filesystems/inotify.txt.
      Signed-off-by: NRobert Love <rml@novell.com>
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0eeca283
  7. 28 6月, 2005 1 次提交
    • J
      [PATCH] Update cfq io scheduler to time sliced design · 22e2c507
      Jens Axboe 提交于
      This updates the CFQ io scheduler to the new time sliced design (cfq
      v3).  It provides full process fairness, while giving excellent
      aggregate system throughput even for many competing processes.  It
      supports io priorities, either inherited from the cpu nice value or set
      directly with the ioprio_get/set syscalls.  The latter closely mimic
      set/getpriority.
      
      This import is based on my latest from -mm.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      22e2c507
  8. 26 6月, 2005 7 次提交
    • C
      [PATCH] Cleanup patch for process freezing · 3e1d1d28
      Christoph Lameter 提交于
      1. Establish a simple API for process freezing defined in linux/include/sched.h:
      
         frozen(process)		Check for frozen process
         freezing(process)		Check if a process is being frozen
         freeze(process)		Tell a process to freeze (go to refrigerator)
         thaw_process(process)	Restart process
         frozen_process(process)	Process is frozen now
      
      2. Remove all references to PF_FREEZE and PF_FROZEN from all
         kernel sources except sched.h
      
      3. Fix numerous locations where try_to_freeze is manually done by a driver
      
      4. Remove the argument that is no longer necessary from two function calls.
      
      5. Some whitespace cleanup
      
      6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
         cleared before setting PF_FROZEN, recalc_sigpending does not check
         PF_FROZEN).
      
      This patch does not address the problem of freeze_processes() violating the rule
      that a task may only modify its own flags by setting PF_FREEZE. This is not clean
      in an SMP environment. freeze(process) is therefore not SMP safe!
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e1d1d28
    • D
      [PATCH] Dynamic sched domains: sched changes · 1a20ff27
      Dinakar Guniguntala 提交于
      The following patches add dynamic sched domains functionality that was
      extensively discussed on lkml and lse-tech.  I would like to see this added to
      -mm
      
      o The main advantage with this feature is that it ensures that the scheduler
        load balacing code only balances against the cpus that are in the sched
        domain as defined by an exclusive cpuset and not all of the cpus in the
        system. This removes any overhead due to load balancing code trying to
        pull tasks outside of the cpu exclusive cpuset only to be prevented by
        the tasks' cpus_allowed mask.
      o cpu exclusive cpusets are useful for servers running orthogonal
        workloads such as RT applications requiring low latency and HPC
        applications that are throughput sensitive
      
      o It provides a new API partition_sched_domains in sched.c
        that makes dynamic sched domains possible.
      o cpu_exclusive cpusets sets are now associated with a sched domain.
        Which means that the users can dynamically modify the sched domains
        through the cpuset file system interface
      o ia64 sched domain code has been updated to support this feature as well
      o Currently, this does not support hotplug. (However some of my tests
        indicate hotplug+preempt is currently broken)
      o I have tested it extensively on x86.
      o This should have very minimal impact on performance as none of
        the fast paths are affected
      Signed-off-by: NDinakar Guniguntala <dino@in.ibm.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1a20ff27
    • N
      [PATCH] sched: consolidate sbe sbf · 476d139c
      Nick Piggin 提交于
      Consolidate balance-on-exec with balance-on-fork.  This is made easy by the
      sched-domains RCU patches.
      
      As well as the general goodness of code reduction, this allows the runqueues
      to be unlocked during balance-on-fork.
      
      schedstats is a problem.  Maybe just have balance-on-event instead of
      distinguishing fork and exec?
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      476d139c
    • N
      [PATCH] sched: cleanup context switch locking · 4866cde0
      Nick Piggin 提交于
      Instead of requiring architecture code to interact with the scheduler's
      locking implementation, provide a couple of defines that can be used by the
      architecture to request runqueue unlocked context switches, and ask for
      interrupts to be enabled over the context switch.
      
      Also replaces the "switch_lock" used by these architectures with an oncpu
      flag (note, not a potentially slow bitflag).  This eliminates one bus
      locked memory operation when context switching, and simplifies the
      task_running function.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4866cde0
    • N
      [PATCH] sched: schedstats update for balance on fork · 68767a0a
      Nick Piggin 提交于
      Add SCHEDSTAT statistics for sched-balance-fork.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68767a0a
    • N
      [PATCH] sched: balance on fork · 147cbb4b
      Nick Piggin 提交于
      Reimplement the balance on exec balancing to be sched-domains aware.  Use this
      to also do balance on fork balancing.  Make x86_64 do balance on fork over the
      NUMA domain.
      
      The problem that the non sched domains aware blancing became apparent on dual
      core, multi socket opterons.  What we want is for the new tasks to be sent to
      a different socket, but more often than not, we would first load up our
      sibling core, or fill two cores of a single remote socket before selecting a
      new one.
      
      This gives large improvements to STREAM on such systems.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      147cbb4b
    • N
      [PATCH] sched: balance timers · 7897986b
      Nick Piggin 提交于
      Do CPU load averaging over a number of different intervals.  Allow each
      interval to be chosen by sending a parameter to source_load and target_load.
      0 is instantaneous, idx > 0 returns a decaying average with the most recent
      sample weighted at 2^(idx-1).  To a maximum of 3 (could be easily increased).
      
      So generally a higher number will result in more conservative balancing.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7897986b
  9. 24 6月, 2005 2 次提交
    • D
      [PATCH] Keys: Make request-key create an authorisation key · 3e30148c
      David Howells 提交于
      The attached patch makes the following changes:
      
       (1) There's a new special key type called ".request_key_auth".
      
           This is an authorisation key for when one process requests a key and
           another process is started to construct it. This type of key cannot be
           created by the user; nor can it be requested by kernel services.
      
           Authorisation keys hold two references:
      
           (a) Each refers to a key being constructed. When the key being
           	 constructed is instantiated the authorisation key is revoked,
           	 rendering it of no further use.
      
           (b) The "authorising process". This is either:
      
           	 (i) the process that called request_key(), or:
      
           	 (ii) if the process that called request_key() itself had an
           	      authorisation key in its session keyring, then the authorising
           	      process referred to by that authorisation key will also be
           	      referred to by the new authorisation key.
      
      	 This means that the process that initiated a chain of key requests
      	 will authorise the lot of them, and will, by default, wind up with
      	 the keys obtained from them in its keyrings.
      
       (2) request_key() creates an authorisation key which is then passed to
           /sbin/request-key in as part of a new session keyring.
      
       (3) When request_key() is searching for a key to hand back to the caller, if
           it comes across an authorisation key in the session keyring of the
           calling process, it will also search the keyrings of the process
           specified therein and it will use the specified process's credentials
           (fsuid, fsgid, groups) to do that rather than the calling process's
           credentials.
      
           This allows a process started by /sbin/request-key to find keys belonging
           to the authorising process.
      
       (4) A key can be read, even if the process executing KEYCTL_READ doesn't have
           direct read or search permission if that key is contained within the
           keyrings of a process specified by an authorisation key found within the
           calling process's session keyring, and is searchable using the
           credentials of the authorising process.
      
           This allows a process started by /sbin/request-key to read keys belonging
           to the authorising process.
      
       (5) The magic KEY_SPEC_*_KEYRING key IDs when passed to KEYCTL_INSTANTIATE or
           KEYCTL_NEGATE will specify a keyring of the authorising process, rather
           than the process doing the instantiation.
      
       (6) One of the process keyrings can be nominated as the default to which
           request_key() should attach new keys if not otherwise specified. This is
           done with KEYCTL_SET_REQKEY_KEYRING and one of the KEY_REQKEY_DEFL_*
           constants. The current setting can also be read using this call.
      
       (7) request_key() is partially interruptible. If it is waiting for another
           process to finish constructing a key, it can be interrupted. This permits
           a request-key cycle to be broken without recourse to rebooting.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-Off-By: NBenoit Boissinot <benoit.boissinot@ens-lyon.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e30148c
    • A
      [PATCH] setuid core dump · d6e71144
      Alan Cox 提交于
      Add a new `suid_dumpable' sysctl:
      
      This value can be used to query and set the core dump mode for setuid
      or otherwise protected/tainted binaries. The modes are
      
      0 - (default) - traditional behaviour.  Any process which has changed
          privilege levels or is execute only will not be dumped
      
      1 - (debug) - all processes dump core when possible.  The core dump is
          owned by the current user and no security is applied.  This is intended
          for system debugging situations only.  Ptrace is unchecked.
      
      2 - (suidsafe) - any binary which normally would not be dumped is dumped
          readable by root only.  This allows the end user to remove such a dump but
          not access it directly.  For security reasons core dumps in this mode will
          not overwrite one another or other files.  This mode is appropriate when
          adminstrators are attempting to debug problems in a normal environment.
      
      (akpm:
      
      > > +EXPORT_SYMBOL(suid_dumpable);
      >
      > EXPORT_SYMBOL_GPL?
      
      No problem to me.
      
      > >  	if (current->euid == current->uid && current->egid == current->gid)
      > >  		current->mm->dumpable = 1;
      >
      > Should this be SUID_DUMP_USER?
      
      Actually the feedback I had from last time was that the SUID_ defines
      should go because its clearer to follow the numbers. They can go
      everywhere (and there are lots of places where dumpable is tested/used
      as a bool in untouched code)
      
      > Maybe this should be renamed to `dump_policy' or something.  Doing that
      > would help us catch any code which isn't using the #defines, too.
      
      Fair comment. The patch was designed to be easy to maintain for Red Hat
      rather than for merging. Changing that field would create a gigantic
      diff because it is used all over the place.
      
      )
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6e71144
  10. 22 6月, 2005 1 次提交
    • W
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander 提交于
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      Signed-off-by: NWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1363c3cd
  11. 06 5月, 2005 1 次提交
  12. 01 5月, 2005 2 次提交
  13. 17 4月, 2005 2 次提交