1. 31 10月, 2005 1 次提交
    • P
      [PATCH] cpusets: dual semaphore locking overhaul · 053199ed
      Paul Jackson 提交于
      Overhaul cpuset locking.  Replace single semaphore with two semaphores.
      
      The suggestion to use two locks was made by Roman Zippel.
      
      Both locks are global.  Code that wants to modify cpusets must first
      acquire the exclusive manage_sem, which allows them read-only access to
      cpusets, and holds off other would-be modifiers.  Before making actual
      changes, the second semaphore, callback_sem must be acquired as well.  Code
      that needs only to query cpusets must acquire callback_sem, which is also a
      global exclusive lock.
      
      The earlier problems with double tripping are avoided, because it is
      allowed for holders of manage_sem to nest the second callback_sem lock, and
      only callback_sem is needed by code called from within __alloc_pages(),
      where the double tripping had been possible.
      
      This is not quite the same as a normal read/write semaphore, because
      obtaining read-only access with intent to change must hold off other such
      attempts, while allowing read-only access w/o such intention.  Changing
      cpusets involves several related checks and changes, which must be done
      while allowing read-only queries (to avoid the double trip), but while
      ensuring nothing changes (holding off other would be modifiers.)
      
      This overhaul of cpuset locking also makes careful use of task_lock() to
      guard access to the task->cpuset pointer, closing a couple of race
      conditions noticed while reading this code (thanks, Roman).  I've never
      seen these races fail in any use or test.
      
      See further the comments in the code.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      053199ed
  2. 30 10月, 2005 4 次提交
    • H
      [PATCH] mm: fix rss and mmlist locking · f412ac08
      Hugh Dickins 提交于
      A couple of oddities were guarded by page_table_lock, no longer properly
      guarded when that is split.
      
      The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
      atomic64_t if the architecture supports it, in such a case.  Definitions by
      courtesy of Christoph Lameter: who spent considerable effort on more scalable
      ways of counting, but found insufficient benefit in practice.
      
      And adding an mm with swap to the mmlist for swapoff: the list is well-
      guarded by its own lock, but the list_empty check now has to be repeated
      inside it.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f412ac08
    • H
      [PATCH] mm: mm_struct hiwaters moved · f449952b
      Hugh Dickins 提交于
      Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm were
      tacked on the end, but it seems better to keep them near _file_rss, _anon_rss
      and total_vm, in the same cacheline on those arches verified.
      
      There are likely to be more profitable rearrangements, but less obvious (is it
      good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask and context
      from many others?), needing serious instrumentation.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f449952b
    • H
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins 提交于
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • H
      [PATCH] mm: rss = file_rss + anon_rss · 4294621f
      Hugh Dickins 提交于
      I was lazy when we added anon_rss, and chose to change as few places as
      possible.  So currently each anonymous page has to be counted twice, in rss
      and in anon_rss.  Which won't be so good if those are atomic counts in some
      configurations.
      
      Change that around: keep file_rss and anon_rss separately, and add them
      together (with get_mm_rss macro) when the total is needed - reading two
      atomics is much cheaper than updating two atomics.  And update anon_rss
      upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4294621f
  3. 11 10月, 2005 1 次提交
    • H
      [PATCH] Fix signal sending in usbdevio on async URB completion · 46113830
      Harald Welte 提交于
      If a process issues an URB from userspace and (starts to) terminate
      before the URB comes back, we run into the issue described above.  This
      is because the urb saves a pointer to "current" when it is posted to the
      device, but there's no guarantee that this pointer is still valid
      afterwards.
      
      In fact, there are three separate issues:
      
      1) the pointer to "current" can become invalid, since the task could be
         completely gone when the URB completion comes back from the device.
      
      2) Even if the saved task pointer is still pointing to a valid task_struct,
         task_struct->sighand could have gone meanwhile.
      
      3) Even if the process is perfectly fine, permissions may have changed,
         and we can no longer send it a signal.
      
      So what we do instead, is to save the PID and uid's of the process, and
      introduce a new kill_proc_info_as_uid() function.
      Signed-off-by: NHarald Welte <laforge@gnumonks.org>
      [ Fixed up types and added symbol exports ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      46113830
  4. 30 9月, 2005 2 次提交
    • L
      Revert task flag re-ordering, add comments · 4a8342d2
      Linus Torvalds 提交于
      Roland points out that the flags end up having non-obvious dependencies
      elsewhere, so revert aa55a086 and add
      some comments about why things are as they are.
      
      We'll just have to fix up the broken comparisons. Roland has a patch.
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4a8342d2
    • O
      [PATCH] fix TASK_STOPPED vs TASK_NONINTERACTIVE interaction · aa55a086
      Oleg Nesterov 提交于
      do_signal_stop:
      
      	for_each_thread(t) {
      		if (t->state < TASK_STOPPED)
      			++sig->group_stop_count;
      	}
      
      However, TASK_NONINTERACTIVE > TASK_STOPPED, so this loop will not
      count TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE threads.
      
      See also wait_task_stopped(), which checks ->state > TASK_STOPPED.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      
      [ We really probably should always use the appropriate bitmasks to test
        task states, not do it like this. Using something like
      
      	#define TASK_RUNNABLE (TASK_RUNNING | TASK_INTERRUPTIBLE | \
      				TASK_UNINTERRUPTIBLE | TASK_NONINTERACTIVE)
      
        and then doing "if (task->state & TASK_RUNNABLE)" or similar. But the
        ordering of the task states is historical, and keeping the ordering
        does make sense regardless. ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aa55a086
  5. 13 9月, 2005 2 次提交
    • A
      [PATCH] set_current_state() commentary · 498d0c57
      Andrew Morton 提交于
      Explain the mysteries of set_current_state().
      
      Quoth Linus:
      
       The scheduler itself never needs the memory barrier at all.
      
       The barrier is needed only if the user itself ends up testing some other
       thing afterwards, ie if you have
      
       	set_process_state(TASK_INTERRUPTIBLE);
       	if (still_need_to_sleep())
       		schedule();
      
       then the "still_need_to_sleep()" thing may test flags and wakeup events,
       and then you _may_ want to (and often do) make sure that the write of
       TASK_INTERRUPTIBLE is serialized wrt the reads of any wakeup data (since
       the wakeup may have happened on another CPU).
      
       So the comment is somewhat wrong. We don't really _care_ whether the state
       propagates out to other CPU's since all of our actions are purely local,
       and there is nothing we do that is conditional on any other CPU: we're
       going to sleep unconditionally, and the scheduler only cares about _our_
       state, not about somebody elses state.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      498d0c57
    • P
      [PATCH] cpuset semaphore depth check optimize · b3426599
      Paul Jackson 提交于
      Optimize the deadlock avoidance check on the global cpuset
      semaphore cpuset_sem.  Instead of adding a depth counter to the
      task struct of each task, rather just two words are enough, one
      to store the depth and the other the current cpuset_sem holder.
      
      Thanks to Nikita Danilov for the idea.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      
      [ We may want to change this further, but at least it's now
        a totally internal decision to the cpusets code ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b3426599
  6. 12 9月, 2005 1 次提交
  7. 11 9月, 2005 3 次提交
    • N
      [PATCH] add schedule_timeout_{,un}interruptible() interfaces · 64ed93a2
      Nishanth Aravamudan 提交于
      Add schedule_timeout_{,un}interruptible() interfaces so that
      schedule_timeout() callers don't have to worry about forgetting to add the
      set_current_state() call beforehand.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      64ed93a2
    • I
      [PATCH] sched: TASK_NONINTERACTIVE · d79fc0fc
      Ingo Molnar 提交于
      This patch implements a task state bit (TASK_NONINTERACTIVE), which can be
      used by blocking points to mark the task's wait as "non-interactive".  This
      does not mean the task will be considered a CPU-hog - the wait will simply
      not have an effect on the waiting task's priority - positive or negative
      alike.  Right now only pipe_wait() will make use of it, because it's a
      common source of not-so-interactive waits (kernel compilation jobs, etc.).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d79fc0fc
    • P
      [PATCH] cpuset semaphore depth check deadlock fix · 4247bdc6
      Paul Jackson 提交于
      The cpusets-formalize-intermediate-gfp_kernel-containment patch
      has a deadlock problem.
      
      This patch was part of a set of four patches to make more
      extensive use of the cpuset 'mem_exclusive' attribute to
      manage kernel GFP_KERNEL memory allocations and to constrain
      the out-of-memory (oom) killer.
      
      A task that is changing cpusets in particular ways on a system
      when it is very short of free memory could double trip over
      the global cpuset_sem semaphore (get the lock and then deadlock
      trying to get it again).
      
      The second attempt to get cpuset_sem would be in the routine
      cpuset_zone_allowed().  This was discovered by code inspection.
      I can not reproduce the problem except with an artifically
      hacked kernel and a specialized stress test.
      
      In real life you cannot hit this unless you are manipulating
      cpusets, and are very unlikely to hit it unless you are rapidly
      modifying cpusets on a memory tight system.  Even then it would
      be a rare occurence.
      
      If you did hit it, the task double tripping over cpuset_sem
      would deadlock in the kernel, and any other task also trying
      to manipulate cpusets would deadlock there too, on cpuset_sem.
      Your batch manager would be wedged solid (if it was cpuset
      savvy), but classic Unix shells and utilities would work well
      enough to reboot the system.
      
      The unusual condition that led to this bug is that unlike most
      semaphores, cpuset_sem _can_ be acquired while in the page
      allocation code, when __alloc_pages() calls cpuset_zone_allowed.
      So it easy to mistakenly perform the following sequence:
        1) task makes system call to alter a cpuset
        2) take cpuset_sem
        3) try to allocate memory
        4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
        5) deadlock
      
      The reason that this is not a serious bug for most users
      is that almost all calls to allocate memory don't require
      taking cpuset_sem.  Only some code paths off the beaten
      track require taking cpuset_sem -- which is good.  Taking
      a global semaphore on the main code path for allocating
      memory would not scale well.
      
      This patch fixes this deadlock by wrapping the up() and down()
      calls on cpuset_sem in kernel/cpuset.c with code that tracks
      the nesting depth of the current task on that semaphore, and
      only does the real down() if the task doesn't hold the lock
      already, and only does the real up() if the nesting depth
      (number of unmatched downs) is exactly one.
      
      The previous required use of refresh_mems(), anytime that
      the cpuset_sem semaphore was acquired and the code executed
      while holding that semaphore might try to allocate memory, is
      no longer required.  Two refresh_mems() calls were removed
      thanks to this.  This is a good change, as failing to get
      all the necessary refresh_mems() calls placed was a primary
      source of bugs in this cpuset code.  The only remaining call
      to refresh_mems() is made while doing a memory allocation,
      if certain task memory placement data needs to be updated
      from its cpuset, due to the cpuset having been changed behind
      the tasks back.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4247bdc6
  8. 10 9月, 2005 1 次提交
  9. 08 9月, 2005 3 次提交
  10. 13 7月, 2005 1 次提交
    • R
      [PATCH] inotify · 0eeca283
      Robert Love 提交于
      inotify is intended to correct the deficiencies of dnotify, particularly
      its inability to scale and its terrible user interface:
      
              * dnotify requires the opening of one fd per each directory
                that you intend to watch. This quickly results in too many
                open files and pins removable media, preventing unmount.
              * dnotify is directory-based. You only learn about changes to
                directories. Sure, a change to a file in a directory affects
                the directory, but you are then forced to keep a cache of
                stat structures.
              * dnotify's interface to user-space is awful.  Signals?
      
      inotify provides a more usable, simple, powerful solution to file change
      notification:
      
              * inotify's interface is a system call that returns a fd, not SIGIO.
      	  You get a single fd, which is select()-able.
              * inotify has an event that says "the filesystem that the item
                you were watching is on was unmounted."
              * inotify can watch directories or files.
      
      Inotify is currently used by Beagle (a desktop search infrastructure),
      Gamin (a FAM replacement), and other projects.
      
      See Documentation/filesystems/inotify.txt.
      Signed-off-by: NRobert Love <rml@novell.com>
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0eeca283
  11. 28 6月, 2005 1 次提交
    • J
      [PATCH] Update cfq io scheduler to time sliced design · 22e2c507
      Jens Axboe 提交于
      This updates the CFQ io scheduler to the new time sliced design (cfq
      v3).  It provides full process fairness, while giving excellent
      aggregate system throughput even for many competing processes.  It
      supports io priorities, either inherited from the cpu nice value or set
      directly with the ioprio_get/set syscalls.  The latter closely mimic
      set/getpriority.
      
      This import is based on my latest from -mm.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      22e2c507
  12. 26 6月, 2005 7 次提交
    • C
      [PATCH] Cleanup patch for process freezing · 3e1d1d28
      Christoph Lameter 提交于
      1. Establish a simple API for process freezing defined in linux/include/sched.h:
      
         frozen(process)		Check for frozen process
         freezing(process)		Check if a process is being frozen
         freeze(process)		Tell a process to freeze (go to refrigerator)
         thaw_process(process)	Restart process
         frozen_process(process)	Process is frozen now
      
      2. Remove all references to PF_FREEZE and PF_FROZEN from all
         kernel sources except sched.h
      
      3. Fix numerous locations where try_to_freeze is manually done by a driver
      
      4. Remove the argument that is no longer necessary from two function calls.
      
      5. Some whitespace cleanup
      
      6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
         cleared before setting PF_FROZEN, recalc_sigpending does not check
         PF_FROZEN).
      
      This patch does not address the problem of freeze_processes() violating the rule
      that a task may only modify its own flags by setting PF_FREEZE. This is not clean
      in an SMP environment. freeze(process) is therefore not SMP safe!
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e1d1d28
    • D
      [PATCH] Dynamic sched domains: sched changes · 1a20ff27
      Dinakar Guniguntala 提交于
      The following patches add dynamic sched domains functionality that was
      extensively discussed on lkml and lse-tech.  I would like to see this added to
      -mm
      
      o The main advantage with this feature is that it ensures that the scheduler
        load balacing code only balances against the cpus that are in the sched
        domain as defined by an exclusive cpuset and not all of the cpus in the
        system. This removes any overhead due to load balancing code trying to
        pull tasks outside of the cpu exclusive cpuset only to be prevented by
        the tasks' cpus_allowed mask.
      o cpu exclusive cpusets are useful for servers running orthogonal
        workloads such as RT applications requiring low latency and HPC
        applications that are throughput sensitive
      
      o It provides a new API partition_sched_domains in sched.c
        that makes dynamic sched domains possible.
      o cpu_exclusive cpusets sets are now associated with a sched domain.
        Which means that the users can dynamically modify the sched domains
        through the cpuset file system interface
      o ia64 sched domain code has been updated to support this feature as well
      o Currently, this does not support hotplug. (However some of my tests
        indicate hotplug+preempt is currently broken)
      o I have tested it extensively on x86.
      o This should have very minimal impact on performance as none of
        the fast paths are affected
      Signed-off-by: NDinakar Guniguntala <dino@in.ibm.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NMatthew Dobson <colpatch@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1a20ff27
    • N
      [PATCH] sched: consolidate sbe sbf · 476d139c
      Nick Piggin 提交于
      Consolidate balance-on-exec with balance-on-fork.  This is made easy by the
      sched-domains RCU patches.
      
      As well as the general goodness of code reduction, this allows the runqueues
      to be unlocked during balance-on-fork.
      
      schedstats is a problem.  Maybe just have balance-on-event instead of
      distinguishing fork and exec?
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      476d139c
    • N
      [PATCH] sched: cleanup context switch locking · 4866cde0
      Nick Piggin 提交于
      Instead of requiring architecture code to interact with the scheduler's
      locking implementation, provide a couple of defines that can be used by the
      architecture to request runqueue unlocked context switches, and ask for
      interrupts to be enabled over the context switch.
      
      Also replaces the "switch_lock" used by these architectures with an oncpu
      flag (note, not a potentially slow bitflag).  This eliminates one bus
      locked memory operation when context switching, and simplifies the
      task_running function.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4866cde0
    • N
      [PATCH] sched: schedstats update for balance on fork · 68767a0a
      Nick Piggin 提交于
      Add SCHEDSTAT statistics for sched-balance-fork.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      68767a0a
    • N
      [PATCH] sched: balance on fork · 147cbb4b
      Nick Piggin 提交于
      Reimplement the balance on exec balancing to be sched-domains aware.  Use this
      to also do balance on fork balancing.  Make x86_64 do balance on fork over the
      NUMA domain.
      
      The problem that the non sched domains aware blancing became apparent on dual
      core, multi socket opterons.  What we want is for the new tasks to be sent to
      a different socket, but more often than not, we would first load up our
      sibling core, or fill two cores of a single remote socket before selecting a
      new one.
      
      This gives large improvements to STREAM on such systems.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      147cbb4b
    • N
      [PATCH] sched: balance timers · 7897986b
      Nick Piggin 提交于
      Do CPU load averaging over a number of different intervals.  Allow each
      interval to be chosen by sending a parameter to source_load and target_load.
      0 is instantaneous, idx > 0 returns a decaying average with the most recent
      sample weighted at 2^(idx-1).  To a maximum of 3 (could be easily increased).
      
      So generally a higher number will result in more conservative balancing.
      Signed-off-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7897986b
  13. 24 6月, 2005 2 次提交
    • D
      [PATCH] Keys: Make request-key create an authorisation key · 3e30148c
      David Howells 提交于
      The attached patch makes the following changes:
      
       (1) There's a new special key type called ".request_key_auth".
      
           This is an authorisation key for when one process requests a key and
           another process is started to construct it. This type of key cannot be
           created by the user; nor can it be requested by kernel services.
      
           Authorisation keys hold two references:
      
           (a) Each refers to a key being constructed. When the key being
           	 constructed is instantiated the authorisation key is revoked,
           	 rendering it of no further use.
      
           (b) The "authorising process". This is either:
      
           	 (i) the process that called request_key(), or:
      
           	 (ii) if the process that called request_key() itself had an
           	      authorisation key in its session keyring, then the authorising
           	      process referred to by that authorisation key will also be
           	      referred to by the new authorisation key.
      
      	 This means that the process that initiated a chain of key requests
      	 will authorise the lot of them, and will, by default, wind up with
      	 the keys obtained from them in its keyrings.
      
       (2) request_key() creates an authorisation key which is then passed to
           /sbin/request-key in as part of a new session keyring.
      
       (3) When request_key() is searching for a key to hand back to the caller, if
           it comes across an authorisation key in the session keyring of the
           calling process, it will also search the keyrings of the process
           specified therein and it will use the specified process's credentials
           (fsuid, fsgid, groups) to do that rather than the calling process's
           credentials.
      
           This allows a process started by /sbin/request-key to find keys belonging
           to the authorising process.
      
       (4) A key can be read, even if the process executing KEYCTL_READ doesn't have
           direct read or search permission if that key is contained within the
           keyrings of a process specified by an authorisation key found within the
           calling process's session keyring, and is searchable using the
           credentials of the authorising process.
      
           This allows a process started by /sbin/request-key to read keys belonging
           to the authorising process.
      
       (5) The magic KEY_SPEC_*_KEYRING key IDs when passed to KEYCTL_INSTANTIATE or
           KEYCTL_NEGATE will specify a keyring of the authorising process, rather
           than the process doing the instantiation.
      
       (6) One of the process keyrings can be nominated as the default to which
           request_key() should attach new keys if not otherwise specified. This is
           done with KEYCTL_SET_REQKEY_KEYRING and one of the KEY_REQKEY_DEFL_*
           constants. The current setting can also be read using this call.
      
       (7) request_key() is partially interruptible. If it is waiting for another
           process to finish constructing a key, it can be interrupted. This permits
           a request-key cycle to be broken without recourse to rebooting.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-Off-By: NBenoit Boissinot <benoit.boissinot@ens-lyon.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e30148c
    • A
      [PATCH] setuid core dump · d6e71144
      Alan Cox 提交于
      Add a new `suid_dumpable' sysctl:
      
      This value can be used to query and set the core dump mode for setuid
      or otherwise protected/tainted binaries. The modes are
      
      0 - (default) - traditional behaviour.  Any process which has changed
          privilege levels or is execute only will not be dumped
      
      1 - (debug) - all processes dump core when possible.  The core dump is
          owned by the current user and no security is applied.  This is intended
          for system debugging situations only.  Ptrace is unchecked.
      
      2 - (suidsafe) - any binary which normally would not be dumped is dumped
          readable by root only.  This allows the end user to remove such a dump but
          not access it directly.  For security reasons core dumps in this mode will
          not overwrite one another or other files.  This mode is appropriate when
          adminstrators are attempting to debug problems in a normal environment.
      
      (akpm:
      
      > > +EXPORT_SYMBOL(suid_dumpable);
      >
      > EXPORT_SYMBOL_GPL?
      
      No problem to me.
      
      > >  	if (current->euid == current->uid && current->egid == current->gid)
      > >  		current->mm->dumpable = 1;
      >
      > Should this be SUID_DUMP_USER?
      
      Actually the feedback I had from last time was that the SUID_ defines
      should go because its clearer to follow the numbers. They can go
      everywhere (and there are lots of places where dumpable is tested/used
      as a bool in untouched code)
      
      > Maybe this should be renamed to `dump_policy' or something.  Doing that
      > would help us catch any code which isn't using the #defines, too.
      
      Fair comment. The patch was designed to be easy to maintain for Red Hat
      rather than for merging. Changing that field would create a gigantic
      diff because it is used all over the place.
      
      )
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6e71144
  14. 22 6月, 2005 1 次提交
    • W
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander 提交于
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      Signed-off-by: NWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1363c3cd
  15. 06 5月, 2005 1 次提交
  16. 01 5月, 2005 2 次提交
  17. 17 4月, 2005 2 次提交