1. 13 1月, 2006 3 次提交
    • A
      [PATCH] missing helper - task_stack_page() · 9fc65876
      Al Viro 提交于
      Patchset annotates arch/* uses of ->thread_info.  Ones that really are about
      access of thread_info of given process are simply switched to
      task_thread_info(task); ones that deal with access to objects on stack are
      switched to new helper - task_stack_page().  A _lot_ of the latter are
      actually open-coded instances of "find where pt_regs are"; those are
      consolidated into task_pt_regs(task) (many architectures actually have such
      helper already).
      
      Note that these annotations are not mandatory - any code not converted to
      these helpers still works.  However, they clean up a lot of places and have
      actually caught a number of bugs, so converting out of tree ports would be a
      good idea...
      
      As an example of breakage caught by that stuff, see i386 pt_regs mess - we
      used to have it open-coded in a bunch of places and when back in April Stas
      had fixed a bug in copy_thread(), the rest had been left out of sync.  That
      required two followup patches (the latest - just before 2.6.15) _and_ still
      had left /proc/*/stat eip field broken.  Try ps -eo eip on i386 and watch the
      junk...
      
      This patch:
      
      new helper - task_stack_page(task).  Returns pointer to the memory object
      containing task stack; usually thread_info of task sits in the beginning
      of that object.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9fc65876
    • A
      [PATCH] sched: filter affine wakeups · d7102e95
      akpm@osdl.org 提交于
      )
      
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      Track the last waker CPU, and only consider wakeup-balancing if there's a
      match between current waker CPU and the previous waker CPU.  This ensures
      that there is some correlation between two subsequent wakeup events before
      we move the task.  Should help random-wakeup workloads on large SMP
      systems, by reducing the migration attempts by a factor of nr_cpus.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d7102e95
    • A
      [PATCH] scheduler cache-hot-autodetect · 198e2f18
      akpm@osdl.org 提交于
      )
      
      From: Ingo Molnar <mingo@elte.hu>
      
      This is the latest version of the scheduler cache-hot-auto-tune patch.
      
      The first problem was that detection time scaled with O(N^2), which is
      unacceptable on larger SMP and NUMA systems. To solve this:
      
      - I've added a 'domain distance' function, which is used to cache
        measurement results. Each distance is only measured once. This means
        that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
        distances 0 and 1, and on SMP distance 0 is measured. The code walks
        the domain tree to determine the distance, so it automatically follows
        whatever hierarchy an architecture sets up. This cuts down on the boot
        time significantly and removes the O(N^2) limit. The only assumption
        is that migration costs can be expressed as a function of domain
        distance - this covers the overwhelming majority of existing systems,
        and is a good guess even for more assymetric systems.
      
        [ People hacking systems that have assymetries that break this
          assumption (e.g. different CPU speeds) should experiment a bit with
          the cpu_distance() function. Adding a ->migration_distance factor to
          the domain structure would be one possible solution - but lets first
          see the problem systems, if they exist at all. Lets not overdesign. ]
      
      Another problem was that only a single cache-size was used for measuring
      the cost of migration, and most architectures didnt set that variable
      up. Furthermore, a single cache-size does not fit NUMA hierarchies with
      L3 caches and does not fit HT setups, where different CPUs will often
      have different 'effective cache sizes'. To solve this problem:
      
      - Instead of relying on a single cache-size provided by the platform and
        sticking to it, the code now auto-detects the 'effective migration
        cost' between two measured CPUs, via iterating through a wide range of
        cachesizes. The code searches for the maximum migration cost, which
        occurs when the working set of the test-workload falls just below the
        'effective cache size'. I.e. real-life optimized search is done for
        the maximum migration cost, between two real CPUs.
      
        This, amongst other things, has the positive effect hat if e.g. two
        CPUs share a L2/L3 cache, a different (and accurate) migration cost
        will be found than between two CPUs on the same system that dont share
        any caches.
      
      (The reliable measurement of migration costs is tricky - see the source
      for details.)
      
      Furthermore i've added various boot-time options to override/tune
      migration behavior.
      
      Firstly, there's a blanket override for autodetection:
      
      	migration_cost=1000,2000,3000
      
      will override the depth 0/1/2 values with 1msec/2msec/3msec values.
      
      Secondly, there's a global factor that can be used to increase (or
      decrease) the autodetected values:
      
      	migration_factor=120
      
      will increase the autodetected values by 20%. This option is useful to
      tune things in a workload-dependent way - e.g. if a workload is
      cache-insensitive then CPU utilization can be maximized by specifying
      migration_factor=0.
      
      I've tested the autodetection code quite extensively on x86, on 3
      P3/Xeon/2MB, and the autodetected values look pretty good:
      
      Dual Celeron (128K L2 cache):
      
       ---------------------
       migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
       ---------------------
                 [00]    [01]
       [00]:     -     1.7(1)
       [01]:   1.7(1)    -
       ---------------------
       cacheflush times [2]: 0.0 (0) 1.7 (1784008)
       ---------------------
      
      Here the slow memory subsystem dominates system performance, and even
      though caches are small, the migration cost is 1.7 msecs.
      
      Dual HT P4 (512K L2 cache):
      
       ---------------------
       migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
       ---------------------
                 [00]    [01]    [02]    [03]
       [00]:     -     0.4(1)  0.0(0)  0.4(1)
       [01]:   0.4(1)    -     0.4(1)  0.0(0)
       [02]:   0.0(0)  0.4(1)    -     0.4(1)
       [03]:   0.4(1)  0.0(0)  0.4(1)    -
       ---------------------
       cacheflush times [2]: 0.0 (33900) 0.4 (448514)
       ---------------------
      
      Here it can be seen that there is no migration cost between two HT
      siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
      system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.
      
      8-way P3/Xeon [2MB L2 cache]:
      
       ---------------------
       migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
       ---------------------
                 [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
       [00]:     -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [01]:  19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [02]:  19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [03]:  19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [04]:  19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1)
       [05]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1)
       [06]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1)
       [07]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -
       ---------------------
       cacheflush times [2]: 0.0 (0) 19.2 (19281756)
       ---------------------
      
      This one has huge caches and a relatively slow memory subsystem - so the
      migration cost is 19 msecs.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: <wilder@us.ibm.com>
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      198e2f18
  2. 12 1月, 2006 2 次提交
  3. 11 1月, 2006 2 次提交
  4. 10 1月, 2006 1 次提交
  5. 09 1月, 2006 4 次提交
    • D
      [PATCH] keys: Permit running process to instantiate keys · b5f545c8
      David Howells 提交于
      Make it possible for a running process (such as gssapid) to be able to
      instantiate a key, as was requested by Trond Myklebust for NFS4.
      
      The patch makes the following changes:
      
       (1) A new, optional key type method has been added. This permits a key type
           to intercept requests at the point /sbin/request-key is about to be
           spawned and do something else with them - passing them over the
           rpc_pipefs files or netlink sockets for instance.
      
           The uninstantiated key, the authorisation key and the intended operation
           name are passed to the method.
      
       (2) The callout_info is no longer passed as an argument to /sbin/request-key
           to prevent unauthorised viewing of this data using ps or by looking in
           /proc/pid/cmdline.
      
           This means that the old /sbin/request-key program will not work with the
           patched kernel as it will expect to see an extra argument that is no
           longer there.
      
           A revised keyutils package will be made available tomorrow.
      
       (3) The callout_info is now attached to the authorisation key. Reading this
           key will retrieve the information.
      
       (4) A new field has been added to the task_struct. This holds the
           authorisation key currently active for a thread. Searches now look here
           for the caller's set of keys rather than looking for an auth key in the
           lowest level of the session keyring.
      
           This permits a thread to be servicing multiple requests at once and to
           switch between them. Note that this is per-thread, not per-process, and
           so is usable in multithreaded programs.
      
           The setting of this field is inherited across fork and exec.
      
       (5) A new keyctl function (KEYCTL_ASSUME_AUTHORITY) has been added that
           permits a thread to assume the authority to deal with an uninstantiated
           key. Assumption is only permitted if the authorisation key associated
           with the uninstantiated key is somewhere in the thread's keyrings.
      
           This function can also clear the assumption.
      
       (6) A new magic key specifier has been added to refer to the currently
           assumed authorisation key (KEY_SPEC_REQKEY_AUTH_KEY).
      
       (7) Instantiation will only proceed if the appropriate authorisation key is
           assumed first. The assumed authorisation key is discarded if
           instantiation is successful.
      
       (8) key_validate() is moved from the file of request_key functions to the
           file of permissions functions.
      
       (9) The documentation is updated.
      
      From: <Valdis.Kletnieks@vt.edu>
      
          Build fix.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Alexander Zangerl <az@bond.edu.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5f545c8
    • P
      [PATCH] remove get_task_struct_rcu() · d4829cd5
      Paul E. McKenney 提交于
      The latest set of signal-RCU patches does not use get_task_struct_rcu().
      Attached is a patch that removes it.
      Signed-off-by: N"Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d4829cd5
    • I
      [PATCH] RCU signal handling · e56d0903
      Ingo Molnar 提交于
      RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
      instead of tasklist_lock read-locked.  This is a scalability improvement on
      SMP and a preemption-latency improvement under PREEMPT_RCU.
      Signed-off-by: NPaul E. McKenney <paulmck@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NWilliam Irwin <wli@holomorphy.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e56d0903
    • C
      [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap · 930d9152
      Christoph Lameter 提交于
      Add PF_SWAPWRITE to control a processes permission to write to swap.
      
      - Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
        and pdflush
      
      - Set PF_SWAPWRITE flag for kswapd and pdflush
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      930d9152
  6. 07 1月, 2006 1 次提交
  7. 29 11月, 2005 1 次提交
    • A
      [PATCH] clean up lock_cpu_hotplug() in cpufreq · a9d9baa1
      Ashok Raj 提交于
      There are some callers in cpufreq hotplug notify path that the lowest
      function calls lock_cpu_hotplug().  The lock is already held during
      cpu_up() and cpu_down() calls when the notify calls are broadcast to
      registered clients.
      
      Ideally if possible, we could disable_preempt() at the highest caller and
      make sure we dont sleep in the path down in cpufreq->driver_target() calls
      but the calls are so intertwined and cumbersome to cleanup.
      
      Hence we consistently use lock_cpu_hotplug() and unlock_cpu_hotplug() in
      all places.
      
       - Removed export of cpucontrol semaphore and made it static.
       - removed explicit uses of up/down with lock_cpu_hotplug()
         so we can keep track of the the callers in same thread context and
         just keep refcounts without calling a down() that causes a deadlock.
       - Removed current_in_hotplug() uses
       - Removed PF_HOTPLUG_CPU in sched.h introduced for the current_in_hotplug()
         temporary workaround.
      
      Tested with insmod of cpufreq_stat.ko, and logical online/offline
      to make sure we dont have any hang situations.
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Cc: Zwane Mwaikambo <zwane@linuxpower.ca>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a9d9baa1
  8. 14 11月, 2005 4 次提交
  9. 09 11月, 2005 1 次提交
    • A
      [PATCH] cpu hotplug: fix locking in cpufreq drivers · 90d45d17
      Ashok Raj 提交于
      When calling target drivers to set frequency, we take cpucontrol lock.
      When we modified the code to accomodate CPU hotplug, there was an attempt
      to take a double lock of cpucontrol leading to a deadlock.  Since the
      current thread context is already holding the cpucontrol lock, we dont need
      to make another attempt to acquire it.
      
      Now we leave a trace in current->flags indicating current thread already is
      under cpucontrol lock held, so we dont attempt to do this another time.
      
      Thanks to Andrew Morton for the beating:-)
      
      From: Brice Goglin <Brice.Goglin@ens-lyon.org>
      
        Build fix
      
      (akpm: this patch is still unpleasant.  Ashok continues to look for a cleaner
      solution, doesn't he?  ;))
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NBrice Goglin <Brice.Goglin@ens-lyon.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      90d45d17
  10. 31 10月, 2005 3 次提交
    • O
      [PATCH] cleanup the usage of SEND_SIG_xxx constants · 621d3121
      Oleg Nesterov 提交于
      This patch simplifies some checks for magic siginfo values.  It should not
      change the behaviour in any way.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      621d3121
    • P
      [PATCH] sched: hardcode non-smp set_cpus_allowed · 4098f991
      Paul Jackson 提交于
      Simplify the UP (1 CPU) implementatin of set_cpus_allowed.
      
      The one CPU is hardcoded to be cpu 0 - so just test for that bit, and avoid
      having to pick up the cpu_online_map.
      
      Also, unexport cpu_online_map: it was only needed for set_cpus_allowed().
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4098f991
    • P
      [PATCH] cpusets: dual semaphore locking overhaul · 053199ed
      Paul Jackson 提交于
      Overhaul cpuset locking.  Replace single semaphore with two semaphores.
      
      The suggestion to use two locks was made by Roman Zippel.
      
      Both locks are global.  Code that wants to modify cpusets must first
      acquire the exclusive manage_sem, which allows them read-only access to
      cpusets, and holds off other would-be modifiers.  Before making actual
      changes, the second semaphore, callback_sem must be acquired as well.  Code
      that needs only to query cpusets must acquire callback_sem, which is also a
      global exclusive lock.
      
      The earlier problems with double tripping are avoided, because it is
      allowed for holders of manage_sem to nest the second callback_sem lock, and
      only callback_sem is needed by code called from within __alloc_pages(),
      where the double tripping had been possible.
      
      This is not quite the same as a normal read/write semaphore, because
      obtaining read-only access with intent to change must hold off other such
      attempts, while allowing read-only access w/o such intention.  Changing
      cpusets involves several related checks and changes, which must be done
      while allowing read-only queries (to avoid the double trip), but while
      ensuring nothing changes (holding off other would be modifiers.)
      
      This overhaul of cpuset locking also makes careful use of task_lock() to
      guard access to the task->cpuset pointer, closing a couple of race
      conditions noticed while reading this code (thanks, Roman).  I've never
      seen these races fail in any use or test.
      
      See further the comments in the code.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      053199ed
  11. 30 10月, 2005 4 次提交
    • H
      [PATCH] mm: fix rss and mmlist locking · f412ac08
      Hugh Dickins 提交于
      A couple of oddities were guarded by page_table_lock, no longer properly
      guarded when that is split.
      
      The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
      atomic64_t if the architecture supports it, in such a case.  Definitions by
      courtesy of Christoph Lameter: who spent considerable effort on more scalable
      ways of counting, but found insufficient benefit in practice.
      
      And adding an mm with swap to the mmlist for swapoff: the list is well-
      guarded by its own lock, but the list_empty check now has to be repeated
      inside it.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f412ac08
    • H
      [PATCH] mm: mm_struct hiwaters moved · f449952b
      Hugh Dickins 提交于
      Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm were
      tacked on the end, but it seems better to keep them near _file_rss, _anon_rss
      and total_vm, in the same cacheline on those arches verified.
      
      There are likely to be more profitable rearrangements, but less obvious (is it
      good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask and context
      from many others?), needing serious instrumentation.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f449952b
    • H
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins 提交于
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • H
      [PATCH] mm: rss = file_rss + anon_rss · 4294621f
      Hugh Dickins 提交于
      I was lazy when we added anon_rss, and chose to change as few places as
      possible.  So currently each anonymous page has to be counted twice, in rss
      and in anon_rss.  Which won't be so good if those are atomic counts in some
      configurations.
      
      Change that around: keep file_rss and anon_rss separately, and add them
      together (with get_mm_rss macro) when the total is needed - reading two
      atomics is much cheaper than updating two atomics.  And update anon_rss
      upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4294621f
  12. 11 10月, 2005 1 次提交
    • H
      [PATCH] Fix signal sending in usbdevio on async URB completion · 46113830
      Harald Welte 提交于
      If a process issues an URB from userspace and (starts to) terminate
      before the URB comes back, we run into the issue described above.  This
      is because the urb saves a pointer to "current" when it is posted to the
      device, but there's no guarantee that this pointer is still valid
      afterwards.
      
      In fact, there are three separate issues:
      
      1) the pointer to "current" can become invalid, since the task could be
         completely gone when the URB completion comes back from the device.
      
      2) Even if the saved task pointer is still pointing to a valid task_struct,
         task_struct->sighand could have gone meanwhile.
      
      3) Even if the process is perfectly fine, permissions may have changed,
         and we can no longer send it a signal.
      
      So what we do instead, is to save the PID and uid's of the process, and
      introduce a new kill_proc_info_as_uid() function.
      Signed-off-by: NHarald Welte <laforge@gnumonks.org>
      [ Fixed up types and added symbol exports ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      46113830
  13. 30 9月, 2005 2 次提交
    • L
      Revert task flag re-ordering, add comments · 4a8342d2
      Linus Torvalds 提交于
      Roland points out that the flags end up having non-obvious dependencies
      elsewhere, so revert aa55a086 and add
      some comments about why things are as they are.
      
      We'll just have to fix up the broken comparisons. Roland has a patch.
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4a8342d2
    • O
      [PATCH] fix TASK_STOPPED vs TASK_NONINTERACTIVE interaction · aa55a086
      Oleg Nesterov 提交于
      do_signal_stop:
      
      	for_each_thread(t) {
      		if (t->state < TASK_STOPPED)
      			++sig->group_stop_count;
      	}
      
      However, TASK_NONINTERACTIVE > TASK_STOPPED, so this loop will not
      count TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE threads.
      
      See also wait_task_stopped(), which checks ->state > TASK_STOPPED.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      
      [ We really probably should always use the appropriate bitmasks to test
        task states, not do it like this. Using something like
      
      	#define TASK_RUNNABLE (TASK_RUNNING | TASK_INTERRUPTIBLE | \
      				TASK_UNINTERRUPTIBLE | TASK_NONINTERACTIVE)
      
        and then doing "if (task->state & TASK_RUNNABLE)" or similar. But the
        ordering of the task states is historical, and keeping the ordering
        does make sense regardless. ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aa55a086
  14. 13 9月, 2005 2 次提交
    • A
      [PATCH] set_current_state() commentary · 498d0c57
      Andrew Morton 提交于
      Explain the mysteries of set_current_state().
      
      Quoth Linus:
      
       The scheduler itself never needs the memory barrier at all.
      
       The barrier is needed only if the user itself ends up testing some other
       thing afterwards, ie if you have
      
       	set_process_state(TASK_INTERRUPTIBLE);
       	if (still_need_to_sleep())
       		schedule();
      
       then the "still_need_to_sleep()" thing may test flags and wakeup events,
       and then you _may_ want to (and often do) make sure that the write of
       TASK_INTERRUPTIBLE is serialized wrt the reads of any wakeup data (since
       the wakeup may have happened on another CPU).
      
       So the comment is somewhat wrong. We don't really _care_ whether the state
       propagates out to other CPU's since all of our actions are purely local,
       and there is nothing we do that is conditional on any other CPU: we're
       going to sleep unconditionally, and the scheduler only cares about _our_
       state, not about somebody elses state.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      498d0c57
    • P
      [PATCH] cpuset semaphore depth check optimize · b3426599
      Paul Jackson 提交于
      Optimize the deadlock avoidance check on the global cpuset
      semaphore cpuset_sem.  Instead of adding a depth counter to the
      task struct of each task, rather just two words are enough, one
      to store the depth and the other the current cpuset_sem holder.
      
      Thanks to Nikita Danilov for the idea.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      
      [ We may want to change this further, but at least it's now
        a totally internal decision to the cpusets code ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b3426599
  15. 12 9月, 2005 1 次提交
  16. 11 9月, 2005 3 次提交
    • N
      [PATCH] add schedule_timeout_{,un}interruptible() interfaces · 64ed93a2
      Nishanth Aravamudan 提交于
      Add schedule_timeout_{,un}interruptible() interfaces so that
      schedule_timeout() callers don't have to worry about forgetting to add the
      set_current_state() call beforehand.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      64ed93a2
    • I
      [PATCH] sched: TASK_NONINTERACTIVE · d79fc0fc
      Ingo Molnar 提交于
      This patch implements a task state bit (TASK_NONINTERACTIVE), which can be
      used by blocking points to mark the task's wait as "non-interactive".  This
      does not mean the task will be considered a CPU-hog - the wait will simply
      not have an effect on the waiting task's priority - positive or negative
      alike.  Right now only pipe_wait() will make use of it, because it's a
      common source of not-so-interactive waits (kernel compilation jobs, etc.).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d79fc0fc
    • P
      [PATCH] cpuset semaphore depth check deadlock fix · 4247bdc6
      Paul Jackson 提交于
      The cpusets-formalize-intermediate-gfp_kernel-containment patch
      has a deadlock problem.
      
      This patch was part of a set of four patches to make more
      extensive use of the cpuset 'mem_exclusive' attribute to
      manage kernel GFP_KERNEL memory allocations and to constrain
      the out-of-memory (oom) killer.
      
      A task that is changing cpusets in particular ways on a system
      when it is very short of free memory could double trip over
      the global cpuset_sem semaphore (get the lock and then deadlock
      trying to get it again).
      
      The second attempt to get cpuset_sem would be in the routine
      cpuset_zone_allowed().  This was discovered by code inspection.
      I can not reproduce the problem except with an artifically
      hacked kernel and a specialized stress test.
      
      In real life you cannot hit this unless you are manipulating
      cpusets, and are very unlikely to hit it unless you are rapidly
      modifying cpusets on a memory tight system.  Even then it would
      be a rare occurence.
      
      If you did hit it, the task double tripping over cpuset_sem
      would deadlock in the kernel, and any other task also trying
      to manipulate cpusets would deadlock there too, on cpuset_sem.
      Your batch manager would be wedged solid (if it was cpuset
      savvy), but classic Unix shells and utilities would work well
      enough to reboot the system.
      
      The unusual condition that led to this bug is that unlike most
      semaphores, cpuset_sem _can_ be acquired while in the page
      allocation code, when __alloc_pages() calls cpuset_zone_allowed.
      So it easy to mistakenly perform the following sequence:
        1) task makes system call to alter a cpuset
        2) take cpuset_sem
        3) try to allocate memory
        4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
        5) deadlock
      
      The reason that this is not a serious bug for most users
      is that almost all calls to allocate memory don't require
      taking cpuset_sem.  Only some code paths off the beaten
      track require taking cpuset_sem -- which is good.  Taking
      a global semaphore on the main code path for allocating
      memory would not scale well.
      
      This patch fixes this deadlock by wrapping the up() and down()
      calls on cpuset_sem in kernel/cpuset.c with code that tracks
      the nesting depth of the current task on that semaphore, and
      only does the real down() if the task doesn't hold the lock
      already, and only does the real up() if the nesting depth
      (number of unmatched downs) is exactly one.
      
      The previous required use of refresh_mems(), anytime that
      the cpuset_sem semaphore was acquired and the code executed
      while holding that semaphore might try to allocate memory, is
      no longer required.  Two refresh_mems() calls were removed
      thanks to this.  This is a good change, as failing to get
      all the necessary refresh_mems() calls placed was a primary
      source of bugs in this cpuset code.  The only remaining call
      to refresh_mems() is made while doing a memory allocation,
      if certain task memory placement data needs to be updated
      from its cpuset, due to the cpuset having been changed behind
      the tasks back.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4247bdc6
  17. 10 9月, 2005 1 次提交
  18. 08 9月, 2005 3 次提交
  19. 13 7月, 2005 1 次提交
    • R
      [PATCH] inotify · 0eeca283
      Robert Love 提交于
      inotify is intended to correct the deficiencies of dnotify, particularly
      its inability to scale and its terrible user interface:
      
              * dnotify requires the opening of one fd per each directory
                that you intend to watch. This quickly results in too many
                open files and pins removable media, preventing unmount.
              * dnotify is directory-based. You only learn about changes to
                directories. Sure, a change to a file in a directory affects
                the directory, but you are then forced to keep a cache of
                stat structures.
              * dnotify's interface to user-space is awful.  Signals?
      
      inotify provides a more usable, simple, powerful solution to file change
      notification:
      
              * inotify's interface is a system call that returns a fd, not SIGIO.
      	  You get a single fd, which is select()-able.
              * inotify has an event that says "the filesystem that the item
                you were watching is on was unmounted."
              * inotify can watch directories or files.
      
      Inotify is currently used by Beagle (a desktop search infrastructure),
      Gamin (a FAM replacement), and other projects.
      
      See Documentation/filesystems/inotify.txt.
      Signed-off-by: NRobert Love <rml@novell.com>
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0eeca283