1. 27 6月, 2006 2 次提交
  2. 23 6月, 2006 1 次提交
    • M
      [PATCH] remove steal_locks() · c89681ed
      Miklos Szeredi 提交于
      This patch removes the steal_locks() function.
      
      steal_locks() doesn't work correctly with any filesystem that does it's own
      lock management, including NFS, CIFS, etc.
      
      In addition it has weird semantics on local filesystems in case tasks
      sharing file-descriptor tables are doing POSIX locking operations in
      parallel to execve().
      
      The steal_locks() function has an effect on applications doing:
      
      clone(CLONE_FILES)
        /* in child */
        lock
        execve
        lock
      
      POSIX locks acquired before execve (by "child", "parent" or any further
      task sharing files_struct) will after the execve be owned exclusively by
      "child".
      
      According to Chris Wright some LSB/LTP kind of suite triggers without the
      stealing behavior, but there's no known real-world application that would
      also fail.
      
      Apps using NPTL are not affected, since all other threads are killed before
      execve.
      
      Apps using LinuxThreads are only affected if they
      
        - have multiple threads during exec (LinuxThreads doesn't kill other
          threads, the app may do it with pthread_kill_other_threads_np())
        - rely on POSIX locks being inherited across exec
      
      Both conditions are documented, but not their interaction.
      
      Apps using clone() natively are affected if they
      
        - use clone(CLONE_FILES)
        - rely on POSIX locks being inherited across exec
      
      The above scenarios are unlikely, but possible.
      
      If the patch is vetoed, there's a plan B, that involves mostly keeping the
      weird stealing semantics, but changing the way lock ownership is handled so
      that network and local filesystems work consistently.
      
      That would add more complexity though, so this solution seems to be
      preferred by most people.
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Steven French <sfrench@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c89681ed
  3. 20 6月, 2006 1 次提交
  4. 20 4月, 2006 1 次提交
  5. 14 4月, 2006 1 次提交
    • E
      [PATCH] de_thread: Don't change our parents and ptrace flags. · c06511d1
      Eric W. Biederman 提交于
      This is two distinct changes.
       - Not changing our real parents.
       - Not changing our ptrace parents.
      
      Not changing our real parents is trivially correct because both tasks
      have the same real parents as they are part of a thread group.  Now that
      we demote the leader to a thread there is no longer any reason to change
      it's parentage.
      
      Not changing our ptrace parents is a user visible change if someone
      looks hard enough.  I don't think user space applications will care or
      even notice.
      
      In the practical and I think common case a debugger will have attached
      to all of the threads using the same ptrace flags.  From my quick skim
      of strace and gdb that appears to be the case.  Which if true means
      debuggers will not notice a change.
      
      Before this point we have already generated a ptrace event in do_exit
      that reports the leaders pid has died so de_thread is visible to a
      debugger.  Which means attempting to hide this case by copying flags
      around appears excessive.
      
      By not doing anything it avoids all of the weird locking issues between
      de_thread and ptrace attach, and removes one case from consideration for
      fixing the ptrace locking.
      
      This only addresses Oleg's first concern with ptrace_attach, that of the
      problems caused by reparenting.  Oleg's second concern is essentially a
      race between ptrace_attach and release_task that causes an oops when we
      get to force_sig_specific.  There is nothing special about de_thread
      with respect to that race.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c06511d1
  6. 11 4月, 2006 2 次提交
    • R
      [PATCH] process accounting: take original leader's start_time in non-leader exec · f5e90281
      Roland McGrath 提交于
      The only record we have of the real-time age of a process, regardless of
      execs it's done, is start_time.  When a non-leader thread exec, the
      original start_time of the process is lost.  Things looking at the
      real-time age of the process are fooled, for example the process accounting
      record when the process finally dies.  This change makes the oldest
      start_time stick around with the process after a non-leader exec.  This way
      the association between PID and start_time is kept constant, which seems
      correct to me.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f5e90281
    • E
      [PATCH] de_thread: Don't confuse users do_each_thread. · de12a787
      Eric W. Biederman 提交于
      Oleg Nesterov spotted two interesting bugs with the current de_thread
      code.  The simplest is a long standing double decrement of
      __get_cpu_var(process_counts) in __unhash_process.  Caused by
      two processes exiting when only one was created.
      
      The other is that since we no longer detach from the thread_group list
      it is possible for do_each_thread when run under the tasklist_lock to
      see the same task_struct twice.  Once on the task list as a
      thread_group_leader, and once on the thread list of another
      thread.
      
      The double appearance in do_each_thread can cause a double increment
      of mm_core_waiters in zap_threads resulting in problems later on in
      coredump_wait.
      
      To remedy those two problems this patch takes the simple approach
      of changing the old thread group leader into a child thread.
      The only routine in release_task that cares is __unhash_process,
      and it can be trivially seen that we handle cleaning up a
      thread group leader properly.
      
      Since de_thread doesn't change the pid of the exiting leader process
      and instead shares it with the new leader process.  I change
      thread_group_leader to recognize group leadership based on the
      group_leader field and not based on pids.  This should also be
      slightly cheaper then the existing thread_group_leader macro.
      
      I performed a quick audit and I couldn't see any user of
      thread_group_leader that cared about the difference.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      de12a787
  7. 01 4月, 2006 1 次提交
  8. 29 3月, 2006 5 次提交
    • O
      [PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCU · aa1757f9
      Oleg Nesterov 提交于
      This patch borrows a clever Hugh's 'struct anon_vma' trick.
      
      Without tasklist_lock held we can't trust task->sighand until we locked it
      and re-checked that it is still the same.
      
      But this means we don't need to defer 'kmem_cache_free(sighand)'.  We can
      return the memory to slab immediately, all we need is to be sure that
      sighand->siglock can't dissapear inside rcu protected section.
      
      To do so we need to initialize ->siglock inside ctor function,
      SLAB_DESTROY_BY_RCU does the rest.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aa1757f9
    • O
      [PATCH] remove add_parent()'s parent argument · 8fafabd8
      Oleg Nesterov 提交于
      add_parent(p, parent) is always called with parent == p->parent, and it makes
      no sense to do it differently.  This patch removes this argument.
      
      No changes in affected .o files.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8fafabd8
    • E
      [PATCH] pidhash: kill switch_exec_pids · d73d6529
      Eric W. Biederman 提交于
      switch_exec_pids is only called from de_thread by way of exec, and it is
      only called when we are exec'ing from a non thread group leader.
      
      Currently switch_exec_pids gives the leader the pid of the thread and
      unhashes and rehashes all of the process groups.  The leader is already in
      the EXIT_DEAD state so no one cares about it's pids.  The only concern for
      the leader is that __unhash_process called from release_task will function
      correctly.  If we don't touch the leader at all we know that
      __unhash_process will work fine so there is no need to touch the leader.
      
      For the task becomming the thread group leader, we just need to give it the
      pid of the old thread group leader, add it to the task list, and attach it
      to the session and the process group of the thread group.
      
      Currently de_thread is also adding the task to the task list which is just
      silly.
      
      Currently the only leader of __detach_pid besides detach_pid is
      switch_exec_pids because of the ugly extra work that was being
      performed.
      
      So this patch removes switch_exec_pids because it is doing too much, it is
      creating an unnecessary special case in pid.c, duing work duplicated in
      de_thread, and generally obscuring what it is going on.
      
      The necessary work is added to de_thread, and it seems to be a little
      clearer there what is going on.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d73d6529
    • O
      [PATCH] simplify exec from init's subthread · 1434261c
      Oleg Nesterov 提交于
      I think it is enough to take tasklist_lock for reading while changing
      child_reaper:
      
      	Reparenting needs write_lock(tasklist_lock)
      
      	Only one thread in a thread group can do exec()
      
      	sighand->siglock garantees that get_signal_to_deliver()
      	will not see a stale value of child_reaper.
      
      This means that we can change child_reaper earlier, without calling
      zap_other_threads() twice.
      
      "child_reaper = current" is a NOOP when init does exec from main thread, we
      don't care.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1434261c
    • E
      [PATCH] exec: allow init to exec from any thread. · fef23e7f
      Eric W. Biederman 提交于
      After looking at the problem of init calling exec some more I figured out
      an easy way to make the code work.
      
      The actual symptom without out this patch is that all threads will die
      except pid == 1, and the thread calling exec.  The thread calling exec will
      wait forever for pid == 1 to die.
      
      Since pid == 1 does not install a handler for SIGKILL it will never die.
      
      This modifies the tests for init from current->pid == 1 to the equivalent
      current == child_reaper.  And then it causes exec in the ugly case to
      modify child_reaper.
      
      The only weird symptom is that you wind up with an init process that
      doesn't have the oldest start time on the box.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fef23e7f
  9. 27 3月, 2006 1 次提交
  10. 26 3月, 2006 2 次提交
  11. 01 3月, 2006 1 次提交
  12. 16 2月, 2006 1 次提交
  13. 19 1月, 2006 1 次提交
    • U
      [PATCH] vfs: *at functions: core · 5590ff0d
      Ulrich Drepper 提交于
      Here is a series of patches which introduce in total 13 new system calls
      which take a file descriptor/filename pair instead of a single file
      name.  These functions, openat etc, have been discussed on numerous
      occasions.  They are needed to implement race-free filesystem traversal,
      they are necessary to implement a virtual per-thread current working
      directory (think multi-threaded backup software), etc.
      
      We have in glibc today implementations of the interfaces which use the
      /proc/self/fd magic.  But this code is rather expensive.  Here are some
      results (similar to what Jim Meyering posted before).
      
      The test creates a deep directory hierarchy on a tmpfs filesystem.  Then
      rm -fr is used to remove all directories.  Without syscall support I get
      this:
      
      real    0m31.921s
      user    0m0.688s
      sys     0m31.234s
      
      With syscall support the results are much better:
      
      real    0m20.699s
      user    0m0.536s
      sys     0m20.149s
      
      The interfaces are for obvious reasons currently not much used.  But they'll
      be used.  coreutils (and Jeff's posixutils) are already using them.
      Furthermore, code like ftw/fts in libc (maybe even glob) will also start using
      them.  I expect a patch to make follow soon.  Every program which is walking
      the filesystem tree will benefit.
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5590ff0d
  14. 15 1月, 2006 1 次提交
  15. 11 1月, 2006 1 次提交
  16. 09 1月, 2006 3 次提交
    • O
      [PATCH] do_coredump() should reset group_stop_count earlier · bb6f6dba
      Oleg Nesterov 提交于
      __group_complete_signal() sets ->group_stop_count in sig_kernel_coredump()
      path and marks the target thread as ->group_exit_task.  So any thread
      except group_exit_task will go to handle_group_stop()->finish_stop().
      
      However, when group_exit_task actually starts do_coredump(), it sets
      SIGNAL_GROUP_EXIT, but does not reset ->group_stop_count while killing
      other threads.  If we have not yet stopped threads in the same thread
      group, they all will spin in kernel mode until group_exit_task sends them
      SIGKILL, because ->group_stop_count > 0 means:
      
      	recalc_sigpending_tsk() never clears TIF_SIGPENDING
      
      	get_signal_to_deliver() goes to handle_group_stop()
      
      	handle_group_stop() returns when SIGNAL_GROUP_EXIT set
      
      	syscall_exit/resume_userspace notice TIF_SIGPENDING,
      	call get_signal_to_deliver() again.
      
      So we are wasting cpu cycles, and if one of these threads is rt_task() this
      may be a serious problem.
      
      NOTE: do_coredump() holds ->mmap_sem, so not stopped threads can't escape
      coredumping after clearing ->group_stop_count.
      
      See also this thread: http://marc.theaimsgroup.com/?t=112739139900002Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bb6f6dba
    • N
      [PATCH] Fix some problems with truncate and mtime semantics. · 4a30131e
      NeilBrown 提交于
      SUS requires that when truncating a file to the size that it currently
      is:
        truncate and ftruncate should NOT modify ctime or mtime
        O_TRUNC SHOULD modify ctime and mtime.
      
      Currently mtime and ctime are always modified on most local
      filesystems (side effect of ->truncate) or never modified (on NFS).
      
      With this patch:
        ATTR_CTIME|ATTR_MTIME are sent with ATTR_SIZE precisely when
          an update of these times is required whether size changes or not
          (via a new argument to do_truncate).  This allows NFS to do
          the right thing for O_TRUNC.
        inode_setattr nolonger forces ATTR_MTIME|ATTR_CTIME when the ATTR_SIZE
          sets the size to it's current value.  This allows local filesystems
          to do the right thing for f?truncate.
      
      Also, the logic in inode_setattr is changed a bit so there are two return
      points.  One returns the error from vmtruncate if it failed, the other
      returns 0 (there can be no other failure).
      
      Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change
      requested, we now fall-through and mark_inode_dirty.  If a filesystem did
      not have a ->truncate function, then vmtruncate will have changed i_size,
      without marking the inode as 'dirty', and I think this is wrong.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4a30131e
    • I
      [PATCH] RCU signal handling · e56d0903
      Ingo Molnar 提交于
      RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
      instead of tasklist_lock read-locked.  This is a scalability improvement on
      SMP and a preemption-latency improvement under PREEMPT_RCU.
      Signed-off-by: NPaul E. McKenney <paulmck@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NWilliam Irwin <wli@holomorphy.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e56d0903
  17. 07 1月, 2006 1 次提交
  18. 30 11月, 2005 1 次提交
  19. 24 11月, 2005 1 次提交
    • O
      [PATCH] fix do_wait() vs exec() race · 962b564c
      Oleg Nesterov 提交于
      When non-leader thread does exec, de_thread adds old leader to the init's
      ->children list in EXIT_ZOMBIE state and drops tasklist_lock.
      
      This means that release_task(leader) in de_thread() is racy vs do_wait()
      from init task.
      
      I think de_thread() should set old leader's state to EXIT_DEAD instead.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: george anzinger <george@mvista.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linus Torvalds <torvalds@osdl.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      962b564c
  20. 09 11月, 2005 3 次提交
  21. 07 11月, 2005 2 次提交
  22. 31 10月, 2005 4 次提交
  23. 30 10月, 2005 3 次提交
    • H
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins 提交于
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • H
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins 提交于
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • H
      [PATCH] mm: rss = file_rss + anon_rss · 4294621f
      Hugh Dickins 提交于
      I was lazy when we added anon_rss, and chose to change as few places as
      possible.  So currently each anonymous page has to be counted twice, in rss
      and in anon_rss.  Which won't be so good if those are atomic counts in some
      configurations.
      
      Change that around: keep file_rss and anon_rss separately, and add them
      together (with get_mm_rss macro) when the total is needed - reading two
      atomics is much cheaper than updating two atomics.  And update anon_rss
      upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4294621f