1. 06 2月, 2008 21 次提交
    • D
      VFS/Security: Rework inode_getsecurity and callers to return resulting buffer · 42492594
      David P. Quigley 提交于
      This patch modifies the interface to inode_getsecurity to have the function
      return a buffer containing the security blob and its length via parameters
      instead of relying on the calling function to give it an appropriately sized
      buffer.
      
      Security blobs obtained with this function should be freed using the
      release_secctx LSM hook.  This alleviates the problem of the caller having to
      guess a length and preallocate a buffer for this function allowing it to be
      used elsewhere for Labeled NFS.
      
      The patch also removed the unused err parameter.  The conversion is similar to
      the one performed by Al Viro for the security_getprocattr hook.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42492594
    • F
      writeback: speed up writeback of big dirty files · 8bc3be27
      Fengguang Wu 提交于
      After making dirty a 100M file, the normal behavior is to start the
      writeback for all data after 30s delays.  But sometimes the following
      happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      1) initial state with a 100M file and a 1K file
      
      2) 4M written, nr_to_write <= 0, so write more
      
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all
      been written out.  The big dirty file is actually still sitting in
      s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
      becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
      may starve newly expired inodes in s_dirty.  It is also not an option to
      draw inodes from both s_more_io and s_dirty, an let the loop go on: this
      might lead to live locks, and might also starve other superblocks in sync
      time(well kupdate may still starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0
      does not necessarily mean that "all data are written".  This patch
      introduces a flag writeback_control.more_io to indicate that more io should
      be done.  With it the big dirty file no longer has to wait for the next
      kupdate invokation 5s later.
      
      In sync_sb_inodes() we only set more_io on super_blocks we actually
      visited.  This avoids the interaction between two pdflush deamons.
      
      Also in __sync_single_inode() we don't blindly keep requeuing the io if the
      filesystem cannot progress.  Failing to do so may lead to 100% iowait.
      Tested-by: NMike Snitzer <snitzer@gmail.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bc3be27
    • Q
      skip writing data pages when inode is under I_SYNC · 2d544564
      Qi Yong 提交于
      Since I_SYNC was split out from I_LOCK, the concern in commit
      4b89eed9 ("Write back inode data pages
      even when the inode itself is locked") is not longer valid.
      
      We should revert to the original behavior: in __writeback_single_inode(),
      when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
      skip writing entirely.  Otherwise, we are double calling do_writepages()
      Signed-off-by: NQi Yong <qiyong@fc-cn.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Joern Engel <joern@wohnheim.fh-wedel.de>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d544564
    • A
      Fix /proc dcache deadlock in do_exit · 7766755a
      Andrea Arcangeli 提交于
      This patch fixes a sles9 system hang in start_this_handle from a customer
      with some heavy workload where all tasks are waiting on kjournald to commit
      the transaction, but kjournald waits on t_updates to go down to zero (it
      never does).
      
      This was reported as a lowmem shortage deadlock but when checking the debug
      data I noticed the VM wasn't under pressure at all (well it was really
      under vm pressure, because lots of tasks hanged in the VM prune_dcache
      methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
      mode, the holder of the journal handle should have if this was a vm issue
      in the first place).
      
      No task was apparently holding the leftover handle in the committing
      transaction, so I deduced t_updates was stuck to 1 because a journal_stop
      was never run by some path (this turned out to be correct).  With a debug
      patch adding proper reverse links and stack trace logging in ext3 deployed
      in production, I found journal_stop is never run because
      mark_inode_dirty_sync is called inside release_task called by do_exit.
      (that was quite fun because I would have never thought about this
      subtleness, I thought a regular path in ext3 had a bug and it forgot to
      call journal_stop)
      
      do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
      come back to run journal_stop)
      
      The reason is that shrink_dcache_parent is racy by design (feature not
      a bug) and it can do blocking I/O in some case, but the point is that
      calling shrink_dcache_parent at the last stage of do_exit isn't safe
      for self-reaping tasks.
      
      I guess the memory pressure of the unbalanced highmem system allowed
      to trigger this more easily.
      
      Now mainline doesn't have this line in iput (like sles9 has):
      
          	     if (inode->i_state & I_DIRTY_DELAYED)
      	     			mark_inode_dirty_sync(inode);
      
      so it will probably not crash with ext3, but for example ext2 implements an
      I/O-blocking ext2_put_inode that will lead to similar screwups with
      ext2_free_blocks never coming back and it's definitely wrong to call
      blocking-IO paths inside do_exit.  So this should fix a subtle bug in
      mainline too (not verified in practice though).  The equivalent fix for
      ext3 is also not verified yet to fix the problem in sles9 but I don't have
      doubt it will (it usually takes days to crash, so it'll take weeks to be
      sure).
      
      An alternate fix would be to offload that work to a kernel thread, but I
      don't think a reschedule for this is worth it, the vm should be able to
      collect those entries for the synchronous release_task.
      Signed-off-by: NAndrea Arcangeli <andrea@suse.de>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7766755a
    • M
      maps4: make page monitoring /proc file optional · 1e883281
      Matt Mackall 提交于
      Make /proc/ page monitoring configurable
      
      This puts the following files under an embedded config option:
      
      /proc/pid/clear_refs
      /proc/pid/smaps
      /proc/pid/pagemap
      /proc/kpagecount
      /proc/kpageflags
      
      [akpm@linux-foundation.org: Kconfig fix]
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e883281
    • M
      maps4: add /proc/kpageflags interface · 304daa81
      Matt Mackall 提交于
      This makes a subset of physical page flags available to userspace. Together
      with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.
      
      Exported flags are decoupled from the kernel's internal flags. This
      allows us to reorder flag bits, and synthesize any bits that get
      redefined in terms of other bits.
      
      [akpm@linux-foundation.org: remove unneeded access_ok()]
      [akpm@linux-foundation.org: s/0/NULL/]
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      304daa81
    • M
      maps4: add /proc/kpagecount interface · 161f47bf
      Matt Mackall 提交于
      This makes physical page map counts available to userspace. Together
      with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
      monitor memory usage on a per-page basis.
      
      [akpm@linux-foundation.org: remove unneeded access_ok()]
      [bunk@stusta.de: make struct proc_kpagemap static]
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      161f47bf
    • M
      maps4: add /proc/pid/pagemap interface · 85863e47
      Matt Mackall 提交于
      This interface provides a mapping for each page in an address space to its
      physical page frame number, allowing precise determination of what pages are
      mapped and what pages are shared between processes.
      
      New in this version:
      
      - headers gone again (as recommended by Dave Hansen and Alan Cox)
      - 64-bit entries (as per discussion with Andi Kleen)
      - swap pte information exported (from Dave Hansen)
      - page walker callback for holes (from Dave Hansen)
      - direct put_user I/O (as suggested by Rusty Russell)
      
      This patch folds in cleanups and swap PTE support from Dave Hansen
      <haveblue@us.ibm.com>.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85863e47
    • M
      maps4: regroup task_mmu by interface · a6198797
      Matt Mackall 提交于
      Reorder source so that all the code and data for each interface is together.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6198797
    • M
      maps4: move clear_refs code to task_mmu.c · f248dcb3
      Matt Mackall 提交于
      This puts all the clear_refs code where it belongs and probably lets things
      compile on MMU-less systems as well.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f248dcb3
    • M
      maps4: simplify interdependence of maps and smaps · 4752c369
      Matt Mackall 提交于
      This pulls the shared map display code out of show_map and puts it in
      show_smap where it belongs.
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4752c369
    • M
      maps4: use pagewalker in clear_refs and smaps · b3ae5acb
      Matt Mackall 提交于
      Use the generic pagewalker for smaps and clear_refs
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ae5acb
    • F
      maps4: add proportional set size accounting in smaps · ec4dd3eb
      Fengguang Wu 提交于
      The "proportional set size" (PSS) of a process is the count of pages it has
      in memory, where each page is divided by the number of processes sharing
      it.  So if a process has 1000 pages all to itself, and 1000 shared with one
      other process, its PSS will be 1500.
      
                     - lwn.net: "ELC: How much memory are applications really using?"
      
      The PSS proposed by Matt Mackall is a very nice metic for measuring an
      process's memory footprint.  So collect and export it via
      /proc/<pid>/smaps.
      
      Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
      job.  They are comprehensive tools.  But for PSS, let's do it in the simple
      way.
      
      Cc: John Berthels <jjberthels@gmail.com>
      Cc: Bernardo Innocenti <bernie@codewiz.org>
      Cc: Padraig Brady <P@draigBrady.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec4dd3eb
    • K
      hugetlb: allow sticky directory mount option · 75897d60
      Ken Chen 提交于
      Allow sticky directory mount option for hugetlbfs.  This allows admin
      to create a shared hugetlbfs mount point for multiple users, while
      prevent accidental file deletion that users may step on each other.
      It is similiar to default tmpfs mount option, or typical option used
      on /tmp.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75897d60
    • C
      bufferhead: revert constructor removal · b98938c3
      Christoph Lameter 提交于
      The constructor for buffer_head slabs was removed recently.  We need the
      constructor back in slab defrag in order to insure that slab objects always
      have a definite state even before we allocated them.
      
      I think we mistakenly merged the removal of the constuctor into a cleanup
      patch.  You (ie: akpm) had a test that showed that the removal of the
      constructor led to a small regression.  The prior state makes things easier
      for slab defrag.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b98938c3
    • C
      is_vmalloc_addr(): Check if an address is within the vmalloc boundaries · 9e2779fa
      Christoph Lameter 提交于
      Checking if an address is a vmalloc address is done in a couple of places.
      Define a common version in mm.h and replace the other checks.
      
      Again the include structures suck.  The definition of VMALLOC_START and
      VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included
      there.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e2779fa
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
    • D
      timerfd: new timerfd API · 4d672e7a
      Davide Libenzi 提交于
      This is the new timerfd API as it is implemented by the following patch:
      
      int timerfd_create(int clockid, int flags);
      int timerfd_settime(int ufd, int flags,
      		    const struct itimerspec *utmr,
      		    struct itimerspec *otmr);
      int timerfd_gettime(int ufd, struct itimerspec *otmr);
      
      The timerfd_create() API creates an un-programmed timerfd fd.  The "clockid"
      parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
      
      The timerfd_settime() API give new settings by the timerfd fd, by optionally
      retrieving the previous expiration time (in case the "otmr" parameter is not
      NULL).
      
      The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
      is set in the "flags" parameter.  Otherwise it's a relative time.
      
      The timerfd_gettime() API returns the next expiration time of the timer, or
      {0, 0} if the timerfd has not been set yet.
      
      Like the previous timerfd API implementation, read(2) and poll(2) are
      supported (with the same interface).  Here's a simple test program I used to
      exercise the new timerfd APIs:
      
      http://www.xmailserver.org/timerfd-test2.c
      
      [akpm@linux-foundation.org: coding-style cleanups]
      [akpm@linux-foundation.org: fix ia64 build]
      [akpm@linux-foundation.org: fix m68k build]
      [akpm@linux-foundation.org: fix mips build]
      [akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
      [heiko.carstens@de.ibm.com: fix s390]
      [akpm@linux-foundation.org: fix powerpc build]
      [akpm@linux-foundation.org: fix sparc64 more]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d672e7a
    • O
      exec: rework the group exit and fix the race with kill · ed5d2cac
      Oleg Nesterov 提交于
      As Roland pointed out, we have the very old problem with exec.  de_thread()
      sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
      clears signal->flags.  All signals (even fatal ones) sent in this window
      (which is not too small) will be lost.
      
      With this patch exec doesn't abuse SIGNAL_GROUP_EXIT.  signal_group_exit(),
      the new helper, should be used to detect exit_group() or exec() in progress.
      It can have more users, but this patch does only strictly necessary changes.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed5d2cac
    • A
      get_task_comm(): return the result · 59714d65
      Andrew Morton 提交于
      It was dumb to make get_task_comm() return void.  Change it to return a
      pointer to the resulting output for caller convenience.
      
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59714d65
    • P
      lockdep: annotate epoll · 0ccf831c
      Peter Zijlstra 提交于
      On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
      
      > I remember I talked with Arjan about this time ago. Basically, since 1)
      > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
      > are used, you can see a wake_up() from inside another wake_up(), but they
      > will never refer to the same lock instance.
      > Think about:
      >
      > 	dfd = socket(...);
      > 	efd1 = epoll_create();
      > 	efd2 = epoll_create();
      > 	epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
      > 	epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
      >
      > When a packet arrives to the device underneath "dfd", the net code will
      > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
      > callback wakeup entry on that queue, and the wake_up() performed by the
      > "dfd" net code will end up in ep_poll_callback(). At this point epoll
      > (efd1) notices that it may have some event ready, so it needs to wake up
      > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
      > that ends up in another wake_up(), after having checked about the
      > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
      > avoid stack blasting. Never hit the same queue, to avoid loops like:
      >
      > 	epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
      > 	epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
      > 	epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
      > 	epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
      >
      > The code "if (tncur->wq == wq || ..." prevents re-entering the same
      > queue/lock.
      
      Since the epoll code is very careful to not nest same instance locks
      allow the recursion.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NStefan Richter <stefanr@s5r6.in-berlin.de>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ccf831c
  2. 04 2月, 2008 5 次提交
  3. 03 2月, 2008 4 次提交
  4. 02 2月, 2008 10 次提交