1. 25 7月, 2008 2 次提交
    • U
      flag parameters: epoll_create · a0998b50
      Ulrich Drepper 提交于
      This patch adds the new epoll_create2 syscall.  It extends the old epoll_create
      syscall by one parameter which is meant to hold a flag value.  In this
      patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
      flag for the returned file descriptor to be set.
      
      A new name EPOLL_CLOEXEC is introduced which in this implementation must
      have the same value as O_CLOEXEC.
      
      The following test must be adjusted for architectures other than x86 and
      x86-64 and in case the syscall numbers changed.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <fcntl.h>
      #include <stdio.h>
      #include <time.h>
      #include <unistd.h>
      #include <sys/syscall.h>
      
      #ifndef __NR_epoll_create2
      # ifdef __x86_64__
      #  define __NR_epoll_create2 291
      # elif defined __i386__
      #  define __NR_epoll_create2 329
      # else
      #  error "need __NR_epoll_create2"
      # endif
      #endif
      
      #define EPOLL_CLOEXEC O_CLOEXEC
      
      int
      main (void)
      {
        int fd = syscall (__NR_epoll_create2, 1, 0);
        if (fd == -1)
          {
            puts ("epoll_create2(0) failed");
            return 1;
          }
        int coe = fcntl (fd, F_GETFD);
        if (coe == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if (coe & FD_CLOEXEC)
          {
            puts ("epoll_create2(0) set close-on-exec flag");
            return 1;
          }
        close (fd);
      
        fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
        if (fd == -1)
          {
            puts ("epoll_create2(EPOLL_CLOEXEC) failed");
            return 1;
          }
        coe = fcntl (fd, F_GETFD);
        if (coe == -1)
          {
            puts ("fcntl failed");
            return 1;
          }
        if ((coe & FD_CLOEXEC) == 0)
          {
            puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
            return 1;
          }
        close (fd);
      
        puts ("OK");
      
        return 0;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0998b50
    • U
      flag parameters: anon_inode_getfd extension · 7d9dbca3
      Ulrich Drepper 提交于
      This patch just extends the anon_inode_getfd interface to take an additional
      parameter with a flag value.  The flag value is passed on to
      get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.
      
      No actual semantic changes here, the changed callers all pass 0 for now.
      
      [akpm@linux-foundation.org: KVM fix]
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d9dbca3
  2. 02 5月, 2008 1 次提交
    • A
      [PATCH] sanitize anon_inode_getfd() · 2030a42c
      Al Viro 提交于
      a) none of the callers even looks at inode or file returned by anon_inode_getfd()
      b) any caller that would try to look at those would be racy, since by the time
      it returns we might have raced with close() from another thread and that
      file would be pining for fjords.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2030a42c
  3. 30 4月, 2008 2 次提交
  4. 29 4月, 2008 1 次提交
    • D
      epoll: avoid kmemcheck warning · cdac75e6
      Davide Libenzi 提交于
      Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but
      rb_set_parent() accesses node's pointer in its code.  This creates a
      warning in kmemcheck (reported by Vegard Nossum) about an uninitialized
      memory access.  The warning is harmless since the following rb-tree node
      insert is going to overwrite the node data.  In any case I think it's
      better to not have that happening at all, and fix it by simplifying the
      code to get rid of a few lines that became superfluous after the previous
      epoll changes.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdac75e6
  5. 06 2月, 2008 1 次提交
    • P
      lockdep: annotate epoll · 0ccf831c
      Peter Zijlstra 提交于
      On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
      
      > I remember I talked with Arjan about this time ago. Basically, since 1)
      > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
      > are used, you can see a wake_up() from inside another wake_up(), but they
      > will never refer to the same lock instance.
      > Think about:
      >
      > 	dfd = socket(...);
      > 	efd1 = epoll_create();
      > 	efd2 = epoll_create();
      > 	epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
      > 	epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
      >
      > When a packet arrives to the device underneath "dfd", the net code will
      > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
      > callback wakeup entry on that queue, and the wake_up() performed by the
      > "dfd" net code will end up in ep_poll_callback(). At this point epoll
      > (efd1) notices that it may have some event ready, so it needs to wake up
      > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
      > that ends up in another wake_up(), after having checked about the
      > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
      > avoid stack blasting. Never hit the same queue, to avoid loops like:
      >
      > 	epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
      > 	epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
      > 	epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
      > 	epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
      >
      > The code "if (tncur->wq == wq || ..." prevents re-entering the same
      > queue/lock.
      
      Since the epoll code is very careful to not nest same instance locks
      allow the recursion.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NStefan Richter <stefanr@s5r6.in-berlin.de>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ccf831c
  6. 07 12月, 2007 1 次提交
  7. 20 10月, 2007 1 次提交
  8. 19 10月, 2007 1 次提交
  9. 20 7月, 2007 1 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
  10. 15 5月, 2007 4 次提交
  11. 11 5月, 2007 3 次提交
  12. 09 5月, 2007 3 次提交
    • P
      Introduce a handy list_first_entry macro · b5e61818
      Pavel Emelianov 提交于
      There are many places in the kernel where the construction like
      
         foo = list_entry(head->next, struct foo_struct, list);
      
      are used.
      The code might look more descriptive and neat if using the macro
      
         list_first_entry(head, type, member) \
                   list_entry((head)->next, type, member)
      
      Here is the macro itself and the examples of its usage in the generic code.
       If it will turn out to be useful, I can prepare the set of patches to
      inject in into arch-specific code, drivers, networking, etc.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5e61818
    • R
      header cleaning: don't include smp_lock.h when not used · e63340ae
      Randy Dunlap 提交于
      Remove includes of <linux/smp_lock.h> where it is not used/needed.
      Suggested by Al Viro.
      
      Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
      sparc64, and arm (all 59 defconfigs).
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e63340ae
    • D
      epoll: optimizations and cleanups · 6192bd53
      Davide Libenzi 提交于
      Epoll is doing multiple passes over the ready set at the moment, because of
      the constraints over the f_op->poll() call.  Looking at the code again, I
      noticed that we already hold the epoll semaphore in read, and this
      (together with other locking conditions that hold while doing an
      epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
      (in a single pass).
      
      This is a stress application that can be used to test the new code.  It
      spwans multiple thread and call epoll_wait() and epoll_ctl() from many
      threads.  Stress tested on my dual Opteron 254 w/out any problems.
      
      http://www.xmailserver.org/totalmess.c
      
      This is not a benchmark, just something that tries to stress and exploit
      possible problems with the new code.
      Also, I made a stupid micro-benchmark:
      
      http://www.xmailserver.org/epwbench.c
      
      [1] Considering that epoll must be thread-safe, there are five ways we can
          be hit during an epoll_wait() transfer loop (ep_send_events()):
      
          1) The epoll fd going away and calling ep_free
             This just can't happen, since we did an fget() in sys_epoll_wait
      
          2) An epoll_ctl(EPOLL_CTL_DEL)
             This can't happen because epoll_ctl() gets ep->sem in write, and
             we're holding it in read during ep_send_events()
      
          3) An fd stored inside the epoll fd going away
             This can't happen because in eventpoll_release_file() we get
             ep->sem in write, and we're holding it in read during
             ep_send_events()
      
          4) Another epoll_wait() happening on another thread
             They both can be inside ep_send_events() at the same time, we get
             (splice) the ready-list under the spinlock, so each one will get
             its own ready list. Note that an fd cannot be at the same time
             inside more than one ready list, because ep_poll_callback() will
             not re-queue it if it sees it already linked:
      
             if (ep_is_linked(&epi->rdllink))
                      goto is_linked;
      
             Another case that can happen, is two concurrent epoll_wait(),
             coming in with a userspace event buffer of size, say, ten.
             Suppose there are 50 event ready in the list. The first
             epoll_wait() will "steal" the whole list, while the second, seeing
             no events, will go to sleep. But at the end of ep_send_events() in
             the first epoll_wait(), we will re-inject surplus ready fds, and we
             will trigger the proper wake_up to the second epoll_wait().
      
          5) ep_poll_callback() hitting us asyncronously
             This is the tricky part. As I said above, the ep_is_linked() test
             done inside ep_poll_callback(), will guarantee us that until the
             item will result linked to a list, ep_poll_callback() will not try
             to re-queue it again (read, write data on any of its members). When
             we do a list_del() in ep_send_events(), the item will still satisfy
             the ep_is_linked() test (whatever data is written in prev/next,
             it'll never be its own pointer), so ep_poll_callback() will still
             leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
             that it'll become visible to ep_poll_callback(), but at the point
             we're already past it.
      
      [akpm@osdl.org: 80 cols]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6192bd53
  13. 09 12月, 2006 1 次提交
  14. 08 12月, 2006 2 次提交
  15. 12 10月, 2006 1 次提交
    • D
      [PATCH] epoll_pwait() · b611967d
      Davide Libenzi 提交于
      Implement the epoll_pwait system call, that extend the event wait mechanism
      with the same logic ppoll and pselect do.  The definition of epoll_pwait
      is:
      
      int epoll_pwait(int epfd, struct epoll_event *events, int maxevents,
                       int timeout, const sigset_t *sigmask, size_t sigsetsize);
      
      The difference between the vanilla epoll_wait and epoll_pwait is that the
      latter allows the caller to specify a signal mask to be set while waiting
      for events.  Hence epoll_pwait will wait until either one monitored event,
      or an unmasked signal happen.  If sigmask is NULL, the epoll_pwait system
      call will act exactly like epoll_wait.  For the POSIX definition of
      pselect, information is available here:
      
      http://www.opengroup.org/onlinepubs/009695399/functions/select.htmlSigned-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b611967d
  16. 03 10月, 2006 1 次提交
  17. 27 9月, 2006 1 次提交
  18. 28 8月, 2006 1 次提交
  19. 04 7月, 2006 1 次提交
  20. 26 6月, 2006 1 次提交
    • D
      [PATCH] epoll: use unlocked wqueue operations · 3419b23a
      Davide Libenzi 提交于
      A few days ago Arjan signaled a lockdep red flag on epoll locks, and
      precisely between the epoll's device structure lock (->lock) and the wait
      queue head lock (->lock).
      
      Like I explained in another email, and directly to Arjan, this can't happen
      in reality because of the explicit check at eventpoll.c:592, that does not
      allow to drop an epoll fd inside the same epoll fd.  Since lockdep is
      working on per-structure locks, it will never be able to know of policies
      enforced in other parts of the code.
      
      It was decided time ago of having the ability to drop epoll fds inside
      other epoll fds, that triggers a very trick wakeup operations (due to
      possibly reentrant callback-driven wakeups) handled by the
      ep_poll_safewake() function.  While looking again at the code though, I
      noticed that all the operations done on the epoll's main structure wait
      queue head (->wq) are already protected by the epoll lock (->lock), so that
      locked-style functions can be used to manipulate the ->wq member.  This
      makes both a lock-acquire save, and lockdep happy.
      
      Running totalmess on my dual opteron for a while did not reveal any problem
      so far:
      
      http://www.xmailserver.org/totalmess.cSigned-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3419b23a
  21. 23 6月, 2006 1 次提交
    • D
      [PATCH] VFS: Permit filesystem to override root dentry on mount · 454e2398
      David Howells 提交于
      Extend the get_sb() filesystem operation to take an extra argument that
      permits the VFS to pass in the target vfsmount that defines the mountpoint.
      
      The filesystem is then required to manually set the superblock and root dentry
      pointers.  For most filesystems, this should be done with simple_set_mnt()
      which will set the superblock pointer and then set the root dentry to the
      superblock's s_root (as per the old default behaviour).
      
      The get_sb() op now returns an integer as there's now no need to return the
      superblock pointer.
      
      This patch permits a superblock to be implicitly shared amongst several mount
      points, such as can be done with NFS to avoid potential inode aliasing.  In
      such a case, simple_set_mnt() would not be called, and instead the mnt_root
      and mnt_sb would be set directly.
      
      The patch also makes the following changes:
      
       (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
           pointer argument and return an integer, so most filesystems have to change
           very little.
      
       (*) If one of the convenience function is not used, then get_sb() should
           normally call simple_set_mnt() to instantiate the vfsmount. This will
           always return 0, and so can be tail-called from get_sb().
      
       (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
           dcache upon superblock destruction rather than shrink_dcache_anon().
      
           This is required because the superblock may now have multiple trees that
           aren't actually bound to s_root, but that still need to be cleaned up. The
           currently called functions assume that the whole tree is rooted at s_root,
           and that anonymous dentries are not the roots of trees which results in
           dentries being left unculled.
      
           However, with the way NFS superblock sharing are currently set to be
           implemented, these assumptions are violated: the root of the filesystem is
           simply a dummy dentry and inode (the real inode for '/' may well be
           inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
           with child trees.
      
           [*] Anonymous until discovered from another tree.
      
       (*) The documentation has been adjusted, including the additional bit of
           changing ext2_* into foo_* in the documentation.
      
      [akpm@osdl.org: convert ipath_fs, do other stuff]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      454e2398
  22. 21 4月, 2006 1 次提交
  23. 11 4月, 2006 1 次提交
  24. 29 3月, 2006 1 次提交
  25. 27 3月, 2006 1 次提交
  26. 26 3月, 2006 1 次提交
    • D
      [PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications · f348d70a
      Davide Libenzi 提交于
      Implement the half-closed devices notifiation, by adding a new POLLRDHUP
      (and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
      existing POLLHUP handling, that does not report correctly half-closed
      devices, was feared to be changed, this implementation leaves the current
      POLLHUP reporting unchanged and simply add a new bit that is set in the few
      places where it makes sense.  The same thing was discussed and conceptually
      agreed quite some time ago:
      
      http://lkml.org/lkml/2003/7/12/116
      
      Since this new event bit is added to the existing Linux poll infrastruture,
      even the existing poll/select system calls will be able to use it.  As far
      as the existing POLLHUP handling, the patch leaves it as is.  The
      pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
      archs and sets the bit in the six relevant files.  The other attached diff
      is the simple change required to sys/epoll.h to add the EPOLLRDHUP
      definition.
      
      There is "a stupid program" to test POLLRDHUP delivery here:
      
       http://www.xmailserver.org/pollrdhup-test.c
      
      It tests poll(2), but since the delivery is same epoll(2) will work equally.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f348d70a
  27. 23 3月, 2006 2 次提交
  28. 28 9月, 2005 1 次提交
  29. 18 9月, 2005 1 次提交