1. 01 5月, 2013 5 次提交
  2. 04 3月, 2013 1 次提交
  3. 03 1月, 2013 1 次提交
    • E
      epoll: prevent missed events on EPOLL_CTL_MOD · 128dd175
      Eric Wong 提交于
      EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
      ensure events are not missed.  Since the modifications to the interest
      mask are not protected by the same lock as ep_poll_callback, we need to
      ensure the change is visible to other CPUs calling ep_poll_callback.
      
      We also need to ensure f_op->poll() has an up-to-date view of past
      events which occured before we modified the interest mask.  So this
      barrier also pairs with the barrier in wq_has_sleeper().
      
      This should guarantee either ep_poll_callback or f_op->poll() (or both)
      will notice the readiness of a recently-ready/modified item.
      
      This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
      http://thread.gmane.org/gmane.linux.kernel/1408782/Signed-off-by: NEric Wong <normalperson@yhbt.net>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
      Tested-by: N"Junchang(Jason) Wang" <junchang.wang@yale.edu>
      Cc: netdev@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      128dd175
  4. 18 12月, 2012 1 次提交
    • C
      fs, epoll: add procfs fdinfo helper · 138d22b5
      Cyrill Gorcunov 提交于
      This allows us to print out eventpoll target file descriptor, events and
      data, the /proc/pid/fdinfo/fd consists of
      
       | pos:	0
       | flags:	02
       | tfd:        5 events:       1d data: ffffffffffffffff enabled: 1
      
      [avagin@: fix for unitialized ret variable]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      138d22b5
  5. 09 11月, 2012 1 次提交
    • A
      revert "epoll: support for disabling items, and a self-test app" · a80a6b85
      Andrew Morton 提交于
      Revert commit 03a7beb5 ("epoll: support for disabling items, and a
      self-test app") pending resolution of the issues identified by Michael
      Kerrisk, copied below.
      
      We'll revisit this for 3.8.
      
      : I've taken a look at this patch as it currently stands in 3.7-rc1, and
      : done a bit of testing. (By the way, the test program
      : tools/testing/selftests/epoll/test_epoll.c does not compile...)
      :
      : There are one or two places where the behavior seems a little strange,
      : so I have a question or two at the end of this mail. But other than
      : that, I want to check my understanding so that the interface can be
      : correctly documented.
      :
      : Just to go though my understanding, the problem is the following
      : scenario in a multithreaded application:
      :
      : 1. Multiple threads are performing epoll_wait() operations,
      :    and maintaining a user-space cache that contains information
      :    corresponding to each file descriptor being monitored by
      :    epoll_wait().
      :
      : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
      :    a file descriptor from the epoll interest list, and
      :    delete the corresponding record from the user-space cache.
      :
      : 3. The problem with (2) is that some other thread may have
      :    previously done an epoll_wait() that retrieved information
      :    about the fd in question, and may be in the middle of using
      :    information in the cache that relates to that fd. Thus,
      :    there is a potential race.
      :
      : 4. The race can't solved purely in user space, because doing
      :    so would require applying a mutex across the epoll_wait()
      :    call, which would of course blow thread concurrency.
      :
      : Right?
      :
      : Your solution is the EPOLL_CTL_DISABLE operation. I want to
      : confirm my understanding about how to use this flag, since
      : the description that has accompanied the patches so far
      : has been a bit sparse
      :
      : 0. In the scenario you're concerned about, deleting a file
      :    descriptor means (safely) doing the following:
      :    (a) Deleting the file descriptor from the epoll interest list
      :        using EPOLL_CTL_DEL
      :    (b) Deleting the corresponding record in the user-space cache
      :
      : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
      :    conjunction with EPOLLONESHOT.
      :
      : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
      :    conjunction is a logical error.
      :
      : 3. The correct way to code multithreaded applications using
      :    EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
      :
      :    a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
      :       should EPOLLONESHOT.
      :
      :    b. When a thread wants to delete a file descriptor, it
      :       should do the following:
      :
      :       [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
      :       [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
      :           was zero, then the file descriptor can be safely
      :           deleted by the thread that made this call.
      :       [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
      :           then the descriptor is in use. In this case, the calling
      :           thread should set a flag in the user-space cache to
      :           indicate that the thread that is using the descriptor
      :           should perform the deletion operation.
      :
      : Is all of the above correct?
      :
      : The implementation depends on checking on whether
      : (events & ~EP_PRIVATE_BITS) == 0
      : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
      : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
      : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
      : cleared.
      :
      : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
      : is only useful in conjunction with EPOLLONESHOT. However, as things
      : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
      : not have EPOLLONESHOT set in 'events' This results in the following
      : (slightly surprising) behavior:
      :
      : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
      :     (the indicator that the file descriptor can be safely deleted).
      : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
      :
      : This doesn't seem particularly useful, and in fact is probably an
      : indication that the user made a logic error: they should only be using
      : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
      : EPOLLONESHOT was set in 'events'. If that is correct, then would it
      : not make sense to return an error to user space for this case?
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paton J. Lewis" <palewis@adobe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a80a6b85
  6. 06 10月, 2012 1 次提交
  7. 27 9月, 2012 2 次提交
  8. 22 8月, 2012 1 次提交
  9. 18 7月, 2012 1 次提交
  10. 02 6月, 2012 1 次提交
  11. 23 5月, 2012 1 次提交
    • R
      epoll: Fix user space breakage related to EPOLLWAKEUP · a8159414
      Rafael J. Wysocki 提交于
      Commit 4d7e30d9 (epoll: Add a flag, EPOLLWAKEUP, to prevent
      suspend while epoll events are ready) caused some applications to
      malfunction, because they set the bit corresponding to the new
      EPOLLWAKEUP flag in their eventpoll flags and they don't have the
      new CAP_EPOLLWAKEUP capability.
      
      To prevent that from happening, change epoll_ctl() to clear
      EPOLLWAKEUP in epds.events if the caller doesn't have the
      CAP_EPOLLWAKEUP capability instead of failing and returning an
      error code, which allows the affected applications to function
      normally.
      Reported-and-tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      a8159414
  12. 06 5月, 2012 1 次提交
  13. 26 4月, 2012 1 次提交
  14. 29 3月, 2012 1 次提交
  15. 24 3月, 2012 3 次提交
    • H
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil 提交于
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      directly.
      
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      
      Any driver accessing the qproc or key fields directly will now fail to compile.
      
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      
      There are only three things a driver can do with the wait argument:
      
      1) obtain the key field:
      
      	events = wait ? wait->key : ~0;
      
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
         unnecessarily.
      
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
      
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
      
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
      
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: NJonathan Corbet <corbet@lwn.net>
      Reviewed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626cf236
    • D
      epoll: remove unneeded variable in reverse_path_check() · da0503aa
      Dan Carpenter 提交于
      We never use the length variable.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da0503aa
    • S
      epoll: comment the funky #ifdef · 02edc6fc
      Steven Rostedt 提交于
      Looking for a bug in -rt, I stumbled across this code here from: commit
      2dfa4eea ("epoll keyed wakeups: teach epoll about hints coming with
      the wakeup key"), specifically:
      
        #ifdef CONFIG_DEBUG_LOCK_ALLOC
        static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
                                            unsigned long events, int subclass)
        {
               unsigned long flags;
      
               spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
               wake_up_locked_poll(wqueue, events);
               spin_unlock_irqrestore(&wqueue->lock, flags);
        }
        #else
        static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
                                            unsigned long events, int subclass)
        {
               wake_up_poll(wqueue, events);
        }
        #endif
      
      You change the function of ep_wake_up_nested() depending on whether
      CONFIG_DEBUG_LOCK_ALLOC is set or not.  This looks awfully suspicious,
      and there's no comment to explain why.  I initially thought that this
      was trying to fool lockdep, and hiding a real bug.
      
      Investigating it, I found the creation of wake_up_nested() (which no
      longer exists) but was created for the sole purpose of epoll and its
      strange wake ups, as explained in commit 0ccf831c ("lockdep:
      annotate epoll")
      
      Although the commit message says "annotate epoll" the change log is much
      better at explaining what is happening than what is in the actual code.
      Thus a comment is really necessary here.  And to save the time of other
      developers from having to go trudging through the git logs trying to
      figure out why this code exists.
      
      I took parts of the change log and placed it into a comment above the
      affected code.  This will make the description of what is happening more
      visible to new developers that have to look at this code for the first
      time.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02edc6fc
  16. 19 3月, 2012 1 次提交
  17. 25 2月, 2012 2 次提交
    • O
      epoll: ep_unregister_pollwait() can use the freed pwq->whead · 971316f0
      Oleg Nesterov 提交于
      signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
      this is not enough. eppoll_entry->whead still points to the memory
      we are going to free, ep_unregister_pollwait()->remove_wait_queue()
      is obviously unsafe.
      
      Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
      change ep_unregister_pollwait() to check pwq->whead != NULL under
      rcu_read_lock() before remove_wait_queue(). We add the new helper,
      ep_remove_wait_queue(), for this.
      
      This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
      ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
      ep_unregister_pollwait()->remove_wait_queue() can play with already
      freed and potentially reused ->sighand, but this is fine. This memory
      must have the valid ->signalfd_wqh until rcu_read_unlock().
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      971316f0
    • O
      epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e
      Oleg Nesterov 提交于
      This patch is intentionally incomplete to simplify the review.
      It ignores ep_unregister_pollwait() which plays with the same wqh.
      See the next change.
      
      epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
      f_op->poll() needs. In particular it assumes that the wait queue
      can't go away until eventpoll_release(). This is not true in case
      of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
      which is not connected to the file.
      
      This patch adds the special event, POLLFREE, currently only for
      epoll. It expects that init_poll_funcptr()'ed hook should do the
      necessary cleanup. Perhaps it should be defined as EPOLLFREE in
      eventpoll.
      
      __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
      ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
      helper.
      
      ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
      This make this poll entry inconsistent, but we don't care. If you
      share epoll fd which contains our sigfd with another process you
      should blame yourself. signalfd is "really special". I simply do
      not know how we can define the "right" semantics if it used with
      epoll.
      
      The main problem is, epoll calls signalfd_poll() once to establish
      the connection with the wait queue, after that signalfd_poll(NULL)
      returns the different/inconsistent results depending on who does
      EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
      has nothing to do with the file, it works with the current thread.
      
      In short: this patch is the hack which tries to fix the symptoms.
      It also assumes that nobody can take tasklist_lock under epoll
      locks, this seems to be true.
      
      Note:
      
      	- we do not have wake_up_all_poll() but wake_up_poll()
      	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
      
      	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
      	  we need a couple of simple changes in eventpoll.c to
      	  make sure it can't be "lost".
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80e731e
  18. 13 1月, 2012 1 次提交
    • J
      epoll: limit paths · 28d82dc1
      Jason Baron 提交于
      The current epoll code can be tickled to run basically indefinitely in
      both loop detection path check (on ep_insert()), and in the wakeup paths.
      The programs that tickle this behavior set up deeply linked networks of
      epoll file descriptors that cause the epoll algorithms to traverse them
      indefinitely.  A couple of these sample programs have been previously
      posted in this thread: https://lkml.org/lkml/2011/2/25/297.
      
      To fix the loop detection path check algorithms, I simply keep track of
      the epoll nodes that have been already visited.  Thus, the loop detection
      becomes proportional to the number of epoll file descriptor and links.
      This dramatically decreases the run-time of the loop check algorithm.  In
      one diabolical case I tried it reduced the run-time from 15 mintues (all
      in kernel time) to .3 seconds.
      
      Fixing the wakeup paths could be done at wakeup time in a similar manner
      by keeping track of nodes that have already been visited, but the
      complexity is harder, since there can be multiple wakeups on different
      cpus...Thus, I've opted to limit the number of possible wakeup paths when
      the paths are created.
      
      This is accomplished, by noting that the end file descriptor points that
      are found during the loop detection pass (from the newly added link), are
      actually the sources for wakeup events.  I keep a list of these file
      descriptors and limit the number and length of these paths that emanate
      from these 'source file descriptors'.  In the current implemetation I
      allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
      length 4 and 10 of length 5.  Note that it is sufficient to check the
      'source file descriptors' reachable from the newly added link, since no
      other 'source file descriptors' will have newly added links.  This allows
      us to check only the wakeup paths that may have gotten too long, and not
      re-check all possible wakeup paths on the system.
      
      In terms of the path limit selection, I think its first worth noting that
      the most common case for epoll, is probably the model where you have 1
      epoll file descriptor that is monitoring n number of 'source file
      descriptors'.  In this case, each 'source file descriptor' has a 1 path of
      length 1.  Thus, I believe that the limits I'm proposing are quite
      reasonable and in fact may be too generous.  Thus, I'm hoping that the
      proposed limits will not prevent any workloads that currently work to
      fail.
      
      In terms of locking, I have extended the use of the 'epmutex' to all
      epoll_ctl add and remove operations.  Currently its only used in a subset
      of the add paths.  I need to hold the epmutex, so that we can correctly
      traverse a coherent graph, to check the number of paths.  I believe that
      this additional locking is probably ok, since its in the setup/teardown
      paths, and doesn't affect the running paths, but it certainly is going to
      add some extra overhead.  Also, worth noting is that the epmuex was
      recently added to the ep_ctl add operations in the initial path loop
      detection code using the argument that it was not on a critical path.
      
      Another thing to note here, is the length of epoll chains that is allowed.
      Currently, eventpoll.c defines:
      
      /* Maximum number of nesting allowed inside epoll sets */
      #define EP_MAX_NESTS 4
      
      This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
      + 1).  However, this limit is currently only enforced during the loop
      check detection code, and only when the epoll file descriptors are added
      in a certain order.  Thus, this limit is currently easily bypassed.  The
      newly added check for wakeup paths, stricly limits the wakeup paths to a
      length of 5, regardless of the order in which ep's are linked together.
      Thus, a side-effect of the new code is a more consistent enforcement of
      the graph depth.
      
      Thus far, I've tested this, using the sample programs previously
      mentioned, which now either return quickly or return -EINVAL.  I've also
      testing using the piptest.c epoll tester, which showed no difference in
      performance.  I've also created a number of different epoll networks and
      tested that they behave as expectded.
      
      I believe this solves the original diabolical test cases, while still
      preserving the sane epoll nesting.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Cc: Nelson Elhage <nelhage@ksplice.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28d82dc1
  19. 01 11月, 2011 1 次提交
    • N
      epoll: fix spurious lockdep warnings · d8805e63
      Nelson Elhage 提交于
      epoll can acquire recursively acquire ep->mtx on multiple "struct
      eventpoll"s at once in the case where one epoll fd is monitoring another
      epoll fd.  This is perfectly OK, since we're careful about the lock
      ordering, but it causes spurious lockdep warnings.  Annotate the recursion
      using mutex_lock_nested, and add a comment explaining the nesting rules
      for good measure.
      
      Recent versions of systemd are triggering this, and it can also be
      demonstrated with the following trivial test program:
      
      --------------------8<--------------------
      
      int main(void) {
         int e1, e2;
         struct epoll_event evt = {
             .events = EPOLLIN
         };
      
         e1 = epoll_create1(0);
         e2 = epoll_create1(0);
         epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
         return 0;
      }
      --------------------8<--------------------
      Reported-by: NPaul Bolle <pebolle@tiscali.nl>
      Tested-by: NPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NNelson Elhage <nelhage@nelhage.com>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8805e63
  20. 27 7月, 2011 1 次提交
  21. 26 7月, 2011 1 次提交
  22. 31 3月, 2011 1 次提交
  23. 23 3月, 2011 2 次提交
  24. 26 2月, 2011 1 次提交
    • D
      epoll: prevent creating circular epoll structures · 22bacca4
      Davide Libenzi 提交于
      In several places, an epoll fd can call another file's ->f_op->poll()
      method with ep->mtx held.  This is in general unsafe, because that other
      file could itself be an epoll fd that contains the original epoll fd.
      
      The code defends against this possibility in its own ->poll() method using
      ep_call_nested, but there are several other unsafe calls to ->poll
      elsewhere that can be made to deadlock.  For example, the following simple
      program causes the call in ep_insert recursively call the original fd's
      ->poll, leading to deadlock:
      
       #include <unistd.h>
       #include <sys/epoll.h>
      
       int main(void) {
           int e1, e2, p[2];
           struct epoll_event evt = {
               .events = EPOLLIN
           };
      
           e1 = epoll_create(1);
           e2 = epoll_create(2);
           pipe(p);
      
           epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
           epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
           write(p[1], p, sizeof p);
           epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
      
           return 0;
       }
      
      On insertion, check whether the inserted file is itself a struct epoll,
      and if so, do a recursive walk to detect whether inserting this file would
      create a loop of epoll structures, which could lead to deadlock.
      
      [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NNelson Elhage <nelhage@ksplice.com>
      Reported-by: NNelson Elhage <nelhage@ksplice.com>
      Tested-by: NNelson Elhage <nelhage@ksplice.com>
      Cc: <stable@kernel.org>		[2.6.34+, possibly earlier]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22bacca4
  25. 18 2月, 2011 1 次提交
  26. 03 2月, 2011 1 次提交
  27. 14 1月, 2011 1 次提交
  28. 28 10月, 2010 1 次提交
  29. 15 10月, 2010 1 次提交
    • A
      llseek: automatically add .llseek fop · 6038f373
      Arnd Bergmann 提交于
      All file_operations should get a .llseek operation so we can make
      nonseekable_open the default for future file operations without a
      .llseek pointer.
      
      The three cases that we can automatically detect are no_llseek, seq_lseek
      and default_llseek. For cases where we can we can automatically prove that
      the file offset is always ignored, we use noop_llseek, which maintains
      the current behavior of not returning an error from a seek.
      
      New drivers should normally not use noop_llseek but instead use no_llseek
      and call nonseekable_open at open time.  Existing drivers can be converted
      to do the same when the maintainer knows for certain that no user code
      relies on calling seek on the device file.
      
      The generated code is often incorrectly indented and right now contains
      comments that clarify for each added line why a specific variant was
      chosen. In the version that gets submitted upstream, the comments will
      be gone and I will manually fix the indentation, because there does not
      seem to be a way to do that using coccinelle.
      
      Some amount of new code is currently sitting in linux-next that should get
      the same modifications, which I will do at the end of the merge window.
      
      Many thanks to Julia Lawall for helping me learn to write a semantic
      patch that does all this.
      
      ===== begin semantic patch =====
      // This adds an llseek= method to all file operations,
      // as a preparation for making no_llseek the default.
      //
      // The rules are
      // - use no_llseek explicitly if we do nonseekable_open
      // - use seq_lseek for sequential files
      // - use default_llseek if we know we access f_pos
      // - use noop_llseek if we know we don't access f_pos,
      //   but we still want to allow users to call lseek
      //
      @ open1 exists @
      identifier nested_open;
      @@
      nested_open(...)
      {
      <+...
      nonseekable_open(...)
      ...+>
      }
      
      @ open exists@
      identifier open_f;
      identifier i, f;
      identifier open1.nested_open;
      @@
      int open_f(struct inode *i, struct file *f)
      {
      <+...
      (
      nonseekable_open(...)
      |
      nested_open(...)
      )
      ...+>
      }
      
      @ read disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      <+...
      (
         *off = E
      |
         *off += E
      |
         func(..., off, ...)
      |
         E = *off
      )
      ...+>
      }
      
      @ read_no_fpos disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ write @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      <+...
      (
        *off = E
      |
        *off += E
      |
        func(..., off, ...)
      |
        E = *off
      )
      ...+>
      }
      
      @ write_no_fpos @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ fops0 @
      identifier fops;
      @@
      struct file_operations fops = {
       ...
      };
      
      @ has_llseek depends on fops0 @
      identifier fops0.fops;
      identifier llseek_f;
      @@
      struct file_operations fops = {
      ...
       .llseek = llseek_f,
      ...
      };
      
      @ has_read depends on fops0 @
      identifier fops0.fops;
      identifier read_f;
      @@
      struct file_operations fops = {
      ...
       .read = read_f,
      ...
      };
      
      @ has_write depends on fops0 @
      identifier fops0.fops;
      identifier write_f;
      @@
      struct file_operations fops = {
      ...
       .write = write_f,
      ...
      };
      
      @ has_open depends on fops0 @
      identifier fops0.fops;
      identifier open_f;
      @@
      struct file_operations fops = {
      ...
       .open = open_f,
      ...
      };
      
      // use no_llseek if we call nonseekable_open
      ////////////////////////////////////////////
      @ nonseekable1 depends on !has_llseek && has_open @
      identifier fops0.fops;
      identifier nso ~= "nonseekable_open";
      @@
      struct file_operations fops = {
      ...  .open = nso, ...
      +.llseek = no_llseek, /* nonseekable */
      };
      
      @ nonseekable2 depends on !has_llseek @
      identifier fops0.fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...  .open = open_f, ...
      +.llseek = no_llseek, /* open uses nonseekable */
      };
      
      // use seq_lseek for sequential files
      /////////////////////////////////////
      @ seq depends on !has_llseek @
      identifier fops0.fops;
      identifier sr ~= "seq_read";
      @@
      struct file_operations fops = {
      ...  .read = sr, ...
      +.llseek = seq_lseek, /* we have seq_read */
      };
      
      // use default_llseek if there is a readdir
      ///////////////////////////////////////////
      @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier readdir_e;
      @@
      // any other fop is used that changes pos
      struct file_operations fops = {
      ... .readdir = readdir_e, ...
      +.llseek = default_llseek, /* readdir is present */
      };
      
      // use default_llseek if at least one of read/write touches f_pos
      /////////////////////////////////////////////////////////////////
      @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read.read_f;
      @@
      // read fops use offset
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = default_llseek, /* read accesses f_pos */
      };
      
      @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ... .write = write_f, ...
      +	.llseek = default_llseek, /* write accesses f_pos */
      };
      
      // Use noop_llseek if neither read nor write accesses f_pos
      ///////////////////////////////////////////////////////////
      
      @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      identifier write_no_fpos.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ...
       .write = write_f,
       .read = read_f,
      ...
      +.llseek = noop_llseek, /* read and write both use no f_pos */
      };
      
      @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write_no_fpos.write_f;
      @@
      struct file_operations fops = {
      ... .write = write_f, ...
      +.llseek = noop_llseek, /* write uses no f_pos */
      };
      
      @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      @@
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = noop_llseek, /* read uses no f_pos */
      };
      
      @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      @@
      struct file_operations fops = {
      ...
      +.llseek = noop_llseek, /* no read or write fn */
      };
      ===== End semantic patch =====
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      6038f373
  30. 11 5月, 2010 1 次提交
    • C
      sched, wait: Use wrapper functions · a93d2f17
      Changli Gao 提交于
      epoll should not touch flags in wait_queue_t. This patch introduces a new
      function __add_wait_queue_exclusive(), for the users, who use wait queue as a
      LIFO queue.
      
      __add_wait_queue_tail_exclusive() is introduced too instead of
      add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
      it is a duplicate of __remove_wait_queue(), disliked by users, and with less
      users.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: <containers@lists.linux-foundation.org>
      LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a93d2f17
  31. 23 12月, 2009 1 次提交