1. 13 7月, 2017 2 次提交
    • C
      kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files · 0791e364
      Cyrill Gorcunov 提交于
      With current epoll architecture target files are addressed with
      file_struct and file descriptor number, where the last is not unique.
      Moreover files can be transferred from another process via unix socket,
      added into queue and closed then so we won't find this descriptor in the
      task fdinfo list.
      
      Thus to checkpoint and restore such processes CRIU needs to find out
      where exactly the target file is present to add it into epoll queue.
      For this sake one can use kcmp call where some particular target file
      from the queue is compared with arbitrary file passed as an argument.
      
      Because epoll target files can have same file descriptor number but
      different file_struct a caller should explicitly specify the offset
      within.
      
      To test if some particular file is matching entry inside epoll one have
      to
      
       - fill kcmp_epoll_slot structure with epoll file descriptor,
         target file number and target file offset (in case if only
         one target is present then it should be 0)
      
       - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
          - the kernel fetch file pointer matching file descriptor @fd of pid1
          - lookups for file struct in epoll queue of pid2 and returns traditional
            0,1,2 result for sorting purpose
      
      Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.comSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0791e364
    • C
      procfs: fdinfo: extend information about epoll target files · 77493f04
      Cyrill Gorcunov 提交于
      Since it is possbile to have same number in tfd field (say file added,
      closed, then nother file dup'ed to same number and added back) it is
      imposible to distinguish such target files solely by their numbers.
      
      Strictly speaking regular applications don't need to recognize these
      targets at all but for checkpoint/restore sake we need to collect
      targets to be able to push them back on restore stage in a proper order.
      
      Thus lets add file position, inode and device number where this target
      lays.  This three fields can be used as a primary key for sorting, and
      together with kcmp help CRIU can find out an exact file target (from the
      whole set of processes being checkpointed).
      
      Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.comSigned-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77493f04
  2. 11 7月, 2017 1 次提交
    • D
      fs, epoll: short circuit fetching events if thread has been killed · c257a340
      David Rientjes 提交于
      We've encountered zombies that are waiting for a thread to exit that are
      looping in ep_poll() almost endlessly although there is a pending
      SIGKILL as a result of a group exit.
      
      This happens because we always find ep_events_available() and fetch more
      events and never are able to check for signal_pending() that would break
      from the loop and return -EINTR.
      
      Special case fatal signals and break immediately to guarantee that we
      loop to fetch more events and delay making a timely exit.
      
      It would also be possible to simply move the check for signal_pending()
      higher than checking for ep_events_available(), but there have been no
      reports of delayed signal handling other than SIGKILL preventing zombies
      from exiting that would be fixed by this.
      
      It fixes an issue for us where we have witnessed zombies sticking around
      for at least O(minutes), but considering the code has been like this
      forever and nobody else has complained that I have found, I would simply
      queue it up for 4.12.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c257a340
  3. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  4. 25 3月, 2017 1 次提交
    • S
      epoll: Add busy poll support to epoll with socket fds. · bf3b9f63
      Sridhar Samudrala 提交于
      This patch adds busy poll support to epoll. The implementation is meant to
      be opportunistic in that it will take the NAPI ID from the last socket
      that is added to the ready list that contains a valid NAPI ID and it will
      use that for busy polling until the ready list goes empty.  Once the ready
      list goes empty the NAPI ID is reset and busy polling is disabled until a
      new socket is added to the ready list.
      
      In addition when we insert a new socket into the epoll we record the NAPI
      ID and assume we are going to receive events on it.  If that doesn't occur
      it will be evicted as the active NAPI ID and we will resume normal
      behavior.
      
      An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
      options to spread the incoming connections to specific worker threads
      based on the incoming queue. This enables epoll for each worker thread
      to have only sockets that receive packets from a single queue. So when an
      application calls epoll_wait() and there are no events available to report,
      busy polling is done on the associated queue to pull the packets.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf3b9f63
  5. 02 3月, 2017 1 次提交
  6. 28 2月, 2017 1 次提交
  7. 25 12月, 2016 1 次提交
  8. 20 5月, 2016 1 次提交
  9. 18 3月, 2016 1 次提交
    • J
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz 提交于
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
  10. 06 2月, 2016 1 次提交
    • J
      epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT · b6a515c8
      Jason Baron 提交于
      In the current implementation of the EPOLLEXCLUSIVE flag (added for
      4.5-rc1), if epoll waiters create different POLL* sets and register them
      as exclusive against the same target fd, the current implementation will
      stop waking any further waiters once it finds the first idle waiter.
      This means that waiters could miss wakeups in certain cases.
      
      For example, when we wake up a pipe for reading we do:
      wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
      one epoll set or epfd is added to pipe p with POLLIN and a second set
      epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
      wakeup since the current implementation will stop after it finds any
      intersection of events with a waiter that is blocked in epoll_wait().
      
      We could potentially address this by requiring all epoll waiters that
      are added to p be required to pass the same set of POLL* events.  IE the
      first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
      flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
      However, I think it might be somewhat confusing interface as we would
      have to reference count the number of users for that set, and so
      userspace would have to keep track of that count, or we would need a
      more involved interface.  It also adds some shared state that we'd have
      store somewhere.  I don't think anybody will want to bloat
      __wait_queue_head for this.
      
      I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
      such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
      that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
      once we hit the first idle waiter that specifies the EPOLLIN bit, since
      any remaining waiters that only have 'POLLOUT' set wouldn't need to be
      woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
      bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
      wake bit set (there is at least one example of this I saw in fs/pipe.c),
      then we just wake the entire exclusive list.  Having both 'POLLOUT' and
      'POLLIN' both set should not be on any performance critical path, so I
      think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
      to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
      the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
      so.
      
      Since epoll waiters may be interested in other events as well besides
      EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
      doing a 'dup' call on the target fd and adding that as one normally
      would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
      events are what we are interest in balancing, I think that the 'dup'
      thing could perhaps be added to only one of the waiter threads.
      However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
      sufficient for the majority of use-cases.
      
      Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
      among multiple epfds, where between 1 and n of the epfds may receive an
      event, it does not satisfy the semantics of EPOLLONESHOT where only 1
      epfd would get an event.  Thus, it is not allowed to be specified in
      conjunction with EPOLLEXCLUSIVE.
      
      EPOLL_CTL_MOD is also not allowed if the fd was previously added as
      EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
      interesting, but this could be relaxed at some further point.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NMadars Vitolins <m@silodev.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a515c8
  11. 21 1月, 2016 1 次提交
    • J
      epoll: add EPOLLEXCLUSIVE flag · df0108c5
      Jason Baron 提交于
      Currently, epoll file descriptors or epfds (the fd returned from
      epoll_create[1]()) that are added to a shared wakeup source are always
      added in a non-exclusive manner.  This means that when we have multiple
      epfds attached to a shared fd source they are all woken up.  This creates
      thundering herd type behavior.
      
      Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
      'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation.  This new
      flag allows for exclusive wakeups when there are multiple epfds attached
      to a shared fd event source.
      
      The implementation walks the list of exclusive waiters, and queues an
      event to each epfd, until it finds the first waiter that has threads
      blocked on it via epoll_wait().  The idea is to search for threads which
      are idle and ready to process the wakeup events.  Thus, we queue an event
      to at least 1 epfd, but may still potentially queue an event to all epfds
      that are attached to the shared fd source.
      
      Performance testing was done by Madars Vitolins using a modified version
      of Enduro/X.  The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
      this particular workload from 860s down to 24s.
      
      Sample epoll_clt text:
      
      EPOLLEXCLUSIVE
      
        Sets an exclusive wakeup mode for the epfd file descriptor that is
        being attached to the target file descriptor, fd.  Thus, when an event
        occurs and multiple epfd file descriptors are attached to the same
        target file using EPOLLEXCLUSIVE, one or more epfds will receive an
        event with epoll_wait(2).  The default in this scenario (when
        EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
        EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NMadars Vitolins <m@silodev.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df0108c5
  12. 14 2月, 2015 1 次提交
  13. 06 11月, 2014 1 次提交
  14. 11 9月, 2014 1 次提交
    • N
      eventpoll: fix uninitialized variable in epoll_ctl · c680e41b
      Nicolas Iooss 提交于
      When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
      not initialized but ep_take_care_of_epollwakeup reads its event field.
      When this unintialized field has EPOLLWAKEUP bit set, a capability check
      is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
      produces unexpected messages in the audit log, such as (on a system
      running SELinux):
      
          type=AVC msg=audit(1408212798.866:410): avc:  denied
          { block_suspend } for  pid=7754 comm="dbus-daemon" capability=36
          scontext=unconfined_u:unconfined_r:unconfined_t
          tcontext=unconfined_u:unconfined_r:unconfined_t
          tclass=capability2 permissive=1
      
          type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
          success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
          pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
          fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
          exe="/usr/bin/dbus-daemon"
          subj=unconfined_u:unconfined_r:unconfined_t key=(null)
      
      ("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")
      
      Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.
      
      Fixes: 4d7e30d9 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
      Signed-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c680e41b
  15. 17 6月, 2014 1 次提交
  16. 07 6月, 2014 1 次提交
  17. 03 1月, 2014 1 次提交
    • J
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 4ff36ee9
      Jason Baron 提交于
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ff36ee9
  18. 03 12月, 2013 1 次提交
  19. 13 11月, 2013 2 次提交
    • J
      epoll: do not take global 'epmutex' for simple topologies · 67347fe4
      Jason Baron 提交于
      When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
      directly to a wakeup source, we do not need to take the global 'epmutex',
      unless the epoll file descriptor is nested.  The purpose of taking the
      'epmutex' on add is to prevent complex topologies such as loops and deep
      wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
      operations.  However, for the simple case of an epoll file descriptor
      attached directly to a wakeup source (with no nesting), we do not need to
      hold the 'epmutex'.
      
      This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
      scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
      performance:
      
      "On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
      addition the benchmark when from scaling well on 10 sockets to scaling
      well on just over 40 sockets.
      
      ...
      
      Currently the benchmark stops scaling at around 40-44 sockets but it seems like
      I found a second unrelated bottleneck."
      
      [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67347fe4
    • J
      epoll: optimize EPOLL_CTL_DEL using rcu · ae10b2b4
      Jason Baron 提交于
      Nathan Zimmer found that once we get over 10+ cpus, the scalability of
      SPECjbb falls over due to the contention on the global 'epmutex', which is
      taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.
      
      Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
      by using rcu to guard against any concurrent traversals.
      
      Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
      simple topologies.  IE when adding a link from an epoll file descriptor to
      a wakeup source, where the epoll file descriptor is not nested.
      
      This patch (of 2):
      
      Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
      converting the file->f_ep_links list into an rcu one.  In this way, we can
      traverse the epoll network on the add path in parallel with deletes.
      Since deletes can't create loops or worse wakeup paths, this is safe.
      
      This patch in combination with the patch "epoll: Do not take global 'epmutex'
      for simple topologies", shows a dramatic performance improvement in
      scalability for SPECjbb.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      CC: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae10b2b4
  20. 30 10月, 2013 1 次提交
  21. 25 10月, 2013 1 次提交
  22. 12 9月, 2013 1 次提交
  23. 04 9月, 2013 1 次提交
  24. 04 7月, 2013 1 次提交
  25. 12 5月, 2013 1 次提交
    • C
      epoll: use freezable blocking call · 1c441e92
      Colin Cross 提交于
      Avoid waking up every thread sleeping in an epoll_wait call during
      suspend and resume by calling a freezable blocking call.  Previous
      patches modified the freezer to avoid sending wakeups to threads
      that are blocked in freezable blocking calls.
      
      This call was selected to be converted to a freezable call because
      it doesn't hold any locks or release any resources when interrupted
      that might be needed by another freezing task or a kernel driver
      during suspend, and is a common site where idle userspace tasks are
      blocked.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NColin Cross <ccross@android.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1c441e92
  26. 01 5月, 2013 5 次提交
  27. 04 3月, 2013 1 次提交
  28. 03 1月, 2013 1 次提交
    • E
      epoll: prevent missed events on EPOLL_CTL_MOD · 128dd175
      Eric Wong 提交于
      EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
      ensure events are not missed.  Since the modifications to the interest
      mask are not protected by the same lock as ep_poll_callback, we need to
      ensure the change is visible to other CPUs calling ep_poll_callback.
      
      We also need to ensure f_op->poll() has an up-to-date view of past
      events which occured before we modified the interest mask.  So this
      barrier also pairs with the barrier in wq_has_sleeper().
      
      This should guarantee either ep_poll_callback or f_op->poll() (or both)
      will notice the readiness of a recently-ready/modified item.
      
      This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
      http://thread.gmane.org/gmane.linux.kernel/1408782/Signed-off-by: NEric Wong <normalperson@yhbt.net>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Hans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
      Tested-by: N"Junchang(Jason) Wang" <junchang.wang@yale.edu>
      Cc: netdev@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      128dd175
  29. 18 12月, 2012 1 次提交
    • C
      fs, epoll: add procfs fdinfo helper · 138d22b5
      Cyrill Gorcunov 提交于
      This allows us to print out eventpoll target file descriptor, events and
      data, the /proc/pid/fdinfo/fd consists of
      
       | pos:	0
       | flags:	02
       | tfd:        5 events:       1d data: ffffffffffffffff enabled: 1
      
      [avagin@: fix for unitialized ret variable]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Helsley <matt.helsley@gmail.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      138d22b5
  30. 09 11月, 2012 1 次提交
    • A
      revert "epoll: support for disabling items, and a self-test app" · a80a6b85
      Andrew Morton 提交于
      Revert commit 03a7beb5 ("epoll: support for disabling items, and a
      self-test app") pending resolution of the issues identified by Michael
      Kerrisk, copied below.
      
      We'll revisit this for 3.8.
      
      : I've taken a look at this patch as it currently stands in 3.7-rc1, and
      : done a bit of testing. (By the way, the test program
      : tools/testing/selftests/epoll/test_epoll.c does not compile...)
      :
      : There are one or two places where the behavior seems a little strange,
      : so I have a question or two at the end of this mail. But other than
      : that, I want to check my understanding so that the interface can be
      : correctly documented.
      :
      : Just to go though my understanding, the problem is the following
      : scenario in a multithreaded application:
      :
      : 1. Multiple threads are performing epoll_wait() operations,
      :    and maintaining a user-space cache that contains information
      :    corresponding to each file descriptor being monitored by
      :    epoll_wait().
      :
      : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
      :    a file descriptor from the epoll interest list, and
      :    delete the corresponding record from the user-space cache.
      :
      : 3. The problem with (2) is that some other thread may have
      :    previously done an epoll_wait() that retrieved information
      :    about the fd in question, and may be in the middle of using
      :    information in the cache that relates to that fd. Thus,
      :    there is a potential race.
      :
      : 4. The race can't solved purely in user space, because doing
      :    so would require applying a mutex across the epoll_wait()
      :    call, which would of course blow thread concurrency.
      :
      : Right?
      :
      : Your solution is the EPOLL_CTL_DISABLE operation. I want to
      : confirm my understanding about how to use this flag, since
      : the description that has accompanied the patches so far
      : has been a bit sparse
      :
      : 0. In the scenario you're concerned about, deleting a file
      :    descriptor means (safely) doing the following:
      :    (a) Deleting the file descriptor from the epoll interest list
      :        using EPOLL_CTL_DEL
      :    (b) Deleting the corresponding record in the user-space cache
      :
      : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
      :    conjunction with EPOLLONESHOT.
      :
      : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
      :    conjunction is a logical error.
      :
      : 3. The correct way to code multithreaded applications using
      :    EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
      :
      :    a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
      :       should EPOLLONESHOT.
      :
      :    b. When a thread wants to delete a file descriptor, it
      :       should do the following:
      :
      :       [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
      :       [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
      :           was zero, then the file descriptor can be safely
      :           deleted by the thread that made this call.
      :       [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
      :           then the descriptor is in use. In this case, the calling
      :           thread should set a flag in the user-space cache to
      :           indicate that the thread that is using the descriptor
      :           should perform the deletion operation.
      :
      : Is all of the above correct?
      :
      : The implementation depends on checking on whether
      : (events & ~EP_PRIVATE_BITS) == 0
      : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
      : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
      : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
      : cleared.
      :
      : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
      : is only useful in conjunction with EPOLLONESHOT. However, as things
      : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
      : not have EPOLLONESHOT set in 'events' This results in the following
      : (slightly surprising) behavior:
      :
      : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
      :     (the indicator that the file descriptor can be safely deleted).
      : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
      :
      : This doesn't seem particularly useful, and in fact is probably an
      : indication that the user made a logic error: they should only be using
      : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
      : EPOLLONESHOT was set in 'events'. If that is correct, then would it
      : not make sense to return an error to user space for this case?
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paton J. Lewis" <palewis@adobe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a80a6b85
  31. 06 10月, 2012 1 次提交
  32. 27 9月, 2012 2 次提交