1. 25 9月, 2020 1 次提交
  2. 11 9月, 2020 1 次提交
    • A
      epoll: EPOLL_CTL_ADD: close the race in decision to take fast path · fe0a916c
      Al Viro 提交于
      Checking for the lack of epitems refering to the epoll we want to insert into
      is not enough; we might have an insertion of that epoll into another one that
      has already collected the set of files to recheck for excessive reverse paths,
      but hasn't gotten to creating/inserting the epitem for it.
      
      However, any such insertion in progress can be detected - it will update the
      generation count in our epoll when it's done looking through it for files
      to check.  That gets done under ->mtx of our epoll and that allows us to
      detect that safely.
      
      We are *not* holding epmutex here, so the generation count is not stable.
      However, since both the update of ep->gen by loop check and (later)
      insertion into ->f_ep_link are done with ep->mtx held, we are fine -
      the sequence is
      	grab epmutex
      	bump loop_check_gen
      	...
      	grab tep->mtx		// 1
      	tep->gen = loop_check_gen
      	...
      	drop tep->mtx		// 2
      	...
      	grab tep->mtx		// 3
      	...
      	insert into ->f_ep_link
      	...
      	drop tep->mtx		// 4
      	bump loop_check_gen
      	drop epmutex
      and if the fastpath check in another thread happens for that
      eventpoll, it can come
      	* before (1) - in that case fastpath is just fine
      	* after (4) - we'll see non-empty ->f_ep_link, slow path
      taken
      	* between (2) and (3) - loop_check_gen is stable,
      with ->mtx providing barriers and we end up taking slow path.
      
      Note that ->f_ep_link emptiness check is slightly racy - we are protected
      against insertions into that list, but removals can happen right under us.
      Not a problem - in the worst case we'll end up taking a slow path for
      no good reason.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0a916c
  3. 10 9月, 2020 2 次提交
  4. 02 9月, 2020 1 次提交
  5. 23 8月, 2020 2 次提交
  6. 15 5月, 2020 1 次提交
  7. 08 5月, 2020 2 次提交
    • R
      epoll: atomically remove wait entry on wake up · 412895f0
      Roman Penyaev 提交于
      This patch does two things:
      
       - fixes a lost wakeup introduced by commit 339ddb53 ("fs/epoll:
         remove unnecessary wakeups of nested epoll")
      
       - improves performance for events delivery.
      
      The description of the problem is the following: if N (>1) threads are
      waiting on ep->wq for new events and M (>1) events come, it is quite
      likely that >1 wakeups hit the same wait queue entry, because there is
      quite a big window between __add_wait_queue_exclusive() and the
      following __remove_wait_queue() calls in ep_poll() function.
      
      This can lead to lost wakeups, because thread, which was woken up, can
      handle not all the events in ->rdllist.  (in better words the problem is
      described here: https://lkml.org/lkml/2019/10/7/905)
      
      The idea of the current patch is to use init_wait() instead of
      init_waitqueue_entry().
      
      Internally init_wait() sets autoremove_wake_function as a callback,
      which removes the wait entry atomically (under the wq locks) from the
      list, thus the next coming wakeup hits the next wait entry in the wait
      queue, thus preventing lost wakeups.
      
      Problem is very well reproduced by the epoll60 test case [1].
      
      Wait entry removal on wakeup has also performance benefits, because
      there is no need to take a ep->lock and remove wait entry from the queue
      after the successful wakeup.  Here is the timing output of the epoll60
      test case:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          real    0m6.970s
          user    0m49.786s
          sys     0m0.113s
      
       After this patch:
      
         real    0m5.220s
         user    0m36.879s
         sys     0m0.019s
      
      The other testcase is the stress-epoll [2], where one thread consumes
      all the events and other threads produce many events:
      
        With explicit wakeup from ep_scan_ready_list() (the state of the
        code prior 339ddb53):
      
          threads  events/ms  run-time ms
                8       5427         1474
               16       6163         2596
               32       6824         4689
               64       7060         9064
              128       6991        18309
      
       After this patch:
      
          threads  events/ms  run-time ms
                8       5598         1429
               16       7073         2262
               32       7502         4265
               64       7640         8376
              128       7634        16767
      
       (number of "events/ms" represents event bandwidth, thus higher is
        better; number of "run-time ms" represents overall time spent
        doing the benchmark, thus lower is better)
      
      [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
      [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.cSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJason Baron <jbaron@akamai.com>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Heiher <r@hev.cc>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      412895f0
    • K
      eventpoll: fix missing wakeup for ovflist in ep_poll_callback · 0c54a6a4
      Khazhismel Kumykov 提交于
      In the event that we add to ovflist, before commit 339ddb53
      ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
      woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
      
      With that wakeup removed, if we add to ovflist here, we may never wake
      up.  Rather than adding back the ep_scan_ready_list wakeup - which was
      resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
      
      We noticed that one of our workloads was missing wakeups starting with
      339ddb53 and upon manual inspection, this wakeup seemed missing to me.
      With this patch added, we no longer see missing wakeups.  I haven't yet
      tried to make a small reproducer, but the existing kselftests in
      filesystem/epoll passed for me with this patch.
      
      [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
        Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
      Fixes: 339ddb53 ("fs/epoll: remove unnecessary wakeups of nested epoll")
      Signed-off-by: NKhazhismel Kumykov <khazhy@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Penyaev <rpenyaev@suse.de>
      Cc: Heiher <r@hev.cc>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c54a6a4
  8. 08 4月, 2020 1 次提交
  9. 22 3月, 2020 1 次提交
  10. 30 1月, 2020 2 次提交
  11. 05 12月, 2019 2 次提交
  12. 21 8月, 2019 1 次提交
  13. 19 7月, 2019 1 次提交
  14. 17 7月, 2019 1 次提交
  15. 29 6月, 2019 1 次提交
    • O
      signal: remove the wrong signal_pending() check in restore_user_sigmask() · 97abc889
      Oleg Nesterov 提交于
      This is the minimal fix for stable, I'll send cleanups later.
      
      Commit 854a6ed5 ("signal: Add restore_user_sigmask()") introduced
      the visible change which breaks user-space: a signal temporary unblocked
      by set_user_sigmask() can be delivered even if the caller returns
      success or timeout.
      
      Change restore_user_sigmask() to accept the additional "interrupted"
      argument which should be used instead of signal_pending() check, and
      update the callers.
      
      Eric said:
      
      : For clarity.  I don't think this is required by posix, or fundamentally to
      : remove the races in select.  It is what linux has always done and we have
      : applications who care so I agree this fix is needed.
      :
      : Further in any case where the semantic change that this patch rolls back
      : (aka where allowing a signal to be delivered and the select like call to
      : complete) would be advantage we can do as well if not better by using
      : signalfd.
      :
      : Michael is there any chance we can get this guarantee of the linux
      : implementation of pselect and friends clearly documented.  The guarantee
      : that if the system call completes successfully we are guaranteed that no
      : signal that is unblocked by using sigmask will be delivered?
      
      Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
      Fixes: 854a6ed5 ("signal: Add restore_user_sigmask()")
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NEric Wong <e@80x24.org>
      Tested-by: NEric Wong <e@80x24.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97abc889
  16. 31 5月, 2019 1 次提交
  17. 08 3月, 2019 3 次提交
    • R
      epoll: use rwlock in order to reduce ep_poll_callback() contention · a218cc49
      Roman Penyaev 提交于
      The goal of this patch is to reduce contention of ep_poll_callback()
      which can be called concurrently from different CPUs in case of high
      events rates and many fds per epoll.  Problem can be very well
      reproduced by generating events (write to pipe or eventfd) from many
      threads, while consumer thread does polling.  In other words this patch
      increases the bandwidth of events which can be delivered from sources to
      the poller by adding poll items in a lockless way to the list.
      
      The main change is in replacement of the spinlock with a rwlock, which
      is taken on read in ep_poll_callback(), and then by adding poll items to
      the tail of the list using xchg atomic instruction.  Write lock is taken
      everywhere else in order to stop list modifications and guarantee that
      list updates are fully completed (I assume that write side of a rwlock
      does not starve, it seems qrwlock implementation has these guarantees).
      
      The following are some microbenchmark results based on the test [1]
      which starts threads which generate N events each.  The test ends when
      all events are successfully fetched by the poller thread:
      
       spinlock
       ========
      
       threads  events/ms  run-time ms
             8       6402        12495
            16       7045        22709
            32       7395        43268
      
       rwlock + xchg
       =============
      
       threads  events/ms  run-time ms
             8      10038         7969
            16      12178        13138
            32      13223        24199
      
      According to the results bandwidth of delivered events is significantly
      increased, thus execution time is reduced.
      
      This patch was tested with different sort of microbenchmarks and
      artificial delays (e.g.  "udelay(get_random_int() & 0xff)") introduced
      in kernel on paths where items are added to lists.
      
      [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a218cc49
    • R
      epoll: unify awaking of wakeup source on ep_poll_callback() path · c3e320b6
      Roman Penyaev 提交于
      Original comment "Activate ep->ws since epi->ws may get deactivated at
      any time" indeed sounds loud, but it is incorrect, because the path
      where we check epi->ws is a path where insert to ovflist happens, i.e.
      ep_scan_ready_list() has taken ep->mtx and waits for this callback to
      finish, thus ep_modify() (which unregisters wakeup source) waits for
      ep_scan_ready_list().
      
      Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
      extra for this path (indirectly protected by main ep->mtx, so even rcu
      is not needed), but I do not want to create another naked
      __ep_pm_stay_awake() variant only for this particular case, so rcu variant
      is just better for all the cases.
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3e320b6
    • R
      epoll: make sure all elements in ready list are in FIFO order · c141175d
      Roman Penyaev 提交于
      Patch series "use rwlock in order to reduce ep_poll_callback()
      contention", v3.
      
      The last patch targets the contention problem in ep_poll_callback(),
      which can be very well reproduced by generating events (write to pipe or
      eventfd) from many threads, while consumer thread does polling.
      
      The following are some microbenchmark results based on the test [1]
      which starts threads which generate N events each.  The test ends when
      all events are successfully fetched by the poller thread:
      
       spinlock
       ========
      
       threads  events/ms  run-time ms
             8       6402        12495
            16       7045        22709
            32       7395        43268
      
       rwlock + xchg
       =============
      
       threads  events/ms  run-time ms
             8      10038         7969
            16      12178        13138
            32      13223        24199
      
      According to the results bandwidth of delivered events is significantly
      increased, thus execution time is reduced.
      
      This patch (of 4):
      
      All coming events are stored in FIFO order and this is also should be
      applicable to ->ovflist, which originally is stack, i.e.  LIFO.
      
      Thus to keep correct FIFO order ->ovflist should reversed by adding
      elements to the head of the read list but not to the tail.
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c141175d
  18. 05 1月, 2019 8 次提交
  19. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  20. 07 12月, 2018 2 次提交
    • D
      signal: Add restore_user_sigmask() · 854a6ed5
      Deepa Dinamani 提交于
      Refactor the logic to restore the sigmask before the syscall
      returns into an api.
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during
      the execution and restored after the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      854a6ed5
    • D
      signal: Add set_user_sigmask() · ded653cc
      Deepa Dinamani 提交于
      Refactor reading sigset from userspace and updating sigmask
      into an api.
      
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during,
      and restored after, the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll,
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Note that the calls to sigprocmask() ignored the return value
      from the api as the function only returns an error on an invalid
      first argument that is hardcoded at these call sites.
      The updated logic uses set_current_blocked() instead.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      ded653cc
  21. 23 8月, 2018 5 次提交