1. 21 8月, 2019 1 次提交
  2. 19 7月, 2019 1 次提交
  3. 17 7月, 2019 1 次提交
  4. 29 6月, 2019 1 次提交
    • O
      signal: remove the wrong signal_pending() check in restore_user_sigmask() · 97abc889
      Oleg Nesterov 提交于
      This is the minimal fix for stable, I'll send cleanups later.
      
      Commit 854a6ed5 ("signal: Add restore_user_sigmask()") introduced
      the visible change which breaks user-space: a signal temporary unblocked
      by set_user_sigmask() can be delivered even if the caller returns
      success or timeout.
      
      Change restore_user_sigmask() to accept the additional "interrupted"
      argument which should be used instead of signal_pending() check, and
      update the callers.
      
      Eric said:
      
      : For clarity.  I don't think this is required by posix, or fundamentally to
      : remove the races in select.  It is what linux has always done and we have
      : applications who care so I agree this fix is needed.
      :
      : Further in any case where the semantic change that this patch rolls back
      : (aka where allowing a signal to be delivered and the select like call to
      : complete) would be advantage we can do as well if not better by using
      : signalfd.
      :
      : Michael is there any chance we can get this guarantee of the linux
      : implementation of pselect and friends clearly documented.  The guarantee
      : that if the system call completes successfully we are guaranteed that no
      : signal that is unblocked by using sigmask will be delivered?
      
      Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
      Fixes: 854a6ed5 ("signal: Add restore_user_sigmask()")
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NEric Wong <e@80x24.org>
      Tested-by: NEric Wong <e@80x24.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97abc889
  5. 31 5月, 2019 1 次提交
  6. 08 3月, 2019 3 次提交
    • R
      epoll: use rwlock in order to reduce ep_poll_callback() contention · a218cc49
      Roman Penyaev 提交于
      The goal of this patch is to reduce contention of ep_poll_callback()
      which can be called concurrently from different CPUs in case of high
      events rates and many fds per epoll.  Problem can be very well
      reproduced by generating events (write to pipe or eventfd) from many
      threads, while consumer thread does polling.  In other words this patch
      increases the bandwidth of events which can be delivered from sources to
      the poller by adding poll items in a lockless way to the list.
      
      The main change is in replacement of the spinlock with a rwlock, which
      is taken on read in ep_poll_callback(), and then by adding poll items to
      the tail of the list using xchg atomic instruction.  Write lock is taken
      everywhere else in order to stop list modifications and guarantee that
      list updates are fully completed (I assume that write side of a rwlock
      does not starve, it seems qrwlock implementation has these guarantees).
      
      The following are some microbenchmark results based on the test [1]
      which starts threads which generate N events each.  The test ends when
      all events are successfully fetched by the poller thread:
      
       spinlock
       ========
      
       threads  events/ms  run-time ms
             8       6402        12495
            16       7045        22709
            32       7395        43268
      
       rwlock + xchg
       =============
      
       threads  events/ms  run-time ms
             8      10038         7969
            16      12178        13138
            32      13223        24199
      
      According to the results bandwidth of delivered events is significantly
      increased, thus execution time is reduced.
      
      This patch was tested with different sort of microbenchmarks and
      artificial delays (e.g.  "udelay(get_random_int() & 0xff)") introduced
      in kernel on paths where items are added to lists.
      
      [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a218cc49
    • R
      epoll: unify awaking of wakeup source on ep_poll_callback() path · c3e320b6
      Roman Penyaev 提交于
      Original comment "Activate ep->ws since epi->ws may get deactivated at
      any time" indeed sounds loud, but it is incorrect, because the path
      where we check epi->ws is a path where insert to ovflist happens, i.e.
      ep_scan_ready_list() has taken ep->mtx and waits for this callback to
      finish, thus ep_modify() (which unregisters wakeup source) waits for
      ep_scan_ready_list().
      
      Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
      extra for this path (indirectly protected by main ep->mtx, so even rcu
      is not needed), but I do not want to create another naked
      __ep_pm_stay_awake() variant only for this particular case, so rcu variant
      is just better for all the cases.
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3e320b6
    • R
      epoll: make sure all elements in ready list are in FIFO order · c141175d
      Roman Penyaev 提交于
      Patch series "use rwlock in order to reduce ep_poll_callback()
      contention", v3.
      
      The last patch targets the contention problem in ep_poll_callback(),
      which can be very well reproduced by generating events (write to pipe or
      eventfd) from many threads, while consumer thread does polling.
      
      The following are some microbenchmark results based on the test [1]
      which starts threads which generate N events each.  The test ends when
      all events are successfully fetched by the poller thread:
      
       spinlock
       ========
      
       threads  events/ms  run-time ms
             8       6402        12495
            16       7045        22709
            32       7395        43268
      
       rwlock + xchg
       =============
      
       threads  events/ms  run-time ms
             8      10038         7969
            16      12178        13138
            32      13223        24199
      
      According to the results bandwidth of delivered events is significantly
      increased, thus execution time is reduced.
      
      This patch (of 4):
      
      All coming events are stored in FIFO order and this is also should be
      applicable to ->ovflist, which originally is stack, i.e.  LIFO.
      
      Thus to keep correct FIFO order ->ovflist should reversed by adding
      elements to the head of the read list but not to the tail.
      
      Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c141175d
  7. 05 1月, 2019 8 次提交
  8. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  9. 07 12月, 2018 2 次提交
    • D
      signal: Add restore_user_sigmask() · 854a6ed5
      Deepa Dinamani 提交于
      Refactor the logic to restore the sigmask before the syscall
      returns into an api.
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during
      the execution and restored after the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      854a6ed5
    • D
      signal: Add set_user_sigmask() · ded653cc
      Deepa Dinamani 提交于
      Refactor reading sigset from userspace and updating sigmask
      into an api.
      
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during,
      and restored after, the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll,
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Note that the calls to sigprocmask() ignored the return value
      from the api as the function only returns an error on an invalid
      first argument that is hardcoded at these call sites.
      The updated logic uses set_current_blocked() instead.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      ded653cc
  10. 23 8月, 2018 7 次提交
  11. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  12. 15 6月, 2018 1 次提交
  13. 26 5月, 2018 1 次提交
  14. 03 4月, 2018 1 次提交
  15. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  16. 02 2月, 2018 2 次提交
  17. 29 11月, 2017 2 次提交
  18. 28 11月, 2017 2 次提交
  19. 18 11月, 2017 3 次提交
    • J
      epoll: remove ep_call_nested() from ep_eventpoll_poll() · 37b5e521
      Jason Baron 提交于
      The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
      routine for an epoll fd, is used to prevent excessively deep epoll
      nesting, and to prevent circular paths.
      
      However, we are already preventing these conditions during
      EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
      deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
      however we don't allow more than EP_MAX_NESTS when an epoll file
      descriptor is actually connected to a wakeup source.  Thus, we do not
      require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
      called via ep_scan_ready_list() only continues nesting if there are
      events available.
      
      Since ep_call_nested() is implemented using a global lock, applications
      that make use of nested epoll can see large performance improvements
      with this change.
      
      Davidlohr said:
      
      : Improvements are quite obscene actually, such as for the following
      : epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
      :
      : ncpus  vanilla     dirty     delta
      : 1      2447092     3028315   +23.75%
      : 4      231265      2986954   +1191.57%
      : 8      121631      2898796   +2283.27%
      : 16     59749       2902056   +4757.07%
      : 32     26837	     2326314   +8568.30%
      : 64     12926       1341281   +10276.61%
      :
      : (http://linux-scalability.org/epoll/epoll-test.c)
      
      Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37b5e521
    • J
      epoll: avoid calling ep_call_nested() from ep_poll_safewake() · 57a173bd
      Jason Baron 提交于
      ep_poll_safewake() is used to wakeup potentially nested epoll file
      descriptors.  The function uses ep_call_nested() to prevent entering the
      same wake up queue more than once, and to prevent excessively deep
      wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
      since we are already preventing these conditions during EPOLL_CTL_ADD.
      This saves extra function calls, and avoids taking a global lock during
      the ep_call_nested() calls.
      
      I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
      case, since ep_call_nested() keeps track of the nesting level, and this
      is required by the call to spin_lock_irqsave_nested().  It would be nice
      to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
      case as well, however its not clear how to simply pass the nesting level
      through multiple wake_up() levels without more surgery.  In any case, I
      don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
      This patch, also apparently fixes a workload at Google that Salman Qazi
      reported by completely removing the poll_safewake_ncalls->lock from
      wakeup paths.
      
      Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57a173bd
    • S
      epoll: account epitem and eppoll_entry to kmemcg · 2ae928a9
      Shakeel Butt 提交于
      A userspace application can directly trigger the allocations from
      eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
      can consume a significant amount of system memory by triggering such
      allocations.  Indeed we have seen in production where a buggy
      application was leaking the epoll references and causing a burst of
      eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
      charging of eventpoll_epi and eventpoll_pwq slabs.
      
      There is a per-user limit (~4% of total memory if no highmem) on these
      caches.  I think it is too generous particularly in the scenario where
      jobs of multiple users are running on the system and the administrator
      is reducing cost by overcomitting the memory.  This is unaccounted
      kernel memory and will not be considered by the oom-killer.  I think by
      accounting it to kmemcg, for systems with kmem accounting enabled, we
      can provide better isolation between jobs of different users.
      
      Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ae928a9