1. 05 1月, 2019 6 次提交
  2. 07 12月, 2018 2 次提交
    • D
      signal: Add restore_user_sigmask() · 854a6ed5
      Deepa Dinamani 提交于
      Refactor the logic to restore the sigmask before the syscall
      returns into an api.
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during
      the execution and restored after the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      854a6ed5
    • D
      signal: Add set_user_sigmask() · ded653cc
      Deepa Dinamani 提交于
      Refactor reading sigset from userspace and updating sigmask
      into an api.
      
      This is useful for versions of syscalls that pass in the
      sigmask and expect the current->sigmask to be changed during,
      and restored after, the execution of the syscall.
      
      With the advent of new y2038 syscalls in the subsequent patches,
      we add two more new versions of the syscalls (for pselect, ppoll,
      and io_pgetevents) in addition to the existing native and compat
      versions. Adding such an api reduces the logic that would need to
      be replicated otherwise.
      
      Note that the calls to sigprocmask() ignored the return value
      from the api as the function only returns an error on an invalid
      first argument that is hardcoded at these call sites.
      The updated logic uses set_current_blocked() instead.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      ded653cc
  3. 23 8月, 2018 7 次提交
  4. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  5. 15 6月, 2018 1 次提交
  6. 26 5月, 2018 1 次提交
  7. 03 4月, 2018 1 次提交
  8. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  9. 02 2月, 2018 2 次提交
  10. 29 11月, 2017 2 次提交
  11. 28 11月, 2017 2 次提交
  12. 18 11月, 2017 3 次提交
    • J
      epoll: remove ep_call_nested() from ep_eventpoll_poll() · 37b5e521
      Jason Baron 提交于
      The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
      routine for an epoll fd, is used to prevent excessively deep epoll
      nesting, and to prevent circular paths.
      
      However, we are already preventing these conditions during
      EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
      deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
      however we don't allow more than EP_MAX_NESTS when an epoll file
      descriptor is actually connected to a wakeup source.  Thus, we do not
      require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
      called via ep_scan_ready_list() only continues nesting if there are
      events available.
      
      Since ep_call_nested() is implemented using a global lock, applications
      that make use of nested epoll can see large performance improvements
      with this change.
      
      Davidlohr said:
      
      : Improvements are quite obscene actually, such as for the following
      : epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
      :
      : ncpus  vanilla     dirty     delta
      : 1      2447092     3028315   +23.75%
      : 4      231265      2986954   +1191.57%
      : 8      121631      2898796   +2283.27%
      : 16     59749       2902056   +4757.07%
      : 32     26837	     2326314   +8568.30%
      : 64     12926       1341281   +10276.61%
      :
      : (http://linux-scalability.org/epoll/epoll-test.c)
      
      Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37b5e521
    • J
      epoll: avoid calling ep_call_nested() from ep_poll_safewake() · 57a173bd
      Jason Baron 提交于
      ep_poll_safewake() is used to wakeup potentially nested epoll file
      descriptors.  The function uses ep_call_nested() to prevent entering the
      same wake up queue more than once, and to prevent excessively deep
      wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
      since we are already preventing these conditions during EPOLL_CTL_ADD.
      This saves extra function calls, and avoids taking a global lock during
      the ep_call_nested() calls.
      
      I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
      case, since ep_call_nested() keeps track of the nesting level, and this
      is required by the call to spin_lock_irqsave_nested().  It would be nice
      to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
      case as well, however its not clear how to simply pass the nesting level
      through multiple wake_up() levels without more surgery.  In any case, I
      don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
      This patch, also apparently fixes a workload at Google that Salman Qazi
      reported by completely removing the poll_safewake_ncalls->lock from
      wakeup paths.
      
      Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Salman Qazi <sqazi@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57a173bd
    • S
      epoll: account epitem and eppoll_entry to kmemcg · 2ae928a9
      Shakeel Butt 提交于
      A userspace application can directly trigger the allocations from
      eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
      can consume a significant amount of system memory by triggering such
      allocations.  Indeed we have seen in production where a buggy
      application was leaking the epoll references and causing a burst of
      eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
      charging of eventpoll_epi and eventpoll_pwq slabs.
      
      There is a per-user limit (~4% of total memory if no highmem) on these
      caches.  I think it is too generous particularly in the scenario where
      jobs of multiple users are running on the system and the administrator
      is reducing cost by overcomitting the memory.  This is unaccounted
      kernel memory and will not be considered by the oom-killer.  I think by
      accounting it to kmemcg, for systems with kmem accounting enabled, we
      can provide better isolation between jobs of different users.
      
      Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ae928a9
  13. 20 9月, 2017 1 次提交
  14. 09 9月, 2017 1 次提交
  15. 02 9月, 2017 1 次提交
    • O
      epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() · 138e4ad6
      Oleg Nesterov 提交于
      The race was introduced by me in commit 971316f0 ("epoll:
      ep_unregister_pollwait() can use the freed pwq->whead").  I did not
      realize that nothing can protect eventpoll after ep_poll_callback() sets
      ->whead = NULL, only whead->lock can save us from the race with
      ep_free() or ep_remove().
      
      Move ->whead = NULL to the end of ep_poll_callback() and add the
      necessary barriers.
      
      TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
      before this patch.
      
      Hopefully this explains use-after-free reported by syzcaller:
      
      	BUG: KASAN: use-after-free in debug_spin_lock_before
      	...
      	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
      	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148
      
      this is spin_lock(eventpoll->lock),
      
      	...
      	Freed by task 17774:
      	...
      	 kfree+0xe8/0x2c0 mm/slub.c:3883
      	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865
      
      Fixes: 971316f0 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
      Reported-by: N范龙飞 <long7573@126.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      138e4ad6
  16. 13 7月, 2017 3 次提交
  17. 11 7月, 2017 1 次提交
    • D
      fs, epoll: short circuit fetching events if thread has been killed · c257a340
      David Rientjes 提交于
      We've encountered zombies that are waiting for a thread to exit that are
      looping in ep_poll() almost endlessly although there is a pending
      SIGKILL as a result of a group exit.
      
      This happens because we always find ep_events_available() and fetch more
      events and never are able to check for signal_pending() that would break
      from the loop and return -EINTR.
      
      Special case fatal signals and break immediately to guarantee that we
      loop to fetch more events and delay making a timely exit.
      
      It would also be possible to simply move the check for signal_pending()
      higher than checking for ep_events_available(), but there have been no
      reports of delayed signal handling other than SIGKILL preventing zombies
      from exiting that would be fixed by this.
      
      It fixes an issue for us where we have witnessed zombies sticking around
      for at least O(minutes), but considering the code has been like this
      forever and nobody else has complained that I have found, I would simply
      queue it up for 4.12.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c257a340
  18. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  19. 25 3月, 2017 1 次提交
    • S
      epoll: Add busy poll support to epoll with socket fds. · bf3b9f63
      Sridhar Samudrala 提交于
      This patch adds busy poll support to epoll. The implementation is meant to
      be opportunistic in that it will take the NAPI ID from the last socket
      that is added to the ready list that contains a valid NAPI ID and it will
      use that for busy polling until the ready list goes empty.  Once the ready
      list goes empty the NAPI ID is reset and busy polling is disabled until a
      new socket is added to the ready list.
      
      In addition when we insert a new socket into the epoll we record the NAPI
      ID and assume we are going to receive events on it.  If that doesn't occur
      it will be evicted as the active NAPI ID and we will resume normal
      behavior.
      
      An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
      options to spread the incoming connections to specific worker threads
      based on the incoming queue. This enables epoll for each worker thread
      to have only sockets that receive packets from a single queue. So when an
      application calls epoll_wait() and there are no events available to report,
      busy polling is done on the associated queue to pull the packets.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf3b9f63
  20. 02 3月, 2017 1 次提交