提交 · 282144e04b9a88b53dc16dc46e47f7d810e0b63b · openeuler / Kernel

22 3月, 2020 1 次提交

epoll: fix possible lost wakeup on epoll_ctl() path · 1b53734b

由 Roman Penyaev 提交于 3月 21, 2020

This fixes possible lost wakeup introduced by commit a218cc49.
Originally modifications to ep->wq were serialized by ep->wq.lock, but
in commit a218cc49 ("epoll: use rwlock in order to reduce
ep_poll_callback() contention") a new rw lock was introduced in order to
relax fd event path, i.e. callers of ep_poll_callback() function.

After the change ep_modify and ep_insert (both are called on epoll_ctl()
path) were switched to ep->lock, but ep_poll (epoll_wait) was using
ep->wq.lock on wqueue list modification.

The bug doesn't lead to any wqueue list corruptions, because wake up
path and list modifications were serialized by ep->wq.lock internally,
but actual waitqueue_active() check prior wake_up() call can be
reordered with modifications of ep ready list, thus wake up can be lost.

And yes, can be healed by explicit smp_mb():

  list_add_tail(&epi->rdlink, &ep->rdllist);
  smp_mb();
  if (waitqueue_active(&ep->wq))
	wake_up(&ep->wp);

But let's make it simple, thus current patch replaces ep->wq.lock with
the ep->lock for wqueue modifications, thus wake up path always observes
activeness of the wqueue correcty.

Fixes: a218cc49 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Reported-by: NMax Neunhoeffer <max@arangodb.com>
Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NMax Neunhoeffer <max@arangodb.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jes Sorensen <jes.sorensen@gmail.com>
Cc: <stable@vger.kernel.org>	[5.1+]
Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
References: https://bugzilla.kernel.org/show_bug.cgi?id=205933Bisected-by: NMax Neunhoeffer <max@arangodb.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1b53734b

30 1月, 2020 2 次提交

eventpoll: support non-blocking do_epoll_ctl() calls · 39220e8d

由 Jens Axboe 提交于 1月 08, 2020

Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

39220e8d

J
eventpoll: abstract out epoll_ctl() handler · 58e41a44
由 Jens Axboe 提交于 1月 08, 2020
```
No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
58e41a44

05 12月, 2019 2 次提交

fs/epoll: remove unnecessary wakeups of nested epoll · 339ddb53

由 Heiher 提交于 12月 04, 2019

Take the case where we have:

        t0
         | (ew)
        e0
         | (et)
        e1
         | (lt)
        s0

t0: thread 0
e0: epoll fd 0
e1: epoll fd 1
s0: socket fd 0
ew: epoll_wait
et: edge-trigger
lt: level-trigger

We remove unnecessary wakeups to prevent the nested epoll that working in edge-
triggered mode to waking up continuously.

Test code:
 #include <unistd.h>
 #include <sys/epoll.h>
 #include <sys/socket.h>

 int main(int argc, char *argv[])
 {
 	int sfd[2];
 	int efd[2];
 	struct epoll_event e;

 	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
 		goto out;

 	efd[0] = epoll_create(1);
 	if (efd[0] < 0)
 		goto out;

 	efd[1] = epoll_create(1);
 	if (efd[1] < 0)
 		goto out;

 	e.events = EPOLLIN;
 	if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
 		goto out;

 	e.events = EPOLLIN | EPOLLET;
 	if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
 		goto out;

 	if (write(sfd[1], "w", 1) != 1)
 		goto out;

 	if (epoll_wait(efd[0], &e, 1, 0) != 1)
 		goto out;

 	if (epoll_wait(efd[0], &e, 1, 0) != 0)
 		goto out;

 	close(efd[0]);
 	close(efd[1]);
 	close(sfd[0]);
 	close(sfd[1]);

 	return 0;

 out:
 	return -1;
 }

More tests:
 https://github.com/heiher/epoll-wakeup

Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.ccSigned-off-by: Nhev <r@hev.cc>
Reviewed-by: NRoman Penyaev <rpenyaev@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

339ddb53

epoll: simplify ep_poll_safewake() for CONFIG_DEBUG_LOCK_ALLOC · f6520c52

由 Jason Baron 提交于 12月 04, 2019

Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
ep_call_nested() in order to pass the correct subclass argument to
spin_lock_irqsave_nested().  However, ep_call_nested() adds unnecessary
checks for epoll depth and loops that are already verified when doing
EPOLL_CTL_ADD.  This mirrors a conversion that was done for
!CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e521 ("epoll: remove
ep_call_nested() from ep_eventpoll_poll()")

Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
Reviewed-by: NRoman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f6520c52

21 8月, 2019 1 次提交

PM / wakeup: Show wakeup sources stats in sysfs · c8377adf

由 Tri Vo 提交于 8月 06, 2019

Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup<ID>/*.
Co-developed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: NStephen Boyd <swboyd@chromium.org>
Signed-off-by: NStephen Boyd <swboyd@chromium.org>
Signed-off-by: NTri Vo <trong@android.com>
Tested-by: NKalesh Singh <kaleshsingh@google.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

c8377adf

19 7月, 2019 1 次提交

proc/sysctl: add shared variables for range check · eec4844f

由 Matteo Croce 提交于 7月 18, 2019

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range.  This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data                                         old     new   delta
    sysctl_vals                                    -      12     +12
    __kstrtab_sysctl_vals                          -      12     +12
    max                                           14      10      -4
    int_max                                       16       -     -16
    one                                           68       -     -68
    zero                                         128      28    -100
    Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
  Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
  Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.comSigned-off-by: NMatteo Croce <mcroce@redhat.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eec4844f

17 7月, 2019 1 次提交

signal: simplify set_user_sigmask/restore_user_sigmask · b772434b

由 Oleg Nesterov 提交于 7月 16, 2019

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b772434b

29 6月, 2019 1 次提交

signal: remove the wrong signal_pending() check in restore_user_sigmask() · 97abc889

由 Oleg Nesterov 提交于 6月 28, 2019

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed5 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed5 ("signal: Add restore_user_sigmask()")
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NEric Wong <e@80x24.org>
Tested-by: NEric Wong <e@80x24.org>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

97abc889

31 5月, 2019 1 次提交

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 · 2874c5fd

由 Thomas Gleixner 提交于 5月 27, 2019

Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

2874c5fd

08 3月, 2019 3 次提交

epoll: use rwlock in order to reduce ep_poll_callback() contention · a218cc49

由 Roman Penyaev 提交于 3月 07, 2019

The goal of this patch is to reduce contention of ep_poll_callback()
which can be called concurrently from different CPUs in case of high
events rates and many fds per epoll.  Problem can be very well
reproduced by generating events (write to pipe or eventfd) from many
threads, while consumer thread does polling.  In other words this patch
increases the bandwidth of events which can be delivered from sources to
the poller by adding poll items in a lockless way to the list.

The main change is in replacement of the spinlock with a rwlock, which
is taken on read in ep_poll_callback(), and then by adding poll items to
the tail of the list using xchg atomic instruction.  Write lock is taken
everywhere else in order to stop list modifications and guarantee that
list updates are fully completed (I assume that write side of a rwlock
does not starve, it seems qrwlock implementation has these guarantees).

The following are some microbenchmark results based on the test [1]
which starts threads which generate N events each.  The test ends when
all events are successfully fetched by the poller thread:

 spinlock
 ========

 threads  events/ms  run-time ms
       8       6402        12495
      16       7045        22709
      32       7395        43268

 rwlock + xchg
 =============

 threads  events/ms  run-time ms
       8      10038         7969
      16      12178        13138
      32      13223        24199

According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This patch was tested with different sort of microbenchmarks and
artificial delays (e.g.  "udelay(get_random_int() & 0xff)") introduced
in kernel on paths where items are added to lists.

[1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a218cc49

epoll: unify awaking of wakeup source on ep_poll_callback() path · c3e320b6

由 Roman Penyaev 提交于 3月 07, 2019

Original comment "Activate ep->ws since epi->ws may get deactivated at
any time" indeed sounds loud, but it is incorrect, because the path
where we check epi->ws is a path where insert to ovflist happens, i.e.
ep_scan_ready_list() has taken ep->mtx and waits for this callback to
finish, thus ep_modify() (which unregisters wakeup source) waits for
ep_scan_ready_list().

Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
extra for this path (indirectly protected by main ep->mtx, so even rcu
is not needed), but I do not want to create another naked
__ep_pm_stay_awake() variant only for this particular case, so rcu variant
is just better for all the cases.

Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c3e320b6

epoll: make sure all elements in ready list are in FIFO order · c141175d

由 Roman Penyaev 提交于 3月 07, 2019

Patch series "use rwlock in order to reduce ep_poll_callback()
contention", v3.

The last patch targets the contention problem in ep_poll_callback(),
which can be very well reproduced by generating events (write to pipe or
eventfd) from many threads, while consumer thread does polling.

The following are some microbenchmark results based on the test [1]
which starts threads which generate N events each.  The test ends when
all events are successfully fetched by the poller thread:

 spinlock
 ========

 threads  events/ms  run-time ms
       8       6402        12495
      16       7045        22709
      32       7395        43268

 rwlock + xchg
 =============

 threads  events/ms  run-time ms
       8      10038         7969
      16      12178        13138
      32      13223        24199

According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This patch (of 4):

All coming events are stored in FIFO order and this is also should be
applicable to ->ovflist, which originally is stack, i.e.  LIFO.

Thus to keep correct FIFO order ->ovflist should reversed by adding
elements to the head of the read list but not to the tail.

Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.deSigned-off-by: NRoman Penyaev <rpenyaev@suse.de>
Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c141175d

05 1月, 2019 8 次提交

fs/epoll: deal with wait_queue only once · 86c05179

由 Davidlohr Bueso 提交于 1月 03, 2019

There is no reason why we rearm the waitiqueue upon every fetch_events
retry (for when events are found yet send_events() fails).  If nothing
else, this saves four lock operations per retry, and furthermore reduces
the scope of the lock even further.

[akpm@linux-foundation.org: restore code to original position, fix and reflow comment]
Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

86c05179

fs/epoll: rename check_events label to send_events · 35cff1a6

由 Davidlohr Bueso 提交于 1月 03, 2019

It is currently called check_events because it, well, did exactly that.
However, since the lockless ep_events_available() call, the label no
longer checks, but just sends the events. Rename as such.

Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

35cff1a6

fs/epoll: avoid barrier after an epoll_wait(2) timeout · abc610e0

由 Davidlohr Bueso 提交于 1月 03, 2019

Upon timeout, we can just exit out of the loop, without the cost of the
changing the task's state with an smp_store_mb call.  Just exit out of
the loop and be done - setting the task state afterwards will be, of
course, redundant.

[dave@stgolabs.net: forgotten fixlets]
  Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dave@stgolabs.net>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

abc610e0

fs/epoll: reduce the scope of wq lock in epoll_wait() · c5a282e9

由 Davidlohr Bueso 提交于 1月 03, 2019

This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For
the blocking case, there is no need to constantly take and drop the
spinlock, which is only needed to manipulate the waitqueue.

The call to ep_events_available() is now lockless, and only exposed to
benign races. Here, if false positive (returns available events and
does not see another thread deleting an epi from the list) we call into
send_events and then the list's state is correctly seen. Otoh, if a
false negative and we don't see a list_add_tail(), for example, from irq
callback, then it is rechecked again before blocking, which will see the
correct state.

In order for more accuracy to see concurrent list_del_init(), use the
list_empty_careful() variant -- of course, this won't be safe against
insertions from wakeup.

For the overflow list we obviously need to prevent load/store tearing as
we don't want to see partial values while the ready list is disabled.

[dave@stgolabs.net: forgotten fixlets]
Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Suggested-by: NJason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c5a282e9

fs/epoll: robustify ep->mtx held checks · 21877e1a

由 Davidlohr Bueso 提交于 1月 03, 2019

Insted of just commenting how important it is, lets make it more robust
and add a lockdep_assert_held() call.

Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

21877e1a

fs/epoll: drop ovflist branch prediction · 76699a67

由 Davidlohr Bueso 提交于 1月 03, 2019

The ep->ovflist is a secondary ready-list to temporarily store events
that might occur when doing sproc without holding the ep->wq.lock. This
accounts for every time we check for ready events and also send events
back to userspace; both callbacks, particularly the latter because of
copy_to_user, can account for a non-trivial time.

As such, the unlikely() check to see if the pointer is being used, seems
both misleading and sub-optimal. In fact, we go to an awful lot of
trouble to sync both lists, and populating the ovflist is far from an
uncommon scenario.

For example, profiling a concurrent epoll_wait(2) benchmark, with
CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
incorrect rate was seen; and when incrementally increasing the number of
epoll instances (which is used, for example for multiple queuing load
balancing models), up to a 90% incorrect rate was seen.

Similarly, by deleting the prediction, 3% throughput boost was seen
across incremental threads.

Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

76699a67

fs/epoll: simplify ep_send_events_proc() ready-list loop · 4e0982a0

由 Davidlohr Bueso 提交于 1月 03, 2019

The current logic is a bit convoluted.  Lets simplify this with a
standard list_for_each_entry_safe() loop instead and just break out
after maxevents is reached.

While at it, remove an unnecessary indentation level in the loop when
there are in fact ready events.

Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e0982a0

fs/epoll: remove max_nests argument from ep_call_nested() · 74bdc129

由 Davidlohr Bueso 提交于 1月 03, 2019

Patch series "epoll: some miscellaneous optimizations".

The following are some incremental optimizations on some of the epoll
core.  Each patch has the details, but together, the series is seen to
shave off measurable cycles on a number of systems and workloads.

For example, on a 40-core IB, a pipetest as well as parallel
epoll_wait() benchmark show around a 20-30% increase in raw operations
per second when the box is fully occupied (incremental thread counts),
and up to 15% performance improvement with lower counts.

Passes ltp epoll related testcases.

This patch(of 6):

All callers pass the EP_MAX_NESTS constant already, so lets simplify
this a tad and get rid of the redundant parameter for nested eventpolls.

Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

74bdc129

04 1月, 2019 1 次提交

Remove 'type' argument from access_ok() function · 96d4f267

由 Linus Torvalds 提交于 1月 03, 2019

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96d4f267

07 12月, 2018 2 次提交

signal: Add restore_user_sigmask() · 854a6ed5

由 Deepa Dinamani 提交于 9月 19, 2018

Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

854a6ed5

signal: Add set_user_sigmask() · ded653cc

由 Deepa Dinamani 提交于 9月 19, 2018

Refactor reading sigset from userspace and updating sigmask
into an api.

This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

ded653cc

23 8月, 2018 7 次提交

fs/eventpoll.c: simplify ep_is_linked() callers · 992991c0

由 Davidlohr Bueso 提交于 8月 21, 2018

Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer.  This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.

Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

992991c0

fs/eventpoll.c: loosen irq safety in ep_poll() · 679abf38

由 Davidlohr Bueso 提交于 8月 21, 2018

Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs.  In fact, the call should never be called in irq context at
all, considering that the only path is

epoll_wait(2) -> do_epoll_wait() -> ep_poll().

When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:

    # threads       vanilla         dirty
	 1          1805587	    2106412
	 2          1854064	    2090762
	 4          1805484	    2017436
	 8          1751222	    1974475
	 16         1725299	    1962104
	 32         1378463	    1571233
	 64          787368	     900784

Which is a pretty constantly near 15%.

Also add a lockdep check such that we detect any mischief before
deadlocking.

Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

679abf38

fs/eventpoll.c: simply CONFIG_NET_RX_BUSY_POLL ifdefery · 514056d5

由 Davidlohr Bueso 提交于 8月 21, 2018

... 'tis easier on the eye.

[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

514056d5

s/epoll: robustify irq safety with lockdep_assert_irqs_enabled() · 92e64178

由 Davidlohr Bueso 提交于 8月 21, 2018

Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock.  These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.

[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

92e64178

fs/epoll: loosen irq safety in epoll_insert() and epoll_remove() · 304b18b8

由 Davidlohr Bueso 提交于 8月 21, 2018

Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.

Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

304b18b8

fs/epoll: loosen irq safety in ep_scan_ready_list() · 002b3436

由 Davidlohr Bueso 提交于 8月 21, 2018

Patch series "fs/epoll: loosen irq safety when possible".

Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs.  This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen.  The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.

Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.

This patch (of 2):

Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.

1 Job	 7500	-->    8800 enq/s  (+17%)
2 Jobs	14000   -->   15200 enq/s  (+8%)
3 Jobs	20500	-->   22300 enq/s  (+8%)
4 Jobs	25000   -->   28000 enq/s  (+8-12)%

On bare metal:

For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.

1) Different configurations were used for a epoll_wait (pipes io)
   microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
   around a 7-10% improvement in overall total number of times the
   epoll_wait() loops when using both regular and nested epolls, so very
   raw numbers, but measurable nonetheless.

# threads	vanilla		dirty
     1		1677717		1805587
     2		1660510		1854064
     4		1610184		1805484
     8		1577696		1751222
     16		1568837		1725299
     32		1291532		1378463
     64		 752584		 787368

   Note that stddev is pretty small.

2) Another pipe test, which shows no real measurable improvement.
   (http://www.xmailserver.org/linux-patches/pipetest.c)

Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

002b3436

epoll: use the waitqueue lock to protect ep->wq · ee8ef0a4

由 Christoph Hellwig 提交于 8月 21, 2018

Patch series "waitqueue lockdep annotation", v3.

This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants.  It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.

This patch (of 3):

The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock.  Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.

Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ee8ef0a4

29 6月, 2018 1 次提交

Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43

由 Linus Torvalds 提交于 6月 28, 2018

The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a11e1d43

15 6月, 2018 1 次提交

eventpoll: switch to ->poll_mask · 11c5ad0e

由 Ben Noordhuis 提交于 6月 15, 2018

Signed-off-by: NBen Noordhuis <info@bnoordhuis.nl>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

11c5ad0e

26 5月, 2018 1 次提交

fs: add new vfs_poll and file_can_poll helpers · 9965ed17

由 Christoph Hellwig 提交于 3月 05, 2018

These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>

9965ed17

03 4月, 2018 1 次提交

fs: add do_epoll_*() helpers; remove internal calls to sys_epoll_*() · 791eb22e

由 Dominik Brodowski 提交于 3月 11, 2018

Using the helper functions do_epoll_create() and do_epoll_wait() allows us
to remove in-kernel calls to the related syscall functions.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

791eb22e

12 2月, 2018 1 次提交

vfs: do bulk POLL* -> EPOLL* replacement · a9a08845

由 Linus Torvalds 提交于 2月 11, 2018

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9a08845

02 2月, 2018 2 次提交

annotate ep_scan_ready_list() · d85e2aa2

由 Al Viro 提交于 2月 01, 2018

make it always return __poll_t and have its callbacks do the same
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d85e2aa2

ep_send_events_proc(): return result via esed->res · d7ebbe46

由 Al Viro 提交于 2月 01, 2018

preparations for not mixing __poll_t and int in ep_scan_ready_list()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d7ebbe46

29 11月, 2017 2 次提交

eventpoll: no need to mask the result of epi_item_poll() again · 69112736

由 Al Viro 提交于 11月 28, 2017

two callers that do so don't need to bother - we'd already
masked it with epi->event.events, which
	* couldn't have changed since we are holding ->mtx
	* had been set to event->events
	* is still equal to event->events, since *event is never
changed by anything.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

69112736

A
eventpoll: constify struct epoll_event pointers · bec1a502
由 Al Viro 提交于 11月 28, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
bec1a502

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功