1. 10 12月, 2019 12 次提交
  2. 08 12月, 2019 8 次提交
    • S
      smb3: improve check for when we send the security descriptor context on create · 231e2a0b
      Steve French 提交于
      We had cases in the previous patch where we were sending the security
      descriptor context on SMB3 open (file create) in cases when we hadn't
      mounted with with "modefromsid" mount option.
      
      Add check for that mount flag before calling ad_sd_context in
      open init.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      231e2a0b
    • L
      pipe: don't use 'pipe_wait() for basic pipe IO · 85190d15
      Linus Torvalds 提交于
      pipe_wait() may be simple, but since it relies on the pipe lock, it
      means that we have to do the wakeup while holding the lock.  That's
      unfortunate, because the very first thing the waked entity will want to
      do is to get the pipe lock for itself.
      
      So get rid of the pipe_wait() usage by simply releasing the pipe lock,
      doing the wakeup (if required) and then using wait_event_interruptible()
      to wait on the right condition instead.
      
      wait_event_interruptible() handles races on its own by comparing the
      wakeup condition before and after adding itself to the wait queue, so
      you can use an optimistic unlocked condition for it.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85190d15
    • L
      pipe: remove 'waiting_writers' merging logic · a28c8b9d
      Linus Torvalds 提交于
      This code is ancient, and goes back to when we only had a single page
      for the pipe buffers.  The exact history is hidden in the mists of time
      (ie "before git", and in fact predates the BK repository too).
      
      At that long-ago point in time, it actually helped to try to merge big
      back-and-forth pipe reads and writes, and not limit pipe reads to the
      single pipe buffer in length just because that was all we had at a time.
      
      However, since then we've expanded the pipe buffers to multiple pages,
      and this logic really doesn't seem to make sense.  And a lot of it is
      somewhat questionable (ie "hmm, the user asked for a non-blocking read,
      but we see that there's a writer pending, so let's wait anyway to get
      the extra data that the writer will have").
      
      But more importantly, it makes the "go to sleep" logic much less
      obvious, and considering the wakeup issues we've had, I want to make for
      less of those kinds of things.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a28c8b9d
    • L
      pipe: fix and clarify pipe read wakeup logic · f467a6a6
      Linus Torvalds 提交于
      This is the read side version of the previous commit: it simplifies the
      logic to only wake up waiting writers when necessary, and makes sure to
      use a synchronous wakeup.  This time not so much for GNU make jobserver
      reasons (that pipe never fills up), but simply to get the writer going
      quickly again.
      
      A bit less verbose commentary this time, if only because I assume that
      the write side commentary isn't going to be ignored if you touch this
      code.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f467a6a6
    • L
      pipe: fix and clarify pipe write wakeup logic · 1b6b26ae
      Linus Torvalds 提交于
      The pipe rework ends up having been extra painful, partly becaused of
      actual bugs with ordering and caching of the pipe state, but also
      because of subtle performance issues.
      
      In particular, the pipe rework caused the kernel build to inexplicably
      slow down.
      
      The reason turns out to be that the GNU make jobserver (which limits the
      parallelism of the build) uses a pipe to implement a "token" system: a
      parallel submake will read a character from the pipe to get the job
      token before starting a new job, and will write a character back to the
      pipe when it is done.  The overall job limit is thus easily controlled
      by just writing the appropriate number of initial token characters into
      the pipe.
      
      But to work well, that really means that the old behavior of write
      wakeups being synchronous (WF_SYNC) is very important - when the pipe
      writer wakes up a reader, we want the reader to actually get scheduled
      immediately.  Otherwise you lose the parallelism of the build.
      
      The pipe rework lost that synchronous wakeup on write, and we had
      clearly all forgotten the reasons and rules for it.
      
      This rewrites the pipe write wakeup logic to do the required Wsync
      wakeups, but also clarifies the logic and avoids extraneous wakeups.
      
      It also ends up addign a number of comments about what oit does and why,
      so that we hopefully don't end up forgetting about this next time we
      change this code.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b6b26ae
    • L
      pipe: fix poll/select race introduced by the pipe rework · ad910e36
      Linus Torvalds 提交于
      The kernel wait queues have a basic rule to them: you add yourself to
      the wait-queue first, and then you check the things that you're going to
      wait on.  That avoids the races with the event you're waiting for.
      
      The same goes for poll/select logic: the "poll_wait()" goes first, and
      then you check the things you're polling for.
      
      Of course, if you use locking, the ordering doesn't matter since the
      lock will serialize with anything that changes the state you're looking
      at. That's not the case here, though.
      
      So move the poll_wait() first in pipe_poll(), before you start looking
      at the pipe state.
      
      Fixes: 8cefc107 ("pipe: Use head and tail pointers for the ring, not cursor and length")
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad910e36
    • P
      nfsd: depend on CRYPTO_MD5 for legacy client tracking · 38a2204f
      Patrick Steinhardt 提交于
      The legacy client tracking infrastructure of nfsd makes use of MD5 to
      derive a client's recovery directory name. As the nfsd module doesn't
      declare any dependency on CRYPTO_MD5, though, it may fail to allocate
      the hash if the kernel was compiled without it. As a result, generation
      of client recovery directories will fail with the following error:
      
          NFSD: unable to generate recoverydir name
      
      The explicit dependency on CRYPTO_MD5 was removed as redundant back in
      6aaa67b5 (NFSD: Remove redundant "select" clauses in fs/Kconfig
      2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5.
      This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit
      df486a25 (NFS: Fix the selection of security flavours in Kconfig) at
      a later point.
      
      Fix the issue by adding back an explicit dependency on CRYPTO_MD5.
      
      Fixes: df486a25 (NFS: Fix the selection of security flavours in Kconfig)
      Signed-off-by: NPatrick Steinhardt <ps@pks.im>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      38a2204f
    • O
      NFSD fixing possible null pointer derefering in copy offload · 18f428d4
      Olga Kornievskaia 提交于
      Static checker revealed possible error path leading to possible
      NULL pointer dereferencing.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: e0639dc5: ("NFSD introduce async copy feature")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      18f428d4
  3. 07 12月, 2019 3 次提交
  4. 06 12月, 2019 2 次提交
    • D
      pipe: Fix missing mask update after pipe_wait() · 8f868d68
      David Howells 提交于
      Fix pipe_write() to not cache the ring index mask and max_usage as their
      values are invalidated by calling pipe_wait() because the latter
      function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
      Without this, pipe_write() may subsequently miscalculate the array
      indices and pipe fullness, leading to an oops like the following:
      
        BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
        Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
        ...
        CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
        ...
        Call Trace:
          pipe_write+0xc25/0xe10 fs/pipe.c:481
          call_write_iter include/linux/fs.h:1895 [inline]
          new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
          __vfs_write+0x94/0x110 fs/read_write.c:496
          vfs_write+0x18a/0x520 fs/read_write.c:558
          ksys_write+0x105/0x220 fs/read_write.c:611
          __do_sys_write fs/read_write.c:623 [inline]
          __se_sys_write fs/read_write.c:620 [inline]
          __x64_sys_write+0x6e/0xb0 fs/read_write.c:620
          do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is not a problem for pipe_read() as the mask is recalculated on
      each pass of the loop, after pipe_wait() has been called.
      
      Fixes: 8cefc107 ("pipe: Use head and tail pointers for the ring, not cursor and length")
      Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      [ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f868d68
    • D
      pipe: Remove assertion from pipe_poll() · 8c7b8c34
      David Howells 提交于
      An assertion check was added to pipe_poll() to make sure that the ring
      occupancy isn't seen to overflow the ring size.  However, since no locks
      are held when the three values are read, it is possible for F_SETPIPE_SZ
      to intervene and muck up the calculation, thereby causing the oops.
      
      Fix this by simply removing the assertion and accepting that the
      calculation might be approximate.
      
      Note that the previous code also had a similar issue, though there was
      no assertion check, since the occupancy counter and the ring size were
      not read with a lock held, so it's possible that the poll check might
      have malfunctioned then too.
      
      Also wake up all the waiters so that they can reissue their checks if
      there was a competing read or write.
      
      Fixes: 8cefc107 ("pipe: Use head and tail pointers for the ring, not cursor and length")
      Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Eric Biggers <ebiggers@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c7b8c34
  5. 05 12月, 2019 15 次提交
    • Z
      iomap: stop using ioend after it's been freed in iomap_finish_ioend() · c275779f
      Zorro Lang 提交于
      This patch fixes the following KASAN report. The @ioend has been
      freed by dio_put(), but the iomap_finish_ioend() still trys to access
      its data.
      
      [20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
      [20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184
      
      [20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
      [20563.653887] Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
      [20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
      [20563.669547] Call trace:
      [20563.671993]  dump_backtrace+0x0/0x370
      [20563.675648]  show_stack+0x1c/0x28
      [20563.678958]  dump_stack+0x138/0x1b0
      [20563.682455]  print_address_description.isra.9+0x60/0x378
      [20563.687759]  __kasan_report+0x1a4/0x2a8
      [20563.691587]  kasan_report+0xc/0x18
      [20563.694985]  __asan_report_load8_noabort+0x18/0x20
      [20563.699769]  iomap_finish_ioend+0x58c/0x5c0
      [20563.703944]  iomap_finish_ioends+0x110/0x270
      [20563.708396]  xfs_end_ioend+0x168/0x598 [xfs]
      [20563.712823]  xfs_end_io+0x1e0/0x2d0 [xfs]
      [20563.716834]  process_one_work+0x7f0/0x1ac8
      [20563.720922]  worker_thread+0x334/0xae0
      [20563.724664]  kthread+0x2c4/0x348
      [20563.727889]  ret_from_fork+0x10/0x18
      
      [20563.732941] Allocated by task 83403:
      [20563.736512]  save_stack+0x24/0xb0
      [20563.739820]  __kasan_kmalloc.isra.9+0xc4/0xe0
      [20563.744169]  kasan_slab_alloc+0x14/0x20
      [20563.747998]  slab_post_alloc_hook+0x50/0xa8
      [20563.752173]  kmem_cache_alloc+0x154/0x330
      [20563.756185]  mempool_alloc_slab+0x20/0x28
      [20563.760186]  mempool_alloc+0xf4/0x2a8
      [20563.763845]  bio_alloc_bioset+0x2d0/0x448
      [20563.767849]  iomap_writepage_map+0x4b8/0x1740
      [20563.772198]  iomap_do_writepage+0x200/0x8d0
      [20563.776380]  write_cache_pages+0x8a4/0xed8
      [20563.780469]  iomap_writepages+0x4c/0xb0
      [20563.784463]  xfs_vm_writepages+0xf8/0x148 [xfs]
      [20563.788989]  do_writepages+0xc8/0x218
      [20563.792658]  __writeback_single_inode+0x168/0x18f8
      [20563.797441]  writeback_sb_inodes+0x370/0xd30
      [20563.801703]  wb_writeback+0x2d4/0x1270
      [20563.805446]  wb_workfn+0x344/0x1178
      [20563.808928]  process_one_work+0x7f0/0x1ac8
      [20563.813016]  worker_thread+0x334/0xae0
      [20563.816757]  kthread+0x2c4/0x348
      [20563.819979]  ret_from_fork+0x10/0x18
      
      [20563.825028] Freed by task 22184:
      [20563.828251]  save_stack+0x24/0xb0
      [20563.831559]  __kasan_slab_free+0x10c/0x180
      [20563.835648]  kasan_slab_free+0x10/0x18
      [20563.839389]  slab_free_freelist_hook+0xb4/0x1c0
      [20563.843912]  kmem_cache_free+0x8c/0x3e8
      [20563.847745]  mempool_free_slab+0x20/0x28
      [20563.851660]  mempool_free+0xd4/0x2f8
      [20563.855231]  bio_free+0x33c/0x518
      [20563.858537]  bio_put+0xb8/0x100
      [20563.861672]  iomap_finish_ioend+0x168/0x5c0
      [20563.865847]  iomap_finish_ioends+0x110/0x270
      [20563.870328]  xfs_end_ioend+0x168/0x598 [xfs]
      [20563.874751]  xfs_end_io+0x1e0/0x2d0 [xfs]
      [20563.878755]  process_one_work+0x7f0/0x1ac8
      [20563.882844]  worker_thread+0x334/0xae0
      [20563.886584]  kthread+0x2c4/0x348
      [20563.889804]  ret_from_fork+0x10/0x18
      
      [20563.894855] The buggy address belongs to the object at fffffc0c54a36900
                      which belongs to the cache bio-1 of size 248
      [20563.906844] The buggy address is located 40 bytes inside of
                      248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
      [20563.918485] The buggy address belongs to the page:
      [20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
      [20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
      [20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
      [20563.948300] page dumped because: kasan: bad access detected
      
      [20563.955345] Memory state around the buggy address:
      [20563.960129]  fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
      [20563.967342]  fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [20563.981766]                                   ^
      [20563.986288]  fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
      [20563.993501]  fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [20564.000713] ==================================================================
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703Signed-off-by: NZorro Lang <zlang@redhat.com>
      Fixes: 9cd0ed63 ("iomap: enhance writeback error message")
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c275779f
    • L
      io_uring: fix a typo in a comment · 0b4295b5
      LimingWu 提交于
      thatn -> than.
      Signed-off-by: NLiming Wu <19092205@suning.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b4295b5
    • P
      io_uring: hook all linked requests via link_list · 4493233e
      Pavel Begunkov 提交于
      Links are created by chaining requests through req->list with an
      exception that head uses req->link_list. (e.g. link_list->list->list)
      Because of that, io_req_link_next() needs complex splicing to advance.
      
      Link them all through list_list. Also, it seems to be simpler and more
      consistent IMHO.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4493233e
    • P
      io_uring: fix error handling in io_queue_link_head · 2e6e1fde
      Pavel Begunkov 提交于
      In case of an error io_submit_sqe() drops a request and continues
      without it, even if the request was a part of a link. Not only it
      doesn't cancel links, but also may execute wrong sequence of actions.
      
      Stop consuming sqes, and let the user handle errors.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e6e1fde
    • A
      fs/binfmt_elf.c: extract elf_read() function · 658c0335
      Alexey Dobriyan 提交于
      ELF reads done by the kernel have very complicated error detection code
      which better live in one place.
      
      Link: http://lkml.kernel.org/r/20191005165215.GB26927@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      658c0335
    • A
    • H
      fs/epoll: remove unnecessary wakeups of nested epoll · 339ddb53
      Heiher 提交于
      Take the case where we have:
      
              t0
               | (ew)
              e0
               | (et)
              e1
               | (lt)
              s0
      
      t0: thread 0
      e0: epoll fd 0
      e1: epoll fd 1
      s0: socket fd 0
      ew: epoll_wait
      et: edge-trigger
      lt: level-trigger
      
      We remove unnecessary wakeups to prevent the nested epoll that working in edge-
      triggered mode to waking up continuously.
      
      Test code:
       #include <unistd.h>
       #include <sys/epoll.h>
       #include <sys/socket.h>
      
       int main(int argc, char *argv[])
       {
       	int sfd[2];
       	int efd[2];
       	struct epoll_event e;
      
       	if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
       		goto out;
      
       	efd[0] = epoll_create(1);
       	if (efd[0] < 0)
       		goto out;
      
       	efd[1] = epoll_create(1);
       	if (efd[1] < 0)
       		goto out;
      
       	e.events = EPOLLIN;
       	if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
       		goto out;
      
       	e.events = EPOLLIN | EPOLLET;
       	if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
       		goto out;
      
       	if (write(sfd[1], "w", 1) != 1)
       		goto out;
      
       	if (epoll_wait(efd[0], &e, 1, 0) != 1)
       		goto out;
      
       	if (epoll_wait(efd[0], &e, 1, 0) != 0)
       		goto out;
      
       	close(efd[0]);
       	close(efd[1]);
       	close(sfd[0]);
       	close(sfd[1]);
      
       	return 0;
      
       out:
       	return -1;
       }
      
      More tests:
       https://github.com/heiher/epoll-wakeup
      
      Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.ccSigned-off-by: Nhev <r@hev.cc>
      Reviewed-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Eric Wong <e@80x24.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      339ddb53
    • J
      epoll: simplify ep_poll_safewake() for CONFIG_DEBUG_LOCK_ALLOC · f6520c52
      Jason Baron 提交于
      Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
      ep_call_nested() in order to pass the correct subclass argument to
      spin_lock_irqsave_nested().  However, ep_call_nested() adds unnecessary
      checks for epoll depth and loops that are already verified when doing
      EPOLL_CTL_ADD.  This mirrors a conversion that was done for
      !CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e521 ("epoll: remove
      ep_call_nested() from ep_eventpoll_poll()")
      
      Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com>
      Reviewed-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Wong <normalperson@yhbt.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6520c52
    • K
      fs/proc/Kconfig: fix indentation · 3d82191c
      Krzysztof Kozlowski 提交于
      Adjust indentation from spaces to tab (+optional two spaces) as in
      coding style with command like:
              $ sed -e 's/^        /	/' -i */Kconfig
      
      [adobriyan@gmail.com: add two spaces where necessary]
      Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d82191c
    • A
      fs/proc/internal.h: shuffle "struct pde_opener" · 70a731c0
      Alexey Dobriyan 提交于
      List iteration takes more code than anything else which means embedded
      list_head should be the first element of the structure.
      
      Space savings:
      
      	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18)
      	Function                                     old     new   delta
      	close_pdeo                                   228     227      -1
      	proc_reg_release                              86      82      -4
      	proc_entry_rundown                           143     139      -4
      	proc_reg_open                                298     289      -9
      
      Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70a731c0
    • A
      fs/proc/generic.c: delete useless "len" variable · 5f6354ea
      Alexey Dobriyan 提交于
      Pointer to next '/' encodes length of path element and next start
      position.  Subtraction and increment are redundant.
      
      Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f6354ea
    • A
      proc: change ->nlink under proc_subdir_lock · e06689bf
      Alexey Dobriyan 提交于
      Currently gluing PDE into global /proc tree is done under lock, but
      changing ->nlink is not.  Additionally struct proc_dir_entry::nlink is
      not atomic so updates can be lost.
      
      Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e06689bf
    • J
      io_uring: use hash table for poll command lookups · 78076bb6
      Jens Axboe 提交于
      We recently changed this from a single list to an rbtree, but for some
      real life workloads, the rbtree slows down the submission/insertion
      case enough so that it's the top cycle consumer on the io_uring side.
      In testing, using a hash table is a more well rounded compromise. It
      is fast for insertion, and as long as it's sized appropriately, it
      works well for the cancellation case as well. Running TAO with a lot
      of network sockets, this removes io_poll_req_insert() from spending
      2% of the CPU cycles.
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78076bb6
    • J
      io-wq: clear node->next on list deletion · 08bdcc35
      Jens Axboe 提交于
      If someone removes a node from a list, and then later adds it back to
      a list, we can have invalid data in ->next. This can cause all sorts
      of issues. One such use case is the IORING_OP_POLL_ADD command, which
      will do just that if we race and get woken twice without any pending
      events. This is a pretty rare case, but can happen under extreme loads.
      Dan reports that he saw the following crash:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
      Oops: 0002 [#1] SMP
      CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      RIP: 0010:io_wqe_enqueue+0x3e/0xd0
      Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
      RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
      RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
      RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
      RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
      R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
      FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <IRQ>
       io_poll_wake+0x12f/0x2a0
       __wake_up_common+0x86/0x120
       __wake_up_common_lock+0x7a/0xc0
       sock_def_readable+0x3c/0x70
       tcp_rcv_established+0x557/0x630
       tcp_v6_do_rcv+0x118/0x3c0
       tcp_v6_rcv+0x97e/0x9d0
       ip6_protocol_deliver_rcu+0xe3/0x440
       ip6_input+0x3d/0xc0
       ? ip6_protocol_deliver_rcu+0x440/0x440
       ipv6_rcv+0x56/0xd0
       ? ip6_rcv_finish_core.isra.18+0x80/0x80
       __netif_receive_skb_one_core+0x50/0x70
       netif_receive_skb_internal+0x2f/0xa0
       napi_gro_receive+0x125/0x150
       mlx5e_handle_rx_cqe+0x1d9/0x5a0
       ? mlx5e_poll_tx_cq+0x305/0x560
       mlx5e_poll_rx_cq+0x49f/0x9c5
       mlx5e_napi_poll+0xee/0x640
       ? smp_reschedule_interrupt+0x16/0xd0
       ? reschedule_interrupt+0xf/0x20
       net_rx_action+0x286/0x3d0
       __do_softirq+0xca/0x297
       irq_exit+0x96/0xa0
       do_IRQ+0x54/0xe0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0033:0x7fdc627a2e3a
      Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
      
      when running a networked workload with about 5000 sockets being polled
      for. Fix this by clearing node->next when the node is being removed from
      the list.
      
      Fixes: 6206f0e1 ("io-wq: shrink io_wq_work a bit")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08bdcc35
    • J
      io_uring: ensure deferred timeouts copy necessary data · 2d28390a
      Jens Axboe 提交于
      If we defer a timeout, we should ensure that we copy the timespec
      when we have consumed the sqe. This is similar to commit f67676d1
      for read/write requests. We already did this correctly for timeouts
      deferred as links, but do it generally and use the infrastructure added
      by commit 1a6b74fc instead of having the timeout deferral use its
      own.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d28390a