1. 27 5月, 2020 25 次提交
  2. 02 4月, 2020 1 次提交
    • J
      io_uring: use current task creds instead of allocating a new one · 311b786d
      Jens Axboe 提交于
      fix #26374723
      
      commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a upstream.
      
      syzbot reports:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
        kthread+0x361/0x430 kernel/kthread.c:255
        ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Modules linked in:
      ---[ end trace f2e1a4307fbe2245 ]---
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is caused by slab fault injection triggering a failure in
      prepare_creds(). We don't actually need to create a copy of the creds
      as we're not modifying it, we just need a reference on the current task
      creds. This avoids the failure case as well, and propagates the const
      throughout the stack.
      
      Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
      Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      311b786d
  3. 26 3月, 2020 1 次提交
  4. 19 3月, 2020 3 次提交
  5. 18 3月, 2020 10 次提交
    • X
      io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled · f1046eaf
      Xiaoguang Wang 提交于
      commit 32b2244a840a90ea94ba42392de5c48d53f521f5 upstream linux-next
      
      When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
      to do io completion events polling again, they can rely on io_sq_thread to do
      polling work, which can reduce cpu usage and uring_lock contention.
      
      I modify fio io_uring engine codes a bit to evaluate the performance:
      static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                              continue;
                      }
      
      -               if (!o->sqpoll_thread) {
      +               if (o->sqpoll_thread && o->hipri) {
                              r = io_uring_enter(ld, 0, actual_min,
                                                      IORING_ENTER_GETEVENTS);
                              if (r < 0) {
      
      and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
      -rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
      -size=10G -numjobs=1  -time_based -runtime=120"
      
      original codes
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
      fio cpu usage |     100% |     100% |     100% |     100% |     100%
      --------------------------------------------------------------------
      
      with patch
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
      fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
      --------------------------------------------------------------------
      bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
      --------------------------------------------------------------------
      
      From above test results, we can see that bw has above 5.5%~13%
      improvement, and fio process's cpu usage also drops much. Note this
      won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
      are both enabled, in this case, io_sq_thread always has 100% cpu usage.
      I think this patch will be friendly to applications which will often use
      io_uring_wait_cqe() or similar from liburing.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f1046eaf
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · a3a1829e
      Xiaoguang Wang 提交于
      commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 upstream
      
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3a1829e
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · a715bf8d
      Xiaoguang Wang 提交于
      commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d upstream.
      
      Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending")
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a715bf8d
    • J
      io_uring: ensure registered buffer import returns the IO length · 85cad208
      Jens Axboe 提交于
      commit 5e559561a8d7e6d4adfce6aa8fbf3daa3dec1577 upstream.
      
      A test case was reported where two linked reads with registered buffers
      failed the second link always. This is because we set the expected value
      of a request in req->result, and if we don't get this result, then we
      fail the dependent links. For some reason the registered buffer import
      returned -ERROR/0, while the normal import returns -ERROR/length. This
      broke linked commands with registered buffers.
      
      Fix this by making io_import_fixed() correctly return the mapped length.
      
      Cc: stable@vger.kernel.org # v5.3
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      85cad208
    • P
      io_uring: Fix getting file for timeout · a4a506be
      Pavel Begunkov 提交于
      commit 5683e5406e94ae1bfb0d9516a18fdb281d0f8d1d upstream.
      
      For timeout requests io_uring tries to grab a file with specified fd,
      which is usually stdin/fd=0.
      Update io_op_needs_file()
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a4a506be
    • J
      io_uring: check for validity of ->rings in teardown · bc740f65
      Jens Axboe 提交于
      commit 15dff286d0e0087d4dcd7049911f179e4e4cfd94 upstream.
      
      Normally the rings are always valid, the exception is if we failed to
      allocate the rings at setup time. syzbot reports this:
      
      RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
      RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
      RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
      R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
      RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
      RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
      Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
      ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
      89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
      RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
      RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
      RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
      R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
      R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
      FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673
        io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260
        io_uring_create fs/io_uring.c:4600 [inline]
        io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626
        __do_sys_io_uring_setup fs/io_uring.c:4639 [inline]
        __se_sys_io_uring_setup fs/io_uring.c:4636 [inline]
        __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636
        do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441229
      Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229
      RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d
      RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
      R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      Modules linked in:
      ---[ end trace b0f5b127a57f623f ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
      RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline]
      RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592
      Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1
      ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c
      89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61
      RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a
      RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100
      RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d
      R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0
      R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000
      FS:  0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is exactly the case of failing to allocate the SQ/CQ rings, and
      then entering shutdown. Check if the rings are valid before trying to
      access them at shutdown time.
      
      Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com
      Fixes: 1d7bb1d50fb4 ("io_uring: add support for backlogged CQ ring")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bc740f65
    • J
      io_uring: fix potential deadlock in io_poll_wake() · e9b10a6b
      Jens Axboe 提交于
      commit 7c9e7f0fe0d8abf856a957c150c48778e75154c1 upstream.
      
      We attempt to run the poll completion inline, but we're using trylock to
      do so. This avoids a deadlock since we're grabbing the locks in reverse
      order at this point, we already hold the poll wq lock and we're trying
      to grab the completion lock, while the normal rules are the reverse of
      that order.
      
      IO completion for a timeout link will need to grab the completion lock,
      but that's not safe from this context. Put the completion under the
      completion_lock in io_poll_wake(), and mark the request as entering
      the completion with the completion_lock already held.
      
      Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e9b10a6b
    • J
      io_uring: use correct "is IO worker" helper · cf8c3e33
      Jens Axboe 提交于
      commit 960e432dfa5927892a9b170d14de874597b84849 upstream.
      
      Since we switched to io-wq, the dependent link optimization for when to
      pass back work inline has been broken. Fix this by providing a suitable
      io-wq helper for io_uring to use to detect when to do this.
      
      Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      cf8c3e33
    • J
      io_uring: make timeout sequence == 0 mean no sequence · 2eca9a36
      Jens Axboe 提交于
      commit 93bd25bb69f46367ba8f82c578e0c05702ceb482 upstream.
      
      Currently we make sequence == 0 be the same as sequence == 1, but that's
      not super useful if the intent is really to have a timeout that's just
      a pure timeout.
      
      If the user passes in sqe->off == 0, then don't apply any sequence logic
      to the request, let it purely be driven by the timeout specified.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Reviewed-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2eca9a36
    • J
      io_uring: fix -ENOENT issue with linked timer with short timeout · c564c48d
      Jens Axboe 提交于
      commit 76a46e066e2d93bd333599d1c84c605c2c4cc909 upstream.
      
      If you prep a read (for example) that needs to get punted to async
      context with a timer, if the timeout is sufficiently short, the timer
      request will get completed with -ENOENT as it could not find the read.
      
      The issue is that we prep and start the timer before we start the read.
      Hence the timer can trigger before the read is even started, and the end
      result is then that the timer completes with -ENOENT, while the read
      starts instead of being cancelled by the timer.
      
      Fix this by splitting the linked timer into two parts:
      
      1) Prep and validate the linked timer
      2) Start timer
      
      The read is then started between steps 1 and 2, so we know that the
      timer will always have a consistent view of the read request state.
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c564c48d