1. 16 6月, 2020 26 次提交
  2. 12 6月, 2020 2 次提交
    • J
      io_uring: check file O_NONBLOCK state for accept · e8758c9b
      Jiufei Xue 提交于
      task #27774850
      
      commit e697deed834de15d2322d0619d51893022c90ea2 upstream.
      
      If the socket is O_NONBLOCK, we should complete the accept request
      with -EAGAIN when data is not ready.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e8758c9b
    • J
      ext4: fix partial cluster initialization when splitting extent · b4a0105f
      Jeffle Xu 提交于
      fix #27212891
      
      Fix the bug when calculating the physical block number of the first
      block in the split extent.
      
      This bug will cause xfstests shared/298 failure on ext4 with bigalloc
      enabled occasionally. Ext4 error messages indicate that previously freed
      blocks are being freed again, and the following fsck will fail due to
      the inconsistency of block bitmap and bg descriptor.
      
      The following is an example case:
      
      1. First, Initialize a ext4 filesystem with cluster size '16K', block size
      '4K', in which case, one cluster contains four blocks.
      
      2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
      tree of this file is like:
      
      ...
      36864:[0]4:220160
      36868:[0]14332:145408
      51200:[0]2:231424
      ...
      
      3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
      like:
      
      ..
      ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
      ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
      ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
      ...
      
      4. Then the extent tree of this file after punching is like
      
      ...
      49507:[0]37:158047
      49547:[0]58:158087
      ...
      
      5. Detailed procedure of punching hole [49544, 49546]
      
      5.1. The block address space:
      ```
      lblk        ~49505  49506   49507~49543     49544~49546    49547~
      	  ---------+------+-------------+----------------+--------
      	    extent | hole |   extent	|	hole	 | extent
      	  ---------+------+-------------+----------------+--------
      pblk       ~158045  158046  158047~158083  158084~158086   158087~
      ```
      
      5.2. The detailed layout of cluster 39521:
      ```
      		cluster 39521
      	<------------------------------->
      
      		hole		  extent
      	<----------------------><--------
      
      lblk      49544   49545   49546   49547
      	+-------+-------+-------+-------+
      	|	|	|	|	|
      	+-------+-------+-------+-------+
      pblk     158084  1580845  158086  158087
      ```
      
      5.3. The ftrace output when punching hole [49544, 49546]:
      - ext4_ext_remove_space (start 49544, end 49546)
        - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
          - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
            - ext4_free_blocks: (block 158084 count 4)
              - ext4_mballoc_free (extent 1/6753/1)
      
      5.4. Ext4 error message in dmesg:
      EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
      EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters
      
      In this case, the whole cluster 39521 is freed mistakenly when freeing
      pblock 158084~158086 (i.e., the first three blocks of this cluster),
      although pblock 158087 (the last remaining block of this cluster) has
      not been freed yet.
      
      The root cause of this isuue is that, the pclu of the partial cluster is
      calculated mistakenly in ext4_ext_remove_space(). The correct
      partial_cluster.pclu (i.e., the cluster number of the first block in the
      next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
      than 39522.
      
      Fixes: f4226d9e ("ext4: fix partial cluster initialization")
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NEric Whitney <enwlinux@gmail.com>
      Cc: stable@kernel.org # v3.19+
      Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b4a0105f
  3. 09 6月, 2020 1 次提交
  4. 04 6月, 2020 11 次提交
    • X
      io_uring: reset -EBUSY error when io sq thread is waken up · 2abbba20
      Xiaoguang Wang 提交于
      to #28170604
      
      commit d4ae271dfaae2a5f41c015f2f20d62a1deeec734 upstream
      
      In io_sq_thread(), currently if we get an -EBUSY error and go to sleep,
      we will won't clear it again, which will result in io_sq_thread() will
      never have a chance to submit sqes again. Below test program test.c
      can reveal this bug:
      
      int main(int argc, char *argv[])
      {
              struct io_uring ring;
              int i, fd, ret;
              struct io_uring_sqe *sqe;
              struct io_uring_cqe *cqe;
              struct iovec *iovecs;
              void *buf;
              struct io_uring_params p;
      
              if (argc < 2) {
                      printf("%s: file\n", argv[0]);
                      return 1;
              }
      
              memset(&p, 0, sizeof(p));
              p.flags = IORING_SETUP_SQPOLL;
              ret = io_uring_queue_init_params(4, &ring, &p);
              if (ret < 0) {
                      fprintf(stderr, "queue_init: %s\n", strerror(-ret));
                      return 1;
              }
      
              fd = open(argv[1], O_RDONLY | O_DIRECT);
              if (fd < 0) {
                      perror("open");
                      return 1;
              }
      
              iovecs = calloc(10, sizeof(struct iovec));
              for (i = 0; i < 10; i++) {
                      if (posix_memalign(&buf, 4096, 4096))
                              return 1;
                      iovecs[i].iov_base = buf;
                      iovecs[i].iov_len = 4096;
              }
      
              ret = io_uring_register_files(&ring, &fd, 1);
              if (ret < 0) {
                      fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret);
                      return ret;
              }
      
              for (i = 0; i < 10; i++) {
                      sqe = io_uring_get_sqe(&ring);
                      if (!sqe)
                              break;
      
                      io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0);
                      sqe->flags |= IOSQE_FIXED_FILE;
      
                      ret = io_uring_submit(&ring);
                      sleep(1);
                      printf("submit %d\n", i);
              }
      
              for (i = 0; i < 10; i++) {
                      io_uring_wait_cqe(&ring, &cqe);
                      printf("receive: %d\n", i);
                      if (cqe->res != 4096) {
                              fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
                              ret = 1;
                      }
                      io_uring_cqe_seen(&ring, cqe);
              }
      
              close(fd);
              io_uring_queue_exit(&ring);
              return 0;
      }
      sudo ./test testfile
      above command will hang on the tenth request, to fix this bug, when io
      sq_thread is waken up, we reset the variable 'ret' to be zero.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2abbba20
    • J
      io_uring: don't add non-IO requests to iopoll pending list · 4d7ac019
      Jens Axboe 提交于
      to #28170604
      
      commit b532576ed39efe3b351ae8320b2ab67a4c4c3719 upstream
      
      We normally disable any commands that aren't specifically poll commands
      for a ring that is setup for polling, but we do allow buffer provide and
      remove commands to support buffer selection for polled IO. Once a
      request is issued, we add it to the poll list to poll for completion. But
      we should not do that for non-IO commands, as those request complete
      inline immediately and aren't pollable. If we do, we can leave requests
      on the iopoll list after they are freed.
      
      Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4d7ac019
    • B
      io_uring: don't use kiocb.private to store buf_index · d90653af
      Bijan Mottahedeh 提交于
      to #28170604
      
      commit 4f4eeba87cc731b200bff9372d14a80f5996b277 upstream
      
      kiocb.private is used in iomap_dio_rw() so store buf_index separately.
      Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      
      Move 'buf_index' to a hole in io_kiocb.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d90653af
    • J
      io_uring: cancel work if task_work_add() fails · 44e75116
      Jens Axboe 提交于
      to #28170604
      
      commit e3aabf9554fd04eb14cd44ae7583fc9d40edd250 upstream
      
      We currently move it to the io_wqe_manager for execution, but we cannot
      safely do so as we may lack some of the state to execute it out of
      context. As we cancel work anyway when the ring/task exits, just mark
      this request as canceled and io_async_task_func() will do the right
      thing.
      
      Fixes: aa96bf8a9ee3 ("io_uring: use io-wq manager as backup task if task is exiting")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      44e75116
    • J
      io_uring: remove dead check in io_splice() · f1645783
      Jens Axboe 提交于
      to #28170604
      
      commit 948a7749454b1712f1b2f2429f9493eb3e4a89b0 upstream
      
      We checked for 'force_nonblock' higher up, so it's definitely false
      at this point. Kill the check, it's a remnant of when we tried to do
      inline splice without always punting to async context.
      
      Fixes: 2fb3e82284fc ("io_uring: punt splice async because of inode mutex")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f1645783
    • P
      io_uring: fix FORCE_ASYNC req preparation · c8fc1875
      Pavel Begunkov 提交于
      to #28170604
      
      commit bd2ab18a1d6267446eae1b47dd839050452bdf7f upstream
      
      As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs,
      so they can be prepared properly.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c8fc1875
    • P
      io_uring: don't prepare DRAIN reqs twice · 3c2eefb7
      Pavel Begunkov 提交于
      to #28170604
      
      commit 650b548129b60b0d23508351800108196f4aa89f upstream
      
      If req->io is not NULL, it's already prepared. Don't do it again,
      it's dangerous.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3c2eefb7
    • J
      io_uring: initialize ctx->sqo_wait earlier · c73bbba3
      Jens Axboe 提交于
      to #28170604
      
      commit 583863ed918136412ddf14de2e12534f17cfdc6f upstream
      
      Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated,
      instead of deferring it to the offload setup. This fixes a syzbot
      reported lockdep complaint, which is really due to trying to wake_up
      on an uninitialized wait queue:
      
      RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319
      RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b
      RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260
      R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000
      INFO: trying to register non-static key.
      the code is fine but needs lockdep annotation.
      turning off the locking correctness validator.
      CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x188/0x20d lib/dump_stack.c:118
       assign_lock_key kernel/locking/lockdep.c:913 [inline]
       register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
       __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234
       lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
       __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122
       io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160
       io_poll_remove_all fs/io_uring.c:4357 [inline]
       io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305
       io_uring_create fs/io_uring.c:7843 [inline]
       io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x441319
      Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      
      Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c73bbba3
    • J
      io_uring: polled fixed file must go through free iteration · b850f336
      Jens Axboe 提交于
      to #28170604
      
      commit 9d9e88a24c1f20ebfc2f28b1762ce78c0b9e1cb3 upstream
      
      When we changed the file registration handling, it became important to
      iterate the bulk request freeing list for fixed files as well, or we
      miss dropping the fixed file reference. If not, we're leaking references,
      and we'll get a kworker stuck waiting for file references to disappear.
      
      This also means we can remove the special casing of fixed vs non-fixed
      files, we need to iterate for both and we can just rely on
      __io_req_aux_free() doing io_put_file() instead of doing it manually.
      
      Fixes: 055895537302 ("io_uring: refactor file register/unregister/update handling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b850f336
    • P
      io_uring: fix zero len do_splice() · a49e9093
      Pavel Begunkov 提交于
      to #28170604
      
      commit c96874265cd04b4bd4a8e114ac9af039a6d83cfe upstream
      
      do_splice() doesn't expect len to be 0. Just always return 0 in this
      case as splice(2) does.
      
      Fixes: 7d67af2c0134 ("io_uring: add splice(2) support")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a49e9093
    • J
      io_uring: don't use 'fd' for openat/openat2/statx · bb87d8ed
      Jens Axboe 提交于
      to #28170604
      
      commit 63ff822358b276137059520cf16e587e8073e80f upstream
      
      We currently make some guesses as when to open this fd, but in reality
      we have no business (or need) to do so at all. In fact, it makes certain
      things fail, like O_PATH.
      
      Remove the fd lookup from these opcodes, we're just passing the 'fd' to
      generic helpers anyway. With that, we can also remove the special casing
      of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
      we have. And we can ensure that we only read sqe->fd once.
      
      This fixes O_PATH usage with openat/openat2, and ditto statx path side
      oddities.
      
      Cc: stable@vger.kernel.org: # v5.6
      Reported-by: NMax Kellermann <mk@cm4all.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bb87d8ed