1. 11 12月, 2019 6 次提交
  2. 05 12月, 2019 7 次提交
    • L
      io_uring: fix a typo in a comment · 0b4295b5
      LimingWu 提交于
      thatn -> than.
      Signed-off-by: NLiming Wu <19092205@suning.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b4295b5
    • P
      io_uring: hook all linked requests via link_list · 4493233e
      Pavel Begunkov 提交于
      Links are created by chaining requests through req->list with an
      exception that head uses req->link_list. (e.g. link_list->list->list)
      Because of that, io_req_link_next() needs complex splicing to advance.
      
      Link them all through list_list. Also, it seems to be simpler and more
      consistent IMHO.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4493233e
    • P
      io_uring: fix error handling in io_queue_link_head · 2e6e1fde
      Pavel Begunkov 提交于
      In case of an error io_submit_sqe() drops a request and continues
      without it, even if the request was a part of a link. Not only it
      doesn't cancel links, but also may execute wrong sequence of actions.
      
      Stop consuming sqes, and let the user handle errors.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e6e1fde
    • J
      io_uring: use hash table for poll command lookups · 78076bb6
      Jens Axboe 提交于
      We recently changed this from a single list to an rbtree, but for some
      real life workloads, the rbtree slows down the submission/insertion
      case enough so that it's the top cycle consumer on the io_uring side.
      In testing, using a hash table is a more well rounded compromise. It
      is fast for insertion, and as long as it's sized appropriately, it
      works well for the cancellation case as well. Running TAO with a lot
      of network sockets, this removes io_poll_req_insert() from spending
      2% of the CPU cycles.
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78076bb6
    • J
      io-wq: clear node->next on list deletion · 08bdcc35
      Jens Axboe 提交于
      If someone removes a node from a list, and then later adds it back to
      a list, we can have invalid data in ->next. This can cause all sorts
      of issues. One such use case is the IORING_OP_POLL_ADD command, which
      will do just that if we race and get woken twice without any pending
      events. This is a pretty rare case, but can happen under extreme loads.
      Dan reports that he saw the following crash:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
      Oops: 0002 [#1] SMP
      CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      RIP: 0010:io_wqe_enqueue+0x3e/0xd0
      Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
      RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
      RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
      RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
      RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
      R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
      FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <IRQ>
       io_poll_wake+0x12f/0x2a0
       __wake_up_common+0x86/0x120
       __wake_up_common_lock+0x7a/0xc0
       sock_def_readable+0x3c/0x70
       tcp_rcv_established+0x557/0x630
       tcp_v6_do_rcv+0x118/0x3c0
       tcp_v6_rcv+0x97e/0x9d0
       ip6_protocol_deliver_rcu+0xe3/0x440
       ip6_input+0x3d/0xc0
       ? ip6_protocol_deliver_rcu+0x440/0x440
       ipv6_rcv+0x56/0xd0
       ? ip6_rcv_finish_core.isra.18+0x80/0x80
       __netif_receive_skb_one_core+0x50/0x70
       netif_receive_skb_internal+0x2f/0xa0
       napi_gro_receive+0x125/0x150
       mlx5e_handle_rx_cqe+0x1d9/0x5a0
       ? mlx5e_poll_tx_cq+0x305/0x560
       mlx5e_poll_rx_cq+0x49f/0x9c5
       mlx5e_napi_poll+0xee/0x640
       ? smp_reschedule_interrupt+0x16/0xd0
       ? reschedule_interrupt+0xf/0x20
       net_rx_action+0x286/0x3d0
       __do_softirq+0xca/0x297
       irq_exit+0x96/0xa0
       do_IRQ+0x54/0xe0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0033:0x7fdc627a2e3a
      Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
      
      when running a networked workload with about 5000 sockets being polled
      for. Fix this by clearing node->next when the node is being removed from
      the list.
      
      Fixes: 6206f0e1 ("io-wq: shrink io_wq_work a bit")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08bdcc35
    • J
      io_uring: ensure deferred timeouts copy necessary data · 2d28390a
      Jens Axboe 提交于
      If we defer a timeout, we should ensure that we copy the timespec
      when we have consumed the sqe. This is similar to commit f67676d1
      for read/write requests. We already did this correctly for timeouts
      deferred as links, but do it generally and use the infrastructure added
      by commit 1a6b74fc instead of having the timeout deferral use its
      own.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d28390a
    • J
      io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT · 901e59bb
      Jens Axboe 提交于
      There's really no reason why we forbid things like link/drain etc on
      regular timeout commands. Enable the usual SQE flags on timeouts.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      901e59bb
  3. 04 12月, 2019 1 次提交
  4. 03 12月, 2019 9 次提交
  5. 02 12月, 2019 1 次提交
    • J
      io_uring: use current task creds instead of allocating a new one · 0b8c0ec7
      Jens Axboe 提交于
      syzbot reports:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
        kthread+0x361/0x430 kernel/kthread.c:255
        ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Modules linked in:
      ---[ end trace f2e1a4307fbe2245 ]---
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is caused by slab fault injection triggering a failure in
      prepare_creds(). We don't actually need to create a copy of the creds
      as we're not modifying it, we just need a reference on the current task
      creds. This avoids the failure case as well, and propagates the const
      throughout the stack.
      
      Fixes: 181e448d ("io_uring: async workers should inherit the user creds")
      Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b8c0ec7
  6. 30 11月, 2019 1 次提交
    • J
      io_uring: fix missing kmap() declaration on powerpc · aa4c3967
      Jens Axboe 提交于
      Christophe reports that current master fails building on powerpc with
      this error:
      
         CC      fs/io_uring.o
      fs/io_uring.c: In function ‘loop_rw_iter’:
      fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
      [-Werror=implicit-function-declaration]
           iovec.iov_base = kmap(iter->bvec->bv_page)
                            ^
      fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
      without a cast [-Wint-conversion]
           iovec.iov_base = kmap(iter->bvec->bv_page)
                          ^
      fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
      [-Werror=implicit-function-declaration]
           kunmap(iter->bvec->bv_page);
           ^
      
      which is caused by a missing highmem.h include. Fix it by including
      it.
      
      Fixes: 311ae9e1 ("io_uring: fix dead-hung for non-iter fixed rw")
      Reported-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa4c3967
  7. 29 11月, 2019 1 次提交
  8. 28 11月, 2019 1 次提交
  9. 27 11月, 2019 8 次提交
    • J
      io_uring: make poll->wait dynamically allocated · e944475e
      Jens Axboe 提交于
      In the quest to bring io_kiocb down to 3 cachelines, this one does
      the trick. Make the wait_queue_entry for the poll command come out
      of kmalloc instead of embedding it in struct io_poll_iocb, as the
      latter is the largest member of io_kiocb. Once we trim this down a
      bit, we're back at a healthy 192 bytes for struct io_kiocb.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e944475e
    • J
      io-wq: shrink io_wq_work a bit · 6206f0e1
      Jens Axboe 提交于
      Currently we're using 40 bytes for the io_wq_work structure, and 16 of
      those is the doubly link list node. We don't need doubly linked lists,
      we always add to tail to keep things ordered, and any other use case
      is list traversal with deletion. For the deletion case, we can easily
      support any node deletion by keeping track of the previous entry.
      
      This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
      io_uring to 216 to 208 bytes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6206f0e1
    • J
      io-wq: fix handling of NUMA node IDs · 3fc50ab5
      Jann Horn 提交于
      There are several things that can go wrong in the current code on NUMA
      systems, especially if not all nodes are online all the time:
      
       - If the identifiers of the online nodes do not form a single contiguous
         block starting at zero, wq->wqes will be too small, and OOB memory
         accesses will occur e.g. in the loop in io_wq_create().
       - If a node comes online between the call to num_online_nodes() and the
         for_each_node() loop in io_wq_create(), an OOB write will occur.
       - If a node comes online between io_wq_create() and io_wq_enqueue(), a
         lookup is performed for an element that doesn't exist, and an OOB read
         will probably occur.
      
      Fix it by:
      
       - using nr_node_ids instead of num_online_nodes() for the allocation size;
         nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
         highest node ID that could possibly come online at some point, even if
         those nodes' identifiers are not a contiguous block
       - creating workers for all possible CPUs, not just all online ones
      
      This is basically what the normal workqueue code also does, as far as I can
      tell.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3fc50ab5
    • J
      io_uring: use kzalloc instead of kcalloc for single-element allocations · ad6e005c
      Jann Horn 提交于
      These allocations are single-element allocations, so don't use the array
      allocation wrapper for them.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad6e005c
    • P
      io_uring: cleanup io_import_fixed() · 7d009165
      Pavel Begunkov 提交于
      Clean io_import_fixed() call site and make it return proper type.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d009165
    • P
      io_uring: inline struct sqe_submit · cf6fd4bd
      Pavel Begunkov 提交于
      There is no point left in keeping struct sqe_submit. Inline it
      into struct io_kiocb, so any req->submit.field is now just req->field
      
      - moves initialisation of ring_file into io_get_req()
      - removes duplicated req->sequence.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cf6fd4bd
    • P
      io_uring: store timeout's sqe->off in proper place · cc42e0ac
      Pavel Begunkov 提交于
      Timeouts' sequence offset (i.e. sqe->off) is stored in
      req->submit.sequence under a false name. Keep it in timeout.data
      instead. The unused space for sequence will be reclaimed in the
      following patches.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cc42e0ac
    • L
      Revert "vfs: properly and reliably lock f_pos in fdget_pos()" · 2be7d348
      Linus Torvalds 提交于
      This reverts commit 0be0ee71.
      
      I was hoping it would be benign to switch over entirely to FMODE_STREAM,
      and we'd have just a couple of small fixups we'd need, but it looks like
      we're not quite there yet.
      
      While it worked fine on both my desktop and laptop, they are fairly
      similar in other respects, and run mostly the same loads.  Kenneth
      Crudup reports that it seems to break both his vmware installation and
      the KDE upower service.  In both cases apparently leading to timeouts
      due to waitinmg for the f_pos lock.
      
      There are a number of character devices in particular that definitely
      want stream-like behavior, but that currently don't get marked as
      streams, and as a result get the exclusion between concurrent
      read()/write() on the same file descriptor.  Which doesn't work well for
      them.
      
      The most obvious example if this is /dev/console and /dev/tty, which use
      console_fops and tty_fops respectively (and ptmx_fops for the pty master
      side).  It may be that it's just this that causes problems, but we
      clearly weren't ready yet.
      
      Because there's a number of other likely common cases that don't have
      llseek implementations and would seem to act as stream devices:
      
        /dev/fuse		(fuse_dev_operations)
        /dev/mcelog		(mce_chrdev_ops)
        /dev/mei0		(mei_fops)
        /dev/net/tun		(tun_fops)
        /dev/nvme0		(nvme_dev_fops)
        /dev/tpm0		(tpm_fops)
        /proc/self/ns/mnt	(ns_file_operations)
        /dev/snd/pcm*		(snd_pcm_f_ops[])
      
      and while some of these could be trivially automatically detected by the
      vfs layer when the character device is opened by just noticing that they
      have no read or write operations either, it often isn't that obvious.
      
      Some character devices most definitely do use the file position, even if
      they don't allow seeking: the firmware update code, for example, uses
      simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
      back and forth.
      
      We'll revisit this when there's a better way to detect the problem and
      fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
      annotations).
      Reported-by: NKenneth R. Crudup <kenny@panix.com>
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2be7d348
  10. 26 11月, 2019 5 次提交
新手
引导
客服 返回
顶部