1. 12 12月, 2019 1 次提交
  2. 10 12月, 2019 1 次提交
  3. 07 12月, 2019 4 次提交
  4. 06 12月, 2019 2 次提交
    • J
      Merge branch 'io_uring-5.5' into for-linus · 85394299
      Jens Axboe 提交于
      * io_uring-5.5:
        io_uring: fix a typo in a comment
        io_uring: hook all linked requests via link_list
        io_uring: fix error handling in io_queue_link_head
        io_uring: use hash table for poll command lookups
      85394299
    • J
      block: fix memleak of bio integrity data · ece841ab
      Justin Tee 提交于
      7c20f116 ("bio-integrity: stop abusing bi_end_io") moves
      bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
      and bio_endio(). This way looks wrong because bio may be freed
      without calling bio_endio(), for example, blk_rq_unprep_clone() is
      called from dm_mq_queue_rq() when the underlying queue of dm-mpath
      is busy.
      
      So memory leak of bio integrity data is caused by commit 7c20f116.
      
      Fixes this issue by re-adding bio_integrity_free() to bio_uninit().
      
      Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by Justin Tee <justin.tee@broadcom.com>
      
      Add commit log, and simplify/fix the original patch wroten by Justin.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ece841ab
  5. 05 12月, 2019 11 次提交
    • L
      io_uring: fix a typo in a comment · 0b4295b5
      LimingWu 提交于
      thatn -> than.
      Signed-off-by: NLiming Wu <19092205@suning.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b4295b5
    • H
      bfq-iosched: Ensure bio->bi_blkg is valid before using it · 08802ed6
      Hou Tao 提交于
      bio->bi_blkg will be NULL when the issue of the request
      has bypassed the block layer as shown in the following oops:
      
       Internal error: Oops: 96000005 [#1] SMP
       CPU: 17 PID: 2996 Comm: scsi_id Not tainted 5.4.0 #4
       Call trace:
        percpu_counter_add_batch+0x38/0x4c8
        bfqg_stats_update_legacy_io+0x9c/0x280
        bfq_insert_requests+0xbac/0x2190
        blk_mq_sched_insert_request+0x288/0x670
        blk_execute_rq_nowait+0x140/0x178
        blk_execute_rq+0x8c/0x140
        sg_io+0x604/0x9c0
        scsi_cmd_ioctl+0xe38/0x10a8
        scsi_cmd_blk_ioctl+0xac/0xe8
        sd_ioctl+0xe4/0x238
        blkdev_ioctl+0x590/0x20e0
        block_ioctl+0x60/0x98
        do_vfs_ioctl+0xe0/0x1b58
        ksys_ioctl+0x80/0xd8
        __arm64_sys_ioctl+0x40/0x78
        el0_svc_handler+0xc4/0x270
      
      so ensure its validity before using it.
      
      Fixes: fd41e603 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08802ed6
    • P
      io_uring: hook all linked requests via link_list · 4493233e
      Pavel Begunkov 提交于
      Links are created by chaining requests through req->list with an
      exception that head uses req->link_list. (e.g. link_list->list->list)
      Because of that, io_req_link_next() needs complex splicing to advance.
      
      Link them all through list_list. Also, it seems to be simpler and more
      consistent IMHO.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4493233e
    • P
      io_uring: fix error handling in io_queue_link_head · 2e6e1fde
      Pavel Begunkov 提交于
      In case of an error io_submit_sqe() drops a request and continues
      without it, even if the request was a part of a link. Not only it
      doesn't cancel links, but also may execute wrong sequence of actions.
      
      Stop consuming sqes, and let the user handle errors.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e6e1fde
    • J
      io_uring: use hash table for poll command lookups · 78076bb6
      Jens Axboe 提交于
      We recently changed this from a single list to an rbtree, but for some
      real life workloads, the rbtree slows down the submission/insertion
      case enough so that it's the top cycle consumer on the io_uring side.
      In testing, using a hash table is a more well rounded compromise. It
      is fast for insertion, and as long as it's sized appropriately, it
      works well for the cancellation case as well. Running TAO with a lot
      of network sockets, this removes io_poll_req_insert() from spending
      2% of the CPU cycles.
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78076bb6
    • J
      io-wq: clear node->next on list deletion · 08bdcc35
      Jens Axboe 提交于
      If someone removes a node from a list, and then later adds it back to
      a list, we can have invalid data in ->next. This can cause all sorts
      of issues. One such use case is the IORING_OP_POLL_ADD command, which
      will do just that if we race and get woken twice without any pending
      events. This is a pretty rare case, but can happen under extreme loads.
      Dan reports that he saw the following crash:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
      Oops: 0002 [#1] SMP
      CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      RIP: 0010:io_wqe_enqueue+0x3e/0xd0
      Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
      RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
      RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
      RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
      RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
      R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
      FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <IRQ>
       io_poll_wake+0x12f/0x2a0
       __wake_up_common+0x86/0x120
       __wake_up_common_lock+0x7a/0xc0
       sock_def_readable+0x3c/0x70
       tcp_rcv_established+0x557/0x630
       tcp_v6_do_rcv+0x118/0x3c0
       tcp_v6_rcv+0x97e/0x9d0
       ip6_protocol_deliver_rcu+0xe3/0x440
       ip6_input+0x3d/0xc0
       ? ip6_protocol_deliver_rcu+0x440/0x440
       ipv6_rcv+0x56/0xd0
       ? ip6_rcv_finish_core.isra.18+0x80/0x80
       __netif_receive_skb_one_core+0x50/0x70
       netif_receive_skb_internal+0x2f/0xa0
       napi_gro_receive+0x125/0x150
       mlx5e_handle_rx_cqe+0x1d9/0x5a0
       ? mlx5e_poll_tx_cq+0x305/0x560
       mlx5e_poll_rx_cq+0x49f/0x9c5
       mlx5e_napi_poll+0xee/0x640
       ? smp_reschedule_interrupt+0x16/0xd0
       ? reschedule_interrupt+0xf/0x20
       net_rx_action+0x286/0x3d0
       __do_softirq+0xca/0x297
       irq_exit+0x96/0xa0
       do_IRQ+0x54/0xe0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0033:0x7fdc627a2e3a
      Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
      
      when running a networked workload with about 5000 sockets being polled
      for. Fix this by clearing node->next when the node is being removed from
      the list.
      
      Fixes: 6206f0e1 ("io-wq: shrink io_wq_work a bit")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08bdcc35
    • J
      io_uring: ensure deferred timeouts copy necessary data · 2d28390a
      Jens Axboe 提交于
      If we defer a timeout, we should ensure that we copy the timespec
      when we have consumed the sqe. This is similar to commit f67676d1
      for read/write requests. We already did this correctly for timeouts
      deferred as links, but do it generally and use the infrastructure added
      by commit 1a6b74fc instead of having the timeout deferral use its
      own.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d28390a
    • J
      io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT · 901e59bb
      Jens Axboe 提交于
      There's really no reason why we forbid things like link/drain etc on
      regular timeout commands. Enable the usual SQE flags on timeouts.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      901e59bb
    • J
      null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED · bca1c43c
      Jens Axboe 提交于
      If BLK_DEV_ZONED isn't set, 'ret' isn't used. This makes gcc complain,
      rightfully. Move ret where it is used.
      
      Fixes: 979d5447 ("null_blk: cleanup null_gendisk_register")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bca1c43c
    • M
      brd: warn on un-aligned buffer · f1acbf21
      Ming Lei 提交于
      Queue dma alignment limit requires users(fs, target, ...) of block layer
      to pass aligned buffer.
      
      So far brd doesn't support un-aligned buffer, even though it is easy
      to support it.
      
      However, given brd is often used for debug purpose, and there are other
      drivers which can't support un-aligned buffer too.
      
      So add warning so that brd users know what to fix.
      Reported-by: NStephen Rust <srust@blockbridge.com>
      Cc: Stephen Rust <srust@blockbridge.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1acbf21
    • M
      brd: remove max_hw_sectors queue limit · 36582a5a
      Ming Lei 提交于
      Now we depend on blk_queue_split() to respect most of queue limit
      (the only one exception could be dma alignment), however
      blk_queue_split() isn't used for brd, so this limit isn't respected
      since v4.3.
      
      Also max_hw_sectors limit doesn't play a big role for brd, which is
      added since brd is added to tree for unknown reason.
      
      So remove it.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36582a5a
  6. 04 12月, 2019 3 次提交
    • S
      xen/blkback: Avoid unmapping unmapped grant pages · f9bd84a8
      SeongJae Park 提交于
      For each I/O request, blkback first maps the foreign pages for the
      request to its local pages.  If an allocation of a local page for the
      mapping fails, it should unmap every mapping already made for the
      request.
      
      However, blkback's handling mechanism for the allocation failure does
      not mark the remaining foreign pages as unmapped.  Therefore, the unmap
      function merely tries to unmap every valid grant page for the request,
      including the pages not mapped due to the allocation failure.  On a
      system that fails the allocation frequently, this problem leads to
      following kernel crash.
      
        [  372.012538] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
        [  372.012546] IP: [<ffffffff814071ac>] gnttab_unmap_refs.part.7+0x1c/0x40
        [  372.012557] PGD 16f3e9067 PUD 16426e067 PMD 0
        [  372.012562] Oops: 0002 [#1] SMP
        [  372.012566] Modules linked in: act_police sch_ingress cls_u32
        ...
        [  372.012746] Call Trace:
        [  372.012752]  [<ffffffff81407204>] gnttab_unmap_refs+0x34/0x40
        [  372.012759]  [<ffffffffa0335ae3>] xen_blkbk_unmap+0x83/0x150 [xen_blkback]
        ...
        [  372.012802]  [<ffffffffa0336c50>] dispatch_rw_block_io+0x970/0x980 [xen_blkback]
        ...
        Decompressing Linux... Parsing ELF... done.
        Booting the kernel.
        [    0.000000] Initializing cgroup subsys cpuset
      
      This commit fixes this problem by marking the grant pages of the given
      request that didn't mapped due to the allocation failure as invalid.
      
      Fixes: c6cc142d ("xen-blkback: use balloon pages for all mappings")
      Reviewed-by: NDavid Woodhouse <dwmw@amazon.de>
      Reviewed-by: NMaximilian Heyne <mheyne@amazon.de>
      Reviewed-by: NPaul Durrant <pdurrant@amazon.co.uk>
      Reviewed-by: NRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: NSeongJae Park <sjpark@amazon.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f9bd84a8
    • J
      io_uring: handle connect -EINPROGRESS like -EAGAIN · 87f80d62
      Jens Axboe 提交于
      Right now we return it to userspace, which means the application has
      to poll for the socket to be writeable. Let's just treat it like
      -EAGAIN and have io_uring handle it internally, this makes it much
      easier to use.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      87f80d62
    • C
      block: set the zone size in blk_revalidate_disk_zones atomically · 6c6b3549
      Christoph Hellwig 提交于
      The current zone revalidation code has a major problem in that it
      doesn't update the zone size and q->nr_zones atomically, leading
      to a short window where an out of bounds access to the zone arrays
      is possible.
      
      To fix this move the setting of the zone size into the crticial
      sections blk_revalidate_disk_zones so that it gets updated together
      with the zone bitmaps and q->nr_zones.  This also slightly simplifies
      the caller as it deducts the zone size from the report_zones.
      
      This change also allows to check for a power of two zone size in generic
      code.
      Reported-by: NHans Holmberg <hans@owltronix.com>
      Reviewed-by: NJavier González <javier@javigon.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6c6b3549
  7. 03 12月, 2019 18 次提交