1. 28 11月, 2019 3 次提交
    • J
      alios: blk-throttle: fix tg NULL pointer dereference · 4667e926
      Joseph Qi 提交于
      io throtl stats will blkg_get at the beginning of throttle and then
      blkg_put at the new introduced bi_tg_end_io. This will cause blkg to be
      freed if end_io is called twice like dm-thin, which will save origin
      end_io first, and call its overwrite end_io and then the saved end_io.
      After that, access blkg is invalid and finally BUG:
      
      [ 4417.235048] BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0
      [ 4417.236475] IP: [<ffffffff812e7c71>] throtl_update_dispatch_stats+0x21/0xb0
      [ 4417.237865] PGD 98395067 PUD 362e1067 PMD 0
      [ 4417.239232] Oops: 0000 [#1] SMP
      ......
      [ 4417.274070] Call Trace:
      [ 4417.275407]  [<ffffffff812ea93d>] blk_throtl_bio+0xfd/0x630
      [ 4417.276760]  [<ffffffff810b3613>] ? wake_up_process+0x23/0x40
      [ 4417.278079]  [<ffffffff81094c04>] ? wake_up_worker+0x24/0x30
      [ 4417.279387]  [<ffffffff81095772>] ? insert_work+0x62/0xa0
      [ 4417.280697]  [<ffffffff8116c2c7>] ? mempool_free_slab+0x17/0x20
      [ 4417.282019]  [<ffffffff8116c6c9>] ? mempool_free+0x49/0x90
      [ 4417.283326]  [<ffffffff812c9acf>] generic_make_request_checks+0x16f/0x360
      [ 4417.284637]  [<ffffffffa0340d97>] ? thin_map+0x227/0x2c0 [dm_thin_pool]
      [ 4417.285951]  [<ffffffff812c9ce7>] generic_make_request+0x27/0x130
      [ 4417.287240]  [<ffffffffa0230b3d>] __map_bio+0xad/0x100 [dm_mod]
      [ 4417.288503]  [<ffffffffa023257e>] __clone_and_map_data_bio+0x15e/0x240 [dm_mod]
      [ 4417.289778]  [<ffffffffa02329ea>] __split_and_process_bio+0x38a/0x500 [dm_mod]
      [ 4417.291062]  [<ffffffffa0232c91>] dm_make_request+0x131/0x1a0 [dm_mod]
      [ 4417.292344]  [<ffffffff812c9da2>] generic_make_request+0xe2/0x130
      [ 4417.293626]  [<ffffffff812c9e61>] submit_bio+0x71/0x150
      [ 4417.294909]  [<ffffffff8121ab1d>] ? bio_alloc_bioset+0x20d/0x360
      [ 4417.296195]  [<ffffffff81215acb>] _submit_bh+0x14b/0x220
      [ 4417.297484]  [<ffffffff81215bb0>] submit_bh+0x10/0x20
      [ 4417.298744]  [<ffffffffa016d8d8>] jbd2_journal_commit_transaction+0x6c8/0x19a0 [jbd2]
      [ 4417.300014]  [<ffffffff810135b8>] ? __switch_to+0xf8/0x4c0
      [ 4417.301268]  [<ffffffffa01731e9>] kjournald2+0xc9/0x270 [jbd2]
      [ 4417.302524]  [<ffffffff810a0fd0>] ? wake_up_atomic_t+0x30/0x30
      [ 4417.303753]  [<ffffffffa0173120>] ? commit_timeout+0x10/0x10 [jbd2]
      [ 4417.304950]  [<ffffffff8109ffef>] kthread+0xcf/0xe0
      [ 4417.306107]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      [ 4417.307255]  [<ffffffff81647f18>] ret_from_fork+0x58/0x90
      [ 4417.308349]  [<ffffffff8109ff20>] ? kthread_create_on_node+0x140/0x140
      ......
      
      Now we introduce a new bio flag BIO_THROTL_STATED to make sure
      blkg_get/put only get called once for the same bio.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4667e926
    • J
      alios: blk-throttle: support io delay stats · 65e6966a
      Joseph Qi 提交于
      Add blkio.throttle.io_service_time and blkio.throttle.io_wait_time to
      get per-cgroup io delay statistics.
      io_service_time represents the time spent after io throttle to io
      completion, while io_wait_time represents the time spent on throttle
      queue.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      65e6966a
    • X
      alios: block: add counter to track io request's d2c time · 07232d74
      Xiaoguang Wang 提交于
      Indeed tool iostat's await is not good enough, which is somewhat sketchy
      and could not show request's latency on device driver's side.
      
      Here we add a new counter to track io request's d2c time, also with this
      patch, we can extend iostat to show this value easily.
      
      Note:
      I had checked how iostat is implemented, it just reads fields it needs,
      so iostat won't be affected by this change, so does tsar.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      07232d74
  2. 19 11月, 2019 2 次提交
  3. 07 11月, 2019 26 次提交
  4. 30 10月, 2019 5 次提交
    • J
      blk-cgroup: turn on psi memstall stuff · 9a8eb1ab
      Josef Bacik 提交于
      commit fd112c74652371a023f85d87b70bee7169e8f4d0 upstream.
      
      With the psi stuff in place we can use the memstall flag to indicate
      pressure that happens from throttling.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9a8eb1ab
    • M
      block: fix single range discard merge · d6b88955
      Ming Lei 提交于
      commit 2a5cf35cd6c56b2924bce103413ad3381bdc31fa upstream.
      
      There are actually two kinds of discard merge:
      
      - one is the normal discard merge, just like normal read/write request,
      and call it single-range discard
      
      - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1
      
      For the former case, queue_max_discard_segments(rq->q) is 1, and we
      should handle this kind of discard merge like the normal read/write
      request.
      
      This patch fixes the following kernel panic issue[1], which is caused by
      not removing the single-range discard request from elevator queue.
      
      Guangwu has one raid discard test case, in which this issue is a bit
      easier to trigger, and I verified that this patch can fix the kernel
      panic issue in Guangwu's test case.
      
      [1] kernel panic log from Jens's report
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
       PGD 0 P4D 0.
       Oops: 0000 [#1] SMP PTI
       CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
      4.20.0-rc3-00649-ge64d9a554a91-dirty #14  Hardware name: Wiwynn \
      Leopard-Orv2/Leopard-DDR BW, BIOS LBM08   03/03/2017       Workqueue: kblockd \
      blk_mq_run_work_fn                                            RIP: \
      0010:blk_mq_get_driver_tag+0x81/0x120                                       Code: 24 \
      10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
      0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
      f6 87 b0 00 00 00 02  RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246                     \
        RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
       RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
       RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
       R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
       R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
       FS:  0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        blk_mq_dispatch_rq_list+0xec/0x480
        ? elv_rb_del+0x11/0x30
        blk_mq_do_dispatch_sched+0x6e/0xf0
        blk_mq_sched_dispatch_requests+0xfa/0x170
        __blk_mq_run_hw_queue+0x5f/0xe0
        process_one_work+0x154/0x350
        worker_thread+0x46/0x3c0
        kthread+0xf5/0x130
        ? process_one_work+0x350/0x350
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x1f/0x30
       Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
      kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
      cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
      button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
      nvme_core fuse sg loop efivarfs autofs4  CR2: 0000000000000148                        \
      
       ---[ end trace 340a1fb996df1b9b ]---
       RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
       Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
      00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \
      20 72 37 f6 87 b0 00 00 00 02
      
      Fixes: 445251d0 ("blk-mq: fix discard merge with scheduler attached")
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Cc: Guangwu Zhang <guazhang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d6b88955
    • J
      block: fix the DISCARD request merge · 4ff7fa23
      Jianchao Wang 提交于
      commit 69840466086d2248898020a08dda52732686c4e6 upstream.
      
      There are two cases when handle DISCARD merge.
      If max_discard_segments == 1, the bios/requests need to be contiguous
      to merge. If max_discard_segments > 1, it takes every bio as a range
      and different range needn't to be contiguous.
      
      But now, attempt_merge screws this up. It always consider contiguity
      for DISCARD for the case max_discard_segments > 1 and cannot merge
      contiguous DISCARD for the case max_discard_segments == 1, because
      rq_attempt_discard_merge always returns false in this case.
      This patch fixes both of the two cases above.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4ff7fa23
    • J
      sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD · 561d9c71
      Johannes Weiner 提交于
      commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 upstream.
      
      There are several definitions of those functions/macros in places that
      mess with fixed-point load averages.  Provide an official version.
      
      [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
      Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: use stat.mean instead of stat->rqs.mean to solve conflict]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      
      Conflicts:
          block/blk-iolatency.c
      561d9c71
    • J
      block-throttle: enable hierarchical throttling even on traditional hierarchy · 61518922
      Joseph Qi 提交于
      ECI may have an use case that configuring each device mapper disk
      throttling policy just under root blkio cgroup, but actually using them
      in different containers.
      Since hierarchical throttling is now only supported on cgroup v2 and ECI
      uses cgroup v1, so we have to enable hierarchical throttling on cgroup
      v1.
      This is ported from redhat 7u, and a year ago Jiufei already ported it
      to alikernel 4.9 as well. So I think this change should be acceptable.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      61518922
  5. 29 10月, 2019 1 次提交
  6. 18 10月, 2019 1 次提交
  7. 08 10月, 2019 1 次提交
    • D
      block: mq-deadline: Fix queue restart handling · dbb7339c
      Damien Le Moal 提交于
      [ Upstream commit cb8acabbe33b110157955a7425ee876fb81e6bbc ]
      
      Commit 7211aef86f79 ("block: mq-deadline: Fix write completion
      handling") added a call to blk_mq_sched_mark_restart_hctx() in
      dd_dispatch_request() to make sure that write request dispatching does
      not stall when all target zones are locked. This fix left a subtle race
      when a write completion happens during a dispatch execution on another
      CPU:
      
      CPU 0: Dispatch			CPU1: write completion
      
      dd_dispatch_request()
          lock(&dd->lock);
          ...
          lock(&dd->zone_lock);	dd_finish_request()
          rq = find request		lock(&dd->zone_lock);
          unlock(&dd->zone_lock);
          				zone write unlock
      				unlock(&dd->zone_lock);
      				...
      				__blk_mq_free_request
                                            check restart flag (not set)
      				      -> queue not run
          ...
          if (!rq && have writes)
              blk_mq_sched_mark_restart_hctx()
          unlock(&dd->lock)
      
      Since the dispatch context finishes after the write request completion
      handling, marking the queue as needing a restart is not seen from
      __blk_mq_free_request() and blk_mq_sched_restart() not executed leading
      to the dispatch stall under 100% write workloads.
      
      Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from
      dd_dispatch_request() into dd_finish_request() under the zone lock to
      ensure full mutual exclusion between write request dispatch selection
      and zone unlock on write request completion.
      
      Fixes: 7211aef86f79 ("block: mq-deadline: Fix write completion handling")
      Cc: stable@vger.kernel.org
      Reported-by: NHans Holmberg <Hans.Holmberg@wdc.com>
      Reviewed-by: NHans Holmberg <hans.holmberg@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dbb7339c
  8. 05 10月, 2019 1 次提交
    • Y
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 82652c06
      Yufen Yu 提交于
      commit 8d6996630c03d7ceeabe2611378fea5ca1c3f1b3 upstream.
      
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      82652c06