1. 16 9月, 2019 1 次提交
    • H
      block: make rq sector size accessible for block stats · 3d244306
      Hou Tao 提交于
      Currently rq->data_len will be decreased by partial completion or
      zeroed by completion, so when blk_stat_add() is invoked, data_len
      will be zero and there will never be samples in poll_cb because
      blk_mq_poll_stats_bkt() will return -1 if data_len is zero.
      
      We could move blk_stat_add() back to __blk_mq_complete_request(),
      but that would make the effort of trying to call ktime_get_ns()
      once in vain. Instead we can reuse throtl_size field, and use
      it for both block stats and block throttle, and adjust the
      logic in blk_mq_poll_stats_bkt() accordingly.
      
      Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
      Tested-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d244306
  2. 15 9月, 2019 1 次提交
  3. 14 9月, 2019 7 次提交
    • J
      Merge branch 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.4/block · 99e5381d
      Jens Axboe 提交于
      Pull MD fixes from Song.
      
      * 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        raid5: use bio_end_sector in r5_next_bio
        raid5: remove STRIPE_OPS_REQ_PENDING
        md: add feature flag MD_FEATURE_RAID0_LAYOUT
        md/raid0: avoid RAID0 data corruption due to layout confusion.
        raid5: don't set STRIPE_HANDLE to stripe which is in batch list
        raid5: don't increment read_errors on EILSEQ return
      99e5381d
    • G
      raid5: use bio_end_sector in r5_next_bio · 067df25c
      Guoqing Jiang 提交于
      Actually, we calculate bio's end sector here, so use the common
      way for the purpose.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      067df25c
    • G
      raid5: remove STRIPE_OPS_REQ_PENDING · feb9bf98
      Guoqing Jiang 提交于
      This stripe state is not used anymore after commit 51acbcec
      ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted
      state.
      
      gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r
      drivers/md/raid5.c:					  (1 << STRIPE_OPS_REQ_PENDING) |
      drivers/md/raid5.h:	STRIPE_OPS_REQ_PENDING,
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      feb9bf98
    • N
      md: add feature flag MD_FEATURE_RAID0_LAYOUT · 33f2c35a
      NeilBrown 提交于
      Due to a bug introduced in Linux 3.14 we cannot determine the
      correctly layout for a multi-zone RAID0 array - there are two
      possibilities.
      
      It is possible to tell the kernel which to chose using a module
      parameter, but this can be clumsy to use.  It would be best if
      the choice were recorded in the metadata.
      So add a feature flag for this purpose.
      If it is set, then the 'layout' field of the superblock is used
      to determine which layout to use.
      
      If this flag is not set, then mddev->layout gets set to -1,
      which causes the module parameter to be required.
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      33f2c35a
    • N
      md/raid0: avoid RAID0 data corruption due to layout confusion. · c84a1372
      NeilBrown 提交于
      If the drives in a RAID0 are not all the same size, the array is
      divided into zones.
      The first zone covers all drives, to the size of the smallest.
      The second zone covers all drives larger than the smallest, up to
      the size of the second smallest - etc.
      
      A change in Linux 3.14 unintentionally changed the layout for the
      second and subsequent zones.  All the correct data is still stored, but
      each chunk may be assigned to a different device than in pre-3.14 kernels.
      This can lead to data corruption.
      
      It is not possible to determine what layout to use - it depends which
      kernel the data was written by.
      So we add a module parameter to allow the old (0) or new (1) layout to be
      specified, and refused to assemble an affected array if that parameter is
      not set.
      
      Fixes: 20d0189b ("block: Introduce new bio_split()")
      cc: stable@vger.kernel.org (3.14+)
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      c84a1372
    • G
      raid5: don't set STRIPE_HANDLE to stripe which is in batch list · 6ce220dd
      Guoqing Jiang 提交于
      If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
      could be set with STRIPE_ACTIVE by the handle_stripe function. And if
      error happens to the batch_head at the same time, break_stripe_batch_list
      is called, then below warning could happen (the same report in [1]), it
      means a member of batch list was set with STRIPE_ACTIVE.
      
      [7028915.431770] stripe state: 2001
      [7028915.431815] ------------[ cut here ]------------
      [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
      [...]
      [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
      [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
      [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
      [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
      [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
      [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
      [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
      [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
      [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
      [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
      [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
      [7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
      [7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
      [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7028915.431910] Call Trace:
      [7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
      [7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
      [7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
      [7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]
      
      Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
      said "If a stripe is added to batch list, then only the first stripe
      of the list should be put to handle_list and run handle_stripe."
      
      So don't set STRIPE_HANDLE to stripe which is already in batch list,
      otherwise the stripe could be put to handle_list and run handle_stripe,
      then the above warning could be triggered.
      
      [1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      6ce220dd
    • N
      raid5: don't increment read_errors on EILSEQ return · b76b4715
      Nigel Croxon 提交于
      While MD continues to count read errors returned by the lower layer.
      If those errors are -EILSEQ, instead of -EIO, it should NOT increase
      the read_errors count.
      
      When RAID6 is set up on dm-integrity target that detects massive
      corruption, the leg will be ejected from the array.  Even if the
      issue is correctable with a sector re-write and the array has
      necessary redundancy to correct it.
      
      The leg is ejected because it runs up the rdev->read_errors beyond
      conf->max_nr_stripes.  The return status in dm-drypt when there is
      a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      b76b4715
  4. 13 9月, 2019 1 次提交
    • J
      Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-5.4/block · 21fa1004
      Jens Axboe 提交于
      Pull NVMe updates from Sagi:
      
      "Highlights includes:
       - controller reset and namespace scan races fixes
       - nvme discovery log change uevent support
       - naming improvements from Keith
       - multiple discovery controllers reject fix from James
       - some regular cleanups from various people"
      
      * 'nvme-5.4' of git://git.infradead.org/nvme:
        nvmet: fix a wrong error status returned in error log page
        nvme: send discovery log page change events to userspace
        nvme: add uevent variables for controller devices
        nvme: enable aen regardless of the presence of I/O queues
        nvme-fabrics: allow discovery subsystems accept a kato
        nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
        nvme: Remove redundant assignment of cq vector
        nvme: Assign subsys instance from first ctrl
        nvme: tcp: remove redundant assignment to variable ret
        nvme: include admin_q sync with nvme_sync_queues
        nvme: Treat discovery subsystems as unique subsystems
        nvme: fix ns removal hang when failing to revalidate due to a transient error
        nvme: make nvme_report_ns_ids propagate error back
        nvme: make nvme_identify_ns propagate errors back
        nvme: pass status to nvme_error_status
        nvme-fc: Fail transport errors with NVME_SC_HOST_PATH
        nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failed
        nvme: fail cancelled commands with NVME_SC_HOST_PATH_ERROR
      21fa1004
  5. 12 9月, 2019 24 次提交
  6. 11 9月, 2019 6 次提交
    • T
      iocost_monitor: Report debt · 7c1ee704
      Tejun Heo 提交于
      Report debt and rename del_ms row to delay for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7c1ee704
    • T
      iocost_monitor: Report more info with higher accuracy · b06f2d35
      Tejun Heo 提交于
      When outputting json:
      
      * Don't truncate numbers.
      
      * Report address of iocg to ease drilling down further.
      
      When outputting table:
      
      * Use math.ceil() for delay_ms so that small delays don't read as 0.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b06f2d35
    • T
      iocost_monitor: Always use strings for json values · e742bd5c
      Tejun Heo 提交于
      Json has limited accuracy for numbers and can silently truncate 64bit
      values, which can be extremely confusing.  Let's consistently use
      string encapsulated values for json output.
      
      While at it, convert an unnecesary f-string to str().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e742bd5c
    • T
      blk-iocost: Don't let merges push vtime into the future · e1518f63
      Tejun Heo 提交于
      Merges have the same problem that forced-bios had which is fixed by
      the previous patch.  The cost of a merge is calculated at the time of
      issue and force-advances vtime into the future.  Until global vtime
      catches up, how the cgroup's hweight changes in the meantime doesn't
      matter and it often leads to situations where the cost is calculated
      at one hweight and paid at a very different one.  See the previous
      patch for more details.
      
      Fix it by never advancing vtime into the future for merges.  If budget
      is available, vtime is advanced.  Otherwise, the cost is charged as
      debt.
      
      This brings merge cost handling in line with issue cost handling in
      ioc_rqos_throttle().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e1518f63
    • T
      blk-iocost: Account force-charged overage in absolute vtime · 36a52481
      Tejun Heo 提交于
      Currently, when a bio needs to be force-charged and there isn't enough
      budget, vtime is simply pushed into the future.  This means that the
      cost of the whole bio is scaled using the current hweight and then
      charged immediately.  Until the global vtime advances beyond this
      future vtime, the cgroup won't be allowed to issue normal IOs.
      
      This is incorrect and can lead to, for example, exploding vrate or
      extended stalls if vrate range is constrained.  Consider the following
      scenario.
      
      1. A cgroup with a very low hweight runs out of budget.
      
      2. A storm of swap-out happens on it.  All of them are scaled
         according to the current low hweight and charged to vtime pushing
         it to a far future.
      
      3. All other cgroups go idle and now the above cgroup has access to
         the whole device.  However, because vtime is already wound using
         the past low hweight, what its current hweight is doesn't matter
         until global vtime catches up to the local vtime.
      
      4. As a result, either vrate gets ramped up extremely or the IOs stall
         while the underlying device is idle.
      
      This is because the hweight the overage is calculated at is different
      from the hweight that it's being paid at.
      
      Fix it by remembering the overage in absoulte vtime and continuously
      paying with the actual budget according to the current hweight at each
      period.
      
      Note that non-forced bios which wait already remembers the cost in
      absolute vtime.  This brings forced-bio accounting in line.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36a52481
    • T
      blk-iocost: Fix incorrect operation order during iocg free · e036c4ca
      Tejun Heo 提交于
      ioc_pd_free() first cancels the hrtimers and then deactivates the
      iocg.  However, the iocg timer can run inbetween and reschedule the
      hrtimers which will end up running after the iocg is freed leading to
      crashes like the following.
      
        general protection fault: 0000 [#1] SMP
        ...
        RIP: 0010:iocg_kick_delay+0xbe/0x1b0
        RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
        RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
        RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
        R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
        FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         iocg_delay_timer_fn+0x3d/0x60
         __hrtimer_run_queues+0xfe/0x270
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x5e/0x120
         apic_timer_interrupt+0xf/0x20
         </IRQ>
      
      Fix it by canceling hrtimers after deactivating the iocg.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e036c4ca