1. 27 9月, 2019 1 次提交
  2. 26 9月, 2019 5 次提交
    • T
      iocost: bump up default latency targets for hard disks · 7afcccaf
      Tejun Heo 提交于
      The default hard disk param sets latency targets at 50ms.  As the
      default target percentiles are zero, these don't directly regulate
      vrate; however, they're still used to calculate the period length -
      100ms in this case.
      
      This is excessively low.  A SATA drive with QD32 saturated with random
      IOs can easily reach avg completion latency of several hundred msecs.
      A period duration which is substantially lower than avg completion
      latency can lead to wildly fluctuating vrate.
      
      Let's bump up the default latency targets to 250ms so that the period
      duration is sufficiently long.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7afcccaf
    • T
      iocost: improve nr_lagging handling · 7cd806a9
      Tejun Heo 提交于
      Some IOs may span multiple periods.  As latencies are collected on
      completion, the inbetween periods won't register them and may
      incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
      avoid those situations.  Currently, whenever there are IOs which are
      spanning from the previous period, busy_level is reset to 0 if
      negative thus suppressing vrate increase.
      
      This has the following two problems.
      
      * When latency target percentiles aren't set, vrate adjustment should
        only be governed by queue depth depletion; however, the current code
        keeps nr_lagging active which pulls in latency results and can keep
        down vrate unexpectedly.
      
      * When lagging condition is detected, it resets the entire negative
        busy_level.  This turned out to be way too aggressive on some
        devices which sometimes experience extended latencies on a small
        subset of commands.  In addition, a lagging IO will be accounted as
        latency target miss on completion anyway and resetting busy_level
        amplifies its impact unnecessarily.
      
      This patch fixes the above two problems by disabling nr_lagging
      counting when latency target percentiles aren't set and blocking vrate
      increases when there are lagging IOs while leaving busy_level as-is.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7cd806a9
    • T
      iocost: better trace vrate changes · 25d41e4a
      Tejun Heo 提交于
      vrate_adj tracepoint traces vrate changes; however, it does so only
      when busy_level is non-zero.  busy_level turning to zero can sometimes
      be as interesting an event.  This patch also enables vrate_adj
      tracepoint on other vrate related events - busy_level changes and
      non-zero nr_lagging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      25d41e4a
    • M
      block: don't release queue's sysfs lock during switching elevator · b89f625e
      Ming Lei 提交于
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & acquire sysfs_lock before registering/un-registering elevator
      queue during switching elevator for avoiding potential deadlock from
      showing & storing 'queue/iosched' attributes and removing elevator's
      kobject.
      
      Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
      required in .show & .store of queue/iosched's attributes, and just
      elevator's sysfs lock is acquired in elv_iosched_store() and
      elv_iosched_show(). So it is safe to hold queue's sysfs lock when
      registering/un-registering elevator queue.
      
      The biggest issue is that commit cecf5d87 assumes that concurrent
      write on 'queue/scheduler' can't happen. However, this assumption isn't
      true, because kernfs_fop_write() only guarantees that concurrent write
      aren't called on the same open file, but the write could be from
      different open on the file. So we can't release & re-acquire queue's
      sysfs lock during switching elevator, otherwise use-after-free on
      elevator could be triggered.
      
      Fixes the issue by not releasing queue's sysfs lock during switching
      elevator.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b89f625e
    • M
      blk-mq: move lockdep_assert_held() into elevator_exit · 284b94be
      Ming Lei 提交于
      Commit c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
      lockdep_assert_held() called in blk_mq_sched_free_requests() which is
      run in failure path of elevator_init_mq().
      
      blk_mq_sched_free_requests() is called in the following 3 functions:
      
      	elevator_init_mq()
      	elevator_exit()
      	blk_cleanup_queue()
      
      In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
      by 'mutex_lock(&q->sysfs_lock)'.
      
      So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
      into elevator_exit() for fixing the report by syzbot.
      
      Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
      Fixed: c48dac13 ("block: don't hold q->sysfs_lock in elevator_init_mq")
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      284b94be
  3. 24 9月, 2019 1 次提交
  4. 23 9月, 2019 1 次提交
    • M
      block: t10-pi: fix -Wswitch warning · be21683e
      Max Gurtovoy 提交于
      Changing the switch() statement to symbolic constants made the compiler
      (at least clang-9, did not check gcc) notice that there is one enum value
      that is not handled here:
      
      block/t10-pi.c:62:11: error: enumeration value 'T10_PI_TYPE0_PROTECTION'
      not handled in switch [-Werror,-Wswitch]
      
      Add a BUG_ON statement if we ever get to t10_pi_verify function with
      TYPE0 and replace the switch() statement with if/else clause for the
      valid types.
      
      Fixes: 9b2061b1a262 ("block: use symbolic constants for t10_pi type")
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      be21683e
  5. 18 9月, 2019 6 次提交
  6. 16 9月, 2019 2 次提交
  7. 15 9月, 2019 1 次提交
  8. 12 9月, 2019 2 次提交
    • M
      block: fix race between switching elevator and removing queues · 0a67b5a9
      Ming Lei 提交于
      cecf5d87 ("block: split .sysfs_lock into two locks") starts to
      release & actuire sysfs_lock again during switching elevator. So it
      isn't enough to prevent switching elevator from happening by simply
      clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
      in-progress switch still can move on after re-acquiring the lock,
      meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.
      
      Fixes this issue by checking 'q->elevator' directly & locklessly after
      q->kobj is removed in blk_unregister_queue(), this way is safe because
      q->elevator can't be changed at that time.
      
      Fixes: cecf5d87 ("block: split .sysfs_lock into two locks")
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a67b5a9
    • S
      block: bypass blk_set_runtime_active for uninitialized q->dev · 8a15b4d7
      Stanley Chu 提交于
      Some devices may skip blk_pm_runtime_init() and have null pointer
      in its request_queue->dev. For example, SCSI devices of UFS Well-Known
      LUNs.
      
      Currently the null pointer is checked by the user of
      blk_set_runtime_active(), i.e., scsi_dev_type_resume(). It is better to
      check it by blk_set_runtime_active() itself instead of by its users.
      Signed-off-by: NStanley Chu <stanley.chu@mediatek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a15b4d7
  9. 11 9月, 2019 4 次提交
    • T
      iocost_monitor: Report debt · 7c1ee704
      Tejun Heo 提交于
      Report debt and rename del_ms row to delay for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7c1ee704
    • T
      blk-iocost: Don't let merges push vtime into the future · e1518f63
      Tejun Heo 提交于
      Merges have the same problem that forced-bios had which is fixed by
      the previous patch.  The cost of a merge is calculated at the time of
      issue and force-advances vtime into the future.  Until global vtime
      catches up, how the cgroup's hweight changes in the meantime doesn't
      matter and it often leads to situations where the cost is calculated
      at one hweight and paid at a very different one.  See the previous
      patch for more details.
      
      Fix it by never advancing vtime into the future for merges.  If budget
      is available, vtime is advanced.  Otherwise, the cost is charged as
      debt.
      
      This brings merge cost handling in line with issue cost handling in
      ioc_rqos_throttle().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e1518f63
    • T
      blk-iocost: Account force-charged overage in absolute vtime · 36a52481
      Tejun Heo 提交于
      Currently, when a bio needs to be force-charged and there isn't enough
      budget, vtime is simply pushed into the future.  This means that the
      cost of the whole bio is scaled using the current hweight and then
      charged immediately.  Until the global vtime advances beyond this
      future vtime, the cgroup won't be allowed to issue normal IOs.
      
      This is incorrect and can lead to, for example, exploding vrate or
      extended stalls if vrate range is constrained.  Consider the following
      scenario.
      
      1. A cgroup with a very low hweight runs out of budget.
      
      2. A storm of swap-out happens on it.  All of them are scaled
         according to the current low hweight and charged to vtime pushing
         it to a far future.
      
      3. All other cgroups go idle and now the above cgroup has access to
         the whole device.  However, because vtime is already wound using
         the past low hweight, what its current hweight is doesn't matter
         until global vtime catches up to the local vtime.
      
      4. As a result, either vrate gets ramped up extremely or the IOs stall
         while the underlying device is idle.
      
      This is because the hweight the overage is calculated at is different
      from the hweight that it's being paid at.
      
      Fix it by remembering the overage in absoulte vtime and continuously
      paying with the actual budget according to the current hweight at each
      period.
      
      Note that non-forced bios which wait already remembers the cost in
      absolute vtime.  This brings forced-bio accounting in line.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36a52481
    • T
      blk-iocost: Fix incorrect operation order during iocg free · e036c4ca
      Tejun Heo 提交于
      ioc_pd_free() first cancels the hrtimers and then deactivates the
      iocg.  However, the iocg timer can run inbetween and reschedule the
      hrtimers which will end up running after the iocg is freed leading to
      crashes like the following.
      
        general protection fault: 0000 [#1] SMP
        ...
        RIP: 0010:iocg_kick_delay+0xbe/0x1b0
        RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
        RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
        RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
        R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
        FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         iocg_delay_timer_fn+0x3d/0x60
         __hrtimer_run_queues+0xfe/0x270
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x5e/0x120
         apic_timer_interrupt+0xf/0x20
         </IRQ>
      
      Fix it by canceling hrtimers after deactivating the iocg.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e036c4ca
  10. 07 9月, 2019 3 次提交
  11. 06 9月, 2019 6 次提交
    • J
      block: fix elevator_get_by_features() · a2614255
      Jens Axboe 提交于
      The lookup logic is broken - 'e' will never be NULL, even if the
      list is empty. Maintain lookup hit in a separate variable instead.
      
      Fixes: a0958ba7 ("block: Improve default elevator selection")
      Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2614255
    • D
      block: Delay default elevator initialization · 737eb78e
      Damien Le Moal 提交于
      When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
      the only information known about the device is the number of hardware
      queues as the block device scan by the device driver is not completed
      yet for most drivers. The device type and elevator required features
      are not set yet, preventing to correctly select the default elevator
      most suitable for the device.
      
      This currently affects all multi-queue zoned block devices which default
      to the "none" elevator instead of the required "mq-deadline" elevator.
      These drives currently include host-managed SMR disks connected to a
      smartpqi HBA and null_blk block devices with zoned mode enabled.
      Upcoming NVMe Zoned Namespace devices will also be affected.
      
      Fix this by adding the boolean elevator_init argument to
      blk_mq_init_allocated_queue() to control the execution of
      elevator_init_mq(). Two cases exist:
      1) elevator_init = false is used for calls to
         blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
         case, a call to elevator_init_mq() is added to __device_add_disk(),
         resulting in the delayed initialization of the queue elevator
         after the device driver finished probing the device information. This
         effectively allows elevator_init_mq() access to more information
         about the device.
      2) elevator_init = true preserves the current behavior of initializing
         the elevator directly from blk_mq_init_allocated_queue(). This case
         is used for the special request based DM devices where the device
         gendisk is created before the queue initialization and device
         information (e.g. queue limits) is already known when the queue
         initialization is executed.
      
      Additionally, to make sure that the elevator initialization is never
      done while requests are in-flight (there should be none when the device
      driver calls device_add_disk()), freeze and quiesce the device request
      queue before calling blk_mq_init_sched() in elevator_init_mq().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      737eb78e
    • D
      block: Improve default elevator selection · a0958ba7
      Damien Le Moal 提交于
      For block devices that do not specify required features, preserve the
      current default elevator selection (mq-deadline for single queue
      devices, none for multi-queue devices). However, for devices specifying
      required features (e.g. zoned block devices ELEVATOR_F_ZBD_SEQ_WRITE
      feature), select the first available elevator providing the required
      features.
      
      In all cases, default to "none" if no elevator is available or if the
      initialization of the default elevator fails.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a0958ba7
    • D
      block: Introduce elevator features · 68c43f13
      Damien Le Moal 提交于
      Introduce the definition of elevator features through the
      elevator_features flags in the elevator_type structure. Each flag can
      represent a feature supported by an elevator. The first feature defined
      by this patch is support for zoned block device sequential write
      constraint with the flag ELEVATOR_F_ZBD_SEQ_WRITE, which is implemented
      by the mq-deadline elevator using zone write locking.
      
      Other possible features are IO priorities, write hints, latency targets
      or single-LUN dual-actuator disks (for which the elevator could maintain
      one LBA ordered list per actuator).
      
      The required_elevator_features field is also added to the request_queue
      structure to allow a device driver to specify elevator feature flags
      that an elevator must support for the correct operation of the device
      (e.g. device drivers for zoned block devices can have the
      ELEVATOR_F_ZBD_SEQ_WRITE flag as a required feature).
      The helper function blk_queue_required_elevator_features() is
      defined for setting this new field.
      
      With these two new fields in place, the elevator functions
      elevator_match() and elevator_find() are modified to allow a user to set
      only an elevator with a set of features that satisfies the device
      required features. Elevators not matching the device requirements are
      not shown in the device sysfs queue/scheduler file to prevent their use.
      
      The "none" elevator can always be selected as before.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      68c43f13
    • D
      block: Change elevator_init_mq() to always succeed · 954b4a5c
      Damien Le Moal 提交于
      If the default elevator chosen is mq-deadline, elevator_init_mq() may
      return an error if mq-deadline initialization fails, leading to
      blk_mq_init_allocated_queue() returning an error, which in turn will
      cause the block device initialization to fail and the device not being
      exposed.
      
      Instead of taking such extreme measure, handle mq-deadline
      initialization failures in the same manner as when mq-deadline is not
      available (no module to load), that is, default to the "none" scheduler.
      With this change, elevator_init_mq() return type can be changed to void.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      954b4a5c
    • D
      block: Cleanup elevator_init_mq() use · 61db437d
      Damien Le Moal 提交于
      Instead of checking a queue tag_set BLK_MQ_F_NO_SCHED flag before
      calling elevator_init_mq() to make sure that the queue supports IO
      scheduling, use the elevator.c function elv_support_iosched() in
      elevator_init_mq(). This does not introduce any functional change but
      ensure that elevator_init_mq() does the right thing based on the queue
      settings.
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      61db437d
  12. 03 9月, 2019 3 次提交
  13. 30 8月, 2019 1 次提交
  14. 29 8月, 2019 4 次提交
    • T
      blkcg: fix missing free on error path of blk_iocost_init() · 3532e722
      Tejun Heo 提交于
      blk_iocost_init() forgot to free its percpu stat on the error path.
      Fix it.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Reported-by: NHillf Danton <hdanton@sina.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3532e722
    • T
      blkcg: add tools/cgroup/iocost_coef_gen.py · 8504dea7
      Tejun Heo 提交于
      Add a script which can be used to generate device-specific iocost
      linear model coefficients.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8504dea7
    • T
      blkcg: add tools/cgroup/iocost_monitor.py · 6954ff18
      Tejun Heo 提交于
      Instead of mucking with debugfs and ->pd_stat(), add drgn based
      monitoring script.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6954ff18
    • T
      blkcg: implement blk-iocost · 7caa4715
      Tejun Heo 提交于
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7caa4715