1. 10 12月, 2016 2 次提交
  2. 09 12月, 2016 2 次提交
    • C
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig 提交于
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f9d03f96
    • C
      blk-wbt: don't throttle discard or write zeroes · be07e14f
      Christoph Hellwig 提交于
      Both of these are metadata only commands that are not issued by the
      writeback code and not directly relevant to the writeback bandwith.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      be07e14f
  3. 06 12月, 2016 1 次提交
  4. 05 12月, 2016 1 次提交
    • N
      block: fix unintended fallthrough in generic_make_request_checks() · 58886785
      Nicolai Stange 提交于
      Since commit e73c23ff ("block: add async variant of
      blkdev_issue_zeroout") messages like the following show up:
      
        EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at
                        logical offset 0 with max blocks 1 with error 95
        EXT4-fs (dm-1): This should not happen!! Data will be lost
      
      Due to the following fallthrough introduced with
      commit 2d253440 ("block: Define zoned block device operations"),
      generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only
      if the block device supports "write same" *and* is a zoned one:
      
        switch (bio_op(bio)) {
        [...]
        case REQ_OP_WRITE_SAME:
              if (!bdev_write_same(bio->bi_bdev))
                      goto not_supported;
        case REQ_OP_ZONE_REPORT:
        case REQ_OP_ZONE_RESET:
                      if (!bdev_is_zoned(bio->bi_bdev))
                              goto not_supported;
                      break;
        [...]
        }
      
      Thus, although the bio setup as done by __blkdev_issue_write_same() from
      commit e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      would succeed, its actual submission would not, resulting in the
      EOPNOTSUPP == 95.
      
      Fix this by removing the fallthrough which, due to the lack of an explicit
      comment, seems to be unintended anyway.
      
      Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      Fixes: 2d253440 ("block: Define zoned block device operations")
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      58886785
  5. 03 12月, 2016 1 次提交
  6. 01 12月, 2016 4 次提交
    • R
      block: factor out req_set_nomerge · e0c72300
      Ritesh Harjani 提交于
      Factor out common code for setting REQ_NOMERGE flag which is being used
      out at certain places and make it a helper instead, req_set_nomerge().
      Signed-off-by: NRitesh Harjani <riteshh@codeaurora.org>
      
      Get rid of the inline.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e0c72300
    • C
      block: add support for REQ_OP_WRITE_ZEROES · a6f0788e
      Chaitanya Kulkarni 提交于
      This adds a new block layer operation to zero out a range of
      LBAs. This allows to implement zeroing for devices that don't use
      either discard with a predictable zero pattern or WRITE SAME of zeroes.
      The prominent example of that is NVMe with the Write Zeroes command,
      but in the future, this should also help with improving the way
      zeroing discards work. For this operation, suitable entry is exported in
      sysfs which indicate the number of maximum bytes allowed in one
      write zeroes operation by the device.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a6f0788e
    • C
      block: add async variant of blkdev_issue_zeroout · e73c23ff
      Chaitanya Kulkarni 提交于
      Similar to __blkdev_issue_discard this variant allows submitting
      the final bio asynchronously and chaining multiple ranges
      into a single completion.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e73c23ff
    • D
      block: Check partition alignment on zoned block devices · b02d8aae
      Damien Le Moal 提交于
      Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
      a zoned block device. However, the first and last zones reported for a
      partition make sense only if the partition start sector and size are aligned
      on the device zone size. The same applies for zone reset. Resetting the first
      or the last zone of a partition straddling zones may impact neighboring
      partitions. Finally, if a partition start sector is not at the beginning of a
      sequential zone, it will be impossible to write to the first sectors of the
      partition on a host-managed device.
      Avoid all these problems and incoherencies by ignoring partitions that are not
      zone aligned.
      
      Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
      correct disk zoning type (host-aware, host-managed or none) but
      bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
      size is unknown). So test this as a way to ensure that a zoned block device is
      being handled as such. As a result, for a host-aware devices, unaligned zone
      partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
      disk will be treated as a regular block device (as it should). If zoned block
      device support is enabled, only aligned partitions will be accepted.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b02d8aae
  7. 29 11月, 2016 4 次提交
  8. 22 11月, 2016 3 次提交
  9. 18 11月, 2016 2 次提交
    • J
      blk-mq: make the polling code adaptive · 64f1c21e
      Jens Axboe 提交于
      The previous commit introduced the hybrid sleep/poll mode. Take
      that one step further, and use the completion latencies to
      automatically sleep for half the mean completion time. This is
      a good approximation.
      
      This changes the 'io_poll_delay' sysfs file a bit to expose the
      various options. Depending on the value, the polling code will
      behave differently:
      
      -1	Never enter hybrid sleep mode
       0	Use half of the completion mean for the sleep delay
      >0	Use this specific value as the sleep delay
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      64f1c21e
    • J
      blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf
      Jens Axboe 提交于
      This patch enables a hybrid polling mode. Instead of polling after IO
      submission, we can induce an artificial delay, and then poll after that.
      For example, if the IO is presumed to complete in 8 usecs from now, we
      can sleep for 4 usecs, wake up, and then do our polling. This still puts
      a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
      after the IO has completed, it'll happen before. With this hybrid
      scheme, we can achieve big latency reductions while still using the same
      (or less) amount of CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      06426adf
  10. 16 11月, 2016 2 次提交
    • A
      blk-wbt: fix old-style function declaration · 4121d385
      Arnd Bergmann 提交于
      The newly added driver causes a harmless warning in some configurations:
      
      block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration]
       static bool inline stat_sample_valid(struct blk_rq_stat *stat)
      
      This makes it use the expected format for the declaration.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4121d385
    • M
      block: deal with stale req count of plug list · 0a6219a9
      Ming Lei 提交于
      In both legacy and mq path, req count of plug list is computed
      before allocating request, so the number can be stale when falling
      back to slept allocation, also the new introduced wbt can sleep
      too.
      
      This patch deals with the case by checking if plug list becomes
      empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
      which is introduced by Shaohua's patches of dispatching big request.
      
      Fixes: 600271d9(blk-mq: immediately dispatch big size request)
      Fixes: 50d24c34(block: immediately dispatch big size request)
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0a6219a9
  11. 15 11月, 2016 1 次提交
  12. 12 11月, 2016 5 次提交
  13. 11 11月, 2016 4 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • J
      blk-wbt: add general throttling mechanism · e34cbd30
      Jens Axboe 提交于
      We can hook this up to the block layer, to help throttle buffered
      writes.
      
      wbt registers a few trace points that can be used to track what is
      happening in the system:
      
      wbt_lat: 259:0: latency 2446318
      wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
                     wmean=518866, wmin=15522, wmax=5330353, wsamples=57
      wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
      
      This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
      dumps the current read/write stats for that window, and wbt_step shows a
      step down event where we now scale back writes. Each trace includes the
      device, 259:0 in this case.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e34cbd30
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
    • T
      block: cfq_cpd_alloc() should use @gfp · ebc4ff66
      Tejun Heo 提交于
      cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
      incorrectly hard coding GFP_KERNEL instead of using the mask specified
      through the @gfp parameter.  This currently doesn't cause any actual
      issues because all current callers specify GFP_KERNEL.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: e4a9bde9 ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebc4ff66
  14. 09 11月, 2016 1 次提交
  15. 07 11月, 2016 1 次提交
    • G
      blk-mq: Always schedule hctx->next_cpu · c02ebfdd
      Gabriel Krisman Bertazi 提交于
      Commit 0e87e58b ("blk-mq: improve warning for running a queue on the
      wrong CPU") attempts to avoid triggering the WARN_ON in
      __blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
      last batch execution before round robin, blk_mq_hctx_next_cpu can
      schedule a dead CPU and also update next_cpu to the next alive CPU in
      the mask, which will trigger the WARN_ON despite the previous
      workaround.
      
      The following patch fixes this scenario by always scheduling the value
      in hctx->next_cpu.  This changes the moment when we round-robin the CPU
      running the hctx, but it really doesn't matter, since it still executes
      BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.
      
      Fixes: 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU")
      Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c02ebfdd
  16. 06 11月, 2016 1 次提交
    • J
      block: add code to track actual device queue depth · d278d4a8
      Jens Axboe 提交于
      For blk-mq, ->nr_requests does track queue depth, at least at init
      time. But for the older queue paths, it's simply a soft setting.
      On top of that, it's generally larger than the hardware setting
      on purpose, to allow backup of requests for merging.
      
      Fill a hole in struct request with a 'queue_depth' member, that
      drivers can call to more closely inform the block layer of the
      real queue depth.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      d278d4a8
  17. 04 11月, 2016 2 次提交
    • S
      blk-mq: immediately dispatch big size request · 600271d9
      Shaohua Li 提交于
      This is corresponding part for blk-mq. Disk with multiple hardware
      queues doesn't need this as we only hold 1 request at most.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      600271d9
    • S
      block: immediately dispatch big size request · 50d24c34
      Shaohua Li 提交于
      Currently block plug holds up to 16 non-mergeable requests. This makes
      sense if the request size is small, eg, reduce lock contention. But if
      request size is big enough, we don't need to worry about lock
      contention. Holding such request makes no sense and it lows the disk
      utilization.
      
      In practice, this improves 10% throughput for my raid5 sequential write
      workload.
      
      The size (128k) is arbitrary right now, but it makes sure lock
      contention is small. This probably could be more intelligent, eg, check
      average request size holded. Since this is mainly for sequential IO,
      probably not worthy.
      
      V2: check the last request instead of the first request, so as long as
      there is one big size request we flush the plug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      50d24c34
  18. 03 11月, 2016 3 次提交