1. 08 1月, 2015 1 次提交
  2. 21 12月, 2014 1 次提交
  3. 18 11月, 2014 1 次提交
    • J
      blk-mq: add blk_mq_free_hctx_request() · 7c7f2f2b
      Jens Axboe 提交于
      It's silly to use blk_mq_free_request() which in turn maps the
      request to the hardware queue, for places where we already know
      what the hardware queue is. This saves us an extra mapping of a
      hardware queue on request completion, if the caller knows this
      information already.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7c7f2f2b
  4. 12 11月, 2014 1 次提交
  5. 30 10月, 2014 2 次提交
    • J
      blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag · e167dfb5
      Jens Axboe 提交于
      Drivers can now tell blk-mq if they take advantage of the deferred
      issue through 'last' or not. If they do, don't do queue-direct
      for sync IO. This is a preparation patch for the nvme conversion.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e167dfb5
    • J
      blk-mq: add a 'list' parameter to ->queue_rq() · 74c45052
      Jens Axboe 提交于
      Since we have the notion of a 'last' request in a chain, we can use
      this to have the hardware optimize the issuing of requests. Add
      a list_head parameter to queue_rq that the driver can use to
      temporarily store hw commands for issue when 'last' is true. If we
      are doing a chain of requests, pass in a NULL list for the first
      request to force issue of that immediately, then batch the remainder
      for deferred issue until the last request has been sent.
      
      Instead of adding yet another argument to the hot ->queue_rq path,
      encapsulate the passed arguments in a blk_mq_queue_data structure.
      This is passed as a constant, and has been tested as faster than
      passing 4 (or even 3) args through ->queue_rq. Update drivers for
      the new ->queue_rq() prototype. There are no functional changes
      in this patch for drivers - if they don't use the passed in list,
      then they will just queue requests individually like before.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74c45052
  6. 26 9月, 2014 1 次提交
    • M
      blk-mq: support per-distpatch_queue flush machinery · f70ced09
      Ming Lei 提交于
      This patch supports to run one single flush machinery for
      each blk-mq dispatch queue, so that:
      
      - current init_request and exit_request callbacks can
      cover flush request too, then the buggy copying way of
      initializing flush request's pdu can be fixed
      
      - flushing performance gets improved in case of multi hw-queue
      
      In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
      iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
      increased a lot over my test environment:
      	- throughput: +70% in case of virtio-blk over null_blk
      	- throughput: +30% in case of virtio-blk over SSD image
      
      The multi virtqueue feature isn't merged to QEMU yet, and patches for
      the feature can be found in below tree:
      
      	git://kernel.ubuntu.com/ming/qemu.git  	v2.1.0-mq.4
      
      And simply passing 'num_queues=4 vectors=5' should be enough to
      enable multi queue(quad queue) feature for QEMU virtio-blk.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f70ced09
  7. 25 9月, 2014 1 次提交
    • T
      blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode · 17497acb
      Tejun Heo 提交于
      blk-mq uses percpu_ref for its usage counter which tracks the number
      of in-flight commands and used to synchronously drain the queue on
      freeze.  percpu_ref shutdown takes measureable wallclock time as it
      involves a sched RCU grace period.  This means that draining a blk-mq
      takes measureable wallclock time.  One would think that this shouldn't
      matter as queue shutdown should be a rare event which takes place
      asynchronously w.r.t. userland.
      
      Unfortunately, SCSI probing involves synchronously setting up and then
      tearing down a lot of request_queues back-to-back for non-existent
      LUNs.  This means that SCSI probing may take above ten seconds when
      scsi-mq is used.
      
        [    0.949892] scsi host0: Virtio SCSI HBA
        [    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
        [    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
        [    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
      
        <stall>
      
        [   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
        [   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
        [   16.194099] osd: LOADED open-osd 0.2.1
        [   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
        [   16.208478] sd 0:0:0:0: [sda] Write Protect is off
        [   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
        [   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
        [   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
        [   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      
      This is also the reason why request_queues start in bypass mode which
      is ended on blk_register_queue() as shutting down a fully functional
      queue also involves a RCU grace period and the queues for non-existent
      SCSI devices never reach registration.
      
      blk-mq basically needs to do the same thing - start the mq in a
      degraded mode which is faster to shut down and then make it fully
      functional only after the queue reaches registration.  percpu_ref
      recently grew facilities to force atomic operation until explicitly
      switched to percpu mode, which can be used for this purpose.  This
      patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
      switch it to percpu mode only once blk_register_queue() is reached.
      
      Note that this issue was previously worked around by 0a30288d
      ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
      probe") for v3.17.  The temp fix was reverted in preparation of adding
      persistent atomic mode to percpu_ref by 9eca8046 ("Revert "blk-mq,
      percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
      This patch and the prerequisite percpu_ref changes will be merged
      during v3.18 devel cycle.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
      Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count")
      Reviewed-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      17497acb
  8. 23 9月, 2014 5 次提交
  9. 16 8月, 2014 1 次提交
  10. 18 6月, 2014 1 次提交
  11. 06 6月, 2014 1 次提交
    • J
      blk-mq: bump max tag depth to 10K tags · a4391c64
      Jens Axboe 提交于
      For some scsi-mq cases, the tag map can be huge. So increase the
      max number of tags we support.
      
      Additionally, don't fail with EINVAL if a user requests too many
      tags. Warn that the tag depth has been adjusted down, and store
      the new value inside the tag_set passed in.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a4391c64
  12. 05 6月, 2014 1 次提交
  13. 30 5月, 2014 2 次提交
  14. 29 5月, 2014 2 次提交
  15. 28 5月, 2014 5 次提交
  16. 24 5月, 2014 1 次提交
  17. 22 5月, 2014 1 次提交
  18. 21 5月, 2014 1 次提交
    • J
      blk-mq: allow changing of queue depth through sysfs · e3a2b3f9
      Jens Axboe 提交于
      For request_fn based devices, the block layer exports a 'nr_requests'
      file through sysfs to allow adjusting of queue depth on the fly.
      Currently this returns -EINVAL for blk-mq, since it's not wired up.
      Wire this up for blk-mq, so that it now also always dynamic
      adjustments of the allowed queue depth for any given block device
      managed by blk-mq.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e3a2b3f9
  19. 20 5月, 2014 1 次提交
    • J
      blk-mq: switch ctx pending map to the sparser blk_align_bitmap · 1429d7c9
      Jens Axboe 提交于
      Each hardware queue has a bitmap of software queues with pending
      requests. When new IO is queued on a software queue, the bit is
      set, and when IO is pruned on a hardware queue run, the bit is
      cleared. This causes a lot of traffic. Switch this from the regular
      BITS_PER_LONG bitmap to a sparser layout, similarly to what was
      done for blk-mq tagging.
      
      20% performance increase was observed for single threaded IO, and
      about 15% performanc increase on multiple threads driving the
      same device.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1429d7c9
  20. 14 5月, 2014 1 次提交
    • J
      blk-mq: improve support for shared tags maps · 0d2602ca
      Jens Axboe 提交于
      This adds support for active queue tracking, meaning that the
      blk-mq tagging maintains a count of active users of a tag set.
      This allows us to maintain a notion of fairness between users,
      so that we can distribute the tag depth evenly without starving
      some users while allowing others to try unfair deep queues.
      
      If sharing of a tag set is detected, each hardware queue will
      track the depth of its own queue. And if this exceeds the total
      depth divided by the number of active queues, the user is actively
      throttled down.
      
      The active queue count is done lazily to avoid bouncing that data
      between submitter and completer. Each hardware queue gets marked
      active when it allocates its first tag, and gets marked inactive
      when 1) the last tag is cleared, and 2) the queue timeout grace
      period has passed.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0d2602ca
  21. 09 5月, 2014 1 次提交
    • J
      blk-mq: implement new and more efficient tagging scheme · 4bb659b1
      Jens Axboe 提交于
      blk-mq currently uses percpu_ida for tag allocation. But that only
      works well if the ratio between tag space and number of CPUs is
      sufficiently high. For most devices and systems, that is not the
      case. The end result if that we either only utilize the tag space
      partially, or we end up attempting to fully exhaust it and run
      into lots of lock contention with stealing between CPUs. This is
      not optimal.
      
      This new tagging scheme is a hybrid bitmap allocator. It uses
      two tricks to both be SMP friendly and allow full exhaustion
      of the space:
      
      1) We cache the last allocated (or freed) tag on a per blk-mq
         software context basis. This allows us to limit the space
         we have to search. The key element here is not caching it
         in the shared tag structure, otherwise we end up dirtying
         more shared cache lines on each allocate/free operation.
      
      2) The tag space is split into cache line sized groups, and
         each context will start off randomly in that space. Even up
         to full utilization of the space, this divides the tag users
         efficiently into cache line groups, avoiding dirtying the same
         one both between allocators and between allocator and freeer.
      
      This scheme shows drastically better behaviour, both on small
      tag spaces but on large ones as well. It has been tested extensively
      to show better performance for all the cases blk-mq cares about.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4bb659b1
  22. 08 5月, 2014 1 次提交
  23. 25 4月, 2014 1 次提交
    • C
      blk-mq: respect rq_affinity · 38535201
      Christoph Hellwig 提交于
      The blk-mq code is using it's own version of the I/O completion affinity
      tunables, which causes a few issues:
      
       - the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
         still is present, thus breaking existing tuning setups.
       - the rq_affinity = 1 mode, which is the defauly for legacy request based
         drivers isn't implemented at all.
       - blk-mq drivers don't implement any completion affinity with the default
         flag settings.
      
      This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
      as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
      respects the queue-wide rq_affinity flags and also implements the
      rq_affinity = 1 mode.
      
      This means I/O completion affinity can now only be tuned block-queue wide
      instead of per context, which seems more sensible to me anyway.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      38535201
  24. 17 4月, 2014 5 次提交
  25. 16 4月, 2014 1 次提交