1. 11 8月, 2017 3 次提交
    • R
      cfq: Give a chance for arming slice idle timer in case of group_idle · b3193bc0
      Ritesh Harjani 提交于
      In below scenario blkio cgroup does not work as per their assigned
      weights :-
      1. When the underlying device is nonrotational with a single HW queue
      with depth of >= CFQ_HW_QUEUE_MIN
      2. When the use case is forming two blkio cgroups cg1(weight 1000) &
      cg2(wight 100) and two processes(file1 and file2) doing sync IO in
      their respective blkio cgroups.
      
      For above usecase result of fio (without this patch):-
      file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan  1 19:41:49 1970
        write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
      <...>
      file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan  1 19:41:49 1970
        write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
      <...>
      // both the process BW is equal even though they belong to diff.
      cgroups with weight of 1000(cg1) and 100(cg2)
      
      In above case (for non rotational NCQ devices),
      as soon as the request from cg1 is completed and even
      though it is provided with higher set_slice=10, because of CFQ
      algorithm when the driver tries to fetch the request, CFQ expires
      this group without providing any idle time nor weight priority
      and schedules another cfq group (in this case cg2).
      And thus both cfq groups(cg1 & cg2) keep alternating to get the
      disk time and hence loses the cgroup weight based scheduling.
      
      Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
      to arm the slice timer in case group_idle is enabled.
      In case if group_idle is also not required (including for nonrotational
      NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
      such cases.
      
      With this patch result of fio(for above usecase) :-
      file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan  1 00:06:08 1970
        write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
      <..>
      file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan  1 00:06:08 1970
        write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
      <..>
      // In this processes BW is as per their respective cgroups weight.
      Signed-off-by: NRitesh Harjani <riteshh@codeaurora.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b3193bc0
    • P
      block, bfq: boost throughput with flash-based non-queueing devices · edaf9428
      Paolo Valente 提交于
      When a queue associated with a process remains empty, there are cases
      where throughput gets boosted if the device is idled to await the
      arrival of a new I/O request for that queue. Currently, BFQ assumes
      that one of these cases is when the device has no internal queueing
      (regardless of the properties of the I/O being served). Unfortunately,
      this condition has proved to be too general. So, this commit refines it
      as "the device has no internal queueing and is rotational".
      
      This refinement provides a significant throughput boost with random
      I/O, on flash-based storage without internal queueing. For example, on
      a HiKey board, throughput increases by up to 125%, growing, e.g., from
      6.9MB/s to 15.6MB/s with two or three random readers in parallel.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edaf9428
    • P
      block,bfq: refactor device-idling logic · d5be3fef
      Paolo Valente 提交于
      The logic that decides whether to idle the device is scattered across
      three functions. Almost all of the logic is in the function
      bfq_bfqq_may_idle, but (1) part of the decision is made in
      bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
      switch off idling regardless of the output of bfq_bfqq_may_idle. In
      addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
      their decisions as a function of parameters that are used, for similar
      purposes, also in bfq_bfqq_may_idle. This commit addresses these
      issues by moving all the logic into bfq_bfqq_may_idle.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5be3fef
  2. 10 8月, 2017 7 次提交
  3. 02 8月, 2017 1 次提交
  4. 01 8月, 2017 1 次提交
    • J
      blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled · b7a71e66
      Jens Axboe 提交于
      We recently had a bug in the IPR SCSI driver, where it would end up
      making the SCSI mid layer run the mq hardware queue with interrupts
      disabled. This isn't legal, since the software queue locking relies
      on never being grabbed from interrupt context. Additionally, drivers
      that set BLK_MQ_F_BLOCKING may schedule from this context.
      
      Add a WARN_ON_ONCE() to catch bad users up front.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7a71e66
  5. 29 7月, 2017 3 次提交
  6. 25 7月, 2017 1 次提交
  7. 24 7月, 2017 1 次提交
  8. 12 7月, 2017 2 次提交
    • H
      bfq: dispatch request to prevent queue stalling after the request completion · 3f7cb4f4
      Hou Tao 提交于
      There are mq devices (eg., virtio-blk, nbd and loopback) which don't
      invoke blk_mq_run_hw_queues() after the completion of a request.
      If bfq is enabled on these devices and the slice_idle attribute or
      strict_guarantees attribute is set as zero, it is possible that
      after a request completion the remaining requests of busy bfq queue
      will stalled in the bfq schedule until a new request arrives.
      
      To fix the scheduler latency problem, we need to check whether or not
      all issued requests have completed and dispatch more requests to driver
      if there is no request in driver.
      
      The problem can be reproduced by running the following script
      on a virtio-blk device with nr_hw_queues as 1:
      
      #!/bin/sh
      
      dev=vdb
      # mount point for dev
      mp=/tmp/mnt
      cd $mp
      
      job=strict.job
      cat <<EOF > $job
      [global]
      direct=1
      bs=4k
      size=256M
      rw=write
      ioengine=libaio
      iodepth=128
      runtime=5
      time_based
      
      [1]
      filename=1.data
      
      [2]
      new_group
      filename=2.data
      EOF
      
      echo bfq > /sys/block/$dev/queue/scheduler
      echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
      fio $job
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f7cb4f4
    • H
      bfq: fix typos in comments about B-WF2Q+ algorithm · 38c91407
      Hou Tao 提交于
      The start time of eligible entity should be less than or equal to
      the current virtual time, and the entity in idle tree has a finish
      time being greater than the current virtual time.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      38c91407
  9. 11 7月, 2017 1 次提交
    • S
      block: call bio_uninit in bio_endio · b222dd2f
      Shaohua Li 提交于
      bio_free isn't a good place to free cgroup info. There are a
      lot of cases bio is allocated in special way (for example, in stack) and
      never gets called by bio_put hence bio_free, we are leaking memory. This
      patch moves the free to bio endio, which should be called anyway. The
      bio_uninit call in bio_free is kept, in case the bio never gets called
      bio endio.
      
      This assumes ->bi_end_io() doesn't access cgroup info, which seems true
      in my audit.
      
      This along with Christoph's integrity patch should fix the memory leak
      issue.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b222dd2f
  10. 06 7月, 2017 1 次提交
    • D
      block: Fix __blkdev_issue_zeroout loop · 615d22a5
      Damien Le Moal 提交于
      The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
      with a maximum number of bvec (pages) equal to
      
      min(nr_sects, (sector_t)BIO_MAX_PAGES)
      
      This works since the requested number of bvecs will always be limited
      to the absolute maximum number supported (BIO_MAX_PAGES), but this is
      ineficient as too many bvec entries may be requested due to the
      different units being used in the min() operation (number of sectors vs
      number of pages).
      To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
      correctly calculate the number of bvecs for zeroout BIOs as the issuing
      loop progresses. The calculation is done using consistent units and
      makes sure that the number of pages return is at least 1 (for cases
      where the number of sectors is less that the number of sectors in
      a page).
      
      Also remove a trailing space after the bit shift in the internal loop
      min() call.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      615d22a5
  11. 05 7月, 2017 1 次提交
  12. 04 7月, 2017 9 次提交
  13. 30 6月, 2017 2 次提交
  14. 29 6月, 2017 4 次提交
    • M
      blk-mq: map all HWQ also in hyperthreaded system · fe631457
      Max Gurtovoy 提交于
      This patch performs sequential mapping between CPUs and queues.
      In case the system has more CPUs than HWQs then there are still
      CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
      and their siblings to the same HWQ.
      This actually fixes a bug that found unmapped HWQs in a system with
      2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
      running NVMEoF (opens upto maximum of 64 HWQs).
      
      Performance results running fio (72 jobs, 128 iodepth)
      using null_blk (w/w.o patch):
      
      bs      IOPS(read submit_queues=72)   IOPS(write submit_queues=72)   IOPS(read submit_queues=24)  IOPS(write submit_queues=24)
      -----  ----------------------------  ------------------------------ ---------------------------- -----------------------------
      512    4890.4K/4723.5K                 4524.7K/4324.2K                   4280.2K/4264.3K               3902.4K/3909.5K
      1k     4910.1K/4715.2K                 4535.8K/4309.6K                   4296.7K/4269.1K               3906.8K/3914.9K
      2k     4906.3K/4739.7K                 4526.7K/4330.6K                   4301.1K/4262.4K               3890.8K/3900.1K
      4k     4918.6K/4730.7K                 4556.1K/4343.6K                   4297.6K/4264.5K               3886.9K/3893.9K
      8k     4906.4K/4748.9K                 4550.9K/4346.7K                   4283.2K/4268.8K               3863.4K/3858.2K
      16k    4903.8K/4782.6K                 4501.5K/4233.9K                   4292.3K/4282.3K               3773.1K/3773.5K
      32k    4885.8K/4782.4K                 4365.9K/4184.2K                   4307.5K/4289.4K               3780.3K/3687.3K
      64k    4822.5K/4762.7K                 2752.8K/2675.1K                   4308.8K/4312.3K               2651.5K/2655.7K
      128k   2388.5K/2313.8K                 1391.9K/1375.7K                   2142.8K/2152.2K               1395.5K/1374.2K
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe631457
    • J
      block: provide bio_uninit() free freeing integrity/task associations · 9ae3b3f5
      Jens Axboe 提交于
      Wen reports significant memory leaks with DIF and O_DIRECT:
      
      "With nvme devive + T10 enabled, On a system it has 256GB and started
      logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
      it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
      leaking.
      
      /proc/meminfo | grep SUnreclaim...
      
      SUnreclaim:      6752128 kB
      SUnreclaim:      6874880 kB
      SUnreclaim:      7238080 kB
      ....
      SUnreclaim:     22307264 kB
      SUnreclaim:     22485888 kB
      SUnreclaim:     22720256 kB
      
      When testcases with T10 enabled call into __blkdev_direct_IO_simple,
      code doesn't free memory allocated by bio_integrity_alloc. The patch
      fixes the issue. HTX has been run with +60 hours without failure."
      
      Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
      doesn't go through the regular bio free. This means that any ancillary
      data allocated with the bio through the stack is not freed. Hence, we
      can leak the integrity data associated with the bio, if the device is
      using DIF/DIX.
      
      Fix this by providing a bio_uninit() and export it, so that we can use
      it to free this data. Note that this is a minimal fix for this issue.
      Any current user of bio's that are allocated outside of
      bio_alloc_bioset() suffers from this issue, most notably some drivers.
      We will fix those in a more comprehensive patch for 4.13. This also
      means that the commit marked as being fixed by this isn't the real
      culprit, it's just the most obvious one out there.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Reported-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9ae3b3f5
    • C
      blk-mq: Create hctx for each present CPU · 4b855ad3
      Christoph Hellwig 提交于
      Currently we only create hctx for online CPUs, which can lead to a lot
      of churn due to frequent soft offline / online operations.  Instead
      allocate one for each present CPU to avoid this and dramatically simplify
      the code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-nvme@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4b855ad3
    • C
      blk-mq: Include all present CPUs in the default queue mapping · 5f042e7c
      Christoph Hellwig 提交于
      This way we get a nice distribution independent of the current cpu
      online / offline state.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-nvme@lists.infradead.org
      Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5f042e7c
  15. 28 6月, 2017 3 次提交
    • P
      block, bfq: update wr_busy_queues if needed on a queue split · 13c931bd
      Paolo Valente 提交于
      This commit fixes a bug triggered by a non-trivial sequence of
      events. These events are briefly described in the next two
      paragraphs. The impatiens, or those who are familiar with queue
      merging and splitting, can jump directly to the last paragraph.
      
      On each I/O-request arrival for a shared bfq_queue, i.e., for a
      bfq_queue that is the result of the merge of two or more bfq_queues,
      BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
      many random I/O requests have arrived for the bfq_queue; if the device
      is non rotational, then random requests must be also small for the
      bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
      detected as seeky, then a split occurs: the bfq I/O context of the
      process that has issued the request is redirected from the shared
      bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
      shared bfq_queue actually happens to be shared only by one process
      (because of previous splits), then no new bfq_queue is created: the
      state of the shared bfq_queue is just changed from shared to non
      shared.
      
      Regardless of whether a brand new non-shared bfq_queue is created, or
      the pre-existing shared bfq_queue is just turned into a non-shared
      bfq_queue, several parameters of the non-shared bfq_queue are set
      (restored) to the original values they had when the bfq_queue
      associated with the bfq I/O context of the process (that has just
      issued an I/O request) was merged with the shared bfq_queue. One of
      these parameters is the weight-raising state.
      
      If, on the split of a shared bfq_queue,
      1) a pre-existing shared bfq_queue is turned into a non-shared
      bfq_queue;
      2) the previously shared bfq_queue happens to be busy;
      3) the weight-raising state of the previously shared bfq_queue happens
      to change;
      the number of weight-raised busy queues changes. The field
      wr_busy_queues must then be updated accordingly, but such an update
      was missing. This commit adds the missing update.
      Reported-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13c931bd
    • C
      block: don't set bounce limit in blk_init_queue · 8fc45044
      Christoph Hellwig 提交于
      Instead move it to the callers.  Those that either don't use bio_data() or
      page_address() or are specific to architectures that do not support highmem
      are skipped.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8fc45044
    • C
      block: don't set bounce limit in blk_init_allocated_queue · 0bf6595e
      Christoph Hellwig 提交于
      And just move it into scsi_transport_sas which needs it due to low-level
      drivers directly derferencing bio_data, and into blk_init_queue_node,
      which will need a further push into the callers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0bf6595e