1. 17 8月, 2018 4 次提交
    • P
      block, bfq: improve code of bfq_bfqq_charge_time · f8121648
      Paolo Valente 提交于
      bfq_bfqq_charge_time contains some lengthy and redundant code. This
      commit trims and condenses that code.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8121648
    • P
      block, bfq: reduce write overcharge · d5801088
      Paolo Valente 提交于
      When a sync request is dispatched, the queue that contains that
      request, and all the ancestor entities of that queue, are charged with
      the number of sectors of the request. In constrast, if the request is
      async, then the queue and its ancestor entities are charged with the
      number of sectors of the request, multiplied by an overcharge
      factor. This throttles the bandwidth for async I/O, w.r.t. to sync
      I/O, and it is done to counter the tendency of async writes to steal
      I/O throughput to reads.
      
      On the opposite end, the lower this parameter, the stabler I/O
      control, in the following respect.  The lower this parameter is, the
      less the bandwidth enjoyed by a group decreases
      - when the group does writes, w.r.t. to when it does reads;
      - when other groups do reads, w.r.t. to when they do writes.
      
      The fixes "block, bfq: always update the budget of an entity when
      needed" and "block, bfq: readd missing reset of parent-entity service"
      improved I/O control in bfq to such an extent that it has been
      possible to revise this overcharge factor downwards.  This commit
      introduces the resulting, new value.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5801088
    • P
      block, bfq: always update the budget of an entity when needed · e02a0aa2
      Paolo Valente 提交于
      When the next child entity to serve changes for a given parent entity,
      the budget of that parent entity must be updated accordingly.
      Unfortunately, this update is not performed, by mistake, for the
      entities that happen to switch from having no child entity to serve,
      to having one child entity to serve.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e02a0aa2
    • P
      block, bfq: readd missing reset of parent-entity service · 8a511ba5
      Paolo Valente 提交于
      The received-service counter needs to be equal to 0 when an entity is
      set in service. Unfortunately, commit "block, bfq: fix service being
      wrongly set to zero in case of preemption" mistakenly removed the
      resetting of this counter for the parent entities of the bfq_queue
      being set in service. This commit fixes this issue by resetting
      service for parent entities, directly on the expiration of the
      in-service bfq_queue.
      
      Fixes: 9fae8dd5 ("block, bfq: fix service being wrongly set to zero in case of preemption")
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a511ba5
  2. 15 8月, 2018 2 次提交
  3. 12 8月, 2018 1 次提交
  4. 10 8月, 2018 1 次提交
    • L
      Blk-throttle: reduce tail io latency when iops limit is enforced · 991f61fe
      Liu Bo 提交于
      When an application's iops has exceeded its cgroup's iops limit, surely it
      is throttled and kernel will set a timer for dispatching, thus IO latency
      includes the delay.
      
      However, the dispatch delay which is calculated by the limit and the
      elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
      once the application's iops is (iops limit + 1), it doesn't need to wait
      any longer than the remaining time of the current slice.
      
      The difference can be proved by the following fio job and cgroup iops
      setting,
      -----
      $ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
      $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
      $ cat r2.job
      [global]
      name=fio-rand-read
      filename=/dev/nullb1
      rw=randread
      bs=4k
      direct=1
      numjobs=1
      time_based=1
      runtime=60
      group_reporting=1
      
      [file1]
      size=4G
      ioengine=libaio
      iodepth=1
      rate_iops=50000
      norandommap=1
      thinktime=4ms
      -----
      
      wo patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
          slat (usec): min=10, max=336, avg=27.71, stdev=17.82
          clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
           lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
          clat percentiles (usec):
           |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
           | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
           | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
           | 99.99th=[28967]
      
      w/ patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
          slat (usec): min=10, max=155, avg=23.24, stdev=16.79
          clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
           lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
          clat percentiles (usec):
           |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
           | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
           | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
           | 99.99th=[12387]
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      991f61fe
  5. 09 8月, 2018 4 次提交
  6. 08 8月, 2018 3 次提交
    • B
      cfq: Suppress compiler warnings about comparisons · f7ecb1b1
      Bart Van Assche 提交于
      This patch does not change any functionality but avoids that gcc
      reports the following warnings when building with W=1:
      
      block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_low_latency_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f7ecb1b1
    • B
      cfq: Annotate fall-through in a switch statement · 9b4f4346
      Bart Van Assche 提交于
      This patch avoids that gcc complains about fall-through when building
      with W=1.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b4f4346
    • A
      blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait · 2887e41b
      Anchal Agarwal 提交于
      I am currently running a large bare metal instance (i3.metal)
      on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
      4.18 kernel. I have a workload that simulates a database
      workload and I am running into lockup issues when writeback
      throttling is enabled,with the hung task detector also
      kicking in.
      
      Crash dumps show that most CPUs (up to 50 of them) are
      all trying to get the wbt wait queue lock while trying to add
      themselves to it in __wbt_wait (see stack traces below).
      
      [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
      [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
      [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
      [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
      [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
      [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
      [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
      [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
      [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
      [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.948138] Call Trace:
      [    0.948139]  <IRQ>
      [    0.948142]  do_raw_spin_lock+0xad/0xc0
      [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.948149]  ? __wake_up_common_lock+0x53/0x90
      [    0.948150]  __wake_up_common_lock+0x53/0x90
      [    0.948155]  wbt_done+0x7b/0xa0
      [    0.948158]  blk_mq_free_request+0xb7/0x110
      [    0.948161]  __blk_mq_complete_request+0xcb/0x140
      [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
      [    0.948169]  nvme_irq+0x23/0x50 [nvme]
      [    0.948173]  __handle_irq_event_percpu+0x46/0x300
      [    0.948176]  handle_irq_event_percpu+0x20/0x50
      [    0.948179]  handle_irq_event+0x34/0x60
      [    0.948181]  handle_edge_irq+0x77/0x190
      [    0.948185]  handle_irq+0xaf/0x120
      [    0.948188]  do_IRQ+0x53/0x110
      [    0.948191]  common_interrupt+0x87/0x87
      [    0.948192]  </IRQ>
      ....
      [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
      [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
      [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
      [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
      [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
      [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
      [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
      [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
      [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
      [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
      [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.311154] Call Trace:
      [    0.311157]  do_raw_spin_lock+0xad/0xc0
      [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
      [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
      [    0.311167]  wbt_wait+0x127/0x330
      [    0.311169]  ? finish_wait+0x80/0x80
      [    0.311172]  ? generic_make_request+0xda/0x3b0
      [    0.311174]  blk_mq_make_request+0xd6/0x7b0
      [    0.311176]  ? blk_queue_enter+0x24/0x260
      [    0.311178]  ? generic_make_request+0xda/0x3b0
      [    0.311181]  generic_make_request+0x10c/0x3b0
      [    0.311183]  ? submit_bio+0x5c/0x110
      [    0.311185]  submit_bio+0x5c/0x110
      [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
      [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
      [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
      [    0.311229]  ? do_writepages+0x3c/0xd0
      [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
      [    0.311240]  do_writepages+0x3c/0xd0
      [    0.311243]  ? _raw_spin_unlock+0x24/0x30
      [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
      [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311253]  file_write_and_wait_range+0x34/0x90
      [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
      [    0.311267]  do_fsync+0x38/0x60
      [    0.311270]  SyS_fsync+0xc/0x10
      [    0.311272]  do_syscall_64+0x6f/0x170
      [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      In the original patch, wbt_done is waking up all the exclusive
      processes in the wait queue, which can cause a thundering herd
      if there is a large number of writer threads in the queue. The
      original intention of the code seems to be to wake up one thread
      only however, it uses wake_up_all() in __wbt_done(), and then
      uses the following check in __wbt_wait to have only one thread
      actually get out of the wait loop:
      
      if (waitqueue_active(&rqw->wait) &&
                  rqw->wait.head.next != &wait->entry)
                      return false;
      
      The problem with this is that the wait entry in wbt_wait is
      define with DEFINE_WAIT, which uses the autoremove wakeup function.
      That means that the above check is invalid - the wait entry will
      have been removed from the queue already by the time we hit the
      check in the loop.
      
      Secondly, auto-removing the wait entries also means that the wait
      queue essentially gets reordered "randomly" (e.g. threads re-add
      themselves in the order they got to run after being woken up).
      Additionally, new requests entering wbt_wait might overtake requests
      that were queued earlier, because the wait queue will be
      (temporarily) empty after the wake_up_all, so the waitqueue_active
      check will not stop them. This can cause certain threads to starve
      under high load.
      
      The fix is to leave the woken up requests in the queue and remove
      them in finish_wait() once the current thread breaks out of the
      wait loop in __wbt_wait. This will ensure new requests always
      end up at the back of the queue, and they won't overtake requests
      that are already in the wait queue. With that change, the loop
      in wbt_wait is also in line with many other wait loops in the kernel.
      Waking up just one thread drastically reduces lock contention, as
      does moving the wait queue add/remove out of the loop.
      
      A significant drop in lockdep's lock contention numbers is seen when
      running the test application on the patched kernel.
      Signed-off-by: NAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2887e41b
  7. 03 8月, 2018 2 次提交
    • M
      blk-mq: fix updating tags depth · 75d6e175
      Ming Lei 提交于
      The passed 'nr' from userspace represents the total depth, meantime
      inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
      and 'nr_reserved_tags' stores the reserved part.
      
      There are two issues in blk_mq_tag_update_depth() now:
      
      1) for growing tags, we should have used the passed 'nr', and keep the
      number of reserved tags not changed.
      
      2) the passed 'nr' should have been used for checking against
      'tags->nr_tags', instead of number of the normal part.
      
      This patch fixes the above two cases, and avoids kernel crash caused
      by wrong resizing sbitmap queue.
      
      Cc: "Ewan D. Milne" <emilne@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Tested by: Marco Patalano <mpatalan@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      75d6e175
    • M
      block: really disable runtime-pm for blk-mq · b233f127
      Ming Lei 提交于
      Runtime PM isn't ready for blk-mq yet, and commit 765e40b6 ("block:
      disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
      it can't take effect in that way since user space still can switch
      it on via 'echo auto > /sys/block/sdN/device/power/control'.
      
      This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
      and fixes all kinds of PM related kernel crash.
      
      Cc: Tomas Janousek <tomi@nomi.cz>
      Cc: Przemek Socha <soprwa@gmail.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NPatrick Steinhardt <ps@pks.im>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b233f127
  8. 02 8月, 2018 1 次提交
  9. 01 8月, 2018 3 次提交
  10. 30 7月, 2018 2 次提交
  11. 27 7月, 2018 3 次提交
  12. 25 7月, 2018 4 次提交
  13. 23 7月, 2018 2 次提交
  14. 18 7月, 2018 5 次提交
    • T
      blkcg: Track DISCARD statistics and output them in cgroup io.stat · 636620b6
      Tejun Heo 提交于
      Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat.  Two
      fields, dbytes and dios, to respectively count the total bytes and
      number of discards are added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Michael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      636620b6
    • M
      block: Track DISCARD statistics and output them in stat and diskstat · bdca3c87
      Michael Callahan 提交于
      Add tracking of REQ_OP_DISCARD ios to the partition statistics and
      append them to the various stat files in /sys as well as
      /proc/diskstats.  These are tracked with the same four stats as reads
      and writes:
      
      Number of discard ios completed.
      Number of discard ios merged
      Number of discard sectors completed
      Milliseconds spent on discard requests
      
      This is done via adding a new STAT_DISCARD define to genhd.h and then
      using it to index that stat field for discard requests.
      
      tj: Refreshed on top of v4.17 and other previous updates.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdca3c87
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • M
      block: Define and use STAT_READ and STAT_WRITE · dbae2c55
      Michael Callahan 提交于
      Add defines for STAT_READ and STAT_WRITE for indexing the partition
      stat entries. This clarifies some fs/ code which has hardcoded 1 for
      STAT_WRITE and will make it easier to extend the stats with additional
      fields.
      
      tj: Refreshed on top of v4.17.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbae2c55
    • M
      blk-mq: issue directly if hw queue isn't busy in case of 'none' · 6ce3dd6e
      Ming Lei 提交于
      In case of 'none' io scheduler, when hw queue isn't busy, it isn't
      necessary to enqueue request to sw queue and dequeue it from
      sw queue because request may be submitted to hw queue asap without
      extra cost, meantime there shouldn't be much request in sw queue,
      and we don't need to worry about effect on IO merge.
      
      There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
      which may connect high performance devices, so 'none' is often required
      for obtaining good performance.
      
      This patch improves IOPS and decreases CPU unilization on megaraid_sas,
      per Kashyap's test.
      
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ce3dd6e
  15. 17 7月, 2018 2 次提交
    • J
      blk-iolatency: truncate our current time · 71e9690b
      Josef Bacik 提交于
      In our longer tests we noticed that some boxes would degrade to the
      point of uselessness.  This is because we truncate the current time when
      saving it in our bio, but I was using the raw current time to subtract
      from.  So once the box had been up a certain amount of time it would
      appear as if our IO's were taking several years to complete.  Fix this
      by truncating the current time so it matches the issue time.  Verified
      this worked by running with this patch for a week on our test tier.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71e9690b
    • J
      blk-iolatency: don't change the latency window · d607eefa
      Josef Bacik 提交于
      Early versions of these patches had us waiting for seconds at a time
      during submission, so we had to adjust the timing window we monitored
      for latency.  Now we don't do things like that so this is unnecessary
      code.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d607eefa
  16. 12 7月, 2018 1 次提交
    • C
      bsg: remove read/write support · 28519c89
      Christoph Hellwig 提交于
      The code poses a security risk due to user memory access in ->release
      and had an API that can't be used reliably.  As far as we know it was
      never used for real, but if that turns out wrong we'll have to revert
      this commit and come up with a band aid.
      
      Jann Horn did look software archives for users of this interface,
      and the only users found were example code in sg3_utils, and optional
      support in an optional module of the tgt user space iscsi target,
      which looks like a proof of concept extension of the /dev/sg
      read/write support.
      
      Tony Battersby chimes in that the code is basically unsafe to use in
      general:
      
        The read/write interface on /dev/bsg is impossible to use safely
        because the list of completed commands is per-device (bd->done_list)
        rather than per-fd like it is with /dev/sg.  So if program A and
        program B are both using the write/read interface on the same bsg
        device, then their command responses will get mixed up, and program
        A will read() some command results from program B and vice versa.
        So no, I don't use read/write on /dev/bsg.  From a security standpoint,
        it should definitely be fixed or removed.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28519c89