1. 15 6月, 2017 1 次提交
    • B
      block: Fix a blk_exit_rl() regression · dc9edc44
      Bart Van Assche 提交于
      Avoid that the following complaint is reported:
      
       BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
       in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
       1 lock held by rcuop/3/41:
        #0:  (rcu_callback){......}, at: [<ffffffff8111f9a2>] rcu_nocb_kthread+0x282/0x500
       Call Trace:
        dump_stack+0x86/0xcf
        ___might_sleep+0x174/0x260
        __might_sleep+0x4a/0x80
        flush_work+0x7e/0x2e0
        __cancel_work_timer+0x143/0x1c0
        cancel_work_sync+0x10/0x20
        blk_throtl_exit+0x25/0x60
        blkcg_exit_queue+0x35/0x40
        blk_release_queue+0x42/0x130
        kobject_put+0xa9/0x190
      
      This happens since we invoke callbacks that need to block from the
      queue release handler. Fix this by pushing the final release to
      a workqueue.
      Reported-by: NRoss Zwisler <zwisler@gmail.com>
      Fixes: commit b425e504 ("block: Avoid that blk_exit_rl() triggers a use-after-free")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      
      Updated changelog
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc9edc44
  2. 02 6月, 2017 1 次提交
    • B
      block: Avoid that blk_exit_rl() triggers a use-after-free · b425e504
      Bart Van Assche 提交于
      Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
      essential that the memory allocated for struct request_queue
      stays around until all blk_exit_rl() calls have finished. Hence
      make blk_init_rl() take a reference on struct request_queue.
      
      This patch fixes the following crash:
      
      general protection fault: 0000 [#2] SMP
      CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      task: ffff88013a108040 task.stack: ffffc9000071c000
      RIP: 0010:free_request_size+0x1a/0x30
      RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
      RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
      RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
      RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
      R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
      R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
      FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
      Call Trace:
       mempool_destroy.part.10+0x21/0x40
       mempool_destroy+0xe/0x10
       blk_exit_rl+0x12/0x20
       blkg_free+0x4d/0xa0
       __blkg_release_rcu+0x59/0x170
       rcu_process_callbacks+0x260/0x4e0
       __do_softirq+0x116/0x250
       smpboot_thread_fn+0x123/0x1e0
       kthread+0x109/0x140
       ret_from_fork+0x31/0x40
      
      Fixes: commit e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org> # v4.11+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b425e504
  3. 26 5月, 2017 1 次提交
  4. 04 5月, 2017 2 次提交
  5. 27 4月, 2017 1 次提交
  6. 19 4月, 2017 1 次提交
  7. 09 4月, 2017 1 次提交
  8. 07 4月, 2017 1 次提交
  9. 29 3月, 2017 2 次提交
  10. 28 3月, 2017 2 次提交
    • S
      blk-throttle: choose a small throtl_slice for SSD · d61fcfa4
      Shaohua Li 提交于
      The throtl_slice is 100ms by default. This is a long time for SSD, a lot
      of IO can run. To make cgroups have smoother throughput, we choose a
      small value (20ms) for SSD.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d61fcfa4
    • S
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li 提交于
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      297e3d85
  11. 22 3月, 2017 2 次提交
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
    • O
      blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb
      Omar Sandoval 提交于
      The stats buckets will become generic soon, so make the existing users
      use the common READ and WRITE definitions instead of one internal to
      blk-stat.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa2e39cb
  12. 03 3月, 2017 1 次提交
  13. 15 2月, 2017 1 次提交
    • T
      block: do not allow updates through sysfs until registration completes · b410aff2
      Tahsin Erdogan 提交于
      When a new disk shows up, sysfs queue directory is created before elevator
      is registered. This allows a user to attempt a scheduler switch even though
      the initial registration hasn't completed yet.
      
      In one scenario, blk_register_queue() calls elv_register_queue() and
      right before cfq_registered_queue() is called, another process executes
      elevator_switch() and replaces q->elevator with deadline scheduler. When
      cfq_registered_queue() executes it interprets e->elevator_data as struct
      cfq_data even though it is actually struct deadline_data.
      
      Grab q->sysfs_lock in blk_register_queue() to synchronize with sysfs
      callers.
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b410aff2
  14. 09 2月, 2017 1 次提交
  15. 07 2月, 2017 1 次提交
  16. 03 2月, 2017 1 次提交
  17. 02 2月, 2017 2 次提交
  18. 28 1月, 2017 1 次提交
  19. 13 12月, 2016 1 次提交
  20. 01 12月, 2016 1 次提交
  21. 29 11月, 2016 2 次提交
  22. 18 11月, 2016 2 次提交
    • J
      blk-mq: make the polling code adaptive · 64f1c21e
      Jens Axboe 提交于
      The previous commit introduced the hybrid sleep/poll mode. Take
      that one step further, and use the completion latencies to
      automatically sleep for half the mean completion time. This is
      a good approximation.
      
      This changes the 'io_poll_delay' sysfs file a bit to expose the
      various options. Depending on the value, the polling code will
      behave differently:
      
      -1	Never enter hybrid sleep mode
       0	Use half of the completion mean for the sleep delay
      >0	Use this specific value as the sleep delay
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      64f1c21e
    • J
      blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf
      Jens Axboe 提交于
      This patch enables a hybrid polling mode. Instead of polling after IO
      submission, we can induce an artificial delay, and then poll after that.
      For example, if the IO is presumed to complete in 8 usecs from now, we
      can sleep for 4 usecs, wake up, and then do our polling. This still puts
      a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
      after the IO has completed, it'll happen before. With this hybrid
      scheme, we can achieve big latency reductions while still using the same
      (or less) amount of CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      06426adf
  23. 12 11月, 2016 1 次提交
  24. 11 11月, 2016 2 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
  25. 19 10月, 2016 2 次提交
  26. 21 9月, 2016 1 次提交
  27. 21 7月, 2016 1 次提交
  28. 13 4月, 2016 1 次提交
  29. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  30. 18 2月, 2016 1 次提交
  31. 26 11月, 2015 1 次提交
    • M
      block/sd: Fix device-imposed transfer length limits · ca369d51
      Martin K. Petersen 提交于
      Commit 4f258a46 ("sd: Fix maximum I/O size for BLOCK_PC requests")
      had the unfortunate side-effect of removing an implicit clamp to
      BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
      code. This caused problems for some SMR drives.
      
      Debugging this issue revealed a few problems with the existing
      infrastructure since the block layer didn't know how to deal with
      device-imposed limits, only limits set by the I/O controller.
      
       - Introduce a new queue limit, max_dev_sectors, which is used by the
         ULD to signal the maximum sectors for a REQ_TYPE_FS request.
      
       - Ensure that max_dev_sectors is correctly stacked and taken into
         account when overriding max_sectors through sysfs.
      
       - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
         values for later processing.
      
       - In sd_revalidate() set the queue's max_dev_sectors based on the
         MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
         is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
         field size.
      
       - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
         VPD--if reported and sane--to signal the preferred device transfer
         size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.
      
       - blk_limits_max_hw_sectors() is no longer used and can be removed.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: sweeneygj@gmx.com
      Tested-by: NArzeets <anatol.pomozov@gmail.com>
      Tested-by: NDavid Eisner <david.eisner@oriel.oxon.org>
      Tested-by: NMario Kicherer <dev@kicherer.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      ca369d51