1. 27 1月, 2017 4 次提交
  2. 25 1月, 2017 1 次提交
  3. 21 1月, 2017 1 次提交
    • J
      blk-mq: allow resize of scheduler requests · 70f36b60
      Jens Axboe 提交于
      Add support for growing the tags associated with a hardware queue, for
      the scheduler tags. Currently we only support resizing within the
      limits of the original depth, change that so we can grow it as well by
      allocating and replacing the existing scheduler tag set.
      
      This is similar to how we could increase the software queue depth with
      the legacy IO stack and schedulers.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      70f36b60
  4. 19 1月, 2017 1 次提交
  5. 18 1月, 2017 8 次提交
  6. 12 1月, 2017 1 次提交
  7. 26 12月, 2016 1 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
  8. 15 12月, 2016 1 次提交
    • G
      blk-mq: Fix failed allocation path when mapping queues · d1b1cea1
      Gabriel Krisman Bertazi 提交于
      In blk_mq_map_swqueue, there is a memory optimization that frees the
      tags of a queue that has gone unmapped.  Later, if that hctx is remapped
      after another topology change, the tags need to be reallocated.
      
      If this allocation fails, a simple WARN_ON triggers, but the block layer
      ends up with an active hctx without any corresponding set of tags.
      Then, any income IO to that hctx can trigger an Oops.
      
      I can reproduce it consistently by running IO, flipping CPUs on and off
      and eventually injecting a memory allocation failure in that path.
      
      In the fix below, if the system experiences a failed allocation of any
      hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
      should always keep it's tags.  There is a minor performance hit, since
      our mapping just got worse after the error path, but this is
      the simplest solution to handle this error path.  The performance hit
      will disappear after another successful remap.
      
      I considered dropping the memory optimization all together, but it
      seemed a bad trade-off to handle this very specific error case.
      
      This should apply cleanly on top of Jens' for-next branch.
      
      The Oops is the one below:
      
      SP (3fff935ce4d0) is in userspace
      1:mon> e
      cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
          pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
          lr: c000000000575328: __bt_get+0x48/0xd0
          sp: c000000fe99eb390
         msr: 900000010280b033
         dar: 28
       dsisr: 40000000
        current = 0xc000000fe9966800
        paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
          pid   = 11035, comm = aio-stress
      Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
      (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
      1:mon> s
      [c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
      [c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
      [c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
      [c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
      [c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
      [c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
      [c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
      [c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
      [c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
      [c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
      [c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
      [c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
      [c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
      [c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
      [c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
      [c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
      [c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
      [c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
      [c000000fe99ebe30] c000000000009404 system_call+0x38/0xec
      Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Reviewed-by: NDouglas Miller <dougmill@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d1b1cea1
  9. 14 12月, 2016 1 次提交
    • G
      blk-mq: Avoid memory reclaim when remapping queues · 36e1f3d1
      Gabriel Krisman Bertazi 提交于
      While stressing memory and IO at the same time we changed SMT settings,
      we were able to consistently trigger deadlocks in the mm system, which
      froze the entire machine.
      
      I think that under memory stress conditions, the large allocations
      performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
      waiting on the block layer remmaping completion, thus deadlocking the
      system.  The trace below was collected after the machine stalled,
      waiting for the hotplug event completion.
      
      The simplest fix for this is to make allocations in this path
      non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
      issue anymore.
      
      This should apply on top of Jens's for-next branch cleanly.
      
      Changes since v1:
        - Use GFP_NOIO instead of GFP_NOWAIT.
      
       Call Trace:
      [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
      [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
      [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
      [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
      [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
      [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
      [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
      [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
      [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
      [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
      [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
      [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
      [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
      [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
      [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
      [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
      [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
      [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
      [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
      [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
      [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
      [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
      [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
      [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
      [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
      [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
      [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
      [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
      [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
      [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
      [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
      [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
      [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
      [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
      [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
      [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
      [c000000f0160be30] [c000000000009204] system_call+0x38/0xec
      Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36e1f3d1
  10. 10 12月, 2016 2 次提交
  11. 06 12月, 2016 1 次提交
  12. 29 11月, 2016 1 次提交
  13. 18 11月, 2016 2 次提交
    • J
      blk-mq: make the polling code adaptive · 64f1c21e
      Jens Axboe 提交于
      The previous commit introduced the hybrid sleep/poll mode. Take
      that one step further, and use the completion latencies to
      automatically sleep for half the mean completion time. This is
      a good approximation.
      
      This changes the 'io_poll_delay' sysfs file a bit to expose the
      various options. Depending on the value, the polling code will
      behave differently:
      
      -1	Never enter hybrid sleep mode
       0	Use half of the completion mean for the sleep delay
      >0	Use this specific value as the sleep delay
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      64f1c21e
    • J
      blk-mq: implement hybrid poll mode for sync O_DIRECT · 06426adf
      Jens Axboe 提交于
      This patch enables a hybrid polling mode. Instead of polling after IO
      submission, we can induce an artificial delay, and then poll after that.
      For example, if the IO is presumed to complete in 8 usecs from now, we
      can sleep for 4 usecs, wake up, and then do our polling. This still puts
      a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
      after the IO has completed, it'll happen before. With this hybrid
      scheme, we can achieve big latency reductions while still using the same
      (or less) amount of CPU.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Tested-By: NStephen Bates <sbates@raithlin.com>
      Reviewed-By: NStephen Bates <sbates@raithlin.com>
      06426adf
  14. 16 11月, 2016 1 次提交
    • M
      block: deal with stale req count of plug list · 0a6219a9
      Ming Lei 提交于
      In both legacy and mq path, req count of plug list is computed
      before allocating request, so the number can be stale when falling
      back to slept allocation, also the new introduced wbt can sleep
      too.
      
      This patch deals with the case by checking if plug list becomes
      empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
      which is introduced by Shaohua's patches of dispatching big request.
      
      Fixes: 600271d9(blk-mq: immediately dispatch big size request)
      Fixes: 50d24c34(block: immediately dispatch big size request)
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0a6219a9
  15. 12 11月, 2016 2 次提交
  16. 11 11月, 2016 2 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
  17. 07 11月, 2016 1 次提交
    • G
      blk-mq: Always schedule hctx->next_cpu · c02ebfdd
      Gabriel Krisman Bertazi 提交于
      Commit 0e87e58b ("blk-mq: improve warning for running a queue on the
      wrong CPU") attempts to avoid triggering the WARN_ON in
      __blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
      last batch execution before round robin, blk_mq_hctx_next_cpu can
      schedule a dead CPU and also update next_cpu to the next alive CPU in
      the mask, which will trigger the WARN_ON despite the previous
      workaround.
      
      The following patch fixes this scenario by always scheduling the value
      in hctx->next_cpu.  This changes the moment when we round-robin the CPU
      running the hctx, but it really doesn't matter, since it still executes
      BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.
      
      Fixes: 0e87e58b ("blk-mq: improve warning for running a queue on the wrong CPU")
      Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c02ebfdd
  18. 04 11月, 2016 1 次提交
  19. 03 11月, 2016 8 次提交