1. 22 11月, 2016 1 次提交
  2. 11 11月, 2016 2 次提交
    • J
      block: hook up writeback throttling · 87760e5e
      Jens Axboe 提交于
      Enable throttling of buffered writeback to make it a lot
      more smooth, and has way less impact on other system activity.
      Background writeback should be, by definition, background
      activity. The fact that we flush huge bundles of it at the time
      means that it potentially has heavy impacts on foreground workloads,
      which isn't ideal. We can't easily limit the sizes of writes that
      we do, since that would impact file system layout in the presence
      of delayed allocation. So just throttle back buffered writeback,
      unless someone is waiting for it.
      
      The algorithm for when to throttle takes its inspiration in the
      CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
      the minimum latencies of requests over a window of time. In that
      window of time, if the minimum latency of any request exceeds a
      given target, then a scale count is incremented and the queue depth
      is shrunk. The next monitoring window is shrunk accordingly. Unlike
      CoDel, if we hit a window that exhibits good behavior, then we
      simply increment the scale count and re-calculate the limits for that
      scale value. This prevents us from oscillating between a
      close-to-ideal value and max all the time, instead remaining in the
      windows where we get good behavior.
      
      Unlike CoDel, blk-wb allows the scale count to to negative. This
      happens if we primarily have writes going on. Unlike positive
      scale counts, this doesn't change the size of the monitoring window.
      When the heavy writers finish, blk-bw quickly snaps back to it's
      stable state of a zero scale count.
      
      The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
      target to me met. It defaults to 2 msec for non-rotational storage, and
      75 msec for rotational storage. Setting this value to '0' disables
      blk-wb. Generally, a user would not have to touch this setting.
      
      We don't enable WBT on devices that are managed with CFQ, and have
      a non-root block cgroup attached. If we have a proportional share setup
      on this particular disk, then the wbt throttling will interfere with
      that. We don't have a strong need for wbt for that case, since we will
      rely on CFQ doing that for us.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87760e5e
    • T
      block: cfq_cpd_alloc() should use @gfp · ebc4ff66
      Tejun Heo 提交于
      cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
      incorrectly hard coding GFP_KERNEL instead of using the mask specified
      through the @gfp parameter.  This currently doesn't cause any actual
      issues because all current callers specify GFP_KERNEL.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: e4a9bde9 ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebc4ff66
  3. 01 11月, 2016 2 次提交
  4. 28 10月, 2016 1 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
  5. 24 9月, 2016 1 次提交
    • G
      cfq: fix starvation of asynchronous writes · 3932a86b
      Glauber Costa 提交于
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: NGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: NGlauber Costa <glauber@scylladb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3932a86b
  6. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  7. 21 7月, 2016 1 次提交
    • T
      block: do not merge requests without consulting with io scheduler · 72ef799b
      Tahsin Erdogan 提交于
      Before merging a bio into an existing request, io scheduler is called to
      get its approval first. However, the requests that come from a plug
      flush may get merged by block layer without consulting with io
      scheduler.
      
      In case of CFQ, this can cause fairness problems. For instance, if a
      request gets merged into a low weight cgroup's request, high weight cgroup
      now will depend on low weight cgroup to get scheduled. If high weigt cgroup
      needs that io request to complete before submitting more requests, then it
      will also lose its timeslice.
      
      Following script demonstrates the problem. Group g1 has a low weight, g2
      and g3 have equal high weights but g2's requests are adjacent to g1's
      requests so they are subject to merging. Due to these merges, g2 gets
      poor disk time allocation.
      
      cat > cfq-merge-repro.sh << "EOF"
      #!/bin/bash
      set -e
      
      IO_ROOT=/mnt-cgroup/io
      
      mkdir -p $IO_ROOT
      
      if ! mount | grep -qw $IO_ROOT; then
        mount -t cgroup none -oblkio $IO_ROOT
      fi
      
      cd $IO_ROOT
      
      for i in g1 g2 g3; do
        if [ -d $i ]; then
          rmdir $i
        fi
      done
      
      mkdir g1 && echo 10 > g1/blkio.weight
      mkdir g2 && echo 495 > g2/blkio.weight
      mkdir g3 && echo 495 > g3/blkio.weight
      
      RUNTIME=10
      
      (echo $BASHPID > g1/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=0k &> /dev/null)&
      
      (echo $BASHPID > g2/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=64k &> /dev/null)&
      
      (echo $BASHPID > g3/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=256k &> /dev/null)&
      
      sleep $((RUNTIME+1))
      
      for i in g1 g2 g3; do
        echo ---- $i ----
        cat $i/blkio.time
      done
      
      EOF
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 162
      ---- g2 ----
      8:16 165
      ---- g3 ----
      8:16 686
      
      After applying the patch:
      
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 90
      ---- g2 ----
      8:16 445
      ---- g3 ----
      8:16 471
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      72ef799b
  8. 28 6月, 2016 3 次提交
    • J
      cfq-iosched: Charge at least 1 jiffie instead of 1 ns · 0b31c10c
      Jan Kara 提交于
      Commit 9a7f38c4 (cfq-iosched: Convert from jiffies to nanoseconds)
      could result in charging just 1 ns to a cgroup submitting IO instead of 1
      jiffie we always charged before. It is arguable what is the right amount
      to change but for now lets retain the old behavior of always charging at
      least one jiffie.
      
      Fixes: 9a7f38c4Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0b31c10c
    • J
      cfq-iosched: Fix regression in bonnie++ rewrite performance · 149321a6
      Jan Kara 提交于
      Commit 9a7f38c4 (cfq-iosched: Convert from jiffies to nanoseconds)
      broke the condition for detecting starved sync IO in
      cfq_completed_request() because rq->start_time remained in jiffies but
      we compared it with nanosecond values. This manifested as a regression
      in bonnie++ rewrite performance because we always ended up considering
      sync IO starved and thus never increased async IO queue depth.
      
      Since rq->start_time is used in a lot of places, converting it to ns
      values would be non-trivial. So just revert the condition in CFQ to use
      comparison with jiffies. This will lead to suboptimal results if
      cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
      relatively far from that with the storage used with CFQ (the default
      value is 128 ms).
      
      Fixes: 9a7f38c4Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      149321a6
    • J
      cfq-iosched: Convert slice_resid from u64 to s64 · 93fdf147
      Jan Kara 提交于
      slice_resid can be both positive and negative. Commit 9a7f38c4
      (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
      long to u64. Although this did not introduce any functional regression
      (the operations just overflow and the result was fine), it is certainly
      wrong and could cause issues in future. So convert the type to more
      appropriate s64.
      
      Fixes: 9a7f38c4Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93fdf147
  9. 10 6月, 2016 1 次提交
    • J
      cfq-iosched: temporarily boost queue priority for idle classes · b8269db4
      Jens Axboe 提交于
      If we're queuing REQ_PRIO IO and the task is running at an idle IO
      class, then temporarily boost the priority. This prevents livelocks
      due to priority inversion, when a low priority task is holding file
      system resources while attempting to do IO.
      
      An example of that is shown below. An ioniced idle task is holding
      the directory mutex, while a normal priority task is trying to do
      a directory lookup.
      
      [478381.198925] ------------[ cut here ]------------
      [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
      [478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.203462] ionice          D ffff8803692736a8     0 1168369      1 0x00000080
      [478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
      [478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
      [478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
      [478381.206874] Call Trace:
      [478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
      [478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
      [478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
      [478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
      [478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
      [478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
      [478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
      [478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
      [478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
      [478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
      [478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
      [478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
      [478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
      [478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
      [478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
      [478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
      [478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
      [478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
      [478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
      [478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
      [478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
      [478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
      [478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
      [478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
      [478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
      [478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
      [478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
      [478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
      [478381.239126] ------------[ cut here ]------------
      [478381.239915] ------------[ cut here ]------------
      [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
      [478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
      [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 0x00000080
      [478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
      [478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
      [478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
      [478381.248252] Call Trace:
      [478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
      [478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
      [478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
      [478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
      [478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
      [478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
      [478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
      [478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
      [478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
      [478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
      [478381.257256]  [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
      [478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
      [478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
      [478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
      [478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
      [478381.261779]  [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170
      [478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
      [478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
      [478381.264551] ------------[ cut here ]------------
      [478381.265377] ------------[ cut here ]------------
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      b8269db4
  10. 08 6月, 2016 6 次提交
  11. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  12. 05 2月, 2016 4 次提交
    • J
      cfq-iosched: Allow parent cgroup to preempt its child · 3984aa55
      Jan Kara 提交于
      Currently we don't allow sync workload of one cgroup to preempt sync
      workload of any other cgroup. This is because we want to achieve service
      separation between cgroups. However in cases where cgroup preempting is
      ancestor of the current cgroup, there is no need of separation and
      idling introduces unnecessary overhead. This hurts for example the case
      when workload is isolated within a cgroup but journalling threads are in
      root cgroup. Simple way to demostrate the issue is using:
      
      dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1
      
      on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
      difference more visible). When all processes are in the root cgroup,
      reported throughput is 153.132 MB/sec. When dbench process gets its own
      blkio cgroup, reported throughput drops to 26.1006 MB/sec.
      
      Fix the problem by making check in cfq_should_preempt() more benevolent
      and allow preemption by ancestor cgroup. This improves the throughput
      reported by dbench4 to 48.9106 MB/sec.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3984aa55
    • J
      cfq-iosched: Allow sync noidle workloads to preempt each other · a257ae3e
      Jan Kara 提交于
      The original idea with preemption of sync noidle queues (introduced in
      commit 718eee05 "cfq-iosched: fairness for sync no-idle queues") was
      that we service all sync noidle queues together, we don't idle on any of
      the queues individually and we idle only if there is no sync noidle
      queue to be served. This intention also matches the original test:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
      	   && new_cfqq->service_tree == cfqq->service_tree)
      		return true;
      
      However since at that time cfqq->service_tree was not set for idling
      queues, this test was unreliable and was replaced in commit e4a22919
      "cfq-iosched: fix no-idle preemption logic" by:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
      	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
      	    new_cfqq->service_tree->count == 1)
      		return true;
      
      That was a reliable test but was actually doing something different -
      now we preempt sync noidle queue only if the new queue is the only one
      busy in the service tree.
      
      These days cfq queue is kept in service tree even if it is idling and
      thus the original check would be safe again. But since we actually check
      that cfq queues are in the same cgroup, of the same priority class and
      workload type (sync noidle), we know that new_cfqq is fine to preempt
      cfqq. So just remove the service tree check.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a257ae3e
    • J
      cfq-iosched: Reorder checks in cfq_should_preempt() · 6c80731c
      Jan Kara 提交于
      Move check for preemption by rt class up. There is no functional change
      but it makes arguing about conditions simpler since we can be sure both
      cfq queues are from the same ioprio class.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6c80731c
    • J
      cfq-iosched: Don't group_idle if cfqq has big thinktime · e795421e
      Jan Kara 提交于
      There is no point in idling on a cfq group if the only cfq queue that is
      there has too big thinktime.
      Signed-off-by: NJan Kara <jack@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e795421e
  13. 18 9月, 2015 1 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
  14. 19 8月, 2015 15 次提交
    • T
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo 提交于
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
        enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69d7fde5
    • T
      blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/ · 3ecca629
      Tejun Heo 提交于
      blkcg is gonna switch to cgroup common weight range as defined by
      CGROUP_WEIGHT_* on the unified hierarchy.  In preparation, rename
      CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3ecca629
    • T
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo 提交于
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ee867dc
    • T
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo 提交于
      * Export blkg_dev_name()
      
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd165eb3
    • T
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo 提交于
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36aa9e5f
    • T
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo 提交于
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      880f50e2
    • T
      blkcg: refine error codes returned during blkcg configuration · 20386ce0
      Tejun Heo 提交于
      blkcg currently returns -EINVAL for most errors which can be pretty
      confusing given that the failure modes are quite varied.  Update the
      error returns so that
      
      * -EINVAL only for syntactic errors.
      * -ERANGE if the value is out of range.
      * -ENODEV if the target device can't be found.
      * -EOPNOTSUPP if the policy is not enabled on the target device.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      20386ce0
    • T
      blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device() · 5332dfc3
      Tejun Heo 提交于
      blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
      enabled are guaranteed to return non-NULL and the counterpart in
      blk-throttle doesn't have these checks either.  Remove the spurious
      NULL checks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5332dfc3
    • T
      blkcg: remove cfqg_stats->sectors · 702747ca
      Tejun Heo 提交于
      cfq_stats->sectors is a blkg_stat which keeps track of the total
      number of sectors serviced; however, this can be trivially calculated
      from blkcg_gq->stat_bytes.  The only thing necessary is adding up
      READs and WRITEs and then dividing by sector size.
      
      Remove cfqg_stats->sectors and make cfq print "sectors" and
      "sectors_recursive" from stat_bytes.
      
      While this is a bit more code, it removes duplicate stat allocations
      and updates and ensures that the reported stats stay in tune with each
      other.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      702747ca
    • T
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo 提交于
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      
      v3: Update blkio-controller.txt accordingly.
      
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      77ea7338
    • T
      blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq · f12c74ca
      Tejun Heo 提交于
      Currently, blkg_[rw]stat_recursive_sum() assume that the target
      counter is located in pd (blkg_policy_data); however, some counters
      are planned to be moved to blkg (blkcg_gq).
      
      This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
      blkg_policy pointers instead of pd.  If policy is NULL, it indexes
      into blkg.  If non-NULL, into the blkg's pd of the policy.
      
      The existing usages are updated to maintain the current behaviors.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f12c74ca
    • T
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo 提交于
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      
      * Custom per-cpu stat implementation in blk-throttle is removed.
      
      This makes all blkcg stat counters per-cpu without complicating policy
      implmentations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24bdb8ef
    • T
      blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it · e6269c44
      Tejun Heo 提交于
      cgroup stats are local to each cgroup and doesn't propagate to
      ancestors by default.  When recursive stats are necessary, the sum is
      calculated over all the descendants.  This initially was for backward
      compatibility to support both group-local and recursive stats but this
      mode of operation makes general sense as stat update is much hotter
      thafn reporting those stats.
      
      This however ends up losing recursive stats when a child is removed.
      To work around this, cfq-iosched adds its stats to its parent
      cfq_group->dead_stats which is summed up together when calculating
      recursive stats.
      
      It's planned that the core stats will be moved to blkcg_gq, so we want
      to move the mechanism for keeping track of the stats of dead children
      from cfq to blkcg core.  This patch adds blkg_[rw]stat->aux_cnt which
      are atomic64_t's keeping track of auxiliary counts which are excluded
      when reading local counts but included for recursive.
      
      blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
      are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
      a dead cgroup to the aux counts of parent->stats instead of separate
      ->dead_stats.
      
      This will also help making blkg_[rw]stats per-cpu.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e6269c44
    • T
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo 提交于
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      performed.
      
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
          !CONFIG_BLK_DEV_THROTTLING.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ae118896
    • T
      blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods · e4a9bde9
      Tejun Heo 提交于
      Each active policy has a cpd (blkcg_policy_data) on each blkcg.  The
      cpd's were allocated by blkcg core and each policy could request to
      allocate extra space at the end by setting blkcg_policy->cpd_size
      larger than the size of cpd.
      
      This is a bit unusual but blkg (blkcg_gq) policy data used to be
      handled this way too so it made sense to be consistent; however, blkg
      policy data switched to alloc/free callbacks.
      
      This patch makes similar changes to cpd handling.
      blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size.  As
      cpd allocation is now done from policy side, it can simply allocate a
      larger area which embeds cpd at the beginning.
      
      As ->cpd_alloc_fn() may be able to perform all necessary
      initializations, this patch makes ->cpd_init_fn() optional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e4a9bde9