1. 09 4月, 2010 3 次提交
    • D
      blkio: Add io_queued and avg_queue_size stats · cdc1184c
      Divyesh Shah 提交于
      These stats are useful for getting a feel for the queue depth of the cgroup,
      i.e., how filled up its queues are at a given instant and over the existence of
      the cgroup. This ability is useful when debugging problems in the wild as it
      helps understand the application's IO pattern w/o having to read through the
      userspace code (coz its tedious or just not available) or w/o the ability
      to run blktrace (since you may not have root access and/or not want to disturb
      performance).
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cdc1184c
    • D
      blkio: Add io_merged stat · 812d4026
      Divyesh Shah 提交于
      This includes both the number of bios merged into requests belonging to this
      cgroup as well as the number of requests merged together.
      In the past, we've observed different merging behavior across upstream kernels,
      some by design some actual bugs. This stat helps a lot in debugging such
      problems when applications report decreased throughput with a new kernel
      version.
      
      This needed adding an extra elevator function to capture bios being merged as I
      did not want to pollute elevator code with blkiocg knowledge and hence needed
      the accounting invocation to come from CFQ.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      812d4026
    • D
      blkio: Changes to IO controller additional stats patches · 84c124da
      Divyesh Shah 提交于
      that include some minor fixes and addresses all comments.
      
      Changelog: (most based on Vivek Goyal's comments)
      o renamed blkiocg_reset_write to blkiocg_reset_stats
      o more clarification in the documentation on io_service_time and io_wait_time
      o Initialize blkg->stats_lock
      o rename io_add_stat to blkio_add_stat and declare it static
      o use bool for direction and sync
      o derive direction and sync info from existing rq methods
      o use 12 for major:minor string length
      o define io_service_time better to cover the NCQ case
      o add a separate reset_stats interface
      o make the indexed stats a 2d array to simplify macro and function pointer code
      o blkio.time now exports in jiffies as before
      o Added stats description in patch description and
        Documentation/cgroup/blkio-controller.txt
      o Prefix all stats functions with blkio and make them static as applicable
      o replace IO_TYPE_MAX with IO_TYPE_TOTAL
      o Moved #define constant to top of blk-cgroup.c
      o Pass dev_t around instead of char *
      o Add note to documentation file about resetting stats
      o use BLK_CGROUP_MODULE in addition to BLK_CGROUP config option in #ifdef
        statements
      o Avoid struct request specific knowledge in blk-cgroup. blk-cgroup.h now has
        rq_direction() and rq_sync() functions which are used by CFQ and when using
        io-controller at a higher level, bio_* functions can be added.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      84c124da
  2. 06 4月, 2010 1 次提交
    • M
      laptop-mode: Make flushes per-device · 31373d09
      Matthew Garrett 提交于
      One of the features of laptop-mode is that it forces a writeout of dirty
      pages if something else triggers a physical read or write from a device.
      The current implementation flushes pages on all devices, rather than only
      the one that triggered the flush. This patch alters the behaviour so that
      only the recently accessed block device is flushed, preventing other
      disks being spun up for no terribly good reason.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      31373d09
  3. 02 4月, 2010 4 次提交
  4. 25 3月, 2010 2 次提交
    • D
      cfq-iosched: Do not merge queues of BE and IDLE classes · 39c01b21
      Divyesh Shah 提交于
      Even if they are found to be co-operating.
      
      The prio_trees do not have any IDLE cfqqs on them. cfq_close_cooperator()
      is called from cfq_select_queue() and cfq_completed_request(). The latter
      ensures that the close cooperator code does not get invoked if the current
      cfqq is of class IDLE but the former doesn't seem to have any such checks.
      So an IDLE cfqq may get merged with a BE cfqq from the same group which
      should be avoided.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      39c01b21
    • D
      cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging · b1ffe737
      Divyesh Shah 提交于
      These have helped us debug some issues we've noticed in earlier IO
      controller versions and should be useful now as well. The extra logging
      covers:
      - idling behavior. Since there are so many conditions based on which we decide
      to idle or not, this patch adds a log message for some conditions that we've
      found useful.
      - workload slices and current prio and workload type
      
      Changelog from v1:
      o moved log message from cfq_set_active_queue() to __cfq_set_active_queue()
      o changed queue_count to st->count
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b1ffe737
  5. 19 3月, 2010 1 次提交
  6. 16 3月, 2010 1 次提交
  7. 15 3月, 2010 2 次提交
  8. 13 3月, 2010 1 次提交
    • B
      cgroups: blkio subsystem as module · 67523c48
      Ben Blum 提交于
      Modify the Block I/O cgroup subsystem to be able to be built as a module.
      As the CFQ disk scheduler optionally depends on blk-cgroup, config options
      in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
      enhanced to support the new module dependency.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67523c48
  9. 08 3月, 2010 1 次提交
  10. 01 3月, 2010 6 次提交
  11. 26 2月, 2010 6 次提交
  12. 23 2月, 2010 2 次提交
  13. 22 2月, 2010 1 次提交
  14. 09 2月, 2010 1 次提交
  15. 05 2月, 2010 1 次提交
  16. 03 2月, 2010 1 次提交
    • V
      cfq-iosched: Do not idle on async queues · 1efe8fe1
      Vivek Goyal 提交于
      Few weeks back, Shaohua Li had posted similar patch. I am reposting it
      with more test results.
      
      This patch does two things.
      
      - Do not idle on async queues.
      
      - It also changes the write queue depth CFQ drives (cfq_may_dispatch()).
        Currently, we seem to driving queue depth of 1 always for WRITES. This is
        true even if there is only one write queue in the system and all the logic
        of infinite queue depth in case of single busy queue as well as slowly
        increasing queue depth based on last delayed sync request does not seem to
        be kicking in at all.
      
      This patch will allow deeper WRITE queue depths (subjected to the other
      WRITE queue depth contstraints like cfq_quantum and last delayed sync
      request).
      
      Shaohua Li had reported getting more out of his SSD. For me, I have got
      one Lun exported from an HP EVA and when pure buffered writes are on, I
      can get more out of the system. Following are test results of pure
      buffered writes (with end_fsync=1) with vanilla and patched kernel. These
      results are average of 3 sets of run with increasing number of threads.
      
      AVERAGE[bufwfs][vanilla]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      bufwfs    3   1   0              0              95349          474141
      bufwfs    3   2   0              0              100282         806926
      bufwfs    3   4   0              0              109989         2.7301e+06
      bufwfs    3   8   0              0              116642         3762231
      bufwfs    3   16  0              0              118230         6902970
      
      AVERAGE[bufwfs] [patched kernel]
      -------
      bufwfs    3   1   0              0              270722         404352
      bufwfs    3   2   0              0              206770         1.06552e+06
      bufwfs    3   4   0              0              195277         1.62283e+06
      bufwfs    3   8   0              0              260960         2.62979e+06
      bufwfs    3   16  0              0              299260         1.70731e+06
      
      I also ran buffered writes along with some sequential reads and some
      buffered reads going on in the system on a SATA disk because the potential
      risk could be that we should not be driving queue depth higher in presence
      of sync IO going to keep the max clat low.
      
      With some random and sequential reads going on in the system on one SATA
      disk I did not see any significant increase in max clat. So it looks like
      other WRITE queue depth control logic is doing its job. Here are the
      results.
      
      AVERAGE[brr, bsr, bufw together] [vanilla]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      brr       3   1   850            546345         0              0
      bsr       3   1   14650          729543         0              0
      bufw      3   1   0              0              23908          8274517
      
      brr       3   2   981.333        579395         0              0
      bsr       3   2   14149.7        1175689        0              0
      bufw      3   2   0              0              21921          1.28108e+07
      
      brr       3   4   898.333        1.75527e+06    0              0
      bsr       3   4   12230.7        1.40072e+06    0              0
      bufw      3   4   0              0              19722.3        2.4901e+07
      
      brr       3   8   900            3160594        0              0
      bsr       3   8   9282.33        1.91314e+06    0              0
      bufw      3   8   0              0              18789.3        23890622
      
      AVERAGE[brr, bsr, bufw mixed] [patched kernel]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      brr       3   1   837            417973         0              0
      bsr       3   1   14357.7        591275         0              0
      bufw      3   1   0              0              24869.7        8910662
      
      brr       3   2   1038.33        543434         0              0
      bsr       3   2   13351.3        1205858        0              0
      bufw      3   2   0              0              18626.3        13280370
      
      brr       3   4   913            1.86861e+06    0              0
      bsr       3   4   12652.3        1430974        0              0
      bufw      3   4   0              0              15343.3        2.81305e+07
      
      brr       3   8   890            2.92695e+06    0              0
      bsr       3   8   9635.33        1.90244e+06    0              0
      bufw      3   8   0              0              17200.3        24424392
      
      So looks like it might make sense to include this patch.
      
      Thanks
      Vivek
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1efe8fe1
  17. 01 2月, 2010 1 次提交
    • G
      blk-cgroup: Fix potential deadlock in blk-cgroup · bcf4dd43
      Gui Jianfeng 提交于
      I triggered a lockdep warning as following.
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.33-rc2 #1
      -------------------------------------------------------
      test_io_control/7357 is trying to acquire lock:
       (blkio_list_lock){+.+...}, at: [<c053a990>] blkiocg_weight_write+0x82/0x9e
      
      but task is already holding lock:
       (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&(&blkcg->lock)->rlock){......}:
             [<c04583b7>] validate_chain+0x8bc/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a
             [<c053a4e1>] blkiocg_add_blkio_group+0x1a/0x6d
             [<c053cac7>] cfq_get_queue+0x225/0x3de
             [<c053eec2>] cfq_set_request+0x217/0x42d
             [<c052c8a6>] elv_set_request+0x17/0x26
             [<c0532a0f>] get_request+0x203/0x2c5
             [<c0532ae9>] get_request_wait+0x18/0x10e
             [<c0533470>] __make_request+0x2ba/0x375
             [<c0531985>] generic_make_request+0x28d/0x30f
             [<c0532da7>] submit_bio+0x8a/0x8f
             [<c04d827a>] submit_bh+0xf0/0x10f
             [<c04d91d2>] ll_rw_block+0xc0/0xf9
             [<f86e9705>] ext3_find_entry+0x319/0x544 [ext3]
             [<f86eae58>] ext3_lookup+0x2c/0xb9 [ext3]
             [<c04c3e1b>] do_lookup+0xd3/0x172
             [<c04c56c8>] link_path_walk+0x5fb/0x95c
             [<c04c5a65>] path_walk+0x3c/0x81
             [<c04c5b63>] do_path_lookup+0x21/0x8a
             [<c04c66cc>] do_filp_open+0xf0/0x978
             [<c04c0c7e>] open_exec+0x1b/0xb7
             [<c04c1436>] do_execve+0xbb/0x266
             [<c04081a9>] sys_execve+0x24/0x4a
             [<c04028a2>] ptregs_execve+0x12/0x18
      
      -> #1 (&(&q->__queue_lock)->rlock){..-.-.}:
             [<c04583b7>] validate_chain+0x8bc/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a
             [<c053dd2a>] cfq_unlink_blkio_group+0x17/0x41
             [<c053a6eb>] blkiocg_destroy+0x72/0xc7
             [<c0467df0>] cgroup_diput+0x4a/0xb2
             [<c04ca473>] dentry_iput+0x93/0xb7
             [<c04ca4b3>] d_kill+0x1c/0x36
             [<c04cb5c5>] dput+0xf5/0xfe
             [<c04c6084>] do_rmdir+0x95/0xbe
             [<c04c60ec>] sys_rmdir+0x10/0x12
             [<c04027cc>] sysenter_do_call+0x12/0x32
      
      -> #0 (blkio_list_lock){+.+...}:
             [<c0458117>] validate_chain+0x61c/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c06929fd>] _raw_spin_lock+0x1e/0x4e
             [<c053a990>] blkiocg_weight_write+0x82/0x9e
             [<c0467f1e>] cgroup_file_write+0xc6/0x1c0
             [<c04bd2f3>] vfs_write+0x8c/0x116
             [<c04bd7c6>] sys_write+0x3b/0x60
             [<c04027cc>] sysenter_do_call+0x12/0x32
      
      other info that might help us debug this:
      
      1 lock held by test_io_control/7357:
       #0:  (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e
      stack backtrace:
      Pid: 7357, comm: test_io_control Not tainted 2.6.33-rc2 #1
      Call Trace:
       [<c045754f>] print_circular_bug+0x91/0x9d
       [<c0458117>] validate_chain+0x61c/0xb9c
       [<c0458dba>] __lock_acquire+0x723/0x789
       [<c0458eb0>] lock_acquire+0x90/0xa7
       [<c053a990>] ? blkiocg_weight_write+0x82/0x9e
       [<c06929fd>] _raw_spin_lock+0x1e/0x4e
       [<c053a990>] ? blkiocg_weight_write+0x82/0x9e
       [<c053a990>] blkiocg_weight_write+0x82/0x9e
       [<c0467f1e>] cgroup_file_write+0xc6/0x1c0
       [<c0454df5>] ? trace_hardirqs_off+0xb/0xd
       [<c044d93a>] ? cpu_clock+0x2e/0x44
       [<c050e6ec>] ? security_file_permission+0xf/0x11
       [<c04bcdda>] ? rw_verify_area+0x8a/0xad
       [<c0467e58>] ? cgroup_file_write+0x0/0x1c0
       [<c04bd2f3>] vfs_write+0x8c/0x116
       [<c04bd7c6>] sys_write+0x3b/0x60
       [<c04027cc>] sysenter_do_call+0x12/0x32
      
      To prevent deadlock, we should take locks as following sequence:
      
      blkio_list_lock -> queue_lock ->  blkcg_lock.
      
      The following patch should fix this bug.
      Signed-off-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      bcf4dd43
  18. 29 1月, 2010 1 次提交
    • A
      block: Added in stricter no merge semantics for block I/O · 488991e2
      Alan D. Brunelle 提交于
      Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
      merges at all are to be attempted (not even the simple one-hit cache).
      
      The following table illustrates the additional benefit - 5 minute runs of
      a random I/O load were applied to a dozen devices on a 16-way x86_64 system.
      
      nomerges        Throughput      %System         Improvement (tput / %sys)
      --------        ------------    -----------     -------------------------
      0               12.45 MB/sec    0.669365609
      1               12.50 MB/sec    0.641519199     0.40% / 2.71%
      2               12.52 MB/sec    0.639849750     0.56% / 2.96%
      Signed-off-by: NAlan D. Brunelle <alan.brunelle@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      488991e2
  19. 11 1月, 2010 4 次提交