1. 15 3月, 2010 1 次提交
  2. 01 3月, 2010 6 次提交
  3. 26 2月, 2010 6 次提交
  4. 23 2月, 2010 2 次提交
  5. 22 2月, 2010 1 次提交
  6. 05 2月, 2010 1 次提交
  7. 03 2月, 2010 1 次提交
    • V
      cfq-iosched: Do not idle on async queues · 1efe8fe1
      Vivek Goyal 提交于
      Few weeks back, Shaohua Li had posted similar patch. I am reposting it
      with more test results.
      
      This patch does two things.
      
      - Do not idle on async queues.
      
      - It also changes the write queue depth CFQ drives (cfq_may_dispatch()).
        Currently, we seem to driving queue depth of 1 always for WRITES. This is
        true even if there is only one write queue in the system and all the logic
        of infinite queue depth in case of single busy queue as well as slowly
        increasing queue depth based on last delayed sync request does not seem to
        be kicking in at all.
      
      This patch will allow deeper WRITE queue depths (subjected to the other
      WRITE queue depth contstraints like cfq_quantum and last delayed sync
      request).
      
      Shaohua Li had reported getting more out of his SSD. For me, I have got
      one Lun exported from an HP EVA and when pure buffered writes are on, I
      can get more out of the system. Following are test results of pure
      buffered writes (with end_fsync=1) with vanilla and patched kernel. These
      results are average of 3 sets of run with increasing number of threads.
      
      AVERAGE[bufwfs][vanilla]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      bufwfs    3   1   0              0              95349          474141
      bufwfs    3   2   0              0              100282         806926
      bufwfs    3   4   0              0              109989         2.7301e+06
      bufwfs    3   8   0              0              116642         3762231
      bufwfs    3   16  0              0              118230         6902970
      
      AVERAGE[bufwfs] [patched kernel]
      -------
      bufwfs    3   1   0              0              270722         404352
      bufwfs    3   2   0              0              206770         1.06552e+06
      bufwfs    3   4   0              0              195277         1.62283e+06
      bufwfs    3   8   0              0              260960         2.62979e+06
      bufwfs    3   16  0              0              299260         1.70731e+06
      
      I also ran buffered writes along with some sequential reads and some
      buffered reads going on in the system on a SATA disk because the potential
      risk could be that we should not be driving queue depth higher in presence
      of sync IO going to keep the max clat low.
      
      With some random and sequential reads going on in the system on one SATA
      disk I did not see any significant increase in max clat. So it looks like
      other WRITE queue depth control logic is doing its job. Here are the
      results.
      
      AVERAGE[brr, bsr, bufw together] [vanilla]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      brr       3   1   850            546345         0              0
      bsr       3   1   14650          729543         0              0
      bufw      3   1   0              0              23908          8274517
      
      brr       3   2   981.333        579395         0              0
      bsr       3   2   14149.7        1175689        0              0
      bufw      3   2   0              0              21921          1.28108e+07
      
      brr       3   4   898.333        1.75527e+06    0              0
      bsr       3   4   12230.7        1.40072e+06    0              0
      bufw      3   4   0              0              19722.3        2.4901e+07
      
      brr       3   8   900            3160594        0              0
      bsr       3   8   9282.33        1.91314e+06    0              0
      bufw      3   8   0              0              18789.3        23890622
      
      AVERAGE[brr, bsr, bufw mixed] [patched kernel]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      brr       3   1   837            417973         0              0
      bsr       3   1   14357.7        591275         0              0
      bufw      3   1   0              0              24869.7        8910662
      
      brr       3   2   1038.33        543434         0              0
      bsr       3   2   13351.3        1205858        0              0
      bufw      3   2   0              0              18626.3        13280370
      
      brr       3   4   913            1.86861e+06    0              0
      bsr       3   4   12652.3        1430974        0              0
      bufw      3   4   0              0              15343.3        2.81305e+07
      
      brr       3   8   890            2.92695e+06    0              0
      bsr       3   8   9635.33        1.90244e+06    0              0
      bufw      3   8   0              0              17200.3        24424392
      
      So looks like it might make sense to include this patch.
      
      Thanks
      Vivek
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1efe8fe1
  8. 01 2月, 2010 1 次提交
    • G
      blk-cgroup: Fix potential deadlock in blk-cgroup · bcf4dd43
      Gui Jianfeng 提交于
      I triggered a lockdep warning as following.
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.33-rc2 #1
      -------------------------------------------------------
      test_io_control/7357 is trying to acquire lock:
       (blkio_list_lock){+.+...}, at: [<c053a990>] blkiocg_weight_write+0x82/0x9e
      
      but task is already holding lock:
       (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&(&blkcg->lock)->rlock){......}:
             [<c04583b7>] validate_chain+0x8bc/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a
             [<c053a4e1>] blkiocg_add_blkio_group+0x1a/0x6d
             [<c053cac7>] cfq_get_queue+0x225/0x3de
             [<c053eec2>] cfq_set_request+0x217/0x42d
             [<c052c8a6>] elv_set_request+0x17/0x26
             [<c0532a0f>] get_request+0x203/0x2c5
             [<c0532ae9>] get_request_wait+0x18/0x10e
             [<c0533470>] __make_request+0x2ba/0x375
             [<c0531985>] generic_make_request+0x28d/0x30f
             [<c0532da7>] submit_bio+0x8a/0x8f
             [<c04d827a>] submit_bh+0xf0/0x10f
             [<c04d91d2>] ll_rw_block+0xc0/0xf9
             [<f86e9705>] ext3_find_entry+0x319/0x544 [ext3]
             [<f86eae58>] ext3_lookup+0x2c/0xb9 [ext3]
             [<c04c3e1b>] do_lookup+0xd3/0x172
             [<c04c56c8>] link_path_walk+0x5fb/0x95c
             [<c04c5a65>] path_walk+0x3c/0x81
             [<c04c5b63>] do_path_lookup+0x21/0x8a
             [<c04c66cc>] do_filp_open+0xf0/0x978
             [<c04c0c7e>] open_exec+0x1b/0xb7
             [<c04c1436>] do_execve+0xbb/0x266
             [<c04081a9>] sys_execve+0x24/0x4a
             [<c04028a2>] ptregs_execve+0x12/0x18
      
      -> #1 (&(&q->__queue_lock)->rlock){..-.-.}:
             [<c04583b7>] validate_chain+0x8bc/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a
             [<c053dd2a>] cfq_unlink_blkio_group+0x17/0x41
             [<c053a6eb>] blkiocg_destroy+0x72/0xc7
             [<c0467df0>] cgroup_diput+0x4a/0xb2
             [<c04ca473>] dentry_iput+0x93/0xb7
             [<c04ca4b3>] d_kill+0x1c/0x36
             [<c04cb5c5>] dput+0xf5/0xfe
             [<c04c6084>] do_rmdir+0x95/0xbe
             [<c04c60ec>] sys_rmdir+0x10/0x12
             [<c04027cc>] sysenter_do_call+0x12/0x32
      
      -> #0 (blkio_list_lock){+.+...}:
             [<c0458117>] validate_chain+0x61c/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c06929fd>] _raw_spin_lock+0x1e/0x4e
             [<c053a990>] blkiocg_weight_write+0x82/0x9e
             [<c0467f1e>] cgroup_file_write+0xc6/0x1c0
             [<c04bd2f3>] vfs_write+0x8c/0x116
             [<c04bd7c6>] sys_write+0x3b/0x60
             [<c04027cc>] sysenter_do_call+0x12/0x32
      
      other info that might help us debug this:
      
      1 lock held by test_io_control/7357:
       #0:  (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e
      stack backtrace:
      Pid: 7357, comm: test_io_control Not tainted 2.6.33-rc2 #1
      Call Trace:
       [<c045754f>] print_circular_bug+0x91/0x9d
       [<c0458117>] validate_chain+0x61c/0xb9c
       [<c0458dba>] __lock_acquire+0x723/0x789
       [<c0458eb0>] lock_acquire+0x90/0xa7
       [<c053a990>] ? blkiocg_weight_write+0x82/0x9e
       [<c06929fd>] _raw_spin_lock+0x1e/0x4e
       [<c053a990>] ? blkiocg_weight_write+0x82/0x9e
       [<c053a990>] blkiocg_weight_write+0x82/0x9e
       [<c0467f1e>] cgroup_file_write+0xc6/0x1c0
       [<c0454df5>] ? trace_hardirqs_off+0xb/0xd
       [<c044d93a>] ? cpu_clock+0x2e/0x44
       [<c050e6ec>] ? security_file_permission+0xf/0x11
       [<c04bcdda>] ? rw_verify_area+0x8a/0xad
       [<c0467e58>] ? cgroup_file_write+0x0/0x1c0
       [<c04bd2f3>] vfs_write+0x8c/0x116
       [<c04bd7c6>] sys_write+0x3b/0x60
       [<c04027cc>] sysenter_do_call+0x12/0x32
      
      To prevent deadlock, we should take locks as following sequence:
      
      blkio_list_lock -> queue_lock ->  blkcg_lock.
      
      The following patch should fix this bug.
      Signed-off-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      bcf4dd43
  9. 29 1月, 2010 1 次提交
    • A
      block: Added in stricter no merge semantics for block I/O · 488991e2
      Alan D. Brunelle 提交于
      Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_
      merges at all are to be attempted (not even the simple one-hit cache).
      
      The following table illustrates the additional benefit - 5 minute runs of
      a random I/O load were applied to a dozen devices on a 16-way x86_64 system.
      
      nomerges        Throughput      %System         Improvement (tput / %sys)
      --------        ------------    -----------     -------------------------
      0               12.45 MB/sec    0.669365609
      1               12.50 MB/sec    0.641519199     0.40% / 2.71%
      2               12.52 MB/sec    0.639849750     0.56% / 2.96%
      Signed-off-by: NAlan D. Brunelle <alan.brunelle@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      488991e2
  10. 11 1月, 2010 6 次提交
  11. 29 12月, 2009 2 次提交
  12. 28 12月, 2009 1 次提交
  13. 21 12月, 2009 1 次提交
  14. 18 12月, 2009 3 次提交
    • V
      cfq-iosched: Remove prio_change logic for workload selection · 65b32a57
      Vivek Goyal 提交于
      o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
        sync-noidle and async. Which workload to run depends primarily on rb_key
        offset across three service trees. Which is a combination of mulitiple things
        including what time queue got queued on the service tree.
      
        There is one exception though. That is if we switched the prio class, say
        we served some RT tasks and again started serving BE class, then with-in
        BE class we always started with sync-noidle workload irrespective of rb_key
        offset in service trees.
      
        This can provide better latencies for sync-noidle workload in the presence
        of RT tasks.
      
      o This patch gets rid of that exception and which workload to run with-in
        class always depends on lowest rb_key across service trees. The reason
        being that now we have multiple BE class groups and if we always switch
        to sync-noidle workload with-in group, we can potentially starve a sync-idle
        workload with-in group. Same is true for async workload which will be in
        root group. Also the workload-switching with-in group will become very
        unpredictable as it now depends whether some RT workload was running in
        the system or not.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: NCorrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      65b32a57
    • V
      cfq-iosched: Get rid of nr_groups · fb104db4
      Vivek Goyal 提交于
      o Currently code does not seem to be using cfqd->nr_groups. Get rid of it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      fb104db4
    • V
      cfq-iosched: Remove the check for same cfq group from allow_merge · 1db32c40
      Vivek Goyal 提交于
      o allow_merge() already checks if submitting task is pointing to same cfqq
        as rq has been queued in. If everything is fine, we should not be having
        a task in one cgroup and having a pointer to cfqq in other cgroup.
      
        Well I guess in some situations it can happen and that is, when a random
        IO queue has been moved into root cgroup for group_isolation=0. In
        this case, tasks's cgroup/group is different from where actually cfqq is,
        but this is intentional and in this case merging should be allowed.
      
        The second situation is where due to close cooperator patches, multiple
        processes can be sharing a cfqq. If everything implemented right, we should
        not end up in a situation where tasks from different processes in different
        groups are sharing the same cfqq as we allow merging of cooperating queues
        only if they are in same group.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1db32c40
  15. 16 12月, 2009 1 次提交
    • J
      block: temporarily disable discard granularity · b568be62
      Jens Axboe 提交于
      Commit 86b37281 adds a check for
      misaligned stacking offsets, but it's buggy since the defaults are 0.
      Hence all dm devices that pass in a non-zero starting offset will
      be marked as misaligned amd dm will complain.
      
      A real fix is coming, in the mean time disable the discard granularity
      check so that users don't worry about dm reporting about misaligned
      devices.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b568be62
  16. 15 12月, 2009 1 次提交
  17. 11 12月, 2009 1 次提交
    • V
      Fix a CFQ crash in "for-2.6.33" branch of block tree · 82bbbf28
      Vivek Goyal 提交于
      I think my previous patch introduced a bug which can lead to CFQ hitting
      BUG_ON().
      
      The offending commit in for-2.6.33 branch is.
      
      commit 7667aa06
      Author: Vivek Goyal <vgoyal@redhat.com>
      Date:   Tue Dec 8 17:52:58 2009 -0500
      
          cfq-iosched: Take care of corner cases of group losing share due to deletion
      
      While doing some stress testing on my box, I enountered following.
      
      login: [ 3165.148841] BUG: scheduling while
      atomic: swapper/0/0x10000100
      [ 3165.149821] Modules linked in: cfq_iosched dm_multipath qla2xxx igb
      scsi_transport_fc dm_snapshot [last unloaded: scsi_wait_scan]
      [ 3165.149821] Pid: 0, comm: swapper Not tainted
      2.6.32-block-for-33-merged-new #3
      [ 3165.149821] Call Trace:
      [ 3165.149821]  <IRQ>  [<ffffffff8103fab8>] __schedule_bug+0x5c/0x60
      [ 3165.149821]  [<ffffffff8103afd7>] ? __wake_up+0x44/0x4d
      [ 3165.149821]  [<ffffffff8153a979>] schedule+0xe3/0x7bc
      [ 3165.149821]  [<ffffffff8103a796>] ? cpumask_next+0x1d/0x1f
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff810422d8>] __cond_resched+0x2a/0x35
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff8153b1ee>] _cond_resched+0x2c/0x37
      [ 3165.149821]  [<ffffffff8100e2db>] is_valid_bugaddr+0x16/0x2f
      [ 3165.149821]  [<ffffffff811e4161>] report_bug+0x18/0xac
      [ 3165.149821]  [<ffffffff8100f1fc>] die+0x39/0x63
      [ 3165.149821]  [<ffffffff8153cde1>] do_trap+0x11a/0x129
      [ 3165.149821]  [<ffffffff8100d470>] do_invalid_op+0x96/0x9f
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff81034b4d>] ? enqueue_task+0x5c/0x67
      [ 3165.149821]  [<ffffffff8103ae83>] ? task_rq_unlock+0x11/0x13
      [ 3165.149821]  [<ffffffff81041aae>] ? try_to_wake_up+0x292/0x2a4
      [ 3165.149821]  [<ffffffff8100c935>] invalid_op+0x15/0x20
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f
      [ 3165.149821]  [<ffffffff811d8c2a>] blk_peek_request+0x191/0x1a7
      [ 3165.149821]  [<ffffffff811e5b8d>] ? kobject_get+0x1a/0x21
      [ 3165.149821]  [<ffffffff812c8d4c>] scsi_request_fn+0x82/0x3df
      [ 3165.149821]  [<ffffffff8110b2de>] ? bio_fs_destructor+0x15/0x17
      [ 3165.149821]  [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f
      [ 3165.149821]  [<ffffffff811d931f>] __blk_run_queue+0x42/0x71
      [ 3165.149821]  [<ffffffff811d9403>] blk_run_queue+0x26/0x3a
      [ 3165.149821]  [<ffffffff812c8761>] scsi_run_queue+0x2de/0x375
      [ 3165.149821]  [<ffffffff812b60ac>] ? put_device+0x17/0x19
      [ 3165.149821]  [<ffffffff812c92d7>] scsi_next_command+0x3b/0x4b
      [ 3165.149821]  [<ffffffff812c9b9f>] scsi_io_completion+0x1c9/0x3f5
      [ 3165.149821]  [<ffffffff812c3c36>] scsi_finish_command+0xb5/0xbe
      
      I think I have hit following BUG_ON() in cfq_dispatch_request().
      
      BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
      
      Please find attached the patch to fix it. I have done some stress testing
      with it and have not seen it happening again.
      
      o We should wait on a queue even after slice expiry only if it is empty. If
        queue is not empty then continue to expire it.
      
      o If we decide to keep the queue then make cfqq=NULL. Otherwise select_queue()
        will return a valid cfqq and cfq_dispatch_request() can hit following
        BUG_ON().
      
        BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list))
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      82bbbf28
  18. 10 12月, 2009 2 次提交
  19. 09 12月, 2009 2 次提交
    • V
      cfq-iosched: Take care of corner cases of group losing share due to deletion · 7667aa06
      Vivek Goyal 提交于
      If there is a sequential reader running in a group, we wait for next request
      to come in that group after slice expiry and once new request is in, we expire
      the queue. Otherwise we delete the group from service tree and group looses
      its fair share.
      
      So far I was marking a queue as wait_busy if it had consumed its slice and
      it was last queue in the group. But this condition did not cover following
      two cases.
      
      1.If a request completed and slice has not expired yet. Next request comes
        in and is dispatched to disk. Now select_queue() hits and slice has expired.
        This group will be deleted. Because request is still in the disk, this queue
        will never get a chance to wait_busy.
      
      2.If request completed and slice has not expired yet. Before next request
        comes in (delay due to think time), select_queue() hits and expires the
        queue hence group. This queue never got a chance to wait busy.
      
      Gui was hitting the boundary condition 1 and not getting fairness numbers
      proportional to weight.
      
      This patch puts the checks for above two conditions and improves the fairness
      numbers for sequential workload on rotational media. Check in select_queue()
      takes care of case 1 and additional check in should_wait_busy() takes care
      of case 2.
      Reported-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      7667aa06
    • V
      cfq-iosched: Get rid of cfqq wait_busy_done flag · c244bb50
      Vivek Goyal 提交于
      o Get rid of wait_busy_done flag. This flag only tells we were doing wait
        busy on a queue and that queue got request so expire it. That information
        can easily be obtained by (cfq_cfqq_wait_busy() && queue_is_not_empty). So
        remove this flag and keep code simple.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c244bb50