1. 07 3月, 2012 2 次提交
    • T
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo 提交于
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
        disabled
      
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
      
           blk-throtl no longer shoots down root_tg to avoid breaking
           td->root_tg.
      
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72e06c25
    • T
      blkcg: make CONFIG_BLK_CGROUP bool · 32e380ae
      Tejun Heo 提交于
      Block cgroup core can be built as module; however, it isn't too useful
      as blk-throttle can only be built-in and cfq-iosched is usually the
      default built-in scheduler.  Scheduled blkcg cleanup requires calling
      into blkcg from block core.  To simplify that, disallow building blkcg
      as module by making CONFIG_BLK_CGROUP bool.
      
      If building blkcg core as module really matters, which I doubt, we can
      revisit it after blkcg API cleanup.
      
      -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
           depend on BLK_CGROUP.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32e380ae
  2. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  3. 14 12月, 2011 3 次提交
    • T
      block, cfq: unlink cfq_io_context's immediately · b2efa052
      Tejun Heo 提交于
      cic is association between io_context and request_queue.  A cic is
      linked from both ioc and q and should be destroyed when either one
      goes away.  As ioc and q both have their own locks, locking becomes a
      bit complex - both orders work for removal from one but not from the
      other.
      
      Currently, cfq tries to circumvent this locking order issue with RCU.
      ioc->lock nests inside queue_lock but the radix tree and cic's are
      also protected by RCU allowing either side to walk their lists without
      grabbing lock.
      
      This rather unconventional use of RCU quickly devolves into extremely
      fragile convolution.  e.g. The following is from cfqd going away too
      soon after ioc and q exits raced.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       [   88.503444]
       Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
       RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
       ...
       Call Trace:
        [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
        [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
        [<ffffffff81389130>] exit_io_context+0x100/0x140
        [<ffffffff81098a29>] do_exit+0x579/0x850
        [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
        [<ffffffff81098de7>] sys_exit_group+0x17/0x20
        [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
      
      The only real hot path here is cic lookup during request
      initialization and avoiding extra locking requires very confined use
      of RCU.  This patch makes cic removal from both ioc and request_queue
      perform double-locking and unlink immediately.
      
      * From q side, the change is almost trivial as ioc->lock nests inside
        queue_lock.  It just needs to grab each ioc->lock as it walks
        cic_list and unlink it.
      
      * From ioc side, it's a bit more difficult because of inversed lock
        order.  ioc needs its lock to walk its cic_list but can't grab the
        matching queue_lock and needs to perform unlock-relock dancing.
      
        Unlinking is now wholly done from put_io_context() and fast path is
        optimized by using the queue_lock the caller already holds, which is
        by far the most common case.  If the ioc accessed multiple devices,
        it tries with trylock.  In unlikely cases of fast path failure, it
        falls back to full double-locking dance from workqueue.
      
      Double-locking isn't the prettiest thing in the world but it's *far*
      simpler and more understandable than RCU trick without adding any
      meaningful overhead.
      
      This still leaves a lot of now unnecessary RCU logics.  Future patches
      will trim them.
      
      -v2: Vivek pointed out that cic->q was being dereferenced after
           cic->release() was called.  Updated to use local variable @this_q
           instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2efa052
    • T
      block, cfq: move ioc ioprio/cgroup changed handling to cic · dc86900e
      Tejun Heo 提交于
      ioprio/cgroup change was handled by marking the changed state in ioc
      and, on the following access to the ioc, performing RCU-protected
      iteration through all cic's grabbing the matching queue_lock.
      
      This patch moves the changed state to each cic.  When ioprio or cgroup
      changes, the respective bit is set on all cic's of the ioc and when
      each of those cic (not ioc) is accessed, change is applied for that
      specific ioc-queue pair.
      
      This also fixes the following two race conditions between setting and
      clearing of changed states.
      
      * Missing barrier between assign/load of ioprio and ioprio_changed
        allowed applying old ioprio.
      
      * Change requests could happen between application of change and
        clearing of changed variables.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc86900e
    • T
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo 提交于
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      dubious.
      
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
        assignment.
      
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
        noop.
      
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      
      This patch updates ioc get/put so that it becomes more conventional.
      
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
        patch.
      
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
        specification.
      
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e736be7
  4. 13 12月, 2011 1 次提交
    • T
      cgroup: don't use subsys->can_attach_task() or ->attach_task() · bb9d97b6
      Tejun Heo 提交于
      Now that subsys->can_attach() and attach() take @tset instead of
      @task, they can handle per-task operations.  Convert
      ->can_attach_task() and ->attach_task() users to use ->can_attach()
      and attach() instead.  Most converions are straight-forward.
      Noteworthy changes are,
      
      * In cgroup_freezer, remove unnecessary NULL assignments to unused
        methods.  It's useless and very prone to get out of sync, which
        already happened.
      
      * In cpuset, PF_THREAD_BOUND test is checked for each task.  This
        doesn't make any practical difference but is conceptually cleaner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: James Morris <jmorris@namei.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bb9d97b6
  5. 25 10月, 2011 2 次提交
  6. 19 10月, 2011 1 次提交
  7. 21 9月, 2011 1 次提交
  8. 27 5月, 2011 1 次提交
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
  9. 23 5月, 2011 1 次提交
  10. 21 5月, 2011 4 次提交
  11. 16 5月, 2011 1 次提交
    • V
      blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup · 70087dc3
      Vivek Goyal 提交于
      Currentlly we first map the task to cgroup and then cgroup to
      blkio_cgroup. There is a more direct way to get to blkio_cgroup
      from task using task_subsys_state(). Use that.
      
      The real reason for the fix is that it also avoids a race in generic
      cgroup code. During remount/umount rebind_subsystems() is called and
      it can do following with and rcu protection.
      
      cgrp->subsys[i] = NULL;
      
      That means if somebody got hold of cgroup under rcu and then it tried
      to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
      is wrong. I was running into this race condition with ltp running on a
      upstream derived kernel and that lead to crash.
      
      So ideally we should also fix cgroup generic code to wait for rcu
      grace period before setting pointer to NULL. Li Zefan is not very keen
      on introducing synchronize_wait() as he thinks it will slow
      down moun/remount/umount operations.
      
      So for the time being atleast fix the kernel crash by taking a more
      direct route to blkio_cgroup.
      
      One tester had reported a crash while running LTP on a derived kernel
      and with this fix crash is no more seen while the test has been
      running for over 6 days.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      70087dc3
  12. 31 3月, 2011 1 次提交
  13. 23 3月, 2011 1 次提交
  14. 12 3月, 2011 1 次提交
  15. 16 11月, 2010 1 次提交
    • V
      blk-cgroup: Allow creation of hierarchical cgroups · bdc85df7
      Vivek Goyal 提交于
      o Allow hierarchical cgroup creation for blkio controller
      
      o Currently we disallow it as both the io controller policies (throttling
        as well as proportion bandwidth) do not support hierarhical accounting
        and control. But the flip side is that blkio controller can not be used with
        libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
      
        <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
      
      o So this patch will allow creation of cgroup hierarhcy but at the backend
        everything will be treated as flat. So if somebody created a an hierarchy
        like as follows.
      
      			root
      			/  \
      		     test1 test2
      			|
      		     test3
      
        CFQ and throttling will practically treat all groups at same level.
      
      				pivot
      			     /  |   \  \
      			root  test1 test2  test3
      
      o Once we have actual support for hierarchical accounting and control
        then we can introduce another cgroup tunable file "blkio.use_hierarchy"
        which will be 0 by default but if user wants to enforce hierarhical
        control then it can be set to 1. This way there should not be any
        ABI problems down the line.
      
      o The only not so pretty part is introduction of extra file "use_hierarchy"
        down the line. Kame-san had mentioned that hierarhical accounting is
        expensive in memory controller hence they keep it off by default. I
        suspect same will be the case for IO controller also as for each IO
        completion we shall have to account IO through hierarchy up to the root.
        if yes, then it probably is not a very bad idea to introduce this extra
        file so that it will be used only when somebody needs it and some people
        might enable hierarchy only in part of the hierarchy.
      
      o This is how basically memory controller also uses "use_hierarhcy" and
        they also allowed creation of hierarchies when actual backend support
        was not available.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Reviewed-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Tested-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      bdc85df7
  16. 02 10月, 2010 1 次提交
  17. 01 10月, 2010 3 次提交
    • V
      blkio: Recalculate the throttled bio dispatch time upon throttle limit change · fe071437
      Vivek Goyal 提交于
      o Currently any cgroup throttle limit changes are processed asynchronousy and
        the change does not take affect till a new bio is dispatched from same group.
      
      o It might happen that a user sets a redicuously low limit on throttling.
        Say 1 bytes per second on reads. In such cases simple operations like mount
        a disk can wait for a very long time.
      
      o Once bio is throttled, there is no easy way to come out of that wait even if
        user increases the read limit later.
      
      o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
        the bio dispatch time according to new limits.
      
      o Can't take queueu lock under blkcg_lock, hence after the change I wake
        up the dispatch thread again which recalculates the time. So there are some
        variables being synchronized across two threads without lock and I had to
        make use of barriers. Hoping I have used barriers correctly. Any review of
        memory barrier code especially will help.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      fe071437
    • V
      blkio: deletion of a cgroup was causes oops · 61014e96
      Vivek Goyal 提交于
      o Now a cgroup list of blkg elements can contain blkg from multiple policies.
        Before sending an unlink event, make sure blkg belongs to they policy. If
        policy does not own the blkg, do not send update for this blkg.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      61014e96
    • V
      blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n · 13f98250
      Vivek Goyal 提交于
      Currently throttling related files were visible even if user had disabled
      throttling using config options. It was switching off background throttling
      of bio but not the cgroup files. This patch fixes it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      13f98250
  18. 16 9月, 2010 4 次提交
  19. 23 8月, 2010 1 次提交
  20. 07 5月, 2010 1 次提交
  21. 03 5月, 2010 1 次提交
  22. 27 4月, 2010 2 次提交
    • V
      blk-cgroup: config options re-arrangement · afc24d49
      Vivek Goyal 提交于
      This patch fixes few usability and configurability issues.
      
      o All the cgroup based controller options are configurable from
        "Genral Setup/Control Group Support/" menu. blkio is the only exception.
        Hence make this option visible in above menu and make it configurable from
        there to bring it inline with rest of the cgroup based controllers.
      
      o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.
      
        This option currently does two things.
      
        - Enable printing of cgroup paths in blktrace
        - Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
          files in cgroup.
      
        If we are using group scheduling, blktrace data is of not really much use
        if cgroup information is not present. To get this data, currently one has to
        also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
        all the additional debug stat files which is not desired.
      
        Hence, this patch moves printing of cgroup paths under
        CONFIG_CFQ_GROUP_IOSCHED.
      
        This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
        the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
        can be enabled through config menu.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NDivyesh Shah <dpshah@google.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      afc24d49
    • V
      blkio: Fix another BUG_ON() crash due to cfqq movement across groups · e5ff082e
      Vivek Goyal 提交于
      o Once in a while, I was hitting a BUG_ON() in blkio code. empty_time was
        assuming that upon slice expiry, group can't be marked empty already (except
        forced dispatch).
      
        But this assumption is broken if cfqq can move (group_isolation=0) across
        groups after receiving a request.
      
        I think most likely in this case we got a request in a cfqq and accounted
        the rq in one group, later while adding the cfqq to tree, we moved the queue
        to a different group which was already marked empty and after dispatch from
        slice we found group already marked empty and raised alarm.
      
        This patch does not error out if group is already marked empty. This can
        introduce some empty_time stat error only in case of group_isolation=0. This
        is better than crashing. In case of group_isolation=1 we should still get
        same stats as before this patch.
      
      [  222.308546] ------------[ cut here ]------------
      [  222.309311] kernel BUG at block/blk-cgroup.c:236!
      [  222.309311] invalid opcode: 0000 [#1] SMP
      [  222.309311] last sysfs file: /sys/devices/virtual/block/dm-3/queue/scheduler
      [  222.309311] CPU 1
      [  222.309311] Modules linked in: dm_round_robin dm_multipath qla2xxx scsi_transport_fc dm_zero dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      [  222.309311]
      [  222.309311] Pid: 4780, comm: fio Not tainted 2.6.34-rc4-blkio-config #68 0A98h/HP xw8600 Workstation
      [  222.309311] RIP: 0010:[<ffffffff8121ad88>]  [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83
      [  222.309311] RSP: 0018:ffff8800ba6e79f8  EFLAGS: 00010002
      [  222.309311] RAX: 0000000000000082 RBX: ffff8800a13b7990 RCX: ffff8800a13b7808
      [  222.309311] RDX: 0000000000002121 RSI: 0000000000000082 RDI: ffff8800a13b7a30
      [  222.309311] RBP: ffff8800ba6e7a18 R08: 0000000000000000 R09: 0000000000000001
      [  222.309311] R10: 000000000002f8c8 R11: ffff8800ba6e7ad8 R12: ffff8800a13b78ff
      [  222.309311] R13: ffff8800a13b7990 R14: 0000000000000001 R15: ffff8800a13b7808
      [  222.309311] FS:  00007f3beec476f0(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000
      [  222.309311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  222.309311] CR2: 000000000040e7f0 CR3: 00000000a12d5000 CR4: 00000000000006e0
      [  222.309311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  222.309311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  222.309311] Process fio (pid: 4780, threadinfo ffff8800ba6e6000, task ffff8800b3d6bf00)
      [  222.309311] Stack:
      [  222.309311]  0000000000000001 ffff8800bab17a48 ffff8800bab17a48 ffff8800a13b7800
      [  222.309311] <0> ffff8800ba6e7a68 ffffffff8121da35 ffff880000000001 00ff8800ba5c5698
      [  222.309311] <0> ffff8800ba6e7a68 ffff8800a13b7800 0000000000000000 ffff8800bab17a48
      [  222.309311] Call Trace:
      [  222.309311]  [<ffffffff8121da35>] __cfq_slice_expired+0x2af/0x3ec
      [  222.309311]  [<ffffffff8121fd7b>] cfq_dispatch_requests+0x2c8/0x8e8
      [  222.309311]  [<ffffffff8120f1cd>] ? spin_unlock_irqrestore+0xe/0x10
      [  222.309311]  [<ffffffff8120fb1a>] ? blk_insert_cloned_request+0x70/0x7b
      [  222.309311]  [<ffffffff81210461>] blk_peek_request+0x191/0x1a7
      [  222.309311]  [<ffffffffa0002799>] dm_request_fn+0x38/0x14c [dm_mod]
      [  222.309311]  [<ffffffff810ae61f>] ? sync_page_killable+0x0/0x35
      [  222.309311]  [<ffffffff81210fd4>] __generic_unplug_device+0x32/0x37
      [  222.309311]  [<ffffffff81211274>] generic_unplug_device+0x2e/0x3c
      [  222.309311]  [<ffffffffa00011a6>] dm_unplug_all+0x42/0x5b [dm_mod]
      [  222.309311]  [<ffffffff8120ca37>] blk_unplug+0x29/0x2d
      [  222.309311]  [<ffffffff8120ca4d>] blk_backing_dev_unplug+0x12/0x14
      [  222.309311]  [<ffffffff81109a7a>] block_sync_page+0x35/0x39
      [  222.309311]  [<ffffffff810ae616>] sync_page+0x41/0x4a
      [  222.309311]  [<ffffffff810ae62d>] sync_page_killable+0xe/0x35
      [  222.309311]  [<ffffffff8158aa59>] __wait_on_bit_lock+0x46/0x8f
      [  222.309311]  [<ffffffff810ae4f5>] __lock_page_killable+0x66/0x6d
      [  222.309311]  [<ffffffff81056f9c>] ? wake_bit_function+0x0/0x33
      [  222.309311]  [<ffffffff810ae528>] lock_page_killable+0x2c/0x2e
      [  222.309311]  [<ffffffff810afbc5>] generic_file_aio_read+0x361/0x4f0
      [  222.309311]  [<ffffffff810ea044>] do_sync_read+0xcb/0x108
      [  222.309311]  [<ffffffff811e42f7>] ? security_file_permission+0x16/0x18
      [  222.309311]  [<ffffffff810ea6ab>] vfs_read+0xab/0x108
      [  222.309311]  [<ffffffff810ea7c8>] sys_read+0x4a/0x6e
      [  222.309311]  [<ffffffff81002b5b>] system_call_fastpath+0x16/0x1b
      [  222.309311] Code: 58 01 00 00 00 48 89 c6 75 0a 48 83 bb 60 01 00 00 00 74 09 48 8d bb a0 00 00 00 eb 35 41 fe cc 74 0d f6 83 c0 01 00 00 04 74 04 <0f> 0b eb fe 48 89 75 e8 e8 be e0 de ff 66 83 8b c0 01 00 00 04
      [  222.309311] RIP  [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83
      [  222.309311]  RSP <ffff8800ba6e79f8>
      [  222.309311] ---[ end trace 32b4f71dffc15712 ]---
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NDivyesh Shah <dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e5ff082e
  23. 16 4月, 2010 1 次提交
  24. 14 4月, 2010 2 次提交
    • D
      blkio: Fix compile errors · 28baf442
      Divyesh Shah 提交于
      Fixes compile errors in blk-cgroup code for empty_time stat and a merge fix in
      CFQ. The first error was when CONFIG_DEBUG_CFQ_IOSCHED is not set.
      Signed-off-by: NDivyesh Shah <dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      28baf442
    • D
      block: Update to io-controller stats · a11cdaa7
      Divyesh Shah 提交于
      Changelog from v1:
      o Call blkiocg_update_idle_time_stats() at cfq_rq_enqueued() instead of at
        dispatch time.
      
      Changelog from original patchset: (in response to Vivek Goyal's comments)
      o group blkiocg_update_blkio_group_dequeue_stats() with other DEBUG functions
      o rename blkiocg_update_set_active_queue_stats() to
        blkiocg_update_avg_queue_size_stats()
      o s/request/io/ in blkiocg_update_request_add_stats() and
        blkiocg_update_request_remove_stats()
      o Call cfq_del_timer() at request dispatch() instead of
        blkiocg_update_idle_time_stats()
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      a11cdaa7
  25. 13 4月, 2010 1 次提交
    • G
      io-controller: Add a new interface "weight_device" for IO-Controller · 34d0f179
      Gui Jianfeng 提交于
      Currently, IO Controller makes use of blkio.weight to assign weight for
      all devices. Here a new user interface "blkio.weight_device" is introduced to
      assign different weights for different devices. blkio.weight becomes the
      default value for devices which are not configured by "blkio.weight_device"
      
      You can use the following format to assigned specific weight for a given
      device:
      #echo "major:minor weight" > blkio.weight_device
      
      major:minor represents device number.
      
      And you can remove weight for a given device as following:
      #echo "major:minor 0" > blkio.weight_device
      
      V1->V2 changes:
      - use user interface "weight_device" instead of "policy" suggested by Vivek
      - rename some struct suggested by Vivek
      - rebase to 2.6-block "for-linus" branch
      - remove an useless list_empty check pointed out by Li Zefan
      - some trivial typo fix
      
      V2->V3 changes:
      - Move policy_*_node() functions up to get rid of forward declarations
      - rename related functions by adding prefix "blkio_"
      Signed-off-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      34d0f179
  26. 09 4月, 2010 1 次提交
    • D
      blkio: Add more debug-only per-cgroup stats · 812df48d
      Divyesh Shah 提交于
      1) group_wait_time - This is the amount of time the cgroup had to wait to get a
        timeslice for one of its queues from when it became busy, i.e., went from 0
        to 1 request queued. This is different from the io_wait_time which is the
        cumulative total of the amount of time spent by each IO in that cgroup waiting
        in the scheduler queue. This stat is a great way to find out any jobs in the
        fleet that are being starved or waiting for longer than what is expected (due
        to an IO controller bug or any other issue).
      2) empty_time - This is the amount of time a cgroup spends w/o any pending
         requests. This stat is useful when a job does not seem to be able to use its
         assigned disk share by helping check if that is happening due to an IO
         controller bug or because the job is not submitting enough IOs.
      3) idle_time - This is the amount of time spent by the IO scheduler idling
         for a given cgroup in anticipation of a better request than the exising ones
         from other queues/cgroups.
      
      All these stats are recorded using start and stop events. When reading these
      stats, we do not add the delta between the current time and the last start time
      if we're between the start and stop events. We avoid doing this to make sure
      that these numbers are always monotonically increasing when read. Since we're
      using sched_clock() which may use the tsc as its source, it may induce some
      inconsistency (due to tsc resync across cpus) if we included the current delta.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      812df48d