1. 07 3月, 2012 6 次提交
    • T
      blkcg: don't allow or retain configuration of missing devices · e56da7e2
      Tejun Heo 提交于
      blkcg is very peculiar in that it allows setting and remembering
      configurations for non-existent devices by maintaining separate data
      structures for configuration.
      
      This behavior is completely out of the usual norms and outright
      confusing; furthermore, it uses dev_t number to match the
      configuration to devices, which is unpredictable to begin with and
      becomes completely unuseable if EXT_DEVT is fully used.
      
      It is wholely unnecessary - we already have fully functional userland
      mechanism to program devices being hotplugged which has full access to
      device identification, connection topology and filesystem information.
      
      Add a new struct blkio_group_conf which contains all blkcg
      configurations to blkio_group and let blkio_group, which can be
      created iff the associated device exists and is removed when the
      associated device goes away, carry all configurations.
      
      Note that, after this patch, all newly created blkg's will always have
      the default configuration (unlimited for throttling and blkcg's weight
      for propio).
      
      This patch makes blkio_policy_node meaningless but doesn't remove it.
      The next patch will.
      
      -v2: Updated to retry after short sleep if blkg lookup/creation failed
           due to the queue being temporarily bypassed as indicated by
           -EBUSY return.  Pointed out by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e56da7e2
    • T
      blkcg: factor out blkio_group creation · cd1604fa
      Tejun Heo 提交于
      Currently both blk-throttle and cfq-iosched implement their own
      blkio_group creation code in throtl_get_tg() and cfq_get_cfqg().  This
      patch factors out the common code into blkg_lookup_create(), which
      returns ERR_PTR value so that transitional failures due to queue
      bypass can be distinguished from other failures.
      
      * New plkio_policy_ops methods blkio_alloc_group_fn() and
        blkio_link_group_fn added.  Both are transitional and will be
        removed once the blkg management code is fully moved into
        blk-cgroup.c.
      
      * blkio_alloc_group_fn() allocates policy-specific blkg which is
        usually a larger data structure with blkg as the first entry and
        intiailizes it.  Note that initialization of blkg proper, including
        percpu stats, is responsibility of blk-cgroup proper.
      
        Note that default config (weight, bps...) initialization is done
        from this method; otherwise, we end up violating locking order
        between blkcg and q locks via blkcg_get_CONF() functions.
      
      * blkio_link_group_fn() is called under queue_lock and responsible for
        linking the blkg to the queue.  blkcg side is handled by blk-cgroup
        proper.
      
      * The common blkg creation function is named blkg_lookup_create() and
        blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
        Also, throtl / cfq related functions are similarly [re]named for
        consistency.
      
      This simplifies blkcg policy implementations and enables further
      cleanup.
      
      -v2: Vivek noticed that blkg_lookup_create() incorrectly tested
           blk_queue_dead() instead of blk_queue_bypass() leading a user of
           the function ending up creating a new blkg on bypassing queue.
           This is a bug introduced while relocating bypass patches before
           this one.  Fixed.
      
      -v3: ERR_PTR patch folded into this one.  @for_root added to
           blkg_lookup_create() to allow creating root group on a bypassed
           queue during elevator switch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd1604fa
    • T
      blkcg: add blkio_policy[] array and allow one policy per policy ID · 035d10b2
      Tejun Heo 提交于
      Block cgroup policies are maintained in a linked list and,
      theoretically, multiple policies sharing the same policy ID are
      allowed.
      
      This patch temporarily restricts one policy per plid and adds
      blkio_policy[] array which indexes registered policy types by plid.
      Both the restriction and blkio_policy[] array are transitional and
      will be removed once API cleanup is complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      035d10b2
    • T
      blkcg: use q and plid instead of opaque void * for blkio_group association · ca32aefc
      Tejun Heo 提交于
      blkgio_group is association between a block cgroup and a queue for a
      given policy.  Using opaque void * for association makes things
      confusing and hinders factoring of common code.  Use request_queue *
      and, if necessary, policy id instead.
      
      This will help block cgroup API cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ca32aefc
    • T
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo 提交于
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
        disabled
      
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
      
           blk-throtl no longer shoots down root_tg to avoid breaking
           td->root_tg.
      
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72e06c25
    • T
      blkcg: make CONFIG_BLK_CGROUP bool · 32e380ae
      Tejun Heo 提交于
      Block cgroup core can be built as module; however, it isn't too useful
      as blk-throttle can only be built-in and cfq-iosched is usually the
      default built-in scheduler.  Scheduled blkcg cleanup requires calling
      into blkcg from block core.  To simplify that, disallow building blkcg
      as module by making CONFIG_BLK_CGROUP bool.
      
      If building blkcg core as module really matters, which I doubt, we can
      revisit it after blkcg API cleanup.
      
      -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
           depend on BLK_CGROUP.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32e380ae
  2. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  3. 14 12月, 2011 3 次提交
    • T
      block, cfq: unlink cfq_io_context's immediately · b2efa052
      Tejun Heo 提交于
      cic is association between io_context and request_queue.  A cic is
      linked from both ioc and q and should be destroyed when either one
      goes away.  As ioc and q both have their own locks, locking becomes a
      bit complex - both orders work for removal from one but not from the
      other.
      
      Currently, cfq tries to circumvent this locking order issue with RCU.
      ioc->lock nests inside queue_lock but the radix tree and cic's are
      also protected by RCU allowing either side to walk their lists without
      grabbing lock.
      
      This rather unconventional use of RCU quickly devolves into extremely
      fragile convolution.  e.g. The following is from cfqd going away too
      soon after ioc and q exits raced.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       [   88.503444]
       Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
       RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
       ...
       Call Trace:
        [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
        [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
        [<ffffffff81389130>] exit_io_context+0x100/0x140
        [<ffffffff81098a29>] do_exit+0x579/0x850
        [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
        [<ffffffff81098de7>] sys_exit_group+0x17/0x20
        [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
      
      The only real hot path here is cic lookup during request
      initialization and avoiding extra locking requires very confined use
      of RCU.  This patch makes cic removal from both ioc and request_queue
      perform double-locking and unlink immediately.
      
      * From q side, the change is almost trivial as ioc->lock nests inside
        queue_lock.  It just needs to grab each ioc->lock as it walks
        cic_list and unlink it.
      
      * From ioc side, it's a bit more difficult because of inversed lock
        order.  ioc needs its lock to walk its cic_list but can't grab the
        matching queue_lock and needs to perform unlock-relock dancing.
      
        Unlinking is now wholly done from put_io_context() and fast path is
        optimized by using the queue_lock the caller already holds, which is
        by far the most common case.  If the ioc accessed multiple devices,
        it tries with trylock.  In unlikely cases of fast path failure, it
        falls back to full double-locking dance from workqueue.
      
      Double-locking isn't the prettiest thing in the world but it's *far*
      simpler and more understandable than RCU trick without adding any
      meaningful overhead.
      
      This still leaves a lot of now unnecessary RCU logics.  Future patches
      will trim them.
      
      -v2: Vivek pointed out that cic->q was being dereferenced after
           cic->release() was called.  Updated to use local variable @this_q
           instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2efa052
    • T
      block, cfq: move ioc ioprio/cgroup changed handling to cic · dc86900e
      Tejun Heo 提交于
      ioprio/cgroup change was handled by marking the changed state in ioc
      and, on the following access to the ioc, performing RCU-protected
      iteration through all cic's grabbing the matching queue_lock.
      
      This patch moves the changed state to each cic.  When ioprio or cgroup
      changes, the respective bit is set on all cic's of the ioc and when
      each of those cic (not ioc) is accessed, change is applied for that
      specific ioc-queue pair.
      
      This also fixes the following two race conditions between setting and
      clearing of changed states.
      
      * Missing barrier between assign/load of ioprio and ioprio_changed
        allowed applying old ioprio.
      
      * Change requests could happen between application of change and
        clearing of changed variables.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc86900e
    • T
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo 提交于
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      dubious.
      
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
        assignment.
      
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
        noop.
      
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      
      This patch updates ioc get/put so that it becomes more conventional.
      
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
        patch.
      
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
        specification.
      
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e736be7
  4. 13 12月, 2011 1 次提交
    • T
      cgroup: don't use subsys->can_attach_task() or ->attach_task() · bb9d97b6
      Tejun Heo 提交于
      Now that subsys->can_attach() and attach() take @tset instead of
      @task, they can handle per-task operations.  Convert
      ->can_attach_task() and ->attach_task() users to use ->can_attach()
      and attach() instead.  Most converions are straight-forward.
      Noteworthy changes are,
      
      * In cgroup_freezer, remove unnecessary NULL assignments to unused
        methods.  It's useless and very prone to get out of sync, which
        already happened.
      
      * In cpuset, PF_THREAD_BOUND test is checked for each task.  This
        doesn't make any practical difference but is conceptually cleaner.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: James Morris <jmorris@namei.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bb9d97b6
  5. 25 10月, 2011 2 次提交
  6. 19 10月, 2011 1 次提交
  7. 21 9月, 2011 1 次提交
  8. 27 5月, 2011 1 次提交
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
  9. 23 5月, 2011 1 次提交
  10. 21 5月, 2011 4 次提交
  11. 16 5月, 2011 1 次提交
    • V
      blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup · 70087dc3
      Vivek Goyal 提交于
      Currentlly we first map the task to cgroup and then cgroup to
      blkio_cgroup. There is a more direct way to get to blkio_cgroup
      from task using task_subsys_state(). Use that.
      
      The real reason for the fix is that it also avoids a race in generic
      cgroup code. During remount/umount rebind_subsystems() is called and
      it can do following with and rcu protection.
      
      cgrp->subsys[i] = NULL;
      
      That means if somebody got hold of cgroup under rcu and then it tried
      to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
      is wrong. I was running into this race condition with ltp running on a
      upstream derived kernel and that lead to crash.
      
      So ideally we should also fix cgroup generic code to wait for rcu
      grace period before setting pointer to NULL. Li Zefan is not very keen
      on introducing synchronize_wait() as he thinks it will slow
      down moun/remount/umount operations.
      
      So for the time being atleast fix the kernel crash by taking a more
      direct route to blkio_cgroup.
      
      One tester had reported a crash while running LTP on a derived kernel
      and with this fix crash is no more seen while the test has been
      running for over 6 days.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      70087dc3
  12. 31 3月, 2011 1 次提交
  13. 23 3月, 2011 1 次提交
  14. 12 3月, 2011 1 次提交
  15. 16 11月, 2010 1 次提交
    • V
      blk-cgroup: Allow creation of hierarchical cgroups · bdc85df7
      Vivek Goyal 提交于
      o Allow hierarchical cgroup creation for blkio controller
      
      o Currently we disallow it as both the io controller policies (throttling
        as well as proportion bandwidth) do not support hierarhical accounting
        and control. But the flip side is that blkio controller can not be used with
        libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
      
        <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
      
      o So this patch will allow creation of cgroup hierarhcy but at the backend
        everything will be treated as flat. So if somebody created a an hierarchy
        like as follows.
      
      			root
      			/  \
      		     test1 test2
      			|
      		     test3
      
        CFQ and throttling will practically treat all groups at same level.
      
      				pivot
      			     /  |   \  \
      			root  test1 test2  test3
      
      o Once we have actual support for hierarchical accounting and control
        then we can introduce another cgroup tunable file "blkio.use_hierarchy"
        which will be 0 by default but if user wants to enforce hierarhical
        control then it can be set to 1. This way there should not be any
        ABI problems down the line.
      
      o The only not so pretty part is introduction of extra file "use_hierarchy"
        down the line. Kame-san had mentioned that hierarhical accounting is
        expensive in memory controller hence they keep it off by default. I
        suspect same will be the case for IO controller also as for each IO
        completion we shall have to account IO through hierarchy up to the root.
        if yes, then it probably is not a very bad idea to introduce this extra
        file so that it will be used only when somebody needs it and some people
        might enable hierarchy only in part of the hierarchy.
      
      o This is how basically memory controller also uses "use_hierarhcy" and
        they also allowed creation of hierarchies when actual backend support
        was not available.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Reviewed-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Tested-by: NCiju Rajan K <ciju@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      bdc85df7
  16. 02 10月, 2010 1 次提交
  17. 01 10月, 2010 3 次提交
    • V
      blkio: Recalculate the throttled bio dispatch time upon throttle limit change · fe071437
      Vivek Goyal 提交于
      o Currently any cgroup throttle limit changes are processed asynchronousy and
        the change does not take affect till a new bio is dispatched from same group.
      
      o It might happen that a user sets a redicuously low limit on throttling.
        Say 1 bytes per second on reads. In such cases simple operations like mount
        a disk can wait for a very long time.
      
      o Once bio is throttled, there is no easy way to come out of that wait even if
        user increases the read limit later.
      
      o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
        the bio dispatch time according to new limits.
      
      o Can't take queueu lock under blkcg_lock, hence after the change I wake
        up the dispatch thread again which recalculates the time. So there are some
        variables being synchronized across two threads without lock and I had to
        make use of barriers. Hoping I have used barriers correctly. Any review of
        memory barrier code especially will help.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      fe071437
    • V
      blkio: deletion of a cgroup was causes oops · 61014e96
      Vivek Goyal 提交于
      o Now a cgroup list of blkg elements can contain blkg from multiple policies.
        Before sending an unlink event, make sure blkg belongs to they policy. If
        policy does not own the blkg, do not send update for this blkg.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      61014e96
    • V
      blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n · 13f98250
      Vivek Goyal 提交于
      Currently throttling related files were visible even if user had disabled
      throttling using config options. It was switching off background throttling
      of bio but not the cgroup files. This patch fixes it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      13f98250
  18. 16 9月, 2010 4 次提交
  19. 23 8月, 2010 1 次提交
  20. 07 5月, 2010 1 次提交
  21. 03 5月, 2010 1 次提交
  22. 27 4月, 2010 2 次提交
    • V
      blk-cgroup: config options re-arrangement · afc24d49
      Vivek Goyal 提交于
      This patch fixes few usability and configurability issues.
      
      o All the cgroup based controller options are configurable from
        "Genral Setup/Control Group Support/" menu. blkio is the only exception.
        Hence make this option visible in above menu and make it configurable from
        there to bring it inline with rest of the cgroup based controllers.
      
      o Get rid of CONFIG_DEBUG_CFQ_IOSCHED.
      
        This option currently does two things.
      
        - Enable printing of cgroup paths in blktrace
        - Enables CONFIG_DEBUG_BLK_CGROUP, which in turn displays additional stat
          files in cgroup.
      
        If we are using group scheduling, blktrace data is of not really much use
        if cgroup information is not present. To get this data, currently one has to
        also enable CONFIG_DEBUG_CFQ_IOSCHED, which in turn brings the overhead of
        all the additional debug stat files which is not desired.
      
        Hence, this patch moves printing of cgroup paths under
        CONFIG_CFQ_GROUP_IOSCHED.
      
        This allows us to get rid of CONFIG_DEBUG_CFQ_IOSCHED completely. Now all
        the debug stat files are controlled only by CONFIG_DEBUG_BLK_CGROUP which
        can be enabled through config menu.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NDivyesh Shah <dpshah@google.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      afc24d49
    • V
      blkio: Fix another BUG_ON() crash due to cfqq movement across groups · e5ff082e
      Vivek Goyal 提交于
      o Once in a while, I was hitting a BUG_ON() in blkio code. empty_time was
        assuming that upon slice expiry, group can't be marked empty already (except
        forced dispatch).
      
        But this assumption is broken if cfqq can move (group_isolation=0) across
        groups after receiving a request.
      
        I think most likely in this case we got a request in a cfqq and accounted
        the rq in one group, later while adding the cfqq to tree, we moved the queue
        to a different group which was already marked empty and after dispatch from
        slice we found group already marked empty and raised alarm.
      
        This patch does not error out if group is already marked empty. This can
        introduce some empty_time stat error only in case of group_isolation=0. This
        is better than crashing. In case of group_isolation=1 we should still get
        same stats as before this patch.
      
      [  222.308546] ------------[ cut here ]------------
      [  222.309311] kernel BUG at block/blk-cgroup.c:236!
      [  222.309311] invalid opcode: 0000 [#1] SMP
      [  222.309311] last sysfs file: /sys/devices/virtual/block/dm-3/queue/scheduler
      [  222.309311] CPU 1
      [  222.309311] Modules linked in: dm_round_robin dm_multipath qla2xxx scsi_transport_fc dm_zero dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      [  222.309311]
      [  222.309311] Pid: 4780, comm: fio Not tainted 2.6.34-rc4-blkio-config #68 0A98h/HP xw8600 Workstation
      [  222.309311] RIP: 0010:[<ffffffff8121ad88>]  [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83
      [  222.309311] RSP: 0018:ffff8800ba6e79f8  EFLAGS: 00010002
      [  222.309311] RAX: 0000000000000082 RBX: ffff8800a13b7990 RCX: ffff8800a13b7808
      [  222.309311] RDX: 0000000000002121 RSI: 0000000000000082 RDI: ffff8800a13b7a30
      [  222.309311] RBP: ffff8800ba6e7a18 R08: 0000000000000000 R09: 0000000000000001
      [  222.309311] R10: 000000000002f8c8 R11: ffff8800ba6e7ad8 R12: ffff8800a13b78ff
      [  222.309311] R13: ffff8800a13b7990 R14: 0000000000000001 R15: ffff8800a13b7808
      [  222.309311] FS:  00007f3beec476f0(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000
      [  222.309311] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  222.309311] CR2: 000000000040e7f0 CR3: 00000000a12d5000 CR4: 00000000000006e0
      [  222.309311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  222.309311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  222.309311] Process fio (pid: 4780, threadinfo ffff8800ba6e6000, task ffff8800b3d6bf00)
      [  222.309311] Stack:
      [  222.309311]  0000000000000001 ffff8800bab17a48 ffff8800bab17a48 ffff8800a13b7800
      [  222.309311] <0> ffff8800ba6e7a68 ffffffff8121da35 ffff880000000001 00ff8800ba5c5698
      [  222.309311] <0> ffff8800ba6e7a68 ffff8800a13b7800 0000000000000000 ffff8800bab17a48
      [  222.309311] Call Trace:
      [  222.309311]  [<ffffffff8121da35>] __cfq_slice_expired+0x2af/0x3ec
      [  222.309311]  [<ffffffff8121fd7b>] cfq_dispatch_requests+0x2c8/0x8e8
      [  222.309311]  [<ffffffff8120f1cd>] ? spin_unlock_irqrestore+0xe/0x10
      [  222.309311]  [<ffffffff8120fb1a>] ? blk_insert_cloned_request+0x70/0x7b
      [  222.309311]  [<ffffffff81210461>] blk_peek_request+0x191/0x1a7
      [  222.309311]  [<ffffffffa0002799>] dm_request_fn+0x38/0x14c [dm_mod]
      [  222.309311]  [<ffffffff810ae61f>] ? sync_page_killable+0x0/0x35
      [  222.309311]  [<ffffffff81210fd4>] __generic_unplug_device+0x32/0x37
      [  222.309311]  [<ffffffff81211274>] generic_unplug_device+0x2e/0x3c
      [  222.309311]  [<ffffffffa00011a6>] dm_unplug_all+0x42/0x5b [dm_mod]
      [  222.309311]  [<ffffffff8120ca37>] blk_unplug+0x29/0x2d
      [  222.309311]  [<ffffffff8120ca4d>] blk_backing_dev_unplug+0x12/0x14
      [  222.309311]  [<ffffffff81109a7a>] block_sync_page+0x35/0x39
      [  222.309311]  [<ffffffff810ae616>] sync_page+0x41/0x4a
      [  222.309311]  [<ffffffff810ae62d>] sync_page_killable+0xe/0x35
      [  222.309311]  [<ffffffff8158aa59>] __wait_on_bit_lock+0x46/0x8f
      [  222.309311]  [<ffffffff810ae4f5>] __lock_page_killable+0x66/0x6d
      [  222.309311]  [<ffffffff81056f9c>] ? wake_bit_function+0x0/0x33
      [  222.309311]  [<ffffffff810ae528>] lock_page_killable+0x2c/0x2e
      [  222.309311]  [<ffffffff810afbc5>] generic_file_aio_read+0x361/0x4f0
      [  222.309311]  [<ffffffff810ea044>] do_sync_read+0xcb/0x108
      [  222.309311]  [<ffffffff811e42f7>] ? security_file_permission+0x16/0x18
      [  222.309311]  [<ffffffff810ea6ab>] vfs_read+0xab/0x108
      [  222.309311]  [<ffffffff810ea7c8>] sys_read+0x4a/0x6e
      [  222.309311]  [<ffffffff81002b5b>] system_call_fastpath+0x16/0x1b
      [  222.309311] Code: 58 01 00 00 00 48 89 c6 75 0a 48 83 bb 60 01 00 00 00 74 09 48 8d bb a0 00 00 00 eb 35 41 fe cc 74 0d f6 83 c0 01 00 00 04 74 04 <0f> 0b eb fe 48 89 75 e8 e8 be e0 de ff 66 83 8b c0 01 00 00 04
      [  222.309311] RIP  [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83
      [  222.309311]  RSP <ffff8800ba6e79f8>
      [  222.309311] ---[ end trace 32b4f71dffc15712 ]---
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NDivyesh Shah <dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e5ff082e
  23. 16 4月, 2010 1 次提交