- 07 3月, 2012 15 次提交
-
-
由 Tejun Heo 提交于
To prepare for unifying blkgs for different policies, make blkg->pd an array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats, and ->stats_cpu into blkg_policy_data. This patch doesn't introduce any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently, blkcg policy implementations manage blkg refcnt duplicating mostly identical code in both policies. This patch moves refcnt to blkg and let blkcg core handle refcnt and freeing of blkgs. * cfq blkgs now also get freed via RCU. * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If necessary, we can add blkio_exit_group_fn() to resurrect this. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently, blkg's are embedded in private data blkcg policy private data structure and thus allocated and freed by policies. This leads to duplicate codes in policies, hinders implementing common part in blkcg core with strong semantics, and forces duplicate blkg's for the same cgroup-q association. This patch introduces struct blkg_policy_data which is a separate data structure chained from blkg. Policies specifies the amount of private data it needs in its blkio_policy_type->pdata_size and blkcg core takes care of allocating them along with blkg which can be accessed using blkg_to_pdata(). blkg can be determined from pdata using pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated to blkio_init_group_fn(). For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in the reverse direction are added. Except that policy specific data now lives in a separate data structure from blkg, this patch doesn't introduce any functional difference. This will be used to unify blkg's for different policies. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Keep track of all request_queues which have blkcg initialized and turn on bypass and invoke blkcg_clear_queue() on all before making changes to blkcg policies. This is to prepare for moving blkg management into blkcg core. Note that this uses more brute force than necessary. Finer grained shoot down will be implemented later and given that policy [un]registration almost never happens on running systems (blk-throtl can't be built as a module and cfq usually is the builtin default iosched), this shouldn't be a problem for the time being. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently block core calls directly into blk-throttle for init, drain and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps the blk-throttle functions. This is to give more control and visiblity to blkcg core layer for proper layering. Further patches will add logic common to blkcg policies to the functions. While at it, collapse blk_throtl_release() into blk_throtl_exit(). There's no reason to keep them separate. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently, blkg points to the associated blkcg via its css_id. This unnecessarily complicates dereferencing blkcg. Let blkg hold a reference to the associated blkcg and point directly to it and disable css_id on blkio_subsys. This change requires splitting blkiocg_destroy() into blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be destroyed and all the blkcg references held by them dropped during cgroup removal. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Vivek Goyal 提交于
blk-cgroup printing code currently assumes that there is a device/disk associated with every queue in the system, but modules like floppy, can instantiate request queues without registering disk which can lead to oops. Skip the queue/blkg which don't have dev/disk associated with them. -tj: Factored out backing_dev_info check into blkg_dev_name(). Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
blkg->dev is dev_t recording the device number of the block device for the associated request_queue. It is used to identify the associated block device when printing out configuration or stats. This is redundant to begin with. A blkg is an association between a cgroup and a request_queue and it of course is possible to reach request_queue from blkg and synchronization conventions are in place for safe q dereferencing, so this shouldn't be necessary from the beginning. Furthermore, it's initialized by sscanf()ing the device name of backing_dev_info. The mind boggles. Anyways, if blkg is visible under rcu lock, we *know* that the associated request_queue hasn't gone away yet and its bdi is registered and alive - blkg can't be created for request_queue which hasn't been fully initialized and it can't go away before blkg is removed. Let stat and conf read functions get device name from blkg->q->backing_dev_info.dev and pass it down to printing functions and remove blkg->dev. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Now that blkcg configuration lives in blkg's, blkio_policy_node is no longer necessary. Kill it. blkio_policy_parse_and_set() now fails if invoked for missing device and functions to print out configurations are updated to print from blkg's. cftype_blkg_same_policy() is dropped along with other policy functions for consistency. Its one line is open coded in the only user - blkio_read_blkg_stats(). -v2: Update to reflect the retry-on-bypass logic change of the previous patch. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
blkcg is very peculiar in that it allows setting and remembering configurations for non-existent devices by maintaining separate data structures for configuration. This behavior is completely out of the usual norms and outright confusing; furthermore, it uses dev_t number to match the configuration to devices, which is unpredictable to begin with and becomes completely unuseable if EXT_DEVT is fully used. It is wholely unnecessary - we already have fully functional userland mechanism to program devices being hotplugged which has full access to device identification, connection topology and filesystem information. Add a new struct blkio_group_conf which contains all blkcg configurations to blkio_group and let blkio_group, which can be created iff the associated device exists and is removed when the associated device goes away, carry all configurations. Note that, after this patch, all newly created blkg's will always have the default configuration (unlimited for throttling and blkcg's weight for propio). This patch makes blkio_policy_node meaningless but doesn't remove it. The next patch will. -v2: Updated to retry after short sleep if blkg lookup/creation failed due to the queue being temporarily bypassed as indicated by -EBUSY return. Pointed out by Vivek. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Block cgroup policies are maintained in a linked list and, theoretically, multiple policies sharing the same policy ID are allowed. This patch temporarily restricts one policy per plid and adds blkio_policy[] array which indexes registered policy types by plid. Both the restriction and blkio_policy[] array are transitional and will be removed once API cleanup is complete. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
blkgio_group is association between a block cgroup and a queue for a given policy. Using opaque void * for association makes things confusing and hinders factoring of common code. Use request_queue * and, if necessary, policy id instead. This will help block cgroup API cleanup. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Block cgroup core can be built as module; however, it isn't too useful as blk-throttle can only be built-in and cfq-iosched is usually the default built-in scheduler. Scheduled blkcg cleanup requires calling into blkcg from block core. To simplify that, disallow building blkcg as module by making CONFIG_BLK_CGROUP bool. If building blkcg core as module really matters, which I doubt, we can revisit it after blkcg API cleanup. -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to depend on BLK_CGROUP. Fixed. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 07 2月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
put_io_context() performed a complex trylock dancing to avoid deferring ioc release to workqueue. It was also broken on UP because trylock was always assumed to succeed which resulted in unbalanced preemption count. While there are ways to fix the UP breakage, even the most pathological microbench (forced ioc allocation and tight fork/exit loop) fails to show any appreciable performance benefit of the optimization. Strip it out. If there turns out to be workloads which are affected by this change, simpler optimization from the discussion thread can be applied later. Signed-off-by: NTejun Heo <tj@kernel.org> LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 12月, 2011 3 次提交
-
-
由 Tejun Heo 提交于
cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's *far* simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
ioprio/cgroup change was handled by marking the changed state in ioc and, on the following access to the ioc, performing RCU-protected iteration through all cic's grabbing the matching queue_lock. This patch moves the changed state to each cic. When ioprio or cgroup changes, the respective bit is set on all cic's of the ioc and when each of those cic (not ioc) is accessed, change is applied for that specific ioc-queue pair. This also fixes the following two race conditions between setting and clearing of changed states. * Missing barrier between assign/load of ioprio and ioprio_changed allowed applying old ioprio. * Change requests could happen between application of change and clearing of changed variables. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 13 12月, 2011 1 次提交
-
-
由 Tejun Heo 提交于
Now that subsys->can_attach() and attach() take @tset instead of @task, they can handle per-task operations. Convert ->can_attach_task() and ->attach_task() users to use ->can_attach() and attach() instead. Most converions are straight-forward. Noteworthy changes are, * In cgroup_freezer, remove unnecessary NULL assignments to unused methods. It's useless and very prone to get out of sync, which already happened. * In cpuset, PF_THREAD_BOUND test is checked for each task. This doesn't make any practical difference but is conceptually cleaner. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com> Acked-by: NLi Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 25 10月, 2011 2 次提交
-
-
由 Vivek Goyal 提交于
blkcg->policy_list is protected by blkcg->lock. Its not rcu protected list. So even for readers, they need to take blkcg->lock. There are few functions which were reading the list without taking lock. Fix it. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Vivek Goyal 提交于
If a rule is being deleted, free up associated policy node. Otherwise that memory is leaked. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 10月, 2011 1 次提交
-
-
由 Tejun Heo 提交于
blkio_policy_parse_and_set() calls blkio_check_dev_num() to check whether the given dev_t is valid. blkio_check_dev_num() uses get_gendisk() for verification but never puts the returned genhd leaking the reference. This patch collapses blkio_check_dev_num() into its caller and updates it such that the genhd is put before returning. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 9月, 2011 1 次提交
-
-
由 Wanlong Gao 提交于
The bug is we're not able to remove the device from blkio cgroup's per-device control files if it gets unplugged. To reproduce the bug: # mount -t cgroup -o blkio xxx /cgroup # cd /cgroup # echo "8:0 1000" > blkio.throttle.read_bps_device # unplug the device # cat blkio.throttle.read_bps_device 8:0 1000 # echo "8:0 0" > blkio.throttle.read_bps_device -bash: echo: write error: No such device After patching, the device removal will succeed. Thanks for the comments of Paul, Zefan, and Vivek. Signed-off-by: NWanlong Gao <gaowanlong@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Acked-by: NVivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 5月, 2011 1 次提交
-
-
由 Ben Blum 提交于
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: NBen Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: NPaul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 5月, 2011 1 次提交
-
-
由 Vivek Goyal 提交于
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking blkg->stats_lock. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 21 5月, 2011 4 次提交
-
-
由 Vivek Goyal 提交于
Now dispatch stats update is lock free. But reset of these stats still takes blkg->stats_lock and is dependent on that. As stats are per cpu, we should be able to just reset the stats on each cpu without any locks. (Atleast for 64bit arch). On 32bit arch there is a small race where 64bit updates are not atomic. The result of this race can be that in the presence of other writers, one might not get 0 value after reset of a stat and might see something intermediate One can write more complicated code to cover this race like sending IPI to other cpus to reset stats and for offline cpus, reset these directly. Right not I am not taking that path because reset_update is more of a debug feature and it can happen only on 32bit arch and possibility of it happening is small. Will fix it if it becomes a real problem. For the time being going for code simplicity. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
由 Vivek Goyal 提交于
Some of the stats are 64bit and updation will be non atomic on 32bit architecture. Use sequence counters on 32bit arch to make reading of stats safe. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
由 Vivek Goyal 提交于
Currently we take blkg_stat lock for even updating the stats. So even if a group has no throttling rules (common case for root group), we end up taking blkg_lock, for updating the stats. Make dispatch stats per cpu so that these can be updated without taking blkg lock. If cpu goes offline, these stats simply disappear. No protection has been provided for that yet. Do we really need anything for that? Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
由 Vivek Goyal 提交于
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y. there are some fields which are out side this config option. Fix that. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 16 5月, 2011 1 次提交
-
-
由 Vivek Goyal 提交于
Currentlly we first map the task to cgroup and then cgroup to blkio_cgroup. There is a more direct way to get to blkio_cgroup from task using task_subsys_state(). Use that. The real reason for the fix is that it also avoids a race in generic cgroup code. During remount/umount rebind_subsystems() is called and it can do following with and rcu protection. cgrp->subsys[i] = NULL; That means if somebody got hold of cgroup under rcu and then it tried to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which is wrong. I was running into this race condition with ltp running on a upstream derived kernel and that lead to crash. So ideally we should also fix cgroup generic code to wait for rcu grace period before setting pointer to NULL. Li Zefan is not very keen on introducing synchronize_wait() as he thinks it will slow down moun/remount/umount operations. So for the time being atleast fix the kernel crash by taking a more direct route to blkio_cgroup. One tester had reported a crash while running LTP on a derived kernel and with this fix crash is no more seen while the test has been running for over 6 days. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 31 3月, 2011 1 次提交
-
-
由 Lucas De Marchi 提交于
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
-
- 23 3月, 2011 1 次提交
-
-
由 Justin TerAvest 提交于
This change moves unaccounted_time to only be reported when CONFIG_DEBUG_BLK_CGROUP is true. Signed-off-by: NJustin TerAvest <teravest@google.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 12 3月, 2011 1 次提交
-
-
由 Justin TerAvest 提交于
There are two kind of times that tasks are not charged for: the first seek and the extra time slice used over the allocated timeslice. Both of these exported as a new unaccounted_time stat. I think it would be good to have this reported in 'time' as well, but that is probably a separate discussion. Signed-off-by: NJustin TerAvest <teravest@google.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 16 11月, 2010 1 次提交
-
-
由 Vivek Goyal 提交于
o Allow hierarchical cgroup creation for blkio controller o Currently we disallow it as both the io controller policies (throttling as well as proportion bandwidth) do not support hierarhical accounting and control. But the flip side is that blkio controller can not be used with libvirt as libvirt creates a cgroup hierarchy deeper than 1 level. <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups> o So this patch will allow creation of cgroup hierarhcy but at the backend everything will be treated as flat. So if somebody created a an hierarchy like as follows. root / \ test1 test2 | test3 CFQ and throttling will practically treat all groups at same level. pivot / | \ \ root test1 test2 test3 o Once we have actual support for hierarchical accounting and control then we can introduce another cgroup tunable file "blkio.use_hierarchy" which will be 0 by default but if user wants to enforce hierarhical control then it can be set to 1. This way there should not be any ABI problems down the line. o The only not so pretty part is introduction of extra file "use_hierarchy" down the line. Kame-san had mentioned that hierarhical accounting is expensive in memory controller hence they keep it off by default. I suspect same will be the case for IO controller also as for each IO completion we shall have to account IO through hierarchy up to the root. if yes, then it probably is not a very bad idea to introduce this extra file so that it will be used only when somebody needs it and some people might enable hierarchy only in part of the hierarchy. o This is how basically memory controller also uses "use_hierarhcy" and they also allowed creation of hierarchies when actual backend support was not available. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com> Reviewed-by: NCiju Rajan K <ciju@linux.vnet.ibm.com> Tested-by: NCiju Rajan K <ciju@linux.vnet.ibm.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 02 10月, 2010 1 次提交
-
-
由 Vivek Goyal 提交于
- Limit max iops value to UINT_MAX and return error to user if value is more than that instead of accepting bigger values and truncating implicitly. Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 01 10月, 2010 3 次提交
-
-
由 Vivek Goyal 提交于
o Currently any cgroup throttle limit changes are processed asynchronousy and the change does not take affect till a new bio is dispatched from same group. o It might happen that a user sets a redicuously low limit on throttling. Say 1 bytes per second on reads. In such cases simple operations like mount a disk can wait for a very long time. o Once bio is throttled, there is no easy way to come out of that wait even if user increases the read limit later. o This patch fixes it. Now if a user changes the cgroup limits, we recalculate the bio dispatch time according to new limits. o Can't take queueu lock under blkcg_lock, hence after the change I wake up the dispatch thread again which recalculates the time. So there are some variables being synchronized across two threads without lock and I had to make use of barriers. Hoping I have used barriers correctly. Any review of memory barrier code especially will help. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
由 Vivek Goyal 提交于
o Now a cgroup list of blkg elements can contain blkg from multiple policies. Before sending an unlink event, make sure blkg belongs to they policy. If policy does not own the blkg, do not send update for this blkg. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
由 Vivek Goyal 提交于
Currently throttling related files were visible even if user had disabled throttling using config options. It was switching off background throttling of bio but not the cgroup files. This patch fixes it. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 16 9月, 2010 1 次提交
-
-
由 Vivek Goyal 提交于
o cgroup changes for IOPS throttling rules. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-