- 19 8月, 2015 6 次提交
-
-
由 Tejun Heo 提交于
Even when allocations fail, cfq_find_alloc_queue() always returns a valid cfq_queue by falling back to the oom cfq_queue. As such, there isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT is set. GFP_NOWAIT allocations don't fail often and even when they do the degraded behavior is acceptable and temporary. After all, the only reason get_request(), which ultimately determines the gfp_mask, cares about __GFP_WAIT is to guarantee request allocation, assuming IO forward progress, for callers which are willing to wait. There's no reason for cfq_find_alloc_queue() to behave differently on __GFP_WAIT when it already has a fallback mechanism. Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes to its callers. This simplifies the function quite a bit and will help making async queues per-cfq_group. v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
blkcg performs several allocations to track IOs per cgroup and enforce resource control. Most of these allocations are performed lazily on demand in the IO path and thus can't involve reclaim path. Currently, these allocations use GFP_ATOMIC; however, blkcg can gracefully deal with occassional failures of these allocations by punting IOs to the root cgroup and there's no reason to reach into the emergency reserve. This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following allocations. * bdi_writeback_congested and blkcg_gq allocations in blkg_create(). * radix tree node allocations for blkcg->blkg_tree. * cfq_queue allocation on ioprio changes. Signed-off-by: NTejun Heo <tj@kernel.org> Suggested-and-Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Suggested-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
* Some were accessing cic->cfqq[] directly. Always use cic_to_cfqq() and cic_set_cfqq(). * check_ioprio_changed() doesn't need to verify cfq_get_queue()'s return for NULL. It's always non-NULL. Simplify accordingly. This patch doesn't cause any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request() replaces it by invoking cfq_get_queue() again without putting the oom queue leaking the reference it was holding. While oom queues are not released through reference counting, they're still reference counted and this can theoretically lead to the reference count overflowing and incorrectly invoke the usual release path on it. Fix it by making cfq_set_request() put the ref it was holding. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Async cfqq's (cfq_queue's) are shared across cfq_data. When cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it stashes the pointer in cfq_data and reuses it from then on; however, the function doesn't consider that cfq_find_alloc_queue() may return the oom_cfqq under memory pressure and installs the returned queue unconditionally. If the oom_cfqq is installed as an async cfqq, cfq_set_request() will continue calling cfq_get_queue() hoping to replace it with a proper queue; however, cfq_get_queue() will keep returning the cached queue for the slot - the oom_cfqq. Fix it by skipping caching if the queue is the oom one. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
cfq_get_queue()'s control flow looks like the following. async_cfqq = NULL; cfqq = NULL; if (!is_sync) { ... async_cfqq = ...; cfqq = *async_cfqq; } if (!cfqq) cfqq = ...; if (!is_sync && !(*async_cfqq)) ...; The only thing the local variable init, the second if, and the async_cfqq test in the third if achieves is to skip cfqq creation and installation if *async_cfqq was already non-NULL. This is needlessly complicated with different tests examining the same condition. Simplify it to the following. if (!is_sync) { ... async_cfqq = ...; cfqq = *async_cfqq; if (cfqq) goto out; } cfqq = ...; if (!is_sync) ...; out: Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJeff Moyer <jmoyer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 21 6月, 2015 1 次提交
-
-
由 Jens Axboe 提交于
Commit 9470e4a6 only covered the initial bug report, there are other spots in CFQ where we need to check that we actually have a valid cfq_group_data structure. Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 20 6月, 2015 2 次提交
-
-
由 Jens Axboe 提交于
If none of the devices in the system are using CFQ, then attempting to read: /sys/fs/cgroup/blkio/blkio.leaf_weight will results in a NULL dereference. Check for a valid cfq_group_data struct before attempting to dereference it. Reported-by: NAndrey Wagin <avagin@gmail.com> Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
If CFQ_GROUP_IOSCHED is not set, the compiler produces the following warning: CC block/cfq-iosched.o linux/block/cfq-iosched.c:469:2: warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function] *cpd_to_cfqgd(struct blkcg_policy_data *cpd) ^ In reality, two other lookup functions aren't used either if CFQ_GROUP_IOSCHED isn't set. Move all three under one of the CFQ_GROUP_IOSCHED sections in the code. Reported-by: NVladimir Zapolskiy <vz@mleia.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 6月, 2015 1 次提交
-
-
由 Jens Axboe 提交于
A previous commit wanted to make CFQ default to IOPS mode on non-rotational storage, however it did so when the queue was initialized and the non-rotational flag is only set later on in the probe. Add an elevator hook that gets called off the add_disk() path, at that point we know that feature probing has finished, and we can reliably check for the various flags that drivers can set. Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs") Tested-by: NRomain Francoise <romain@orebokech.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 07 6月, 2015 1 次提交
-
-
由 Arianna Avanzini 提交于
The block IO (blkio) controller enables the block layer to provide service guarantees in a hierarchical fashion. Specifically, service guarantees are provided by registered request-accounting policies. As of now, a proportional-share and a throttling policy are available. They are implemented, respectively, by the CFQ I/O scheduler and the blk-throttle subsystem. Unfortunately, as for adding new policies, the current implementation of the block IO controller is only halfway ready to allow new policies to be plugged in. This commit provides a solution to make the block IO controller fully ready to handle new policies. In what follows, we first describe briefly the current state, and then list the changes made by this commit. The throttling policy does not need any per-cgroup information to perform its task. In contrast, the proportional share policy uses, for each cgroup, both the weight assigned by the user to the cgroup, and a set of dynamically- computed weights, one for each device. The first, user-defined weight is stored in the blkcg data structure: the block IO controller allocates a private blkcg data structure for each cgroup in the blkio cgroups hierarchy (regardless of which policy is active). In other words, the block IO controller internally mirrors the blkio cgroups with private blkcg data structures. On the other hand, for each cgroup and device, the corresponding dynamically- computed weight is maintained in the following, different way. For each device, the block IO controller keeps a private blkcg_gq structure for each cgroup in blkio. In other words, block IO also keeps one private mirror copy of the blkio cgroups hierarchy for each device, made of blkcg_gq structures. Each blkcg_gq structure keeps per-policy information in a generic array of dynamically-allocated 'dedicated' data structures, one for each registered policy (so currently the array contains two elements). To be inserted into the generic array, each dedicated data structure embeds a generic blkg_policy_data structure. Consider now the array contained in the blkcg_gq structure corresponding to a given pair of cgroup and device: one of the elements of the array contains the dedicated data structure for the proportional-share policy, and this dedicated data structure contains the dynamically-computed weight for that pair of cgroup and device. The generic strategy adopted for storing per-policy data in blkcg_gq structures is already capable of handling new policies, whereas the one adopted with blkcg structures is not, because per-policy data are hard-coded in the blkcg structures themselves (currently only data related to the proportional- share policy). This commit addresses the above issues through the following changes: . It generalizes blkcg structures so that per-policy data are stored in the same way as in blkcg_gq structures. Specifically, it lets also the blkcg structure store per-policy data in a generic array of dynamically-allocated dedicated data structures. We will refer to these data structures as blkcg dedicated data structures, to distinguish them from the dedicated data structures inserted in the generic arrays kept by blkcg_gq structures. To allow blkcg dedicated data structures to be inserted in the generic array inside a blkcg structure, this commit also introduces a new blkcg_policy_data structure, which is the equivalent of blkg_policy_data for blkcg dedicated data structures. . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a cpd_size field and a cpd_init field, to be initialized by the policy with, respectively, the size of the blkcg dedicated data structures, and the address of a constructor function for blkcg dedicated data structures. . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e., the fields related to the proportional-share policy), into a new blkcg dedicated data structure called cfq_group_data. Signed-off-by: NPaolo Valente <paolo.valente@unimore.it> Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com> Acked-by: NTejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 06 6月, 2015 1 次提交
-
-
由 Tahsin Erdogan 提交于
CFQ idling causes reduced IOPS throughput on non-rotational disks. Since disk head seeking is not applicable to SSDs, it doesn't really help performance by anticipating future near-by IO requests. By turning off idling (and switching to IOPS mode), we allow other processes to dispatch IO requests down to the driver and so increase IO throughput. Following FIO benchmark results were taken on a cloud SSD offering with idling on and off: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 7054 90.107 38.697 28217KB/s Off 29255 21.836 11.730 117022KB/s fio --name=temp --size=100G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10 And the following is from a local SSD run: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 19320 33.043 14.068 77281KB/s Off 21626 29.465 12.662 86507KB/s fio --name=temp --size=5G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10 Reviewed-by: NNauman Rafique <nauman@google.com> Signed-off-by: NTahsin Erdogan <tahsin@google.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 02 6月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
cgroup aware writeback support will require exposing some of blkcg details. In preprataion, move block/blk-cgroup.h to include/linux/blk-cgroup.h. This patch is pure file move. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 2月, 2015 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC. In cfq_find_alloc_queue() possible allocation failure is not handled. As a result kernel oopses on NULL pointer dereference when cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer. Bug was introduced in v3.5 in commit cd1604fa ("blkcg: factor out blkio_group creation"). Prior to that commit cfq group lookup had returned pointer to root group as fallback. This patch handles this error using existing fallback oom_cfqq. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com> Fixes: cd1604fa ("blkcg: factor out blkio_group creation") Cc: stable@kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 1月, 2015 1 次提交
-
-
由 Jeff Moyer 提交于
Hi, If you can manage to submit an async write as the first async I/O from the context of a process with realtime scheduling priority, then a cfq_queue is allocated, but filed into the wrong async_cfqq bucket. It ends up in the best effort array, but actually has realtime I/O scheduling priority set in cfqq->ioprio. The reason is that cfq_get_queue assumes the default scheduling class and priority when there is no information present (i.e. when the async cfqq is created): static struct cfq_queue * cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic, struct bio *bio, gfp_t gfp_mask) { const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio); cic->ioprio starts out as 0, which is "invalid". So, class of 0 (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so: async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio); static struct cfq_queue ** cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio) { switch (ioprio_class) { case IOPRIO_CLASS_RT: return &cfqd->async_cfqq[0][ioprio]; case IOPRIO_CLASS_NONE: ioprio = IOPRIO_NORM; /* fall through */ case IOPRIO_CLASS_BE: return &cfqd->async_cfqq[1][ioprio]; case IOPRIO_CLASS_IDLE: return &cfqd->async_idle_cfqq; default: BUG(); } } Here, instead of returning a class mapped from the process' scheduling priority, we get back the bucket associated with IOPRIO_CLASS_BE. Now, there is no queue allocated there yet, so we create it: cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask); That function ends up doing this: cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync); cfq_init_prio_data(cfqq, cic); cfq_init_cfqq marks the priority as having changed. Then, cfq_init_prio data does this: ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio); switch (ioprio_class) { default: printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class); case IOPRIO_CLASS_NONE: /* * no prio set, inherit CPU scheduling settings */ cfqq->ioprio = task_nice_ioprio(tsk); cfqq->ioprio_class = task_nice_ioclass(tsk); break; So we basically have two code paths that treat IOPRIO_CLASS_NONE differently, which results in an RT async cfqq filed into a best effort bucket. Attached is a patch which fixes the problem. I'm not sure how to make it cleaner. Suggestions would be welcome. Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Tested-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: stable@kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 9月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
blkcg->id is a unique id given to each blkcg; however, the cgroup_subsys_state which each blkcg embeds already has ->serial_nr which can be used for the same purpose. Drop blkcg->id and replace its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for consistency. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 28 8月, 2014 1 次提交
-
-
由 Toshiaki Makita 提交于
Explain that weight has to be updated on activation. This complements previous fix e15693ef ("cfq-iosched: Fix wrong children_weight calculation"). Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 27 8月, 2014 1 次提交
-
-
由 Toshiaki Makita 提交于
cfq_group_service_tree_add() is applying new_weight at the beginning of the function via cfq_update_group_weight(). This actually allows weight to change between adding it to and subtracting it from children_weight, and triggers WARN_ON_ONCE() in cfq_group_service_tree_del(), or even causes oops by divide error during vfr calculation in cfq_group_service_tree_add(). The detailed scenario is as follows: 1. Create blkio cgroups X and Y as a child of X. Set X's weight to 500 and perform some I/O to apply new_weight. This X's I/O completes before starting Y's I/O. 2. Y starts I/O and cfq_group_service_tree_add() is called with Y. 3. cfq_group_service_tree_add() walks up the tree during children_weight calculation and adds parent X's weight (500) to children_weight of root. children_weight becomes 500. 4. Set X's weight to 1000. 5. X starts I/O and cfq_group_service_tree_add() is called with X. 6. cfq_group_service_tree_add() applies its new_weight (1000). 7. I/O of Y completes and cfq_group_service_tree_del() is called with Y. 8. I/O of X completes and cfq_group_service_tree_del() is called with X. 9. cfq_group_service_tree_del() subtracts X's weight (1000) from children_weight of root. children_weight becomes -500. This triggers WARN_ON_ONCE(). 10. Set X's weight to 500. 11. X starts I/O and cfq_group_service_tree_add() is called with X. 12. cfq_group_service_tree_add() applies its new_weight (500) and adds it to children_weight of root. children_weight becomes 0. Calcularion of vfr triggers oops by divide error. weight should be updated right before adding it to children_weight. Reported-by: NRuki Sekiya <sekiya.ruki@lab.ntt.co.jp> Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 14 5月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NAristeu Rozanski <arozansk@redhat.com> Acked-by: NVivek Goyal <vgoyal@redhat.com> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>
-
- 01 5月, 2014 1 次提交
-
-
由 Masanari Iida 提交于
Fix format string mismatch in cfq_var_show() Signed-off-by: NMasanari Iida <standby24x7@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 4月, 2014 1 次提交
-
-
由 Jens Axboe 提交于
The queue parameter is never used, just get rid of it. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 19 3月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
-
- 25 2月, 2014 1 次提交
-
-
由 Jan Kara 提交于
Block layer currently abuses rq->csd.list.next for storing fifo_time. That is a terrible hack and completely unnecessary as well. Union achieves the same space saving in a cleaner way. Signed-off-by: NJan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 12 2月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com>
-
- 06 12月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
In preparation of conversion to kernfs, cgroup file handling is updated so that it can be easily mapped to kernfs. This patch replaces cftype->read_seq_string() with cftype->seq_show() which is not limited to single_open() operation and will map directcly to kernfs seq_file interface. The conversions are mechanical. As ->seq_show() doesn't have @css and @cft, the functions which make use of them are converted to use seq_css() and seq_cft() respectively. In several occassions, e.f. if it has seq_string in its name, the function name is updated to fit the new method better. This patch does not introduce any behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NAristeu Rozanski <arozansk@redhat.com> Acked-by: NVivek Goyal <vgoyal@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Neil Horman <nhorman@tuxdriver.com>
-
- 13 11月, 2013 1 次提交
-
-
由 Peter Zijlstra 提交于
Now that seqcounts are lockdep enabled objects, we need to explicitly initialize runtime allocated seqcounts so that lockdep can track them. Without this patch, Fengguang was seeing: [ 4.127282] INFO: trying to register non-static key. [ 4.128027] the code is fine but needs lockdep annotation. [ 4.128027] turning off the locking correctness validator. [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2 [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ ... ] [ 4.128027] Call Trace: [ 4.128027] [<7908e744>] ? console_unlock+0x353/0x380 [ 4.128027] [<79dc7cf2>] dump_stack+0x48/0x60 [ 4.128027] [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb [ 4.128027] [<7908a1c5>] lock_acquire+0x71/0x9a [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153 [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<794079aa>] blk_throtl_bio+0x1c3/0x485 ... Use u64_stats_init() for all affected data structures, which initializes the seqcount. Reported-and-Tested-by: NFengguang Wu <fengguang.wu@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ] Signed-off-by: NJohn Stultz <john.stultz@linaro.org> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 9月, 2013 1 次提交
-
-
由 Anatol Pomozov 提交于
'samples' is 64bit operant, but do_div() second parameter is 32. do_div silently truncates high 32 bits and calculated result is invalid. In case if low 32bit of 'samples' are zeros then do_div() produces kernel crash. Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 12 9月, 2013 1 次提交
-
-
由 Joe Perches 提交于
Use the helper function instead of __GFP_ZERO. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 8月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVivek Goyal <vgoyal@redhat.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
-
- 03 7月, 2013 1 次提交
-
-
由 Jianpeng Ma 提交于
There's a race between elevator switching and normal io operation. Because the allocation of struct elevator_queue and struct elevator_data don't in a atomic operation.So there are have chance to use NULL ->elevator_data. For example: Thread A: Thread B blk_queu_bio elevator_switch spin_lock_irq(q->queue_block) elevator_alloc elv_merge elevator_init_fn Because call elevator_alloc, it can't hold queue_lock and the ->elevator_data is NULL.So at the same time, threadA call elv_merge and nedd some info of elevator_data.So the crash happened. Move the elevator_alloc into func elevator_init_fn, it make the operations in a atomic operation. Using the follow method can easy reproduce this bug 1:dd if=/dev/sdb of=/dev/null 2:while true;do echo noop > scheduler;echo deadline > scheduler;done The test method also use this method. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 3月, 2013 1 次提交
-
-
由 Kent Overstreet 提交于
Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: NKent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
-
- 28 2月, 2013 1 次提交
-
-
由 Sasha Levin 提交于
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 2月, 2013 1 次提交
-
-
由 Glauber Costa 提交于
While stress-running very-small container scenarios with the Kernel Memory Controller, I've run into a lockdep-detected lock imbalance in cfq-iosched.c. I'll apologize beforehand for not posting a backlog: I didn't anticipate it would be so hard to reproduce, so I didn't save my serial output and went directly on debugging. Turns out that it did not happen again in more than 20 runs, making it a quite rare pattern. But here is my analysis: When we are in very low-memory situations, we will arrive at cfq_find_alloc_queue and may not find a queue, having to resort to the oom queue, in an rcu-locked condition: if (!cfqq || cfqq == &cfqd->oom_cfqq) [ ... ] Next, we will release the rcu lock, and try to allocate a queue, retrying if we succeed: rcu_read_unlock(); spin_unlock_irq(cfqd->queue->queue_lock); new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask | __GFP_ZERO, cfqd->queue->node); spin_lock_irq(cfqd->queue->queue_lock); if (new_cfqq) goto retry; We are unlocked at this point, but it should be fine, since we will reacquire the rcu_read_lock when we retry. Except of course, that we may not retry: the allocation may very well fail and we'll keep on going through the flow: The next branch is: if (cfqq) { [ ... ] } else cfqq = &cfqd->oom_cfqq; And right before exiting, we'll issue rcu_read_unlock(). Being already unlocked, this is the likely source of our imbalance. Since cfqq is either already NULL or made NULL in the first statement of the outter branch, the only viable alternative here seems to be to return the oom queue right away in case of allocation failure. Please review the following patch and apply if you agree with my analysis. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 1月, 2013 7 次提交
-
-
由 Tejun Heo 提交于
Unfortunately, at this point, there's no way to make the existing statistics hierarchical without creating nasty surprises for the existing users. Just create recursive counterpart of the existing stats. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
由 Tejun Heo 提交于
To support hierarchical stats, it's necessary to remember stats from dead children. Add cfqg->dead_stats and make a dying cfqg transfer its stats to the parent's dead-stats. The transfer happens form ->pd_offline_fn() and it is possible that there are some residual IOs completing afterwards. Currently, we lose these stats. Given that cgroup removal isn't a very high frequency operation and the amount of residual IOs on offline are likely to be nil or small, this shouldn't be a big deal and the complexity needed to handle residual IOs - another callback and rather elaborate synchronization to reach and lock the matching q - doesn't seem justified. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
由 Tejun Heo 提交于
Separate out cfqg_stats_reset() which takes struct cfqg_stats * from cfq_pd_reset_stats() and move the latter to where other pd methods are defined. cfqg_stats_reset() will be used to implement hierarchical stats. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
由 Tejun Heo 提交于
Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for summing up stats from multiple blkgs. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
由 Tejun Heo 提交于
With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
由 Tejun Heo 提交于
cfq_group_slice() calculates slice by taking a fraction of cfq_target_latency according to the ratio of cfqg->weight against service_tree->total_weight. This currently works only because all cfqgs are treated to be at the same level. To prepare for proper hierarchy support, convert cfq_group_slice() to base the calculation on cfqg->vfraction. As cfqg->vfraction is always a fraction of 1 and represents the fraction allocated to the cfqg with hierarchy considered, the slice can be simply calculated by multiplying cfqg->vfraction to cfq_target_latency (with fixed point shift factored in). As vfraction calculation currently treats all non-root cfqgs as children of the root cfqg, this patch doesn't introduce noticeable behavior difference. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
由 Tejun Heo 提交于
Currently, cfqg charges are scaled directly according to cfqg->weight. Regardless of the number of active cfqgs or the amount of active weights, a given weight value always scales charge the same way. This works fine as long as all cfqgs are treated equally regardless of their positions in the hierarchy, which is what cfq currently implements. It can't work in hierarchical settings because the interpretation of a given weight value depends on where the weight is located in the hierarchy. This patch reimplements cfqg charge scaling so that it can be used to support hierarchy properly. The scheme is fairly simple and light-weight. * When a cfqg is added to the service tree, v(disktime)weight is calculated. It walks up the tree to root calculating the fraction it has in the hierarchy. At each level, the fraction can be calculated as cfqg->weight / parent->level_weight By compounding these, the global fraction of vdisktime the cfqg has claim to - vfraction - can be determined. * When the cfqg needs to be charged, the charge is scaled inversely proportionally to the vfraction. The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point representation as before; however, the smallest scaling factor is now 1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1 was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller scaling factor. While this shifts the global scale of vdisktime a bit, it doesn't change the relative relationships among cfqgs and the scheduling result isn't different. cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY didn't have any relevance to vdisktime before and is unlikely to cause any visible behavior difference now especially as the scale shift isn't that large. As the new scheme now makes proper distinction between cfqg->weight and ->leaf_weight, reverse the weight aliasing for root cfqgs. For root, both weights are now mapped to ->leaf_weight instead of the other way around. Because we're still using cfqg_flat_parent(), this patch shouldn't change the scheduling behavior in any noticeable way. v2: Beefed up comments on vfraction as requested by Vivek. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-