提交 · af75cd3c67845ebe31d2df9a780889a5ebecef11 · openanolis / cloud-kernel

21 5月, 2011 8 次提交

blk-throttle: Make no throttling rule group processing lockless · af75cd3c

由 Vivek Goyal 提交于 5月 19, 2011

Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.

So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

af75cd3c

blk-throttle: Make dispatch stats per cpu · 5624a4e4

由 Vivek Goyal 提交于 5月 19, 2011

Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.

Make dispatch stats per cpu so that these can be updated without taking
blkg lock.

If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5624a4e4

blk-throttle: Free up a group only after one rcu grace period · 4843c69d

由 Vivek Goyal 提交于 5月 19, 2011

Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

4843c69d

blk-throttle: Use helper function to add root throtl group to lists · 5617cbef

由 Vivek Goyal 提交于 5月 19, 2011

Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5617cbef

blk-throttle: Introduce a helper function to fill in device details · 269f5415

由 Vivek Goyal 提交于 5月 19, 2011

A helper function for the code which is used at 2-3 places. Makes reading
code little easier.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

269f5415

blk-throttle: Dynamically allocate root group · 29b12589

由 Vivek Goyal 提交于 5月 19, 2011

Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

29b12589

blk-cgroup: Allow sleeping while dynamically allocating a group · f469a7b4

由 Vivek Goyal 提交于 5月 19, 2011

Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

f469a7b4

blk-throttle: Do the new group initialization with the help of a function · a29a171e

由 Vivek Goyal 提交于 5月 19, 2011

Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

a29a171e

16 5月, 2011 1 次提交

blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup · 70087dc3

由 Vivek Goyal 提交于 5月 16, 2011

Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.

The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.

cgrp->subsys[i] = NULL;

That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.

So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.

So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.

One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

70087dc3

06 4月, 2011 1 次提交

blk-throttle: don't call xchg on bool · 6f037937

由 Andreas Schwab 提交于 3月 30, 2011

xchg does not work portably with smaller than 32bit types.
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

6f037937

31 3月, 2011 1 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

23 3月, 2011 1 次提交

blk-throttle: Reset group slice when limits are changed · 04521db0

由 Vivek Goyal 提交于 3月 22, 2011

Lina reported that if throttle limits are initially very high and then
dropped, then no new bio might be dispatched for a long time. And the
reason being that after dropping the limits we don't reset the existing
slice and do the rate calculation with new low rate and account the bios
dispatched at high rate. To fix it, reset the slice upon rate change.

https://lkml.org/lkml/2011/3/10/298

Another problem with very high limit is that we never queued the
bio on throtl service tree. That means we kept on extending the
group slice but never trimmed it. Fix that also by regulary
trimming the slice even if bio is not being queued up.
Reported-by: NLina Lu <lulina_nuaa@foxmail.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

04521db0

10 3月, 2011 2 次提交

blk-throttle: Use blk_plug in throttle dispatch · 69d60eb9

由 Vivek Goyal 提交于 3月 09, 2011

Use plug in throttle dispatch also as we are dispatching a bunch of
bios in throttle context and some of them might merge.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

69d60eb9

block: remove per-queue plugging · 7eaceacc

由 Jens Axboe 提交于 3月 10, 2011

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

7eaceacc

08 3月, 2011 2 次提交

blk-throttle: Some cleanups and race fixes in limit update code · de701c74

由 Vivek Goyal 提交于 3月 07, 2011

When throttle group limits are updated through cgroups, a thread is
woken up to process these updates. While reviewing that code, oleg noted
couple of race conditions existed in the code and he also suggested that
code can be simplified.

This patch fixes the races simplifies the code based on Oleg's suggestions:

	- Use xchg().
	- Introduced a common function throtl_update_blkio_group_common()
          which is shared now by all iops/bps update functions.
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>

Fixed a merge issue, throtl_schedule_delayed_work() takes throtl_data
as the argument now, not the queue.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

de701c74

blk-throttle: process limit change only through one function · 231d704b

由 Vivek Goyal 提交于 3月 07, 2011

With the help of cgroup interface one can go and upate the bps/iops
limits of existing group. Once the limits are udpated, a thread is
woken up to see if some blocked group needs recalculation based on new
limits and needs to be requeued.

There was also a piece of code where I was checking for group limit
update when a fresh bio comes in. This patch gets rid of that piece of
code and keeps processing the limit change at one place
throtl_process_limit_change().  It just keeps the code simple and easy
to understand.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

231d704b

03 3月, 2011 1 次提交

block: Move blk_throtl_exit() call to blk_cleanup_queue() · da527770

由 Vivek Goyal 提交于 3月 02, 2011

Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
written in such a way that it needs queue lock. In blk_release_queue()
there is no gurantee that ->queue_lock is still around.

Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
one problem.

  https://lkml.org/lkml/2010/10/23/86

  And a quick fix moved blk_throtl_exit() to blk_release_queue().

        commit 7ad58c02
        Author: Jens Axboe <jaxboe@fusionio.com>
        Date:   Sat Oct 23 20:40:26 2010 +0200

        block: fix use-after-free bug in blk throttle code

This patch reverts above change and does not try to shutdown the
throtl work in blk_sync_queue(). By avoiding call to
throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
the problem reported by Ingo.

blk_sync_queue() seems to be used only by md driver and it seems to be
using it to make sure q->unplug_fn is not called as md registers its
own unplug functions and it is about to free up the data structures
used by unplug_fn(). Block throttle does not call back into unplug_fn()
or into md. So there is no need to cancel blk throttle work.

In fact I think cancelling block throttle work is bad because it might
happen that some bios are throttled and scheduled to be dispatched later
with the help of pending work and if work is cancelled, these bios might
never be dispatched.

Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
blk_release_queue() time. That should be safe as we are also calling
blk_throtl_exit() which should make sure all the throttling related
data structures are cleaned up.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

da527770

02 3月, 2011 1 次提交

blk-throttle: Do not use kblockd workqueue for throtl work · 450adcbe

由 Vivek Goyal 提交于 3月 01, 2011

o Dominik Klein reported a system hang issue while doing some blkio
  throttling testing.

  https://lkml.org/lkml/2011/2/24/173

o Some tracing revealed that CFQ was not dispatching any more jobs as
  queue unplug was not happening. And queue unplug was not happening
  because unplug work was not being called as there was one throttling
  work on same cpu which as not finished yet. And throttling work had not
  finished as it was tyring to dispatch a bio to CFQ but all the request
  descriptors were consume to it was put to sleep.

o So basically it is a cyclic dependecny between CFQ unplug work and
  throtl dispatch work. Tejun suggested that use separate workqueue for
  such cases.

o This patch uses a separate workqueue for throttle related work and
  does not rely on kblockd workqueue anymore.

Cc: stable@kernel.org
Reported-by: NDominik Klein <dk@in-telegence.net>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

450adcbe

19 1月, 2011 1 次提交

blkio-throttle: Avoid calling blkiocg_lookup_group() for root group · be2c6b19

由 Vivek Goyal 提交于 1月 19, 2011

o Jeff Moyer was doing some testing on a RAM backed disk and
  blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly
  somebody else reported that blkiocg_lookup_group() is eating 6% extra
  cpu. Though looking at the code I can't think why the overhead of
  this function is so high. One thing is that it is called with very high
  frequency (once for every IO).

o For lot of folks blkio controller will be compiled in but they might
  not have actually created cgroups. Hence optimize the case of root
  cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening
  in root group (common case).
Reported-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

be2c6b19

02 12月, 2010 2 次提交

blk-throttle: Correct the placement of smp_rmb() · 04a6b516

由 Vivek Goyal 提交于 12月 01, 2010

o I was discussing what are the variable being updated without spin lock and
  why do we need barriers and Oleg pointed out that location of smp_rmb()
  should be between read of td->limits_changed and tg->limits_changed. This
  patch fixes it.

o Following is one possible sequence of events. Say cpu0 is executing
  throtl_update_blkio_group_read_bps() and cpu1 is executing
  throtl_process_limit_change().

 cpu0                                                cpu1

 tg->limits_changed = true;
 smp_mb__before_atomic_inc();
 atomic_inc(&td->limits_changed);

                                     if (!atomic_read(&td->limits_changed))
                                             return;

                                     if (tg->limits_changed)
                                             do_something;

 If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
 make sure that if update to td->limits_changed is visible on cpu1, then
 update to tg->limits_changed should also be visible.

 Oleg pointed out to ensure that we need to insert an smp_rmb() between
 td->limits_changed read and tg->limits_changed read.

o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
  This patch fixes it.
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

04a6b516

blk-throttle: Trim/adjust slice_end once a bio has been dispatched · d1ae8ffd

由 Vivek Goyal 提交于 12月 01, 2010

o During some testing I did following and noticed throttling stops working.

        - Put a very low limit on a cgroup, say 1 byte per second.
        - Start some reads, this will set slice_end to a very high value.
        - Change the limit to higher value say 1MB/s
        - Now IO unthrottles and finishes as expected.
        - Try to do the read again but IO is not limited to 1MB/s as expected.

o What is happening.
        - Initially low value of limit sets slice_end to a very high value.
        - During updation of limit, slice_end is not being truncated.
        - Very high value of slice_end leads to keeping the existing slice
          valid for a very long time and new slice does not start.
        - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
          is not called in this path. So slice_start is some old value and
          practically we are able to do huge amount of IO.

o There are many ways it can be fixed. I have fixed it by trying to
  adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
  is big and can't be dispatched in one slice. After dispatch of bio, readjust
  the slice_end to make sure we don't end up with huge values.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

d1ae8ffd

16 11月, 2010 1 次提交

blk-throttle: Fix calculation of max number of WRITES to be dispatched · c2f6805d

由 Vivek Goyal 提交于 11月 15, 2010

o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one
  dispatch round. ummy pointed out that there is a bug in max_nr_writes
  calculation. This patch fixes it.
Reported-by: Nummy y <yummylln@yahoo.com.cn>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

c2f6805d

02 10月, 2010 2 次提交

blkio-throttle: Fix possible multiplication overflow in iops calculations · c49c06e4

由 Vivek Goyal 提交于 10月 01, 2010

o User can specify max iops value of 32bit (UINT_MAX), through cgroup
  interface. If a user has specified say 4294967294 (UNIT_MAX  - 2), then
  on 32bit platform, following multiplication can overflow.

  io_allowed = (tg->iops[rw] * jiffy_elapsed_rnd)

o Explicitly cast the multiplication to 64bit and then perform division and
  then check whether result is still great then UNINT_MAX.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

c49c06e4

blkio-throttle: There is no need to convert jiffies to milli seconds · 5e901a2b

由 Vivek Goyal 提交于 10月 01, 2010

o Do not convert jiffies to mili seconds as it is not required. Just work
  with jiffies and HZ.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5e901a2b

01 10月, 2010 3 次提交

blkio-throttle: Fix link failure failure on i386 · 3aad5d3e

由 Vivek Goyal 提交于 10月 01, 2010

o Randy Dunlap reported following linux-next failure. This patch fixes it.

on i386:

blk-throttle.c:(.text+0x1abb8): undefined reference to `__udivdi3'
blk-throttle.c:(.text+0x1b1dc): undefined reference to `__udivdi3'

o bytes_per_second interface is 64bit and I was continuing to do 64 bit
  division even on 32bit platform without help of special macros/functions
  hence the failure.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Reported-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

3aad5d3e

blkio: Recalculate the throttled bio dispatch time upon throttle limit change · fe071437

由 Vivek Goyal 提交于 10月 01, 2010

o Currently any cgroup throttle limit changes are processed asynchronousy and
the change does not take affect till a new bio is dispatched from same group.

o It might happen that a user sets a redicuously low limit on throttling.
Say 1 bytes per second on reads. In such cases simple operations like mount
a disk can wait for a very long time.

o Once bio is throttled, there is no easy way to come out of that wait even if
user increases the read limit later.

o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
the bio dispatch time according to new limits.

o Can't take queueu lock under blkcg_lock, hence after the change I wake
up the dispatch thread again which recalculates the time. So there are some
variables being synchronized across two threads without lock and I had to
make use of barriers. Hoping I have used barriers correctly. Any review of
memory barrier code especially will help.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

fe071437

blkio: Add root group to td->tg_list · 02977e4a

由 Vivek Goyal 提交于 10月 01, 2010

o Currently all the dynamically allocated groups, except root grp is added
  to td->tg_list. This was not a problem so far but in next patch I will
  travel through td->tg_list to process any updates of limits on the group.
  If root group is not in tg_list, then root group's updates are not
  processed.

o It is better to root group also to tg_list instead of doing special
  processing for it during limit updates.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

02977e4a

16 9月, 2010 2 次提交

blkio: Implementation of IOPS limit logic · 8e89d13f

由 Vivek Goyal 提交于 9月 15, 2010

o core logic of implementing IOPS throttling.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

8e89d13f

blkio: Core implementation of throttle policy · e43473b7

由 Vivek Goyal 提交于 9月 15, 2010

o Actual implementation of throttling policy in block layer. Currently it
  implements READ and WRITE bytes per second throttling logic. IOPS throttling
  comes in later patches.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

e43473b7

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功