1. 02 8月, 2011 1 次提交
    • V
      cfq-iosched: Reduce linked group count upon group destruction · a5395b83
      Vivek Goyal 提交于
      FQ keeps track of number of groups which are linked on blkcg->blkg_list.
      This is useful to avoid races between queue exit and cgroup exit code
      paths. So if at the request queue exit time linked group count is not
      zero, that means there are some group out there which is yet to be
      deleted under rcu read period and queue exit code should wait for
      on rcu period.
      
      In my previous patch I forgot to decrease the number of group count.
      So in current form, we nr_blkcg_linked_grps is always non-zero and
      we will always wait one rcu period (if BLK_CGROUP=y). The side effect
      of this is that it can increase boot time. I am surprised, nobody
      complained so far.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a5395b83
  2. 12 7月, 2011 4 次提交
    • S
      CFQ: add think time check for group · 7700fc4f
      Shaohua Li 提交于
      Currently when the last queue of a group has no request, we don't expire
      the queue to hope request from the group comes soon, so the group doesn't
      miss its share. But if the think time is big, the assumption isn't correct
      and we just waste bandwidth. In such case, we don't do idle.
      
      [global]
      runtime=30
      direct=1
      
      [test1]
      cgroup=test1
      cgroup_weight=1000
      rw=randread
      ioengine=libaio
      size=500m
      runtime=30
      directory=/mnt
      filename=file1
      thinktime=9000
      
      [test2]
      cgroup=test2
      cgroup_weight=1000
      rw=randread
      ioengine=libaio
      size=500m
      runtime=30
      directory=/mnt
      filename=file2
      
      	patched		base
      test1	64k		39k
      test2	548k		540k
      total	604k		578k
      
      group1 gets much better throughput because it waits less time.
      
      To check if the patch changes behavior of queue without think time. I also
      tried to give test1 2ms think time or no think time. The test result is stable.
      The thoughput doesn't change with/without the patch.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7700fc4f
    • S
      CFQ: add think time check for service tree · f5f2b6ce
      Shaohua Li 提交于
      Currently when the last queue of a service tree has no request, we don't
      expire the queue to hope request from the service tree comes soon, so the
      service tree doesn't miss its share. But if the think time is big, the
      assumption isn't correct and we just waste bandwidth. In such case, we
      don't do idle.
      
      [global]
      runtime=10
      direct=1
      
      [test1]
      rw=randread
      ioengine=libaio
      size=500m
      directory=/mnt
      filename=file1
      thinktime=9000
      
      [test2]
      rw=read
      ioengine=libaio
      size=1G
      directory=/mnt
      filename=file2
      
      	patched		base
      test1	41k/s		33k/s
      test2	15868k/s	15789k/s
      total	15902k/s	15817k/s
      
      A slightly better
      
      To check if the patch changes behavior of queue without think time. I also
      tried to give test1 2ms think time or no think time. The test has variation
      even without the patch, but the average throughput doesn't change with/without
      the patch.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      f5f2b6ce
    • S
      CFQ: move think time check variables to a separate struct · 383cd721
      Shaohua Li 提交于
      Move the variables to do think time check to a sepatate struct. This is
      to prepare adding think time check for service tree and group. No
      functional change.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      383cd721
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  3. 11 7月, 2011 1 次提交
  4. 27 6月, 2011 2 次提交
  5. 14 6月, 2011 1 次提交
  6. 13 6月, 2011 1 次提交
  7. 06 6月, 2011 3 次提交
  8. 03 6月, 2011 1 次提交
    • J
      iosched: prevent aliased requests from starving other I/O · 796d5116
      Jeff Moyer 提交于
      Hi, Jens,
      
      If you recall, I posted an RFC patch for this back in July of last year:
      http://lkml.org/lkml/2010/7/13/279
      
      The basic problem is that a process can issue a never-ending stream of
      async direct I/Os to the same sector on a device, thus starving out
      other I/O in the system (due to the way the alias handling works in both
      cfq and deadline).  The solution I proposed back then was to start
      dispatching from the fifo after a certain number of aliases had been
      dispatched.  Vivek asked why we had to treat aliases differently at all,
      and I never had a good answer.  So, I put together a simple patch which
      allows aliases to be added to the rb tree (it adds them to the right,
      though that doesn't matter as the order isn't guaranteed anyway).  I
      think this is the preferred solution, as it doesn't break up time slices
      in CFQ or batches in deadline.  I've tested it, and it does solve the
      starvation issue.  Let me know what you think.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      796d5116
  9. 02 6月, 2011 1 次提交
  10. 01 6月, 2011 1 次提交
  11. 24 5月, 2011 4 次提交
  12. 23 5月, 2011 1 次提交
  13. 21 5月, 2011 4 次提交
    • V
      blk-throttle: Make dispatch stats per cpu · 5624a4e4
      Vivek Goyal 提交于
      Currently we take blkg_stat lock for even updating the stats. So even if
      a group has no throttling rules (common case for root group), we end
      up taking blkg_lock, for updating the stats.
      
      Make dispatch stats per cpu so that these can be updated without taking
      blkg lock.
      
      If cpu goes offline, these stats simply disappear. No protection has
      been provided for that yet. Do we really need anything for that?
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5624a4e4
    • V
      blk-cgroup: Allow sleeping while dynamically allocating a group · f469a7b4
      Vivek Goyal 提交于
      Currently, all the cfq_group or throtl_group allocations happen while
      we are holding ->queue_lock and sleeping is not allowed.
      
      Soon, we will move to per cpu stats and also need to allocate the
      per group stats. As one can not call alloc_percpu() from atomic
      context as it can sleep, we need to drop ->queue_lock, allocate the
      group, retake the lock and continue processing.
      
      In throttling code, I check the queue DEAD flag again to make sure
      that driver did not call blk_cleanup_queue() in the mean time.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      f469a7b4
    • V
      cfq-iosched: Fix a possible race with cfq cgroup removal code · 56edf7d7
      Vivek Goyal 提交于
      blkg->key = cfqd is an rcu protected pointer and hence we used to do
      call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.
      
      The problem here is that even though cfqd is around, there are no
      gurantees that associated request queue (td->queue) or q->queue_lock
      is still around. A driver might have called blk_cleanup_queue() and
      release the lock.
      
      It might happen that after freeing up the lock we call
      blkg->key->queue->queue_ock and crash. This is possible in following
      path.
      
      blkiocg_destroy()
       blkio_unlink_group_fn()
        cfq_unlink_blkio_group()
      
      Hence, wait for an rcu peirod if there are groups which have not
      been unlinked from blkcg->blkg_list. That way, if there are any groups
      which are taking cfq_unlink_blkio_group() path, can safely take queue
      lock.
      
      This is how we have taken care of race in throttling logic also.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      56edf7d7
    • V
      cfq-iosched: Get rid of redundant function parameter "create" · 3e59cf9d
      Vivek Goyal 提交于
      Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
      Get rid of that.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      3e59cf9d
  14. 16 5月, 2011 1 次提交
    • V
      blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup · 70087dc3
      Vivek Goyal 提交于
      Currentlly we first map the task to cgroup and then cgroup to
      blkio_cgroup. There is a more direct way to get to blkio_cgroup
      from task using task_subsys_state(). Use that.
      
      The real reason for the fix is that it also avoids a race in generic
      cgroup code. During remount/umount rebind_subsystems() is called and
      it can do following with and rcu protection.
      
      cgrp->subsys[i] = NULL;
      
      That means if somebody got hold of cgroup under rcu and then it tried
      to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
      is wrong. I was running into this race condition with ltp running on a
      upstream derived kernel and that lead to crash.
      
      So ideally we should also fix cgroup generic code to wait for rcu
      grace period before setting pointer to NULL. Li Zefan is not very keen
      on introducing synchronize_wait() as he thinks it will slow
      down moun/remount/umount operations.
      
      So for the time being atleast fix the kernel crash by taking a more
      direct route to blkio_cgroup.
      
      One tester had reported a crash while running LTP on a derived kernel
      and with this fix crash is no more seen while the test has been
      running for over 6 days.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      70087dc3
  15. 19 4月, 2011 1 次提交
  16. 18 4月, 2011 1 次提交
  17. 31 3月, 2011 1 次提交
  18. 23 3月, 2011 3 次提交
  19. 17 3月, 2011 1 次提交
  20. 12 3月, 2011 1 次提交
  21. 10 3月, 2011 1 次提交
  22. 07 3月, 2011 3 次提交
  23. 02 3月, 2011 2 次提交
    • T
      block: add @force_kblockd to __blk_run_queue() · 1654e741
      Tejun Heo 提交于
      __blk_run_queue() automatically either calls q->request_fn() directly
      or schedules kblockd depending on whether the function is recursed.
      blk-flush implementation needs to be able to explicitly choose
      kblockd.  Add @force_kblockd.
      
      All the current users are converted to specify %false for the
      parameter and this patch doesn't introduce any behavior change.
      
      stable: This is prerequisite for fixing ide oops caused by the new
              blk-flush implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      1654e741
    • J
      cfq-iosched: Always provide group isolation. · 0bbfeb83
      Justin TerAvest 提交于
      Effectively, make group_isolation=1 the default and remove the tunable.
      The setting group_isolation=0 was because by default we idle on
      sync-noidle tree and on fast devices, this can be very harmful for
      throughput.
      
      However, this problem can also be addressed by tuning slice_idle and
      possibly group_idle on faster storage devices.
      
      This change simplifies the CFQ code by removing the feature entirely.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      0bbfeb83