1. 31 3月, 2011 1 次提交
  2. 26 3月, 2011 2 次提交
  3. 23 3月, 2011 5 次提交
  4. 21 3月, 2011 1 次提交
    • J
      block: attempt to merge with existing requests on plug flush · 5e84ea3a
      Jens Axboe 提交于
      One of the disadvantages of on-stack plugging is that we potentially
      lose out on merging since all pending IO isn't always visible to
      everybody. When we flush the on-stack plugs, right now we don't do
      any checks to see if potential merge candidates could be utilized.
      
      Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
      It works just ELEVATOR_INSERT_SORT, but first checks whether we can
      merge with an existing request before doing the insertion (if we fail
      merging).
      
      This fixes a regression with multiple processes issuing IO that
      can be merged.
      
      Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
      an accounting bug.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5e84ea3a
  5. 17 3月, 2011 1 次提交
  6. 12 3月, 2011 2 次提交
  7. 11 3月, 2011 1 次提交
    • L
      block: fix mis-synchronisation in blkdev_issue_zeroout() · 0aeea189
      Lukas Czerner 提交于
      BZ29402
      https://bugzilla.kernel.org/show_bug.cgi?id=29402
      
      We can hit serious mis-synchronization in bio completion path of
      blkdev_issue_zeroout() leading to a panic.
      
      The problem is that when we are going to wait_for_completion() in
      blkdev_issue_zeroout() we check if the bb.done equals issued (number of
      submitted bios). If it does, we can skip the wait_for_completition()
      and just out of the function since there is nothing to wait for.
      However, there is a ordering problem because bio_batch_end_io() is
      calling atomic_inc(&bb->done) before complete(), hence it might seem to
      blkdev_issue_zeroout() that all bios has been completed and exit. At
      this point when bio_batch_end_io() is going to call complete(bb->wait),
      bb and wait does not longer exist since it was allocated on stack in
      blkdev_issue_zeroout() ==> panic!
      
      (thread 1)                      (thread 2)
      bio_batch_end_io()              blkdev_issue_zeroout()
        if(bb) {                      ...
          if (bb->end_io)             ...
            bb->end_io(bio, err);     ...
          atomic_inc(&bb->done);      ...
          ...                         while (issued != atomic_read(&bb.done))
          ...                         (let issued == bb.done)
          ...                         (do the rest of the function)
          ...                         return ret;
          complete(bb->wait);
          ^^^^^^^^
          panic
      
      We can fix this easily by simplifying bio_batch and completion counting.
      
      Also remove bio_end_io_t *end_io since it is not used.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reported-by: NEric Whitney <eric.whitney@hp.com>
      Tested-by: NEric Whitney <eric.whitney@hp.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      0aeea189
  8. 10 3月, 2011 6 次提交
    • V
      blk-throttle: Use blk_plug in throttle dispatch · 69d60eb9
      Vivek Goyal 提交于
      Use plug in throttle dispatch also as we are dispatching a bunch of
      bios in throttle context and some of them might merge.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      69d60eb9
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
    • J
      block: add API for delaying work/request_fn a little bit · 3cca6dc1
      Jens Axboe 提交于
      Currently we use plugging for that, but as plugging is going away,
      we need an alternative mechanism.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      3cca6dc1
    • T
      block: Don't implicitly trigger event check on disk_unblock_events() · facc31dd
      Tejun Heo 提交于
      Currently, disk_unblock_events() implicitly kick event check if the
      block count reaches zero.  This behavior is not described in the
      comment and hinders with future changes.  Make the unblocker
      explicitly check events by calling disk_check_events() as necessary.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      facc31dd
  9. 09 3月, 2011 1 次提交
  10. 08 3月, 2011 2 次提交
  11. 07 3月, 2011 3 次提交
  12. 03 3月, 2011 3 次提交
    • V
      block: Move blk_throtl_exit() call to blk_cleanup_queue() · da527770
      Vivek Goyal 提交于
      Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
      written in such a way that it needs queue lock. In blk_release_queue()
      there is no gurantee that ->queue_lock is still around.
      
      Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
      one problem.
      
        https://lkml.org/lkml/2010/10/23/86
      
        And a quick fix moved blk_throtl_exit() to blk_release_queue().
      
              commit 7ad58c02
              Author: Jens Axboe <jaxboe@fusionio.com>
              Date:   Sat Oct 23 20:40:26 2010 +0200
      
              block: fix use-after-free bug in blk throttle code
      
      This patch reverts above change and does not try to shutdown the
      throtl work in blk_sync_queue(). By avoiding call to
      throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
      the problem reported by Ingo.
      
      blk_sync_queue() seems to be used only by md driver and it seems to be
      using it to make sure q->unplug_fn is not called as md registers its
      own unplug functions and it is about to free up the data structures
      used by unplug_fn(). Block throttle does not call back into unplug_fn()
      or into md. So there is no need to cancel blk throttle work.
      
      In fact I think cancelling block throttle work is bad because it might
      happen that some bios are throttled and scheduled to be dispatched later
      with the help of pending work and if work is cancelled, these bios might
      never be dispatched.
      
      Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
      blk_release_queue() time. That should be safe as we are also calling
      blk_throtl_exit() which should make sure all the throttling related
      data structures are cleaned up.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      da527770
    • V
      block: Initialize ->queue_lock to internal lock at queue allocation time · c94a96ac
      Vivek Goyal 提交于
      There does not seem to be a clear convention whether q->queue_lock is
      initialized or not when blk_cleanup_queue() is called. In the past it
      was not necessary but now blk_throtl_exit() takes up queue lock by
      default and needs queue lock to be available.
      
      In fact elevator_exit() code also has similar requirement just that it
      is less stringent in the sense that elevator_exit() is called only if
      elevator is initialized.
      
      Two problems have been noticed because of ambiguity about spin lock
      status.
      
            - If a driver calls blk_alloc_queue() and then soon calls
              blk_cleanup_queue() almost immediately, (because some other
      	driver structure allocation failed or some other error happened)
      	then blk_throtl_exit() will run into issues as queue lock is not
      	initialized. Loop driver ran into this issue recently and I
      	noticed error paths in md driver too. Similar error paths should
      	exist in other drivers too.
      
            - If some driver provided external spin lock and zapped the lock
              before blk_cleanup_queue(), then it can lead to issues.
      
      So this patch initializes the default queue lock at queue allocation time.
      
      block throttling code is one of the users of queue lock and it is
      initialized at the queue allocation time, so it makes sense to
      initialize ->queue_lock also to internal lock. A driver can overide that
      lock later. This will take care of the issue where a driver does not have
      to worry about initializing the queue lock to default before calling
      blk_cleanup_queue()
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c94a96ac
    • L
      block/genhd: Change some numerals into macros · 53f22956
      Liu Yuan 提交于
      Rename the numerals in the diskstats_show() into the macros.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NLiu Yuan <tailai.ly@taobao.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      53f22956
  13. 02 3月, 2011 5 次提交
  14. 25 2月, 2011 1 次提交
  15. 24 2月, 2011 1 次提交
    • N
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown 提交于
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      
      flusk_disk then passes true from check_disk_change and false from
      check_disk_size_change.
      
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      
      md does use check_disk_size_change and so is affected.
      
      This regression was introduced by commit 608aeef1 which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      
      Cc: stable@kernel.org
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      93b270f7
  16. 13 2月, 2011 1 次提交
  17. 11 2月, 2011 2 次提交
  18. 09 2月, 2011 1 次提交
    • J
      cfq-iosched: Don't wait if queue already has requests. · 02a8f01b
      Justin TerAvest 提交于
      Commit 7667aa06 added logic to wait for
      the last queue of the group to become busy (have at least one request),
      so that the group does not lose out for not being continuously
      backlogged. The commit did not check for the condition that the last
      queue already has some requests. As a result, if the queue already has
      requests, wait_busy is set. Later on, cfq_select_queue() checks the
      flag, and decides that since the queue has a request now and wait_busy
      is set, the queue is expired.  This results in early expiration of the
      queue.
      
      This patch fixes the problem by adding a check to see if queue already
      has requests. If it does, wait_busy is not set. As a result, time slices
      do not expire early.
      
      The queues with more than one request are usually buffered writers.
      Testing shows improvement in isolation between buffered writers.
      
      Cc: stable@kernel.org
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Reviewed-by: NGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      02a8f01b
  19. 25 1月, 2011 1 次提交
    • T
      block: reimplement FLUSH/FUA to support merge · ae1b1539
      Tejun Heo 提交于
      The current FLUSH/FUA support has evolved from the implementation
      which had to perform queue draining.  As such, sequencing is done
      queue-wide one flush request after another.  However, with the
      draining requirement gone, there's no reason to keep the queue-wide
      sequential approach.
      
      This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
      request is sequenced individually.  The actual FLUSH execution is
      double buffered and whenever a request wants to execute one for either
      PRE or POSTFLUSH, it queues on the pending queue.  Once certain
      conditions are met, a flush request is issued and on its completion
      all pending requests proceed to the next sequence.
      
      This allows arbitrary merging of different type of flushes.  How they
      are merged can be primarily controlled and tuned by adjusting the
      above said 'conditions' used to determine when to issue the next
      flush.
      
      This is inspired by Darrick's patches to merge multiple zero-data
      flushes which helps workloads with highly concurrent fsync requests.
      
      * As flush requests are never put on the IO scheduler, request fields
        used for flush share space with rq->rb_node.  rq->completion_data is
        moved out of the union.  This increases the request size by one
        pointer.
      
        As rq->elevator_private* are used only by the iosched too, it is
        possible to reduce the request size further.  However, to do that,
        we need to modify request allocation path such that iosched data is
        not allocated for flush requests.
      
      * FLUSH/FUA processing happens on insertion now instead of dispatch.
      
      - Comments updated as per Vivek and Mike.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Darrick J. Wong" <djwong@us.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ae1b1539