1. 08 2月, 2012 1 次提交
    • T
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo 提交于
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      bool.
      
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      050c8ea8
  2. 27 1月, 2012 1 次提交
  3. 14 12月, 2011 9 次提交
    • T
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo 提交于
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
      
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
      
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
      
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      
      * Comment added to explain how icq's are managed and synchronized.
      
      This completes cleanup of io_context interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1f8cc94
    • T
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo 提交于
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e5a8794
    • T
      block, cfq: move io_cq lookup to blk-ioc.c · 47fdd4ca
      Tejun Heo 提交于
      Now that all io_cq related data structures are in block core layer,
      io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.
      
      Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
      parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
      io_cq) and cfq_cic_lookup() becomes thin wrapper around
      cfq_cic_lookup().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      47fdd4ca
    • T
      block: remove elevator_queue->ops · 22f746e2
      Tejun Heo 提交于
      elevator_queue->ops points to the same ops struct ->elevator_type.ops
      is pointing to.  The only effect of caching it in elevator_queue is
      shorter notation - it doesn't save any indirect derefence.
      
      Relocate elevator_type->list which used only during module init/exit
      to the end of the structure, rename elevator_queue->elevator_type to
      ->type, and replace elevator_queue->ops with elevator_queue->type.ops.
      
      This doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22f746e2
    • T
      block, cfq: replace current_io_context() with create_io_context() · f2dbd76a
      Tejun Heo 提交于
      When called under queue_lock, current_io_context() triggers lockdep
      warning if it hits allocation path.  This is because io_context
      installation is protected by task_lock which is not IRQ safe, so it
      triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
      deadlock warning.
      
      Given the restriction, accessor + creator rolled into one doesn't work
      too well.  Drop current_io_context() and let the users access
      task->io_context directly inside queue_lock combined with explicit
      creation using create_io_context().
      
      Future ioc updates will further consolidate ioc access and the create
      interface will be unexported.
      
      While at it, relocate ioc internal interface declarations in blk.h and
      add section comments before and after.
      
      This patch does not introduce functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2dbd76a
    • T
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo 提交于
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      
      * Make sure the caller checks for the return value.
      
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09ac46c4
    • T
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo 提交于
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      dubious.
      
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
        assignment.
      
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
        noop.
      
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      
      This patch updates ioc get/put so that it becomes more conventional.
      
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
        patch.
      
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
        specification.
      
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e736be7
    • T
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo 提交于
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a73f730d
    • T
      block: add blk_queue_dead() · 34f6055c
      Tejun Heo 提交于
      There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
      macro and use it.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34f6055c
  4. 19 10月, 2011 4 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
    • T
      block: reorganize throtl_get_tg() and blk_throtl_bio() · bc16a4f9
      Tejun Heo 提交于
      blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
      
      * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
        and drops queue_lock in the latter case.  Different locking context
        depending on return value is error-prone and DEAD state is scheduled
        to be protected by queue_lock anyway.  Move DEAD check inside
        queue_lock and return valid tg or NULL.
      
      * blk_throtl_bio() indicates return status both with its return value
        and in/out param **@bio.  The former is used to indicate whether
        queue is found to be dead during throtl processing.  The latter
        whether the bio is throttled.
      
        There's no point in returning DEAD check result from
        blk_throtl_bio().  The queue can die after blk_throtl_bio() is
        finished but before make_request_fn() grabs queue lock.
      
        Make it take *@bio instead and return boolean result indicating
        whether the request is throttled or not.
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc16a4f9
    • T
      block: reorganize queue draining · e3c78ca5
      Tejun Heo 提交于
      Reorganize queue draining related code in preparation of queue exit
      changes.
      
      * Factor out actual draining from elv_quiesce_start() to
        blk_drain_queue().
      
      * Make elv_quiesce_start/end() responsible for their own locking.
      
      * Replace open-coded ELVSWITCH clearing in elevator_switch() with
        elv_quiesce_end().
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3c78ca5
    • T
      block: move blk_throtl prototypes to block/blk.h · bc9fcbf9
      Tejun Heo 提交于
      blk_throtl interface is block internal and there's no reason to have
      them in linux/blkdev.h.  Move them to block/blk.h.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc9fcbf9
  5. 16 8月, 2011 1 次提交
    • J
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer 提交于
      Commit ae1b1539, block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      
      static inline struct request *__elv_next_request(struct request_queue *q)
      {
              struct request *rq;
      
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
                      }
      
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
      {
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              REQ_FUA);
              unsigned skip = 0;
      ...
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      	}
      
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      appreciated.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4853abaa
  6. 19 5月, 2011 1 次提交
  7. 07 5月, 2011 1 次提交
    • S
      block: hold queue if flush is running for non-queueable flush drive · 3ac0cc45
      shaohua.li@intel.com 提交于
      In some drives, flush requests are non-queueable. When flush request is
      running, normal read/write requests can't run. If block layer dispatches
      such request, driver can't handle it and requeue it.  Tejun suggested we
      can hold the queue when flush is running. This can avoid unnecessary
      requeue.  Also this can improve performance. For example, we have
      request flush1, write1, flush 2. flush1 is dispatched, then queue is
      hold, write1 isn't inserted to queue. After flush1 is finished, flush2
      will be dispatched. Since disk cache is already clean, flush2 will be
      finished very soon, so looks like flush2 is folded to flush1.
      
      In my test, the queue holding completely solves a regression introduced by
      commit 53d63e6b:
      
          block: make the flush insertion use the tail of the dispatch list
      
          It's not a preempt type request, in fact we have to insert it
          behind requests that do specify INSERT_FRONT.
      
      which causes about 20% regression running a sysbench fileio
      workload.
      
      Stable: 2.6.39 only
      
      Cc: stable@kernel.org
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      3ac0cc45
  8. 19 4月, 2011 1 次提交
    • J
      block: get rid of QUEUE_FLAG_REENTER · c21e6beb
      Jens Axboe 提交于
      We are currently using this flag to check whether it's safe
      to call into ->request_fn(). If it is set, we punt to kblockd.
      But we get a lot of false positives and excessive punts to
      kblockd, which hurts performance.
      
      The only real abuser of this infrastructure is SCSI. So export
      the async queue run and convert SCSI over to use that. There's
      room for improvement in that SCSI need not always use the async
      call, but this fixes our performance issue and they can fix that
      up in due time.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c21e6beb
  9. 18 4月, 2011 1 次提交
  10. 31 3月, 2011 1 次提交
  11. 21 3月, 2011 1 次提交
    • J
      block: attempt to merge with existing requests on plug flush · 5e84ea3a
      Jens Axboe 提交于
      One of the disadvantages of on-stack plugging is that we potentially
      lose out on merging since all pending IO isn't always visible to
      everybody. When we flush the on-stack plugs, right now we don't do
      any checks to see if potential merge candidates could be utilized.
      
      Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
      It works just ELEVATOR_INSERT_SORT, but first checks whether we can
      merge with an existing request before doing the insertion (if we fail
      merging).
      
      This fixes a regression with multiple processes issuing IO that
      can be merged.
      
      Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
      an accounting bug.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5e84ea3a
  12. 10 3月, 2011 1 次提交
  13. 25 1月, 2011 2 次提交
    • T
      block: reimplement FLUSH/FUA to support merge · ae1b1539
      Tejun Heo 提交于
      The current FLUSH/FUA support has evolved from the implementation
      which had to perform queue draining.  As such, sequencing is done
      queue-wide one flush request after another.  However, with the
      draining requirement gone, there's no reason to keep the queue-wide
      sequential approach.
      
      This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
      request is sequenced individually.  The actual FLUSH execution is
      double buffered and whenever a request wants to execute one for either
      PRE or POSTFLUSH, it queues on the pending queue.  Once certain
      conditions are met, a flush request is issued and on its completion
      all pending requests proceed to the next sequence.
      
      This allows arbitrary merging of different type of flushes.  How they
      are merged can be primarily controlled and tuned by adjusting the
      above said 'conditions' used to determine when to issue the next
      flush.
      
      This is inspired by Darrick's patches to merge multiple zero-data
      flushes which helps workloads with highly concurrent fsync requests.
      
      * As flush requests are never put on the IO scheduler, request fields
        used for flush share space with rq->rb_node.  rq->completion_data is
        moved out of the union.  This increases the request size by one
        pointer.
      
        As rq->elevator_private* are used only by the iosched too, it is
        possible to reduce the request size further.  However, to do that,
        we need to modify request allocation path such that iosched data is
        not allocated for flush requests.
      
      * FLUSH/FUA processing happens on insertion now instead of dispatch.
      
      - Comments updated as per Vivek and Mike.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Darrick J. Wong" <djwong@us.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ae1b1539
    • T
      block: add REQ_FLUSH_SEQ · 414b4ff5
      Tejun Heo 提交于
      rq == &q->flush_rq was used to determine whether a rq is part of a
      flush sequence, which worked because all requests in a flush sequence
      were sequenced using the single dedicated request.  This is about to
      change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
      requests.
      
      This patch doesn't cause any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      414b4ff5
  14. 25 10月, 2010 1 次提交
  15. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  16. 11 9月, 2010 1 次提交
  17. 10 9月, 2010 5 次提交
    • T
      block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests · 4fed947c
      Tejun Heo 提交于
      Now that the backend conversion is complete, export sequenced
      FLUSH/FUA capability through REQ_FLUSH/FUA flags.  REQ_FLUSH means the
      device cache should be flushed before executing the request.  REQ_FUA
      means that the data in the request should be on non-volatile media on
      completion.
      
      Block layer will choose the correct way of implementing the semantics
      and execute it.  The request may be passed to the device directly if
      the device can handle it; otherwise, it will be sequenced using one or
      more proxy requests.  Devices will never see REQ_FLUSH and/or FUA
      which it doesn't support.
      
      Also, unlike the original REQ_HARDBARRIER, REQ_FLUSH/FUA requests are
      never failed with -EOPNOTSUPP.  If the underlying device doesn't
      support FLUSH/FUA, the block layer simply make those noop.  IOW, it no
      longer distinguishes between writeback cache which doesn't support
      cache flush and writethrough/no cache.  Devices which have WB cache
      w/o flush are very difficult to come by these days and there's nothing
      much we can do anyway, so it doesn't make sense to require everyone to
      implement -EOPNOTSUPP handling.  This will simplify filesystems and
      block drivers as they can drop -EOPNOTSUPP retry logic for barriers.
      
      * QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
        blk-flush.c.
      
      * REQ_FLUSH w/o data can also be directly passed to drivers without
        sequencing but some drivers assume that zero length requests don't
        have rq->bio which isn't true for these requests requiring the use
        of proxy requests.
      
      * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
        copied from bio to request.
      
      * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
        WRITE_FLUSH_FUA are added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4fed947c
    • T
      block: rename barrier/ordered to flush · dd4c133f
      Tejun Heo 提交于
      With ordering requirements dropped, barrier and ordered are misnomers.
      Now all block layer does is sequencing FLUSH and FUA.  Rename them to
      flush.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd4c133f
    • T
      block: drop barrier ordering by queue draining · 28e7d184
      Tejun Heo 提交于
      Filesystems will take all the responsibilities for ordering requests
      around commit writes and will only indicate how the commit writes
      themselves should be handled by block layers.  This patch drops
      barrier ordering by queue draining from block layer.  Ordering by
      draining implementation was somewhat invasive to request handling.
      List of notable changes follow.
      
      * Each queue has 1 bit color which is flipped on each barrier issue.
        This is used to track whether a given request is issued before the
        current barrier or not.  REQ_ORDERED_COLOR flag and coloring
        implementation in __elv_add_request() are removed.
      
      * Requests which shouldn't be processed yet for draining were stalled
        by returning -EAGAIN from blk_do_ordered() according to the test
        result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
        This logic is removed.
      
      * Draining completion logic in elv_completed_request() removed.
      
      * All barrier sequence requests were queued to request queue and then
        trckled to lower layer according to progress and thus maintaining
        request orders during requeue was necessary.  This is replaced by
        queueing the next request in the barrier sequence only after the
        current one is complete from blk_ordered_complete_seq(), which
        removes the need for multiple proxy requests in struct request_queue
        and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
        elv_insert().
      
      * As barriers no longer have ordering constraints, there's no need to
        dump the whole elevator onto the dispatch queue on each barrier.
        Insert barriers at the front instead.
      
      * If other barrier requests come to the front of the dispatch queue
        while one is already in progress, they are stored in
        q->pending_barriers and restored to dispatch queue one-by-one after
        each barrier completion from blk_ordered_complete_seq().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      28e7d184
    • T
      block: misc cleanups in barrier code · dd831006
      Tejun Heo 提交于
      Make the following cleanups in preparation of barrier/flush update.
      
      * blk_do_ordered() declaration is moved from include/linux/blkdev.h to
        block/blk.h.
      
      * blk_do_ordered() now returns pointer to struct request, with %NULL
        meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
        later".  The third case will be dropped with further changes.
      
      * In the initialization of proxy barrier request, data direction is
        already set by init_request_from_bio().  Drop unnecessary explicit
        REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
        flag setting.
      
      * add_request() is collapsed into __make_request().
      
      These changes don't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd831006
    • B
      block: Range check cpu in blk_cpu_to_group · be14eb61
      Brian King 提交于
      While testing CPU DLPAR, the following problem was discovered.
      We were DLPAR removing the first CPU, which in this case was
      logical CPUs 0-3. CPUs 0-2 were already marked offline and
      we were in the process of offlining CPU 3. After marking
      the CPU inactive and offline in cpu_disable, but before the
      cpu was completely idle (cpu_die), we ended up in __make_request
      on CPU 3. There we looked at the topology map to see which CPU
      to complete the I/O on and found no CPUs in the cpu_sibling_map.
      This resulted in the block layer setting the completion cpu
      to be NR_CPUS, which then caused an oops when we tried to
      complete the I/O.
      
      Fix this by sanity checking the value we return from blk_cpu_to_group
      to be a valid cpu value.
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      be14eb61
  18. 08 8月, 2010 1 次提交
  19. 11 9月, 2009 1 次提交
    • T
      block: implement mixed merge of different failfast requests · 80a761fd
      Tejun Heo 提交于
      Failfast has characteristics from other attributes.  When issuing,
      executing and successuflly completing requests, failfast doesn't make
      any difference.  It only affects how a request is handled on failure.
      Allowing requests with different failfast settings to be merged cause
      normal IOs to fail prematurely while not allowing has performance
      penalties as failfast is used for read aheads which are likely to be
      located near in-flight or to-be-issued normal IOs.
      
      This patch introduces the concept of 'mixed merge'.  A request is a
      mixed merge if it is merge of segments which require different
      handling on failure.  Currently the only mixable attributes are
      failfast ones (or lack thereof).
      
      When a bio with different failfast settings is added to an existing
      request or requests of different failfast settings are merged, the
      merged request is marked mixed.  Each bio carries failfast settings
      and the request always tracks failfast state of the first bio.  When
      the request fails, blk_rq_err_bytes() can be used to determine how
      many bytes can be safely failed without crossing into an area which
      requires further retrials.
      
      This allows request merging regardless of failfast settings while
      keeping the failure handling correct.
      
      This patch only implements mixed merge but doesn't enable it.  The
      next one will update SCSI to make use of mixed merge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      80a761fd
  20. 27 5月, 2009 1 次提交
    • K
      block: fix no diskstat problem · 3c4198e8
      Kiyoshi Ueda 提交于
      The commit below in 2.6-block/for-2.6.31 causes no diskstat problem
      because the blk_discard_rq() check was added with '&&'.
      It should be 'blk_fs_request() || blk_discard_rq()'.
      This patch does it and fixes the no diskstat problem.
      Please review and apply.
      
      ------ /proc/diskstat without this patch -------------------------------------
         8       0 sda 0 0 0 0 0 0 0 0 0 0 0
      ------------------------------------------------------------------------------
      
      ----- /proc/diskstat with this patch applied ---------------------------------
         8       0 sda 4186 303 373621 61600 9578 3859 107468 169479 2 89755 231059
      ------------------------------------------------------------------------------
      
      --------------------------------------------------------------------------
      commit c69d4854
      Author: Jens Axboe <jens.axboe@oracle.com>
      Date:   Fri Apr 24 08:12:19 2009 +0200
      
          block: include discard requests in IO accounting
      
          We currently don't do merging on discard requests, but we potentially
          could. If we do, then we need to include discard requests in the IO
          accounting, or merging would end up decrementing in_flight IO counters
          for an IO which never incremented them.
      
          So enable accounting for discard requests.
      
      <snip>
      
       static inline int blk_do_io_stat(struct request *rq)
       {
      -       return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq);
      +       return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq) &&
      +               blk_discard_rq(rq);
       }
      --------------------------------------------------------------------------
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      3c4198e8
  21. 19 5月, 2009 1 次提交
  22. 11 5月, 2009 2 次提交
    • T
      block: implement and enforce request peek/start/fetch · 9934c8c0
      Tejun Heo 提交于
      Till now block layer allowed two separate modes of request execution.
      A request is always acquired from the request queue via
      elv_next_request().  After that, drivers are free to either dequeue it
      or process it without dequeueing.  Dequeue allows elv_next_request()
      to return the next request so that multiple requests can be in flight.
      
      Executing requests without dequeueing has its merits mostly in
      allowing drivers for simpler devices which can't do sg to deal with
      segments only without considering request boundary.  However, the
      benefit this brings is dubious and declining while the cost of the API
      ambiguity is increasing.  Segment based drivers are usually for very
      old or limited devices and as converting to dequeueing model isn't
      difficult, it doesn't justify the API overhead it puts on block layer
      and its more modern users.
      
      Previous patches converted all block low level drivers to dequeueing
      model.  This patch completes the API transition by...
      
      * renaming elv_next_request() to blk_peek_request()
      
      * renaming blkdev_dequeue_request() to blk_start_request()
      
      * adding blk_fetch_request() which is combination of peek and start
      
      * disallowing completion of queued (not started) requests
      
      * applying new API to all LLDs
      
      Renamings are for consistency and to break out of tree code so that
      it's apparent that out of tree drivers need updating.
      
      [ Impact: block request issue API cleanup, no functional change ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Mike Miller <mike.miller@hp.com>
      Cc: unsik Kim <donari75@gmail.com>
      Cc: Paul Clements <paul.clements@steeleye.com>
      Cc: Tim Waugh <tim@cyberelk.net>
      Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Laurent Vivier <Laurent@lvivier.info>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Grant Likely <grant.likely@secretlab.ca>
      Cc: Adrian McMenamin <adrian@mcmen.demon.co.uk>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      Cc: Borislav Petkov <petkovbb@googlemail.com>
      Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
      Cc: Alex Dubov <oakad@yahoo.com>
      Cc: Pierre Ossman <drzeus@drzeus.cx>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
      Cc: Stefan Weinhuber <wein@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pete Zaitcev <zaitcev@redhat.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      9934c8c0
    • T
      block: drop request->hard_* and *nr_sectors · 2e46e8b2
      Tejun Heo 提交于
      struct request has had a few different ways to represent some
      properties of a request.  ->hard_* represent block layer's view of the
      request progress (completion cursor) and the ones without the prefix
      are supposed to represent the issue cursor and allowed to be updated
      as necessary by the low level drivers.  The thing is that as block
      layer supports partial completion, the two cursors really aren't
      necessary and only cause confusion.  In addition, manual management of
      request detail from low level drivers is cumbersome and error-prone at
      the very least.
      
      Another interesting duplicate fields are rq->[hard_]nr_sectors and
      rq->{hard_cur|current}_nr_sectors against rq->data_len and
      rq->bio->bi_size.  This is more convoluted than the hard_ case.
      
      rq->[hard_]nr_sectors are initialized for requests with bio but
      blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
      initialized for all request but blk_rq_bytes() uses it only for pc
      requests.  This causes good amount of confusion throughout block layer
      and its drivers and determining the request length has been a bit of
      black magic which may or may not work depending on circumstances and
      what the specific LLD is actually doing.
      
      rq->{hard_cur|current}_nr_sectors represent the number of sectors in
      the contiguous data area at the front.  This is mainly used by drivers
      which transfers data by walking request segment-by-segment.  This
      value always equals rq->bio->bi_size >> 9.  However, data length for
      pc requests may not be multiple of 512 bytes and using this field
      becomes a bit confusing.
      
      In general, having multiple fields to represent the same property
      leads only to confusion and subtle bugs.  With recent block low level
      driver cleanups, no driver is accessing or manipulating these
      duplicate fields directly.  Drop all the duplicates.  Now rq->sector
      means the current sector, rq->data_len the current total length and
      rq->bio->bi_size the current segment length.  Everything else is
      defined in terms of these three and available only through accessors.
      
      * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
        now handles pc and fs requests equally other than rq->sector update.
        This means that now pc requests can use partial completion too (no
        in-kernel user yet tho).
      
      * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
        now uses byte count as the primary data length.
      
      * blk_rq_pos() is now guranteed to be always correct.  In-block users
        converted.
      
      * blk_rq_bytes() is now guaranteed to be always valid as is
        blk_rq_sectors().  In-block users converted.
      
      * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
        More convenient one is used.
      
      * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
        pointer to request.
      
      [ Impact: API cleanup, single way to represent one property of a request ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      2e46e8b2
  23. 28 4月, 2009 1 次提交
    • J
      block: include discard requests in IO accounting · c69d4854
      Jens Axboe 提交于
      We currently don't do merging on discard requests, but we potentially
      could. If we do, then we need to include discard requests in the IO
      accounting, or merging would end up decrementing in_flight IO counters
      for an IO which never incremented them.
      
      So enable accounting for discard requests.
      
      Problem found by Nikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c69d4854