1. 23 3月, 2013 1 次提交
  2. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  3. 23 1月, 2013 1 次提交
    • T
      block: don't request module during elevator init · 21c3c5d2
      Tejun Heo 提交于
      Block layer allows selecting an elevator which is built as a module to
      be selected as system default via kernel param "elevator=".  This is
      achieved by automatically invoking request_module() whenever a new
      block device is initialized and the elevator is not available.
      
      This led to an interesting deadlock problem involving async and module
      init.  Block device probing running off an async job invokes
      request_module().  While the module is being loaded, it performs
      async_synchronize_full() which ends up waiting for the async job which
      is already waiting for request_module() to finish, leading to
      deadlock.
      
      Invoking request_module() from deep in block device init path is
      already nasty in itself.  It seems best to avoid these situations from
      the beginning by moving on-demand module loading out of block init
      path.
      
      The previous patch made sure that the default elevator module is
      loaded early during boot if available.  This patch removes on-demand
      loading of the default elevator from elevator init path.  As the
      module would have been loaded during boot, userland-visible behavior
      difference should be minimal.
      
      For more details, please refer to the following thread.
      
        http://thread.gmane.org/gmane.linux.kernel/1420814
      
      v2: The bool parameter was named @request_module which conflicted with
          request_module().  This built okay w/ CONFIG_MODULES because
          request_module() was defined as a macro.  W/o CONFIG_MODULES, it
          causes build breakage.  Rename the parameter to @try_loading.
          Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Riesen <raa.lkml@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      21c3c5d2
  4. 19 1月, 2013 1 次提交
    • T
      init, block: try to load default elevator module early during boot · bb813f4c
      Tejun Heo 提交于
      This patch adds default module loading and uses it to load the default
      block elevator.  During boot, it's called right after initramfs or
      initrd is made available and right before control is passed to
      userland.  This ensures that as long as the modules are available in
      the usual places in initramfs, initrd or the root filesystem, the
      default modules are loaded as soon as possible.
      
      This will replace the on-demand elevator module loading from elevator
      init path.
      
      v2: Fixed build breakage when !CONFIG_BLOCK.  Reported by kbuild test
          robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Riesen <raa.lkml@gmail.com>
      Cc: Fengguang We <fengguang.wu@intel.com>
      bb813f4c
  5. 11 1月, 2013 1 次提交
  6. 09 11月, 2012 1 次提交
    • S
      block: recursive merge requests · bee0393c
      Shaohua Li 提交于
      In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,....
      When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2
      and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged.
      
      If we do recursive merge for such interleave access, some workloads throughput
      get improvement. A recent worload I'm checking on is swap, below change
      boostes the throughput around 5% ~ 10%.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bee0393c
  7. 20 9月, 2012 1 次提交
  8. 20 4月, 2012 1 次提交
    • T
      blkcg: implement per-queue policy activation · a2b1693b
      Tejun Heo 提交于
      All blkcg policies were assumed to be enabled on all request_queues.
      Due to various implementation obstacles, during the recent blkcg core
      updates, this was temporarily implemented as shooting down all !root
      blkgs on elevator switch and policy [de]registration combined with
      half-broken in-place root blkg updates.  In addition to being buggy
      and racy, this meant losing all blkcg configurations across those
      events.
      
      Now that blkcg is cleaned up enough, this patch replaces the temporary
      implementation with proper per-queue policy activation.  Each blkcg
      policy should call the new blkcg_[de]activate_policy() to enable and
      disable the policy on a specific queue.  blkcg_activate_policy()
      allocates and installs policy data for the policy for all existing
      blkgs.  blkcg_deactivate_policy() does the reverse.  If a policy is
      not enabled for a given queue, blkg printing / config functions skip
      the respective blkg for the queue.
      
      blkcg_activate_policy() also takes care of root blkg creation, and
      cfq_init_queue() and blk_throtl_init() are updated accordingly.
      
      This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
      unnecessary.  Dropped.
      
      v2: cfq_init_queue() was returning uninitialized @ret on root_group
          alloc failure if !CONFIG_CFQ_GROUP_IOSCHED.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2b1693b
  9. 07 3月, 2012 7 次提交
    • T
      block: implement bio_associate_current() · 852c788f
      Tejun Heo 提交于
      IO scheduling and cgroup are tied to the issuing task via io_context
      and cgroup of %current.  Unfortunately, there are cases where IOs need
      to be routed via a different task which makes scheduling and cgroup
      limit enforcement applied completely incorrectly.
      
      For example, all bios delayed by blk-throttle end up being issued by a
      delayed work item and get assigned the io_context of the worker task
      which happens to serve the work item and dumped to the default block
      cgroup.  This is double confusing as bios which aren't delayed end up
      in the correct cgroup and makes using blk-throttle and cfq propio
      together impossible.
      
      Any code which punts IO issuing to another task is affected which is
      getting more and more common (e.g. btrfs).  As both io_context and
      cgroup are firmly tied to task including userland visible APIs to
      manipulate them, it makes a lot of sense to match up tasks to bios.
      
      This patch implements bio_associate_current() which associates the
      specified bio with %current.  The bio will record the associated ioc
      and blkcg at that point and block layer will use the recorded ones
      regardless of which task actually ends up issuing the bio.  bio
      release puts the associated ioc and blkcg.
      
      It grabs and remembers ioc and blkcg instead of the task itself
      because task may already be dead by the time the bio is issued making
      ioc and blkcg inaccessible and those are all block layer cares about.
      
      elevator_set_req_fn() is updated such that the bio elvdata is being
      allocated for is available to the elevator.
      
      This doesn't update block cgroup policies yet.  Further patches will
      implement the support.
      
      -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
           rq_ioc() to fix build breakage.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      852c788f
    • T
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo 提交于
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8989fae
    • T
      blkcg: let blkcg core manage per-queue blkg list and counter · 03aa264a
      Tejun Heo 提交于
      With the previous patch to move blkg list heads and counters to
      request_queue and blkg, logic to manage them in both policies are
      almost identical and can be moved to blkcg core.
      
      This patch moves blkg link logic into blkg_lookup_create(), implements
      common blkg unlink code in blkg_destroy(), and updates
      blkg_destory_all() so that it's policy specific and can skip root
      group.  The updated blkg_destroy_all() is now used to both clear queue
      for bypassing and elv switching, and release all blkgs on q exit.
      
      This patch introduces a race window where policy [de]registration may
      race against queue blkg clearing.  This can only be a problem on cfq
      unload and shouldn't be a real problem in practice (and we have many
      other places where this race already exists).  Future patches will
      remove these unlikely races.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03aa264a
    • T
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo 提交于
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
        disabled
      
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
      
           blk-throtl no longer shoots down root_tg to avoid breaking
           td->root_tg.
      
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72e06c25
    • T
      block: implement blk_queue_bypass_start/end() · d732580b
      Tejun Heo 提交于
      Rename and extend elv_queisce_start/end() to
      blk_queue_bypass_start/end() which are exported and supports nesting
      via @q->bypass_depth.  Also add blk_queue_bypass() to test bypass
      state.
      
      This will be further extended and used for blkio_group management.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d732580b
    • T
      elevator: make elevator_init_fn() return 0/-errno · b2fab5ac
      Tejun Heo 提交于
      elevator_ops->elevator_init_fn() has a weird return value.  It returns
      a void * which the caller should assign to q->elevator->elevator_data
      and %NULL return denotes init failure.
      
      Update such that it returns integer 0/-errno and sets elevator_data
      directly as necessary.
      
      This makes the interface more conventional and eases further cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2fab5ac
    • T
      elevator: clear auxiliary data earlier during elevator switch · 5a5bafdc
      Tejun Heo 提交于
      Elevator switch tries hard to keep as much as context until new
      elevator is ready so that it can revert to the original state if
      initializing the new elevator fails for some reason.  Unfortunately,
      with more auxiliary contexts to manage, this makes elevator init and
      exit paths too complex and fragile.
      
      This patch makes elevator_switch() unregister the current elevator and
      flush icq's before start initializing the new one.  As we still keep
      the old elevator itself, the only difference is that we lose icq's on
      rare occassions of switching failure, which isn't critical at all.
      
      Note that this makes explicit elevator parameter to
      elevator_init_queue() and __elv_register_queue() unnecessary as they
      always can use the current elevator.
      
      This patch enables block cgroup cleanups.
      
      -v2: blk_add_trace_msg() prints elevator name from @new_e instead of
           @e->type as the local variable no longer exists.  This caused
           build failure on CONFIG_BLK_DEV_IO_TRACE.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a5bafdc
  10. 08 2月, 2012 1 次提交
    • T
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo 提交于
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      bool.
      
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      050c8ea8
  11. 15 1月, 2012 1 次提交
    • J
      Revert "block: recursive merge requests" · 5d381efb
      Jens Axboe 提交于
      This reverts commit 27419322.
      
      We have some problems related to selection of empty queues
      that need to be resolved, evidence so far points to the
      recursive merge logic making either being the cause or at
      least the accelerator for this. So revert it for now, until
      we figure this out.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d381efb
  12. 16 12月, 2011 1 次提交
    • S
      block: recursive merge requests · 27419322
      Shaohua Li 提交于
      In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
      a+3,.... When the requests are flushed to queue, a and a+1 are merged
      to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
      aren't merged.
      With recursive merge below, the workload throughput gets improved 20%
      and context switch drops 60%.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27419322
  13. 14 12月, 2011 6 次提交
    • T
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo 提交于
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e5a8794
    • T
      block, cfq: move icq cache management to block core · 3d3c2379
      Tejun Heo 提交于
      Let elevators set ->icq_size and ->icq_align in elevator_type and
      elv_register() and elv_unregister() respectively create and destroy
      kmem_cache for icq.
      
      * elv_register() now can return failure.  All callers updated.
      
      * icq caches are automatically named "ELVNAME_io_cq".
      
      * cfq_slab_setup/kill() are collapsed into cfq_init/exit().
      
      * While at it, minor indentation change for iosched_cfq.elevator_name
        for consistency.
      
      This will help moving icq management to block core.  This doesn't
      introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d3c2379
    • T
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo 提交于
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      
      * Move cfqd->icq_list to request_queue->icq_list
      
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a612fddf
    • T
      block: remove elevator_queue->ops · 22f746e2
      Tejun Heo 提交于
      elevator_queue->ops points to the same ops struct ->elevator_type.ops
      is pointing to.  The only effect of caching it in elevator_queue is
      shorter notation - it doesn't save any indirect derefence.
      
      Relocate elevator_type->list which used only during module init/exit
      to the end of the structure, rename elevator_queue->elevator_type to
      ->type, and replace elevator_queue->ops with elevator_queue->type.ops.
      
      This doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22f746e2
    • T
      block: reorder elevator switch sequence · f8fc877d
      Tejun Heo 提交于
      Elevator switch sequence first attached the new elevator, then tried
      registering it (sysfs) and if that failed attached back the old
      elevator.  However, sysfs registration doesn't require the elevator to
      be attached, so there is no reason to do the "detach, attach new,
      register, maybe re-attach old" sequence.  It can just do "register,
      detach, attach".
      
      * elevator_init_queue() is updated to set ->elevator_data directly and
        return 0 / -errno.  This allows elevator_exit() on an unattached
        elevator.
      
      * __elv_unregister_queue() which was necessary to unregister
        unattached q is removed in favor of __elv_register_queue() which can
        register unattached q.
      
      * elevator_attach() becomes a single assignment and obscures more then
        it helps.  Dropped.
      
      This will help cleaning up io_context handling across elevator switch.
      
      This patch doesn't introduce visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8fc877d
    • T
      block, cfq: remove delayed unlink · b9a19208
      Tejun Heo 提交于
      Now that all cic's are immediately unlinked from both ioc and queue,
      lazy dropping from lookup path and trimming on elevator unregister are
      unnecessary.  Kill them and remove now unused elevator_ops->trim().
      
      This also leaves call_for_each_cic() without any user.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9a19208
  14. 19 10月, 2011 2 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
    • T
      block: reorganize queue draining · e3c78ca5
      Tejun Heo 提交于
      Reorganize queue draining related code in preparation of queue exit
      changes.
      
      * Factor out actual draining from elv_quiesce_start() to
        blk_drain_queue().
      
      * Make elv_quiesce_start/end() responsible for their own locking.
      
      * Replace open-coded ELVSWITCH clearing in elevator_switch() with
        elv_quiesce_end().
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3c78ca5
  15. 12 9月, 2011 1 次提交
  16. 03 6月, 2011 1 次提交
    • J
      iosched: prevent aliased requests from starving other I/O · 796d5116
      Jeff Moyer 提交于
      Hi, Jens,
      
      If you recall, I posted an RFC patch for this back in July of last year:
      http://lkml.org/lkml/2010/7/13/279
      
      The basic problem is that a process can issue a never-ending stream of
      async direct I/Os to the same sector on a device, thus starving out
      other I/O in the system (due to the way the alias handling works in both
      cfq and deadline).  The solution I proposed back then was to start
      dispatching from the fifo after a certain number of aliases had been
      dispatched.  Vivek asked why we had to treat aliases differently at all,
      and I never had a good answer.  So, I put together a simple patch which
      allows aliases to be added to the rb tree (it adds them to the right,
      though that doesn't matter as the order isn't guaranteed anyway).  I
      think this is the preferred solution, as it doesn't break up time slices
      in CFQ or batches in deadline.  I've tested it, and it does solve the
      starvation issue.  Let me know what you think.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      796d5116
  17. 21 5月, 2011 1 次提交
  18. 06 5月, 2011 1 次提交
  19. 22 4月, 2011 1 次提交
  20. 18 4月, 2011 1 次提交
  21. 06 4月, 2011 1 次提交
  22. 21 3月, 2011 1 次提交
    • J
      block: attempt to merge with existing requests on plug flush · 5e84ea3a
      Jens Axboe 提交于
      One of the disadvantages of on-stack plugging is that we potentially
      lose out on merging since all pending IO isn't always visible to
      everybody. When we flush the on-stack plugs, right now we don't do
      any checks to see if potential merge candidates could be utilized.
      
      Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
      It works just ELEVATOR_INSERT_SORT, but first checks whether we can
      merge with an existing request before doing the insertion (if we fail
      merging).
      
      This fixes a regression with multiple processes issuing IO that
      can be merged.
      
      Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing
      an accounting bug.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5e84ea3a
  23. 10 3月, 2011 2 次提交
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
  24. 02 3月, 2011 1 次提交
    • T
      block: add @force_kblockd to __blk_run_queue() · 1654e741
      Tejun Heo 提交于
      __blk_run_queue() automatically either calls q->request_fn() directly
      or schedules kblockd depending on whether the function is recursed.
      blk-flush implementation needs to be able to explicitly choose
      kblockd.  Add @force_kblockd.
      
      All the current users are converted to specify %false for the
      parameter and this patch doesn't introduce any behavior change.
      
      stable: This is prerequisite for fixing ide oops caused by the new
              blk-flush implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      1654e741
  25. 11 2月, 2011 1 次提交
  26. 25 1月, 2011 1 次提交
    • T
      block: reimplement FLUSH/FUA to support merge · ae1b1539
      Tejun Heo 提交于
      The current FLUSH/FUA support has evolved from the implementation
      which had to perform queue draining.  As such, sequencing is done
      queue-wide one flush request after another.  However, with the
      draining requirement gone, there's no reason to keep the queue-wide
      sequential approach.
      
      This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
      request is sequenced individually.  The actual FLUSH execution is
      double buffered and whenever a request wants to execute one for either
      PRE or POSTFLUSH, it queues on the pending queue.  Once certain
      conditions are met, a flush request is issued and on its completion
      all pending requests proceed to the next sequence.
      
      This allows arbitrary merging of different type of flushes.  How they
      are merged can be primarily controlled and tuned by adjusting the
      above said 'conditions' used to determine when to issue the next
      flush.
      
      This is inspired by Darrick's patches to merge multiple zero-data
      flushes which helps workloads with highly concurrent fsync requests.
      
      * As flush requests are never put on the IO scheduler, request fields
        used for flush share space with rq->rb_node.  rq->completion_data is
        moved out of the union.  This increases the request size by one
        pointer.
      
        As rq->elevator_private* are used only by the iosched too, it is
        possible to reduce the request size further.  However, to do that,
        we need to modify request allocation path such that iosched data is
        not allocated for flush requests.
      
      * FLUSH/FUA processing happens on insertion now instead of dispatch.
      
      - Comments updated as per Vivek and Mike.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Darrick J. Wong" <djwong@us.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ae1b1539
  27. 10 11月, 2010 1 次提交
    • C
      block: remove REQ_HARDBARRIER · 02e031cb
      Christoph Hellwig 提交于
      REQ_HARDBARRIER is dead now, so remove the leftovers.  What's left
      at this point is:
      
       - various checks inside the block layer.
       - sanity checks in bio based drivers.
       - now unused bio_empty_barrier helper.
       - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
         but Xen really needs to sort out it's barrier situaton.
       - setting of ordered tags in uas - dead code copied from old scsi
         drivers.
       - scsi different retry for barriers - it's dead and should have been
         removed when flushes were converted to FS requests.
       - blktrace handling of barriers - removed.  Someone who knows blktrace
         better should add support for REQ_FLUSH and REQ_FUA, though.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      02e031cb