1. 15 2月, 2012 1 次提交
  2. 08 2月, 2012 1 次提交
    • T
      block: don't call elevator callbacks for plug merges · 07c2bd37
      Tejun Heo 提交于
      Plug merge calls two elevator callbacks outside queue lock -
      elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
      attempt_plug_merge() suggests that elevator is guaranteed to be there
      through the existing request on the plug list, nothing prevents plug
      merge from calling into dying or initializing elevator.
      
      For regular merges, bypass ensures elvpriv count to reach zero, which
      in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
      from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
      the requests haven't gone through elevator insertion yet, it doesn't
      have SOFTBARRIER set allowing merges on a bypassed queue.
      
      This, for example, leads to the following crash during elevator
      switch.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       PGD 112cbc067 PUD 115d5c067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU 1
       Modules linked in: deadline_iosched
      
       Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
       RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
       RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
       RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
       RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
       R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
       FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
       Stack:
        ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
        ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
        ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
       Call Trace:
        [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
        [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
        [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
        [<ffffffff81396a5a>] generic_make_request+0xca/0x100
        [<ffffffff81396b04>] submit_bio+0x74/0x100
        [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
        [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff811986b2>] do_sync_read+0xe2/0x120
        [<ffffffff81199345>] vfs_read+0xc5/0x180
        [<ffffffff81199501>] sys_read+0x51/0x90
        [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
      
      There are multiple ways to fix this including making plug merge check
      ELVPRIV; however,
      
      * Calling into elevator outside queue lock is confusing and
        error-prone.
      
      * Requests on plug list aren't known to the elevator.  They aren't on
        the elevator yet, so there's no elevator specific state to update.
      
      * Given the nature of plug merges - collecting bio's for the same
        purpose from the same issuer - elevator specific restrictions aren't
        applicable.
      
      So, simply don't call into elevator methods from plug merge by moving
      elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
      using blk_try_merge() in attempt_plug_merge().
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      
      Note that this makes per-cgroup merged stats skip plug merging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07c2bd37
  3. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  4. 19 1月, 2012 1 次提交
  5. 18 1月, 2012 1 次提交
    • J
      cfq-iosched: fix use-after-free of cfqq · 54b466e4
      Jens Axboe 提交于
      With the changes in life time management between the cfq IO contexts
      and the cfq queues, we now risk having cfqd->active_queue being
      freed when cfq_slice_expired() is being called. cfq_preempt_queue()
      caches this queue and uses it after calling said function, causing
      a use-after-free condition. This triggers the following oops,
      when cfqq_type() attempts to dereference it:
      
      BUG: unable to handle kernel paging request at ffff8800746c4f0c
      IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20
      PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      CPU 3
      Modules linked in:
      
      Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs
      RIP: 0010:[<ffffffff81266d59>]  [<ffffffff81266d59>] cfqq_type+0xb/0x20
      RSP: 0018:ffff880079c11778  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000
      RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08
      RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001
      R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0
      R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002
      FS:  00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040)
      Stack:
       ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43
       ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98
       0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149
      Call Trace:
       [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c
       [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31
       [<ffffffff81254149>] __elv_add_request+0x1ca/0x200
       [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312
       [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0
       [<ffffffff8125907b>] submit_bio+0xbf/0xca
       [<ffffffff81136ec7>] submit_bh+0xdf/0xfe
       [<ffffffff81176d04>] ext3_bread+0x50/0x99
       [<ffffffff811785b3>] dx_probe+0x38/0x291
       [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219
       [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406
       [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0
       [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191
       [<ffffffff8117a330>] ext3_lookup+0x39/0xe1
       [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c
       [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5
       [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef
       [<ffffffff8111b557>] path_lookupat+0x59/0x5ea
       [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a
       [<ffffffff8111bce0>] do_path_lookup+0x23/0x59
       [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99
       [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56
       [<ffffffff8111d02d>] user_path_at+0x11/0x13
       [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64
       [<ffffffff8111425a>] vfs_stat+0x1b/0x1d
       [<ffffffff81114359>] sys_newstat+0x1a/0x33
       [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42
       [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b
      Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3
      RIP  [<ffffffff81266d59>] cfqq_type+0xb/0x20
       RSP <ffff880079c11778>
      CR2: ffff8800746c4f0c
      
      Get rid of the caching of cfqd->active_queue, and reorder the
      check so that it happens before we expire the active queue.
      
      Thanks to Tejun for pin pointing the error location.
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Tested-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54b466e4
  6. 16 12月, 2011 2 次提交
  7. 14 12月, 2011 18 次提交
    • T
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo 提交于
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
      
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
      
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
      
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      
      * Comment added to explain how icq's are managed and synchronized.
      
      This completes cleanup of io_context interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1f8cc94
    • T
      block, cfq: restructure io_cq creation path for io_context interface cleanup · 9b84cacd
      Tejun Heo 提交于
      Add elevator_ops->elevator_init_icq_fn() and restructure
      cfq_create_cic() and rename it to ioc_create_icq().
      
      The new function expects its caller to pass in io_context, uses
      elevator_type->icq_cache, handles generic init, calls the new elevator
      operation for elevator specific initialization, and returns pointer to
      created or looked up icq.  This leaves cfq_icq_pool variable without
      any user.  Removed.
      
      This prepares for io_context interface cleanup and doesn't introduce
      any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b84cacd
    • T
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo 提交于
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e5a8794
    • T
      block, cfq: move icq cache management to block core · 3d3c2379
      Tejun Heo 提交于
      Let elevators set ->icq_size and ->icq_align in elevator_type and
      elv_register() and elv_unregister() respectively create and destroy
      kmem_cache for icq.
      
      * elv_register() now can return failure.  All callers updated.
      
      * icq caches are automatically named "ELVNAME_io_cq".
      
      * cfq_slab_setup/kill() are collapsed into cfq_init/exit().
      
      * While at it, minor indentation change for iosched_cfq.elevator_name
        for consistency.
      
      This will help moving icq management to block core.  This doesn't
      introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d3c2379
    • T
      block, cfq: move io_cq lookup to blk-ioc.c · 47fdd4ca
      Tejun Heo 提交于
      Now that all io_cq related data structures are in block core layer,
      io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.
      
      Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
      parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
      io_cq) and cfq_cic_lookup() becomes thin wrapper around
      cfq_cic_lookup().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      47fdd4ca
    • T
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo 提交于
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      
      * Move cfqd->icq_list to request_queue->icq_list
      
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a612fddf
    • T
      block, cfq: reorganize cfq_io_context into generic and cfq specific parts · c5869807
      Tejun Heo 提交于
      Currently io_context and cfq logics are mixed without clear boundary.
      Most of io_context is independent from cfq but cfq_io_context handling
      logic is dispersed between generic ioc code and cfq.
      
      cfq_io_context represents association between an io_context and a
      request_queue, which is a concept useful outside of cfq, but it also
      contains fields which are useful only to cfq.
      
      This patch takes out generic part and put it into io_cq (io
      context-queue) and the rest into cfq_io_cq (cic moniker remains the
      same) which contains io_cq.  The following changes are made together.
      
      * cfq_ttime and cfq_io_cq now live in cfq-iosched.c.
      
      * All related fields, functions and constants are renamed accordingly.
      
      * ioc->ioc_data is now "struct io_cq *" instead of "void *" and
        renamed to icq_hint.
      
      This prepares for io_context API cleanup.  Documentation is currently
      sparse.  It will be added later.
      
      Changes in this patch are mechanical and don't cause functional
      change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5869807
    • T
      block, cfq: replace current_io_context() with create_io_context() · f2dbd76a
      Tejun Heo 提交于
      When called under queue_lock, current_io_context() triggers lockdep
      warning if it hits allocation path.  This is because io_context
      installation is protected by task_lock which is not IRQ safe, so it
      triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
      deadlock warning.
      
      Given the restriction, accessor + creator rolled into one doesn't work
      too well.  Drop current_io_context() and let the users access
      task->io_context directly inside queue_lock combined with explicit
      creation using create_io_context().
      
      Future ioc updates will further consolidate ioc access and the create
      interface will be unexported.
      
      While at it, relocate ioc internal interface declarations in blk.h and
      add section comments before and after.
      
      This patch does not introduce functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2dbd76a
    • T
      block, cfq: kill cic->key · 1238033c
      Tejun Heo 提交于
      Now that lazy paths are removed, cfqd_dead_key() is meaningless and
      cic->q can be used whereever cic->key is used.  Kill cic->key.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1238033c
    • T
      block, cfq: kill ioc_gone · b50b636b
      Tejun Heo 提交于
      Now that cic's are immediately unlinked under both locks, there's no
      need to count and drain cic's before module unload.  RCU callback
      completion is waited with rcu_barrier().
      
      While at it, remove residual RCU operations on cic_list.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b50b636b
    • T
      block, cfq: remove delayed unlink · b9a19208
      Tejun Heo 提交于
      Now that all cic's are immediately unlinked from both ioc and queue,
      lazy dropping from lookup path and trimming on elevator unregister are
      unnecessary.  Kill them and remove now unused elevator_ops->trim().
      
      This also leaves call_for_each_cic() without any user.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9a19208
    • T
      block, cfq: unlink cfq_io_context's immediately · b2efa052
      Tejun Heo 提交于
      cic is association between io_context and request_queue.  A cic is
      linked from both ioc and q and should be destroyed when either one
      goes away.  As ioc and q both have their own locks, locking becomes a
      bit complex - both orders work for removal from one but not from the
      other.
      
      Currently, cfq tries to circumvent this locking order issue with RCU.
      ioc->lock nests inside queue_lock but the radix tree and cic's are
      also protected by RCU allowing either side to walk their lists without
      grabbing lock.
      
      This rather unconventional use of RCU quickly devolves into extremely
      fragile convolution.  e.g. The following is from cfqd going away too
      soon after ioc and q exits raced.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       [   88.503444]
       Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
       RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
       ...
       Call Trace:
        [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
        [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
        [<ffffffff81389130>] exit_io_context+0x100/0x140
        [<ffffffff81098a29>] do_exit+0x579/0x850
        [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
        [<ffffffff81098de7>] sys_exit_group+0x17/0x20
        [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
      
      The only real hot path here is cic lookup during request
      initialization and avoiding extra locking requires very confined use
      of RCU.  This patch makes cic removal from both ioc and request_queue
      perform double-locking and unlink immediately.
      
      * From q side, the change is almost trivial as ioc->lock nests inside
        queue_lock.  It just needs to grab each ioc->lock as it walks
        cic_list and unlink it.
      
      * From ioc side, it's a bit more difficult because of inversed lock
        order.  ioc needs its lock to walk its cic_list but can't grab the
        matching queue_lock and needs to perform unlock-relock dancing.
      
        Unlinking is now wholly done from put_io_context() and fast path is
        optimized by using the queue_lock the caller already holds, which is
        by far the most common case.  If the ioc accessed multiple devices,
        it tries with trylock.  In unlikely cases of fast path failure, it
        falls back to full double-locking dance from workqueue.
      
      Double-locking isn't the prettiest thing in the world but it's *far*
      simpler and more understandable than RCU trick without adding any
      meaningful overhead.
      
      This still leaves a lot of now unnecessary RCU logics.  Future patches
      will trim them.
      
      -v2: Vivek pointed out that cic->q was being dereferenced after
           cic->release() was called.  Updated to use local variable @this_q
           instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2efa052
    • T
      block, cfq: fix cic lookup locking · f1a4f4d3
      Tejun Heo 提交于
      * cfq_cic_lookup() may be called without queue_lock and multiple tasks
        can execute it simultaneously for the same shared ioc.  Nothing
        prevents them racing each other and trying to drop the same dead cic
        entry multiple times.
      
      * smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing
        prevents cfq_cic_lookup() seeing stale cic->key.  This usually
        doesn't blow up because by the time cic is exited, all requests have
        been drained and new requests are terminated before going through
        elevator.  However, it can still be triggered by plug merge path
        which doesn't grab queue_lock and thus can't check DEAD state
        reliably.
      
      This patch updates lookup locking such that,
      
      * Lookup is always performed under queue_lock.  This doesn't add any
        more locking.  The only issue is cfq_allow_merge() which can be
        called from plug merge path without holding any lock.  For now, this
        is worked around by using cic of the request to merge into, which is
        guaranteed to have the same ioc.  For longer term, I think it would
        be best to separate out plug merge method from regular one.
      
      * Spurious ioc->lock locking around cic lookup hint assignment
        dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1a4f4d3
    • T
      block, cfq: fix race condition in cic creation path and tighten locking · 216284c3
      Tejun Heo 提交于
      cfq_get_io_context() would fail if multiple tasks race to insert cic's
      for the same association.  This patch restructures
      cfq_get_io_context() such that slow path insertion race is handled
      properly.
      
      Note that the restructuring also makes cfq_get_io_context() called
      under queue_lock and performs both ioc and cfqd insertions while
      holding both ioc and queue locks.  This is part of on-going locking
      tightening and will be used to simplify synchronization rules.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      216284c3
    • T
      block, cfq: move ioc ioprio/cgroup changed handling to cic · dc86900e
      Tejun Heo 提交于
      ioprio/cgroup change was handled by marking the changed state in ioc
      and, on the following access to the ioc, performing RCU-protected
      iteration through all cic's grabbing the matching queue_lock.
      
      This patch moves the changed state to each cic.  When ioprio or cgroup
      changes, the respective bit is set on all cic's of the ioc and when
      each of those cic (not ioc) is accessed, change is applied for that
      specific ioc-queue pair.
      
      This also fixes the following two race conditions between setting and
      clearing of changed states.
      
      * Missing barrier between assign/load of ioprio and ioprio_changed
        allowed applying old ioprio.
      
      * Change requests could happen between application of change and
        clearing of changed variables.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc86900e
    • T
      block, cfq: misc updates to cfq_io_context · 283287a5
      Tejun Heo 提交于
      Make the following changes to prepare for ioc/cic management cleanup.
      
      * Add cic->q so that ioc can determine the associated queue without
        querying cfq.  This will eventually replace ->key.
      
      * Factor out cfq_release_cic() from cic_free_func().  This function
        assumes that the caller handled locking.
      
      * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it
        take only @cic.
      
      * Restructure cfq_cic_link() for future updates.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      283287a5
    • T
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo 提交于
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      dubious.
      
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
        assignment.
      
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
        noop.
      
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      
      This patch updates ioc get/put so that it becomes more conventional.
      
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
        patch.
      
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
        specification.
      
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e736be7
    • T
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo 提交于
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a73f730d
  8. 02 12月, 2011 1 次提交
    • Y
      cfq-iosched: fix cfq_cic_link() race confition · 5eb46851
      Yasuaki Ishimatsu 提交于
      cfq_cic_link() has race condition. When some processes which shared ioc
      issue I/O to same block device simultaneously, cfq_cic_link() returns -EEXIST
      sometimes. The race condition might stop I/O by following steps:
      
      step  1: Process A: Issue an I/O to /dev/sda
      step  2: Process A: Get an ioc (iocA here) in get_io_context() which does not
      		    linked with a cic for the device
      step  3: Process A: Get a new cic for the device (cicA here) in
      		    cfq_alloc_io_context()
      
      step  4: Process B: Issue an I/O to /dev/sda
      step  5: Process B: Get iocA in get_io_context() since process A and B share the
      		    same ioc
      step  6: Process B: Get a new cic for the device (cicB here) in
      		    cfq_alloc_io_context() since iocA has not been linked with a
      		    cic for the device yet
      
      step  7: Process A: Link cicA to iocA in cfq_cic_link()
      step  8: Process A: Dispatch I/O to driver and finish it
      
      step  9: Process B: Try to link cicB to iocA in cfq_cic_link()
      		    But it fails with showing "cfq: cic link failed!" kernel
      		    message, since iocA has already linked with cicA at step 7.
      step 10: Process B: Wait for finishig I/O in get_request_wait()
      		    The function does not wake up, when there is no I/O to the
      		    device.
      
      When cfq_cic_link() returns -EEXIST, it means ioc has already linked with cic.
      So when cfq_cic_link() return -EEXIST, retry cfq_cic_lookup().
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5eb46851
  9. 30 11月, 2011 1 次提交
  10. 23 8月, 2011 1 次提交
  11. 19 8月, 2011 1 次提交
    • J
      Revert "cfq: Remove special treatment for metadata rqs." · b53d1ed7
      Jens Axboe 提交于
      We have a kernel build regression since 3.1-rc1, which is about 10%
      regression. The kernel source is in an ext3 filesystem.
      Alex Shi bisect it to commit:
      commit a07405b7
      Author: Justin TerAvest <teravest@google.com>
      Date:   Sun Jul 10 22:09:19 2011 +0200
      
          cfq: Remove special treatment for metadata rqs.
      
      Apparently this is caused by lack metadata preemption, where ext3/ext4
      do use READ_META. I didn't see a way to fix the issue, so suggest
      reverting the patch.
      
      This reverts commit a07405b7.
      
      Reported-by: Alex Shi<alex.shi@intel.com>
      Reported-by: Shaohua Li<shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b53d1ed7
  12. 02 8月, 2011 1 次提交
    • V
      cfq-iosched: Reduce linked group count upon group destruction · a5395b83
      Vivek Goyal 提交于
      FQ keeps track of number of groups which are linked on blkcg->blkg_list.
      This is useful to avoid races between queue exit and cgroup exit code
      paths. So if at the request queue exit time linked group count is not
      zero, that means there are some group out there which is yet to be
      deleted under rcu read period and queue exit code should wait for
      on rcu period.
      
      In my previous patch I forgot to decrease the number of group count.
      So in current form, we nr_blkcg_linked_grps is always non-zero and
      we will always wait one rcu period (if BLK_CGROUP=y). The side effect
      of this is that it can increase boot time. I am surprised, nobody
      complained so far.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a5395b83
  13. 12 7月, 2011 4 次提交
    • S
      CFQ: add think time check for group · 7700fc4f
      Shaohua Li 提交于
      Currently when the last queue of a group has no request, we don't expire
      the queue to hope request from the group comes soon, so the group doesn't
      miss its share. But if the think time is big, the assumption isn't correct
      and we just waste bandwidth. In such case, we don't do idle.
      
      [global]
      runtime=30
      direct=1
      
      [test1]
      cgroup=test1
      cgroup_weight=1000
      rw=randread
      ioengine=libaio
      size=500m
      runtime=30
      directory=/mnt
      filename=file1
      thinktime=9000
      
      [test2]
      cgroup=test2
      cgroup_weight=1000
      rw=randread
      ioengine=libaio
      size=500m
      runtime=30
      directory=/mnt
      filename=file2
      
      	patched		base
      test1	64k		39k
      test2	548k		540k
      total	604k		578k
      
      group1 gets much better throughput because it waits less time.
      
      To check if the patch changes behavior of queue without think time. I also
      tried to give test1 2ms think time or no think time. The test result is stable.
      The thoughput doesn't change with/without the patch.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7700fc4f
    • S
      CFQ: add think time check for service tree · f5f2b6ce
      Shaohua Li 提交于
      Currently when the last queue of a service tree has no request, we don't
      expire the queue to hope request from the service tree comes soon, so the
      service tree doesn't miss its share. But if the think time is big, the
      assumption isn't correct and we just waste bandwidth. In such case, we
      don't do idle.
      
      [global]
      runtime=10
      direct=1
      
      [test1]
      rw=randread
      ioengine=libaio
      size=500m
      directory=/mnt
      filename=file1
      thinktime=9000
      
      [test2]
      rw=read
      ioengine=libaio
      size=1G
      directory=/mnt
      filename=file2
      
      	patched		base
      test1	41k/s		33k/s
      test2	15868k/s	15789k/s
      total	15902k/s	15817k/s
      
      A slightly better
      
      To check if the patch changes behavior of queue without think time. I also
      tried to give test1 2ms think time or no think time. The test has variation
      even without the patch, but the average throughput doesn't change with/without
      the patch.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      f5f2b6ce
    • S
      CFQ: move think time check variables to a separate struct · 383cd721
      Shaohua Li 提交于
      Move the variables to do think time check to a sepatate struct. This is
      to prepare adding think time check for service tree and group. No
      functional change.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      383cd721
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  14. 11 7月, 2011 1 次提交
  15. 27 6月, 2011 2 次提交
  16. 14 6月, 2011 1 次提交
  17. 13 6月, 2011 1 次提交
  18. 06 6月, 2011 1 次提交