1. 07 3月, 2012 3 次提交
  2. 08 2月, 2012 2 次提交
    • T
      block: don't call elevator callbacks for plug merges · 07c2bd37
      Tejun Heo 提交于
      Plug merge calls two elevator callbacks outside queue lock -
      elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
      attempt_plug_merge() suggests that elevator is guaranteed to be there
      through the existing request on the plug list, nothing prevents plug
      merge from calling into dying or initializing elevator.
      
      For regular merges, bypass ensures elvpriv count to reach zero, which
      in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
      from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
      the requests haven't gone through elevator insertion yet, it doesn't
      have SOFTBARRIER set allowing merges on a bypassed queue.
      
      This, for example, leads to the following crash during elevator
      switch.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       PGD 112cbc067 PUD 115d5c067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU 1
       Modules linked in: deadline_iosched
      
       Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
       RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
       RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
       RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
       RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
       R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
       FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
       Stack:
        ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
        ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
        ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
       Call Trace:
        [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
        [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
        [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
        [<ffffffff81396a5a>] generic_make_request+0xca/0x100
        [<ffffffff81396b04>] submit_bio+0x74/0x100
        [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
        [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff811986b2>] do_sync_read+0xe2/0x120
        [<ffffffff81199345>] vfs_read+0xc5/0x180
        [<ffffffff81199501>] sys_read+0x51/0x90
        [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
      
      There are multiple ways to fix this including making plug merge check
      ELVPRIV; however,
      
      * Calling into elevator outside queue lock is confusing and
        error-prone.
      
      * Requests on plug list aren't known to the elevator.  They aren't on
        the elevator yet, so there's no elevator specific state to update.
      
      * Given the nature of plug merges - collecting bio's for the same
        purpose from the same issuer - elevator specific restrictions aren't
        applicable.
      
      So, simply don't call into elevator methods from plug merge by moving
      elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
      using blk_try_merge() in attempt_plug_merge().
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      
      Note that this makes per-cgroup merged stats skip plug merging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07c2bd37
    • T
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo 提交于
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      bool.
      
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      050c8ea8
  3. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  4. 19 1月, 2012 1 次提交
    • S
      block: fix NULL icq_cache reference · 05c30b95
      Shaohua Li 提交于
      Vivek reported a kernel crash:
      [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
      [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [   94.218004] CPU 0
      [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
      [   94.218004]
      [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
      [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
      [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
      [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
      [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
      [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
      [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
      [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
      [   94.218004] Stack:
      [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
      [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
      [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
      [   94.218004] Call Trace:
      [   94.218004]  <IRQ>
      [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
      [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
      [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
      [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
      [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
      [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
      [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
      [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
      [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
      [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
      [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
      
      Once a queue is quiesced, it's not supposed to have any elvpriv data or
      icq's, and elevator switching depends on that.  Request alloc path
      followed the rule for elvpriv data but forgot apply it to icq's
      leading to the following crash during elevator switch. Fix it by not
      allocating icq's if ELVPRIV is not set for the request.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c30b95
  5. 16 12月, 2011 1 次提交
    • T
      block: don't kick empty queue in blk_drain_queue() · 4eabc941
      Tejun Heo 提交于
      While probing, fd sets up queue, probes hardware and tears down the
      queue if probing fails.  In the process, blk_drain_queue() kicks the
      queue which failed to finish initialization and fd is unhappy about
      that.
      
        floppy0: no floppy controllers found
        ------------[ cut here ]------------
        WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
        Hardware name: To Be Filled By O.E.M.
        VFS: do_fd_request called on non-open device
        Modules linked in:
        Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2b #2
        Call Trace:
         [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
         [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
         [<ffffffff813d657f>] do_fd_request+0xbf/0xd0
         [<ffffffff81322b95>] blk_drain_queue+0x65/0x80
         [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
         [<ffffffff818a809d>] floppy_init+0xdeb/0xe28
         [<ffffffff818a72b2>] ? daring+0x6b/0x6b
         [<ffffffff810002af>] do_one_initcall+0x3f/0x170
         [<ffffffff81884b34>] kernel_init+0x9d/0x11e
         [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
         [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
         [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
         [<ffffffff815dbb10>] ? gs_change+0xb/0xb
      
      Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
      something on it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRalf Hildebrandt <Ralf.Hildebrandt@charite.de>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Tested-by: NSergei Trofimovich <slyich@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4eabc941
  6. 14 12月, 2011 9 次提交
    • T
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo 提交于
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
      
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
      
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
      
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      
      * Comment added to explain how icq's are managed and synchronized.
      
      This completes cleanup of io_context interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1f8cc94
    • T
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo 提交于
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      
      * Move cfqd->icq_list to request_queue->icq_list
      
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a612fddf
    • T
      block, cfq: replace current_io_context() with create_io_context() · f2dbd76a
      Tejun Heo 提交于
      When called under queue_lock, current_io_context() triggers lockdep
      warning if it hits allocation path.  This is because io_context
      installation is protected by task_lock which is not IRQ safe, so it
      triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
      deadlock warning.
      
      Given the restriction, accessor + creator rolled into one doesn't work
      too well.  Drop current_io_context() and let the users access
      task->io_context directly inside queue_lock combined with explicit
      creation using create_io_context().
      
      Future ioc updates will further consolidate ioc access and the create
      interface will be unexported.
      
      While at it, relocate ioc internal interface declarations in blk.h and
      add section comments before and after.
      
      This patch does not introduce functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2dbd76a
    • T
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo 提交于
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      
      * Make sure the caller checks for the return value.
      
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09ac46c4
    • T
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo 提交于
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a73f730d
    • T
      block: add missing blk_queue_dead() checks · 8ba61435
      Tejun Heo 提交于
      blk_insert_cloned_request(), blk_execute_rq_nowait() and
      blk_flush_plug_list() either didn't check whether the queue was dead
      or did it without holding queue_lock.  Update them so that dead state
      is checked while holding queue_lock.
      
      AFAICS, this plugs all holes (requeue doesn't matter as the request is
      transitioning atomically from in_flight to queued).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8ba61435
    • T
      block: fix drain_all condition in blk_drain_queue() · 481a7d64
      Tejun Heo 提交于
      When trying to drain all requests, blk_drain_queue() checked only
      q->rq.count[]; however, this only tracks REQ_ALLOCED requests.  This
      patch updates blk_drain_queue() such that it looks at all the counters
      and queues so that request_queue is actually empty on completion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      481a7d64
    • T
      block: add blk_queue_dead() · 34f6055c
      Tejun Heo 提交于
      There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
      macro and use it.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34f6055c
    • T
      block, sx8: kill blk_insert_request() · 1ba64ede
      Tejun Heo 提交于
      The only user left for blk_insert_request() is sx8 and it can be
      trivially switched to use blk_execute_rq_nowait() - special requests
      aren't included in io stat and sx8 doesn't use block layer tagging.
      Switch sx8 and kill blk_insert_requeset().
      
      This patch doesn't introduce any functional difference.
      
      Only compile tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Garzik <jgarzik@pobox.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1ba64ede
  7. 23 11月, 2011 1 次提交
    • M
      block: initialize request_queue's numa node during · 5151412d
      Mike Snitzer 提交于
      struct request_queue is allocated with __GFP_ZERO so its "node" field is
      zero before initialization.  This causes an oops if node 0 is offline in
      the page allocator because its zonelists are not initialized.  From Dave
      Young's dmesg:
      
      	SRAT: Node 1 PXM 2 0-d0000000
      	SRAT: Node 1 PXM 2 100000000-330000000
      	SRAT: Node 0 PXM 1 330000000-630000000
      	Initmem setup node 1 0000000000000000-000000000affb000
      	...
      	Built 1 zonelists in Node order, mobility grouping on.
      	...
      	BUG: unable to handle kernel paging request at 0000000000001c08
      	IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870
      
      and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
      zonelist->_zonerefs.
      
      The fix is to initialize q->node at the time of allocation so the correct
      node is passed to the slab allocator later.
      
      Since blk_init_allocated_queue_node() is no longer needed, merge it with
      blk_init_allocated_queue().
      
      [rientjes@google.com: changelog, initializing q->node]
      Cc: stable@vger.kernel.org [2.6.37+]
      Reported-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NDave Young <dyoung@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5151412d
  8. 16 11月, 2011 2 次提交
  9. 04 11月, 2011 1 次提交
    • T
      block: don't call blk_drain_queue() if elevator is not up · 6dd9ad7d
      Tejun Heo 提交于
      blk_cleanup_queue() may be called before elevator is set up on a
      queue which triggers the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       ...
       Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
       Bochs Bochs
       RIP: 0010:[<ffffffff8125a69c>]  [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       ...
       Call Trace:
        [<ffffffff8125da92>] blk_drain_queue+0x42/0x70
        [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
        [<ffffffff81469640>] md_free+0x50/0x70
        [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
        [<ffffffff81270d56>] kref_put+0x36/0xa0
        [<ffffffff8126f2b7>] kobject_put+0x27/0x60
        [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
        [<ffffffff81083450>] process_one_work+0x100/0x3b0
        [<ffffffff8108527f>] worker_thread+0x15f/0x3a0
        [<ffffffff81089937>] kthread+0x87/0x90
        [<ffffffff81621834>] kernel_thread_helper+0x4/0x10
      
      Fix it by making blk_cleanup_queue() check whether q->elevator is set
      up before invoking blk_drain_queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6dd9ad7d
  10. 24 10月, 2011 2 次提交
    • J
      blk-flush: move the queue kick into · e67b77c7
      Jeff Moyer 提交于
      A dm-multipath user reported[1] a problem when trying to boot
      a kernel with commit 4853abaa
      (block: fix flush machinery for stacking drivers with differring
      flush flags) applied.  It turns out that an empty flush request
      can be sent into blk_insert_flush.  When the BUG_ON was fixed
      to allow for this, I/O on the underlying device would stall.  The
      reason is that blk_insert_cloned_request does not kick the queue.
      In the aforementioned commit, I had added a special case to
      kick the queue if data was sent down but the queue flags did
      not require a flush.  A better solution is to push the queue
      kick up into blk_insert_cloned_request.
      
      This patch, along with a follow-on which fixes the BUG_ON, fixes
      the issue reported.
      
      [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.htmlReported-by: NChristophe Saout <christophe@saout.de>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      
      Stable note: 3.1
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e67b77c7
    • T
      block: Remove the control of complete cpu from bio. · 9562ad9a
      Tao Ma 提交于
      bio originally has the functionality to set the complete cpu, but
      it is broken.
      
      Chirstoph said that "This code is unused, and from the all the
      discussions lately pretty obviously broken.  The only thing keeping
      it serves is creating more confusion and possibly more bugs."
      
      And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
      with leaving cpu control to the request based drivers, they are the
      only ones that can toggle the setting anyway".
      
      So this patch tries to remove all the work of controling complete cpu
      from a bio.
      
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9562ad9a
  11. 19 10月, 2011 6 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
    • T
      block: drop @tsk from attempt_plug_merge() and explain sync rules · bd87b589
      Tejun Heo 提交于
      attempt_plug_merge() accesses elevator without holding queue_lock and
      may call into ->elevator_bio_merge_fn().  The elvator is guaranteed to
      be valid because it's accessed iff the plugged list has requests and
      elevator is never exited with live requests, so as long as the
      elevator method can deal with unlocked access, this is safe.
      
      Explain the sync rules around attempt_plug_merge() and drop the
      unnecessary @tsk parameter.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd87b589
    • T
      block: make get_request[_wait]() fail if queue is dead · da8303c6
      Tejun Heo 提交于
      Currently get_request[_wait]() allocates request whether queue is dead
      or not.  This patch makes get_request[_wait]() return NULL if @q is
      dead.  blk_queue_bio() is updated to fail the submitted bio if request
      allocation fails.  While at it, add docbook comments for
      get_request[_wait]().
      
      Note that the current code has rather unclear (there are spurious DEAD
      tests scattered around) assumption that the owner of a queue
      guarantees that no request travels block layer if the queue is dead
      and this patch in itself doesn't change much; however, this will allow
      fixing the broken assumption in the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      da8303c6
    • T
      block: reorganize throtl_get_tg() and blk_throtl_bio() · bc16a4f9
      Tejun Heo 提交于
      blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
      
      * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
        and drops queue_lock in the latter case.  Different locking context
        depending on return value is error-prone and DEAD state is scheduled
        to be protected by queue_lock anyway.  Move DEAD check inside
        queue_lock and return valid tg or NULL.
      
      * blk_throtl_bio() indicates return status both with its return value
        and in/out param **@bio.  The former is used to indicate whether
        queue is found to be dead during throtl processing.  The latter
        whether the bio is throttled.
      
        There's no point in returning DEAD check result from
        blk_throtl_bio().  The queue can die after blk_throtl_bio() is
        finished but before make_request_fn() grabs queue lock.
      
        Make it take *@bio instead and return boolean result indicating
        whether the request is throttled or not.
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc16a4f9
    • T
      block: reorganize queue draining · e3c78ca5
      Tejun Heo 提交于
      Reorganize queue draining related code in preparation of queue exit
      changes.
      
      * Factor out actual draining from elv_quiesce_start() to
        blk_drain_queue().
      
      * Make elv_quiesce_start/end() responsible for their own locking.
      
      * Replace open-coded ELVSWITCH clearing in elevator_switch() with
        elv_quiesce_end().
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3c78ca5
    • T
      block: pass around REQ_* flags instead of broken down booleans during request alloc/free · 75eb6c37
      Tejun Heo 提交于
      blk_alloc_request() and freed_request() take different combinations of
      REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
      two.  Make them take @flags only.  This cleans up the code a bit and
      will ease updating allocation related REQ_* flags.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      75eb6c37
  12. 28 9月, 2011 1 次提交
    • H
      block: Free queue resources at blk_release_queue() · 777eb1bf
      Hannes Reinecke 提交于
      A kernel crash is observed when a mounted ext3/ext4 filesystem is
      physically removed. The problem is that blk_cleanup_queue() frees up
      some resources eg by calling elevator_exit(), which are not checked for
      in normal operation. So we should rather move these calls to the
      destructor function blk_release_queue() as at that point all remaining
      references are gone. However, in doing so we have to ensure that any
      externally supplied queue_lock is disconnected as the driver might free
      up the lock after the call of blk_cleanup_queue(),
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      777eb1bf
  13. 21 9月, 2011 1 次提交
    • S
      block: document blk-plug · 75df7136
      Suresh Jayaraman 提交于
      Thus spake Andrew Morton:
      
      "And I have the usual maintainability whine.  If someone comes up to
      vmscan.c and sees it calling blk_start_plug(), how are they supposed to
      work out why that call is there?  They go look at the blk_start_plug()
      definition and it is undocumented.  I think we can do better than this?"
      
      Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
      Axboe and from an earlier attempt by Shaohua Li to document blk-plug.
      
      [akpm@linux-foundation.org: grammatical and spelling tweaks]
      Signed-off-by: NSuresh Jayaraman <sjayaraman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      75df7136
  14. 15 9月, 2011 1 次提交
    • C
      block: refactor generic_make_request · 27a84d54
      Christoph Hellwig 提交于
      Move all the checks performed on a bio into a new helper, and call it as
      soon as bio is submitted even if it is a re-submission from ->make_request.
      
      We explicitly mark the new helper as beeing non-inlined as the stack
      usage for printing the block device name in the failure case is quite
      high and this a patch where we have to be extremely conservative about
      stack usage.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27a84d54
  15. 12 9月, 2011 3 次提交
  16. 24 8月, 2011 2 次提交
  17. 16 8月, 2011 1 次提交
    • J
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer 提交于
      Commit ae1b1539, block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      
      static inline struct request *__elv_next_request(struct request_queue *q)
      {
              struct request *rq;
      
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
                      }
      
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
      {
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              REQ_FUA);
              unsigned skip = 0;
      ...
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      	}
      
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      appreciated.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4853abaa
  18. 04 8月, 2011 1 次提交
    • A
      fault-injection: add ability to export fault_attr in arbitrary directory · dd48c085
      Akinobu Mita 提交于
      init_fault_attr_dentries() is used to export fault_attr via debugfs.
      But it can only export it in debugfs root directory.
      
      Per Forlin is working on mmc_fail_request which adds support to inject
      data errors after a completed host transfer in MMC subsystem.
      
      The fault_attr for mmc_fail_request should be defined per mmc host and
      export it in debugfs directory per mmc host like
      /sys/kernel/debug/mmc0/mmc_fail_request.
      
      init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
      introduces fault_create_debugfs_attr() which is able to create a
      directory in the arbitrary directory and replace
      init_fault_attr_dentries().
      
      [akpm@linux-foundation.org: extraneous semicolon, per Randy]
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Tested-by: NPer Forlin <per.forlin@linaro.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd48c085
  19. 27 7月, 2011 1 次提交