1. 09 11月, 2013 2 次提交
    • A
      block: Enable sysfs nomerge control for I/O requests in the plug list · 23779fbc
      Alireza Haghdoost 提交于
      This patch enables the sysfs to control I/O request merge
      functionality in the plug list. While this control has been
      implemented for the request queue, it was dismissed in the plug list.
      Therefore, block layer merges requests together (or attempt to merge)
      even if the merge capability was disable using sysfs nomerge parameter
      value 2.
      
      This limitation is directly affects functionality of io_submit()
      system call. The system call enables user to submit a bunch of IO
      requests from user space using struct iocb **ios input argument.
      However, the unconditioned merging functionality in the plug list
      potentially merges these requests together down the road. Therefore,
      there is no way to distinguish between an application sending bunch of
      sequential IOs and an application sending one big IO. Ultimately, all
      requests generated by the former app merge within the plug list
      together and looks similar to the second app.
      
      While the merging functionality is a desirable feature to improve the
      performance of IO subsystem for some applications, it is not useful
      for other application like ours at all.
      Signed-off-by: NAlireza Haghdoost <alireza@cs.umn.edu>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      
      Coding style modified.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      23779fbc
    • T
      elevator: Fix a race in elevator switching and md device initialization · eb1c160b
      Tomoki Sekiyama 提交于
      The soft lockup below happens at the boot time of the system using dm
      multipath and the udev rules to switch scheduler.
      
      [  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
      [  356.127001] RIP: 0010:[<ffffffff81072a7d>]  [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
      ...
      [  356.127001] Call Trace:
      [  356.127001]  [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
      [  356.127001]  [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
      [  356.127001]  [<ffffffff810738b2>] del_timer_sync+0x52/0x60
      [  356.127001]  [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
      [  356.127001]  [<ffffffff812c98df>] elevator_exit+0x2f/0x50
      [  356.127001]  [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
      [  356.127001]  [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
      [  356.127001]  [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
      [  356.127001]  [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
      [  356.127001]  [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
      [  356.127001]  [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
      [  356.127001]  [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
      
      This is caused by a race between md device initialization by multipathd and
      shell script to switch the scheduler using sysfs.
      
       - multipathd:
         SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
         -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
          q->elevator = elevator_alloc(q, e); // not yet initialized
      
       - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
         elevator_switch (in the call trace above)
          struct elevator_queue *old = q->elevator;
          q->elevator = elevator_alloc(q, new_e);
          elevator_exit(old);                 // lockup! (*)
      
       - multipathd: (cont.)
          err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified
      
      (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
      while timer->base == NULL. In this case, as timer will never initialized,
      it results in lockup.
      
      This patch introduces acquisition of q->sysfs_lock around elevator_init()
      into blk_init_allocated_queue(), to provide mutual exclusion between
      initialization of the q->scheduler and switching of the scheduler.
      
      This should fix this bugzilla:
      https://bugzilla.redhat.com/show_bug.cgi?id=902012Signed-off-by: NTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb1c160b
  2. 08 11月, 2013 2 次提交
    • M
      blk-core: Fix memory corruption if blkcg_init_queue fails · fff4996b
      Mikulas Patocka 提交于
      If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
      to clean up structures allocated by the backing dev.
      
      ------------[ cut here ]------------
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
      ODEBUG: free active (active state 0) object type: percpu_counter hint:           (null)
      Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
      CPU: 0 PID: 2739 Comm: lvchange Tainted: G        W
      3.10.15-devel #14
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
       0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
       ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
       ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
      Call Trace:
       [<ffffffff813c8fd4>] dump_stack+0x19/0x1b
       [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0
       [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50
       [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250
       [<ffffffff81229a15>] debug_print_object+0x85/0xa0
       [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250
       [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0
       [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0
       [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10
       [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod]
       [<ffffffff81097662>] ? get_lock_stats+0x22/0x70
       [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
       [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520
       [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0
       [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      ---[ end trace 4b5ff0d55673d986 ]---
      ------------[ cut here ]------------
      
      This fix should be backported to stable kernels starting with 2.6.37. Note
      that in the kernels prior to 3.5 the affected code is different, but the
      bug is still there - bdi_init is called and bdi_destroy isn't.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 2.6.37+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fff4996b
    • J
      block: fix race between request completion and timeout handling · 4912aa6c
      Jeff Moyer 提交于
      crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      
      Pid: 491, comm: scsi_eh_0 Tainted: G        W  ----------------   2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
      RIP: 0010:[<ffffffff8124e424>]  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
      RSP: 0018:ffff881057eefd60  EFLAGS: 00010012
      RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
      RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
      RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
      R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
      FS:  0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
      Stack:
       0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
      <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
      <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
      Call Trace:
       [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
       [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
       [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
       [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
       [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
       [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
       [<ffffffff810908c6>] kthread+0x96/0xa0
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffff81090830>] ? kthread+0x0/0xa0
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
      RIP  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
       RSP <ffff881057eefd60>
      
      The RIP is this line:
              BUG_ON(blk_queued_rq(rq));
      
      After digging through the code, I think there may be a race between the
      request completion and the timer handler running.
      
      A timer is started for each request put on the device's queue (see
      blk_start_request->blk_add_timer).  If the request does not complete
      before the timer expires, the timer handler (blk_rq_timed_out_timer)
      will mark the request complete atomically:
      
      static inline int blk_mark_rq_complete(struct request *rq)
      {
              return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
      }
      
      and then call blk_rq_timed_out.  The latter function will call
      scsi_times_out, which will return one of BLK_EH_HANDLED,
      BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
      returned, blk_clear_rq_complete is called, and blk_add_timer is again
      called to simply wait longer for the request to complete.
      
      Now, if the request happens to complete while this is going on, what
      happens?  Given that we know the completion handler will bail if it
      finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
      handler running after that bit is cleared.  So, from the above
      paragraph, after the call to blk_clear_rq_complete.  If the completion
      sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
      there (I haven't seen this in the cores).  Next, if we get the
      completion before the call to list_add_tail, then the timer will
      eventually fire for an old req, which may either be freed or reallocated
      (there is evidence that this might be the case).  Finally, if the
      completion comes in *after* the addition to the timeout list, I think
      it's harmless.  The request will be removed from the timeout list,
      req_atom_complete will be set, and all will be well.
      
      This will only actually explain the coredumps *IF* the request
      structure was freed, reallocated *and* queued before the error handler
      thread had a chance to process it.  That is possible, but it may make
      sense to keep digging for another race.  I think that if this is what
      was happening, we would see other instances of this problem showing up
      as null pointer or garbage pointer dereferences, for example when the
      request structure was not re-used.  It looks like we actually do run
      into that situation in other reports.
      
      This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
      &req->atomic_flags)); from blk_add_timer to the only caller that could
      trip over it (blk_start_request).  It then inverts the calls to
      blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
      the race.  I've boot tested this patch, but nothing more.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4912aa6c
  3. 12 9月, 2013 1 次提交
  4. 24 8月, 2013 2 次提交
  5. 01 7月, 2013 1 次提交
  6. 17 5月, 2013 1 次提交
  7. 15 5月, 2013 1 次提交
    • V
      block: queue work on power efficient wq · 695588f9
      Viresh Kumar 提交于
      Block layer uses workqueues for multiple purposes. There is no real dependency
      of scheduling these on the cpu which scheduled them.
      
      On a idle system, it is observed that and idle cpu wakes up many times just to
      service this work. It would be better if we can schedule it on a cpu which the
      scheduler believes to be the most appropriate one.
      
      This patch replaces normal workqueues with power efficient versions.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      695588f9
  8. 19 4月, 2013 1 次提交
  9. 24 3月, 2013 2 次提交
    • K
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet 提交于
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d
    • K
      block: Refactor blk_update_request() · f79ea416
      Kent Overstreet 提交于
      Converts it to use bio_advance(), simplifying it quite a bit in the
      process.
      
      Note that req_bio_endio() now always calls bio_advance() - which means
      it always loops over the biovec, not just on partial completions. Don't
      expect it to affect performance, but worth noting.
      
      Tested it by forcing partial updates, and dumping before and after on
      various bio/bvec fields when doing a partial update.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      f79ea416
  10. 23 3月, 2013 2 次提交
  11. 22 2月, 2013 1 次提交
    • D
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong 提交于
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
  12. 14 1月, 2013 2 次提交
    • T
      block: add @req to bio_{front|back}_merge tracepoints · 8c1cf6bb
      Tejun Heo 提交于
      bio_{front|back}_merge tracepoints report a bio merging into an
      existing request but didn't specify which request the bio is being
      merged into.  Add @req to it.  This makes it impossible to share the
      event template with block_bio_queue - split it out.
      
      @req isn't used or exported to userland at this point and there is no
      userland visible behavior change.  Later changes will make use of the
      extra parameter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c1cf6bb
    • T
      block: add missing block_bio_complete() tracepoint · 3a366e61
      Tejun Heo 提交于
      bio completion didn't kick block_bio_complete TP.  Only dm was
      explicitly triggering the TP on IO completion.  This makes
      block_bio_complete TP useless for tracers which want to know about
      bios, and all other bio based drivers skip generating blktrace
      completion events.
      
      This patch makes all bio completions via bio_endio() generate
      block_bio_complete TP.
      
      * Explicit trace_block_bio_complete() invocation removed from dm and
        the trace point is unexported.
      
      * @rq dropped from trace_block_bio_complete().  bios may fly around
        w/o queue associated.  Verifying and accessing the assocaited queue
        belongs to TP probes.
      
      * blktrace now gets both request and bio completions.  Make it ignore
        bio completions if request completion path is happening.
      
      This makes all bio based drivers generate blktrace completion events
      properly and makes the block_bio_complete TP actually useful.
      
      v2: With this change, block_bio_complete TP could be invoked on sg
          commands which have bio's with %NULL bi_bdev.  Update TP
          assignment code to check whether bio->bi_bdev is %NULL before
          dereferencing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a366e61
  13. 11 1月, 2013 1 次提交
  14. 15 12月, 2012 1 次提交
  15. 06 12月, 2012 5 次提交
    • B
      block: Make blk_cleanup_queue() wait until request_fn finished · 24faf6f6
      Bart Van Assche 提交于
      Some request_fn implementations, e.g. scsi_request_fn(), unlock
      the queue lock internally. This may result in multiple threads
      executing request_fn for the same queue simultaneously. Keep
      track of the number of active request_fn calls and make sure that
      blk_cleanup_queue() waits until all active request_fn invocations
      have finished. A block driver may start cleaning up resources
      needed by its request_fn as soon as blk_cleanup_queue() finished,
      so blk_cleanup_queue() must wait for all outstanding request_fn
      invocations to finish.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Reported-by: NChanho Min <chanho.min@lge.com>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24faf6f6
    • B
      block: Avoid scheduling delayed work on a dead queue · 70460571
      Bart Van Assche 提交于
      Running a queue must continue after it has been marked dying until
      it has been marked dead. So the function blk_run_queue_async() must
      not schedule delayed work after blk_cleanup_queue() has marked a queue
      dead. Hence add a test for that queue state in blk_run_queue_async()
      and make sure that queue_unplugged() invokes that function with the
      queue lock held. This avoids that the queue state can change after
      it has been tested and before mod_delayed_work() is invoked. Drop
      the queue dying test in queue_unplugged() since it is now
      superfluous: __blk_run_queue() already tests whether or not the
      queue is dead.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      70460571
    • B
      block: Avoid that request_fn is invoked on a dead queue · c246e80d
      Bart Van Assche 提交于
      A block driver may start cleaning up resources needed by its
      request_fn as soon as blk_cleanup_queue() finished, so request_fn
      must not be invoked after draining finished. This is important
      when blk_run_queue() is invoked without any requests in progress.
      As an example, if blk_drain_queue() and scsi_run_queue() run in
      parallel, blk_drain_queue() may have finished all requests after
      scsi_run_queue() has taken a SCSI device off the starved list but
      before that last function has had a chance to run the queue.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Chanho Min <chanho.min@lge.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c246e80d
    • B
      block: Let blk_drain_queue() caller obtain the queue lock · 807592a4
      Bart Van Assche 提交于
      Let the caller of blk_drain_queue() obtain the queue lock to improve
      readability of the patch called "Avoid that request_fn is invoked on
      a dead queue".
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chanho Min <chanho.min@lge.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      807592a4
    • B
      block: Rename queue dead flag · 3f3299d5
      Bart Van Assche 提交于
      QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
      stop. After this flag has been set queue draining starts. However,
      during the queue draining phase it is still safe to invoke the
      queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
      flag.
      
      This patch has been generated by running the following command
      over the kernel source tree:
      
      git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
          xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g'      \
              -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g';                \
      sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
          include/linux/blkdev.h;                                       \
      sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
          -e 's/Dead queue/A dying queue/' block/blk-core.c
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chanho Min <chanho.min@lge.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f3299d5
  16. 10 11月, 2012 1 次提交
  17. 26 10月, 2012 1 次提交
    • J
      block: Add blk_rq_pos(rq) to sort rq when plushing · 975927b9
      Jianpeng Ma 提交于
      My workload is a raid5 which had 16 disks. And used our filesystem to
      write using direct-io mode.
      
      I used the blktrace to find those message:
      8,16   0     6647     2.453665504  2579  M   W 7493152 + 8 [md0_raid5]
      8,16   0     6648     2.453672411  2579  Q   W 7493160 + 8 [md0_raid5]
      8,16   0     6649     2.453672606  2579  M   W 7493160 + 8 [md0_raid5]
      8,16   0     6650     2.453679255  2579  Q   W 7493168 + 8 [md0_raid5]
      8,16   0     6651     2.453679441  2579  M   W 7493168 + 8 [md0_raid5]
      8,16   0     6652     2.453685948  2579  Q   W 7493176 + 8 [md0_raid5]
      8,16   0     6653     2.453686149  2579  M   W 7493176 + 8 [md0_raid5]
      8,16   0     6654     2.453693074  2579  Q   W 7493184 + 8 [md0_raid5]
      8,16   0     6655     2.453693254  2579  M   W 7493184 + 8 [md0_raid5]
      8,16   0     6656     2.453704290  2579  Q   W 7493192 + 8 [md0_raid5]
      8,16   0     6657     2.453704482  2579  M   W 7493192 + 8 [md0_raid5]
      8,16   0     6658     2.453715016  2579  Q   W 7493200 + 8 [md0_raid5]
      8,16   0     6659     2.453715247  2579  M   W 7493200 + 8 [md0_raid5]
      8,16   0     6660     2.453721730  2579  Q   W 7493208 + 8 [md0_raid5]
      8,16   0     6661     2.453721974  2579  M   W 7493208 + 8 [md0_raid5]
      8,16   0     6662     2.453728202  2579  Q   W 7493216 + 8 [md0_raid5]
      8,16   0     6663     2.453728436  2579  M   W 7493216 + 8 [md0_raid5]
      8,16   0     6664     2.453734782  2579  Q   W 7493224 + 8 [md0_raid5]
      8,16   0     6665     2.453735019  2579  M   W 7493224 + 8 [md0_raid5]
      8,16   0     6666     2.453741401  2579  Q   W 7493232 + 8 [md0_raid5]
      8,16   0     6667     2.453741632  2579  M   W 7493232 + 8 [md0_raid5]
      8,16   0     6668     2.453748148  2579  Q   W 7493240 + 8 [md0_raid5]
      8,16   0     6669     2.453748386  2579  M   W 7493240 + 8 [md0_raid5]
      8,16   0     6670     2.453851843  2579  I   W 7493144 + 104 [md0_raid5]
      8,16   0        0     2.453853661     0  m   N cfq2579 insert_request
      8,16   0     6671     2.453854064  2579  I   W 7493120 + 24 [md0_raid5]
      8,16   0        0     2.453854439     0  m   N cfq2579 insert_request
      8,16   0     6672     2.453854793  2579  U   N [md0_raid5] 2
      8,16   0        0     2.453855513     0  m   N cfq2579 Not idling.st->count:1
      8,16   0        0     2.453855927     0  m   N cfq2579 dispatch_insert
      8,16   0        0     2.453861771     0  m   N cfq2579 dispatched a request
      8,16   0        0     2.453862248     0  m   N cfq2579 activate rq,drv=1
      8,16   0     6673     2.453862332  2579  D   W 7493120 + 24 [md0_raid5]
      8,16   0        0     2.453865957     0  m   N cfq2579 Not idling.st->count:1
      8,16   0        0     2.453866269     0  m   N cfq2579 dispatch_insert
      8,16   0        0     2.453866707     0  m   N cfq2579 dispatched a request
      8,16   0        0     2.453867061     0  m   N cfq2579 activate rq,drv=2
      8,16   0     6674     2.453867145  2579  D   W 7493144 + 104 [md0_raid5]
      8,16   0     6675     2.454147608     0  C   W 7493120 + 24 [0]
      8,16   0        0     2.454149357     0  m   N cfq2579 complete rqnoidle 0
      8,16   0     6676     2.454791505     0  C   W 7493144 + 104 [0]
      8,16   0        0     2.454794803     0  m   N cfq2579 complete rqnoidle 0
      8,16   0        0     2.454795160     0  m   N cfq schedule dispatch
      
      From above messages,we can find rq[W 7493144 + 104] and rq[W
      7493120 + 24] do not merge.
      Because the bio order is:
        8,16   0     6638     2.453619407  2579  Q   W 7493144 + 8 [md0_raid5]
        8,16   0     6639     2.453620460  2579  G   W 7493144 + 8 [md0_raid5]
        8,16   0     6640     2.453639311  2579  Q   W 7493120 + 8 [md0_raid5]
        8,16   0     6641     2.453639842  2579  G   W 7493120 + 8 [md0_raid5]
      The bio(7493144) first and bio(7493120) later.So the subsequent
      bios will be divided into two parts.
      When flushing plug-list,because elv_attempt_insert_merge only support
      backmerge,not supporting frontmerge.
      So rq[7493120 + 24] can't merge with rq[7493144 + 104].
      
      From my test,i found those situation can count 25% in our system.
      Using this patch, there is no this situation.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      CC:Shaohua Li <shli@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      975927b9
  18. 21 9月, 2012 2 次提交
    • T
      block: fix request_queue->flags initialization · 60ea8226
      Tejun Heo 提交于
      A queue newly allocated with blk_alloc_queue_node() has only
      QUEUE_FLAG_BYPASS set.  For request-based drivers,
      blk_init_allocated_queue() is called and q->queue_flags is overwritten
      with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
      initial bypass is still in effect.
      
      In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
      instead of overwriting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      60ea8226
    • T
      block: lift the initial queue bypass mode on blk_register_queue() instead of... · 749fefe6
      Tejun Heo 提交于
      block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
      
      b82d4b19 ("blkcg: make request_queue bypassing on allocation") made
      request_queues bypassed on allocation to avoid switching on and off
      bypass mode on a queue being initialized.  Some drivers allocate and
      then destroy a lot of queues without fully initializing them and
      incurring bypass latency overhead on each of them could add upto
      significant overhead.
      
      Unfortunately, blk_init_allocated_queue() is never used by queues of
      bio-based drivers, which means that all bio-based driver queues are in
      bypass mode even after initialization and registration complete
      successfully.
      
      Due to the limited way request_queues are used by bio drivers, this
      problem is hidden pretty well but it shows up when blk-throttle is
      used in combination with a bio-based driver.  Trying to configure
      (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
      indefinitely in blkg_conf_prep() waiting for bypass mode to end.
      
      This patch moves the initial blk_queue_bypass_end() call from
      blk_init_allocated_queue() to blk_register_queue() which is called for
      any userland-visible queues regardless of its type.
      
      I believe this is correct because I don't think there is any block
      driver which needs or wants working elevator and blk-cgroup on a queue
      which isn't visible to userland.  If there are such users, we need a
      different solution.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoseph Glanville <joseph.glanville@orionvm.com.au>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      749fefe6
  19. 20 9月, 2012 3 次提交
  20. 09 9月, 2012 4 次提交
  21. 31 8月, 2012 1 次提交
    • Y
      block: rate-limit the error message from failing commands · 37d7b34f
      Yi Zou 提交于
      When performing a cable pull test w/ active stress I/O using fio over
      a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
      on the other, it is observed that the system becomes not usable due to
      scsi-ml being busy printing the error messages for all the failing commands.
      I don't believe this problem is specific to FCoE and these commands are
      anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
      the messages here to solve this issue.
      
      v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
      rate-limit per function. However, in this case, the failed i/o gets to
      blk_end_request_err() and then blk_update_request(), which also has to
      be rate-limited, as added in the v2 of this patch.
      
      v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.
      Signed-off-by: NYi Zou <yi.zou@intel.com>
      Cc: www.Open-FCoE.org <devel@open-fcoe.org>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: <linux-scsi@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      37d7b34f
  22. 22 8月, 2012 2 次提交
    • T
      workqueue: deprecate __cancel_delayed_work() · 136b5721
      Tejun Heo 提交于
      Now that cancel_delayed_work() can be safely called from IRQ handlers,
      there's no reason to use __cancel_delayed_work().  Use
      cancel_delayed_work() instead of __cancel_delayed_work() and mark the
      latter deprecated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      136b5721
    • T
      workqueue: use mod_delayed_work() instead of __cancel + queue · e7c2f967
      Tejun Heo 提交于
      Now that mod_delayed_work() is safe to call from IRQ handlers,
      __cancel_delayed_work() followed by queue_delayed_work() can be
      replaced with mod_delayed_work().
      
      Most conversions are straight-forward except for the following.
      
      * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
        elaborate dancing around its delayed_work.  Collapse it such that
        linkwatch_work is queued for immediate execution if LW_URGENT and
        existing timer is kept otherwise.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> 
      e7c2f967
  23. 31 7月, 2012 1 次提交