1. 22 2月, 2013 1 次提交
    • D
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong 提交于
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
  2. 14 1月, 2013 2 次提交
    • T
      block: add @req to bio_{front|back}_merge tracepoints · 8c1cf6bb
      Tejun Heo 提交于
      bio_{front|back}_merge tracepoints report a bio merging into an
      existing request but didn't specify which request the bio is being
      merged into.  Add @req to it.  This makes it impossible to share the
      event template with block_bio_queue - split it out.
      
      @req isn't used or exported to userland at this point and there is no
      userland visible behavior change.  Later changes will make use of the
      extra parameter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c1cf6bb
    • T
      block: add missing block_bio_complete() tracepoint · 3a366e61
      Tejun Heo 提交于
      bio completion didn't kick block_bio_complete TP.  Only dm was
      explicitly triggering the TP on IO completion.  This makes
      block_bio_complete TP useless for tracers which want to know about
      bios, and all other bio based drivers skip generating blktrace
      completion events.
      
      This patch makes all bio completions via bio_endio() generate
      block_bio_complete TP.
      
      * Explicit trace_block_bio_complete() invocation removed from dm and
        the trace point is unexported.
      
      * @rq dropped from trace_block_bio_complete().  bios may fly around
        w/o queue associated.  Verifying and accessing the assocaited queue
        belongs to TP probes.
      
      * blktrace now gets both request and bio completions.  Make it ignore
        bio completions if request completion path is happening.
      
      This makes all bio based drivers generate blktrace completion events
      properly and makes the block_bio_complete TP actually useful.
      
      v2: With this change, block_bio_complete TP could be invoked on sg
          commands which have bio's with %NULL bi_bdev.  Update TP
          assignment code to check whether bio->bi_bdev is %NULL before
          dereferencing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a366e61
  3. 11 1月, 2013 1 次提交
  4. 15 12月, 2012 1 次提交
  5. 06 12月, 2012 5 次提交
    • B
      block: Make blk_cleanup_queue() wait until request_fn finished · 24faf6f6
      Bart Van Assche 提交于
      Some request_fn implementations, e.g. scsi_request_fn(), unlock
      the queue lock internally. This may result in multiple threads
      executing request_fn for the same queue simultaneously. Keep
      track of the number of active request_fn calls and make sure that
      blk_cleanup_queue() waits until all active request_fn invocations
      have finished. A block driver may start cleaning up resources
      needed by its request_fn as soon as blk_cleanup_queue() finished,
      so blk_cleanup_queue() must wait for all outstanding request_fn
      invocations to finish.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Reported-by: NChanho Min <chanho.min@lge.com>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24faf6f6
    • B
      block: Avoid scheduling delayed work on a dead queue · 70460571
      Bart Van Assche 提交于
      Running a queue must continue after it has been marked dying until
      it has been marked dead. So the function blk_run_queue_async() must
      not schedule delayed work after blk_cleanup_queue() has marked a queue
      dead. Hence add a test for that queue state in blk_run_queue_async()
      and make sure that queue_unplugged() invokes that function with the
      queue lock held. This avoids that the queue state can change after
      it has been tested and before mod_delayed_work() is invoked. Drop
      the queue dying test in queue_unplugged() since it is now
      superfluous: __blk_run_queue() already tests whether or not the
      queue is dead.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      70460571
    • B
      block: Avoid that request_fn is invoked on a dead queue · c246e80d
      Bart Van Assche 提交于
      A block driver may start cleaning up resources needed by its
      request_fn as soon as blk_cleanup_queue() finished, so request_fn
      must not be invoked after draining finished. This is important
      when blk_run_queue() is invoked without any requests in progress.
      As an example, if blk_drain_queue() and scsi_run_queue() run in
      parallel, blk_drain_queue() may have finished all requests after
      scsi_run_queue() has taken a SCSI device off the starved list but
      before that last function has had a chance to run the queue.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Chanho Min <chanho.min@lge.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c246e80d
    • B
      block: Let blk_drain_queue() caller obtain the queue lock · 807592a4
      Bart Van Assche 提交于
      Let the caller of blk_drain_queue() obtain the queue lock to improve
      readability of the patch called "Avoid that request_fn is invoked on
      a dead queue".
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chanho Min <chanho.min@lge.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      807592a4
    • B
      block: Rename queue dead flag · 3f3299d5
      Bart Van Assche 提交于
      QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
      stop. After this flag has been set queue draining starts. However,
      during the queue draining phase it is still safe to invoke the
      queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
      flag.
      
      This patch has been generated by running the following command
      over the kernel source tree:
      
      git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
          xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g'      \
              -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g';                \
      sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
          include/linux/blkdev.h;                                       \
      sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
          -e 's/Dead queue/A dying queue/' block/blk-core.c
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chanho Min <chanho.min@lge.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f3299d5
  6. 10 11月, 2012 1 次提交
  7. 26 10月, 2012 1 次提交
    • J
      block: Add blk_rq_pos(rq) to sort rq when plushing · 975927b9
      Jianpeng Ma 提交于
      My workload is a raid5 which had 16 disks. And used our filesystem to
      write using direct-io mode.
      
      I used the blktrace to find those message:
      8,16   0     6647     2.453665504  2579  M   W 7493152 + 8 [md0_raid5]
      8,16   0     6648     2.453672411  2579  Q   W 7493160 + 8 [md0_raid5]
      8,16   0     6649     2.453672606  2579  M   W 7493160 + 8 [md0_raid5]
      8,16   0     6650     2.453679255  2579  Q   W 7493168 + 8 [md0_raid5]
      8,16   0     6651     2.453679441  2579  M   W 7493168 + 8 [md0_raid5]
      8,16   0     6652     2.453685948  2579  Q   W 7493176 + 8 [md0_raid5]
      8,16   0     6653     2.453686149  2579  M   W 7493176 + 8 [md0_raid5]
      8,16   0     6654     2.453693074  2579  Q   W 7493184 + 8 [md0_raid5]
      8,16   0     6655     2.453693254  2579  M   W 7493184 + 8 [md0_raid5]
      8,16   0     6656     2.453704290  2579  Q   W 7493192 + 8 [md0_raid5]
      8,16   0     6657     2.453704482  2579  M   W 7493192 + 8 [md0_raid5]
      8,16   0     6658     2.453715016  2579  Q   W 7493200 + 8 [md0_raid5]
      8,16   0     6659     2.453715247  2579  M   W 7493200 + 8 [md0_raid5]
      8,16   0     6660     2.453721730  2579  Q   W 7493208 + 8 [md0_raid5]
      8,16   0     6661     2.453721974  2579  M   W 7493208 + 8 [md0_raid5]
      8,16   0     6662     2.453728202  2579  Q   W 7493216 + 8 [md0_raid5]
      8,16   0     6663     2.453728436  2579  M   W 7493216 + 8 [md0_raid5]
      8,16   0     6664     2.453734782  2579  Q   W 7493224 + 8 [md0_raid5]
      8,16   0     6665     2.453735019  2579  M   W 7493224 + 8 [md0_raid5]
      8,16   0     6666     2.453741401  2579  Q   W 7493232 + 8 [md0_raid5]
      8,16   0     6667     2.453741632  2579  M   W 7493232 + 8 [md0_raid5]
      8,16   0     6668     2.453748148  2579  Q   W 7493240 + 8 [md0_raid5]
      8,16   0     6669     2.453748386  2579  M   W 7493240 + 8 [md0_raid5]
      8,16   0     6670     2.453851843  2579  I   W 7493144 + 104 [md0_raid5]
      8,16   0        0     2.453853661     0  m   N cfq2579 insert_request
      8,16   0     6671     2.453854064  2579  I   W 7493120 + 24 [md0_raid5]
      8,16   0        0     2.453854439     0  m   N cfq2579 insert_request
      8,16   0     6672     2.453854793  2579  U   N [md0_raid5] 2
      8,16   0        0     2.453855513     0  m   N cfq2579 Not idling.st->count:1
      8,16   0        0     2.453855927     0  m   N cfq2579 dispatch_insert
      8,16   0        0     2.453861771     0  m   N cfq2579 dispatched a request
      8,16   0        0     2.453862248     0  m   N cfq2579 activate rq,drv=1
      8,16   0     6673     2.453862332  2579  D   W 7493120 + 24 [md0_raid5]
      8,16   0        0     2.453865957     0  m   N cfq2579 Not idling.st->count:1
      8,16   0        0     2.453866269     0  m   N cfq2579 dispatch_insert
      8,16   0        0     2.453866707     0  m   N cfq2579 dispatched a request
      8,16   0        0     2.453867061     0  m   N cfq2579 activate rq,drv=2
      8,16   0     6674     2.453867145  2579  D   W 7493144 + 104 [md0_raid5]
      8,16   0     6675     2.454147608     0  C   W 7493120 + 24 [0]
      8,16   0        0     2.454149357     0  m   N cfq2579 complete rqnoidle 0
      8,16   0     6676     2.454791505     0  C   W 7493144 + 104 [0]
      8,16   0        0     2.454794803     0  m   N cfq2579 complete rqnoidle 0
      8,16   0        0     2.454795160     0  m   N cfq schedule dispatch
      
      From above messages,we can find rq[W 7493144 + 104] and rq[W
      7493120 + 24] do not merge.
      Because the bio order is:
        8,16   0     6638     2.453619407  2579  Q   W 7493144 + 8 [md0_raid5]
        8,16   0     6639     2.453620460  2579  G   W 7493144 + 8 [md0_raid5]
        8,16   0     6640     2.453639311  2579  Q   W 7493120 + 8 [md0_raid5]
        8,16   0     6641     2.453639842  2579  G   W 7493120 + 8 [md0_raid5]
      The bio(7493144) first and bio(7493120) later.So the subsequent
      bios will be divided into two parts.
      When flushing plug-list,because elv_attempt_insert_merge only support
      backmerge,not supporting frontmerge.
      So rq[7493120 + 24] can't merge with rq[7493144 + 104].
      
      From my test,i found those situation can count 25% in our system.
      Using this patch, there is no this situation.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      CC:Shaohua Li <shli@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      975927b9
  8. 21 9月, 2012 2 次提交
    • T
      block: fix request_queue->flags initialization · 60ea8226
      Tejun Heo 提交于
      A queue newly allocated with blk_alloc_queue_node() has only
      QUEUE_FLAG_BYPASS set.  For request-based drivers,
      blk_init_allocated_queue() is called and q->queue_flags is overwritten
      with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
      initial bypass is still in effect.
      
      In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
      instead of overwriting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      60ea8226
    • T
      block: lift the initial queue bypass mode on blk_register_queue() instead of... · 749fefe6
      Tejun Heo 提交于
      block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
      
      b82d4b19 ("blkcg: make request_queue bypassing on allocation") made
      request_queues bypassed on allocation to avoid switching on and off
      bypass mode on a queue being initialized.  Some drivers allocate and
      then destroy a lot of queues without fully initializing them and
      incurring bypass latency overhead on each of them could add upto
      significant overhead.
      
      Unfortunately, blk_init_allocated_queue() is never used by queues of
      bio-based drivers, which means that all bio-based driver queues are in
      bypass mode even after initialization and registration complete
      successfully.
      
      Due to the limited way request_queues are used by bio drivers, this
      problem is hidden pretty well but it shows up when blk-throttle is
      used in combination with a bio-based driver.  Trying to configure
      (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
      indefinitely in blkg_conf_prep() waiting for bypass mode to end.
      
      This patch moves the initial blk_queue_bypass_end() call from
      blk_init_allocated_queue() to blk_register_queue() which is called for
      any userland-visible queues regardless of its type.
      
      I believe this is correct because I don't think there is any block
      driver which needs or wants working elevator and blk-cgroup on a queue
      which isn't visible to userland.  If there are such users, we need a
      different solution.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoseph Glanville <joseph.glanville@orionvm.com.au>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      749fefe6
  9. 20 9月, 2012 3 次提交
  10. 09 9月, 2012 4 次提交
  11. 31 8月, 2012 1 次提交
    • Y
      block: rate-limit the error message from failing commands · 37d7b34f
      Yi Zou 提交于
      When performing a cable pull test w/ active stress I/O using fio over
      a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
      on the other, it is observed that the system becomes not usable due to
      scsi-ml being busy printing the error messages for all the failing commands.
      I don't believe this problem is specific to FCoE and these commands are
      anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
      the messages here to solve this issue.
      
      v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
      rate-limit per function. However, in this case, the failed i/o gets to
      blk_end_request_err() and then blk_update_request(), which also has to
      be rate-limited, as added in the v2 of this patch.
      
      v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.
      Signed-off-by: NYi Zou <yi.zou@intel.com>
      Cc: www.Open-FCoE.org <devel@open-fcoe.org>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: <linux-scsi@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      37d7b34f
  12. 22 8月, 2012 2 次提交
    • T
      workqueue: deprecate __cancel_delayed_work() · 136b5721
      Tejun Heo 提交于
      Now that cancel_delayed_work() can be safely called from IRQ handlers,
      there's no reason to use __cancel_delayed_work().  Use
      cancel_delayed_work() instead of __cancel_delayed_work() and mark the
      latter deprecated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      136b5721
    • T
      workqueue: use mod_delayed_work() instead of __cancel + queue · e7c2f967
      Tejun Heo 提交于
      Now that mod_delayed_work() is safe to call from IRQ handlers,
      __cancel_delayed_work() followed by queue_delayed_work() can be
      replaced with mod_delayed_work().
      
      Most conversions are straight-forward except for the following.
      
      * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
        elaborate dancing around its delayed_work.  Collapse it such that
        linkwatch_work is queued for immediate execution if LW_URGENT and
        existing timer is kept otherwise.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> 
      e7c2f967
  13. 31 7月, 2012 3 次提交
  14. 27 6月, 2012 1 次提交
    • T
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo 提交于
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a051661c
  15. 25 6月, 2012 5 次提交
    • T
      block: prepare for multiple request_lists · 5b788ce3
      Tejun Heo 提交于
      Request allocation is about to be made per-blkg meaning that there'll
      be multiple request lists.
      
      * Make queue full state per request_list.  blk_*queue_full() functions
        are renamed to blk_*rl_full() and takes @rl instead of @q.
      
      * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
        instead of @q.  Also add @gfp_mask parameter.
      
      * Add blk_exit_rl() instead of destroying rl directly from
        blk_release_queue().
      
      * Add request_list->q and make request alloc/free functions -
        blk_free_request(), [__]freed_request(), __get_request() - take @rl
        instead of @q.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b788ce3
    • T
      block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv · 8a5ecdd4
      Tejun Heo 提交于
      Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
      move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
      to use q->nr_rqs[] instead of q->rq.count[].
      
      These counters separates queue-wide request statistics from the
      request list and allow implementation of per-queue request allocation.
      
      While at it, properly indent fields of struct request_list.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a5ecdd4
    • T
      block: allocate io_context upfront · 7f4b35d1
      Tejun Heo 提交于
      Block layer very lazy allocation of ioc.  It waits until the moment
      ioc is absolutely necessary; unfortunately, that time could be inside
      queue lock and __get_request() performs unlock - try alloc - retry
      dancing.
      
      Just allocate it up-front on entry to block layer.  We're not saving
      the rain forest by deferring it to the last possible moment and
      complicating things unnecessarily.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7f4b35d1
    • T
      block: refactor get_request[_wait]() · a06e05e6
      Tejun Heo 提交于
      Currently, there are two request allocation functions - get_request()
      and get_request_wait().  The former tries to allocate a request once
      and the latter keeps retrying until it succeeds.  The latter wraps the
      former and keeps retrying until allocation succeeds.
      
      The combination of two functions deliver fallible non-wait allocation,
      fallible wait allocation and unfailing wait allocation.  However,
      given that forward progress is guaranteed, fallible wait allocation
      isn't all that useful and in fact nobody uses it.
      
      This patch simplifies the interface as follows.
      
      * get_request() is renamed to __get_request() and is only used by the
        wrapper function.
      
      * get_request_wait() is renamed to get_request().  It now takes
        @gfp_mask and retries iff it contains %__GFP_WAIT.
      
      This patch doesn't introduce any functional change and is to prepare
      for further updates to request allocation path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a06e05e6
    • T
      mempool: add @gfp_mask to mempool_create_node() · a91a5ac6
      Tejun Heo 提交于
      mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
      blk_init_free_list(), is about to be updated to use other allocation
      flags - add @gfp_mask argument to the function.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a91a5ac6
  16. 15 6月, 2012 2 次提交
    • A
      block: Mitigate lock unbalance caused by lock switching · 5e5cfac0
      Asias He 提交于
      Commit 777eb1bf disconnects externally
      supplied queue_lock before blk_drain_queue(). Switching the lock would
      introduce lock unbalance because theads which have taken the external
      lock might unlock the internal lock in the during the queue drain. This
      patch mitigate this by disconnecting the lock after the queue draining
      since queue draining makes a lot of request_queue users go away.
      
      However, please note, this patch only makes the problem less likely to
      happen. Anyone who still holds a ref might try to issue a new request on
      a dead queue after the blk_cleanup_queue() finishes draining, the lock
      unbalance might still happen in this case.
      
       =====================================
       [ BUG: bad unlock balance detected! ]
       3.4.0+ #288 Not tainted
       -------------------------------------
       fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
       [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
       but there are no more locks to release!
      
       other info that might help us debug this:
       1 lock held by fio/17706:
        #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
       get_request_wait+0x19a/0x250
      
       stack backtrace:
       Pid: 17706, comm: fio Not tainted 3.4.0+ #288
       Call Trace:
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
        [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
        [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
        [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
        [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810e0079>] lock_release+0xd9/0x250
        [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
        [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
        [<ffffffff81328faa>] generic_make_request+0xca/0x100
        [<ffffffff81329056>] submit_bio+0x76/0xf0
        [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
        [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
        [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
        [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
        [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
        [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
        [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
        [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
        [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
        [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
        [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
        [<ffffffff811eae50>] sys_io_submit+0x10/0x20
        [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b
      
      Changes since v2: Update commit log to explain how the code is still
                        broken even if we delay the lock switching after the drain.
      Changes since v1: Update commit log as Tejun suggested.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5e5cfac0
    • A
      block: Avoid missed wakeup in request waitqueue · 458f27a9
      Asias He 提交于
      After hot-unplug a stressed disk, I found that rl->wait[] is not empty
      while rl->count[] is empty and there are theads still sleeping on
      get_request after the queue cleanup. With simple debug code, I found
      there are exactly nr_sleep - nr_wakeup of theads in D state. So there
      are missed wakeup.
      
        $ dmesg | grep nr_sleep
        [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
        $ vmstat 1
        1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0
      
      To quote Tejun:
      
        Ah, okay, freed_request() wakes up single waiter with the assumption
        that after the wakeup there will at least be one successful allocation
        which in turn will continue the wakeup chain until the wait list is
        empty - ie. waiter wakeup is dependent on successful request
        allocation happening after each wakeup.  With queue marked dead, any
        woken up waiter fails the allocation path, so the wakeup chaining is
        lost and we're left with hung waiters. What we need is wake_up_all()
        after drain completion.
      
      This patch fixes the missed wakeup by waking up all the theads which
      are sleeping on wait queue after queue drain.
      
      Changes in v2: Drop waitqueue_active() optimization
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAsias He <asias@redhat.com>
      
      Fixed a bug by me, where stacked devices would oops on calling
      blk_drain_queue() since ->rq.wait[] do not get initialized unless
      it's a full queue setup.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      458f27a9
  17. 20 4月, 2012 4 次提交
    • T
      block: fix elvpriv allocation failure handling · aaf7c680
      Tejun Heo 提交于
      Request allocation is mempool backed to guarantee forward progress
      under memory pressure; unfortunately, this property got broken while
      adding elvpriv data.  Failures during elvpriv allocation, including
      ioc and icq creation failures, currently make get_request() fail as
      whole.  There's no forward progress guarantee for these allocations -
      they may fail indefinitely under memory pressure stalling IO and
      deadlocking the system.
      
      This patch updates get_request() such that elvpriv allocation failure
      doesn't make the whole function fail.  If elvpriv allocation fails,
      the allocation is degraded into !ELVPRIV.  This will force the request
      to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
      failures should be rare (nothing is per-request) and anything is
      better than deadlocking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aaf7c680
    • T
      block: collapse blk_alloc_request() into get_request() · 29e2b09a
      Tejun Heo 提交于
      Allocation failure handling in get_request() is about to be updated.
      To ease the update, collapse blk_alloc_request() into get_request().
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      29e2b09a
    • T
      blkcg: make request_queue bypassing on allocation · b82d4b19
      Tejun Heo 提交于
      With the previous change to guarantee bypass visiblity for RCU read
      lock regions, entering bypass mode involves non-trivial overhead and
      future changes are scheduled to make use of bypass mode during init
      path.  Combined it may end up adding noticeable delay during boot.
      
      This patch makes request_queue start its life in bypass mode, which is
      ended on queue init completion at the end of
      blk_init_allocated_queue(), and updates blk_queue_bypass_start() such
      that draining and RCU synchronization are performed only when the
      queue actually enters bypass mode.
      
      This avoids unnecessarily switching in and out of bypass mode during
      init avoiding the overhead and any nasty surprises which may step from
      leaving bypass mode on half-initialized queues.
      
      The boot time overhead was pointed out by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b82d4b19
    • T
      blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing · 80fd9979
      Tejun Heo 提交于
      Currently, blkg_lookup() doesn't check @q bypass state.  This patch
      updates blk_queue_bypass_start() to do synchronize_rcu() before
      returning and updates blkg_lookup() to check blk_queue_bypass() and
      return %NULL if bypassing.  This ensures blkg_lookup() returns %NULL
      if @q is bypassing.
      
      This is to guarantee that nobody is accessing policy data while @q is
      bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in
      place on policy [de]activation.
      
      v2: Added more comments explaining bypass guarantees as suggested by
          Vivek.
      
      v3: Added more comments explaining why there's no synchronize_rcu() in
          blk_cleanup_queue() as suggested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80fd9979
  18. 07 4月, 2012 1 次提交
    • S
      block: make auto block plug flush threshold per-disk based · 1b2e19f1
      Shaohua Li 提交于
      We do auto block plug flush to reduce latency, the threshold is 16
      requests. This works well if the task is accessing one or two drives.
      The problem is if the task is accessing a raid 0 device and the raid
      disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have
      heavy lock contention.
      
      This patch makes the threshold per-disk based. The latency should be
      still ok accessing one or two drives. The setup with application
      accessing a lot of drives in the meantime uaually is big machine,
      avoiding lock contention is more important, because any contention
      will actually increase latency.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b2e19f1