1. 02 6月, 2015 1 次提交
  2. 30 1月, 2015 1 次提交
  3. 10 12月, 2014 1 次提交
    • B
      blk-mq: Fix a use-after-free · 45a9c9d9
      Bart Van Assche 提交于
      blk-mq users are allowed to free the memory request_queue.tag_set
      points at after blk_cleanup_queue() has finished but before
      blk_release_queue() has started. This can happen e.g. in the SCSI
      core. The SCSI core namely embeds the tag_set structure in a SCSI
      host structure. The SCSI host structure is freed by
      scsi_host_dev_release(). This function is called after
      blk_cleanup_queue() finished but can be called before
      blk_release_queue().
      
      This means that it is not safe to access request_queue.tag_set from
      inside blk_release_queue(). Hence remove the blk_sync_queue() call
      from blk_release_queue(). This call is not necessary - outstanding
      requests must have finished before blk_release_queue() is
      called. Additionally, move the blk_mq_free_queue() call from
      blk_release_queue() to blk_cleanup_queue() to avoid that struct
      request_queue.tag_set gets accessed after it has been freed.
      
      This patch avoids that the following kernel oops can be triggered
      when deleting a SCSI host for which scsi-mq was enabled:
      
      Call Trace:
       [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
       [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
       [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
       [<ffffffff8124d654>] blk_release_queue+0x84/0xd0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff81245895>] blk_put_queue+0x15/0x20
       [<ffffffff8125c409>] disk_release+0x99/0xd0
       [<ffffffff8133d056>] device_release+0x36/0xb0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff8125a78a>] put_disk+0x1a/0x20
       [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
       [<ffffffff811d56a0>] blkdev_put+0x50/0x160
       [<ffffffff81199eb4>] kill_block_super+0x44/0x70
       [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
       [<ffffffff8119a87e>] deactivate_super+0x4e/0x70
       [<ffffffff811b9833>] cleanup_mnt+0x43/0x90
       [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
       [<ffffffff8107252c>] task_work_run+0xac/0xe0
       [<ffffffff81002c01>] do_notify_resume+0x61/0xa0
       [<ffffffff814d2c58>] int_signal+0x12/0x17
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      45a9c9d9
  4. 26 9月, 2014 3 次提交
  5. 25 9月, 2014 1 次提交
    • T
      blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode · 17497acb
      Tejun Heo 提交于
      blk-mq uses percpu_ref for its usage counter which tracks the number
      of in-flight commands and used to synchronously drain the queue on
      freeze.  percpu_ref shutdown takes measureable wallclock time as it
      involves a sched RCU grace period.  This means that draining a blk-mq
      takes measureable wallclock time.  One would think that this shouldn't
      matter as queue shutdown should be a rare event which takes place
      asynchronously w.r.t. userland.
      
      Unfortunately, SCSI probing involves synchronously setting up and then
      tearing down a lot of request_queues back-to-back for non-existent
      LUNs.  This means that SCSI probing may take above ten seconds when
      scsi-mq is used.
      
        [    0.949892] scsi host0: Virtio SCSI HBA
        [    1.007864] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
        [    1.021299] scsi 0:0:1:0: Direct-Access     QEMU     QEMU HARDDISK    1.1. PQ: 0 ANSI: 5
        [    1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
      
        <stall>
      
        [   16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
        [   16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
        [   16.194099] osd: LOADED open-osd 0.2.1
        [   16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
        [   16.208478] sd 0:0:0:0: [sda] Write Protect is off
        [   16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
        [   16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
        [   16.223264] sd 0:0:1:0: [sdb] Write Protect is off
        [   16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      
      This is also the reason why request_queues start in bypass mode which
      is ended on blk_register_queue() as shutting down a fully functional
      queue also involves a RCU grace period and the queues for non-existent
      SCSI devices never reach registration.
      
      blk-mq basically needs to do the same thing - start the mq in a
      degraded mode which is faster to shut down and then make it fully
      functional only after the queue reaches registration.  percpu_ref
      recently grew facilities to force atomic operation until explicitly
      switched to percpu mode, which can be used for this purpose.  This
      patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
      switch it to percpu mode only once blk_register_queue() is reached.
      
      Note that this issue was previously worked around by 0a30288d
      ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
      probe") for v3.17.  The temp fix was reverted in preparation of adding
      persistent atomic mode to percpu_ref by 9eca8046 ("Revert "blk-mq,
      percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
      This patch and the prerequisite percpu_ref changes will be merged
      during v3.18 devel cycle.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
      Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count")
      Reviewed-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      17497acb
  6. 10 9月, 2014 1 次提交
    • A
      Block: fix unbalanced bypass-disable in blk_register_queue · df35c7c9
      Alan Stern 提交于
      When a queue is registered, the block layer turns off the bypass
      setting (because bypass is enabled when the queue is created).  This
      doesn't work well for queues that are unregistered and then registered
      again; we get a WARNING because of the unbalanced calls to
      blk_queue_bypass_end().
      
      This patch fixes the problem by making blk_register_queue() call
      blk_queue_bypass_end() only the first time the queue is registered.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Acked-by: NTejun Heo <tj@kernel.org>
      CC: James Bottomley <James.Bottomley@HansenPartnership.com>
      CC: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      df35c7c9
  7. 02 7月, 2014 1 次提交
    • T
      block, blk-mq: draining can't be skipped even if bypass_depth was non-zero · 776687bc
      Tejun Heo 提交于
      Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
      skip queue draining if bypass_depth was already above zero.  The
      assumption is that the one which bumped the bypass_depth should have
      performed draining already; however, there's nothing which prevents a
      new instance of bypassing/freezing from starting before the previous
      one finishes draining.  The current code may allow the later
      bypassing/freezing instances to complete while there still are
      in-flight requests which haven't finished draining.
      
      Fix it by draining regardless of bypass_depth.  We still skip draining
      from blk_queue_bypass_start() while the queue is initializing to avoid
      introducing excessive delays during boot.  INIT_DONE setting is moved
      above the initial blk_queue_bypass_end() so that bypassing attempts
      can't slip inbetween.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      776687bc
  8. 27 5月, 2014 1 次提交
  9. 21 5月, 2014 1 次提交
    • J
      blk-mq: allow changing of queue depth through sysfs · e3a2b3f9
      Jens Axboe 提交于
      For request_fn based devices, the block layer exports a 'nr_requests'
      file through sysfs to allow adjusting of queue depth on the fly.
      Currently this returns -EINVAL for blk-mq, since it's not wired up.
      Wire this up for blk-mq, so that it now also always dynamic
      adjustments of the allowed queue depth for any given block device
      managed by blk-mq.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e3a2b3f9
  10. 11 2月, 2014 1 次提交
    • C
      blk-mq: rework flush sequencing logic · 18741986
      Christoph Hellwig 提交于
      Witch to using a preallocated flush_rq for blk-mq similar to what's done
      with the old request path.  This allows us to set up the request properly
      with a tag from the actually allowed range and ->rq_disk as needed by
      some drivers.  To make life easier we also switch to dynamic allocation
      of ->flush_rq for the old path.
      
      This effectively reverts most of
      
          "blk-mq: fix for flush deadlock"
      
      and
      
          "blk-mq: Don't reserve a tag for flush request"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      18741986
  11. 01 1月, 2014 1 次提交
  12. 15 11月, 2013 1 次提交
  13. 25 10月, 2013 1 次提交
    • J
      blk-mq: new multi-queue block IO queueing mechanism · 320ae51f
      Jens Axboe 提交于
      Linux currently has two models for block devices:
      
      - The classic request_fn based approach, where drivers use struct
        request units for IO. The block layer provides various helper
        functionalities to let drivers share code, things like tag
        management, timeout handling, queueing, etc.
      
      - The "stacked" approach, where a driver squeezes in between the
        block layer and IO submitter. Since this bypasses the IO stack,
        driver generally have to manage everything themselves.
      
      With drivers being written for new high IOPS devices, the classic
      request_fn based driver doesn't work well enough. The design dates
      back to when both SMP and high IOPS was rare. It has problems with
      scaling to bigger machines, and runs into scaling issues even on
      smaller machines when you have IOPS in the hundreds of thousands
      per device.
      
      The stacked approach is then most often selected as the model
      for the driver. But this means that everybody has to re-invent
      everything, and along with that we get all the problems again
      that the shared approach solved.
      
      This commit introduces blk-mq, block multi queue support. The
      design is centered around per-cpu queues for queueing IO, which
      then funnel down into x number of hardware submission queues.
      We might have a 1:1 mapping between the two, or it might be
      an N:M mapping. That all depends on what the hardware supports.
      
      blk-mq provides various helper functions, which include:
      
      - Scalable support for request tagging. Most devices need to
        be able to uniquely identify a request both in the driver and
        to the hardware. The tagging uses per-cpu caches for freed
        tags, to enable cache hot reuse.
      
      - Timeout handling without tracking request on a per-device
        basis. Basically the driver should be able to get a notification,
        if a request happens to fail.
      
      - Optional support for non 1:1 mappings between issue and
        submission queues. blk-mq can redirect IO completions to the
        desired location.
      
      - Support for per-request payloads. Drivers almost always need
        to associate a request structure with some driver private
        command structure. Drivers can tell blk-mq this at init time,
        and then any request handed to the driver will have the
        required size of memory associated with it.
      
      - Support for merging of IO, and plugging. The stacked model
        gets neither of these. Even for high IOPS devices, merging
        sequential IO reduces per-command overhead and thus
        increases bandwidth.
      
      For now, this is provided as a potential 3rd queueing model, with
      the hope being that, as it matures, it can replace both the classic
      and stacked model. That would get us back to having just 1 real
      model for block devices, leaving the stacked approach to dm/md
      devices (as it was originally intended).
      
      Contributions in this patch from the following people:
      
      Shaohua Li <shli@fusionio.com>
      Alexander Gordeev <agordeev@redhat.com>
      Christoph Hellwig <hch@infradead.org>
      Mike Christie <michaelc@cs.wisc.edu>
      Matias Bjorling <m@bjorling.me>
      Jeff Moyer <jmoyer@redhat.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      320ae51f
  14. 12 9月, 2013 1 次提交
  15. 04 4月, 2013 1 次提交
    • A
      block: avoid using uninitialized value in from queue_var_store · c678ef52
      Arnd Bergmann 提交于
      As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
      that use a value generated by queue_var_store independent of whether
      that value was set or not.
      
      block/blk-sysfs.c: In function 'queue_store_nonrot':
      block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Unlike most other such warnings, this one is not a false positive,
      writing any non-number string into the sysfs files indeed has
      an undefined result, rather than returning an error.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c678ef52
  16. 10 1月, 2013 1 次提交
  17. 06 12月, 2012 1 次提交
    • B
      block: Rename queue dead flag · 3f3299d5
      Bart Van Assche 提交于
      QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
      stop. After this flag has been set queue draining starts. However,
      during the queue draining phase it is still safe to invoke the
      queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
      flag.
      
      This patch has been generated by running the following command
      over the kernel source tree:
      
      git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
          xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g'      \
              -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g';                \
      sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
          include/linux/blkdev.h;                                       \
      sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
          -e 's/Dead queue/A dying queue/' block/blk-core.c
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chanho Min <chanho.min@lge.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f3299d5
  18. 21 9月, 2012 1 次提交
    • T
      block: lift the initial queue bypass mode on blk_register_queue() instead of... · 749fefe6
      Tejun Heo 提交于
      block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
      
      b82d4b19 ("blkcg: make request_queue bypassing on allocation") made
      request_queues bypassed on allocation to avoid switching on and off
      bypass mode on a queue being initialized.  Some drivers allocate and
      then destroy a lot of queues without fully initializing them and
      incurring bypass latency overhead on each of them could add upto
      significant overhead.
      
      Unfortunately, blk_init_allocated_queue() is never used by queues of
      bio-based drivers, which means that all bio-based driver queues are in
      bypass mode even after initialization and registration complete
      successfully.
      
      Due to the limited way request_queues are used by bio drivers, this
      problem is hidden pretty well but it shows up when blk-throttle is
      used in combination with a bio-based driver.  Trying to configure
      (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
      indefinitely in blkg_conf_prep() waiting for bypass mode to end.
      
      This patch moves the initial blk_queue_bypass_end() call from
      blk_init_allocated_queue() to blk_register_queue() which is called for
      any userland-visible queues regardless of its type.
      
      I believe this is correct because I don't think there is any block
      driver which needs or wants working elevator and blk-cgroup on a queue
      which isn't visible to userland.  If there are such users, we need a
      different solution.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoseph Glanville <joseph.glanville@orionvm.com.au>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      749fefe6
  19. 20 9月, 2012 1 次提交
  20. 09 9月, 2012 1 次提交
    • D
      block: reject invalid queue attribute values · b1f3b64d
      Dave Reisner 提交于
      Instead of using simple_strtoul which "converts" invalid numbers to 0,
      use strict_strtoul and perform error checking to ensure that userspace
      passes us a valid unsigned long. This addresses problems with functions
      such as writev, which might want to write a trailing newline -- the
      newline should rightfully be rejected, but the value preceeding it
      should be preserved.
      
      Fixes BZ#46981.
      Signed-off-by: NDave Reisner <dreisner@archlinux.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1f3b64d
  21. 27 6月, 2012 1 次提交
    • T
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo 提交于
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a051661c
  22. 25 6月, 2012 1 次提交
    • T
      block: prepare for multiple request_lists · 5b788ce3
      Tejun Heo 提交于
      Request allocation is about to be made per-blkg meaning that there'll
      be multiple request lists.
      
      * Make queue full state per request_list.  blk_*queue_full() functions
        are renamed to blk_*rl_full() and takes @rl instead of @q.
      
      * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
        instead of @q.  Also add @gfp_mask parameter.
      
      * Add blk_exit_rl() instead of destroying rl directly from
        blk_release_queue().
      
      * Add request_list->q and make request alloc/free functions -
        blk_free_request(), [__]freed_request(), __get_request() - take @rl
        instead of @q.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b788ce3
  23. 07 3月, 2012 2 次提交
    • T
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo 提交于
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8989fae
    • T
      blkcg: add blkcg_{init|drain|exit}_queue() · 5efd6113
      Tejun Heo 提交于
      Currently block core calls directly into blk-throttle for init, drain
      and exit.  This patch adds blkcg_{init|drain|exit}_queue() which wraps
      the blk-throttle functions.  This is to give more control and
      visiblity to blkcg core layer for proper layering.  Further patches
      will add logic common to blkcg policies to the functions.
      
      While at it, collapse blk_throtl_release() into blk_throtl_exit().
      There's no reason to keep them separate.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5efd6113
  24. 14 12月, 2011 3 次提交
    • T
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo 提交于
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e5a8794
    • T
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo 提交于
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a73f730d
    • T
      block: add blk_queue_dead() · 34f6055c
      Tejun Heo 提交于
      There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
      macro and use it.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34f6055c
  25. 19 10月, 2011 1 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
  26. 28 9月, 2011 1 次提交
    • H
      block: Free queue resources at blk_release_queue() · 777eb1bf
      Hannes Reinecke 提交于
      A kernel crash is observed when a mounted ext3/ext4 filesystem is
      physically removed. The problem is that blk_cleanup_queue() frees up
      some resources eg by calling elevator_exit(), which are not checked for
      in normal operation. So we should rather move these calls to the
      destructor function blk_release_queue() as at that point all remaining
      references are gone. However, in doing so we have to ensure that any
      externally supplied queue_lock is disconnected as the driver might free
      up the lock after the call of blk_cleanup_queue(),
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      777eb1bf
  27. 21 9月, 2011 1 次提交
  28. 24 8月, 2011 1 次提交
  29. 24 7月, 2011 1 次提交
    • D
      block: strict rq_affinity · 5757a6d7
      Dan Williams 提交于
      Some systems benefit from completions always being steered to the strict
      requester cpu rather than the looser "per-socket" steering that
      blk_cpu_to_group() attempts by default. This is because the first
      CPU in the group mask ends up being completely overloaded with work,
      while the others (including the original submitter) has power left
      to spare.
      
      Allow the strict mode to be set by writing '2' to the sysfs control
      file. This is identical to the scheme used for the nomerges file,
      where '2' is a more aggressive setting than just being turned on.
      
      echo 2 > /sys/block/<bdev>/queue/rq_affinity
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Roland Dreier <roland@purestorage.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      5757a6d7
  30. 18 5月, 2011 1 次提交
  31. 19 4月, 2011 2 次提交
  32. 14 4月, 2011 1 次提交
  33. 03 3月, 2011 1 次提交
    • V
      block: Move blk_throtl_exit() call to blk_cleanup_queue() · da527770
      Vivek Goyal 提交于
      Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
      written in such a way that it needs queue lock. In blk_release_queue()
      there is no gurantee that ->queue_lock is still around.
      
      Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
      one problem.
      
        https://lkml.org/lkml/2010/10/23/86
      
        And a quick fix moved blk_throtl_exit() to blk_release_queue().
      
              commit 7ad58c02
              Author: Jens Axboe <jaxboe@fusionio.com>
              Date:   Sat Oct 23 20:40:26 2010 +0200
      
              block: fix use-after-free bug in blk throttle code
      
      This patch reverts above change and does not try to shutdown the
      throtl work in blk_sync_queue(). By avoiding call to
      throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
      the problem reported by Ingo.
      
      blk_sync_queue() seems to be used only by md driver and it seems to be
      using it to make sure q->unplug_fn is not called as md registers its
      own unplug functions and it is about to free up the data structures
      used by unplug_fn(). Block throttle does not call back into unplug_fn()
      or into md. So there is no need to cancel blk throttle work.
      
      In fact I think cancelling block throttle work is bad because it might
      happen that some bios are throttled and scheduled to be dispatched later
      with the help of pending work and if work is cancelled, these bios might
      never be dispatched.
      
      Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
      blk_release_queue() time. That should be safe as we are also calling
      blk_throtl_exit() which should make sure all the throttling related
      data structures are cleaned up.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      da527770
  34. 17 12月, 2010 1 次提交
    • M
      block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead · e692cb66
      Martin K. Petersen 提交于
      When stacking devices, a request_queue is not always available. This
      forced us to have a no_cluster flag in the queue_limits that could be
      used as a carrier until the request_queue had been set up for a
      metadevice.
      
      There were several problems with that approach. First of all it was up
      to the stacking device to remember to set queue flag after stacking had
      completed. Also, the queue flag and the queue limits had to be kept in
      sync at all times. We got that wrong, which could lead to us issuing
      commands that went beyond the max scatterlist limit set by the driver.
      
      The proper fix is to avoid having two flags for tracking the same thing.
      We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
      block layer merging functions. The queue_limit 'no_cluster' is turned
      into 'cluster' to avoid double negatives and to ease stacking.
      Clustering defaults to being enabled as before. The queue flag logic is
      removed from the stacking function, and explicitly setting the cluster
      flag is no longer necessary in DM and MD.
      Reported-by: NEd Lin <ed.lin@promise.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e692cb66