1. 04 11月, 2011 1 次提交
    • T
      block: don't call blk_drain_queue() if elevator is not up · 6dd9ad7d
      Tejun Heo 提交于
      blk_cleanup_queue() may be called before elevator is set up on a
      queue which triggers the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       ...
       Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
       Bochs Bochs
       RIP: 0010:[<ffffffff8125a69c>]  [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       ...
       Call Trace:
        [<ffffffff8125da92>] blk_drain_queue+0x42/0x70
        [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
        [<ffffffff81469640>] md_free+0x50/0x70
        [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
        [<ffffffff81270d56>] kref_put+0x36/0xa0
        [<ffffffff8126f2b7>] kobject_put+0x27/0x60
        [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
        [<ffffffff81083450>] process_one_work+0x100/0x3b0
        [<ffffffff8108527f>] worker_thread+0x15f/0x3a0
        [<ffffffff81089937>] kthread+0x87/0x90
        [<ffffffff81621834>] kernel_thread_helper+0x4/0x10
      
      Fix it by making blk_cleanup_queue() check whether q->elevator is set
      up before invoking blk_drain_queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6dd9ad7d
  2. 25 10月, 2011 4 次提交
  3. 24 10月, 2011 5 次提交
    • T
      block: make gendisk hold a reference to its queue · f992ae80
      Tejun Heo 提交于
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f992ae80
    • J
      blk-flush: move the queue kick into · e67b77c7
      Jeff Moyer 提交于
      A dm-multipath user reported[1] a problem when trying to boot
      a kernel with commit 4853abaa
      (block: fix flush machinery for stacking drivers with differring
      flush flags) applied.  It turns out that an empty flush request
      can be sent into blk_insert_flush.  When the BUG_ON was fixed
      to allow for this, I/O on the underlying device would stall.  The
      reason is that blk_insert_cloned_request does not kick the queue.
      In the aforementioned commit, I had added a special case to
      kick the queue if data was sent down but the queue flags did
      not require a flush.  A better solution is to push the queue
      kick up into blk_insert_cloned_request.
      
      This patch, along with a follow-on which fixes the BUG_ON, fixes
      the issue reported.
      
      [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.htmlReported-by: NChristophe Saout <christophe@saout.de>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      
      Stable note: 3.1
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e67b77c7
    • J
      blk-flush: fix invalid BUG_ON in blk_insert_flush · 834f9f61
      Jeff Moyer 提交于
      A user reported a regression due to commit
      4853abaa (block: fix flush
      machinery for stacking drivers with differring flush flags).
      Part of the problem is that blk_insert_flush required a
      single bio be attached to the request.  In reality, having
      no attached bio is also a valid case, as can be observed with
      an empty flush.
      
      [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.htmlReported-by: NChristophe Saout <christophe@saout.de>
      Signed-off-by: Jeff Moyer <jmoyer@redhat.com
      Acked-by: NTejun Heo <tj@kernel.org>
      
      Stable note: 3.1
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      834f9f61
    • T
      block: Remove the control of complete cpu from bio. · 9562ad9a
      Tao Ma 提交于
      bio originally has the functionality to set the complete cpu, but
      it is broken.
      
      Chirstoph said that "This code is unused, and from the all the
      discussions lately pretty obviously broken.  The only thing keeping
      it serves is creating more confusion and possibly more bugs."
      
      And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
      with leaving cpu control to the request based drivers, they are the
      only ones that can toggle the setting anyway".
      
      So this patch tries to remove all the work of controling complete cpu
      from a bio.
      
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9562ad9a
    • J
      block: fix a typo in the blk-cgroup.h file · e890413a
      Jie Liu 提交于
      byptes -> bytes.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e890413a
  4. 19 10月, 2011 10 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
    • T
      block: drop @tsk from attempt_plug_merge() and explain sync rules · bd87b589
      Tejun Heo 提交于
      attempt_plug_merge() accesses elevator without holding queue_lock and
      may call into ->elevator_bio_merge_fn().  The elvator is guaranteed to
      be valid because it's accessed iff the plugged list has requests and
      elevator is never exited with live requests, so as long as the
      elevator method can deal with unlocked access, this is safe.
      
      Explain the sync rules around attempt_plug_merge() and drop the
      unnecessary @tsk parameter.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd87b589
    • T
      block: make get_request[_wait]() fail if queue is dead · da8303c6
      Tejun Heo 提交于
      Currently get_request[_wait]() allocates request whether queue is dead
      or not.  This patch makes get_request[_wait]() return NULL if @q is
      dead.  blk_queue_bio() is updated to fail the submitted bio if request
      allocation fails.  While at it, add docbook comments for
      get_request[_wait]().
      
      Note that the current code has rather unclear (there are spurious DEAD
      tests scattered around) assumption that the owner of a queue
      guarantees that no request travels block layer if the queue is dead
      and this patch in itself doesn't change much; however, this will allow
      fixing the broken assumption in the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      da8303c6
    • T
      block: reorganize throtl_get_tg() and blk_throtl_bio() · bc16a4f9
      Tejun Heo 提交于
      blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
      
      * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
        and drops queue_lock in the latter case.  Different locking context
        depending on return value is error-prone and DEAD state is scheduled
        to be protected by queue_lock anyway.  Move DEAD check inside
        queue_lock and return valid tg or NULL.
      
      * blk_throtl_bio() indicates return status both with its return value
        and in/out param **@bio.  The former is used to indicate whether
        queue is found to be dead during throtl processing.  The latter
        whether the bio is throttled.
      
        There's no point in returning DEAD check result from
        blk_throtl_bio().  The queue can die after blk_throtl_bio() is
        finished but before make_request_fn() grabs queue lock.
      
        Make it take *@bio instead and return boolean result indicating
        whether the request is throttled or not.
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc16a4f9
    • T
      block: reorganize queue draining · e3c78ca5
      Tejun Heo 提交于
      Reorganize queue draining related code in preparation of queue exit
      changes.
      
      * Factor out actual draining from elv_quiesce_start() to
        blk_drain_queue().
      
      * Make elv_quiesce_start/end() responsible for their own locking.
      
      * Replace open-coded ELVSWITCH clearing in elevator_switch() with
        elv_quiesce_end().
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3c78ca5
    • T
      block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() · 315fceee
      Tejun Heo 提交于
      blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
      completely bogus.  The caller must have a reference to the queue on
      entry and taking an extra reference doesn't change anything.
      
      For scsi_cmd_ioctl(), the only effect is that it ends up checking
      QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
      right after blk_get_queue().  Dead queue should be and is handled in
      request issue path (it's somewhat broken now but that's a separate
      problem and doesn't affect this one much).
      
      throtl_get_tg() incorrectly assumes that q is rcu freed.  Also, it
      doesn't check return value of blk_get_queue().  If the queue is
      already dead, it ends up doing an extra put.
      
      Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      315fceee
    • T
      block: pass around REQ_* flags instead of broken down booleans during request alloc/free · 75eb6c37
      Tejun Heo 提交于
      blk_alloc_request() and freed_request() take different combinations of
      REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
      two.  Make them take @flags only.  This cleans up the code a bit and
      will ease updating allocation related REQ_* flags.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      75eb6c37
    • T
      block: move blk_throtl prototypes to block/blk.h · bc9fcbf9
      Tejun Heo 提交于
      blk_throtl interface is block internal and there's no reason to have
      them in linux/blkdev.h.  Move them to block/blk.h.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc9fcbf9
    • T
      block: fix genhd refcounting in blkio_policy_parse_and_set() · ece84241
      Tejun Heo 提交于
      blkio_policy_parse_and_set() calls blkio_check_dev_num() to check
      whether the given dev_t is valid.  blkio_check_dev_num() uses
      get_gendisk() for verification but never puts the returned genhd
      leaking the reference.
      
      This patch collapses blkio_check_dev_num() into its caller and updates
      it such that the genhd is put before returning.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ece84241
    • T
      block: make gendisk hold a reference to its queue · 523e1d39
      Tejun Heo 提交于
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      523e1d39
  5. 28 9月, 2011 1 次提交
    • H
      block: Free queue resources at blk_release_queue() · 777eb1bf
      Hannes Reinecke 提交于
      A kernel crash is observed when a mounted ext3/ext4 filesystem is
      physically removed. The problem is that blk_cleanup_queue() frees up
      some resources eg by calling elevator_exit(), which are not checked for
      in normal operation. So we should rather move these calls to the
      destructor function blk_release_queue() as at that point all remaining
      references are gone. However, in doing so we have to ensure that any
      externally supplied queue_lock is disconnected as the driver might free
      up the lock after the call of blk_cleanup_queue(),
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      777eb1bf
  6. 21 9月, 2011 3 次提交
  7. 15 9月, 2011 1 次提交
    • C
      block: refactor generic_make_request · 27a84d54
      Christoph Hellwig 提交于
      Move all the checks performed on a bio into a new helper, and call it as
      soon as bio is submitted even if it is a re-submission from ->make_request.
      
      We explicitly mark the new helper as beeing non-inlined as the stack
      usage for printing the block device name in the failure case is quite
      high and this a patch where we have to be extremely conservative about
      stack usage.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27a84d54
  8. 14 9月, 2011 1 次提交
  9. 12 9月, 2011 4 次提交
  10. 24 8月, 2011 3 次提交
  11. 23 8月, 2011 1 次提交
  12. 19 8月, 2011 1 次提交
    • J
      Revert "cfq: Remove special treatment for metadata rqs." · b53d1ed7
      Jens Axboe 提交于
      We have a kernel build regression since 3.1-rc1, which is about 10%
      regression. The kernel source is in an ext3 filesystem.
      Alex Shi bisect it to commit:
      commit a07405b7
      Author: Justin TerAvest <teravest@google.com>
      Date:   Sun Jul 10 22:09:19 2011 +0200
      
          cfq: Remove special treatment for metadata rqs.
      
      Apparently this is caused by lack metadata preemption, where ext3/ext4
      do use READ_META. I didn't see a way to fix the issue, so suggest
      reverting the patch.
      
      This reverts commit a07405b7.
      
      Reported-by: Alex Shi<alex.shi@intel.com>
      Reported-by: Shaohua Li<shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      b53d1ed7
  13. 16 8月, 2011 1 次提交
    • J
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer 提交于
      Commit ae1b1539, block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      
      static inline struct request *__elv_next_request(struct request_queue *q)
      {
              struct request *rq;
      
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
                      }
      
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
      {
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              REQ_FUA);
              unsigned skip = 0;
      ...
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      	}
      
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      appreciated.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4853abaa
  14. 11 8月, 2011 1 次提交
    • S
      block: improve rq_affinity placement · bcf30e75
      Shaohua Li 提交于
      This patch reverts commit 35ae66e0(block: Make rq_affinity = 1
      work as expected). The purpose is to avoid an unnecessary IPI.
      Let's take an example. My test box has cpu 0-7, one socket. Say request is
      added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
      patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
      0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
      CPU 7 have no difference from cache sharing point view and we can avoid an
      ipi if doing it in CPU 7.
      An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
      not. blk_complete_request() is running in interrupt handler, and currently
      I/O controller doesn't support multiple interrupts (I checked several LSI
      cards and AHCI), so only one CPU can run blk_complete_request(). This is
      still quite different as QUEUE_FLAG_SAME_FORCE.
      Since only one CPU runs softirq, the only difference with below patch is
      softirq not always runs at the first CPU of a group.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      bcf30e75
  15. 10 8月, 2011 1 次提交
    • J
      allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH · fa1bf42f
      Jeff Moyer 提交于
      blk_insert_flush has the following check:
      
      	/*
      	 * If there's data but flush is not necessary, the request can be
      	 * processed directly without going through flush machinery.  Queue
      	 * for normal execution.
      	 */
      	if ((policy & REQ_FSEQ_DATA) &&
      	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
      		list_add_tail(&rq->queuelist, &q->queue_head);
      		return;
      	}
      
      However, blk_flush_policy will not return with policy set to only
      REQ_FSEQ_DATA:
      
      static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
      {
      	unsigned int policy = 0;
      
      	if (fflags & REQ_FLUSH) {
      		if (rq->cmd_flags & REQ_FLUSH)
      			policy |= REQ_FSEQ_PREFLUSH;
      		if (blk_rq_sectors(rq))
      			policy |= REQ_FSEQ_DATA;
      		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
      			policy |= REQ_FSEQ_POSTFLUSH;
      	}
      	return policy;
      }
      
      Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set.  Fix this
      mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH
      check.
      
      Tejun notes:
      
        Hmmm... yes, this can become a correctness issue if (and only if)
        blk_queue_flush() is called to change q->flush_flags while requests
        are in-flight; otherwise, requests wouldn't reach the function at all.
        Also, I think it would be a generally good idea to always set
        FSEQ_DATA if the request has data.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      fa1bf42f
  16. 05 8月, 2011 1 次提交
    • T
      block: Make rq_affinity = 1 work as expected · 35ae66e0
      Tao Ma 提交于
      Commit 5757a6d7 introduced a new rq_affinity = 2 so as to make
      the request completed in the __make_request cpu. But it makes the
      old rq_affinity = 1 not work any more. The root cause is that
      if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
      ccpu will be the same as group_cpu, so the completion will be
      excuted in the 'cpu' not 'group_cpu'.
      
      This patch fix problem by simpling removing group_cpu and the codes
      are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
      we raise_blk_irq to ccpu.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Reviewed-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      35ae66e0
  17. 04 8月, 2011 1 次提交
    • A
      fault-injection: add ability to export fault_attr in arbitrary directory · dd48c085
      Akinobu Mita 提交于
      init_fault_attr_dentries() is used to export fault_attr via debugfs.
      But it can only export it in debugfs root directory.
      
      Per Forlin is working on mmc_fail_request which adds support to inject
      data errors after a completed host transfer in MMC subsystem.
      
      The fault_attr for mmc_fail_request should be defined per mmc host and
      export it in debugfs directory per mmc host like
      /sys/kernel/debug/mmc0/mmc_fail_request.
      
      init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
      introduces fault_create_debugfs_attr() which is able to create a
      directory in the arbitrary directory and replace
      init_fault_attr_dentries().
      
      [akpm@linux-foundation.org: extraneous semicolon, per Randy]
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Tested-by: NPer Forlin <per.forlin@linaro.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd48c085