1. 04 11月, 2011 1 次提交
    • T
      block: don't call blk_drain_queue() if elevator is not up · 6dd9ad7d
      Tejun Heo 提交于
      blk_cleanup_queue() may be called before elevator is set up on a
      queue which triggers the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       ...
       Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
       Bochs Bochs
       RIP: 0010:[<ffffffff8125a69c>]  [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       ...
       Call Trace:
        [<ffffffff8125da92>] blk_drain_queue+0x42/0x70
        [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
        [<ffffffff81469640>] md_free+0x50/0x70
        [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
        [<ffffffff81270d56>] kref_put+0x36/0xa0
        [<ffffffff8126f2b7>] kobject_put+0x27/0x60
        [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
        [<ffffffff81083450>] process_one_work+0x100/0x3b0
        [<ffffffff8108527f>] worker_thread+0x15f/0x3a0
        [<ffffffff81089937>] kthread+0x87/0x90
        [<ffffffff81621834>] kernel_thread_helper+0x4/0x10
      
      Fix it by making blk_cleanup_queue() check whether q->elevator is set
      up before invoking blk_drain_queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6dd9ad7d
  2. 25 10月, 2011 4 次提交
  3. 24 10月, 2011 6 次提交
    • J
      Merge branch 'for-linus' into for-3.2/core · 83157223
      Jens Axboe 提交于
      83157223
    • T
      block: make gendisk hold a reference to its queue · f992ae80
      Tejun Heo 提交于
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f992ae80
    • J
      blk-flush: move the queue kick into · e67b77c7
      Jeff Moyer 提交于
      A dm-multipath user reported[1] a problem when trying to boot
      a kernel with commit 4853abaa
      (block: fix flush machinery for stacking drivers with differring
      flush flags) applied.  It turns out that an empty flush request
      can be sent into blk_insert_flush.  When the BUG_ON was fixed
      to allow for this, I/O on the underlying device would stall.  The
      reason is that blk_insert_cloned_request does not kick the queue.
      In the aforementioned commit, I had added a special case to
      kick the queue if data was sent down but the queue flags did
      not require a flush.  A better solution is to push the queue
      kick up into blk_insert_cloned_request.
      
      This patch, along with a follow-on which fixes the BUG_ON, fixes
      the issue reported.
      
      [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.htmlReported-by: NChristophe Saout <christophe@saout.de>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      
      Stable note: 3.1
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e67b77c7
    • J
      blk-flush: fix invalid BUG_ON in blk_insert_flush · 834f9f61
      Jeff Moyer 提交于
      A user reported a regression due to commit
      4853abaa (block: fix flush
      machinery for stacking drivers with differring flush flags).
      Part of the problem is that blk_insert_flush required a
      single bio be attached to the request.  In reality, having
      no attached bio is also a valid case, as can be observed with
      an empty flush.
      
      [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.htmlReported-by: NChristophe Saout <christophe@saout.de>
      Signed-off-by: Jeff Moyer <jmoyer@redhat.com
      Acked-by: NTejun Heo <tj@kernel.org>
      
      Stable note: 3.1
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      834f9f61
    • T
      block: Remove the control of complete cpu from bio. · 9562ad9a
      Tao Ma 提交于
      bio originally has the functionality to set the complete cpu, but
      it is broken.
      
      Chirstoph said that "This code is unused, and from the all the
      discussions lately pretty obviously broken.  The only thing keeping
      it serves is creating more confusion and possibly more bugs."
      
      And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
      with leaving cpu control to the request based drivers, they are the
      only ones that can toggle the setting anyway".
      
      So this patch tries to remove all the work of controling complete cpu
      from a bio.
      
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9562ad9a
    • J
      block: fix a typo in the blk-cgroup.h file · e890413a
      Jie Liu 提交于
      byptes -> bytes.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e890413a
  4. 21 10月, 2011 1 次提交
    • D
      block: initialize the bounce pool if high memory may be added later · 3bcfeaf9
      David Vrabel 提交于
      init_emergency_pool() does not create the page pool for bouncing block
      requests if the current count of high pages is zero.  If high memory
      may be added later (either via memory hotplug or a balloon driver in a
      virtualized system) then a oops occurs if a request with a high page
      need bouncing because the pool does not exist.
      
      So, always create the pool if memory hotplug is enabled and change the
      test so it's valid even if all high pages are currently in the balloon
      (the balloon drivers adjust totalhigh_pages but not max_pfn).
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3bcfeaf9
  5. 19 10月, 2011 11 次提交
    • T
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo 提交于
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      reference.
      
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
      
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       ...
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       ...
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      synchronization.
      
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
        queue_lock.
      
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
        released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9a929dd
    • T
      block: drop @tsk from attempt_plug_merge() and explain sync rules · bd87b589
      Tejun Heo 提交于
      attempt_plug_merge() accesses elevator without holding queue_lock and
      may call into ->elevator_bio_merge_fn().  The elvator is guaranteed to
      be valid because it's accessed iff the plugged list has requests and
      elevator is never exited with live requests, so as long as the
      elevator method can deal with unlocked access, this is safe.
      
      Explain the sync rules around attempt_plug_merge() and drop the
      unnecessary @tsk parameter.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd87b589
    • T
      block: make get_request[_wait]() fail if queue is dead · da8303c6
      Tejun Heo 提交于
      Currently get_request[_wait]() allocates request whether queue is dead
      or not.  This patch makes get_request[_wait]() return NULL if @q is
      dead.  blk_queue_bio() is updated to fail the submitted bio if request
      allocation fails.  While at it, add docbook comments for
      get_request[_wait]().
      
      Note that the current code has rather unclear (there are spurious DEAD
      tests scattered around) assumption that the owner of a queue
      guarantees that no request travels block layer if the queue is dead
      and this patch in itself doesn't change much; however, this will allow
      fixing the broken assumption in the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      da8303c6
    • T
      block: reorganize throtl_get_tg() and blk_throtl_bio() · bc16a4f9
      Tejun Heo 提交于
      blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
      
      * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
        and drops queue_lock in the latter case.  Different locking context
        depending on return value is error-prone and DEAD state is scheduled
        to be protected by queue_lock anyway.  Move DEAD check inside
        queue_lock and return valid tg or NULL.
      
      * blk_throtl_bio() indicates return status both with its return value
        and in/out param **@bio.  The former is used to indicate whether
        queue is found to be dead during throtl processing.  The latter
        whether the bio is throttled.
      
        There's no point in returning DEAD check result from
        blk_throtl_bio().  The queue can die after blk_throtl_bio() is
        finished but before make_request_fn() grabs queue lock.
      
        Make it take *@bio instead and return boolean result indicating
        whether the request is throttled or not.
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc16a4f9
    • T
      block: reorganize queue draining · e3c78ca5
      Tejun Heo 提交于
      Reorganize queue draining related code in preparation of queue exit
      changes.
      
      * Factor out actual draining from elv_quiesce_start() to
        blk_drain_queue().
      
      * Make elv_quiesce_start/end() responsible for their own locking.
      
      * Replace open-coded ELVSWITCH clearing in elevator_switch() with
        elv_quiesce_end().
      
      This patch doesn't cause any visible functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3c78ca5
    • T
      block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() · 315fceee
      Tejun Heo 提交于
      blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
      completely bogus.  The caller must have a reference to the queue on
      entry and taking an extra reference doesn't change anything.
      
      For scsi_cmd_ioctl(), the only effect is that it ends up checking
      QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
      right after blk_get_queue().  Dead queue should be and is handled in
      request issue path (it's somewhat broken now but that's a separate
      problem and doesn't affect this one much).
      
      throtl_get_tg() incorrectly assumes that q is rcu freed.  Also, it
      doesn't check return value of blk_get_queue().  If the queue is
      already dead, it ends up doing an extra put.
      
      Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      315fceee
    • T
      block: pass around REQ_* flags instead of broken down booleans during request alloc/free · 75eb6c37
      Tejun Heo 提交于
      blk_alloc_request() and freed_request() take different combinations of
      REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
      two.  Make them take @flags only.  This cleans up the code a bit and
      will ease updating allocation related REQ_* flags.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      75eb6c37
    • T
      block: move blk_throtl prototypes to block/blk.h · bc9fcbf9
      Tejun Heo 提交于
      blk_throtl interface is block internal and there's no reason to have
      them in linux/blkdev.h.  Move them to block/blk.h.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc9fcbf9
    • T
      block: fix genhd refcounting in blkio_policy_parse_and_set() · ece84241
      Tejun Heo 提交于
      blkio_policy_parse_and_set() calls blkio_check_dev_num() to check
      whether the given dev_t is valid.  blkio_check_dev_num() uses
      get_gendisk() for verification but never puts the returned genhd
      leaking the reference.
      
      This patch collapses blkio_check_dev_num() into its caller and updates
      it such that the genhd is put before returning.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ece84241
    • T
      block: make gendisk hold a reference to its queue · 523e1d39
      Tejun Heo 提交于
      The following command sequence triggers an oops.
      
      # mount /dev/sdb1 /mnt
      # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
      # umount /mnt
      
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
      
       Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
       RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
      ...
       Call Trace:
        [<ffffffff810d2845>] lock_acquire+0x95/0x140
        [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
        [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
        [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
        [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
        [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
        [<ffffffff811c40df>] blkdev_put+0x5f/0x190
        [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
        [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
        [<ffffffff8119003a>] deactivate_super+0x4a/0x70
        [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
        [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
        [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b
      
      This is because bdev holds on to disk but disk doesn't pin the
      associated queue.  If a SCSI device is removed while the device is
      still open, the sdev puts the base reference to the queue on release.
      When the bdev is finally released, the associated queue is already
      gone along with the bdi and bdev_inode_switch_bdi() ends up
      dereferencing already freed bdi.
      
      Even if it were not for this bug, disk not holding onto the associated
      queue is very unusual and error-prone.
      
      Fix it by making add_disk() take an extra reference to its queue and
      put it on disk_release() and ensuring that disk and its fops owner are
      put in that order after all accesses to the disk and queue are
      complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      523e1d39
    • J
      Merge branch 'v3.1-rc10' into for-3.2/core · 5c04b426
      Jens Axboe 提交于
      Conflicts:
      	block/blk-core.c
      	include/linux/blkdev.h
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5c04b426
  6. 18 10月, 2011 1 次提交
  7. 17 10月, 2011 3 次提交
    • L
      Avoid using variable-length arrays in kernel/sys.c · a84a79e4
      Linus Torvalds 提交于
      The size is always valid, but variable-length arrays generate worse code
      for no good reason (unless the function happens to be inlined and the
      compiler sees the length for the simple constant it is).
      
      Also, there seems to be some code generation problem on POWER, where
      Henrik Bakken reports that register r28 can get corrupted under some
      subtle circumstances (interrupt happening at the wrong time?).  That all
      indicates some seriously broken compiler issues, but since variable
      length arrays are bad regardless, there's little point in trying to
      chase it down.
      
      "Just don't do that, then".
      Reported-by: NHenrik Grindal Bakken <henribak@cisco.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a84a79e4
    • C
      loop: remove the incorrect write_begin/write_end shortcut · 456be148
      Christoph Hellwig 提交于
      Currently the loop device tries to call directly into write_begin/write_end
      instead of going through ->write if it can.  This is a fairly nasty shortcut
      as write_begin and write_end are only callbacks for the generic write code
      and expect to be called with filesystem specific locks held.
      
      This code currently causes various issues for clustered filesystems as it
      doesn't take the required cluster locks, and it also causes issues for XFS
      as it doesn't properly lock against the swapext ioctl as called by the
      defragmentation tools.  This in case causes data corruption if
      defragmentation hits a busy loop device in the wrong time window, as
      reported by RH QA.
      
      The reason why we have this shortcut is that it saves a data copy when
      doing a transformation on the loop device, which is the technical term
      for using cryptoloop (or an XOR transformation).  Given that cryptoloop
      has been deprecated in favour of dm-crypt my opinion is that we should
      simply drop this shortcut instead of finding complicated ways to to
      introduce a formal interface for this shortcut.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      456be148
    • L
      Merge branch 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm · 8bc03e8f
      Linus Torvalds 提交于
      * 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm:
        ARM: 7128/1: vic: Don't write to the read-only register VIC_IRQ_STATUS
        ARM: 7122/1: localtimer: add header linux/errno.h explicitly
        ARM: 7117/1: perf: fix HW_CACHE_* events on Cortex-A9
        ARM: 7113/1: mm: Align bank start to MAX_ORDER_NR_PAGES
      8bc03e8f
  8. 15 10月, 2011 4 次提交
  9. 14 10月, 2011 8 次提交
  10. 13 10月, 2011 1 次提交