1. 07 3月, 2012 13 次提交
    • T
      blkcg: use the usual get blkg path for root blkio_group · f51b802c
      Tejun Heo 提交于
      For root blkg, blk_throtl_init() was using throtl_alloc_tg()
      explicitly and cfq_init_queue() was manually initializing embedded
      cfqd->root_group, adding unnecessarily different code paths to blkg
      handling.
      
      Make both use the usual blkio_group get functions - throtl_get_tg()
      and cfq_get_cfqg() - for the root blkio_group too.  Note that
      blk_throtl_init() callsite is pushed downwards in
      blk_alloc_queue_node() so that @q is sufficiently initialized for
      throtl_get_tg().
      
      This simplifies root blkg handling noticeably for cfq and will allow
      further modularization of blkcg API.
      
      -v2: Vivek pointed out that using cfq_get_cfqg() won't work if
           CONFIG_CFQ_GROUP_IOSCHED is disabled.  Fix it by factoring out
           initialization of base part of cfqg into cfq_init_cfqg_base() and
           alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f51b802c
    • T
      blkcg: add blkio_policy[] array and allow one policy per policy ID · 035d10b2
      Tejun Heo 提交于
      Block cgroup policies are maintained in a linked list and,
      theoretically, multiple policies sharing the same policy ID are
      allowed.
      
      This patch temporarily restricts one policy per plid and adds
      blkio_policy[] array which indexes registered policy types by plid.
      Both the restriction and blkio_policy[] array are transitional and
      will be removed once API cleanup is complete.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      035d10b2
    • T
      blkcg: use q and plid instead of opaque void * for blkio_group association · ca32aefc
      Tejun Heo 提交于
      blkgio_group is association between a block cgroup and a queue for a
      given policy.  Using opaque void * for association makes things
      confusing and hinders factoring of common code.  Use request_queue *
      and, if necessary, policy id instead.
      
      This will help block cgroup API cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ca32aefc
    • T
      blkcg: update blkg get functions take blkio_cgroup as parameter · 0a5a7d0e
      Tejun Heo 提交于
      In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(),
      instead of obtaining blkcg of %current explicitly, let the caller
      specify the blkcg to use as parameter and make both functions hold on
      to the blkcg.
      
      This is part of block cgroup interface cleanup and will help making
      blkcg API more modular.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a5a7d0e
    • T
      blkcg: move rcu_read_lock() outside of blkio_group get functions · 2a7f1244
      Tejun Heo 提交于
      rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto
      @blkcg while looking up blkg.  For API cleanup, the next patch will
      make the caller responsible for determining @blkcg to look blkg from
      and let them specify it as a parameter.  Move rcu read locking out to
      the callers to prepare for the change.
      
      -v2: Originally this patch was described as a fix for RCU read locking
           bug around @blkg, which Vivek pointed out to be incorrect.  It
           was from misunderstanding the role of rcu locking as protecting
           @blkg not @blkcg.  Patch description updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a7f1244
    • T
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo 提交于
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
        disabled
      
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
      
           blk-throtl no longer shoots down root_tg to avoid breaking
           td->root_tg.
      
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72e06c25
    • T
      block: extend queue bypassing to cover blkcg policies · 6ecf23af
      Tejun Heo 提交于
      Extend queue bypassing such that dying queue is always bypassing and
      blk-throttle is drained on bypass.  With blkcg policies updated to
      test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
      no bio or request is held by or going through blkcg policies on a
      bypassing queue.
      
      This will be used to implement blkg cleanup on elevator switches and
      policy changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ecf23af
    • T
      block: implement blk_queue_bypass_start/end() · d732580b
      Tejun Heo 提交于
      Rename and extend elv_queisce_start/end() to
      blk_queue_bypass_start/end() which are exported and supports nesting
      via @q->bypass_depth.  Also add blk_queue_bypass() to test bypass
      state.
      
      This will be further extended and used for blkio_group management.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d732580b
    • T
      elevator: make elevator_init_fn() return 0/-errno · b2fab5ac
      Tejun Heo 提交于
      elevator_ops->elevator_init_fn() has a weird return value.  It returns
      a void * which the caller should assign to q->elevator->elevator_data
      and %NULL return denotes init failure.
      
      Update such that it returns integer 0/-errno and sets elevator_data
      directly as necessary.
      
      This makes the interface more conventional and eases further cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2fab5ac
    • T
      elevator: clear auxiliary data earlier during elevator switch · 5a5bafdc
      Tejun Heo 提交于
      Elevator switch tries hard to keep as much as context until new
      elevator is ready so that it can revert to the original state if
      initializing the new elevator fails for some reason.  Unfortunately,
      with more auxiliary contexts to manage, this makes elevator init and
      exit paths too complex and fragile.
      
      This patch makes elevator_switch() unregister the current elevator and
      flush icq's before start initializing the new one.  As we still keep
      the old elevator itself, the only difference is that we lose icq's on
      rare occassions of switching failure, which isn't critical at all.
      
      Note that this makes explicit elevator parameter to
      elevator_init_queue() and __elv_register_queue() unnecessary as they
      always can use the current elevator.
      
      This patch enables block cgroup cleanups.
      
      -v2: blk_add_trace_msg() prints elevator name from @new_e instead of
           @e->type as the local variable no longer exists.  This caused
           build failure on CONFIG_BLK_DEV_IO_TRACE.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a5bafdc
    • T
      cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED · b95ada55
      Tejun Heo 提交于
      cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED
      is disabled.  This fortunately doesn't collide with blk-throtl as
      BLKIO_POLICY_PROP is zero but is unnecessary and risky.  Just don't
      register it if not enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b95ada55
    • T
      blkcg: make CONFIG_BLK_CGROUP bool · 32e380ae
      Tejun Heo 提交于
      Block cgroup core can be built as module; however, it isn't too useful
      as blk-throttle can only be built-in and cfq-iosched is usually the
      default built-in scheduler.  Scheduled blkcg cleanup requires calling
      into blkcg from block core.  To simplify that, disallow building blkcg
      as module by making CONFIG_BLK_CGROUP bool.
      
      If building blkcg core as module really matters, which I doubt, we can
      revisit it after blkcg API cleanup.
      
      -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
           depend on BLK_CGROUP.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32e380ae
    • T
      block: blk-throttle should be drained regardless of q->elevator · b855b04a
      Tejun Heo 提交于
      Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
      q->elevator doesn't exist; however, bio based drivers don't have
      elevator initialized but can still use blk-throttle.  This patch moves
      q->elevator test inside blk_drain_queue() such that only
      elv_drain_elevator() is skipped if !q->elevator.
      
      -v2: loop can have registered queue which has NULL request_fn.  Make
           sure we don't call into __blk_run_queue() in such cases.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      
      Fold in bug fix from Vivek.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b855b04a
  2. 02 3月, 2012 3 次提交
    • A
      Block: use a freezable workqueue for disk-event polling · 62d3c543
      Alan Stern 提交于
      This patch (as1519) fixes a bug in the block layer's disk-events
      polling.  The polling is done by a work routine queued on the
      system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
      polling continues even in the middle of a system sleep transition.
      
      Obviously, polling a suspended drive for media changes and such isn't
      a good thing to do; in the case of USB mass-storage devices it can
      lead to real problems requiring device resets and even re-enumeration.
      
      The patch fixes things by creating a new system-wide, non-reentrant,
      freezable workqueue and using it for disk-events polling.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      CC: <stable@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62d3c543
    • S
      block: fix __blkdev_get and add_disk race condition · 9f53d2fe
      Stanislaw Gruszka 提交于
      The following situation might occur:
      
      __blkdev_get:			add_disk:
      
      				register_disk()
      get_gendisk()
      
      disk_block_events()
      	disk->ev == NULL
      
      				disk_add_events()
      
      __disk_unblock_events()
      	disk->ev != NULL
      	--ev->block
      
      Then we unblock events, when they are suppose to be blocked. This can
      trigger events related block/genhd.c warnings, but also can crash in
      sd_check_events() or other places.
      
      I'm able to reproduce crashes with the following scripts (with
      connected usb dongle as sdb disk).
      
      <snip>
      DEV=/dev/sdb
      ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue
      
      function stop_me()
      {
      	for i in `jobs -p` ; do kill $i 2> /dev/null ; done
      	exit
      }
      
      trap stop_me SIGHUP SIGINT SIGTERM
      
      for ((i = 0; i < 10; i++)) ; do
      	while true; do fdisk -l $DEV  2>&1 > /dev/null ; done &
      done
      
      while true ; do
      echo 1 > $ENABLE
      sleep 1
      echo 0 > $ENABLE
      done
      </snip>
      
      I use the script to verify patch fixing oops in sd_revalidate_disk
      http://marc.info/?l=linux-scsi&m=132935572512352&w=2
      Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
      sd_revalidate_disk" or this one, script easily crash kernel within
      a few seconds. With both patches applied I do not observe crash.
      Unfortunately after some time (dozen of minutes), script will hung in:
      
      [ 1563.906432]  [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
      [ 1563.906437]  [<c04532d5>] msleep+0x15/0x20
      [ 1563.906443]  [<c05d60b2>] blk_drain_queue+0x32/0xd0
      [ 1563.906447]  [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
      [ 1563.906454]  [<c06d278f>] scsi_free_queue+0x3f/0x60
      [ 1563.906459]  [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
      [ 1563.906463]  [<c06d4aff>] scsi_forget_host+0x4f/0x60
      [ 1563.906468]  [<c06cd84a>] scsi_remove_host+0x5a/0xf0
      [ 1563.906482]  [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
      [ 1563.906490]  [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]
      
      Anyway I think this patch is some step forward.
      
      As drawback, I do not teardown on sysfs file create error, because I do
      not know how to nullify disk->ev (since it can be used). However add_disk
      error handling practically does not exist too, and things will work
      without this sysfs file, except events will not be exported to user
      space.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f53d2fe
    • J
      block: Fix NULL pointer dereference in sd_revalidate_disk · fe316bf2
      Jun'ichi Nomura 提交于
      Since 2.6.39 (1196f8b8), when a driver returns -ENOMEDIUM for open(),
      __blkdev_get() calls rescan_partitions() to remove
      in-kernel partition structures and raise KOBJ_CHANGE uevent.
      
      However it ends up calling driver's revalidate_disk without open
      and could cause oops.
      
      In the case of SCSI:
      
        process A                  process B
        ----------------------------------------------
        sys_open
          __blkdev_get
            sd_open
              returns -ENOMEDIUM
                                   scsi_remove_device
                                     <scsi_device torn down>
            rescan_partitions
              sd_revalidate_disk
                <oops>
      Oopses are reported here:
      http://marc.info/?l=linux-scsi&m=132388619710052
      
      This patch separates the partition invalidation from rescan_partitions()
      and use it for -ENOMEDIUM case.
      Reported-by: NHuajun Li <huajun.li.lee@gmail.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe316bf2
  3. 15 2月, 2012 3 次提交
    • T
      block: exit_io_context() should call elevator_exit_icq_fn() · 621032ad
      Tejun Heo 提交于
      While updating locking, b2efa052 "block, cfq: unlink
      cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
      from exit_io_context() to the final ioc put.  While this doesn't cause
      catastrophic failure, it effectively removes task exit notification to
      elevator and cause noticeable IO performance degradation with CFQ.
      
      On task exit, CFQ used to immediately expire the slice if it was being
      used by the exiting task as no more IO would be issued by the task;
      however, after b2efa052, the notification is lost and disk could sit
      idle needlessly, leading to noticeable IO performance degradation for
      certain workloads.
      
      This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
      elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
      from exit_io_context().  ICQ_EXITED flag is added to avoid invoking
      the callback more than once for the same icq.
      
      Walking icq_list from ioc side and invoking elevator callback requires
      reverse double locking.  This may be better implemented using RCU;
      unfortunately, using RCU isn't trivial.  e.g. RCU protection would
      need to cover request_queue and queue_lock switch on cleanup makes
      grabbing queue_lock from RCU unsafe.  Reverse double locking should
      do, at least for now.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-bisected-by: NShaohua Li <shli@kernel.org>
      LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com>
      Tested-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      621032ad
    • T
      block: simplify ioc_release_fn() · 2274b029
      Tejun Heo 提交于
      Reverse double lock dancing in ioc_release_fn() can be simplified by
      just using trylock on the queue_lock and back out from ioc lock on
      trylock failure.  Simplify it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Tested-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2274b029
    • T
      block: replace icq->changed with icq->flags · d705ae6b
      Tejun Heo 提交于
      icq->changed was used for ICQ_*_CHANGED bits.  Rename it to flags and
      access it under ioc->lock instead of using atomic bitops.
      ioc_get_changed() is added so that the changed part can be fetched and
      cleared as before.
      
      icq->flags will be used to carry other flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Tested-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d705ae6b
  4. 11 2月, 2012 1 次提交
    • T
      block: fix lockdep warning on io_context release put_io_context() · d8c66c5d
      Tejun Heo 提交于
      11a3122f "block: strip out locking optimization in put_io_context()"
      removed ioc_lock depth lockdep annoation along with locking
      optimization; however, while recursing from put_io_context() is no
      longer possible, ioc_release_fn() may still end up putting the last
      reference of another ioc through elevator, which wlil grab ioc->lock
      triggering spurious (as the ioc is always different one) A-A deadlock
      warning.
      
      As this can only happen one time from ioc_release_fn(), using non-zero
      subclass from ioc_release_fn() is enough.  Use subclass 1.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d8c66c5d
  5. 09 2月, 2012 1 次提交
    • S
      bsg: fix sysfs link remove warning · 37b40adf
      Stanislaw Gruszka 提交于
      We create "bsg" link if q->kobj.sd is not NULL, so remove it only
      when the same condition is true.
      
      Fixes:
      
      WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
      sysfs: can not remove 'bsg', no directory
      Call Trace:
        [<c0429683>] warn_slowpath_common+0x6a/0x7f
        [<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
        [<c042970b>] warn_slowpath_fmt+0x2b/0x2f
        [<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
        [<c053969a>] sysfs_remove_link+0x20/0x23
        [<c05d88f1>] bsg_unregister_queue+0x40/0x6d
        [<c0692263>] __scsi_remove_device+0x31/0x9d
        [<c069149f>] scsi_forget_host+0x41/0x52
        [<c0689fa9>] scsi_remove_host+0x71/0xe0
        [<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
        [<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
        [<c06c29de>] usb_unbind_interface+0x4e/0x109
        [<c067a80f>] __device_release_driver+0x6b/0xa6
        [<c067a861>] device_release_driver+0x17/0x22
        [<c067a46a>] bus_remove_device+0xd6/0xe6
        [<c06785e2>] device_del+0xf2/0x137
        [<c06c101f>] usb_disable_device+0x94/0x1a0
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      37b40adf
  6. 08 2月, 2012 2 次提交
    • T
      block: don't call elevator callbacks for plug merges · 07c2bd37
      Tejun Heo 提交于
      Plug merge calls two elevator callbacks outside queue lock -
      elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
      attempt_plug_merge() suggests that elevator is guaranteed to be there
      through the existing request on the plug list, nothing prevents plug
      merge from calling into dying or initializing elevator.
      
      For regular merges, bypass ensures elvpriv count to reach zero, which
      in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
      from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
      the requests haven't gone through elevator insertion yet, it doesn't
      have SOFTBARRIER set allowing merges on a bypassed queue.
      
      This, for example, leads to the following crash during elevator
      switch.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       PGD 112cbc067 PUD 115d5c067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU 1
       Modules linked in: deadline_iosched
      
       Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
       RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
       RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
       RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
       RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
       R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
       FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
       Stack:
        ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
        ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
        ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
       Call Trace:
        [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
        [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
        [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
        [<ffffffff81396a5a>] generic_make_request+0xca/0x100
        [<ffffffff81396b04>] submit_bio+0x74/0x100
        [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
        [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff811986b2>] do_sync_read+0xe2/0x120
        [<ffffffff81199345>] vfs_read+0xc5/0x180
        [<ffffffff81199501>] sys_read+0x51/0x90
        [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
      
      There are multiple ways to fix this including making plug merge check
      ELVPRIV; however,
      
      * Calling into elevator outside queue lock is confusing and
        error-prone.
      
      * Requests on plug list aren't known to the elevator.  They aren't on
        the elevator yet, so there's no elevator specific state to update.
      
      * Given the nature of plug merges - collecting bio's for the same
        purpose from the same issuer - elevator specific restrictions aren't
        applicable.
      
      So, simply don't call into elevator methods from plug merge by moving
      elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
      using blk_try_merge() in attempt_plug_merge().
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      
      Note that this makes per-cgroup merged stats skip plug merging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07c2bd37
    • T
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo 提交于
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      bool.
      
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      050c8ea8
  7. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  8. 06 2月, 2012 1 次提交
    • S
      block: fix ioc locking warning · 9fa73472
      Shaohua Li 提交于
      Meelis reported a warning:
      
      WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
      Hardware name: 939Dual-SATA2
      timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
      Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
      Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd1256667 #176
      Call Trace:
       <IRQ>  [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96
       [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
       [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43
       [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa
       [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
       [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec
       [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31
       [<ffffffff810145fd>] ? apic_write+0x11/0x13
       [<ffffffff81027475>] __do_softirq+0x74/0xfa
       [<ffffffff812f337a>] call_softirq+0x1a/0x30
       [<ffffffff81002ff9>] do_softirq+0x31/0x68
       [<ffffffff810276cf>] irq_exit+0x3d/0xa3
       [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77
       [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70
       <EOI>  [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
       [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
       [<ffffffff8100801f>] ? default_idle+0x1e/0x32
       [<ffffffff81008019>] ? default_idle+0x18/0x32
       [<ffffffff810008b1>] cpu_idle+0x87/0xd1
       [<ffffffff812de861>] rest_init+0x85/0x89
       [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8
       [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82
       [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7
      
      this_q == locked_q is possible. There are two problems here:
      1. In UP case, there is preemption counter issue as spin_trylock always
      successes.
      2. In SMP case, the loop breaks too earlier.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Reported-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Tested-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fa73472
  9. 19 1月, 2012 2 次提交
    • S
      block: fix NULL icq_cache reference · 05c30b95
      Shaohua Li 提交于
      Vivek reported a kernel crash:
      [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
      [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [   94.218004] CPU 0
      [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
      [   94.218004]
      [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
      [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
      [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
      [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
      [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
      [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
      [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
      [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
      [   94.218004] Stack:
      [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
      [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
      [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
      [   94.218004] Call Trace:
      [   94.218004]  <IRQ>
      [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
      [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
      [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
      [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
      [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
      [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
      [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
      [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
      [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
      [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
      [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
      
      Once a queue is quiesced, it's not supposed to have any elvpriv data or
      icq's, and elevator switching depends on that.  Request alloc path
      followed the rule for elvpriv data but forgot apply it to icq's
      leading to the following crash during elevator switch. Fix it by not
      allocating icq's if ELVPRIV is not set for the request.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c30b95
    • S
      block,cfq: change code order · df0793ab
      Shaohua Li 提交于
      cfq_slice_expired will change saved_workload_slice. It should be called
      first so saved_workload_slice is correctly set to 0 after workload type
      is changed.
      This fixes the code order changed by 54b466e4.
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df0793ab
  10. 18 1月, 2012 1 次提交
    • J
      cfq-iosched: fix use-after-free of cfqq · 54b466e4
      Jens Axboe 提交于
      With the changes in life time management between the cfq IO contexts
      and the cfq queues, we now risk having cfqd->active_queue being
      freed when cfq_slice_expired() is being called. cfq_preempt_queue()
      caches this queue and uses it after calling said function, causing
      a use-after-free condition. This triggers the following oops,
      when cfqq_type() attempts to dereference it:
      
      BUG: unable to handle kernel paging request at ffff8800746c4f0c
      IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20
      PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      CPU 3
      Modules linked in:
      
      Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs
      RIP: 0010:[<ffffffff81266d59>]  [<ffffffff81266d59>] cfqq_type+0xb/0x20
      RSP: 0018:ffff880079c11778  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000
      RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08
      RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001
      R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0
      R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002
      FS:  00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040)
      Stack:
       ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43
       ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98
       0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149
      Call Trace:
       [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c
       [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31
       [<ffffffff81254149>] __elv_add_request+0x1ca/0x200
       [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312
       [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0
       [<ffffffff8125907b>] submit_bio+0xbf/0xca
       [<ffffffff81136ec7>] submit_bh+0xdf/0xfe
       [<ffffffff81176d04>] ext3_bread+0x50/0x99
       [<ffffffff811785b3>] dx_probe+0x38/0x291
       [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219
       [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406
       [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0
       [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191
       [<ffffffff8117a330>] ext3_lookup+0x39/0xe1
       [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c
       [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5
       [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef
       [<ffffffff8111b557>] path_lookupat+0x59/0x5ea
       [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a
       [<ffffffff8111bce0>] do_path_lookup+0x23/0x59
       [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99
       [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56
       [<ffffffff8111d02d>] user_path_at+0x11/0x13
       [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64
       [<ffffffff8111425a>] vfs_stat+0x1b/0x1d
       [<ffffffff81114359>] sys_newstat+0x1a/0x33
       [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42
       [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b
      Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3
      RIP  [<ffffffff81266d59>] cfqq_type+0xb/0x20
       RSP <ffff880079c11778>
      CR2: ffff8800746c4f0c
      
      Get rid of the caching of cfqd->active_queue, and reorder the
      check so that it happens before we expire the active queue.
      
      Thanks to Tejun for pin pointing the error location.
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Tested-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54b466e4
  11. 15 1月, 2012 3 次提交
    • J
      Revert "block: recursive merge requests" · 5d381efb
      Jens Axboe 提交于
      This reverts commit 27419322.
      
      We have some problems related to selection of empty queues
      that need to be resolved, evidence so far points to the
      recursive merge logic making either being the cause or at
      least the accelerator for this. So revert it for now, until
      we figure this out.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d381efb
    • P
      block: fail SCSI passthrough ioctls on partition devices · 0bfc96cb
      Paolo Bonzini 提交于
      Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
      will pass the command to the underlying block device.  This is
      well-known, but it is also a large security problem when (via Unix
      permissions, ACLs, SELinux or a combination thereof) a program or user
      needs to be granted access only to part of the disk.
      
      This patch lets partitions forward a small set of harmless ioctls;
      others are logged with printk so that we can see which ioctls are
      actually sent.  In my tests only CDROM_GET_CAPABILITY actually occurred.
      Of course it was being sent to a (partition on a) hard disk, so it would
      have failed with ENOTTY and the patch isn't changing anything in
      practice.  Still, I'm treating it specially to avoid spamming the logs.
      
      In principle, this restriction should include programs running with
      CAP_SYS_RAWIO.  If for example I let a program access /dev/sda2 and
      /dev/sdb, it still should not be able to read/write outside the
      boundaries of /dev/sda2 independent of the capabilities.  However, for
      now programs with CAP_SYS_RAWIO will still be allowed to send the
      ioctls.  Their actions will still be logged.
      
      This patch does not affect the non-libata IDE driver.  That driver
      however already tests for bd != bd->bd_contains before issuing some
      ioctl; it could be restricted further to forbid these ioctls even for
      programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.
      
      Cc: linux-scsi@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: James Bottomley <JBottomley@parallels.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [ Make it also print the command name when warning - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bfc96cb
    • P
      block: add and use scsi_blk_cmd_ioctl · 577ebb37
      Paolo Bonzini 提交于
      Introduce a wrapper around scsi_cmd_ioctl that takes a block device.
      
      The function will then be enhanced to detect partition block devices
      and, in that case, subject the ioctls to whitelisting.
      
      Cc: linux-scsi@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: James Bottomley <JBottomley@parallels.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      577ebb37
  12. 11 1月, 2012 2 次提交
  13. 06 1月, 2012 1 次提交
    • L
      vfs: fix up ENOIOCTLCMD error handling · 07d106d0
      Linus Torvalds 提交于
      We're doing some odd things there, which already messes up various users
      (see the net/socket.c code that this removes), and it was going to add
      yet more crud to the block layer because of the incorrect error code
      translation.
      
      ENOIOCTLCMD is not an error return that should be returned to user mode
      from the "ioctl()" system call, but it should *not* be translated as
      EINVAL ("Invalid argument").  It should be translated as ENOTTY
      ("Inappropriate ioctl for device").
      
      That EINVAL confusion has apparently so permeated some code that the
      block layer actually checks for it, which is sad.  We continue to do so
      for now, but add a big comment about how wrong that is, and we should
      remove it entirely eventually.  In the meantime, this tries to keep the
      changes localized to just the EINVAL -> ENOTTY fix, and removing code
      that makes it harder to do the right thing.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07d106d0
  14. 04 1月, 2012 5 次提交
  15. 29 12月, 2011 1 次提交