1. 07 3月, 2012 4 次提交
    • T
      elevator: clear auxiliary data earlier during elevator switch · 5a5bafdc
      Tejun Heo 提交于
      Elevator switch tries hard to keep as much as context until new
      elevator is ready so that it can revert to the original state if
      initializing the new elevator fails for some reason.  Unfortunately,
      with more auxiliary contexts to manage, this makes elevator init and
      exit paths too complex and fragile.
      
      This patch makes elevator_switch() unregister the current elevator and
      flush icq's before start initializing the new one.  As we still keep
      the old elevator itself, the only difference is that we lose icq's on
      rare occassions of switching failure, which isn't critical at all.
      
      Note that this makes explicit elevator parameter to
      elevator_init_queue() and __elv_register_queue() unnecessary as they
      always can use the current elevator.
      
      This patch enables block cgroup cleanups.
      
      -v2: blk_add_trace_msg() prints elevator name from @new_e instead of
           @e->type as the local variable no longer exists.  This caused
           build failure on CONFIG_BLK_DEV_IO_TRACE.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a5bafdc
    • T
      cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED · b95ada55
      Tejun Heo 提交于
      cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED
      is disabled.  This fortunately doesn't collide with blk-throtl as
      BLKIO_POLICY_PROP is zero but is unnecessary and risky.  Just don't
      register it if not enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b95ada55
    • T
      blkcg: make CONFIG_BLK_CGROUP bool · 32e380ae
      Tejun Heo 提交于
      Block cgroup core can be built as module; however, it isn't too useful
      as blk-throttle can only be built-in and cfq-iosched is usually the
      default built-in scheduler.  Scheduled blkcg cleanup requires calling
      into blkcg from block core.  To simplify that, disallow building blkcg
      as module by making CONFIG_BLK_CGROUP bool.
      
      If building blkcg core as module really matters, which I doubt, we can
      revisit it after blkcg API cleanup.
      
      -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
           depend on BLK_CGROUP.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32e380ae
    • T
      block: blk-throttle should be drained regardless of q->elevator · b855b04a
      Tejun Heo 提交于
      Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
      q->elevator doesn't exist; however, bio based drivers don't have
      elevator initialized but can still use blk-throttle.  This patch moves
      q->elevator test inside blk_drain_queue() such that only
      elv_drain_elevator() is skipped if !q->elevator.
      
      -v2: loop can have registered queue which has NULL request_fn.  Make
           sure we don't call into __blk_run_queue() in such cases.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      
      Fold in bug fix from Vivek.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b855b04a
  2. 02 3月, 2012 3 次提交
    • A
      Block: use a freezable workqueue for disk-event polling · 62d3c543
      Alan Stern 提交于
      This patch (as1519) fixes a bug in the block layer's disk-events
      polling.  The polling is done by a work routine queued on the
      system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
      polling continues even in the middle of a system sleep transition.
      
      Obviously, polling a suspended drive for media changes and such isn't
      a good thing to do; in the case of USB mass-storage devices it can
      lead to real problems requiring device resets and even re-enumeration.
      
      The patch fixes things by creating a new system-wide, non-reentrant,
      freezable workqueue and using it for disk-events polling.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      CC: <stable@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62d3c543
    • S
      block: fix __blkdev_get and add_disk race condition · 9f53d2fe
      Stanislaw Gruszka 提交于
      The following situation might occur:
      
      __blkdev_get:			add_disk:
      
      				register_disk()
      get_gendisk()
      
      disk_block_events()
      	disk->ev == NULL
      
      				disk_add_events()
      
      __disk_unblock_events()
      	disk->ev != NULL
      	--ev->block
      
      Then we unblock events, when they are suppose to be blocked. This can
      trigger events related block/genhd.c warnings, but also can crash in
      sd_check_events() or other places.
      
      I'm able to reproduce crashes with the following scripts (with
      connected usb dongle as sdb disk).
      
      <snip>
      DEV=/dev/sdb
      ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue
      
      function stop_me()
      {
      	for i in `jobs -p` ; do kill $i 2> /dev/null ; done
      	exit
      }
      
      trap stop_me SIGHUP SIGINT SIGTERM
      
      for ((i = 0; i < 10; i++)) ; do
      	while true; do fdisk -l $DEV  2>&1 > /dev/null ; done &
      done
      
      while true ; do
      echo 1 > $ENABLE
      sleep 1
      echo 0 > $ENABLE
      done
      </snip>
      
      I use the script to verify patch fixing oops in sd_revalidate_disk
      http://marc.info/?l=linux-scsi&m=132935572512352&w=2
      Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
      sd_revalidate_disk" or this one, script easily crash kernel within
      a few seconds. With both patches applied I do not observe crash.
      Unfortunately after some time (dozen of minutes), script will hung in:
      
      [ 1563.906432]  [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
      [ 1563.906437]  [<c04532d5>] msleep+0x15/0x20
      [ 1563.906443]  [<c05d60b2>] blk_drain_queue+0x32/0xd0
      [ 1563.906447]  [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
      [ 1563.906454]  [<c06d278f>] scsi_free_queue+0x3f/0x60
      [ 1563.906459]  [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
      [ 1563.906463]  [<c06d4aff>] scsi_forget_host+0x4f/0x60
      [ 1563.906468]  [<c06cd84a>] scsi_remove_host+0x5a/0xf0
      [ 1563.906482]  [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
      [ 1563.906490]  [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]
      
      Anyway I think this patch is some step forward.
      
      As drawback, I do not teardown on sysfs file create error, because I do
      not know how to nullify disk->ev (since it can be used). However add_disk
      error handling practically does not exist too, and things will work
      without this sysfs file, except events will not be exported to user
      space.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f53d2fe
    • J
      block: Fix NULL pointer dereference in sd_revalidate_disk · fe316bf2
      Jun'ichi Nomura 提交于
      Since 2.6.39 (1196f8b8), when a driver returns -ENOMEDIUM for open(),
      __blkdev_get() calls rescan_partitions() to remove
      in-kernel partition structures and raise KOBJ_CHANGE uevent.
      
      However it ends up calling driver's revalidate_disk without open
      and could cause oops.
      
      In the case of SCSI:
      
        process A                  process B
        ----------------------------------------------
        sys_open
          __blkdev_get
            sd_open
              returns -ENOMEDIUM
                                   scsi_remove_device
                                     <scsi_device torn down>
            rescan_partitions
              sd_revalidate_disk
                <oops>
      Oopses are reported here:
      http://marc.info/?l=linux-scsi&m=132388619710052
      
      This patch separates the partition invalidation from rescan_partitions()
      and use it for -ENOMEDIUM case.
      Reported-by: NHuajun Li <huajun.li.lee@gmail.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe316bf2
  3. 15 2月, 2012 3 次提交
    • T
      block: exit_io_context() should call elevator_exit_icq_fn() · 621032ad
      Tejun Heo 提交于
      While updating locking, b2efa052 "block, cfq: unlink
      cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
      from exit_io_context() to the final ioc put.  While this doesn't cause
      catastrophic failure, it effectively removes task exit notification to
      elevator and cause noticeable IO performance degradation with CFQ.
      
      On task exit, CFQ used to immediately expire the slice if it was being
      used by the exiting task as no more IO would be issued by the task;
      however, after b2efa052, the notification is lost and disk could sit
      idle needlessly, leading to noticeable IO performance degradation for
      certain workloads.
      
      This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
      elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
      from exit_io_context().  ICQ_EXITED flag is added to avoid invoking
      the callback more than once for the same icq.
      
      Walking icq_list from ioc side and invoking elevator callback requires
      reverse double locking.  This may be better implemented using RCU;
      unfortunately, using RCU isn't trivial.  e.g. RCU protection would
      need to cover request_queue and queue_lock switch on cleanup makes
      grabbing queue_lock from RCU unsafe.  Reverse double locking should
      do, at least for now.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-bisected-by: NShaohua Li <shli@kernel.org>
      LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com>
      Tested-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      621032ad
    • T
      block: simplify ioc_release_fn() · 2274b029
      Tejun Heo 提交于
      Reverse double lock dancing in ioc_release_fn() can be simplified by
      just using trylock on the queue_lock and back out from ioc lock on
      trylock failure.  Simplify it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Tested-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2274b029
    • T
      block: replace icq->changed with icq->flags · d705ae6b
      Tejun Heo 提交于
      icq->changed was used for ICQ_*_CHANGED bits.  Rename it to flags and
      access it under ioc->lock instead of using atomic bitops.
      ioc_get_changed() is added so that the changed part can be fetched and
      cleared as before.
      
      icq->flags will be used to carry other flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Tested-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d705ae6b
  4. 11 2月, 2012 1 次提交
    • T
      block: fix lockdep warning on io_context release put_io_context() · d8c66c5d
      Tejun Heo 提交于
      11a3122f "block: strip out locking optimization in put_io_context()"
      removed ioc_lock depth lockdep annoation along with locking
      optimization; however, while recursing from put_io_context() is no
      longer possible, ioc_release_fn() may still end up putting the last
      reference of another ioc through elevator, which wlil grab ioc->lock
      triggering spurious (as the ioc is always different one) A-A deadlock
      warning.
      
      As this can only happen one time from ioc_release_fn(), using non-zero
      subclass from ioc_release_fn() is enough.  Use subclass 1.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d8c66c5d
  5. 09 2月, 2012 1 次提交
    • S
      bsg: fix sysfs link remove warning · 37b40adf
      Stanislaw Gruszka 提交于
      We create "bsg" link if q->kobj.sd is not NULL, so remove it only
      when the same condition is true.
      
      Fixes:
      
      WARNING: at fs/sysfs/inode.c:323 sysfs_hash_and_remove+0x2b/0x77()
      sysfs: can not remove 'bsg', no directory
      Call Trace:
        [<c0429683>] warn_slowpath_common+0x6a/0x7f
        [<c0537a68>] ? sysfs_hash_and_remove+0x2b/0x77
        [<c042970b>] warn_slowpath_fmt+0x2b/0x2f
        [<c0537a68>] sysfs_hash_and_remove+0x2b/0x77
        [<c053969a>] sysfs_remove_link+0x20/0x23
        [<c05d88f1>] bsg_unregister_queue+0x40/0x6d
        [<c0692263>] __scsi_remove_device+0x31/0x9d
        [<c069149f>] scsi_forget_host+0x41/0x52
        [<c0689fa9>] scsi_remove_host+0x71/0xe0
        [<f7de5945>] quiesce_and_remove_host+0x51/0x83 [usb_storage]
        [<f7de5a1e>] usb_stor_disconnect+0x18/0x22 [usb_storage]
        [<c06c29de>] usb_unbind_interface+0x4e/0x109
        [<c067a80f>] __device_release_driver+0x6b/0xa6
        [<c067a861>] device_release_driver+0x17/0x22
        [<c067a46a>] bus_remove_device+0xd6/0xe6
        [<c06785e2>] device_del+0xf2/0x137
        [<c06c101f>] usb_disable_device+0x94/0x1a0
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      37b40adf
  6. 08 2月, 2012 2 次提交
    • T
      block: don't call elevator callbacks for plug merges · 07c2bd37
      Tejun Heo 提交于
      Plug merge calls two elevator callbacks outside queue lock -
      elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
      attempt_plug_merge() suggests that elevator is guaranteed to be there
      through the existing request on the plug list, nothing prevents plug
      merge from calling into dying or initializing elevator.
      
      For regular merges, bypass ensures elvpriv count to reach zero, which
      in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
      from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
      the requests haven't gone through elevator insertion yet, it doesn't
      have SOFTBARRIER set allowing merges on a bypassed queue.
      
      This, for example, leads to the following crash during elevator
      switch.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       PGD 112cbc067 PUD 115d5c067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU 1
       Modules linked in: deadline_iosched
      
       Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
       RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
       RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
       RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
       RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
       R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
       FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
       Stack:
        ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
        ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
        ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
       Call Trace:
        [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
        [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
        [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
        [<ffffffff81396a5a>] generic_make_request+0xca/0x100
        [<ffffffff81396b04>] submit_bio+0x74/0x100
        [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
        [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff811986b2>] do_sync_read+0xe2/0x120
        [<ffffffff81199345>] vfs_read+0xc5/0x180
        [<ffffffff81199501>] sys_read+0x51/0x90
        [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
      
      There are multiple ways to fix this including making plug merge check
      ELVPRIV; however,
      
      * Calling into elevator outside queue lock is confusing and
        error-prone.
      
      * Requests on plug list aren't known to the elevator.  They aren't on
        the elevator yet, so there's no elevator specific state to update.
      
      * Given the nature of plug merges - collecting bio's for the same
        purpose from the same issuer - elevator specific restrictions aren't
        applicable.
      
      So, simply don't call into elevator methods from plug merge by moving
      elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
      using blk_try_merge() in attempt_plug_merge().
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      
      Note that this makes per-cgroup merged stats skip plug merging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07c2bd37
    • T
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo 提交于
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      bool.
      
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      050c8ea8
  7. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  8. 06 2月, 2012 1 次提交
    • S
      block: fix ioc locking warning · 9fa73472
      Shaohua Li 提交于
      Meelis reported a warning:
      
      WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
      Hardware name: 939Dual-SATA2
      timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
      Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
      Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd1256667 #176
      Call Trace:
       <IRQ>  [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96
       [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
       [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43
       [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa
       [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
       [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec
       [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31
       [<ffffffff810145fd>] ? apic_write+0x11/0x13
       [<ffffffff81027475>] __do_softirq+0x74/0xfa
       [<ffffffff812f337a>] call_softirq+0x1a/0x30
       [<ffffffff81002ff9>] do_softirq+0x31/0x68
       [<ffffffff810276cf>] irq_exit+0x3d/0xa3
       [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77
       [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70
       <EOI>  [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
       [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
       [<ffffffff8100801f>] ? default_idle+0x1e/0x32
       [<ffffffff81008019>] ? default_idle+0x18/0x32
       [<ffffffff810008b1>] cpu_idle+0x87/0xd1
       [<ffffffff812de861>] rest_init+0x85/0x89
       [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8
       [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82
       [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7
      
      this_q == locked_q is possible. There are two problems here:
      1. In UP case, there is preemption counter issue as spin_trylock always
      successes.
      2. In SMP case, the loop breaks too earlier.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Reported-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Tested-by: NKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fa73472
  9. 19 1月, 2012 2 次提交
    • S
      block: fix NULL icq_cache reference · 05c30b95
      Shaohua Li 提交于
      Vivek reported a kernel crash:
      [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
      [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [   94.218004] CPU 0
      [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
      [   94.218004]
      [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
      [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
      [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
      [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
      [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
      [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
      [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
      [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
      [   94.218004] Stack:
      [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
      [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
      [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
      [   94.218004] Call Trace:
      [   94.218004]  <IRQ>
      [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
      [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
      [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
      [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
      [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
      [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
      [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
      [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
      [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
      [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
      [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
      
      Once a queue is quiesced, it's not supposed to have any elvpriv data or
      icq's, and elevator switching depends on that.  Request alloc path
      followed the rule for elvpriv data but forgot apply it to icq's
      leading to the following crash during elevator switch. Fix it by not
      allocating icq's if ELVPRIV is not set for the request.
      Reported-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c30b95
    • S
      block,cfq: change code order · df0793ab
      Shaohua Li 提交于
      cfq_slice_expired will change saved_workload_slice. It should be called
      first so saved_workload_slice is correctly set to 0 after workload type
      is changed.
      This fixes the code order changed by 54b466e4.
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df0793ab
  10. 18 1月, 2012 1 次提交
    • J
      cfq-iosched: fix use-after-free of cfqq · 54b466e4
      Jens Axboe 提交于
      With the changes in life time management between the cfq IO contexts
      and the cfq queues, we now risk having cfqd->active_queue being
      freed when cfq_slice_expired() is being called. cfq_preempt_queue()
      caches this queue and uses it after calling said function, causing
      a use-after-free condition. This triggers the following oops,
      when cfqq_type() attempts to dereference it:
      
      BUG: unable to handle kernel paging request at ffff8800746c4f0c
      IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20
      PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      CPU 3
      Modules linked in:
      
      Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs
      RIP: 0010:[<ffffffff81266d59>]  [<ffffffff81266d59>] cfqq_type+0xb/0x20
      RSP: 0018:ffff880079c11778  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000
      RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08
      RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001
      R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0
      R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002
      FS:  00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040)
      Stack:
       ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43
       ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98
       0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149
      Call Trace:
       [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c
       [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31
       [<ffffffff81254149>] __elv_add_request+0x1ca/0x200
       [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312
       [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0
       [<ffffffff8125907b>] submit_bio+0xbf/0xca
       [<ffffffff81136ec7>] submit_bh+0xdf/0xfe
       [<ffffffff81176d04>] ext3_bread+0x50/0x99
       [<ffffffff811785b3>] dx_probe+0x38/0x291
       [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219
       [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406
       [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0
       [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191
       [<ffffffff8117a330>] ext3_lookup+0x39/0xe1
       [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c
       [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5
       [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef
       [<ffffffff8111b557>] path_lookupat+0x59/0x5ea
       [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a
       [<ffffffff8111bce0>] do_path_lookup+0x23/0x59
       [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99
       [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56
       [<ffffffff8111d02d>] user_path_at+0x11/0x13
       [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64
       [<ffffffff8111425a>] vfs_stat+0x1b/0x1d
       [<ffffffff81114359>] sys_newstat+0x1a/0x33
       [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42
       [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b
      Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3
      RIP  [<ffffffff81266d59>] cfqq_type+0xb/0x20
       RSP <ffff880079c11778>
      CR2: ffff8800746c4f0c
      
      Get rid of the caching of cfqd->active_queue, and reorder the
      check so that it happens before we expire the active queue.
      
      Thanks to Tejun for pin pointing the error location.
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Tested-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54b466e4
  11. 15 1月, 2012 3 次提交
    • J
      Revert "block: recursive merge requests" · 5d381efb
      Jens Axboe 提交于
      This reverts commit 27419322.
      
      We have some problems related to selection of empty queues
      that need to be resolved, evidence so far points to the
      recursive merge logic making either being the cause or at
      least the accelerator for this. So revert it for now, until
      we figure this out.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d381efb
    • P
      block: fail SCSI passthrough ioctls on partition devices · 0bfc96cb
      Paolo Bonzini 提交于
      Linux allows executing the SG_IO ioctl on a partition or LVM volume, and
      will pass the command to the underlying block device.  This is
      well-known, but it is also a large security problem when (via Unix
      permissions, ACLs, SELinux or a combination thereof) a program or user
      needs to be granted access only to part of the disk.
      
      This patch lets partitions forward a small set of harmless ioctls;
      others are logged with printk so that we can see which ioctls are
      actually sent.  In my tests only CDROM_GET_CAPABILITY actually occurred.
      Of course it was being sent to a (partition on a) hard disk, so it would
      have failed with ENOTTY and the patch isn't changing anything in
      practice.  Still, I'm treating it specially to avoid spamming the logs.
      
      In principle, this restriction should include programs running with
      CAP_SYS_RAWIO.  If for example I let a program access /dev/sda2 and
      /dev/sdb, it still should not be able to read/write outside the
      boundaries of /dev/sda2 independent of the capabilities.  However, for
      now programs with CAP_SYS_RAWIO will still be allowed to send the
      ioctls.  Their actions will still be logged.
      
      This patch does not affect the non-libata IDE driver.  That driver
      however already tests for bd != bd->bd_contains before issuing some
      ioctl; it could be restricted further to forbid these ioctls even for
      programs running with CAP_SYS_ADMIN/CAP_SYS_RAWIO.
      
      Cc: linux-scsi@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: James Bottomley <JBottomley@parallels.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [ Make it also print the command name when warning - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bfc96cb
    • P
      block: add and use scsi_blk_cmd_ioctl · 577ebb37
      Paolo Bonzini 提交于
      Introduce a wrapper around scsi_cmd_ioctl that takes a block device.
      
      The function will then be enhanced to detect partition block devices
      and, in that case, subject the ioctls to whitelisting.
      
      Cc: linux-scsi@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: James Bottomley <JBottomley@parallels.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      577ebb37
  12. 11 1月, 2012 2 次提交
  13. 06 1月, 2012 1 次提交
    • L
      vfs: fix up ENOIOCTLCMD error handling · 07d106d0
      Linus Torvalds 提交于
      We're doing some odd things there, which already messes up various users
      (see the net/socket.c code that this removes), and it was going to add
      yet more crud to the block layer because of the incorrect error code
      translation.
      
      ENOIOCTLCMD is not an error return that should be returned to user mode
      from the "ioctl()" system call, but it should *not* be translated as
      EINVAL ("Invalid argument").  It should be translated as ENOTTY
      ("Inappropriate ioctl for device").
      
      That EINVAL confusion has apparently so permeated some code that the
      block layer actually checks for it, which is sad.  We continue to do so
      for now, but add a big comment about how wrong that is, and we should
      remove it entirely eventually.  In the meantime, this tries to keep the
      changes localized to just the EINVAL -> ENOTTY fix, and removing code
      that makes it harder to do the right thing.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07d106d0
  14. 04 1月, 2012 5 次提交
  15. 29 12月, 2011 1 次提交
  16. 28 12月, 2011 1 次提交
    • T
      block: remove WARN_ON_ONCE() in exit_io_context() · c98b2cc2
      Tejun Heo 提交于
      6e736be7 "block: make ioc get/put interface more conventional and fix
      race on alloction" added WARN_ON_ONCE() in exit_io_context() which
      triggers if !PF_EXITING.  All tasks hitting exit_io_context() from
      task exit should have PF_EXITING set but task struct tearing down
      after fork failure calls into the function without PF_EXITING,
      triggering the condition.
      
        WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
        Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
        Call Trace:
         [<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2
         [<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a
         [<ffffffff8181a7a2>] exit_io_context+0x40/0x92
         [<ffffffff810b58c9>] copy_process+0x126f/0x1453
         [<ffffffff810b5c1b>] do_fork+0x120/0x3e9
         [<ffffffff8106242f>] sys_clone+0x26/0x28
         [<ffffffff82425803>] stub_clone+0x13/0x20
        ---[ end trace a2e4eb670b375238 ]---
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c98b2cc2
  17. 25 12月, 2011 1 次提交
    • T
      block: an exiting task should be allowed to create io_context · fd638368
      Tejun Heo 提交于
      While fixing io_context creation / task exit race condition,
      6e736be7 "block: make ioc get/put interface more conventional and
      fix race on alloction" also prevented an exiting (%PF_EXITING) task
      from creating its own io_context.  This is incorrect as exit path may
      issue IOs, e.g. from exit_files(), and if those IOs are the first ones
      issued by the task, io_context needs to be created to process the IOs.
      
      Combined with the existing problem of io_context / io_cq creation
      failure having the possibility of stalling IO, this problem results in
      deterministic full IO lockup with certain workloads.
      
      Fix it by allowing io_context creation regardless of %PF_EXITING for
      %current.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd638368
  18. 21 12月, 2011 1 次提交
  19. 19 12月, 2011 1 次提交
  20. 16 12月, 2011 4 次提交
    • S
      block, cfq: fix empty queue crash caused by request merge · 6ae0516b
      Shaohua Li 提交于
      All requests of a queue could be merged to other requests of other queue.
      Such queue will not have request in it, but it's in service tree. This
      will cause kernel oops.
      I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
      issue should exist without the patch.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ae0516b
    • S
      block: recursive merge requests · 27419322
      Shaohua Li 提交于
      In my workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1,
      a+3,.... When the requests are flushed to queue, a and a+1 are merged
      to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3)
      aren't merged.
      With recursive merge below, the workload throughput gets improved 20%
      and context switch drops 60%.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27419322
    • S
      block, cfq: fix empty queue crash caused by request merge · 4a0b75c7
      Shaohua Li 提交于
      All requests of a queue could be merged to other requests of other queue.
      Such queue will not have request in it, but it's in service tree. This
      will cause kernel oops.
      I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the
      issue should exist without the patch.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4a0b75c7
    • T
      block: don't kick empty queue in blk_drain_queue() · 4eabc941
      Tejun Heo 提交于
      While probing, fd sets up queue, probes hardware and tears down the
      queue if probing fails.  In the process, blk_drain_queue() kicks the
      queue which failed to finish initialization and fd is unhappy about
      that.
      
        floppy0: no floppy controllers found
        ------------[ cut here ]------------
        WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
        Hardware name: To Be Filled By O.E.M.
        VFS: do_fd_request called on non-open device
        Modules linked in:
        Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2b #2
        Call Trace:
         [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
         [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
         [<ffffffff813d657f>] do_fd_request+0xbf/0xd0
         [<ffffffff81322b95>] blk_drain_queue+0x65/0x80
         [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
         [<ffffffff818a809d>] floppy_init+0xdeb/0xe28
         [<ffffffff818a72b2>] ? daring+0x6b/0x6b
         [<ffffffff810002af>] do_one_initcall+0x3f/0x170
         [<ffffffff81884b34>] kernel_init+0x9d/0x11e
         [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
         [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
         [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
         [<ffffffff815dbb10>] ? gs_change+0xb/0xb
      
      Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
      something on it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRalf Hildebrandt <Ralf.Hildebrandt@charite.de>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Tested-by: NSergei Trofimovich <slyich@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4eabc941
  21. 14 12月, 2011 1 次提交
    • T
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo 提交于
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
      
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
      
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
      
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      
      * Comment added to explain how icq's are managed and synchronized.
      
      This completes cleanup of io_context interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1f8cc94