1. 09 9月, 2012 5 次提交
  2. 03 8月, 2012 3 次提交
    • J
      block: Don't use static to define "void *p" in show_partition_start() · 06768067
      Jianpeng Ma 提交于
      I met a odd prblem:read /proc/partitions may return zero.
      
      I wrote a file test.c:
      int main()
      {
      	char buff[4096];
      	int ret;
      	int fd;
      	printf("pid=%d\n",getpid());
      	while (1) {
      		fd = open("/proc/partitions", O_RDONLY);
      		if (fd < 0) {
      			printf("open error %s\n", strerror(errno));
      			return 0;
      		}
      		ret = read(fd, buff, 4096);
      		if (ret <= 0)
      			printf("ret=%d, %s, %ld\n", ret,
      				strerror(errno), lseek(fd,0,SEEK_CUR));
      		close(fd);
      	}
      	exit(0);
      }
      
      You can reproduce by:
      1:while true;do cat /proc/partitions > /dev/null ;done
      2:./test
      
      I reviewed the code and found:
      
      >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
      >> {
      >> 	static void *p;
      >>
      >> 	p = disk_seqf_start(seqf, pos);
      >> 	if (!IS_ERR_OR_NULL(p) && !*pos)
      >> 		seq_puts(seqf, "major minor  #blocks  name\n\n");
      >> 	return p;
      >> }
      		test								cat /proc/partitions
      	p = disk_seqf_start()(Not NULL)
      									p = disk_seqf_start()(NULL because pos)
      	if (!IS_ERR_OR_NULL(p) && !*pos)
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06768067
    • A
      block: Add blk_bio_map_sg() helper · 85b9f66a
      Asias He 提交于
      Add a helper to map a bio to a scatterlist, modelled after
      blk_rq_map_sg.
      
      This helper is useful for any driver that wants to create
      a scatterlist from its ->make_request_fn method.
      
      Changes in v2:
       - Use __blk_segment_map_sg to avoid duplicated code
       - Add cocbook style function comment
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85b9f66a
    • A
      block: Introduce __blk_segment_map_sg() helper · 963ab9e5
      Asias He 提交于
      Split the mapping code in blk_rq_map_sg() to a helper
      __blk_segment_map_sg(), so that other mapping function, e.g.
      blk_bio_map_sg(), can share the code.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      963ab9e5
  3. 02 8月, 2012 2 次提交
    • P
      block: split discard into aligned requests · c6e66634
      Paolo Bonzini 提交于
      When a disk has large discard_granularity and small max_discard_sectors,
      discards are not split with optimal alignment.  In the limit case of
      discard_granularity == max_discard_sectors, no request could be aligned
      correctly, so in fact you might end up with no discarded logical blocks
      at all.
      
      Another example that helps showing the condition in the patch is with
      discard_granularity == 64, max_discard_sectors == 128.  A request that is
      submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
      However, only 2 aligned blocks out of 3 are included in the request;
      128..191 may be left intact and not discarded.  With this patch, the
      first request will be truncated to ensure good alignment of what's left,
      and the split will be 2..127, 128..255, 256..257.  The patch will also
      take into account the discard_alignment.
      
      At most one extra request will be introduced, because the first request
      will be reduced by at most granularity-1 sectors, and granularity
      must be less than max_discard_sectors.  Subsequent requests will run
      on round_down(max_discard_sectors, granularity) sectors, as in the
      current code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c6e66634
    • P
      block: reorganize rounding of max_discard_sectors · f6ff53d3
      Paolo Bonzini 提交于
      Mostly a preparation for the next patch.
      
      In principle this fixes an infinite loop if max_discard_sectors < granularity,
      but that really shouldn't happen.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6ff53d3
  4. 01 8月, 2012 4 次提交
    • Y
      block: remove dead func declaration · 80799fbb
      Yuanhan Liu 提交于
      __generic_unplug_device() function is removed with commit
      7eaceacc, which forgot to
      remove the declaration at meantime. Here remove it.
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80799fbb
    • V
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal 提交于
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c83f6bf9
    • O
      block: uninitialized ioc->nr_tasks triggers WARN_ON · 4638a83e
      Olof Johansson 提交于
      Hi,
      
      I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
      below warning as of 3.5-rc1 and later (3.4 is fine):
      
      [   10.886893] ------------[ cut here ]------------
      [   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
      [   10.886905] Hardware name: Bochs
      [   10.886906] Modules linked in:
      [   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
      [   10.886908] Call Trace:
      [   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
      [   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
      [   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
      [   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
      [   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
      [   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
      [   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
      [   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
      [   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
      [   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
      [   10.886928] ---[ end trace 32a14af7ee6a590b ]---
      
      Reproducing is easy, I can hit it on a KVM system with a very basic
      config (x86_64 make defconfig + enable the drivers needed). To hit it,
      just install dump (on debian/ubuntu, not sure what the package might be
      called on Fedora), and:
      
      dump -o -f /tmp/foo /
      
      You'll see the warning in dmesg once it forks off the I/O process and
      starts dumping filesystem contents.
      
      I bisected it down to the following commit:
      
      commit f6e8d01b
      Author: Tejun Heo <tj@kernel.org>
      Date:   Mon Mar 5 13:15:26 2012 -0800
      
          block: add io_context->active_ref
      
          Currently ioc->nr_tasks is used to decide two things - whether an ioc
          is done issuing IOs and whether it's shared by multiple tasks.  This
          patch separate out the first into ioc->active_ref, which is acquired
          and released using {get|put}_io_context_active() respectively.
      
          This will be used to associate bio's with a given task.  This patch
          doesn't introduce any visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
          Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      
      It seems like the init of ioc->nr_tasks was removed in that patch,
      so it starts out at 0 instead of 1.
      
      Tejun, is the right thing here to add back the init, or should something else
      be done?
      
      The below patch removes the warning, but I haven't done any more extensive
      testing on it.
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4638a83e
    • M
      block: do not artificially constrain max_sectors for stacking drivers · fe86cdce
      Mike Snitzer 提交于
      blk_set_stacking_limits is intended to allow stacking drivers to build
      up the limits of the stacked device based on the underlying devices'
      limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
      doesn't allow the stacking driver to inherit a max_sectors larger than
      1024 -- due to blk_stack_limits' use of min_not_zero.
      
      It is now clear that this artificial limit is getting in the way so
      change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
      stacking drivers like dm-multipath to inherit 'max_sectors' from the
      underlying paths).
      Reported-by: NVijay Chauhan <vijay.chauhan@netapp.com>
      Tested-by: NVijay Chauhan <vijay.chauhan@netapp.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe86cdce
  5. 31 7月, 2012 3 次提交
  6. 20 7月, 2012 1 次提交
  7. 27 6月, 2012 1 次提交
    • T
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo 提交于
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a051661c
  8. 25 6月, 2012 9 次提交
    • T
      block: prepare for multiple request_lists · 5b788ce3
      Tejun Heo 提交于
      Request allocation is about to be made per-blkg meaning that there'll
      be multiple request lists.
      
      * Make queue full state per request_list.  blk_*queue_full() functions
        are renamed to blk_*rl_full() and takes @rl instead of @q.
      
      * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
        instead of @q.  Also add @gfp_mask parameter.
      
      * Add blk_exit_rl() instead of destroying rl directly from
        blk_release_queue().
      
      * Add request_list->q and make request alloc/free functions -
        blk_free_request(), [__]freed_request(), __get_request() - take @rl
        instead of @q.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b788ce3
    • T
      block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv · 8a5ecdd4
      Tejun Heo 提交于
      Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
      move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
      to use q->nr_rqs[] instead of q->rq.count[].
      
      These counters separates queue-wide request statistics from the
      request list and allow implementation of per-queue request allocation.
      
      While at it, properly indent fields of struct request_list.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a5ecdd4
    • T
      blkcg: inline bio_blkcg() and friends · b1208b56
      Tejun Heo 提交于
      Make bio_blkcg() and friends inline.  They all are very simple and
      used only in few places.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1208b56
    • T
      block: allocate io_context upfront · 7f4b35d1
      Tejun Heo 提交于
      Block layer very lazy allocation of ioc.  It waits until the moment
      ioc is absolutely necessary; unfortunately, that time could be inside
      queue lock and __get_request() performs unlock - try alloc - retry
      dancing.
      
      Just allocate it up-front on entry to block layer.  We're not saving
      the rain forest by deferring it to the last possible moment and
      complicating things unnecessarily.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7f4b35d1
    • T
      block: refactor get_request[_wait]() · a06e05e6
      Tejun Heo 提交于
      Currently, there are two request allocation functions - get_request()
      and get_request_wait().  The former tries to allocate a request once
      and the latter keeps retrying until it succeeds.  The latter wraps the
      former and keeps retrying until allocation succeeds.
      
      The combination of two functions deliver fallible non-wait allocation,
      fallible wait allocation and unfailing wait allocation.  However,
      given that forward progress is guaranteed, fallible wait allocation
      isn't all that useful and in fact nobody uses it.
      
      This patch simplifies the interface as follows.
      
      * get_request() is renamed to __get_request() and is only used by the
        wrapper function.
      
      * get_request_wait() is renamed to get_request().  It now takes
        @gfp_mask and retries iff it contains %__GFP_WAIT.
      
      This patch doesn't introduce any functional change and is to prepare
      for further updates to request allocation path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a06e05e6
    • T
      block: drop custom queue draining used by scsi_transport_{iscsi|fc} · 86072d81
      Tejun Heo 提交于
      iscsi_remove_host() uses bsg_remove_queue() which implements custom
      queue draining.  fc_bsg_remove() open-codes mostly identical logic.
      
      The draining logic isn't correct in that blk_stop_queue() doesn't
      prevent new requests from being queued - it just stops processing, so
      nothing prevents new requests to be queued after the logic determines
      that the queue is drained.
      
      blk_cleanup_queue() now implements proper queue draining and these
      custom draining logics aren't necessary.  Drop them and use
      bsg_unregister_queue() + blk_cleanup_queue() instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMike Christie <michaelc@cs.wisc.edu>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: James Smart <james.smart@emulex.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86072d81
    • T
      mempool: add @gfp_mask to mempool_create_node() · a91a5ac6
      Tejun Heo 提交于
      mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
      blk_init_free_list(), is about to be updated to use other allocation
      flags - add @gfp_mask argument to the function.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a91a5ac6
    • T
      blkcg: make root blkcg allocation use %GFP_KERNEL · 15974993
      Tejun Heo 提交于
      Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
      from __blkg_lookup_create() for root blkcg creation.  This could make
      policy fail unnecessarily.
      
      Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
      optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
      preload radix tree and preallocate blkg with %GFP_KERNEL before trying
      to create the root blkg.
      
      v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
         instead of ERR_PTR() value.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15974993
    • T
      blkcg: __blkg_lookup_create() doesn't need radix preload · 13589864
      Tejun Heo 提交于
      There's no point in calling radix_tree_preload() if preloading doesn't
      use more permissible GFP mask.  Drop preloading from
      __blkg_lookup_create().
      
      While at it, drop sparse locking annotation which no longer applies.
      
      v2: Vivek pointed out the odd preload usage.  Instead of updating,
          just drop it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13589864
  9. 15 6月, 2012 4 次提交
    • J
      scsi: Silence unnecessary warnings about ioctl to partition · 6d935928
      Jan Kara 提交于
      Sometimes, warnings about ioctls to partition happen often enough that they
      form majority of the warnings in the kernel log and users complain. In some
      cases warnings are about ioctls such as SG_IO so it's not good to get rid of
      the warnings completely as they can ease debugging of userspace problems
      when ioctl is refused.
      
      Since I have seen warnings from lots of commands, including some proprietary
      userspace applications, I don't think disallowing the ioctls for processes
      with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
      stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: James Bottomley <JBottomley@parallels.com>
      CC: linux-scsi@vger.kernel.org
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d935928
    • A
      block: Drop dead function blk_abort_queue() · 76aaa510
      Asias He 提交于
      This function was only used by btrfs code in btrfs_abort_devices()
      (seems in a wrong way).
      
      It was removed in commit d07eb911,
      So, Let's remove the dead code to avoid any confusion.
      
      Changes in v2: update commit log, btrfs_abort_devices() was removed
      already.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-kernel@vger.kernel.org
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      76aaa510
    • A
      block: Mitigate lock unbalance caused by lock switching · 5e5cfac0
      Asias He 提交于
      Commit 777eb1bf disconnects externally
      supplied queue_lock before blk_drain_queue(). Switching the lock would
      introduce lock unbalance because theads which have taken the external
      lock might unlock the internal lock in the during the queue drain. This
      patch mitigate this by disconnecting the lock after the queue draining
      since queue draining makes a lot of request_queue users go away.
      
      However, please note, this patch only makes the problem less likely to
      happen. Anyone who still holds a ref might try to issue a new request on
      a dead queue after the blk_cleanup_queue() finishes draining, the lock
      unbalance might still happen in this case.
      
       =====================================
       [ BUG: bad unlock balance detected! ]
       3.4.0+ #288 Not tainted
       -------------------------------------
       fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
       [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
       but there are no more locks to release!
      
       other info that might help us debug this:
       1 lock held by fio/17706:
        #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
       get_request_wait+0x19a/0x250
      
       stack backtrace:
       Pid: 17706, comm: fio Not tainted 3.4.0+ #288
       Call Trace:
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
        [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
        [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
        [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
        [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810e0079>] lock_release+0xd9/0x250
        [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
        [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
        [<ffffffff81328faa>] generic_make_request+0xca/0x100
        [<ffffffff81329056>] submit_bio+0x76/0xf0
        [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
        [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
        [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
        [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
        [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
        [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
        [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
        [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
        [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
        [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
        [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
        [<ffffffff811eae50>] sys_io_submit+0x10/0x20
        [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b
      
      Changes since v2: Update commit log to explain how the code is still
                        broken even if we delay the lock switching after the drain.
      Changes since v1: Update commit log as Tejun suggested.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5e5cfac0
    • A
      block: Avoid missed wakeup in request waitqueue · 458f27a9
      Asias He 提交于
      After hot-unplug a stressed disk, I found that rl->wait[] is not empty
      while rl->count[] is empty and there are theads still sleeping on
      get_request after the queue cleanup. With simple debug code, I found
      there are exactly nr_sleep - nr_wakeup of theads in D state. So there
      are missed wakeup.
      
        $ dmesg | grep nr_sleep
        [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
        $ vmstat 1
        1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0
      
      To quote Tejun:
      
        Ah, okay, freed_request() wakes up single waiter with the assumption
        that after the wakeup there will at least be one successful allocation
        which in turn will continue the wakeup chain until the wait list is
        empty - ie. waiter wakeup is dependent on successful request
        allocation happening after each wakeup.  With queue marked dead, any
        woken up waiter fails the allocation path, so the wakeup chaining is
        lost and we're left with hung waiters. What we need is wake_up_all()
        after drain completion.
      
      This patch fixes the missed wakeup by waking up all the theads which
      are sleeping on wait queue after queue drain.
      
      Changes in v2: Drop waitqueue_active() optimization
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAsias He <asias@redhat.com>
      
      Fixed a bug by me, where stacked devices would oops on calling
      blk_drain_queue() since ->rq.wait[] do not get initialized unless
      it's a full queue setup.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      458f27a9
  10. 06 6月, 2012 1 次提交
  11. 04 6月, 2012 3 次提交
    • T
      blkcg: fix blkg_alloc() failure path · 9b2ea86b
      Tejun Heo 提交于
      When policy data allocation fails in the middle, blkg_alloc() invokes
      blkg_free() to destroy the half constructed blkg.  This ends up
      calling pd_exit_fn() on policy datas which didn't go through
      pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
      immediately after each policy data allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b2ea86b
    • T
      block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED · ffea73fc
      Tejun Heo 提交于
      cfq may be built w/ or w/o blkcg support depending on
      CONFIG_CFQ_CGROUP_IOSCHED.  If blkcg support is disabled, most of
      related code is ifdef'd out but some part is left dangling -
      blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
      calls are made on it.
      
      Feeding zero filled policy to blkcg_policy_register() is incorrect and
      triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
      !CONFIG_CFQ_GROUP_IOSCHED.
      
       ------------[ cut here ]------------
       WARNING: at block/blk-cgroup.c:867
       Modules linked in:
       Modules linked in:
       CPU: 3 Not tainted 3.4.0-09547-gfb21affa #1
       Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
       Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
      	    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
       Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
      	    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
      	    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
      	    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
       Krnl Code: 00000000003d76be: a7280001           lhi     %r2,1
      	    00000000003d76c2: a7f4ffdf           brc     15,3d7680
      	   #00000000003d76c6: a7f40001           brc     15,3d76c8
      	   >00000000003d76ca: a7c8ffea           lhi     %r12,-22
      	    00000000003d76ce: a7f4ffce           brc     15,3d766a
      	    00000000003d76d2: a7f40001           brc     15,3d76d4
      	    00000000003d76d6: a7c80000           lhi     %r12,0
      	    00000000003d76da: a7f4ffc2           brc     15,3d765e
       Call Trace:
       ([<0000000000b6f000>] initcall_debug+0x0/0x4)
        [<0000000000989e8a>] cfq_init+0x62/0xd4
        [<00000000001000ba>] do_one_initcall+0x3a/0x170
        [<000000000096fb60>] kernel_init+0x214/0x2bc
        [<0000000000623202>] kernel_thread_starter+0x6/0xc
        [<00000000006231fc>] kernel_thread_starter+0x0/0xc
       no locks held by swapper/0/1.
       Last Breaking-Event-Address:
        [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
       ---[ end trace b8ef4903fcbf9dd3 ]---
      
      This patch fixes the problem by ensuring all blkcg support code is
      inside CONFIG_CFQ_GROUP_IOSCHED.
      
      * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
        inside the first CONFIG_CFQ_GROUP_IOSCHED block.  __maybe_unused is
        dropped from blkcg_policy_cfq decl.
      
      * blkcg_deactivate_poilcy() invocation is moved inside ifdef.  This
        also makes the activation logic match cfq_init_queue().
      
      * All blkcg_policy_[un]register() invocations are moved inside ifdef.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ffea73fc
    • T
      block: fix return value on cfq_init() failure · fd794956
      Tejun Heo 提交于
      cfq_init() would return zero after kmem cache creation failure.  Fix
      so that it returns -ENOMEM.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd794956
  12. 31 5月, 2012 1 次提交
  13. 23 5月, 2012 1 次提交
  14. 16 5月, 2012 1 次提交
  15. 15 5月, 2012 1 次提交
    • T
      block: fix buffer overflow when printing partition UUIDs · 05c69d29
      Tejun Heo 提交于
      6d1d8050 "block, partition: add partition_meta_info to hd_struct"
      added part_unpack_uuid() which assumes that the passed in buffer has
      enough space for sprintfing "%pU" - 37 characters including '\0'.
      
      Unfortunately, b5af921e "init: add support for root devices
      specified by partition UUID" supplied 33 bytes buffer to the function
      leading to the following panic with stackprotector enabled.
      
        Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
      
        [<ffffffff815e226b>] panic+0xba/0x1c6
        [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
        [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
        [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
        [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
        [<ffffffff81aee0fa>] mount_root+0x57/0x5b
        [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
        [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
        [<ffffffff81aedd60>] kernel_init+0x155/0x15a
        [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
        [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
        [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
        [<ffffffff815f4d20>] ? gs_change+0x13/0x13
      
      Increase the buffer size, remove the dangerous part_unpack_uuid() and
      use snprintf() directly from printk_all_partitions().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSzymon Gruszczynski <sz.gruszczynski@googlemail.com>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c69d29