1. 10 2月, 2015 7 次提交
    • M
      dm: allocate requests in target when stacking on blk-mq devices · e5863d9a
      Mike Snitzer 提交于
      For blk-mq request-based DM the responsibility of allocating a cloned
      request is transfered from DM core to the target type.  Doing so
      enables the cloned request to be allocated from the appropriate
      blk-mq request_queue's pool (only the DM target, e.g. multipath, can
      know which block device to send a given cloned request to).
      
      Care was taken to preserve compatibility with old-style block request
      completion that requires request-based DM _not_ acquire the clone
      request's queue lock in the completion path.  As such, there are now 2
      different request-based DM target_type interfaces:
      1) the original .map_rq() interface will continue to be used for
         non-blk-mq devices -- the preallocated clone request is passed in
         from DM core.
      2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
         blk-mq devices -- blk_get_request() and blk_put_request() are used
         respectively from these hooks.
      
      dm_table_set_type() was updated to detect if the request-based target is
      being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
      DM core disallows switching the DM table's type after it is set.  This
      means that there is no mixing of non-blk-mq and blk-mq devices within
      the same request-based DM table.
      
      [This patch was started by Keith and later heavily modified by Mike]
      Tested-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e5863d9a
    • K
      dm: prepare for allocating blk-mq clone requests in target · 466d89a6
      Keith Busch 提交于
      For blk-mq request-based DM the responsibility of allocating a cloned
      request will be transfered from DM core to the target type.
      
      To prepare for conditionally using this new model the original
      request's 'special' now points to the dm_rq_target_io because the
      clone is allocated later in the block layer rather than in DM core.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      466d89a6
    • K
      dm: submit stacked requests in irq enabled context · 2eb6e1e3
      Keith Busch 提交于
      Switch to having request-based DM enqueue all prep'ed requests into work
      processed by another thread.  This allows request-based DM to invoke
      block APIs that assume interrupt enabled context (e.g. blk_get_request)
      and is a prerequisite for adding blk-mq support to request-based DM.
      
      The new kernel thread is only initialized for request-based DM devices.
      
      multipath_map() is now always in irq enabled context so change multipath
      spinlock (m->lock) locking to always disable interrupts.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2eb6e1e3
    • M
      dm: split request structure out from dm_rq_target_io structure · 1ae49ea2
      Mike Snitzer 提交于
      Request-based DM support for blk-mq devices requires that
      dm_rq_target_io structures not be allocated with an embedded request
      structure.  The request-based DM target (e.g. dm-multipath) must
      allocate the request from the blk-mq devices' request_queue using
      blk_get_request().
      
      The unfortunate side-effect of this change is old-style request-based DM
      support will no longer use contiguous memory for the dm_rq_target_io and
      request structures for each clone.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      1ae49ea2
    • M
      dm: remove exports for request-based interfaces without external callers · dbf9782c
      Mike Snitzer 提交于
      Remove exports for dm_dispatch_request, dm_requeue_unmapped_request,
      and dm_kill_unmapped_request.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dbf9782c
    • M
      dm: fix multipath regression due to initializing wrong request · db507b3f
      Mike Snitzer 提交于
      Commit febf7158 ("block: require blk_rq_prep_clone() be given an
      initialized clone request") introduced a regression by calling
      blk_rq_init() on the original request rather than the clone
      request that is passed to setup_clone().
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Fixes: febf7158 ("block: require blk_rq_prep_clone() be given an initialized clone request")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      db507b3f
    • K
      cfq-iosched: handle failure of cfq group allocation · 69abaffe
      Konstantin Khlebnikov 提交于
      Cfq_lookup_create_cfqg() allocates struct blkcg_gq using GFP_ATOMIC.
      In cfq_find_alloc_queue() possible allocation failure is not handled.
      As a result kernel oopses on NULL pointer dereference when
      cfq_link_cfqq_cfqg() calls cfqg_get() for NULL pointer.
      
      Bug was introduced in v3.5 in commit cd1604fa ("blkcg: factor
      out blkio_group creation"). Prior to that commit cfq group lookup
      had returned pointer to root group as fallback.
      
      This patch handles this error using existing fallback oom_cfqq.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Fixes: cd1604fa ("blkcg: factor out blkio_group creation")
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69abaffe
  2. 06 2月, 2015 8 次提交
  3. 29 1月, 2015 4 次提交
  4. 24 1月, 2015 2 次提交
    • S
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li 提交于
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24391c0d
    • S
      block: support different tag allocation policy · ee1b6f7a
      Shaohua Li 提交于
      The libata tag allocation is using a round-robin policy. Next patch will
      make libata use block generic tag allocation, so let's add a policy to
      tag allocation.
      
      Currently two policies: FIFO (default) and round-robin.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ee1b6f7a
  5. 22 1月, 2015 3 次提交
    • B
      block: Remove annoying "unknown partition table" message · bb5c3cdd
      Boaz Harrosh 提交于
      As Christoph put it:
        Can we just get rid of the warnings?  It's fairly annoying as devices
        without partitions are perfectly fine and very useful.
      
      Me too I see this message every VM boot for ages on all my
      devices. Would love to just remove it. For me a partition-table
      is only needed for a booting BIOS, grub, and stuff.
      
      CC: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bb5c3cdd
    • M
      block: Add discard flag to blkdev_issue_zeroout() function · d93ba7a5
      Martin K. Petersen 提交于
      blkdev_issue_discard() will zero a given block range. This is done by
      way of explicit writing, thus provisioning or allocating the blocks on
      disk.
      
      There are use cases where the desired behavior is to zero the blocks but
      unprovision them if possible. The blocks must deterministically contain
      zeroes when they are subsequently read back.
      
      This patch adds a flag to blkdev_issue_zeroout() that provides this
      variant. If the discard flag is set and a block device guarantees
      discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
      the device does not support discard_zeroes_data or if the discard
      request fails we will fall back to first REQ_WRITE_SAME and then a
      regular REQ_WRITE.
      
      Also update the callers of blkdev_issue_zero() to reflect the new flag
      and make sb_issue_zeroout() prefer the discard approach.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d93ba7a5
    • J
      cfq-iosched: fix incorrect filing of rt async cfqq · c6ce1943
      Jeff Moyer 提交于
      Hi,
      
      If you can manage to submit an async write as the first async I/O from
      the context of a process with realtime scheduling priority, then a
      cfq_queue is allocated, but filed into the wrong async_cfqq bucket.  It
      ends up in the best effort array, but actually has realtime I/O
      scheduling priority set in cfqq->ioprio.
      
      The reason is that cfq_get_queue assumes the default scheduling class and
      priority when there is no information present (i.e. when the async cfqq
      is created):
      
      static struct cfq_queue *
      cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
      	      struct bio *bio, gfp_t gfp_mask)
      {
      	const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);
      
      cic->ioprio starts out as 0, which is "invalid".  So, class of 0
      (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:
      
      		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
      
      static struct cfq_queue **
      cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
      {
              switch (ioprio_class) {
              case IOPRIO_CLASS_RT:
                      return &cfqd->async_cfqq[0][ioprio];
              case IOPRIO_CLASS_NONE:
                      ioprio = IOPRIO_NORM;
                      /* fall through */
              case IOPRIO_CLASS_BE:
                      return &cfqd->async_cfqq[1][ioprio];
              case IOPRIO_CLASS_IDLE:
                      return &cfqd->async_idle_cfqq;
              default:
                      BUG();
              }
      }
      
      Here, instead of returning a class mapped from the process' scheduling
      priority, we get back the bucket associated with IOPRIO_CLASS_BE.
      
      Now, there is no queue allocated there yet, so we create it:
      
      		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);
      
      That function ends up doing this:
      
      			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
      			cfq_init_prio_data(cfqq, cic);
      
      cfq_init_cfqq marks the priority as having changed.  Then, cfq_init_prio
      data does this:
      
      	ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	switch (ioprio_class) {
      	default:
      		printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
      	case IOPRIO_CLASS_NONE:
      		/*
      		 * no prio set, inherit CPU scheduling settings
      		 */
      		cfqq->ioprio = task_nice_ioprio(tsk);
      		cfqq->ioprio_class = task_nice_ioclass(tsk);
      		break;
      
      So we basically have two code paths that treat IOPRIO_CLASS_NONE
      differently, which results in an RT async cfqq filed into a best effort
      bucket.
      
      Attached is a patch which fixes the problem.  I'm not sure how to make
      it cleaner.  Suggestions would be welcome.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Tested-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c6ce1943
  6. 14 1月, 2015 2 次提交
    • J
      blk-mq: fix false negative out-of-tags condition · 0bf36498
      Jens Axboe 提交于
      The blk-mq tagging tries to maintain some locality between CPUs and
      the tags issued. The tags are split into groups of words, and the
      words may not be fully populated. When searching for a new free tag,
      blk-mq may look at partial words, hence it passes in an offset/size
      to find_next_zero_bit(). However, it does that wrong, the size must
      always be the full length of the number of tags in that word,
      otherwise we'll potentially miss some near the end.
      
      Another issue is when __bt_get() goes from one word set to the next.
      It bumps the index, but not the last_tag associated with the
      previous index. Bump that to be in the range of the new word.
      
      Finally, clean up __bt_get() and __bt_get_word() a bit and get
      rid of the goto in there, and the unnecessary 'wrap' variable.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0bf36498
    • M
      block: Change direct_access calling convention · dd22f551
      Matthew Wilcox 提交于
      In order to support accesses to larger chunks of memory, pass in a
      'size' parameter (counted in bytes), and return the amount available at
      that address.
      
      Add a new helper function, bdev_direct_access(), to handle common
      functionality including partition handling, checking the length requested
      is positive, checking for the sector being page-aligned, and checking
      the length of the request does not pass the end of the partition.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBoaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd22f551
  7. 03 1月, 2015 3 次提交
  8. 01 1月, 2015 1 次提交
  9. 29 12月, 2014 3 次提交
  10. 28 12月, 2014 5 次提交
  11. 27 12月, 2014 2 次提交