1. 06 2月, 2015 3 次提交
  2. 29 1月, 2015 3 次提交
  3. 24 1月, 2015 2 次提交
    • S
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li 提交于
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24391c0d
    • S
      block: support different tag allocation policy · ee1b6f7a
      Shaohua Li 提交于
      The libata tag allocation is using a round-robin policy. Next patch will
      make libata use block generic tag allocation, so let's add a policy to
      tag allocation.
      
      Currently two policies: FIFO (default) and round-robin.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ee1b6f7a
  4. 22 1月, 2015 3 次提交
    • B
      block: Remove annoying "unknown partition table" message · bb5c3cdd
      Boaz Harrosh 提交于
      As Christoph put it:
        Can we just get rid of the warnings?  It's fairly annoying as devices
        without partitions are perfectly fine and very useful.
      
      Me too I see this message every VM boot for ages on all my
      devices. Would love to just remove it. For me a partition-table
      is only needed for a booting BIOS, grub, and stuff.
      
      CC: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bb5c3cdd
    • M
      block: Add discard flag to blkdev_issue_zeroout() function · d93ba7a5
      Martin K. Petersen 提交于
      blkdev_issue_discard() will zero a given block range. This is done by
      way of explicit writing, thus provisioning or allocating the blocks on
      disk.
      
      There are use cases where the desired behavior is to zero the blocks but
      unprovision them if possible. The blocks must deterministically contain
      zeroes when they are subsequently read back.
      
      This patch adds a flag to blkdev_issue_zeroout() that provides this
      variant. If the discard flag is set and a block device guarantees
      discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
      the device does not support discard_zeroes_data or if the discard
      request fails we will fall back to first REQ_WRITE_SAME and then a
      regular REQ_WRITE.
      
      Also update the callers of blkdev_issue_zero() to reflect the new flag
      and make sb_issue_zeroout() prefer the discard approach.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d93ba7a5
    • J
      cfq-iosched: fix incorrect filing of rt async cfqq · c6ce1943
      Jeff Moyer 提交于
      Hi,
      
      If you can manage to submit an async write as the first async I/O from
      the context of a process with realtime scheduling priority, then a
      cfq_queue is allocated, but filed into the wrong async_cfqq bucket.  It
      ends up in the best effort array, but actually has realtime I/O
      scheduling priority set in cfqq->ioprio.
      
      The reason is that cfq_get_queue assumes the default scheduling class and
      priority when there is no information present (i.e. when the async cfqq
      is created):
      
      static struct cfq_queue *
      cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
      	      struct bio *bio, gfp_t gfp_mask)
      {
      	const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);
      
      cic->ioprio starts out as 0, which is "invalid".  So, class of 0
      (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:
      
      		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
      
      static struct cfq_queue **
      cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
      {
              switch (ioprio_class) {
              case IOPRIO_CLASS_RT:
                      return &cfqd->async_cfqq[0][ioprio];
              case IOPRIO_CLASS_NONE:
                      ioprio = IOPRIO_NORM;
                      /* fall through */
              case IOPRIO_CLASS_BE:
                      return &cfqd->async_cfqq[1][ioprio];
              case IOPRIO_CLASS_IDLE:
                      return &cfqd->async_idle_cfqq;
              default:
                      BUG();
              }
      }
      
      Here, instead of returning a class mapped from the process' scheduling
      priority, we get back the bucket associated with IOPRIO_CLASS_BE.
      
      Now, there is no queue allocated there yet, so we create it:
      
      		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);
      
      That function ends up doing this:
      
      			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
      			cfq_init_prio_data(cfqq, cic);
      
      cfq_init_cfqq marks the priority as having changed.  Then, cfq_init_prio
      data does this:
      
      	ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	switch (ioprio_class) {
      	default:
      		printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
      	case IOPRIO_CLASS_NONE:
      		/*
      		 * no prio set, inherit CPU scheduling settings
      		 */
      		cfqq->ioprio = task_nice_ioprio(tsk);
      		cfqq->ioprio_class = task_nice_ioclass(tsk);
      		break;
      
      So we basically have two code paths that treat IOPRIO_CLASS_NONE
      differently, which results in an RT async cfqq filed into a best effort
      bucket.
      
      Attached is a patch which fixes the problem.  I'm not sure how to make
      it cleaner.  Suggestions would be welcome.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Tested-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c6ce1943
  5. 14 1月, 2015 1 次提交
    • J
      blk-mq: fix false negative out-of-tags condition · 0bf36498
      Jens Axboe 提交于
      The blk-mq tagging tries to maintain some locality between CPUs and
      the tags issued. The tags are split into groups of words, and the
      words may not be fully populated. When searching for a new free tag,
      blk-mq may look at partial words, hence it passes in an offset/size
      to find_next_zero_bit(). However, it does that wrong, the size must
      always be the full length of the number of tags in that word,
      otherwise we'll potentially miss some near the end.
      
      Another issue is when __bt_get() goes from one word set to the next.
      It bumps the index, but not the last_tag associated with the
      previous index. Bump that to be in the range of the new word.
      
      Finally, clean up __bt_get() and __bt_get_word() a bit and get
      rid of the goto in there, and the unnecessary 'wrap' variable.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0bf36498
  6. 03 1月, 2015 1 次提交
  7. 01 1月, 2015 1 次提交
  8. 21 12月, 2014 2 次提交
  9. 15 12月, 2014 1 次提交
  10. 12 12月, 2014 1 次提交
    • M
      bio: modify __bio_add_page() to accept pages that don't start a new segment · fcbf6a08
      Maurizio Lombardi 提交于
      The original behaviour is to refuse to add a new page if the maximum
      number of segments has been reached, regardless of the fact the page we
      are going to add can be merged into the last segment or not.
      
      Unfortunately, when the system runs under heavy memory fragmentation
      conditions, a driver may try to add multiple pages to the last segment.
      The original code won't accept them and EBUSY will be reported to
      userspace.
      
      This patch modifies the function so it refuses to add a page only in case
      the latter starts a new segment and the maximum number of segments has
      already been reached.
      
      The bug can be easily reproduced with the st driver:
      
      1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE  to 16
      2) modprobe st buffer_kbs=1024
      3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
         dd: error writing `/dev/st0': Device or resource busy
      Signed-off-by: NMaurizio Lombardi <mlombard@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Jet Chen <jet.chen@intel.com>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fcbf6a08
  11. 10 12月, 2014 6 次提交
    • T
      blk-mq: Fix uninitialized kobject at CPU hotplugging · 06a41a99
      Takashi Iwai 提交于
      When a CPU is hotplugged, the current blk-mq spews a warning like:
      
        kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong.
        CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
         0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8
         ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58
         ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007
        Call Trace:
         [<ffffffff81005306>] dump_trace+0x86/0x330
         [<ffffffff81005644>] show_stack_log_lvl+0x94/0x170
         [<ffffffff81006d21>] show_stack+0x21/0x50
         [<ffffffff81605f07>] dump_stack+0x41/0x51
         [<ffffffff8132c7a0>] kobject_add+0xa0/0xb0
         [<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0
         [<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60
         [<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190
         [<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70
         [<ffffffff8105fd23>] cpu_notify+0x23/0x50
         [<ffffffff81060037>] _cpu_up+0x157/0x170
         [<ffffffff810600d9>] cpu_up+0x89/0xb0
         [<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80
         [<ffffffff814323cd>] device_online+0x5d/0xa0
         [<ffffffff81432485>] online_store+0x75/0x80
         [<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150
         [<ffffffff811c5532>] vfs_write+0xb2/0x1f0
         [<ffffffff811c5f42>] SyS_write+0x42/0xb0
         [<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b
         [<00007f0132fb24e0>] 0x7f0132fb24e0
      
      This is indeed because of an uninitialized kobject for blk_mq_ctx.
      The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it
      goes loop over hctx_for_each_ctx(), i.e. it initializes only for
      online CPUs.  Thus, when a CPU is hotplugged, the ctx for the newly
      onlined CPU is registered without initialization.
      
      This patch fixes the issue by initializing the all ctx kobjects
      belonging to each queue.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      06a41a99
    • B
      blk-mq: Use all available hardware queues · 959f5f5b
      Bart Van Assche 提交于
      Suppose that a system has two CPU sockets, three cores per socket,
      that it does not support hyperthreading and that four hardware
      queues are provided by a block driver. With the current algorithm
      this will lead to the following assignment of CPU cores to hardware
      queues:
      
        HWQ 0: 0 1
        HWQ 1: 2 3
        HWQ 2: 4 5
        HWQ 3: (none)
      
      This patch changes the queue assignment into:
      
        HWQ 0: 0 1
        HWQ 1: 2
        HWQ 2: 3 4
        HWQ 3: 5
      
      In other words, this patch has the following three effects:
      - All four hardware queues are used instead of only three.
      - CPU cores are spread more evenly over hardware queues. For the
        above example the range of the number of CPU cores associated
        with a single HWQ is reduced from [0..2] to [1..2].
      - If the number of HWQ's is a multiple of the number of CPU sockets
        it is now guaranteed that all CPU cores associated with a single
        HWQ reside on the same CPU socket.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      959f5f5b
    • B
      blk-mq: Micro-optimize bt_get() · 52f7eb94
      Bart Van Assche 提交于
      Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
      calls into a single call.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52f7eb94
    • B
      blk-mq: Fix a race between bt_clear_tag() and bt_get() · c38d185d
      Bart Van Assche 提交于
      What we need is the following two guarantees:
      * Any thread that observes the effect of the test_and_set_bit() by
        __bt_get_word() also observes the preceding addition of 'current'
        to the appropriate wait list. This is guaranteed by the semantics
        of the spin_unlock() operation performed by prepare_and_wait().
        Hence the conversion of test_and_set_bit_lock() into
        test_and_set_bit().
      * The wait lists are examined by bt_clear() after the tag bit has
        been cleared. clear_bit_unlock() guarantees that any thread that
        observes that the bit has been cleared also observes the store
        operations preceding clear_bit_unlock(). However,
        clear_bit_unlock() does not prevent that the wait lists are examined
        before that the tag bit is cleared. Hence the addition of a memory
        barrier between clear_bit() and the wait list examination.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c38d185d
    • B
      blk-mq: Avoid that __bt_get_word() wraps multiple times · 9e98e9d7
      Bart Van Assche 提交于
      If __bt_get_word() is called with last_tag != 0, if the first
      find_next_zero_bit() fails, if after wrap-around the
      test_and_set_bit() call fails and find_next_zero_bit() succeeds,
      if the next test_and_set_bit() call fails and subsequently
      find_next_zero_bit() does not find a zero bit, then another
      wrap-around will occur. Avoid this by introducing an additional
      local variable.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e98e9d7
    • B
      blk-mq: Fix a use-after-free · 45a9c9d9
      Bart Van Assche 提交于
      blk-mq users are allowed to free the memory request_queue.tag_set
      points at after blk_cleanup_queue() has finished but before
      blk_release_queue() has started. This can happen e.g. in the SCSI
      core. The SCSI core namely embeds the tag_set structure in a SCSI
      host structure. The SCSI host structure is freed by
      scsi_host_dev_release(). This function is called after
      blk_cleanup_queue() finished but can be called before
      blk_release_queue().
      
      This means that it is not safe to access request_queue.tag_set from
      inside blk_release_queue(). Hence remove the blk_sync_queue() call
      from blk_release_queue(). This call is not necessary - outstanding
      requests must have finished before blk_release_queue() is
      called. Additionally, move the blk_mq_free_queue() call from
      blk_release_queue() to blk_cleanup_queue() to avoid that struct
      request_queue.tag_set gets accessed after it has been freed.
      
      This patch avoids that the following kernel oops can be triggered
      when deleting a SCSI host for which scsi-mq was enabled:
      
      Call Trace:
       [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
       [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
       [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
       [<ffffffff8124d654>] blk_release_queue+0x84/0xd0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff81245895>] blk_put_queue+0x15/0x20
       [<ffffffff8125c409>] disk_release+0x99/0xd0
       [<ffffffff8133d056>] device_release+0x36/0xb0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff8125a78a>] put_disk+0x1a/0x20
       [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
       [<ffffffff811d56a0>] blkdev_put+0x50/0x160
       [<ffffffff81199eb4>] kill_block_super+0x44/0x70
       [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
       [<ffffffff8119a87e>] deactivate_super+0x4e/0x70
       [<ffffffff811b9833>] cleanup_mnt+0x43/0x90
       [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
       [<ffffffff8107252c>] task_work_run+0xac/0xe0
       [<ffffffff81002c01>] do_notify_resume+0x61/0xa0
       [<ffffffff814d2c58>] int_signal+0x12/0x17
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      45a9c9d9
  12. 09 12月, 2014 1 次提交
  13. 08 12月, 2014 2 次提交
  14. 04 12月, 2014 1 次提交
  15. 02 12月, 2014 1 次提交
    • D
      block: fix regression where bio_integrity_process uses wrong bio_vec iterator · 594416a7
      Darrick J. Wong 提交于
      bio integrity handling is broken on a system with LVM layered atop a
      DIF/DIX SCSI drive because device mapper clones the bio, modifies the
      clone, and sends the clone to the lower layers for processing.
      However, the clone bio has bi_vcnt == 0, which means that when the sd
      driver calls bio_integrity_process to attach DIX data, the
      for_each_segment_all() call (which uses bi_vcnt) returns immediately
      and random garbage is sent to the disk on a disk write.  The disk of
      course returns an error.
      
      Therefore, teach bio_integrity_process() to use bio_for_each_segment()
      to iterate the bio_vecs, since the per-bio iterator tracks which
      bio_vecs are associated with that particular bio.  The integrity
      handling code is effectively part of the "driver" (it's not the bio
      owner), so it must use the correct iterator function.
      
      v2: Fix a compiler warning about abandoned local variables.  This
      patch supersedes "block: bio_integrity_process uses wrong bio_vec
      iterator".  Patch applies against 3.18-rc6.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      594416a7
  16. 01 12月, 2014 1 次提交
  17. 25 11月, 2014 3 次提交
  18. 24 11月, 2014 2 次提交
  19. 20 11月, 2014 1 次提交
  20. 18 11月, 2014 2 次提交
  21. 12 11月, 2014 2 次提交