1. 10 2月, 2016 1 次提交
    • K
      blk-mq: dynamic h/w context count · 868f2f0b
      Keith Busch 提交于
      The hardware's provided queue count may change at runtime with resource
      provisioning. This patch allows a block driver to alter the number of
      h/w queues available when its resource count changes.
      
      The main part is a new blk-mq API to request a new number of h/w queues
      for a given live tag set. The new API freezes all queues using that set,
      then adjusts the allocated count prior to remapping these to CPUs.
      
      The bulk of the rest just shifts where h/w contexts and all their
      artifacts are allocated and freed.
      
      The number of max h/w contexts is capped to the number of possible cpus
      since there is no use for more than that. As such, all pre-allocated
      memory for pointers need to account for the max possible rather than
      the initial number of queues.
      
      A side effect of this is that the blk-mq will proceed successfully as
      long as it can allocate at least one h/w context. Previously it would
      fail request queue initialization if less than the requested number
      was allocated.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      868f2f0b
  2. 05 2月, 2016 4 次提交
    • J
      cfq-iosched: Allow parent cgroup to preempt its child · 3984aa55
      Jan Kara 提交于
      Currently we don't allow sync workload of one cgroup to preempt sync
      workload of any other cgroup. This is because we want to achieve service
      separation between cgroups. However in cases where cgroup preempting is
      ancestor of the current cgroup, there is no need of separation and
      idling introduces unnecessary overhead. This hurts for example the case
      when workload is isolated within a cgroup but journalling threads are in
      root cgroup. Simple way to demostrate the issue is using:
      
      dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1
      
      on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
      difference more visible). When all processes are in the root cgroup,
      reported throughput is 153.132 MB/sec. When dbench process gets its own
      blkio cgroup, reported throughput drops to 26.1006 MB/sec.
      
      Fix the problem by making check in cfq_should_preempt() more benevolent
      and allow preemption by ancestor cgroup. This improves the throughput
      reported by dbench4 to 48.9106 MB/sec.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3984aa55
    • J
      cfq-iosched: Allow sync noidle workloads to preempt each other · a257ae3e
      Jan Kara 提交于
      The original idea with preemption of sync noidle queues (introduced in
      commit 718eee05 "cfq-iosched: fairness for sync no-idle queues") was
      that we service all sync noidle queues together, we don't idle on any of
      the queues individually and we idle only if there is no sync noidle
      queue to be served. This intention also matches the original test:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
      	   && new_cfqq->service_tree == cfqq->service_tree)
      		return true;
      
      However since at that time cfqq->service_tree was not set for idling
      queues, this test was unreliable and was replaced in commit e4a22919
      "cfq-iosched: fix no-idle preemption logic" by:
      
      	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
      	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
      	    new_cfqq->service_tree->count == 1)
      		return true;
      
      That was a reliable test but was actually doing something different -
      now we preempt sync noidle queue only if the new queue is the only one
      busy in the service tree.
      
      These days cfq queue is kept in service tree even if it is idling and
      thus the original check would be safe again. But since we actually check
      that cfq queues are in the same cgroup, of the same priority class and
      workload type (sync noidle), we know that new_cfqq is fine to preempt
      cfqq. So just remove the service tree check.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a257ae3e
    • J
      cfq-iosched: Reorder checks in cfq_should_preempt() · 6c80731c
      Jan Kara 提交于
      Move check for preemption by rt class up. There is no functional change
      but it makes arguing about conditions simpler since we can be sure both
      cfq queues are from the same ioprio class.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6c80731c
    • J
      cfq-iosched: Don't group_idle if cfqq has big thinktime · e795421e
      Jan Kara 提交于
      There is no point in idling on a cfq group if the only cfq queue that is
      there has too big thinktime.
      Signed-off-by: NJan Kara <jack@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e795421e
  3. 23 1月, 2016 2 次提交
  4. 13 1月, 2016 1 次提交
    • K
      block: split bios to max possible length · e36f6204
      Keith Busch 提交于
      This splits bio in the middle of a vector to form the largest possible
      bio at the h/w's desired alignment, and guarantees the bio being split
      will have some data.
      
      The criteria for splitting is changed from the max sectors to the h/w's
      optimal sector alignment if it is provided. For h/w that advertise their
      block storage's underlying chunk size, it's a big performance win to not
      submit commands that cross them. If sector alignment is not provided,
      this patch uses the max sectors as before.
      
      This addresses the performance issue commit d3805611 attempted to
      fix, but was reverted due to splitting logic error.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.4.x-
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e36f6204
  5. 10 1月, 2016 6 次提交
  6. 09 1月, 2016 4 次提交
    • V
      badblocks: Add core badblock management code · 9e0e252a
      Vishal Verma 提交于
      Take the core badblocks implementation from md, and make it generally
      available. This follows the same style as kernel implementations of
      linked lists, rb-trees etc, where you can have a structure that can be
      embedded anywhere, and accessor functions to manipulate the data.
      
      The only changes in this copy of the code are ones to generalize
      function/variable names from md-specific ones. Also add init and free
      functions.
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      9e0e252a
    • D
      block: fix del_gendisk() vs blkdev_ioctl crash · ac34f15e
      Dan Williams 提交于
      When tearing down a block device early in its lifetime, userspace may
      still be performing discovery actions like blkdev_ioctl() to re-read
      partitions.
      
      The nvdimm_revalidate_disk() implementation depends on
      disk->driverfs_dev to be valid at entry.  However, it is set to NULL in
      del_gendisk() and fatally this is happening *before* the disk device is
      deleted from userspace view.
      
      There's no reason for del_gendisk() to clear ->driverfs_dev.  That
      device is the parent of the disk.  It is guaranteed to not be freed
      until the disk, as a child, drops its ->parent reference.
      
      We could also fix this issue locally in nvdimm_revalidate_disk() by
      using disk_to_dev(disk)->parent, but lets fix it globally since
      ->driverfs_dev follows the lifetime of the parent.  Longer term we
      should probably just add a @parent parameter to add_disk(), and stop
      carrying this pointer in the gendisk.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
       CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
       [..]
       Call Trace:
        [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
        [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
        [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
        [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
        [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
        [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
        [<ffffffff812916dd>] block_ioctl+0x3d/0x50
        [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
        [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
        [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff81264f09>] SyS_ioctl+0x79/0x90
        [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76
      Reported-by: NRobert Hu <robert.hu@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ac34f15e
    • D
      block: enable dax for raw block devices · 5a023cdb
      Dan Williams 提交于
      If an application wants exclusive access to all of the persistent memory
      provided by an NVDIMM namespace it can use this raw-block-dax facility
      to forgo establishing a filesystem.  This capability is targeted
      primarily to hypervisors wanting to provision persistent memory for
      guests.  It can be disabled / enabled dynamically via the new BLKDAXSET
      ioctl.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5a023cdb
    • J
      Revert "block: Split bios on chunk boundaries" · 6126eb24
      Jens Axboe 提交于
      This reverts commit d3805611.
      
      If we end up splitting on the first segment, we don't adjust
      the sector count. That results in hitting a BUG() with attempting
      to split 0 sectors.
      
      As this is just a performance issue and not a regression since
      4.3 release, let's just rever this change. That gives us more
      time to test a real fix for 4.5, which would be marked for
      stable anyway.
      6126eb24
  7. 29 12月, 2015 1 次提交
  8. 23 12月, 2015 4 次提交
  9. 12 12月, 2015 1 次提交
  10. 09 12月, 2015 1 次提交
  11. 04 12月, 2015 5 次提交
  12. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  13. 02 12月, 2015 1 次提交
  14. 01 12月, 2015 1 次提交
  15. 30 11月, 2015 1 次提交
    • H
      block: Always check queue limits for cloned requests · bf4e6b4e
      Hannes Reinecke 提交于
      When a cloned request is retried on other queues it always needs
      to be checked against the queue limits of that queue.
      Otherwise the calculations for nr_phys_segments might be wrong,
      leading to a crash in scsi_init_sgtable().
      
      To clarify this the patch renames blk_rq_check_limits()
      to blk_cloned_rq_check_limits() and removes the symbol
      export, as the new function should only be used for
      cloned requests and never exported.
      
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Ewan Milne <emilne@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Fixes: e2a60da7 ("block: Clean up special command handling logic")
      Cc: stable@vger.kernel.org # 3.7+
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bf4e6b4e
  16. 26 11月, 2015 4 次提交
    • E
      Return EBUSY from BLKRRPART for mounted whole-dev fs · 77032ca6
      Eric Sandeen 提交于
      Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
      partition of sda is mounted (and will fail with EINVAL if pointed
      at a partition).  But it will pass if the entire block device is
      formatted with a filesystem and mounted.  I don't think this makes
      sense; partitioning should surely not ever change out from under
      a mounted device.
      
      So check for bdev->bd_super, and fail that with -EBUSY as well.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      77032ca6
    • M
      block/sd: Fix device-imposed transfer length limits · ca369d51
      Martin K. Petersen 提交于
      Commit 4f258a46 ("sd: Fix maximum I/O size for BLOCK_PC requests")
      had the unfortunate side-effect of removing an implicit clamp to
      BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
      code. This caused problems for some SMR drives.
      
      Debugging this issue revealed a few problems with the existing
      infrastructure since the block layer didn't know how to deal with
      device-imposed limits, only limits set by the I/O controller.
      
       - Introduce a new queue limit, max_dev_sectors, which is used by the
         ULD to signal the maximum sectors for a REQ_TYPE_FS request.
      
       - Ensure that max_dev_sectors is correctly stacked and taken into
         account when overriding max_sectors through sysfs.
      
       - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
         values for later processing.
      
       - In sd_revalidate() set the queue's max_dev_sectors based on the
         MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
         is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
         field size.
      
       - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
         VPD--if reported and sane--to signal the preferred device transfer
         size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.
      
       - blk_limits_max_hw_sectors() is no longer used and can be removed.
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: sweeneygj@gmx.com
      Tested-by: NArzeets <anatol.pomozov@gmail.com>
      Tested-by: NDavid Eisner <david.eisner@oriel.oxon.org>
      Tested-by: NMario Kicherer <dev@kicherer.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      ca369d51
    • J
      Revert "blk-flush: Queue through IO scheduler when flush not required" · d7cf931d
      Jens Axboe 提交于
      This reverts commit 1b2ff19e.
      
      Jan writes:
      
      --
      
      Thanks for report! After some investigation I found out we allocate
      elevator specific data in __get_request() only for non-flush requests. And
      this is actually required since the flush machinery uses the space in
      struct request for something else. Doh. So my patch is just wrong and not
      easy to fix since at the time __get_request() is called we are not sure
      whether the flush machinery will be used in the end. Jens, please revert
      1b2ff19e. Thanks!
      
      I'm somewhat surprised that you can reliably hit the race where flushing
      gets disabled for the device just while the request is in flight. But I
      guess during boot it makes some sense.
      
      --
      
      So let's just revert it, we can fix the queue run manually after the
      fact. This race is rare enough that it didn't trigger in testing, it
      requires the specific disable-while-in-flight scenario to trigger.
      d7cf931d
    • J
      Revert "blk-flush: Queue through IO scheduler when flush not required" · dcd8376c
      Jens Axboe 提交于
      This reverts commit 1b2ff19e.
      
      Jan writes:
      
      --
      
      Thanks for report! After some investigation I found out we allocate
      elevator specific data in __get_request() only for non-flush requests. And
      this is actually required since the flush machinery uses the space in
      struct request for something else. Doh. So my patch is just wrong and not
      easy to fix since at the time __get_request() is called we are not sure
      whether the flush machinery will be used in the end. Jens, please revert
      1b2ff19e. Thanks!
      
      I'm somewhat surprised that you can reliably hit the race where flushing
      gets disabled for the device just while the request is in flight. But I
      guess during boot it makes some sense.
      
      --
      
      So let's just revert it, we can fix the queue run manually after the
      fact. This race is rare enough that it didn't trigger in testing, it
      requires the specific disable-while-in-flight scenario to trigger.
      dcd8376c
  17. 25 11月, 2015 2 次提交