1. 23 9月, 2016 1 次提交
  2. 22 9月, 2016 2 次提交
  3. 21 9月, 2016 1 次提交
  4. 17 9月, 2016 5 次提交
  5. 15 9月, 2016 1 次提交
    • M
      blk-mq: introduce blk_mq_delay_kick_requeue_list() · 2849450a
      Mike Snitzer 提交于
      blk_mq_delay_kick_requeue_list() provides the ability to kick the
      q->requeue_list after a specified time.  To do this the request_queue's
      'requeue_work' member was changed to a delayed_work.
      
      blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
      requests while it doesn't make sense to immediately requeue them
      (e.g. when all paths in a DM multipath have failed).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2849450a
  6. 14 9月, 2016 3 次提交
  7. 29 8月, 2016 3 次提交
  8. 25 8月, 2016 2 次提交
    • J
      blk-mq: improve warning for running a queue on the wrong CPU · 0e87e58b
      Jens Axboe 提交于
      __blk_mq_run_hw_queue() currently warns if we are running the queue on a
      CPU that isn't set in its mask. However, this can happen if a CPU is
      being offlined, and the workqueue handling will place the work on CPU0
      instead. Improve the warning so that it only triggers if the batch cpu
      in the hardware queue is currently online.  If it triggers for that
      case, then it's indicative of a flow problem in blk-mq, so we want to
      retain it for that case.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0e87e58b
    • J
      blk-mq: don't overwrite rq->mq_ctx · e57690fe
      Jens Axboe 提交于
      We do this in a few places, if the CPU is offline. This isn't allowed,
      though, since on multi queue hardware, we can't just move a request
      from one software queue to another, if they map to different hardware
      queues. The request and tag isn't valid on another hardware queue.
      
      This can happen if plugging races with CPU offlining. But it does
      no harm, since it can only happen in the window where we are
      currently busy freezing the queue and flushing IO, in preparation
      for redoing the software <-> hardware queue mappings.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e57690fe
  9. 24 8月, 2016 1 次提交
    • M
      block: make sure a big bio is split into at most 256 bvecs · 4d70dca4
      Ming Lei 提交于
      After arbitrary bio size was introduced, the incoming bio may
      be very big. We have to split the bio into small bios so that
      each holds at most BIO_MAX_PAGES bvecs for safety reason, such
      as bio_clone().
      
      This patch fixes the following kernel crash:
      
      > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
      > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
      > [  172.660399] Oops: 0000 [#1] SMP
      > [...]
      > [  172.664780] Call Trace:
      > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
      > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
      > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
      > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
      > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
      > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
      > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
      
      The issue can be reproduced by the following steps:
      	- create one raid1 over two virtio-blk
      	- build bcache device over the above raid1 and another cache device
      	and bucket size is set as 2Mbytes
      	- set cache mode as writeback
      	- run random write over ext4 on the bcache device
      
      Fixes: 54efd50b(block: make generic_make_request handle arbitrarily sized bios)
      Reported-by: NSebastian Roesner <sroesner-kernelorg@roesner-online.de>
      Reported-by: NEric Wheeler <bcache@lists.ewheeler.net>
      Cc: stable@vger.kernel.org (4.3+)
      Cc: Shaohua Li <shli@fb.com>
      Acked-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4d70dca4
  10. 17 8月, 2016 1 次提交
  11. 16 8月, 2016 1 次提交
  12. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  13. 05 8月, 2016 6 次提交
    • J
      blk-mq: fix deadlock in blk_mq_register_disk() error path · c0f3fd2b
      Jens Axboe 提交于
      If we fail registering any of the hardware queues, we call
      into blk_mq_unregister_disk() with the hotplug mutex already
      held. Since blk_mq_unregister_disk() attempts to acquire the
      same mutex, we end up in a less than happy place.
      Reported-by: NJinpu Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c0f3fd2b
    • D
      block: fix bdi vs gendisk lifetime mismatch · df08c32c
      Dan Williams 提交于
      The name for a bdi of a gendisk is derived from the gendisk's devt.
      However, since the gendisk is destroyed before the bdi it leaves a
      window where a new gendisk could dynamically reuse the same devt while a
      bdi with the same name is still live.  Arrange for the bdi to hold a
      reference against its "owner" disk device while it is registered.
      Otherwise we can hit sysfs duplicate name collisions like the following:
      
       WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
       sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'
      
       Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
        0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
        ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
        0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
       Call Trace:
        [<ffffffff8134caec>] dump_stack+0x63/0x87
        [<ffffffff8108c351>] __warn+0xd1/0xf0
        [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
        [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
        [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
        [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
        [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
        [<ffffffff8134ff55>] kobject_add+0x75/0xd0
        [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
        [<ffffffff8148b0a5>] device_add+0x125/0x610
        [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
        [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
        [<ffffffff811b775c>] bdi_register+0x8c/0x180
        [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
        [<ffffffff813317f5>] add_disk+0x175/0x4a0
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NYi Zhang <yizhan@redhat.com>
      Tested-by: NYi Zhang <yizhan@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      
      Fixed up missing 0 return in bdi_register_owner().
      Signed-off-by: NJens Axboe <axboe@fb.com>
      df08c32c
    • G
      blk-mq: Allow timeouts to run while queue is freezing · 71f79fb3
      Gabriel Krisman Bertazi 提交于
      In case a submitted request gets stuck for some reason, the block layer
      can prevent the request starvation by starting the scheduled timeout work.
      If this stuck request occurs at the same time another thread has started
      a queue freeze, the blk_mq_timeout_work will not be able to acquire the
      queue reference and will return silently, thus not issuing the timeout.
      But since the request is already holding a q_usage_counter reference and
      is unable to complete, it will never release its reference, preventing
      the queue from completing the freeze started by first thread.  This puts
      the request_queue in a hung state, forever waiting for the freeze
      completion.
      
      This was observed while running IO to a NVMe device at the same time we
      toggled the CPU hotplug code. Eventually, once a request got stuck
      requiring a timeout during a queue freeze, we saw the CPU Hotplug
      notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
      the trace below.
      
      [c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
      [c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
      [c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
      [c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
      [c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
      [c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
      [c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
      [c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
      [c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
      [c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
      [c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
      [c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
      [c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
      [c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
      [c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
      [c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
      [c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
      [c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
      [c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
      [c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
      
      The fix is to allow the timeout work to execute in the window between
      dropping the initial refcount reference and the release of the last
      reference, which actually marks the freeze completion.  This can be
      achieved with percpu_refcount_tryget, which does not require the counter
      to be alive.  This way the timeout work can do it's job and terminate a
      stuck request even during a freeze, returning its reference and avoiding
      the deadlock.
      
      Allowing the timeout to run is just a part of the fix, since for some
      devices, we might get stuck again inside the device driver's timeout
      handler, should it attempt to allocate a new request in that path -
      which is a quite common action for Abort commands, which need to be sent
      after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
      from inside the timeout handler, which will fail during a freeze, since
      it also tries to acquire a queue reference.
      
      I considered a similar change to blk_mq_alloc_request as a generic
      solution for further device driver hangs, but we can't do that, since it
      would allow new requests to disturb the freeze process.  I thought about
      creating a new function in the block layer to support unfreezable
      requests for these occasions, but after working on it for a while, I
      feel like this should be handled in a per-driver basis.  I'm now
      experimenting with changes to the NVMe timeout path, but I'm open to
      suggestions of ways to make this generic.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-block@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      71f79fb3
    • V
      block: fix use-after-free in seq file · 77da1605
      Vegard Nossum 提交于
      I got a KASAN report of use-after-free:
      
          ==================================================================
          BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
          Read of size 8 by task trinity-c1/315
          =============================================================================
          BUG kmalloc-32 (Not tainted): kasan: bad access detected
          -----------------------------------------------------------------------------
      
          Disabling lock debugging due to kernel taint
          INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
                  ___slab_alloc+0x4f1/0x520
                  __slab_alloc.isra.58+0x56/0x80
                  kmem_cache_alloc_trace+0x260/0x2a0
                  disk_seqf_start+0x66/0x110
                  traverse+0x176/0x860
                  seq_read+0x7e3/0x11a0
                  proc_reg_read+0xbc/0x180
                  do_loop_readv_writev+0x134/0x210
                  do_readv_writev+0x565/0x660
                  vfs_readv+0x67/0xa0
                  do_preadv+0x126/0x170
                  SyS_preadv+0xc/0x10
                  do_syscall_64+0x1a1/0x460
                  return_from_SYSCALL_64+0x0/0x6a
          INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
                  __slab_free+0x17a/0x2c0
                  kfree+0x20a/0x220
                  disk_seqf_stop+0x42/0x50
                  traverse+0x3b5/0x860
                  seq_read+0x7e3/0x11a0
                  proc_reg_read+0xbc/0x180
                  do_loop_readv_writev+0x134/0x210
                  do_readv_writev+0x565/0x660
                  vfs_readv+0x67/0xa0
                  do_preadv+0x126/0x170
                  SyS_preadv+0xc/0x10
                  do_syscall_64+0x1a1/0x460
                  return_from_SYSCALL_64+0x0/0x6a
      
          CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
           ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
           ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
          Call Trace:
           [<ffffffff81d6ce81>] dump_stack+0x65/0x84
           [<ffffffff8146c7bd>] print_trailer+0x10d/0x1a0
           [<ffffffff814704ff>] object_err+0x2f/0x40
           [<ffffffff814754d1>] kasan_report_error+0x221/0x520
           [<ffffffff8147590e>] __asan_report_load8_noabort+0x3e/0x40
           [<ffffffff83888161>] klist_iter_exit+0x61/0x70
           [<ffffffff82404389>] class_dev_iter_exit+0x9/0x10
           [<ffffffff81d2e8ea>] disk_seqf_stop+0x3a/0x50
           [<ffffffff8151f812>] seq_read+0x4b2/0x11a0
           [<ffffffff815f8fdc>] proc_reg_read+0xbc/0x180
           [<ffffffff814b24e4>] do_loop_readv_writev+0x134/0x210
           [<ffffffff814b4c45>] do_readv_writev+0x565/0x660
           [<ffffffff814b8a17>] vfs_readv+0x67/0xa0
           [<ffffffff814b8de6>] do_preadv+0x126/0x170
           [<ffffffff814b92ec>] SyS_preadv+0xc/0x10
      
      This problem can occur in the following situation:
      
      open()
       - pread()
          - .seq_start()
             - iter = kmalloc() // succeeds
             - seqf->private = iter
          - .seq_stop()
             - kfree(seqf->private)
       - pread()
          - .seq_start()
             - iter = kmalloc() // fails
          - .seq_stop()
             - class_dev_iter_exit(seqf->private) // boom! old pointer
      
      As the comment in disk_seqf_stop() says, stop is called even if start
      failed, so we need to reinitialise the private pointer to NULL when seq
      iteration stops.
      
      An alternative would be to set the private pointer to NULL when the
      kmalloc() in disk_seqf_start() fails.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      77da1605
    • P
      block: add missing group association in bio-cloning functions · 20bd723e
      Paolo Valente 提交于
      When a bio is cloned, the newly created bio must be associated with
      the same blkcg as the original bio (if BLK_CGROUP is enabled). If
      this operation is not performed, then the new bio is not associated
      with any group, and the group of the current task is returned when
      the group of the bio is requested.
      
      Depending on the cloning frequency, this may cause a large
      percentage of the bios belonging to a given group to be treated
      as if belonging to other groups (in most cases as if belonging to
      the root group). The expected group isolation may thereby be broken.
      
      This commit adds the missing association in bio-cloning functions.
      
      Fixes: da2f0f74 ("Btrfs: add support for blkio controllers")
      Cc: stable@vger.kernel.org # v4.3+
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Reviewed-by: NNikolay Borisov <kernel@kyup.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      20bd723e
    • H
      blkcg: kill unused field nr_undestroyed_grps · bfd279a8
      Hou Tao 提交于
      'nr_undestroyed_grps' in struct throtl_data was used to count
      the number of throtl_grp related with throtl_data, but now
      throtl_grp is tracked by blkcg_gq, so it is useless anymore.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bfd279a8
  14. 04 8月, 2016 1 次提交
  15. 21 7月, 2016 11 次提交