1. 11 9月, 2019 3 次提交
    • T
      blk-iocost: Don't let merges push vtime into the future · e1518f63
      Tejun Heo 提交于
      Merges have the same problem that forced-bios had which is fixed by
      the previous patch.  The cost of a merge is calculated at the time of
      issue and force-advances vtime into the future.  Until global vtime
      catches up, how the cgroup's hweight changes in the meantime doesn't
      matter and it often leads to situations where the cost is calculated
      at one hweight and paid at a very different one.  See the previous
      patch for more details.
      
      Fix it by never advancing vtime into the future for merges.  If budget
      is available, vtime is advanced.  Otherwise, the cost is charged as
      debt.
      
      This brings merge cost handling in line with issue cost handling in
      ioc_rqos_throttle().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e1518f63
    • T
      blk-iocost: Account force-charged overage in absolute vtime · 36a52481
      Tejun Heo 提交于
      Currently, when a bio needs to be force-charged and there isn't enough
      budget, vtime is simply pushed into the future.  This means that the
      cost of the whole bio is scaled using the current hweight and then
      charged immediately.  Until the global vtime advances beyond this
      future vtime, the cgroup won't be allowed to issue normal IOs.
      
      This is incorrect and can lead to, for example, exploding vrate or
      extended stalls if vrate range is constrained.  Consider the following
      scenario.
      
      1. A cgroup with a very low hweight runs out of budget.
      
      2. A storm of swap-out happens on it.  All of them are scaled
         according to the current low hweight and charged to vtime pushing
         it to a far future.
      
      3. All other cgroups go idle and now the above cgroup has access to
         the whole device.  However, because vtime is already wound using
         the past low hweight, what its current hweight is doesn't matter
         until global vtime catches up to the local vtime.
      
      4. As a result, either vrate gets ramped up extremely or the IOs stall
         while the underlying device is idle.
      
      This is because the hweight the overage is calculated at is different
      from the hweight that it's being paid at.
      
      Fix it by remembering the overage in absoulte vtime and continuously
      paying with the actual budget according to the current hweight at each
      period.
      
      Note that non-forced bios which wait already remembers the cost in
      absolute vtime.  This brings forced-bio accounting in line.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36a52481
    • T
      blk-iocost: Fix incorrect operation order during iocg free · e036c4ca
      Tejun Heo 提交于
      ioc_pd_free() first cancels the hrtimers and then deactivates the
      iocg.  However, the iocg timer can run inbetween and reschedule the
      hrtimers which will end up running after the iocg is freed leading to
      crashes like the following.
      
        general protection fault: 0000 [#1] SMP
        ...
        RIP: 0010:iocg_kick_delay+0xbe/0x1b0
        RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
        RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
        RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
        R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
        FS:  0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         iocg_delay_timer_fn+0x3d/0x60
         __hrtimer_run_queues+0xfe/0x270
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x5e/0x120
         apic_timer_interrupt+0xf/0x20
         </IRQ>
      
      Fix it by canceling hrtimers after deactivating the iocg.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e036c4ca
  2. 07 9月, 2019 3 次提交
  3. 06 9月, 2019 10 次提交
  4. 04 9月, 2019 7 次提交
    • Z
      paride/pcd: need to check if cd->disk is null in pcd_detect · 03754ea3
      zhengbin 提交于
      If alloc_disk fails in pcd_init_units, cd->disk & pi are empty, we need
      to check if cd->disk is null in pcd_detect.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03754ea3
    • Z
      paride/pcd: need to set queue to NULL before put_disk · d821cce8
      zhengbin 提交于
      In pcd_init_units, if blk_mq_init_sq_queue fails, need to set queue to
      NULL before put_disk, otherwise null-ptr-deref Read will occur.
      
      put_disk
        kobject_put
          disk_release
            blk_put_queue(disk->queue)
      
      Fixes: f0d17625 ("paride/pcd: Fix potential NULL pointer dereference and mem leak")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d821cce8
    • Z
      paride/pf: need to set queue to NULL before put_disk · ecf4d59a
      zhengbin 提交于
      In pf_init_units, if blk_mq_init_sq_queue fails, need to set queue to
      NULL before put_disk, otherwise null-ptr-deref Read will occur.
      
      put_disk
        kobject_put
          disk_release
            blk_put_queue(disk->queue)
      
      Fixes: 77218ddf ("paride: convert pf to blk-mq")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecf4d59a
    • J
      Merge branch 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.4/block · c5ef62e6
      Jens Axboe 提交于
      Pull MD fixes from Song.
      
      * 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid5: use bio_end_sector to calculate last_sector
        md/raid1: fail run raid1 array when active disk less than one
        md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
      c5ef62e6
    • G
      md/raid5: use bio_end_sector to calculate last_sector · b0f01ecf
      Guoqing Jiang 提交于
      Use the common way to get last_sector.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      b0f01ecf
    • Y
      md/raid1: fail run raid1 array when active disk less than one · 07f1a685
      Yufen Yu 提交于
      When run test case:
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      mdadm run fail with kernel message as follow:
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      
      In fact, when active disk in raid1 array less than one, we
      need to return fail in raid1_run().
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      07f1a685
    • G
      md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone · 62f7b198
      Guilherme G. Piccoli 提交于
      Currently md raid0/linear are not provided with any mechanism to validate
      if an array member got removed or failed. The driver keeps sending BIOs
      regardless of the state of array members, and kernel shows state 'clean'
      in the 'array_state' sysfs attribute. This leads to the following
      situation: if a raid0/linear array member is removed and the array is
      mounted, some user writing to this array won't realize that errors are
      happening unless they check dmesg or perform one fsync per written file.
      Despite udev signaling the member device is gone, 'mdadm' cannot issue the
      STOP_ARRAY ioctl successfully, given the array is mounted.
      
      In other words, no -EIO is returned and writes (except direct ones) appear
      normal. Meaning the user might think the wrote data is correctly stored in
      the array, but instead garbage was written given that raid0 does stripping
      (and so, it requires all its members to be working in order to not corrupt
      data). For md/linear, writes to the available members will work fine, but
      if the writes go to the missing member(s), it'll cause a file corruption
      situation, whereas the portion of the writes to the missing devices aren't
      written effectively.
      
      This patch changes this behavior: we check if the block device's gendisk
      is UP when submitting the BIO to the array member, and if it isn't, we flag
      the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
      request to the array requiring data from a valid member is still completed.
      While flagging the device as MD_BROKEN, we also show a rate-limited warning
      in the kernel log.
      
      A new array state 'broken' was added too: it mimics the state 'clean' in
      every aspect, being useful only to distinguish if the array has some member
      missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
      state. This state cannot be written in 'array_state' as it just shows
      one or more members of the array are missing but acts like 'clean', it
      wouldn't make sense to write it.
      
      With this patch, the filesystem reacts much faster to the event of missing
      array member: after some I/O errors, ext4 for instance aborts the journal
      and prevents corruption. Without this change, we're able to keep writing
      in the disk and after a machine reboot, e2fsck shows some severe fs errors
      that demand fixing. This patch was tested in ext4 and xfs filesystems, and
      requires a 'mdadm' counterpart to handle the 'broken' state.
      
      Cc: Song Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      62f7b198
  5. 03 9月, 2019 7 次提交
  6. 31 8月, 2019 2 次提交
    • T
      writeback: don't access page->mapping directly in track_foreign_dirty TP · 0feacaa2
      Tejun Heo 提交于
      page->mapping may encode different values in it and page_mapping()
      should always be used to access the mapping pointer.
      track_foreign_dirty tracepoint was incorrectly accessing page->mapping
      directly.  Use page_mapping() instead.  Also, add NULL checks while at
      it.
      
      Fixes: 3a8e9ac8 ("writeback: add tracepoints for cgroup foreign writebacks")
      Reported-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0feacaa2
    • J
      Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-5.4/block · 8f5914bc
      Jens Axboe 提交于
      Pull NVMe changes from Sagi:
      
      "The nvme updates include:
       - ana log parse fix from Anton
       - nvme quirks support for Apple devices from Ben
       - fix missing bio completion tracing for multipath stack devices from
         Hannes and Mikhail
       - IP TOS settings for nvme rdma and tcp transports from Israel
       - rq_dma_dir cleanups from Israel
       - tracing for Get LBA Status command from Minwoo
       - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
       - Some consolidation between the fabrics transports for handling the CAP
         register
       - reset race with ns scanning fix for fabrics (move fabrics commands to
         a dedicated request queue with a different lifetime from the admin
         request queue)."
      
      * 'nvme-5.4' of git://git.infradead.org/nvme: (30 commits)
        nvme-rdma: Use rq_dma_dir macro
        nvme-fc: Use rq_dma_dir macro
        nvme-pci: Tidy up nvme_unmap_data
        nvme: make fabrics command run on a separate request queue
        nvme-pci: Support shared tags across queues for Apple 2018 controllers
        nvme-pci: Add support for Apple 2018+ models
        nvme-pci: Add support for variable IO SQ element size
        nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros
        nvme: trace bio completion
        nvme-multipath: fix ana log nsid lookup when nsid is not found
        nvmet-tcp: Add TOS for tcp transport
        nvme-tcp: Add TOS for tcp transport
        nvme-tcp: Use struct nvme_ctrl directly
        nvme-rdma: Add TOS for rdma transport
        nvme-fabrics: Add type of service (TOS) configuration
        nvmet-tcp: fix possible memory leak
        nvmet-tcp: fix possible NULL deref
        nvmet: trace: parse Get LBA Status command in detail
        nvme: trace: parse Get LBA Status command in detail
        nvme: trace: support for Get LBA Status opcode parsed
        ...
      8f5914bc
  7. 30 8月, 2019 8 次提交