1. 11 9月, 2018 1 次提交
    • J
      dm thin metadata: try to avoid ever aborting transactions · 3ab91828
      Joe Thornber 提交于
      Committing a transaction can consume some metadata of it's own, we now
      reserve a small amount of metadata to cover this.  Free metadata
      reported by the kernel will not include this reserve.
      
      If any of the reserve has been used after a commit we enter a new
      internal state PM_OUT_OF_METADATA_SPACE.  This is reported as
      PM_READ_ONLY, so no userland changes are needed.  If the metadata
      device is resized the pool will move back to PM_WRITE.
      
      These changes mean we never need to abort and rollback a transaction due
      to running out of metadata space.  This is particularly important
      because there have been a handful of reports of data corruption against
      DM thin-provisioning that can all be attributed to the thin-pool having
      ran out of metadata space.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3ab91828
  2. 07 9月, 2018 6 次提交
  3. 04 9月, 2018 1 次提交
  4. 01 9月, 2018 3 次提交
    • G
      md-cluster: release RESYNC lock after the last resync message · 41a95041
      Guoqing Jiang 提交于
      All the RESYNC messages are sent with resync lock held, the only
      exception is resync_finish which releases resync_lockres before
      send the last resync message, this should be changed as well.
      Otherwise, we can see deadlock issue as follows:
      
      clustermd2-gqjiang2:~ # cat /proc/mdstat
      Personalities : [raid10] [raid1]
      md0 : active raid1 sdg[0] sdf[1]
            134144 blocks super 1.2 [2/2] [UU]
            [===================>.]  resync = 99.6% (134144/134144) finish=0.0min speed=26K/sec
            bitmap: 1/1 pages [4KB], 65536KB chunk
      
      unused devices: <none>
      clustermd2-gqjiang2:~ # ps aux|grep md|grep D
      root     20497  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
      clustermd2-gqjiang2:~ # cat /proc/20497/stack
      [<ffffffffc05ff51e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
      [<ffffffffc05ff7e8>] __sendmsg+0x98/0x130 [md_cluster]
      [<ffffffffc05ff900>] sendmsg+0x20/0x30 [md_cluster]
      [<ffffffffc05ffc35>] resync_info_update+0xb5/0xc0 [md_cluster]
      [<ffffffffc0593e84>] md_reap_sync_thread+0x134/0x170 [md_mod]
      [<ffffffffc059514c>] md_check_recovery+0x28c/0x510 [md_mod]
      [<ffffffffc060c882>] raid1d+0x42/0x800 [raid1]
      [<ffffffffc058ab61>] md_thread+0x121/0x150 [md_mod]
      [<ffffffff9a0a5b3f>] kthread+0xff/0x140
      [<ffffffff9a800235>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff
      
      clustermd-gqjiang1:~ # ps aux|grep md|grep D
      root     20531  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_raid1]
      root     20537  0.0  0.0      0     0 ?        D    16:00   0:00 [md0_cluster_rec]
      root     20676  0.0  0.0      0     0 ?        D    16:01   0:00 [md0_resync]
      clustermd-gqjiang1:~ # cat /proc/mdstat
      Personalities : [raid10] [raid1]
      md0 : active raid1 sdf[1] sdg[0]
            134144 blocks super 1.2 [2/2] [UU]
            [===================>.]  resync = 97.3% (131072/134144) finish=8076.8min speed=0K/sec
            bitmap: 1/1 pages [4KB], 65536KB chunk
      
      unused devices: <none>
      clustermd-gqjiang1:~ # cat /proc/20531/stack
      [<ffffffffc080974d>] metadata_update_start+0xcd/0xd0 [md_cluster]
      [<ffffffffc079c897>] md_update_sb.part.61+0x97/0x820 [md_mod]
      [<ffffffffc079f15b>] md_check_recovery+0x29b/0x510 [md_mod]
      [<ffffffffc0816882>] raid1d+0x42/0x800 [raid1]
      [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
      [<ffffffff9e0a5b3f>] kthread+0xff/0x140
      [<ffffffff9e800235>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff
      clustermd-gqjiang1:~ # cat /proc/20537/stack
      [<ffffffffc0813222>] freeze_array+0xf2/0x140 [raid1]
      [<ffffffffc080a56e>] recv_daemon+0x41e/0x580 [md_cluster]
      [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
      [<ffffffff9e0a5b3f>] kthread+0xff/0x140
      [<ffffffff9e800235>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff
      clustermd-gqjiang1:~ # cat /proc/20676/stack
      [<ffffffffc080951e>] dlm_lock_sync+0x8e/0xc0 [md_cluster]
      [<ffffffffc080957f>] lock_token+0x2f/0xa0 [md_cluster]
      [<ffffffffc0809622>] lock_comm+0x32/0x90 [md_cluster]
      [<ffffffffc08098f5>] sendmsg+0x15/0x30 [md_cluster]
      [<ffffffffc0809c0a>] resync_info_update+0x8a/0xc0 [md_cluster]
      [<ffffffffc08130ba>] raid1_sync_request+0xa9a/0xb10 [raid1]
      [<ffffffffc079b8ea>] md_do_sync+0xbaa/0xf90 [md_mod]
      [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod]
      [<ffffffff9e0a5b3f>] kthread+0xff/0x140
      [<ffffffff9e800235>] ret_from_fork+0x35/0x40
      [<ffffffffffffffff>] 0xffffffffffffffff
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      41a95041
    • X
      RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0 · 1d0ffd26
      Xiao Ni 提交于
      In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks
      have bad blocks, the max_sectors is less than last. It will call goto read_more many
      times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition
      sectors_done is not 0. So the value passed to the argument force of raise_barrier is
      true.
      
      In raise_barrier it checks conf->barrier when force is true. If force is true and
      conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer
      disks. And in the callback function of the bio it calls lower_barrier. If the bio
      finishes before calling raise_barrier again, it can trigger the BUG_ON.
      
      Add one pair of raise_barrier/lower_barrier to fix this bug.
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Suggested-by: NNeil Brown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1d0ffd26
    • S
      md/raid5-cache: disable reshape completely · e254de6b
      Shaohua Li 提交于
      We don't support reshape yet if an array supports log device. Previously we
      determine the fact by checking ->log. However, ->log could be NULL after a log
      device is removed, but the array is still marked to support log device. Don't
      allow reshape in this case too. User can disable log device support by setting
      'consistency_policy' to 'resync' then do reshape.
      Reported-by: NXiao Ni <xni@redhat.com>
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e254de6b
  5. 23 8月, 2018 2 次提交
  6. 17 8月, 2018 1 次提交
  7. 14 8月, 2018 1 次提交
  8. 12 8月, 2018 17 次提交
  9. 11 8月, 2018 1 次提交
    • C
      bcache: fix error setting writeback_rate through sysfs interface · 46451874
      Coly Li 提交于
      Commit ea8c5356 ("bcache: set max writeback rate when I/O request
      is idle") changes struct bch_ratelimit member rate from uint32_t to
      atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
      to set new writeback rate, after the input is converted from memory
      buf to long int by sysfs_strtoul_clamp().
      
      The above change has a problem because there is an implicit return
      inside sysfs_strtoul_clamp() so the following atomic_long_set()
      won't be called. This error is detected by 0day system with following
      snipped smatch warnings:
      
      drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
      symbol 'v'.
      270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      @271 atomic_long_set(&dc->writeback_rate.rate, v);
      
      This patch fixes the above error by using strtoul_safe_clamp() to
      convert the input buffer into a long int type result.
      
      Fixes: ea8c5356 ("bcache: set max writeback rate when I/O request is idle")
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Stefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46451874
  10. 10 8月, 2018 1 次提交
    • I
      dm cache metadata: set dirty on all cache blocks after a crash · 5b1fe7be
      Ilya Dryomov 提交于
      Quoting Documentation/device-mapper/cache.txt:
      
        The 'dirty' state for a cache block changes far too frequently for us
        to keep updating it on the fly.  So we treat it as a hint.  In normal
        operation it will be written when the dm device is suspended.  If the
        system crashes all cache blocks will be assumed dirty when restarted.
      
      This got broken in commit f177940a ("dm cache metadata: switch to
      using the new cursor api for loading metadata") in 4.9, which removed
      the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
      flag) when loading cache blocks.  This results in data corruption on an
      unclean shutdown with dirty cache blocks on the fast device.  After the
      crash those blocks are considered clean and may get evicted from the
      cache at any time.  This can be demonstrated by doing a lot of reads
      to trigger individual evictions, but uncache is more predictable:
      
        ### Disable auto-activation in lvm.conf to be able to do uncache in
        ### time (i.e. see uncache doing flushing) when the fix is applied.
      
        # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
        # vgcreate vg_cache /dev/vdb /dev/vdc
        # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
        # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
        # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
        # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
        # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
        # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # dmsetup status vg_cache-lv_slowdev
        0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
                                                                  ^^^^
                                      7065 * 64k = 441M yet to be written to the slow device
        # echo b >/proc/sysrq-trigger
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 0 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
        0fe00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
      
      This is the case with both v1 and v2 cache pool metatata formats.
      
      After applying this patch:
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 3724 blocks for cache vg_cache/lv_slowdev.
        ...
        Flushing 71 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
      
      Cc: stable@vger.kernel.org
      Fixes: f177940a ("dm cache metadata: switch to using the new cursor api for loading metadata")
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5b1fe7be
  11. 09 8月, 2018 6 次提交
    • S
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG · cbb751c0
      Shenghui Wang 提交于
      Remove the tailing backslash in macro BTREE_FLAG in btree.h
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cbb751c0
    • S
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section · e921efeb
      Shenghui Wang 提交于
      The pr_err statement in the code for sysfs_attatch section would run
      for various error codes, which maybe confusing.
      
      E.g,
      
      Run the command twice:
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
         [the backing dev got attached on the first run]
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
      
      In dmesg, after the command run twice, we can get:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be891
                     : cache set not found
      The first statement in the message was right, but the second was
      confusing.
      
      bch_cached_dev_attach has various pr_ statements for various error
      codes, except ENOENT.
      
      After the change, rerun above command twice:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      
      In dmesg we only got:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      No confusing "cache set not found" message anymore.
      
      And for some not exist SET-UUID:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
      			/sys/block/bcache0/bcache/attach
      In dmesg we can get:
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be898
      	               : cache set not found
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e921efeb
    • C
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li 提交于
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: NColy Li <colyli@suse.de>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea8c5356
    • C
      bcache: add code comments for bset.c · b467a6ac
      Coly Li 提交于
      This patch tries to add code comments in bset.c, to make some
      tricky code and designment to be more comprehensible. Most information
      of this patch comes from the discussion between Kent and I, he
      offers very informative details. If there is any mistake
      of the idea behind the code, no doubt that's from me misrepresentation.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b467a6ac
    • C
      bcache: fix mistaken comments in request.c · 0cba2e71
      Coly Li 提交于
      This patch updates code comment in bch_keylist_realloc() by fixing
      incorrected function names, to make the code to be more comprehennsible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0cba2e71
    • C
      bcache: fix mistaken code comments in bcache.h · cb329dec
      Coly Li 提交于
      This patch updates the code comment in struct cache with correct array
      names, to make the code to be more comprehensible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb329dec