1. 23 8月, 2018 1 次提交
  2. 17 8月, 2018 1 次提交
  3. 14 8月, 2018 1 次提交
  4. 12 8月, 2018 17 次提交
  5. 11 8月, 2018 1 次提交
    • C
      bcache: fix error setting writeback_rate through sysfs interface · 46451874
      Coly Li 提交于
      Commit ea8c5356 ("bcache: set max writeback rate when I/O request
      is idle") changes struct bch_ratelimit member rate from uint32_t to
      atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
      to set new writeback rate, after the input is converted from memory
      buf to long int by sysfs_strtoul_clamp().
      
      The above change has a problem because there is an implicit return
      inside sysfs_strtoul_clamp() so the following atomic_long_set()
      won't be called. This error is detected by 0day system with following
      snipped smatch warnings:
      
      drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
      symbol 'v'.
      270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      @271 atomic_long_set(&dc->writeback_rate.rate, v);
      
      This patch fixes the above error by using strtoul_safe_clamp() to
      convert the input buffer into a long int type result.
      
      Fixes: ea8c5356 ("bcache: set max writeback rate when I/O request is idle")
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Stefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46451874
  6. 10 8月, 2018 1 次提交
    • I
      dm cache metadata: set dirty on all cache blocks after a crash · 5b1fe7be
      Ilya Dryomov 提交于
      Quoting Documentation/device-mapper/cache.txt:
      
        The 'dirty' state for a cache block changes far too frequently for us
        to keep updating it on the fly.  So we treat it as a hint.  In normal
        operation it will be written when the dm device is suspended.  If the
        system crashes all cache blocks will be assumed dirty when restarted.
      
      This got broken in commit f177940a ("dm cache metadata: switch to
      using the new cursor api for loading metadata") in 4.9, which removed
      the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
      flag) when loading cache blocks.  This results in data corruption on an
      unclean shutdown with dirty cache blocks on the fast device.  After the
      crash those blocks are considered clean and may get evicted from the
      cache at any time.  This can be demonstrated by doing a lot of reads
      to trigger individual evictions, but uncache is more predictable:
      
        ### Disable auto-activation in lvm.conf to be able to do uncache in
        ### time (i.e. see uncache doing flushing) when the fix is applied.
      
        # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
        # vgcreate vg_cache /dev/vdb /dev/vdc
        # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
        # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
        # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
        # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
        # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
        # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # dmsetup status vg_cache-lv_slowdev
        0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
                                                                  ^^^^
                                      7065 * 64k = 441M yet to be written to the slow device
        # echo b >/proc/sysrq-trigger
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 0 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
        0fe00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
      
      This is the case with both v1 and v2 cache pool metatata formats.
      
      After applying this patch:
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 3724 blocks for cache vg_cache/lv_slowdev.
        ...
        Flushing 71 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
      
      Cc: stable@vger.kernel.org
      Fixes: f177940a ("dm cache metadata: switch to using the new cursor api for loading metadata")
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5b1fe7be
  7. 09 8月, 2018 11 次提交
    • S
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG · cbb751c0
      Shenghui Wang 提交于
      Remove the tailing backslash in macro BTREE_FLAG in btree.h
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cbb751c0
    • S
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section · e921efeb
      Shenghui Wang 提交于
      The pr_err statement in the code for sysfs_attatch section would run
      for various error codes, which maybe confusing.
      
      E.g,
      
      Run the command twice:
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
         [the backing dev got attached on the first run]
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
      
      In dmesg, after the command run twice, we can get:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be891
                     : cache set not found
      The first statement in the message was right, but the second was
      confusing.
      
      bch_cached_dev_attach has various pr_ statements for various error
      codes, except ENOENT.
      
      After the change, rerun above command twice:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      
      In dmesg we only got:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      No confusing "cache set not found" message anymore.
      
      And for some not exist SET-UUID:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
      			/sys/block/bcache0/bcache/attach
      In dmesg we can get:
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be898
      	               : cache set not found
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e921efeb
    • C
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li 提交于
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: NColy Li <colyli@suse.de>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea8c5356
    • C
      bcache: add code comments for bset.c · b467a6ac
      Coly Li 提交于
      This patch tries to add code comments in bset.c, to make some
      tricky code and designment to be more comprehensible. Most information
      of this patch comes from the discussion between Kent and I, he
      offers very informative details. If there is any mistake
      of the idea behind the code, no doubt that's from me misrepresentation.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b467a6ac
    • C
      bcache: fix mistaken comments in request.c · 0cba2e71
      Coly Li 提交于
      This patch updates code comment in bch_keylist_realloc() by fixing
      incorrected function names, to make the code to be more comprehennsible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0cba2e71
    • C
      bcache: fix mistaken code comments in bcache.h · cb329dec
      Coly Li 提交于
      This patch updates the code comment in struct cache with correct array
      names, to make the code to be more comprehensible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb329dec
    • C
      bcache: add a comment in super.c · e57fd746
      Coly Li 提交于
      This patch adds a line of code comment in super.c:register_bdev(), to
      make code to be more comprehensible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e57fd746
    • C
      bcache: avoid unncessary cache prefetch bch_btree_node_get() · c2e8dcf7
      Coly Li 提交于
      In bch_btree_node_get() the read-in btree node will be partially
      prefetched into L1 cache for following bset iteration (if there is).
      But if the btree node read is failed, the perfetch operations will
      waste L1 cache space. This patch checkes whether read operation and
      only does cache prefetch when read I/O succeeded.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c2e8dcf7
    • C
      bcache: display rate debug parameters to 0 when writeback is not running · b4cb6efc
      Coly Li 提交于
      When writeback is not running, writeback rate should be 0, other value is
      misleading. And the following dyanmic writeback rate debug parameters
      should be 0 too,
      	rate, proportional, integral, change
      otherwise they are misleading when writeback is not running.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b4cb6efc
    • C
      bcache: do not check return value of debugfs_create_dir() · 78ac2107
      Coly Li 提交于
      Greg KH suggests that normal code should not care about debugfs. Therefore
      no matter successful or failed of debugfs_create_dir() execution, it is
      unncessary to check its return value.
      
      There are two functions called debugfs_create_dir() and check the return
      value, which are bch_debug_init() and closure_debug_init(). This patch
      changes these two functions from int to void type, and ignore return values
      of debugfs_create_dir().
      
      This patch does not fix exact bug, just makes things work as they should.
      Signed-off-by: NColy Li <colyli@suse.de>
      Suggested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78ac2107
    • M
      dm snapshot: remove stale FIXME in snapshot_map() · c9a5e6a9
      Mike Snitzer 提交于
      Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore")
      eliminated the need to worry about read vs write locking.  So remove a
      FIXME in snapshot_map() that is concerned about selectively taking a
      write lock.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c9a5e6a9
  8. 08 8月, 2018 4 次提交
    • D
      dm snapshot: improve performance by switching out_of_order_list to rbtree · 3db2776d
      David Jeffery 提交于
      copy_complete()'s processing of out_of_order_list can result in
      quadratic complexity in the worst case.  As such it was the source of
      consuming too much cpu and the source of significant loss in
      performance.
      
      Fix this by converting out_of_order_list to an rbtree.  This improved
      a dm-snapshot test copy workload from 32 seconds to 4 seconds.
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NBrett Hull <bhull@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3db2776d
    • J
      dm kcopyd: avoid softlockup in run_complete_job · 784c9a29
      John Pittman 提交于
      It was reported that softlockups occur when using dm-snapshot ontop of
      slow (rbd) storage.  E.g.:
      
      [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
      ...
      [ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
      [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
      ...
      [ 4048.034190] Call Trace:
      [ 4048.034196]  ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
      [ 4048.034200]  run_complete_job+0x5f/0xb0 [dm_mod]
      [ 4048.034205]  process_jobs+0x91/0x220 [dm_mod]
      [ 4048.034210]  ? kcopyd_put_pages+0x40/0x40 [dm_mod]
      [ 4048.034214]  do_work+0x46/0xa0 [dm_mod]
      [ 4048.034219]  process_one_work+0x171/0x370
      [ 4048.034221]  worker_thread+0x1fc/0x3f0
      [ 4048.034224]  kthread+0xf8/0x130
      [ 4048.034226]  ? max_active_store+0x80/0x80
      [ 4048.034227]  ? kthread_bind+0x10/0x10
      [ 4048.034231]  ret_from_fork+0x35/0x40
      [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks
      
      Fix this by calling cond_resched() after run_complete_job()'s callout to
      the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
      trace).
      Signed-off-by: NJohn Pittman <jpittman@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      784c9a29
    • M
      dm cache metadata: save in-core policy_hint_size to on-disk superblock · fd2fa954
      Mike Snitzer 提交于
      policy_hint_size starts as 0 during __write_initial_superblock().  It
      isn't until the policy is loaded that policy_hint_size is set in-core
      (cmd->policy_hint_size).  But it never got recorded in the on-disk
      superblock because __commit_transaction() didn't deal with transfering
      the in-core cmd->policy_hint_size to the on-disk superblock.
      
      The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
      __begin_transaction_flags() which re-reads all superblock fields.
      Because the superblock's policy_hint_size was never properly stored, when
      the cache was created, hints_array_available() would always return false
      when re-activating a previously created cache.  This means
      __load_mappings() always considered the hints invalid and never made use
      of the hints (these hints served to optimize).
      
      Another detremental side-effect of this oversight is the cache_check
      utility would fail with: "invalid hint width: 0"
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fd2fa954
    • H
      dm thin: stop no_space_timeout worker when switching to write-mode · 75294442
      Hou Tao 提交于
      Now both check_for_space() and do_no_space_timeout() will read & write
      pool->pf.error_if_no_space.  If these functions run concurrently, as
      shown in the following case, the default setting of "queue_if_no_space"
      can get lost.
      
      precondition:
          * error_if_no_space = false (aka "queue_if_no_space")
          * pool is in Out-of-Data-Space (OODS) mode
          * no_space_timeout worker has been queued
      
      CPU 0:                          CPU 1:
      // delete a thin device
      process_delete_mesg()
      // check_for_space() invoked by commit()
      set_pool_mode(pool, PM_WRITE)
          pool->pf.error_if_no_space = \
           pt->requested_pf.error_if_no_space
      
      				// timeout, pool is still in OODS mode
      				do_no_space_timeout
      				    // "queue_if_no_space" config is lost
      				    pool->pf.error_if_no_space = true
          pool->pf.mode = new_mode
      
      Fix it by stopping no_space_timeout worker when switching to write mode.
      
      Fixes: bcc696fa ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      75294442
  9. 03 8月, 2018 1 次提交
    • B
      md/raid5: fix data corruption of replacements after originals dropped · d63e2fc8
      BingJing Chang 提交于
      During raid5 replacement, the stripes can be marked with R5_NeedReplace
      flag. Data can be read from being-replaced devices and written to
      replacing spares without reading all other devices. (It's 'replace'
      mode. s.replacing = 1) If a being-replaced device is dropped, the
      replacement progress will be interrupted and resumed with pure recovery
      mode. However, existing stripes before being interrupted cannot read
      from the dropped device anymore. It prints lots of WARN_ON messages.
      And it results in data corruption because existing stripes write
      problematic data into its replacement device and update the progress.
      
      \# Erase disks (1MB + 2GB)
      dd if=/dev/zero of=/dev/sda bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
      mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
      \# Ensure array stores non-zero data
      dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
      \# Start replacement
      mdadm /dev/md0 -a /dev/sdd
      mdadm /dev/md0 --replace /dev/sda
      
      Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
      echo check > /sys/block/md0/md/sync_action
      cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.
      
      Soon after you hot-plug out /dev/sda, you will see many WARN_ON
      messages. The replacement recovery will be interrupted shortly. After
      the recovery finishes, it will result in data corruption.
      
      Actually, it's just an unhandled case of replacement. In commit
      <f94c0b66> (md/raid5: fix interaction of 'replace' and 'recovery'.),
      if a NeedReplace device is not UPTODATE then that is an error, the
      commit just simply print WARN_ON but also mark these corrupted stripes
      with R5_WantReplace. (it means it's ready for writes.)
      
      To fix this case, we can leverage 'sync and replace' mode mentioned in
      commit <9a3e1101> (md/raid5: detect and handle replacements during
      recovery.). We can add logics to detect and use 'sync and replace' mode
      for these stripes.
      Reported-by: NAlex Chen <alexchen@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d63e2fc8
  10. 02 8月, 2018 2 次提交