1. 11 8月, 2018 1 次提交
    • C
      bcache: fix error setting writeback_rate through sysfs interface · 46451874
      Coly Li 提交于
      Commit ea8c5356 ("bcache: set max writeback rate when I/O request
      is idle") changes struct bch_ratelimit member rate from uint32_t to
      atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
      to set new writeback rate, after the input is converted from memory
      buf to long int by sysfs_strtoul_clamp().
      
      The above change has a problem because there is an implicit return
      inside sysfs_strtoul_clamp() so the following atomic_long_set()
      won't be called. This error is detected by 0day system with following
      snipped smatch warnings:
      
      drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
      symbol 'v'.
      270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      @271 atomic_long_set(&dc->writeback_rate.rate, v);
      
      This patch fixes the above error by using strtoul_safe_clamp() to
      convert the input buffer into a long int type result.
      
      Fixes: ea8c5356 ("bcache: set max writeback rate when I/O request is idle")
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Stefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46451874
  2. 09 8月, 2018 10 次提交
    • S
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG · cbb751c0
      Shenghui Wang 提交于
      Remove the tailing backslash in macro BTREE_FLAG in btree.h
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cbb751c0
    • S
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section · e921efeb
      Shenghui Wang 提交于
      The pr_err statement in the code for sysfs_attatch section would run
      for various error codes, which maybe confusing.
      
      E.g,
      
      Run the command twice:
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
         [the backing dev got attached on the first run]
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
      
      In dmesg, after the command run twice, we can get:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be891
                     : cache set not found
      The first statement in the message was right, but the second was
      confusing.
      
      bch_cached_dev_attach has various pr_ statements for various error
      codes, except ENOENT.
      
      After the change, rerun above command twice:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      
      In dmesg we only got:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      No confusing "cache set not found" message anymore.
      
      And for some not exist SET-UUID:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
      			/sys/block/bcache0/bcache/attach
      In dmesg we can get:
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be898
      	               : cache set not found
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e921efeb
    • C
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li 提交于
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: NColy Li <colyli@suse.de>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea8c5356
    • C
      bcache: add code comments for bset.c · b467a6ac
      Coly Li 提交于
      This patch tries to add code comments in bset.c, to make some
      tricky code and designment to be more comprehensible. Most information
      of this patch comes from the discussion between Kent and I, he
      offers very informative details. If there is any mistake
      of the idea behind the code, no doubt that's from me misrepresentation.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b467a6ac
    • C
      bcache: fix mistaken comments in request.c · 0cba2e71
      Coly Li 提交于
      This patch updates code comment in bch_keylist_realloc() by fixing
      incorrected function names, to make the code to be more comprehennsible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0cba2e71
    • C
      bcache: fix mistaken code comments in bcache.h · cb329dec
      Coly Li 提交于
      This patch updates the code comment in struct cache with correct array
      names, to make the code to be more comprehensible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb329dec
    • C
      bcache: add a comment in super.c · e57fd746
      Coly Li 提交于
      This patch adds a line of code comment in super.c:register_bdev(), to
      make code to be more comprehensible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e57fd746
    • C
      bcache: avoid unncessary cache prefetch bch_btree_node_get() · c2e8dcf7
      Coly Li 提交于
      In bch_btree_node_get() the read-in btree node will be partially
      prefetched into L1 cache for following bset iteration (if there is).
      But if the btree node read is failed, the perfetch operations will
      waste L1 cache space. This patch checkes whether read operation and
      only does cache prefetch when read I/O succeeded.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c2e8dcf7
    • C
      bcache: display rate debug parameters to 0 when writeback is not running · b4cb6efc
      Coly Li 提交于
      When writeback is not running, writeback rate should be 0, other value is
      misleading. And the following dyanmic writeback rate debug parameters
      should be 0 too,
      	rate, proportional, integral, change
      otherwise they are misleading when writeback is not running.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b4cb6efc
    • C
      bcache: do not check return value of debugfs_create_dir() · 78ac2107
      Coly Li 提交于
      Greg KH suggests that normal code should not care about debugfs. Therefore
      no matter successful or failed of debugfs_create_dir() execution, it is
      unncessary to check its return value.
      
      There are two functions called debugfs_create_dir() and check the return
      value, which are bch_debug_init() and closure_debug_init(). This patch
      changes these two functions from int to void type, and ignore return values
      of debugfs_create_dir().
      
      This patch does not fix exact bug, just makes things work as they should.
      Signed-off-by: NColy Li <colyli@suse.de>
      Suggested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78ac2107
  3. 03 8月, 2018 1 次提交
    • B
      md/raid5: fix data corruption of replacements after originals dropped · d63e2fc8
      BingJing Chang 提交于
      During raid5 replacement, the stripes can be marked with R5_NeedReplace
      flag. Data can be read from being-replaced devices and written to
      replacing spares without reading all other devices. (It's 'replace'
      mode. s.replacing = 1) If a being-replaced device is dropped, the
      replacement progress will be interrupted and resumed with pure recovery
      mode. However, existing stripes before being interrupted cannot read
      from the dropped device anymore. It prints lots of WARN_ON messages.
      And it results in data corruption because existing stripes write
      problematic data into its replacement device and update the progress.
      
      \# Erase disks (1MB + 2GB)
      dd if=/dev/zero of=/dev/sda bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
      mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
      \# Ensure array stores non-zero data
      dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
      \# Start replacement
      mdadm /dev/md0 -a /dev/sdd
      mdadm /dev/md0 --replace /dev/sda
      
      Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
      echo check > /sys/block/md0/md/sync_action
      cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.
      
      Soon after you hot-plug out /dev/sda, you will see many WARN_ON
      messages. The replacement recovery will be interrupted shortly. After
      the recovery finishes, it will result in data corruption.
      
      Actually, it's just an unhandled case of replacement. In commit
      <f94c0b66> (md/raid5: fix interaction of 'replace' and 'recovery'.),
      if a NeedReplace device is not UPTODATE then that is an error, the
      commit just simply print WARN_ON but also mark these corrupted stripes
      with R5_WantReplace. (it means it's ready for writes.)
      
      To fix this case, we can leverage 'sync and replace' mode mentioned in
      commit <9a3e1101> (md/raid5: detect and handle replacements during
      recovery.). We can add logics to detect and use 'sync and replace' mode
      for these stripes.
      Reported-by: NAlex Chen <alexchen@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d63e2fc8
  4. 27 7月, 2018 9 次提交
  5. 25 7月, 2018 2 次提交
  6. 24 7月, 2018 1 次提交
  7. 19 7月, 2018 1 次提交
  8. 18 7月, 2018 2 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • M
      block: Add part_stat_read_accum to read across field entries. · 59767fbd
      Michael Callahan 提交于
      Add a part_stat_read_accum macro to genhd.h to read and sum across
      field entries.  For example to sum up the number read and write
      sectors completed.  In addition to being ar reasonable cleanup by
      itself this will make it easier to add new stat fields in the future.
      
      tj: Refreshed on top of v4.17.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59767fbd
  9. 06 7月, 2018 4 次提交
    • C
      md/r5cache: remove redundant pointer bio · ebc7709f
      Colin Ian King 提交于
      Pointer bio is being assigned but is never used hence it is redundant
      and can be removed.
      
      Cleans up clang warning:
      warning: variable 'bio' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ebc7709f
    • G
      md-cluster: don't send msg if array is closing · df8c6764
      Guoqing Jiang 提交于
      If we close an array which resync thread is running,
      then we don't need the node to send msg since another
      node would launch the resync thread to continue the
      rest works. Also send a message is time consuming,
      we should avoid it.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      df8c6764
    • G
      md-cluster: show array's status more accurate · 0357ba27
      Guoqing Jiang 提交于
      When resync or recovery is happening in one node,
      other nodes don't show the appropriate info now.
      
      For example, when create an array in master node
      without "--assume-clean", then assemble the array
      in slave nodes, you can see "resync=PENDING" when
      read /proc/mdstat in slave nodes. However, the info
      is confusing since "PENDING" status is introduced
      for start array in read-only mode.
      
      We introduce RESYNCING_REMOTE flag to indicate that
      resync thread is running in remote node. The flags
      is set when node receive RESYNCING msg. And we clear
      the REMOTE flag in following cases:
      
      1. resync or recover is finished in master node,
         which means slaves receive msg with both lo
         and hi are set to 0.
      2. node continues resync/recovery in recover_bitmaps.
      3. when resync_finish is called.
      
      Then we show accurate information in status_resync
      by check REMOTE flags and with other conditions.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0357ba27
    • G
      md-cluster: clear another node's suspend_area after the copy is finished · 010228e4
      Guoqing Jiang 提交于
      When one node leaves cluster or stops the resyncing
      (resync or recovery) array, then other nodes need to
      call recover_bitmaps to continue the unfinished task.
      
      But we need to clear suspend_area later after other
      nodes copy the resync information to their bitmap
      (by call bitmap_copy_from_slot). Otherwise, all nodes
      could write to the suspend_area even the suspend_area
      is not handled by any node, because area_resyncing
      returns 0 at the beginning of raid1_write_request.
      Which means one node could write suspend_area while
      another node is resyncing the same area, then data
      could be inconsistent.
      
      So let's clear suspend_area later to avoid above issue
      with the protection of bm lock. Also it is straightforward
      to clear suspend_area after nodes have copied the resync
      info to bitmap.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      010228e4
  10. 03 7月, 2018 1 次提交
  11. 29 6月, 2018 2 次提交
    • R
      dm: prevent DAX mounts if not supported · dbc62659
      Ross Zwisler 提交于
      Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
      flag is set on the device's request queue to decide whether or not the
      device supports filesystem DAX.  Really we should be using
      bdev_dax_supported() like filesystems do at mount time.  This performs
      other tests like checking to make sure the dax_direct_access() path works.
      
      We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if
      any of the underlying devices do not support DAX.  This makes the handling
      of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags
      in dm_table_set_restrictions().
      
      Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this
      will ensure that filesystems built upon DM devices will only be able to
      mount with DAX if all underlying devices also support DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Fixes: commit 545ed20e ("dm: add infrastructure for DAX support")
      Cc: stable@vger.kernel.org
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      dbc62659
    • B
      md/raid10: fix that replacement cannot complete recovery after reassemble · bda31539
      BingJing Chang 提交于
      During assemble, the spare marked for replacement is not checked.
      conf->fullsync cannot be updated to be 1. As a result, recovery will
      treat it as a clean array. All recovering sectors are skipped. Original
      device is replaced with the not-recovered spare.
      
      mdadm -C /dev/md0 -l10 -n4 -pn2 /dev/loop[0123]
      mdadm /dev/md0 -a /dev/loop4
      mdadm /dev/md0 --replace /dev/loop0
      mdadm -S /dev/md0 # stop array during recovery
      
      mdadm -A /dev/md0 /dev/loop[01234]
      
      After reassemble, you can see recovery go on, but it completes
      immediately. In fact, recovery is not actually processed.
      
      To solve this problem, we just add the missing logics for replacment
      spares. (In raid1.c or raid5.c, they have already been checked.)
      Reported-by: NAlex Chen <alexchen@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      bda31539
  12. 27 6月, 2018 1 次提交
    • M
      dm thin: handle running out of data space vs concurrent discard · a685557f
      Mike Snitzer 提交于
      Discards issued to a DM thin device can complete to userspace (via
      fstrim) _before_ the metadata changes associated with the discards is
      reflected in the thinp superblock (e.g. free blocks).  As such, if a
      user constructs a test that loops repeatedly over these steps, block
      allocation can fail due to discards not having completed yet:
      1) fill thin device via filesystem file
      2) remove file
      3) fstrim
      
      From initial report, here:
      https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html
      
      "The root cause of this issue is that dm-thin will first remove
      mapping and increase corresponding blocks' reference count to prevent
      them from being reused before DISCARD bios get processed by the
      underlying layers. However. increasing blocks' reference count could
      also increase the nr_allocated_this_transaction in struct sm_disk
      which makes smd->old_ll.nr_allocated +
      smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks.
      In this case, alloc_data_block() will never commit metadata to reset
      the begin pointer of struct sm_disk, because sm_disk_get_nr_free()
      always return an underflow value."
      
      While there is room for improvement to the space-map accounting that
      thinp is making use of: the reality is this test is inherently racey and
      will result in the previous iteration's fstrim's discard(s) completing
      vs concurrent block allocation, via dd, in the next iteration of the
      loop.
      
      No amount of space map accounting improvements will be able to allow
      user's to use a block before a discard of that block has completed.
      
      So the best we can really do is allow DM thinp to gracefully handle such
      aggressive use of all the pool's data by degrading the pool into
      out-of-data-space (OODS) mode.  We _should_ get that behaviour already
      (if space map accounting didn't falsely cause alloc_data_block() to
      believe free space was available).. but short of that we handle the
      current reality that dm_pool_alloc_data_block() can return -ENOSPC.
      Reported-by: NDennis Yang <dennisyang@qnap.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a685557f
  13. 23 6月, 2018 5 次提交
    • A
      dm raid: don't use 'const' in function return · f2ccaa59
      Arnd Bergmann 提交于
      A newly introduced function has 'const int' as the return type,
      but as "make W=1" reports, that has no meaning:
      
      drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
      
      This changes the return type to plain 'int'.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: 33e53f06 ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Fixes: 552aa679 ("dm raid: use rs_is_raid*()")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f2ccaa59
    • B
      dm zoned: avoid triggering reclaim from inside dmz_map() · 2d0b2d64
      Bart Van Assche 提交于
      This patch avoids that lockdep reports the following:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      4.18.0-rc1 #62 Not tainted
      ------------------------------------------------------
      kswapd0/84 is trying to acquire lock:
      00000000c313516d (&xfs_nondir_ilock_class){++++}, at: xfs_free_eofblocks+0xa2/0x1e0
      
      but task is already holding lock:
      00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (fs_reclaim){+.+.}:
        kmem_cache_alloc+0x2c/0x2b0
        radix_tree_node_alloc.constprop.19+0x3d/0xc0
        __radix_tree_create+0x161/0x1c0
        __radix_tree_insert+0x45/0x210
        dmz_map+0x245/0x2d0 [dm_zoned]
        __map_bio+0x40/0x260
        __split_and_process_non_flush+0x116/0x220
        __split_and_process_bio+0x81/0x180
        __dm_make_request.isra.32+0x5a/0x100
        generic_make_request+0x36e/0x690
        submit_bio+0x6c/0x140
        mpage_readpages+0x19e/0x1f0
        read_pages+0x6d/0x1b0
        __do_page_cache_readahead+0x21b/0x2d0
        force_page_cache_readahead+0xc4/0x100
        generic_file_read_iter+0x7c6/0xd20
        __vfs_read+0x102/0x180
        vfs_read+0x9b/0x140
        ksys_read+0x55/0xc0
        do_syscall_64+0x5a/0x1f0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #1 (&dmz->chunk_lock){+.+.}:
        dmz_map+0x133/0x2d0 [dm_zoned]
        __map_bio+0x40/0x260
        __split_and_process_non_flush+0x116/0x220
        __split_and_process_bio+0x81/0x180
        __dm_make_request.isra.32+0x5a/0x100
        generic_make_request+0x36e/0x690
        submit_bio+0x6c/0x140
        _xfs_buf_ioapply+0x31c/0x590
        xfs_buf_submit_wait+0x73/0x520
        xfs_buf_read_map+0x134/0x2f0
        xfs_trans_read_buf_map+0xc3/0x580
        xfs_read_agf+0xa5/0x1e0
        xfs_alloc_read_agf+0x59/0x2b0
        xfs_alloc_pagf_init+0x27/0x60
        xfs_bmap_longest_free_extent+0x43/0xb0
        xfs_bmap_btalloc_nullfb+0x7f/0xf0
        xfs_bmap_btalloc+0x428/0x7c0
        xfs_bmapi_write+0x598/0xcc0
        xfs_iomap_write_allocate+0x15a/0x330
        xfs_map_blocks+0x1cf/0x3f0
        xfs_do_writepage+0x15f/0x7b0
        write_cache_pages+0x1ca/0x540
        xfs_vm_writepages+0x65/0xa0
        do_writepages+0x48/0xf0
        __writeback_single_inode+0x58/0x730
        writeback_sb_inodes+0x249/0x5c0
        wb_writeback+0x11e/0x550
        wb_workfn+0xa3/0x670
        process_one_work+0x228/0x670
        worker_thread+0x3c/0x390
        kthread+0x11c/0x140
        ret_from_fork+0x3a/0x50
      
      -> #0 (&xfs_nondir_ilock_class){++++}:
        down_read_nested+0x43/0x70
        xfs_free_eofblocks+0xa2/0x1e0
        xfs_fs_destroy_inode+0xac/0x270
        dispose_list+0x51/0x80
        prune_icache_sb+0x52/0x70
        super_cache_scan+0x127/0x1a0
        shrink_slab.part.47+0x1bd/0x590
        shrink_node+0x3b5/0x470
        balance_pgdat+0x158/0x3b0
        kswapd+0x1ba/0x600
        kthread+0x11c/0x140
        ret_from_fork+0x3a/0x50
      
      other info that might help us debug this:
      
      Chain exists of:
        &xfs_nondir_ilock_class --> &dmz->chunk_lock --> fs_reclaim
      
      Possible unsafe locking scenario:
      
           CPU0                    CPU1
           ----                    ----
      lock(fs_reclaim);
                                   lock(&dmz->chunk_lock);
                                   lock(fs_reclaim);
      lock(&xfs_nondir_ilock_class);
      
      *** DEADLOCK ***
      
      3 locks held by kswapd0/84:
       #0: 00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
       #1: 000000000f8208f5 (shrinker_rwsem){++++}, at: shrink_slab.part.47+0x3f/0x590
       #2: 00000000cacefa54 (&type->s_umount_key#43){.+.+}, at: trylock_super+0x16/0x50
      
      stack backtrace:
      CPU: 7 PID: 84 Comm: kswapd0 Not tainted 4.18.0-rc1 #62
      Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
      Call Trace:
       dump_stack+0x85/0xcb
       print_circular_bug.isra.36+0x1ce/0x1db
       __lock_acquire+0x124e/0x1310
       lock_acquire+0x9f/0x1f0
       down_read_nested+0x43/0x70
       xfs_free_eofblocks+0xa2/0x1e0
       xfs_fs_destroy_inode+0xac/0x270
       dispose_list+0x51/0x80
       prune_icache_sb+0x52/0x70
       super_cache_scan+0x127/0x1a0
       shrink_slab.part.47+0x1bd/0x590
       shrink_node+0x3b5/0x470
       balance_pgdat+0x158/0x3b0
       kswapd+0x1ba/0x600
       kthread+0x11c/0x140
       ret_from_fork+0x3a/0x50
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Fixes: 4218a955 ("dm zoned: use GFP_NOIO in I/O path")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2d0b2d64
    • K
      dm writecache: use 2-factor allocator arguments · 50a7d3ba
      Kees Cook 提交于
      This adjusts the allocator calls to use the 2-factor argument style, as
      already done treewide for better defense against allocator overflows.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      [snitzer: tweaked code to leave assignment in a test alone]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      50a7d3ba
    • M
      dm thin metadata: remove needless work from __commit_transaction · 7ccdbf85
      Mike Snitzer 提交于
      Commit 5a32083d ("dm: take care to copy the space map roots before
      locking the superblock") properly removed the calls to dm_sm_root_size()
      from __write_initial_superblock().  But the dm_sm_root_size() calls were
      left dangling in __commit_transaction().
      
      Fixes: 5a32083d ("dm: take care to copy the space map roots before locking the superblock")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7ccdbf85
    • M
      dm: use bio_split() when splitting out the already processed bio · f21c601a
      Mike Snitzer 提交于
      Use of bio_clone_bioset() is inefficient if there is no need to clone
      the original bio's bio_vec array.  Best to use the bio_clone_fast()
      variant.  Also, just using bio_advance() is only part of what is needed
      to properly setup the clone -- it doesn't account for the various
      bio_integrity() related work that also needs to be performed (see
      bio_split).
      
      Address both of these issues by switching from bio_clone_bioset() to
      bio_split().
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.15+, requires removal of '&' before md->queue->bio_split
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f21c601a