1. 08 8月, 2019 5 次提交
  2. 05 8月, 2019 1 次提交
    • M
      blk-mq: add callback of .cleanup_rq · 226b4fc7
      Ming Lei 提交于
      SCSI maintains its own driver private data hooked off of each SCSI
      request, and the pridate data won't be freed after scsi_queue_rq()
      returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
      (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
      fully dispatched them, due to a lower level SCSI driver's resource
      limitation identified in scsi_queue_rq(). Currently SCSI's per-request
      private data is leaked when the upper layer driver (dm-rq) frees and
      then retries these requests in response to BLK_STS_RESOURCE or
      BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
      
      This usecase is so specialized that it doesn't warrant training an
      existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
      account for freeing its driver private data -- doing so would add an
      extra branch for handling a special case that all other consumers of
      SCSI (and blk-mq) won't ever need to worry about.
      
      So the most pragmatic way forward is to delegate freeing SCSI driver
      private data to the upper layer driver (dm-rq).  Do so by adding
      new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
      from dm-rq.  A following commit will implement the .cleanup_rq() hook
      in scsi_mq_ops.
      
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: <stable@vger.kernel.org>
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      226b4fc7
  3. 31 7月, 2019 2 次提交
  4. 22 7月, 2019 1 次提交
  5. 17 7月, 2019 3 次提交
    • N
      dm kcopyd: Increase default sub-job size to 512KB · c663e040
      Nikos Tsironis 提交于
      Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
      sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
      I/O in flight.
      
      This upper limit to the amount of in-flight I/O under-utilizes fast
      devices and results in decreased throughput, e.g., when writing to a
      snapshotted thin LV with I/O size less than the pool's block size (so
      COW is performed using kcopyd).
      
      Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
      4MB of I/O in flight for each kcopyd job. This results in an up to 96%
      improvement of bandwidth when writing to a snapshotted thin LV, with I/O
      sizes less than the pool's block size.
      
      Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
      to fine tune the sub-job size of kcopyd. The default value of this
      parameter is 512KB and the maximum allowed value is 1024KB.
      
      We evaluate the performance impact of the change by running the
      snap_breaking_throughput benchmark, from the device mapper test suite
      [1].
      
      The benchmark:
      
        1. Creates a 1G thin LV
        2. Provisions the thin LV
        3. Takes a snapshot of the thin LV
        4. Writes to the thin LV with:
      
            dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>
      
      Running this benchmark with various thin pool block sizes and dd I/O
      sizes (all combinations triggering the use of kcopyd) we get the
      following results:
      
      +-----------------+-------------+------------------+-----------------+
      | Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
      +-----------------+-------------+------------------+-----------------+
      |       1 MB      |      256 KB |       242        |       280       |
      |       1 MB      |      512 KB |       238        |       295       |
      |                 |             |                  |                 |
      |       2 MB      |      256 KB |       238        |       354       |
      |       2 MB      |      512 KB |       241        |       380       |
      |       2 MB      |        1 MB |       245        |       394       |
      |                 |             |                  |                 |
      |       4 MB      |      256 KB |       248        |       412       |
      |       4 MB      |      512 KB |       234        |       432       |
      |       4 MB      |        1 MB |       251        |       474       |
      |       4 MB      |        2 MB |       257        |       504       |
      |                 |             |                  |                 |
      |       8 MB      |      256 KB |       239        |       420       |
      |       8 MB      |      512 KB |       256        |       431       |
      |       8 MB      |        1 MB |       264        |       467       |
      |       8 MB      |        2 MB |       264        |       502       |
      |       8 MB      |        4 MB |       281        |       537       |
      +-----------------+-------------+------------------+-----------------+
      
      [1] https://github.com/jthornber/device-mapper-test-suiteSigned-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c663e040
    • M
      dm snapshot: fix oversights in optional discard support · 3ee25485
      Mike Snitzer 提交于
      __find_snapshots_sharing_cow() should always be used with _origins_lock
      held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
      being merged discards must not be allowed -- otherwise incorrect or
      duplicate work will be performed.
      
      Fixes: 2e602385 ("dm snapshot: add optional discard support features")
      Reported-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3ee25485
    • D
      dm zoned: fix zone state management race · 3b8cafdd
      Damien Le Moal 提交于
      dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
      backend device is being actively read or written and so cannot be
      reclaimed. This flag is set as long as the zone atomic reference
      counter is not 0. When this atomic is decremented and reaches 0 (e.g.
      on BIO completion), the active flag is cleared and set again whenever
      the zone is reused and BIO issued with the atomic counter incremented.
      These 2 operations (atomic inc/dec and flag set/clear) are however not
      always executed atomically under the target metadata mutex lock and
      this causes the warning:
      
      WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));
      
      in dmz_deactivate_zone() to be displayed. This problem is regularly
      triggered with xfstests generic/209, generic/300, generic/451 and
      xfs/077 with XFS being used as the file system on the dm-zoned target
      device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
      generic/300 trigger the warning with ext4 use.
      
      This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
      and managing the "ACTIVE" state by directly looking at the reference
      counter value. To do so, the functions dmz_activate_zone() and
      dmz_deactivate_zone() are changed to inline functions respectively
      calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
      is changed to an inline function calling atomic_read().
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Reported-by: NMasato Suzuki <masato.suzuki@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      3b8cafdd
  6. 15 7月, 2019 1 次提交
  7. 12 7月, 2019 3 次提交
    • J
      dm bufio: fix deadlock with loop device · bd293d07
      Junxiao Bi 提交于
      When thin-volume is built on loop device, if available memory is low,
      the following deadlock can be triggered:
      
      One process P1 allocates memory with GFP_FS flag, direct alloc fails,
      memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
      runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
      IO to complete in __try_evict_buffer().
      
      But this IO may never complete if issued to an underlying loop device
      that forwards it using direct-IO, which allocates memory using
      GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
      reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
      will be invoked, and since the mutex is already held by P1 the loop
      thread will hang, and IO will never complete.  Resulting in ABBA
      deadlock.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bd293d07
    • M
      dm snapshot: add optional discard support features · 2e602385
      Mike Snitzer 提交于
      discard_zeroes_cow - a discard issued to the snapshot device that maps
      to entire chunks to will zero the corresponding exception(s) in the
      snapshot's exception store.
      
      discard_passdown_origin - a discard to the snapshot device is passed down
      to the snapshot-origin's underlying device.  This doesn't cause copy-out
      to the snapshot exception store because the snapshot-origin target is
      bypassed.
      
      The discard_passdown_origin feature depends on the discard_zeroes_cow
      feature being enabled.
      
      When these 2 features are enabled they allow a temporarily read-only
      device that has completely exhausted its free space to recover space.
      To do so dm-snapshot provides temporary buffer to accommodate writes
      that the temporarily read-only device cannot handle yet.  Once the upper
      layer frees space (e.g. fstrim to XFS) the discards issued to the
      dm-snapshot target will be issued to underlying read-only device whose
      free space was exhausted.  In addition those discards will also cause
      zeroes to be written to the snapshot exception store if corresponding
      exceptions exist.  If the underlying origin device provides
      deduplication for zero blocks then if/when the snapshot is merged backed
      to the origin those blocks will become unused.  Once the origin has
      gained adequate space, merging the snapshot back to the thinly
      provisioned device will permit continued use of that device without the
      temporary space provided by the snapshot.
      Requested-by: NJohn Dorminy <jdorminy@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2e602385
    • D
      block: Kill gfp_t argument of blkdev_report_zones() · bd976e52
      Damien Le Moal 提交于
      Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
      preparation of using vmalloc() for large report buffer and zone array
      allocations used by this function, remove its "gfp_t gfp_mask" argument
      and rely on the caller context to use memalloc_noio_save/restore() where
      necessary (block layer zone revalidation and dm-zoned I/O error path).
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd976e52
  8. 11 7月, 2019 1 次提交
  9. 10 7月, 2019 9 次提交
  10. 06 7月, 2019 2 次提交
  11. 03 7月, 2019 1 次提交
  12. 28 6月, 2019 11 次提交
    • C
      bcache: add reclaimed_journal_buckets to struct cache_set · dff90d58
      Coly Li 提交于
      Now we have counters for how many times jouranl is reclaimed, how many
      times cached dirty btree nodes are flushed, but we don't know how many
      jouranl buckets are really reclaimed.
      
      This patch adds reclaimed_journal_buckets into struct cache_set, this
      is an increasing only counter, to tell how many journal buckets are
      reclaimed since cache set runs. From all these three counters (reclaim,
      reclaimed_journal_buckets, flush_write), we can have idea how well
      current journal space reclaim code works.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dff90d58
    • C
      bcache: performance improvement for btree_flush_write() · 91be66e1
      Coly Li 提交于
      This patch improves performance for btree_flush_write() in following
      ways,
      - Use another spinlock journal.flush_write_lock to replace the very
        hot journal.lock. We don't have to use journal.lock here, selecting
        candidate btree nodes takes a lot of time, hold journal.lock here will
        block other jouranling threads and drop the overall I/O performance.
      - Only select flushing btree node from c->btree_cache list. When the
        machine has a large system memory, mca cache may have a huge number of
        cached btree nodes. Iterating all the cached nodes will take a lot
        of CPU time, and most of the nodes on c->btree_cache_freeable and
        c->btree_cache_freed lists are cleared and have need to flush. So only
        travel mca list c->btree_cache to select flushing btree node should be
        enough for most of the cases.
      - Don't iterate whole c->btree_cache list, only reversely select first
        BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
        c->btree_cache and select the oldest journal pin btree nodes consumes
        huge number of CPU cycles if the list is huge (push and pop a node
        into/out of a heap is expensive). The last several dirty btree nodes
        on the tail of c->btree_cache list are earlest allocated and cached
        btree nodes, they are relative to the oldest journal pin btree nodes.
        Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
        c->btree_cache probably includes the oldest journal pin btree nodes.
      
      In my testing, the above change decreases 50%+ CPU consumption when
      journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
      comparing blocking I/O for 120+ seconds in previous code, this is much
      better. Maybe there is room to improve in future, but at this momment
      the fix looks fine and performs well in my testing.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91be66e1
    • C
      bcache: fix race in btree_flush_write() · 50a260e8
      Coly Li 提交于
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50a260e8
    • C
      bcache: remove retry_flush_write from struct cache_set · d91ce757
      Coly Li 提交于
      In struct cache_set, retry_flush_write is added for commit c4dc2497
      ("bcache: fix high CPU occupancy during journal") which is reverted in
      previous patch.
      
      Now it is useless anymore, and this patch removes it from bcache code.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d91ce757
    • C
      bcache: add comments for mutex_lock(&b->write_lock) · 41508bb7
      Coly Li 提交于
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      41508bb7
    • C
      bcache: only clear BTREE_NODE_dirty bit when it is set · e5ec5f47
      Coly Li 提交于
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5ec5f47
    • C
      bcache: Revert "bcache: fix high CPU occupancy during journal" · 249a5f6d
      Coly Li 提交于
      This reverts commit c4dc2497.
      
      This patch enlarges a race between normal btree flush code path and
      flush_btree_write(), which causes deadlock when journal space is
      exhausted. Reverts this patch makes the race window from 128 btree
      nodes to only 1 btree nodes.
      
      Fixes: c4dc2497 ("bcache: fix high CPU occupancy during journal")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Tang Junhui <tang.junhui.linux@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      249a5f6d
    • C
      bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" · ba82c1ac
      Coly Li 提交于
      This reverts commit 6268dc2c.
      
      This patch depends on commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal") which is reverted in previous patch. So
      revert this one too.
      
      Fixes: 6268dc2c ("bcache: free heap cache_set->flush_btree in bch_journal_free")
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Shenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba82c1ac
    • C
      bcache: shrink btree node cache after bch_btree_check() · 1df3877f
      Coly Li 提交于
      When cache set starts, bch_btree_check() will check all bkeys on cache
      device by calculating the checksum. This operation will consume a huge
      number of system memory if there are a lot of data cached. Since bcache
      uses its own mca cache to maintain all its read-in btree nodes, and only
      releases the cache space when system memory manage code starts to shrink
      caches. Then before memory manager code to call the mca cache shrinker
      callback, bcache mca cache will compete memory resource with user space
      application, which may have nagive effect to performance of user space
      workloads (e.g. data base, or I/O service of distributed storage node).
      
      This patch tries to call bcache mca shrinker routine to proactively
      release mca cache memory, to decrease the memory pressure of system and
      avoid negative effort of the overall system I/O performance.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1df3877f
    • C
      bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() · a231f07a
      Coly Li 提交于
      In journal_read_bucket() when setting ja->seq[bucket_index], there might
      be potential case that a later non-maximum overwrites a better sequence
      number to ja->seq[bucket_index]. This patch adds a check to make sure
      that ja->seq[bucket_index] will be only set a new value if it is bigger
      then current value.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a231f07a
    • C
      bcache: add code comments for journal_read_bucket() · 2464b693
      Coly Li 提交于
      This patch adds more code comments in journal_read_bucket(), this is an
      effort to make the code to be more understandable.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2464b693