1. 15 11月, 2022 1 次提交
  2. 11 11月, 2022 3 次提交
    • G
      sbitmap: Use single per-bitmap counting to wake up queued tags · 4f8126bb
      Gabriel Krisman Bertazi 提交于
      sbitmap suffers from code complexity, as demonstrated by recent fixes,
      and eventual lost wake ups on nested I/O completion.  The later happens,
      from what I understand, due to the non-atomic nature of the updates to
      wait_cnt, which needs to be subtracted and eventually reset when equal
      to zero.  This two step process can eventually miss an update when a
      nested completion happens to interrupt the CPU in between the wait_cnt
      updates.  This is very hard to fix, as shown by the recent changes to
      this code.
      
      The code complexity arises mostly from the corner cases to avoid missed
      wakes in this scenario.  In addition, the handling of wake_batch
      recalculation plus the synchronization with sbq_queue_wake_up is
      non-trivial.
      
      This patchset implements the idea originally proposed by Jan [1], which
      removes the need for the two-step updates of wait_cnt.  This is done by
      tracking the number of completions and wakeups in always increasing,
      per-bitmap counters.  Instead of having to reset the wait_cnt when it
      reaches zero, we simply keep counting, and attempt to wake up N threads
      in a single wait queue whenever there is enough space for a batch.
      Waking up less than batch_wake shouldn't be a problem, because we
      haven't changed the conditions for wake up, and the existing batch
      calculation guarantees at least enough remaining completions to wake up
      a batch for each queue at any time.
      
      Performance-wise, one should expect very similar performance to the
      original algorithm for the case where there is no queueing.  In both the
      old algorithm and this implementation, the first thing is to check
      ws_active, which bails out if there is no queueing to be managed. In the
      new code, we took care to avoid accounting completions and wakeups when
      there is no queueing, to not pay the cost of atomic operations
      unnecessarily, since it doesn't skew the numbers.
      
      For more interesting cases, where there is queueing, we need to take
      into account the cross-communication of the atomic operations.  I've
      been benchmarking by running parallel fio jobs against a single hctx
      nullb in different hardware queue depth scenarios, and verifying both
      IOPS and queueing.
      
      Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
      jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
      varying only the hardware queue length per test.
      
      queue size 2                 4                 8                 16                 32                 64
      6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
      patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
      
      The following is a similar experiment, ran against a nullb with a single
      bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
      parallel fio jobs operating on the same device
      
      queue size 2 	             4                 8              	16             	    32		       64
      6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
      patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
      
      It has also survived blktests and a 12h-stress run against nullb. I also
      ran the code against nvme and a scsi SSD, and I didn't observe
      performance regression in those. If there are other tests you think I
      should run, please let me know and I will follow up with results.
      
      [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
      
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Liu Song <liusong@linux.alibaba.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      4f8126bb
    • C
      blk-mq: simplify blk_mq_realloc_tag_set_tags · ee9d5521
      Christoph Hellwig 提交于
      Use set->nr_hw_queues for the current number of tags, and remove the
      duplicate set->nr_hw_queues update in the caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      ee9d5521
    • C
      blk-mq: remove blk_mq_alloc_tag_set_tags · 5ee20298
      Christoph Hellwig 提交于
      There is no point in trying to share any code with the realloc case when
      all that is needed by the initial tagset allocation is a simple
      kcalloc_node.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      5ee20298
  3. 10 11月, 2022 14 次提交
  4. 07 11月, 2022 1 次提交
  5. 02 11月, 2022 21 次提交