1. 18 12月, 2018 4 次提交
    • M
      blk-mq: skip zero-queue maps in blk_mq_map_swqueue · e5edd5f2
      Ming Lei 提交于
      From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping
      table won't be initialized actually if map->nr_queues is zero, so
      we can't use blk_mq_map_queue_type() to retrieve hctx any more.
      
      This way still may cause broken mapping, fix it by skipping zero-queues
      maps in blk_mq_map_swqueue().
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5edd5f2
    • D
      block: fix blk-iolatency accounting underflow · 13369816
      Dennis Zhou 提交于
      The blk-iolatency controller measures the time from rq_qos_throttle() to
      rq_qos_done_bio() and attributes this time to the first bio that needs
      to create the request. This means if a bio is plug-mergeable or
      bio-mergeable, it gets to bypass the blk-iolatency controller.
      
      The recent series [1], to tag all bios w/ blkgs undermined how iolatency
      was determining which bios it was charging and should process in
      rq_qos_done_bio(). Because all bios are being tagged, this caused the
      atomic_t for the struct rq_wait inflight count to underflow and result
      in a stall.
      
      This patch adds a new flag BIO_TRACKED to let controllers know that a
      bio is going through the rq_qos path. blk-iolatency now checks if this
      flag is set to see if it should process the bio in rq_qos_done_bio().
      
      Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
      BIO_THROTTLED was another candidate, but the flag is set for all bios
      that have gone through blk-throttle code. Overloading a flag comes with
      the burden of making sure that when either implementation changes, a
      change in setting rules for one doesn't cause a bug in the other. So
      here, we unfortunately opt for adding a new flag.
      
      [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13369816
    • M
      blk-mq: fix dispatch from sw queue · c16d6b5a
      Ming Lei 提交于
      When a request is added to rq list of sw queue(ctx), the rq may be from
      a different type of hctx, especially after multi queue mapping is
      introduced.
      
      So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
      blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
      hctx can be dispatched to current hctx in case that read queue or poll
      queue is enabled.
      
      This patch fixes this issue by introducing per-queue-type list.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Changed by me to not use separately cacheline aligned lists, just
      place them all in the same cacheline where we had just the one list
      and lock before.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c16d6b5a
    • D
      block: mq-deadline: Fix write completion handling · 7211aef8
      Damien Le Moal 提交于
      For a zoned block device using mq-deadline, if a write request for a
      zone is received while another write was already dispatched for the same
      zone, dd_dispatch_request() will return NULL and the newly inserted
      write request is kept in the scheduler queue waiting for the ongoing
      zone write to complete. With this behavior, when no other request has
      been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
      and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
      __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
      queue when the already dispatched write request completes. The newly
      dispatched request stays stuck in the scheduler queue until eventually
      another request is submitted.
      
      This problem does not affect SCSI disk as the SCSI stack handles queue
      restart on request completion. However, this problem is can be triggered
      the nullblk driver with zoned mode enabled.
      
      Fix this by always requesting a queue restart in dd_dispatch_request()
      if no request was dispatched while WRITE requests are queued.
      
      Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      
      Add missing export of blk_mq_sched_restart()
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7211aef8
  2. 17 12月, 2018 10 次提交
  3. 16 12月, 2018 5 次提交
  4. 14 12月, 2018 6 次提交
  5. 13 12月, 2018 15 次提交
    • G
      bcache: print number of keys in trace_bcache_journal_write · e78bd0d2
      Guoju Fang 提交于
      Sometimes flush journal may be very frequent, so it's useful to dump
      number of keys every time write journal.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e78bd0d2
    • C
      bcache: set writeback_percent in a flexible range · cc38ca7e
      Coly Li 提交于
      Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
      dynamic cutoff writeback values, writeback_percent is limited to [0,
      CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
      up to 40.
      
      Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
      the range of writeback_percent can be a more flexible range as [0,
      bch_cutoff_writeback]. The flexibility is, it can be expended to a
      larger or smaller range than [0, 40], depends on how value
      bch_cutoff_writeback is specified.
      
      The default value is still strongly recommended to most of users for
      most of workloads. But for people who want to do research on bcache
      writeback perforamnce tuning, they may have chance to specify more
      flexible writeback_percent in range [0, 70].
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cc38ca7e
    • C
      bcache: make cutoff_writeback and cutoff_writeback_sync tunable · 9aaf5165
      Coly Li 提交于
      Currently the cutoff writeback and cutoff writeback sync thresholds are
      defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
      static values. Most of time these they work fine, but when people want
      to do research on bcache writeback mode performance tuning, there is no
      chance to modify the soft and hard cutoff writeback values.
      
      This patch introduces two module parameters bch_cutoff_writeback_sync
      and bch_cutoff_writeback which permit people to tune the values when
      loading bcache.ko. If they are not specified by module loading, current
      values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
      default and nothing changes.
      
      When people want to tune this two values,
      - cutoff_writeback can be set in range [1, 70]
      - cutoff_writeback_sync can be set in range [1, 90]
      - cutoff_writeback always <= cutoff_writeback_sync
      
      The default values are strongly recommended to most of users for most of
      workloads. Anyway, if people wants to take their own risk to do research
      on new writeback cutoff tuning for their own workload, now they can make
      it.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9aaf5165
    • C
      bcache: add MODULE_DESCRIPTION information · 009673d0
      Coly Li 提交于
      This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
      add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").
      
      This is preparation for adding module parameters.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      009673d0
    • C
      bcache: option to automatically run gc thread after writeback · 7a671d8e
      Coly Li 提交于
      The option gc_after_writeback is disabled by default, because garbage
      collection will discard SSD data which drops cached data.
      
      Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
      enable this option, which wakes up gc thread when writeback accomplished
      and all cached data is clean.
      
      This option is helpful for people who cares writing performance more. In
      heavy writing workload, all cached data can be clean only happens when
      writeback thread cleans all cached data in I/O idle time. In such
      situation a following gc running may help to shrink bcache B+ tree and
      discard more clean data, which may be helpful for future writing
      requests.
      
      If you are not sure whether this is helpful for your own workload,
      please leave it as disabled by default.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a671d8e
    • C
      bcache: introduce force_wake_up_gc() · cb07ad63
      Coly Li 提交于
      Garbage collection thread starts to work when c->sectors_to_gc is
      negative value, otherwise nothing will happen even the gc thread is
      woken up by wake_up_gc().
      
      force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
      wake_up_gc(), then gc thread may have chance to run if no one else sets
      c->sectors_to_gc to a positive value before gc_should_run().
      
      This routine can be called where the gc thread is woken up and required
      to run in force.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb07ad63
    • S
      bcache: cannot set writeback_running via sysfs if no writeback kthread created · f383ae30
      Shenghui Wang 提交于
      "echo 1 > writeback_running" marks writeback_running even if no
      writeback kthread created as "d_strtoul(writeback_running)" will simply
      set dc-> writeback_running without checking the existence of
      dc->writeback_thread.
      
      Add check for setting writeback_running via sysfs: if no writeback
      kthread available, reject setting to 1.
      
      v2 -> v3:
        * Make message on wrong assignment more clear.
        * Print name of bcache device instead of name of backing device.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f383ae30
    • S
      bcache: do not mark writeback_running too early · 79b79146
      Shenghui Wang 提交于
      A fresh backing device is not attached to any cache_set, and
      has no writeback kthread created until first attached to some
      cache_set.
      
      But bch_cached_dev_writeback_init run
      "
      	dc->writeback_running		= true;
      	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
      			&dc->disk.flags));
      "
      for any newly formatted backing devices.
      
      For a fresh standalone backing device, we can get something like
      following even if no writeback kthread created:
      ------------------------
      /sys/block/bcache0/bcache# cat writeback_running
      1
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		512.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	-15427384ms
      
      The none ZERO fields are misleading as no alive writeback kthread yet.
      
      Set dc->writeback_running false as no writeback thread created in
      bch_cached_dev_writeback_init().
      
      We have writeback thread created and woken up in bch_cached_dev_writeback
      _start(). Set dc->writeback_running true before bch_writeback_queue()
      called, as a writeback thread will check if dc->writeback_running is true
      before writing back dirty data, and hung if false detected.
      
      After the change, we can get the following output for a fresh standalone
      backing device:
      -----------------------
      /sys/block/bcache0/bcache$ cat writeback_running
      0
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		0.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	0ms
      
      v1 -> v2:
        Set dc->writeback_running before bch_writeback_queue() called,
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79b79146
    • S
      bcache: update comment in sysfs.c · 4e361e02
      Shenghui Wang 提交于
      We have struct cached_dev allocated by kzalloc in register_bcache(),
      which initializes all the fields of cached_dev with 0s. And commit
      ce4c3e19 ("bcache: Replace bch_read_string_list() by
      __sysfs_match_string()") has remove the string "default".
      
      Update the comment.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e361e02
    • S
      bcache: update comment for bch_data_insert · 3db4d078
      Shenghui Wang 提交于
      commit 220bb38c ("bcache: Break up struct search") introduced
      changes to struct search and s->iop. bypass/bio are fields of struct
      data_insert_op now. Update the comment.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3db4d078
    • S
      bcache: do not check if debug dentry is ERR or NULL explicitly on remove · ae171023
      Shenghui Wang 提交于
      debugfs_remove and debugfs_remove_recursive will check if the dentry
      pointer is NULL or ERR, and will do nothing in that case.
      
      Remove the check in cache_set_free and bch_debug_init.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ae171023
    • S
      bcache: add comment for cache_set->fill_iter · d2f96f48
      Shenghui Wang 提交于
      We have the following define for btree iterator:
      	struct btree_iter {
      		size_t size, used;
      	#ifdef CONFIG_BCACHE_DEBUG
      		struct btree_keys *b;
      	#endif
      		struct btree_iter_set {
      			struct bkey *k, *end;
      		} data[MAX_BSETS];
      	};
      
      We can see that the length of data[] field is static MAX_BSETS, which is
      defined as 4 currently.
      
      But a btree node on disk could have too many bsets for an iterator to fit
      on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
      space to host more btree_iter_sets.
      
      bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
      allocate an iterator equipped with enough room that can host
      	(sb.bucket_size / sb.block_size)
      btree_iter_sets, which is more than static MAX_BSETS.
      
      bch_btree_node_read_done() will use that pool to allocate one iterator, to
      host many bsets in one btree node.
      
      Add more comment around cache_set->fill_iter to make code less confusing.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d2f96f48
    • S
      nvme-rdma: support separate queue maps for read and write · b65bb777
      Sagi Grimberg 提交于
      llow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues.  In
      addition, implement .map_queues that will apply 2 queue maps for read
      and write queue sets.
      
      Note that with the separate queue map, HCTX_TYPE_READ will always use
      nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b65bb777
    • S
      nvme-tcp: support separate queue maps for read and write · 873946f4
      Sagi Grimberg 提交于
      Allow NVMF_OPT_NR_WRITE_QUEUES to describe additional write queues.  In
      addition, implement .map_queues that will apply 2 queue maps for read
      and write queue sets.
      
      Note that with the separate queue map, HCTX_TYPE_READ will always use
      nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      873946f4
    • S
      nvme-fabrics: allow user to set nr_write_queues for separate queue maps · 330f6b8a
      Sagi Grimberg 提交于
      This argument will specify how many I/O queues will be connected in
      create_ctrl in addition to nr_io_queues. With this configuration, I/O
      that carries payload from the host to the target, will use the default
      hctx queue map, and I/O that involves target to host transfers will use
      the read hctx queue map.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      330f6b8a