1. 16 10月, 2017 5 次提交
    • M
      bcache: rearrange writeback main thread ratelimit · a8500fc8
      Michael Lyle 提交于
      The time spent searching for things to write back "counts" for the
      actual rate achieved, so don't flush the accumulated rate with each
      chunk.
      
      This will maintain better fidelity to user-commanded rates, but it
      may slightly increase the burstiness of writeback.  The writeback
      lock needs improvement to help mitigate this.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8500fc8
    • M
      bcache: writeback rate shouldn't artifically clamp · e41166c5
      Michael Lyle 提交于
      The previous code artificially limited writeback rate to 1000000
      blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
      hardware.  The rate limiting code works fine (though with decreased
      precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.
      
      Additionally, ensure that uint32_t is used as a type for rate throughout
      the rate management so that type checking/clamp_t can work properly.
      
      bch_next_delay should be rewritten for increased precision and better
      handling of high rates and long sleep periods, but this is adequate for
      now.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reported-by: NColy Li <colyli@suse.de>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e41166c5
    • M
      bcache: smooth writeback rate control · ae82ddbf
      Michael Lyle 提交于
      This works in conjunction with the new PI controller.  Currently, in
      real-world workloads, the rate controller attempts to write back 1
      sector per second.  In practice, these minimum-rate writebacks are
      between 4k and 60k in test scenarios, since bcache aggregates and
      attempts to do contiguous writes and because filesystems on top of
      bcachefs typically write 4k or more.
      
      Previously, bcache used to guarantee to write at least once per second.
      This means that the actual writeback rate would exceed the configured
      amount by a factor of 8-120 or more.
      
      This patch adjusts to be willing to sleep up to 2.5 seconds, and to
      target writing 4k/second.  On the smallest writes, it will sleep 1
      second like before, but many times it will sleep longer and load the
      backing device less.  This keeps the loading on the cache and backing
      device related to writeback more consistent when writing back at low
      rates.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ae82ddbf
    • M
      bcache: implement PI controller for writeback rate · 1d316e65
      Michael Lyle 提交于
      bcache uses a control system to attempt to keep the amount of dirty data
      in cache at a user-configured level, while not responding excessively to
      transients and variations in write rate.  Previously, the system was a
      PD controller; but the output from it was integrated, turning the
      Proportional term into an Integral term, and turning the Derivative term
      into a crude Proportional term.  Performance of the controller has been
      uneven in production, and it has tended to respond slowly, oscillate,
      and overshoot.
      
      This patch set replaces the current control system with an explicit PI
      controller and tuning that should be correct for most hardware.  By
      default, it attempts to write at a rate that would retire 1/40th of the
      current excess blocks per second.  An integral term in turn works to
      remove steady state errors.
      
      IMO, this yields benefits in simplicity (removing weighted average
      filtering, etc) and system performance.
      
      Another small change is a tunable parameter is introduced to allow the
      user to specify a minimum rate at which dirty blocks are retired.
      
      There is a slight difference from earlier versions of the patch in
      integral handling to prevent excessive negative integral windup.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d316e65
    • M
      bcache: don't write back data if reading it failed · 5fa89fb9
      Michael Lyle 提交于
      If an IO operation fails, and we didn't successfully read data from the
      cache, don't writeback invalid/partial data to the backing disk.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5fa89fb9
  2. 08 9月, 2017 1 次提交
    • T
      bcache: initialize dirty stripes in flash_dev_run() · 175206cf
      Tang Junhui 提交于
      bcache uses a Proportion-Differentiation Controller algorithm to control
      writeback rate to cached devices. In the PD controller algorithm, dirty
      stripes of thin flash device should not be counted in, because flash only
      volumes never write back dirty data.
      
      Currently dirty stripe counter for thin flash device is not initialized
      when the thin flash device starts. Which means the following calculation
      in PD controller will reference an undefined dirty stripes number, and
      all cached devices attached to the same cache set where the thin flash
      device lies on may have an inaccurate writeback rate.
      
      This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
      correctly initialize dirty stripe counter when the thin flash device
      starts to run. This patch also does following parameter data type change,
       -void bch_sectors_dirty_init(struct cached_dev *dc);
       +void bch_sectors_dirty_init(struct bcache_device *);
      to call this function conveniently in flash_dev_run().
      
      (Commit log is composed by Coly Li)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      175206cf
  3. 06 9月, 2017 2 次提交
    • T
      bcache: fix for gc and write-back race · 9baf3097
      Tang Junhui 提交于
      gc and write-back get raced (see the email "bcache get stucked" I sended
      before):
      gc thread                               write-back thread
      |                                       |bch_writeback_thread()
      |bch_gc_thread()                        |
      |                                       |==>read_dirty()
      |==>bch_btree_gc()                      |
      |==>btree_root() //get btree root       |
      |                //node write locker    |
      |==>bch_btree_gc_root()                 |
      |                                       |==>read_dirty_submit()
      |                                       |==>write_dirty()
      |                                       |==>continue_at(cl,
      |                                       |               write_dirty_finish,
      |                                       |               system_wq);
      |                                       |==>write_dirty_finish()//excute
      |                                       |               //in system_wq
      |                                       |==>bch_btree_insert()
      |                                       |==>bch_btree_map_leaf_nodes()
      |                                       |==>__bch_btree_map_nodes()
      |                                       |==>btree_root //try to get btree
      |                                       |              //root node read
      |                                       |              //lock
      |                                       |-----stuck here
      |==>bch_btree_set_root()
      |==>bch_journal_meta()
      |==>bch_journal()
      |==>journal_try_write()
      |==>journal_write_unlocked() //journal_full(&c->journal)
      |                            //condition satisfied
      |==>continue_at(cl, journal_write, system_wq); //try to excute
      |                               //journal_write in system_wq
      |                               //but work queue is excuting
      |                               //write_dirty_finish()
      |==>closure_sync(); //wait journal_write execute
      |                   //over and wake up gc,
      |-------------stuck here
      |==>release root node write locker
      
      This patch alloc a separate work-queue for write-back thread to avoid such
      race.
      
      (Commit log re-organized by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9baf3097
    • T
      bcache: correct cache_dirty_target in __update_writeback_rate() · a8394090
      Tang Junhui 提交于
      __update_write_rate() uses a Proportion-Differentiation Controller
      algorithm to control writeback rate. A dirty target number is used in
      this PD controller to control writeback rate. A larger target number
      will make the writeback rate smaller, on the versus, a smaller target
      number will make the writeback rate larger.
      
      bcache uses the following steps to calculate the target number,
      1) cache_sectors = all-buckets-of-cache-set * buckets-size
      2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
      3) target = cache_dirty_target *
      (sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)
      
      The calculation at step 1) for cache_sectors is incorrect, which does
      not consider dirty blocks occupied by flash only volume.
      
      A flash only volume can be took as a bcache device without cached
      device. All data sectors allocated for it are persistent on cache device
      and marked dirty, they are not touched by bcache writeback and garbage
      collection code. So data blocks of flash only volume should be ignore
      when calculating cache_sectors of cache set.
      
      Current code does not subtract dirty sectors of flash only volume, which
      results a larger target number from the above 3 steps. And in sequence
      the cache device's writeback rate is smaller then a correct value,
      writeback speed is slower on all cached devices.
      
      This patch fixes the incorrect slower writeback rate by subtracting
      dirty sectors of flash only volumes in __update_writeback_rate().
      
      (Commit log composed by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8394090
  4. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  5. 09 6月, 2017 1 次提交
  6. 02 3月, 2017 1 次提交
  7. 22 11月, 2016 1 次提交
  8. 22 9月, 2016 1 次提交
  9. 08 6月, 2016 1 次提交
  10. 24 5月, 2016 1 次提交
  11. 31 12月, 2015 1 次提交
  12. 14 8月, 2015 1 次提交
  13. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  14. 05 8月, 2014 1 次提交
    • S
      bcache: fix uninterruptible sleep in writeback thread · 9e5c3535
      Slava Pestov 提交于
      There were two issues here:
      
      - writeback thread did not start until the device first became dirty
      - writeback thread used uninterruptible sleep once running
      
      Without this patch I see kernel warnings printed and a load average of
      1.52 after booting my test VM. With this patch the warnings are gone and
      the load average is near 0.00 as expected.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      9e5c3535
  15. 17 12月, 2013 3 次提交
  16. 24 11月, 2013 1 次提交
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  17. 11 11月, 2013 12 次提交
  18. 25 9月, 2013 2 次提交
  19. 02 7月, 2013 1 次提交
    • K
      bcache: Use standard utility code · 8e51e414
      Kent Overstreet 提交于
      Some of bcache's utility code has made it into the rest of the kernel,
      so drop the bcache versions.
      
      Bcache used to have a workaround for allocating from a bio set under
      generic_make_request() (if you allocated more than once, the bios you
      already allocated would get stuck on current->bio_list when you
      submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
      when allocating bios under generic_make_request() so that allocation
      could fail and it could retry from workqueue. But bio_alloc_bioset() has
      a workaround now, so we can drop this hack and the associated error
      handling.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      8e51e414
  20. 27 6月, 2013 2 次提交