1. 22 6月, 2022 3 次提交
    • C
      bcache: remove bcache device self-defined readahead · fe62491a
      Coly Li 提交于
      mainline inclusion
      from v5.13-rc6
      commit 1616a4c2
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
      CVE: N/A
      
      ------------------------------------
      
      For read cache missing, bcache defines a readahead size for the read I/O
      request to the backing device for the missing data. This readahead size
      is initialized to 0, and almost no one uses it to avoid unnecessary read
      amplifying onto backing device and write amplifying onto cache device.
      Considering upper layer file system code has readahead logic allready
      and works fine with readahead_cache_policy sysfile interface, we don't
      have to keep bcache self-defined readahead anymore.
      
      This patch removes the bcache self-defined readahead for cache missing
      request for backing device, and the readahead sysfs file interfaces are
      removed as well.
      
      This is the preparation for next patch to fix potential kernel panic due
      to oversized request in a simpler method.
      Reported-by: NAlexander Ullrich <ealex1979@gmail.com>
      Reported-by: NDiego Ercolani <diego.ercolani@gmail.com>
      Reported-by: NJan Szubiak <jan.szubiak@linuxpolska.pl>
      Reported-by: NMarco Rebhan <me@dblsaiko.net>
      Reported-by: NMatthias Ferdinand <bcache@mfedv.net>
      Reported-by: NVictor Westerhuis <victor@westerhu.is>
      Reported-by: NVojtech Pavlik <vojtech@suse.cz>
      Reported-and-tested-by: NRolf Fokkens <rolf@rolffokkens.nl>
      Reported-and-tested-by: NThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Takashi Iwai <tiwai@suse.com>
      Link: https://lore.kernel.org/r/20210607125052.21277-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      fe62491a
    • C
      bcache: remove PTR_CACHE · 4f793082
      Christoph Hellwig 提交于
      mainline inclusion
      from 5.13-rc1
      commit 11e9560e
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
      CVE: N/A
      
      ----------------------------------
      
      Remove the PTR_CACHE inline and replace it with a direct dereference
      of c->cache.
      
      (Coly Li: fix the typo from PTR_BUCKET to PTR_CACHE in commit log)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20210411134316.80274-3-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      4f793082
    • D
      bcache: consider the fragmentation when update the writeback rate · c49b38e4
      dongdong tao 提交于
      mainline inclusion
      from v5.12-rc1
      commit 71dda2a5
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
      CVE: N/A
      
      ---------------------------------------------
      
      Current way to calculate the writeback rate only considered the
      dirty sectors, this usually works fine when the fragmentation
      is not high, but it will give us unreasonable small rate when
      we are under a situation that very few dirty sectors consumed
      a lot dirty buckets. In some case, the dirty bucekts can reached
      to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
      reached the writeback_percent, the writeback rate will still
      be the minimum value (4k), thus it will cause all the writes to be
      stucked in a non-writeback mode because of the slow writeback.
      
      We accelerate the rate in 3 stages with different aggressiveness,
      the first stage starts when dirty buckets percent reach above
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
      BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
      the first stage tries to writeback the amount of dirty data
      in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
      the second stage tries to writeback the amount of dirty data in one bucket
      in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
      stage tries to writeback the amount of dirty data in one bucket in
      (1 / (dirty_buckets_percent - 64)) millisecond.
      
      the initial rate at each stage can be controlled by 3 configurable
      parameters writeback_rate_fp_term_{low|mid|high}, they are by default
      1, 10, 1000, the hint of IO throughput that these values are trying
      to achieve is described by above paragraph, the reason that
      I choose those value as default is based on the testing and the
      production data, below is some details:
      
      A. When it comes to the low stage, there is still a bit far from the 70
         threshold, so we only want to give it a little bit push by setting the
         term to 1, it means the initial rate will be 170 if the fragment is 6,
         it is calculated by bucket_size/fragment, this rate is very small,
         but still much reasonable than the minimum 8.
         For a production bcache with unheavy workload, if the cache device
         is bigger than 1 TB, it may take hours to consume 1% buckets,
         so it is very possible to reclaim enough dirty buckets in this stage,
         thus to avoid entering the next stage.
      
      B. If the dirty buckets ratio didn't turn around during the first stage,
         it comes to the mid stage, then it is necessary for mid stage
         to be more aggressive than low stage, so i choose the initial rate
         to be 10 times more than low stage, that means 1700 as the initial
         rate if the fragment is 6. This is some normal rate
         we usually see for a normal workload when writeback happens
         because of writeback_percent.
      
      C. If the dirty buckets ratio didn't turn around during the low and mid
         stages, it comes to the third stage, and it is the last chance that
         we can turn around to avoid the horrible cutoff writeback sync issue,
         then we choose 100 times more aggressive than the mid stage, that
         means 170000 as the initial rate if the fragment is 6. This is also
         inferred from a production bcache, I've got one week's writeback rate
         data from a production bcache which has quite heavy workloads,
         again, the writeback is triggered by the writeback percent,
         the highest rate area is around 100000 to 240000, so I believe this
         kind aggressiveness at this stage is reasonable for production.
         And it should be mostly enough because the hint is trying to reclaim
         1000 bucket per second, and from that heavy production env,
         it is consuming 50 bucket per second on average in one week's data.
      
      Option writeback_consider_fragment is to control whether we want
      this feature to be on or off, it's on by default.
      
      Lastly, below is the performance data for all the testing result,
      including the data from production env:
      https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharingSigned-off-by: Ndongdong tao <dongdong.tao@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c49b38e4
  2. 12 1月, 2022 2 次提交
  3. 10 1月, 2022 2 次提交
  4. 09 4月, 2021 2 次提交
  5. 03 10月, 2020 8 次提交
  6. 25 7月, 2020 3 次提交
    • C
      bcache: handle cache prio_buckets and disk_buckets properly for bucket size > 8MB · c954ac8d
      Coly Li 提交于
      Similar to c->uuids, struct cache's prio_buckets and disk_buckets also
      have the potential memory allocation failure during cache registration
      if the bucket size > 8MB.
      
      ca->prio_buckets can be stored on cache device in multiple buckets, its
      in-memory space is allocated by kzalloc() interface but normally
      allocated by alloc_pages() because the size > KMALLOC_MAX_CACHE_SIZE.
      
      So allocation of ca->prio_buckets has the MAX_ORDER restriction too. If
      the bucket size > 8MB, by default the page allocator will fail because
      the page order > 11 (default MAX_ORDER value). ca->prio_buckets should
      also use meta_bucket_bytes(), meta_bucket_pages() to decide its memory
      size and use alloc_meta_bucket_pages() to allocate pages, to avoid the
      allocation failure during cache set registration when bucket size > 8MB.
      
      ca->disk_buckets is a single bucket size memory buffer, it is used to
      iterate each bucket of ca->prio_buckets, and compose the bio based on
      memory of ca->disk_buckets, then write ca->disk_buckets memory to cache
      disk one-by-one for each bucket of ca->prio_buckets. ca->disk_buckets
      should have in-memory size exact to the meta_bucket_pages(), this is the
      size that ca->prio_buckets will be stored into each on-disk bucket.
      
      This patch fixes the above issues and handle cache's prio_buckets and
      disk_buckets properly for bucket size larger than 8MB.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c954ac8d
    • C
      bcache: introduce meta_bucket_pages() related helper routines · de1fafab
      Coly Li 提交于
      Currently the in-memory meta data like c->uuids or c->disk_buckets
      are allocated by alloc_bucket_pages(). The macro alloc_bucket_pages()
      calls __get_free_pages() to allocated continuous pages with order
      indicated by ilog2(bucket_pages(c)),
       #define alloc_bucket_pages(gfp, c)                      \
           ((void *) __get_free_pages(__GFP_ZERO|gfp, ilog2(bucket_pages(c))))
      
      The maximum order is defined as MAX_ORDER, the default value is 11 (and
      can be overwritten by CONFIG_FORCE_MAX_ZONEORDER). In bcache code the
      maximum bucket size width is 16bits, this is restricted both by KEY_SIZE
      size and bucket_size size from struct cache_sb_disk. The maximum 16bits
      width and power-of-2 value is (1<<15) in unit of sector (512byte). It
      means the maximum value of bucket size in bytes is (1<<24) bytes a.k.a
      4096 pages.
      
      When the bucket size is set to maximum permitted value, ilog2(4096) is
      12, which exceeds the default maximum order __get_free_pages() can
      accepted, the failed pages allocation will fail cache set registration
      procedure and print a kernel oops message for the exceeded pages order.
      
      This patch introduces meta_bucket_pages(), meta_bucket_bytes(), and
      alloc_bucket_pages() helper routines. meta_bucket_pages() indicates the
      maximum pages can be allocated to meta data bucket, meta_bucket_bytes()
      indicates the according maximum bytes, and alloc_bucket_pages() does
      the pages allocation for meta bucket. Because meta_bucket_pages()
      chooses the smaller value among the bucket size and MAX_ORDER_NR_PAGES,
      it still works when MAX_ORDER overwritten by CONFIG_FORCE_MAX_ZONEORDER.
      
      Following patches will use these helper routines to decide maximum pages
      can be allocated for different meta data buckets. If the bucket size is
      larger than meta_bucket_bytes(), the bcache registration can continue to
      success, just the space more than meta_bucket_bytes() inside the bucket
      is wasted. Comparing bcache failed for large bucket size, wasting some
      space for meta data buckets is acceptable at this moment.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de1fafab
    • C
      bcache: fix overflow in offset_to_stripe() · 7a148126
      Coly Li 提交于
      offset_to_stripe() returns the stripe number (in type unsigned int) from
      an offset (in type uint64_t) by the following calculation,
      	do_div(offset, d->stripe_size);
      For large capacity backing device (e.g. 18TB) with small stripe size
      (e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
      returned value which caller receives is 536870912, due to the overflow.
      
      Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
      range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
      in range [0, bcache_dev->nr_stripes - 1].
      
      This patch adds a upper limition check in offset_to_stripe(): the max
      valid stripe number should be less than bcache_device->nr_stripes. If
      the calculated stripe number from do_div() is equal to or larger than
      bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
      is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
      therefore -EOVERFLOW is not used as error code.)
      
      This patch also changes nr_stripes' type of struct bcache_device from
      'unsigned int' to 'int', and return value type of offset_to_stripe()
      from 'unsigned int' to 'int', to match their exact data ranges.
      
      All locations where bcache_device->nr_stripes and offset_to_stripe() are
      referenced also get updated for the above type change.
      Reported-and-tested-by: NKen Raeburn <raeburn@redhat.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a148126
  7. 01 7月, 2020 1 次提交
  8. 27 5月, 2020 1 次提交
    • J
      bcache: Convert pr_<level> uses to a more typical style · 46f5aa88
      Joe Perches 提交于
      Remove the trailing newline from the define of pr_fmt and add newlines
      to the uses.
      
      Miscellanea:
      
      o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont
        as the earlier conversion was inappropriate done causing multiple
        lines to be emitted where only a single output line was desired
      o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple
        line output where only a single line output was desired
      o Coalesce formats
      
      Fixes: 6ae63e35 ("bcache: replace printk() by pr_*() routines")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46f5aa88
  9. 01 2月, 2020 1 次提交
    • C
      bcache: add readahead cache policy options via sysfs interface · 038ba8cc
      Coly Li 提交于
      In year 2007 high performance SSD was still expensive, in order to
      save more space for real workload or meta data, the readahead I/Os
      for non-meta data was bypassed and not cached on SSD.
      
      In now days, SSD price drops a lot and people can find larger size
      SSD with more comfortable price. It is unncessary to alway bypass
      normal readahead I/Os to save SSD space for now.
      
      This patch adds options for readahead data cache policies via sysfs
      file /sys/block/bcache<N>/readahead_cache_policy, the options are,
      - "all": cache all readahead data I/Os.
      - "meta-only": only cache meta data, and bypass other regular I/Os.
      
      If users want to make bcache continue to only cache readahead request
      for metadata and bypass regular data readahead, please set "meta-only"
      to this sysfs file. By default, bcache will back to cache all read-
      ahead requests now.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Acked-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      038ba8cc
  10. 24 1月, 2020 1 次提交
  11. 14 11月, 2019 3 次提交
    • C
      bcache: add idle_max_writeback_rate sysfs interface · c5fcdedc
      Coly Li 提交于
      For writeback mode, if there is no regular I/O request for a while,
      the writeback rate will be set to the maximum value (1TB/s for now).
      This is good for most of the storage workload, but there are still
      people don't what the maximum writeback rate in I/O idle time.
      
      This patch adds a sysfs interface file idle_max_writeback_rate to
      permit people to disable maximum writeback rate. Then the minimum
      writeback rate can be advised by writeback_rate_minimum in the
      bcache device's sysfs interface.
      Reported-by: NChristian Balzer <chibi@gol.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5fcdedc
    • A
      bcache: fix deadlock in bcache_allocator · 84c529ae
      Andrea Righi 提交于
      bcache_allocator can call the following:
      
       bch_allocator_thread()
        -> bch_prio_write()
           -> bch_bucket_alloc()
              -> wait on &ca->set->bucket_wait
      
      But the wake up event on bucket_wait is supposed to come from
      bch_allocator_thread() itself => deadlock:
      
      [ 1158.490744] INFO: task bcache_allocato:15861 blocked for more than 10 seconds.
      [ 1158.495929]       Not tainted 5.3.0-050300rc3-generic #201908042232
      [ 1158.500653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1158.504413] bcache_allocato D    0 15861      2 0x80004000
      [ 1158.504419] Call Trace:
      [ 1158.504429]  __schedule+0x2a8/0x670
      [ 1158.504432]  schedule+0x2d/0x90
      [ 1158.504448]  bch_bucket_alloc+0xe5/0x370 [bcache]
      [ 1158.504453]  ? wait_woken+0x80/0x80
      [ 1158.504466]  bch_prio_write+0x1dc/0x390 [bcache]
      [ 1158.504476]  bch_allocator_thread+0x233/0x490 [bcache]
      [ 1158.504491]  kthread+0x121/0x140
      [ 1158.504503]  ? invalidate_buckets+0x890/0x890 [bcache]
      [ 1158.504506]  ? kthread_park+0xb0/0xb0
      [ 1158.504510]  ret_from_fork+0x35/0x40
      
      Fix by making the call to bch_prio_write() non-blocking, so that
      bch_allocator_thread() never waits on itself.
      
      Moreover, make sure to wake up the garbage collector thread when
      bch_prio_write() is failing to allocate buckets.
      
      BugLink: https://bugs.launchpad.net/bugs/1784665
      BugLink: https://bugs.launchpad.net/bugs/1796292Signed-off-by: NAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84c529ae
    • G
      bcache: fix a lost wake-up problem caused by mca_cannibalize_lock · 34cf78bf
      Guoju Fang 提交于
      This patch fix a lost wake-up problem caused by the race between
      mca_cannibalize_lock and bch_cannibalize_unlock.
      
      Consider two processes, A and B. Process A is executing
      mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
      and is executing bch_cannibalize_unlock. The problem happens that after
      process A executes cmpxchg and will execute prepare_to_wait. In this
      timeslice process B executes wake_up, but after that process A executes
      prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
      goes to sleep but no one will wake up it. This problem may cause bcache
      device to dead.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34cf78bf
  12. 28 6月, 2019 4 次提交
  13. 13 12月, 2018 2 次提交
    • C
      bcache: option to automatically run gc thread after writeback · 7a671d8e
      Coly Li 提交于
      The option gc_after_writeback is disabled by default, because garbage
      collection will discard SSD data which drops cached data.
      
      Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
      enable this option, which wakes up gc thread when writeback accomplished
      and all cached data is clean.
      
      This option is helpful for people who cares writing performance more. In
      heavy writing workload, all cached data can be clean only happens when
      writeback thread cleans all cached data in I/O idle time. In such
      situation a following gc running may help to shrink bcache B+ tree and
      discard more clean data, which may be helpful for future writing
      requests.
      
      If you are not sure whether this is helpful for your own workload,
      please leave it as disabled by default.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a671d8e
    • S
      bcache: add comment for cache_set->fill_iter · d2f96f48
      Shenghui Wang 提交于
      We have the following define for btree iterator:
      	struct btree_iter {
      		size_t size, used;
      	#ifdef CONFIG_BCACHE_DEBUG
      		struct btree_keys *b;
      	#endif
      		struct btree_iter_set {
      			struct bkey *k, *end;
      		} data[MAX_BSETS];
      	};
      
      We can see that the length of data[] field is static MAX_BSETS, which is
      defined as 4 currently.
      
      But a btree node on disk could have too many bsets for an iterator to fit
      on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
      space to host more btree_iter_sets.
      
      bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
      allocate an iterator equipped with enough room that can host
      	(sb.bucket_size / sb.block_size)
      btree_iter_sets, which is more than static MAX_BSETS.
      
      bch_btree_node_read_done() will use that pool to allocate one iterator, to
      host many bsets in one btree node.
      
      Add more comment around cache_set->fill_iter to make code less confusing.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d2f96f48
  14. 08 10月, 2018 1 次提交
  15. 27 9月, 2018 1 次提交
    • G
      bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65
      Guoju Fang 提交于
      After write SSD completed, bcache schedules journal_write work to
      system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
      flag. system_wq is also a bound wq, and there may be no idle kworker on
      current processor. Creating a new kworker may unfortunately need to
      reclaim memory first, by shrinking cache and slab used by vfs, which
      depends on bcache device. That's a deadlock.
      
      This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
      flag. It's rescuer thread will work to avoid the deadlock.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f843e65
  16. 12 8月, 2018 5 次提交