1. 08 2月, 2018 7 次提交
    • T
      bcache: return attach error when no cache set exist · 7f4fc93d
      Tang Junhui 提交于
      I attach a back-end device to a cache set, and the cache set is not
      registered yet, this back-end device did not attach successfully, and no
      error returned:
      [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
      [root]#
      
      In sysfs_attach(), the return value "v" is initialized to "size" in
      the beginning, and if no cache set exist in bch_cache_sets, the "v" value
      would not change any more, and return to sysfs, sysfs regard it as success
      since the "size" is a positive number.
      
      This patch fixes this issue by assigning "v" with "-ENOENT" in the
      initialization.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7f4fc93d
    • C
      bcache: set writeback_rate_update_seconds in range [1, 60] seconds · 7a5e3ecb
      Coly Li 提交于
      dc->writeback_rate_update_seconds can be set via sysfs and its value can
      be set to [1, ULONG_MAX].  It does not make sense to set such a large
      value, 60 seconds is long enough value considering the default 5 seconds
      works well for long time.
      
      Because dc->writeback_rate_update is a special delayed work, it re-arms
      itself inside the delayed work routine update_writeback_rate(). When
      stopping it by cancel_delayed_work_sync(), there should be a timeout to
      wait and make sure the re-armed delayed work is stopped too. A small max
      value of dc->writeback_rate_update_seconds is also helpful to decide a
      reasonable small timeout.
      
      This patch limits sysfs interface to set dc->writeback_rate_update_seconds
      in range of [1, 60] seconds, and replaces the hand-coded number by macros.
      
      Changelog:
      v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
      v1: initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a5e3ecb
    • T
      bcache: fix for allocator and register thread race · 682811b3
      Tang Junhui 提交于
      After long time running of random small IO writing,
      I reboot the machine, and after the machine power on,
      I found bcache got stuck, the stack is:
      [root@ceph153 ~]# cat /proc/2510/task/*/stack
      [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
      [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
      [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
      [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
      [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
      [<ffffffff810a631f>] kthread+0xcf/0xe0
      [<ffffffff8164c318>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      [root@ceph153 ~]# cat /proc/2038/task/*/stack
      [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
      [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
      [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
      [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
      [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
      [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
      [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
      [<ffffffff811e069f>] SyS_write+0x7f/0xe0
      [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
      The stack shows the register thread and allocator thread
      were getting stuck when registering cache device.
      
      I reboot the machine several times, the issue always
      exsit in this machine.
      
      I debug the code, and found the call trace as bellow:
      register_bcache()
         ==>run_cache_set()
            ==>bch_journal_replay()
               ==>bch_btree_insert()
                  ==>__bch_btree_map_nodes()
                     ==>btree_insert_fn()
                        ==>btree_split() //node need split
                           ==>btree_check_reserve()
      In btree_check_reserve(), It will check if there is enough buckets
      of RESERVE_BTREE type, since allocator thread did not work yet, so
      no buckets of RESERVE_BTREE type allocated, so the register thread
      waits on c->btree_cache_wait, and goes to sleep.
      
      Then the allocator thread initialized, the call trace is bellow:
      bch_allocator_thread()
      ==>bch_prio_write()
         ==>bch_journal_meta()
            ==>bch_journal()
               ==>journal_wait_for_write()
      In journal_wait_for_write(), It will check if journal is full by
      journal_full(), but the long time random small IO writing
      causes the exhaustion of journal buckets(journal.blocks_free=0),
      In order to release the journal buckets,
      the allocator calls btree_flush_write() to flush keys to
      btree nodes, and waits on c->journal.wait until btree nodes writing
      over or there has already some journal buckets space, then the
      allocator thread goes to sleep. but in btree_flush_write(), since
      bch_journal_replay() is not finished, so no btree nodes have journal
      (condition "if (btree_current_write(b)->journal)" never satisfied),
      so we got no btree node to flush, no journal bucket released,
      and allocator sleep all the times.
      
      Through the above analysis, we can see that:
      1) Register thread wait for allocator thread to allocate buckets of
         RESERVE_BTREE type;
      2) Alloctor thread wait for register thread to replay journal, so it
         can flush btree nodes and get journal bucket.
         then they are all got stuck by waiting for each other.
      
      Hua Rui provided a patch for me, by allocating some buckets of
      RESERVE_BTREE type in advance, so the register thread can get bucket
      when btree node splitting and no need to waiting for the allocator
      thread. I tested it, it has effect, and register thread run a step
      forward, but finally are still got stuck, the reason is only 8 bucket
      of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
      after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
      then btree_check_reserve() is not satisfied anymore, so it goes to sleep
      again, and in the same time, alloctor thread did not flush enough btree
      nodes to release a journal bucket, so they all got stuck again.
      
      So we need to allocate more buckets of RESERVE_BTREE type in advance,
      but how much is enough?  By experience and test, I think it should be
      as much as journal buckets. Then I modify the code as this patch,
      and test in the machine, and it works.
      
      This patch modified base on Hua Rui’s patch, and allocate more buckets
      of RESERVE_BTREE type in advance to avoid register thread and allocate
      thread going to wait for each other.
      
      [patch v2] ca->sb.njournal_buckets would be 0 in the first time after
      cache creation, and no journal exists, so just 8 btree buckets is OK.
      Signed-off-by: NHua Rui <huarui.dev@gmail.com>
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      682811b3
    • C
      bcache: set error_limit correctly · 7ba0d830
      Coly Li 提交于
      Struct cache uses io_errors for two purposes,
      - Error decay: when cache set error_decay is set, io_errors is used to
        generate a small piece of delay when I/O error happens.
      - I/O errors counter: in order to generate big enough value for error
        decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
        IO_ERROR_SHIFT).
      
      In function bch_count_io_errors(), if I/O errors counter reaches cache set
      error limit, bch_cache_set_error() will be called to retire the whold cache
      set. But current code is problematic when checking the error limit, see the
      following code piece from bch_count_io_errors(),
      
       90     if (error) {
       91             char buf[BDEVNAME_SIZE];
       92             unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
       93                                                 &ca->io_errors);
       94             errors >>= IO_ERROR_SHIFT;
       95
       96             if (errors < ca->set->error_limit)
       97                     pr_err("%s: IO error on %s, recovering",
       98                            bdevname(ca->bdev, buf), m);
       99             else
      100                     bch_cache_set_error(ca->set,
      101                                         "%s: too many IO errors %s",
      102                                         bdevname(ca->bdev, buf), m);
      103     }
      
      At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
      errors counter to compare at line 96. But ca->set->error_limit is initia-
      lized with an amplified value in bch_cache_set_alloc(),
      1545         c->error_limit  = 8 << IO_ERROR_SHIFT;
      
      It means by default, in bch_count_io_errors(), before 8<<20 errors happened
      bch_cache_set_error() won't be called to retire the problematic cache
      device. If the average request size is 64KB, it means bcache won't handle
      failed device until 512GB data is requested. This is too large to be an I/O
      threashold. So I believe the correct error limit should be much less.
      
      This patch sets default cache set error limit to 8, then in
      bch_count_io_errors() when errors counter reaches 8 (if it is default
      value), function bch_cache_set_error() will be called to retire the whole
      cache set. This patch also removes bits shifting when store or show
      io_error_limit value via sysfs interface.
      
      Nowadays most of SSDs handle internal flash failure automatically by LBA
      address re-indirect mapping. If an I/O error can be observed by upper layer
      code, it will be a notable error because that SSD can not re-indirect
      map the problematic LBA address to an available flash block. This situation
      indicates the whole SSD will be failed very soon. Therefore setting 8 as
      the default io error limit value makes sense, it is enough for most of
      cache devices.
      
      Changelog:
      v2: add reviewed-by from Hannes.
      v1: initial version for review.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ba0d830
    • C
      bcache: properly set task state in bch_writeback_thread() · 99361bbf
      Coly Li 提交于
      Kernel thread routine bch_writeback_thread() has the following code block,
      
      447         down_write(&dc->writeback_lock);
      448~450     if (check conditions) {
      451                 up_write(&dc->writeback_lock);
      452                 set_current_state(TASK_INTERRUPTIBLE);
      453
      454                 if (kthread_should_stop())
      455                         return 0;
      456
      457                 schedule();
      458                 continue;
      459         }
      
      If condition check is true, its task state is set to TASK_INTERRUPTIBLE
      and call schedule() to wait for others to wake up it.
      
      There are 2 issues in current code,
      1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
         another process changes the condition and call wake_up_process(dc->
         writeback_thread), then at line 452 task state is set back to
         TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
         waken up.
      2, At line 454 if kthread_should_stop() is true, writeback kernel thread
         will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
         call do_exit(). It is not good to enter do_exit() with task state
         TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
         warning message is reported by __might_sleep(): "WARNING: do not call
         blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
      
      For the first issue, task state should be set before condition checks.
      Ineed because dc->writeback_lock is required when modifying all the
      conditions, calling set_current_state() inside code block where dc->
      writeback_lock is hold is safe. But this is quite implicit, so I still move
      set_current_state() before all the condition checks.
      
      For the second issue, frankley speaking it does not hurt when kernel thread
      exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
      makes them feel there might be something risky with bcache and hurt their
      data.  Setting task state to TASK_RUNNING before returning fixes this
      problem.
      
      In alloc.c:allocator_wait(), there is also a similar issue, and is also
      fixed in this patch.
      
      Changelog:
      v3: merge two similar fixes into one patch
      v2: fix the race issue in v1 patch.
      v1: initial buggy fix.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99361bbf
    • T
      bcache: fix high CPU occupancy during journal · c4dc2497
      Tang Junhui 提交于
      After long time small writing I/O running, we found the occupancy of CPU
      is very high and I/O performance has been reduced by about half:
      
      [root@ceph151 internal]# top
      top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
      Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
      %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
      KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
      KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem
      
        PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
       2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
      14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
       9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
       8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
      12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
      31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
       1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
      32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
      21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
       2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
      16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
      15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
       9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
      11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
       9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
      11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
       6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
      
      The kernel work consumed a lot of CPU, and I found they are running journal
      work, The journal is reclaiming source and flush btree node with surprising
      frequency.
      
      Through further analysis, we found that in btree_flush_write(), we try to
      get a btree node with the smallest fifo idex to flush by traverse all the
      btree nodein c->bucket_hash, after we getting it, since no locker protects
      it, this btree node may have been written to cache device by other works,
      and if this occurred, we retry to traverse in c->bucket_hash and get
      another btree node. When the problem occurrd, the retry times is very high,
      and we consume a lot of CPU in looking for a appropriate btree node.
      
      In this patch, we try to record 128 btree nodes with the smallest fifo idex
      in heap, and pop one by one when we need to flush btree node. It greatly
      reduces the time for the loop to find the appropriate BTREE node, and also
      reduce the occupancy of CPU.
      
      [note by mpl: this triggers a checkpatch error because of adjacent,
      pre-existing style violations]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4dc2497
    • T
      bcache: add journal statistic · a728eacb
      Tang Junhui 提交于
      Sometimes, Journal takes up a lot of CPU, we need statistics
      to know what's the journal is doing. So this patch provide
      some journal statistics:
      1) reclaim: how many times the journal try to reclaim resource,
         usually the journal bucket or/and the pin are exhausted.
      2) flush_write: how many times the journal try to flush btree node
         to cache device, usually the journal bucket are exhausted.
      3) retry_flush_write: how many times the journal retry to flush
         the next btree node, usually the previous tree node have been
         flushed by other thread.
      we show these statistic by sysfs interface. Through these statistics
      We can totally see the status of journal module when the CPU is too
      high.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a728eacb
  2. 10 1月, 2018 1 次提交
  3. 09 1月, 2018 13 次提交
    • M
      bcache: fix writeback target calc on large devices · 616486ab
      Michael Lyle 提交于
      Bcache needs to scale the dirty data in the cache over the multiple
      backing disks in order to calculate writeback rates for each.
      The previous code did this by multiplying the target number of dirty
      sectors by the backing device size, and expected it to fit into a
      uint64_t; this blows up on relatively small backing devices.
      
      The new approach figures out the bdev's share in 16384ths of the overall
      cached data.  This is chosen to cope well when bdevs drastically vary in
      size and to ensure that bcache can cross the petabyte boundary for each
      backing device.
      
      This has been improved based on Tang Junhui's feedback to ensure that
      every device gets a share of dirty data, no matter how small it is
      compared to the total backing pool.
      
      The existing mechanism is very limited; this is purely a bug fix to
      remove limits on volume size.  However, there still needs to be change
      to make this "fair" over many volumes where some are idle.
      Reported-by: NJack Douglas <jack@douglastechnology.co.uk>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      616486ab
    • C
      bcache: fix misleading error message in bch_count_io_errors() · 5138ac67
      Coly Li 提交于
      Bcache only does recoverable I/O for read operations by calling
      cached_dev_read_error(). For write opertions there is no I/O recovery for
      failed requests.
      
      But in bch_count_io_errors() no matter read or write I/Os, before errors
      counter reaches io error limit, pr_err() always prints "IO error on %,
      recoverying". For write requests this information is misleading, because
      there is no I/O recovery at all.
      
      This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
      prints "recovering" by pr_err() when the bio direction is READ.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5138ac67
    • C
      bcache: reduce cache_set devices iteration by devices_max_used · 2831231d
      Coly Li 提交于
      Member devices of struct cache_set is used to reference all attached
      bcache devices to this cache set. If it is treated as array of pointers,
      size of devices[] is indicated by member nr_uuids of struct cache_set.
      
      nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
      	bucket_bytes(c) / sizeof(struct uuid_entry)
      Bucket size is determined by user space tool "make-bcache", by default it
      is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
      nr_uuids value is 4096 from the above calculation.
      
      Every time when bcache code iterates bcache devices of a cache set, all
      the 4096 pointers are checked even only 1 bcache device is attached to the
      cache set, that's a wast of time and unncessary.
      
      This patch adds a member devices_max_used to struct cache_set. Its value
      is 1 + the maximum used index of devices[] in a cache set. When iterating
      all valid bcache devices of a cache set, use c->devices_max_used in
      for-loop may reduce a lot of useless checking.
      
      Personally, my motivation of this patch is not for performance, I use it
      in bcache debugging, which helps me to narrow down the scape to check
      valid bcached devices of a cache set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2831231d
    • Z
      bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() · b40503ea
      Zhai Zhaoxuan 提交于
      The function cached_dev_make_request() and flash_dev_make_request() call
      generic_start_io_acct() with (struct bcache_device)->disk when they start a
      closure. Then the function bio_complete() calls generic_end_io_acct() with
      (struct search)->orig_bio->bi_disk when the closure has done.
      Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
      called with a wrong device queue.
      
      It causes the "inflight" (in struct hd_struct) counter keep increasing
      without decreasing.
      
      This patch fix the problem by calling generic_end_io_acct() with
      (struct bcache_device)->disk.
      Signed-off-by: NZhai Zhaoxuan <kxuanobj@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b40503ea
    • K
      bcache: mark closure_sync() __sched · ce439bf7
      Kent Overstreet 提交于
      [edit by mlyle: include sched/debug.h to get __sched]
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce439bf7
    • K
      bcache: Fix, improve efficiency of closure_sync() · e4bf7919
      Kent Overstreet 提交于
      Eliminates cases where sync can race and fail to complete / get stuck.
      Removes many status flags and simplifies entering-and-exiting closure
      sleeping behaviors.
      
      [mlyle: fixed conflicts due to changed return behavior in mainline.
      extended commit comment, and squashed down two commits that were mostly
      contradictory to get to this state.  Changed __set_current_state to
      set_current_state per Jens review comment]
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e4bf7919
    • M
      bcache: allow quick writeback when backing idle · b1092c9a
      Michael Lyle 提交于
      If the control system would wait for at least half a second, and there's
      been no reqs hitting the backing disk for awhile: use an alternate mode
      where we have at most one contiguous set of writebacks in flight at a
      time. (But don't otherwise delay).  If front-end IO appears, it will
      still be quick, as it will only have to contend with one real operation
      in flight.  But otherwise, we'll be sending data to the backing disk as
      quickly as it can accept it (with one op at a time).
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1092c9a
    • M
      bcache: writeback: properly order backing device IO · 6e6ccc67
      Michael Lyle 提交于
      Writeback keys are presently iterated and dispatched for writeback in
      order of the logical block address on the backing device.  Multiple may
      be, in parallel, read from the cache device and then written back
      (especially when there are contiguous I/O).
      
      However-- there was no guarantee with the existing code that the writes
      would be issued in LBA order, as the reads from the cache device are
      often re-ordered.  In turn, when writing back quickly, the backing disk
      often has to seek backwards-- this slows writeback and increases
      utilization.
      
      This patch introduces an ordering mechanism that guarantees that the
      original order of issue is maintained for the write portion of the I/O.
      Performance for writeback is significantly improved when there are
      multiple contiguous keys or high writeback rates.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Tested-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e6ccc67
    • T
      bcache: fix wrong return value in bch_debug_init() · 539d39eb
      Tang Junhui 提交于
      in bch_debug_init(), ret is always 0, and the return value is useless,
      change it to return 0 if be success after calling debugfs_create_dir(),
      else return a non-zero value.
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      539d39eb
    • T
      bcache: segregate flash only volume write streams · 4eca1cb2
      Tang Junhui 提交于
      In such scenario that there are some flash only volumes
      , and some cached devices, when many tasks request these devices in
      writeback mode, the write IOs may fall to the same bucket as bellow:
      | cached data | flash data | cached data | cached data| flash data|
      then after writeback of these cached devices, the bucket would
      be like bellow bucket:
      | free | flash data | free | free | flash data |
      
      So, there are many free space in this bucket, but since data of flash
      only volumes still exists, so this bucket cannot be reclaimable,
      which would cause waste of bucket space.
      
      In this patch, we segregate flash only volume write streams from
      cached devices, so data from flash only volumes and cached devices
      can store in different buckets.
      
      Compare to v1 patch, this patch do not add a additionally open bucket
      list, and it is try best to segregate flash only volume write streams
      from cached devices, sectors of flash only volumes may still be mixed
      with dirty sectors of cached device, but the number is very small.
      
      [mlyle: fixed commit log formatting, permissions, line endings]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4eca1cb2
    • V
      bcache: Use PTR_ERR_OR_ZERO() · 9d134117
      Vasyl Gomonovych 提交于
      Fix ptr_ret.cocci warnings:
      drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used
      
      Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
      
      Generated by: scripts/coccinelle/api/ptr_ret.cocci
      Signed-off-by: NVasyl Gomonovych <gomonovych@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d134117
    • T
      bcache: stop writeback thread after detaching · 8d29c442
      Tang Junhui 提交于
      Currently, when a cached device detaching from cache, writeback thread is
      not stopped, and writeback_rate_update work is not canceled. For example,
      after the following command:
      echo 1 >/sys/block/sdb/bcache/detach
      you can still see the writeback thread. Then you attach the device to the
      cache again, bcache will create another writeback thread, for example,
      after below command:
      echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
      then you will see 2 writeback threads.
      This patch stops writeback thread and cancels writeback_rate_update work
      when cached device detaching from cache.
      
      Compare with patch v1, this v2 patch moves code down into the register
      lock for safety in case of any future changes as Coly and Mike suggested.
      
      [edit by mlyle: commit log spelling/formatting]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8d29c442
    • R
      bcache: ret IOERR when read meets metadata error · b221fc13
      Rui Hua 提交于
      The read request might meet error when searching the btree, but the error
      was not handled in cache_lookup(), and this kind of metadata failure will
      not go into cached_dev_read_error(), finally, the upper layer will receive
      bi_status=0.  In this patch we judge the metadata error by the return
      value of bch_btree_map_keys(), there are two potential paths give rise to
      the error:
      
      1. Because the btree is not totally cached in memery, we maybe get error
         when read btree node from cache device (see bch_btree_node_get()), the
         likely errno is -EIO, -ENOMEM
      
      2. When read miss happens, bch_btree_insert_check_key() will be called to
         insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
         doing preparatory work before insert the missed data to cache device),
         a failure can also happen in this situation, the likely errno is
         -ENOMEM
      
      bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
      get either -EIO or -ENOMEM in above two cases. if this happened, we should
      NOT recover data from backing device (when cache device is dirty) because
      we don't know whether bkeys the read request covered are all clean.  And
      after that happened, s->iop.status is still its initially value(0) before
      we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
      cached_dev_read_error(), and finally it can be passed to upper layer, or
      recovered by reread from backing device.
      
      [edit by mlyle: patch formatting, word-wrap, comment spelling,
      commit log format]
      Signed-off-by: NHua Rui <huarui.dev@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b221fc13
  4. 07 1月, 2018 3 次提交
  5. 25 11月, 2017 4 次提交
    • M
      bcache: check return value of register_shrinker · 6c4ca1e3
      Michael Lyle 提交于
      register_shrinker is now __must_check, so check it to kill a warning.
      Caller of bch_btree_cache_alloc in super.c appropriately checks return
      value so this is fully plumbed through.
      
      This V2 fixes checkpatch warnings and improves the commit description,
      as I was too hasty getting the previous version out.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NVojtech Pavlik <vojtech@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6c4ca1e3
    • R
      bcache: recover data from backing when data is clean · e393aa24
      Rui Hua 提交于
      When we send a read request and hit the clean data in cache device, there
      is a situation called cache read race in bcache(see the commit in the tail
      of cache_look_up(), the following explaination just copy from there):
      The bucket we're reading from might be reused while our bio is in flight,
      and we could then end up reading the wrong data. We guard against this
      by checking (in bch_cache_read_endio()) if the pointer is stale again;
      if so, we treat it as an error (s->iop.error = -EINTR) and reread from
      the backing device (but we don't pass that error up anywhere)
      
      It should be noted that cache read race happened under normal
      circumstances, not the circumstance when SSD failed, it was counted
      and shown in  /sys/fs/bcache/XXX/internal/cache_read_races.
      
      Without this patch, when we use writeback mode, we will never reread from
      the backing device when cache read race happened, until the whole cache
      device is clean, because the condition
      (s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
      cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
      will be passed up, at last, user will receive -EINTR when it's bio end,
      this is not suitable, and wield to up-application.
      
      In this patch, we use s->read_dirty_data to judge whether the read
      request hit dirty data in cache device, it is safe to reread data from
      the backing device when the read request hit clean data. This can not
      only handle cache read race, but also recover data when failed read
      request from cache device.
      
      [edited by mlyle to fix up whitespace, commit log title, comment
      spelling]
      
      Fixes: d59b2379 ("bcache: only permit to recovery read error when cache device is clean")
      Cc: <stable@vger.kernel.org> # 4.14
      Signed-off-by: NHua Rui <huarui.dev@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e393aa24
    • H
      bcache: Fix building error on MIPS · cf33c1ee
      Huacai Chen 提交于
      This patch try to fix the building error on MIPS. The reason is MIPS
      has already defined the PTR macro, which conflicts with the PTR macro
      in include/uapi/linux/bcache.h.
      
      [fixed by mlyle: corrected a line-length issue]
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHuacai Chen <chenhc@lemote.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cf33c1ee
    • T
      bcache: add a comment in journal bucket reading · bb22cafd
      Tang Junhui 提交于
      Journal bucket is a circular buffer, the bucket
      can be like YYYNNNYY, which means the first valid journal in
      the 7th bucket, and the latest valid journal in third bucket, in
      this case, if we do not try we the zero index first, We
      may get a valid journal in the 7th bucket, then we call
      find_next_bit(bitmap,ca->sb.njournal_buckets, l + 1) to get the
      first invalid bucket after the 7th bucket, because all these
      buckets is valid, so no bit 1 in bitmap, thus find_next_bit()
      function would return with ca->sb.njournal_buckets (8). So, after
      that, bcache only read journal in 7th and 8the bucket,
      the first to the third buckets are lost.
      
      So, it is important to let developer know that, we need to try
      the zero index at first in the hash-search, and avoid any breaks
      in future's code modification.
      
      [ML: Fixed whitespace & formatting & file permissions]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bb22cafd
  6. 15 11月, 2017 1 次提交
  7. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  8. 31 10月, 2017 5 次提交
    • L
      bcache: explicitly destroy mutex while exiting · 330a4db8
      Liang Chen 提交于
      mutex_destroy does nothing most of time, but it's better to call
      it to make the code future proof and it also has some meaning
      for like mutex debug.
      
      As Coly pointed out in a previous review, bcache_exit() may not be
      able to handle all the references properly if userspace registers
      cache and backing devices right before bch_debug_init runs and
      bch_debug_init failes later. So not exposing userspace interface
      until everything is ready to avoid that issue.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Reviewed-by: NEric Wheeler <bcache@linux.ewheeler.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      330a4db8
    • T
      bcache: fix wrong cache_misses statistics · c1573137
      tang.junhui 提交于
      Currently, Cache missed IOs are identified by s->cache_miss, but actually,
      there are many situations that missed IOs are not assigned a value for
      s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO
      (s->iop.bypass = 1), or the cache_bio allocate failed. In these situations,
      it will go to out_put or out_submit, and s->cache_miss is null, which leads
      bch_mark_cache_accounting() to treat this IO as a hit IO.
      
      [ML: applied by 3-way merge]
      Signed-off-by: Ntang.junhui <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1573137
    • T
      bcache: update bucket_in_use in real time · d44c2f9e
      Tang Junhui 提交于
      bucket_in_use is updated in gc thread which triggered by invalidating or
      writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
      use it to compare with the threshold, it is often not timely, which leads
      to inaccurate judgment and often results in bucket depletion.
      
      We have send a patch before, by the means of updating bucket_in_use
      periodically In gc thread, which Coly thought that would lead high
      latency, In this patch, we add avail_nbuckets to record the count of
      available buckets, and we calculate bucket_in_use when alloc or free
      bucket in real time.
      
      [edited by ML: eliminated some whitespace errors]
      Signed-off-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d44c2f9e
    • E
      bcache: convert cached_dev.count from atomic_t to refcount_t · 3b304d24
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable cached_dev.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      Suggested-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3b304d24
    • C
      bcache: only permit to recovery read error when cache device is clean · d59b2379
      Coly Li 提交于
      When bcache does read I/Os, for example in writeback or writethrough mode,
      if a read request on cache device is failed, bcache will try to recovery
      the request by reading from cached device. If the data on cached device is
      not synced with cache device, then requester will get a stale data.
      
      For critical storage system like database, providing stale data from
      recovery may result an application level data corruption, which is
      unacceptible.
      
      With this patch, for a failed read request in writeback or writethrough
      mode, recovery a recoverable read request only happens when cache device
      is clean. That is to say, all data on cached device is up to update.
      
      For other cache modes in bcache, read request will never hit
      cached_dev_read_error(), they don't need this patch.
      
      Please note, because cache mode can be switched arbitrarily in run time, a
      writethrough mode might be switched from a writeback mode. Therefore
      checking dc->has_data in writethrough mode still makes sense.
      
      Changelog:
      V4: Fix parens error pointed by Michael Lyle.
      v3: By response from Kent Oversteet, he thinks recovering stale data is a
          bug to fix, and option to permit it is unnecessary. So this version
          the sysfs file is removed.
      v2: rename sysfs entry from allow_stale_data_on_failure  to
          allow_stale_data_on_failure, and fix the confusing commit log.
      v1: initial patch posted.
      
      [small change to patch comment spelling by mlyle]
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reported-by: NArne Wolf <awolf@lenovo.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Kai Krakow <hurikhan77@gmail.com>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d59b2379
  9. 17 10月, 2017 1 次提交
  10. 16 10月, 2017 4 次提交
    • L
      bcache: safeguard a dangerous addressing in closure_queue · 6446c684
      Liang Chen 提交于
      The use of the union reduces the size of closure struct by taking advantage
      of the current size of its members. The offset of func in work_struct
      equals the size of the first three members, so that work.work_func will
      just reference the forth member - fn.
      
      This is smart but dangerous. It can be broken if work_struct or the other
      structs get changed, and can be a bit difficult to debug.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6446c684
    • M
      bcache: rearrange writeback main thread ratelimit · a8500fc8
      Michael Lyle 提交于
      The time spent searching for things to write back "counts" for the
      actual rate achieved, so don't flush the accumulated rate with each
      chunk.
      
      This will maintain better fidelity to user-commanded rates, but it
      may slightly increase the burstiness of writeback.  The writeback
      lock needs improvement to help mitigate this.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8500fc8
    • M
      bcache: writeback rate shouldn't artifically clamp · e41166c5
      Michael Lyle 提交于
      The previous code artificially limited writeback rate to 1000000
      blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
      hardware.  The rate limiting code works fine (though with decreased
      precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.
      
      Additionally, ensure that uint32_t is used as a type for rate throughout
      the rate management so that type checking/clamp_t can work properly.
      
      bch_next_delay should be rewritten for increased precision and better
      handling of high rates and long sleep periods, but this is adequate for
      now.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reported-by: NColy Li <colyli@suse.de>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e41166c5
    • M
      bcache: smooth writeback rate control · ae82ddbf
      Michael Lyle 提交于
      This works in conjunction with the new PI controller.  Currently, in
      real-world workloads, the rate controller attempts to write back 1
      sector per second.  In practice, these minimum-rate writebacks are
      between 4k and 60k in test scenarios, since bcache aggregates and
      attempts to do contiguous writes and because filesystems on top of
      bcachefs typically write 4k or more.
      
      Previously, bcache used to guarantee to write at least once per second.
      This means that the actual writeback rate would exceed the configured
      amount by a factor of 8-120 or more.
      
      This patch adjusts to be willing to sleep up to 2.5 seconds, and to
      target writing 4k/second.  On the smallest writes, it will sleep 1
      second like before, but many times it will sleep longer and load the
      backing device less.  This keeps the loading on the cache and backing
      device related to writeback more consistent when writing back at low
      rates.
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ae82ddbf